Scientific Text-Based Machine Learning

By Shehrin Sayed

University of California, Berkeley

Category

Downloads

Published on

Abstract

Please note that this page is under development and information provided may change.

Introduction:

Machine learning is undoubtedly a useful tool and is gradually changing how we function in everyday life. The application of this powerful tool in materials and devices research may have significant impact.

It has been showed that a computer program can learn special concepts in science and engineering from scientific literature. Especially, a word embedding model like word2vec can enable unsupervised learning by converting words into high-dimensional vectors. The relations among vector reflect the knowledge in the training text corpus.

Here, we provide a sample word embedding model which has been trained with scientific abstracts from certain journals within the American Physical Society (APS) published between 1970 and 2019. The model sufficiently captures the knowledge within the training dataset.

User Guide:

You'll need to install Anaconda 3.9 and install gensim 3.8 package using the following command:

pip install gensim==3.8

or

conda install gensim==3.8

 

Then open a python script and import numpy and gensim:

import numpy as np

from gensim.models import Word2Vec

Load the model with its filepath:

w2v_model = Word2Vec.load("C:/Dummy_Folder/berkeley_sample_model")

 

In order to find similar words for a keyword, use the following syntax:

w2v_model.wv.most_similar(“keyword", topn=10)

where, topn parameter determines how many similar words to show.

 

In order to ask an analogy question, e.g., if Fe is ferromagnetic, what are semiconductor, use the following expression:

w2v_model.wv.most_similar(positive=["Fe", "semiconductor"], negative=["ferromagnetic"], topn=10)

 

The "does not match" function can be used to identify which element in an array is highly different from others, e.g., the result of the following expression is "Ru".

w2v_model.wv.doesnt_match(["Fe", "Co", "Ru"])

 

In order to calculate cosine distance between two vectors, use the following expression:

w2v_model.wv.similarity(“word1", “word2")

 

We hope you enjoy playing with the model.

 

 

 

 

Sponsored by

This model was developed under a research project that was in part supported by Applications and Systems-Driven Center for Energy-Efficient Integrated NanoTechnologies (ASCENT), one of six centers in the Joint University Microelectronics Program (JUMP), a Semiconductor Research Corporation (SRC) program sponsored by Defense Advanced Research Projects Agency (DARPA) and in part by the U.S. Department of Energy, under Contract No. DE-AC02-05-CH11231 within the Non-Equilibrium Magnetic Materials(NEMM) program.

References

[1] Nature 571, 95-97 (2019).

[2] DOI: https://doi.org/10.21203/rs.3.rs-1718292/v1 (2022).

Publications

[1] Sayed et al., DOI: https://doi.org/10.21203/rs.3.rs-1718292/v1, 2022.

Cite this work

Researchers should cite this work as follows:

  • Shehrin Sayed (2023), "Scientific Text-Based Machine Learning," https://nanohub.org/resources/38156.

    BibTex | EndNote

Tags