Scientific Text-Based Machine Learning

Shehrin Sayed

Home › Downloads › Scientific Text-Based Machine Learning › About

Scientific Text-Based Machine Learning

By Shehrin Sayed

University of California, Berkeley

Download (ZIP)

Published on

02 Nov 2023

Abstract

Please note that this page is under development and information provided may change.

Introduction:

Machine learning is undoubtedly a useful tool and is gradually changing how we function in everyday life. The application of this powerful tool in materials and devices research may have significant impact.

It has been showed that a computer program can learn special concepts in science and engineering from scientific literature. Especially, a word embedding model like word2vec can enable unsupervised learning by converting words into high-dimensional vectors. The relations among vector reflect the knowledge in the training text corpus.

Here, we provide a sample word embedding model which has been trained with scientific abstracts from certain journals within the American Physical Society (APS) published between 1970 and 2019. The model sufficiently captures the knowledge within the training dataset.

User Guide:

You'll need to install Anaconda 3.9 and install gensim 3.8 package using the following command:

pip install gensim==3.8

or

conda install gensim==3.8

Then open a python script and import numpy and gensim:

import numpy as np

from gensim.models import Word2Vec

Load the model with its filepath:

w2v_model = Word2Vec.load("C:/Dummy_Folder/berkeley_sample_model")

In order to find similar words for a keyword, use the following syntax:

w2v_model.wv.most_similar(“keyword", topn=10)

where, topn parameter determines how many similar words to show.

In order to ask an analogy question, e.g., if Fe is ferromagnetic, what are semiconductor, use the following expression:

w2v_model.wv.most_similar(positive=["Fe", "semiconductor"], negative=["ferromagnetic"], topn=10)

The "does not match" function can be used to identify which element in an array is highly different from others, e.g., the result of the following expression is "Ru".

w2v_model.wv.doesnt_match(["Fe", "Co", "Ru"])

In order to calculate cosine distance between two vectors, use the following expression:

w2v_model.wv.similarity(“word1", “word2")

We hope you enjoy playing with the model.

References

[1] Nature 571, 95-97 (2019).

[2] DOI: https://doi.org/10.21203/rs.3.rs-1718292/v1 (2022).

Publications

[1] Sayed et al., DOI: https://doi.org/10.21203/rs.3.rs-1718292/v1, 2022.

Cite this work

Researchers should cite this work as follows:

Shehrin Sayed (2023), "Scientific Text-Based Machine Learning," https://nanohub.org/resources/38156.

BibTex | EndNote

Scientific Text-Based Machine Learning

Category

Published on

Abstract

Sponsored by

References

Publications

Cite this work

Tags