nanoHUB IGNITE 2021
Semiconductor Defects Design Challenge
Instructions
Research Motivation and Description
The spontaneous or intentional creation of point defects and impurities in semiconductors has a major influence on their optical and electronic properties, and consequently their performance in solar cells and modern electronics. Defects create energy states which may manifest in the band gap region and either cause a reduction in photovoltaic absorption by killing charge carriers, or potentially contribute to enhanced absorption by acting as stepping stones for transitioning photons. Although defects are generally seen to be undesirable for solar cell absorbers, there are great opportunities in exploring if certain defects and impurities, in the right circumstances, can lead to the creation of “intermediate band photovoltaics” with the promise of improved efficiencies.
To this end, we performed screening of functional atomic impurities in Cd-chalcogenide semiconductors (e.g., CdTe and CdSe) using high-throughput computations and machine learning. High-performance computing resources located at Argonne National Lab (Carbon at CNM and LCRC) and Berkeley Lab (NERSC) were utilized to generate large databases of impurity properties from first principles-based density functional theory (DFT) computations. This dataset was combined with material descriptors encoding information ranging from coordination environments to tabulated elemental properties to cheaper DFT data, to train state-of-the-art regression models. Various regression techniques such as LASSO, random forest, and kernel ridge regression techniques were used and the predictive models were optimized with respect to the type and quantity of training data, optimal hyperparameter sets, and cross-validation errors. The best models thus achieved were deployed to make predictions for the combinatorial chemical space of all possible impurity atoms in Cd-chalogenide compounds, following which screening was performed on the basis of their relative energetics, leading to a shortlist of impurities for every compound which can affect a desirable or disastrous change in the optical and electronic properties of the semiconductor. The data and models developed in this work have major consequences for semiconductor applications ranging from solar cells to infrared sensors to quantum information sciences.
Accessing the tutorial on nanoHUB
Go to https://nanohub.org/resources/mldefect. Login to nanoHUB and launch the mldefect tool. This will open a Jupyter notebook where DFT data for impurities in Cd-chalcogenide semiconductors is used to train ML predictive models for the impurity formation energies of thousands of possible impurities. Go through this notebook carefully, reading all the descriptions and paying attention to the lines of code for reading the DFT data and ML descriptors from excel files, calling Python packages for training predictive models using techniques such as linear regression, random forest regression, and others, calculating error bars in prediction from random forest, and plotting and visualizing the quality of ML prediction and root mean square errors in prediction.
For a complete demonstration of the exercises in this notebook, check out this recorded nanoHUB tutorial:
https://www.youtube.com/watch?v=-EfHbtnI25g&ab_channel=nanohubtechtalks
Further details of this work can be obtained from the following publication:
A. Mannodi-Kanakkithodi et al., npj Computational Materials volume 6, Article number: 39 (2020).
Solve the following challenges
For the problems below, unless specified, use “Cd-rich Delta_H” as the property (output) of interest. This notebook only considers the impurity formation energy (three types of Delta_H values) and ignores the charge transition levels which are mentioned in the description cells. Change the descriptor dimensions, training set size, and regression algorithm of choice as specified in the challenge. You may have to add some lines of code or comment out some existing lines as necessary.
1. For each of the 3 properties of interest (formation energies at Cd-rich, moderate and X-rich conditions), which of the 19 total set of descriptors shows the maximum Pearson correlation coefficient? Which descriptors shows the minimum Pearson correlation?
2. Test the effect of number of descriptors and number of training data points on linear regression performance. Let descriptors dimensions be m and test fraction be denoted as t (t = 0.2 means 80% of the data is used for training the model).
(a) Keeping t = 0.2, plot the test set prediction error as a function of m, using the following values of m: m = 4, 8, 12, 15, 19. Since there are 19 total descriptors, there will be 19Cm ways of choosing m descriptors: for each value of m, choose m descriptors randomly 10 times, and train 10 linear regression models. Plot the average test error as well as the maximum and minimum test error bounds on the y-axis and m on the x-axis.
(b) Keeping m = 19 (which means using all the descriptors), train linear regression models (10 times each) for t = 0.9, 0.8, 0.7, … 0.1. Plot the average test error as well as the maximum and minimum test error bounds for each value of t on the y-axis, and (1-t) on the x-axis (indicating the training set size as a fraction).
Each linear regression model only takes a few seconds to train so this will not be a time-consuming exercise.
3. Keeping t = 0.2 and m = 19, train random forest regression models with hyperparameter optimization and cross-validation. Using errors from 10 different runs (similar to challenge 2), compare the test set predictions using 5-fold CV and 10-fold CV. What are the optimal values of the 5 hyperparameters obtained from either CV? Since the model training takes time, only a single model will suffice for each choice of CV.
4. Keeping t = 0.2, m = 19, and cv = 5, report the values of rmse_CdTeSe and rmse_CdSeS (the root mean square errors in prediction for impurities in compounds CdTeSe and CdSeS) for kernel ridge regression models trained using the following ranges of kernel parameters:
(a) l in np.logspace(-2, 2, 10) and p in np.logspace(0, 2, 10)]
(b) l in np.logspace(-4, 4, 20) and p in np.logspace(0, 3, 20)]
For both cases, train 10 models and report the average, maximum and minimum test set prediction errors. Also report parity plots for the best models from either case, showing the train and test points for each semiconductor type (in the final cell, comment out the plt.errorbar lines and uncomment the plt.scatter lines). The models for (b) may take much longer since it will cycle through 20*20=400 sets of hyperparameter choices.
5. Using the best models obtained in challenges 3 and 4, make predictions for all the impurities in the “X.csv” file. Select all the “CdTe_0.5Se_0.5” data points and rank them from lowest Cd-rich formation energy to highest. This gives us the list of most likely impurities to form in CdTe_0.5Se_0.5 to the least likely impurities to form. Compare the lists obtained from the two models. (In the csv file, the very first column refers to the impurity atom, the next column is the semiconductor name, and the next column is the defect site).