Exercise 5
Part of the course Machine Learning for Materials and Chemistry.
Decision trees are easily interpretable and offer an intuitive access to data. Building them, however, is not trivial as we have seen in the lecture. This is a good opportunity to dive into scikit-learn, a popular machine learning toolkit. We will work on the raw data (sn2_lccsd_predictions.txt) used in a paper.
Task 5.1: One-hot encoding revisited
Download the data file. Each row has two space-separated columns
- a string specifying the substituents (see Table 1)
- the activation energy in kcal/mol
For example, a row would be
A_A_A_A_C_B 3.5838529869702143
Write a function that reads this file and uses sklearn.preprocessing.OneHotEncoder
to encode the first column as 2D-vector X
and transform the second column into one vector y
.
Task 5.2: Grow trees
Write a function to build decision trees for regression using sklearn.tree.DecisionTreeRegressor
. Experiment with different numbers of levels in the tree and analyse the error of the fit. You may use sklearn.tree.plot_tree
to visualise the trees.
Task 5.3: Coarse-graining
Replace the regression problem with a classification problem by grouping the activation energies into three groups: low (less than 10), high (higher than 35) and middle (rest). Use sklearn.tree.DecisionTreeClassifier
to learn the groups. Again, experiment with different numbers of levels and analyse the error of the fit. Which strategy (regression or classification) would you prefer and why?