Exercise 5

Part of the course Machine Learning for Materials and Chemistry.

Decision trees are easily interpretable and offer an intuitive access to data. Building them, however, is not trivial as we have seen in the lecture. This is a good opportunity to dive into scikit-learn, a popular machine learning toolkit. We will work on the raw data (sn2_lccsd_predictions.txt) used in a paper.

Task 5.1: One-hot encoding revisited

Download the data file. Each row has two space-separated columns

a string specifying the substituents (see Table 1)
the activation energy in kcal/mol

For example, a row would be

A_A_A_A_C_B 3.5838529869702143

Write a function that reads this file and uses sklearn.preprocessing.OneHotEncoder to encode the first column as 2D-vector X and transform the second column into one vector y.

Task 5.2: Grow trees

Write a function to build decision trees for regression using sklearn.tree.DecisionTreeRegressor. Experiment with different numbers of levels in the tree and analyse the error of the fit. You may use sklearn.tree.plot_tree to visualise the trees.

Task 5.3: Coarse-graining

Replace the regression problem with a classification problem by grouping the activation energies into three groups: low (less than 10), high (higher than 35) and middle (rest). Use sklearn.tree.DecisionTreeClassifier to learn the groups. Again, experiment with different numbers of levels and analyse the error of the fit. Which strategy (regression or classification) would you prefer and why?