Machine Learning for Materials and Chemistry

Aims to be a elective Master module, can be followed by Bachelor students. Developed at University of Kassel.

Slides

Exercises

Meant to both deepen understanding of the lecture content, but also to boost Python skills.

Exercise 1: Linear regression and model stability

Linear regression is a common technique used in statistics and machine learning to model the relationship between two variables. However, when using linear regression with a small number of data points, the model may be subject to noise and may not accurately capture the underlying pattern in the data. This can result in overfitting, underfitting, or inaccurate predictions.

As more data points are added, the model becomes more accurate and better able to capture the underlying pattern in the data. This is because the addition of more data points provides more information about the relationship between the variables and helps to reduce the impact of noise in the data.

The Theil-Sen regressor is a robust alternative to linear regression that is designed to overcome the issue of noise in small datasets. It uses a method called the "median slope" to estimate the slope of the line that best fits the data, rather than the mean slope used in linear regression. However, as the number of data points increases, the difference between the Theil-Sen regressor and linear regression tends to become smaller, and linear regression can provide accurate results when enough data is available.

We have used this particular regressor to improve Hammett models for reaction barriers.

Task 1.1: Implementing Linear Regression

Write a Python function randomized_points(n: int) -> tuple[np.ndarray, np.ndarray] that generates n random points (\(x\)) uniformly distributed over the interval (0, 1) and the corresponding \(y = ax + b + \epsilon\), where \(a\) is the slope (\(a=2\)), \(b\) is the intercept (\(b=1\)), and \(\epsilon\) is the error term that is drawn from a normal distribution with mean 0 and standard deviation 0.2. The function should return two arrays xs and ys, where xs contains the \(x\)-coordinates of the n data points and ys contains the corresponding \(y\)-coordinates.
Write a Python function linear_regression(xs: np.ndarray, ys: np.ndarray) -> tuple[float, float] that takes in the arrays xs and ys from the previous step and returns the slope and intercept of a linear regression line that fits the data points.

Task 1.2: Using Theil-Sen Regressor from Scipy

Write a Python function theil_sen(xs: np.ndarray, ys: np.ndarray) -> tuple[float, float] that performs linear regression using Theil-Sen Regressor from Scipy. The function should take in the arrays xs and ys from Task 1 and return the slope and intercept of the regression line.

Task 1.3: Assessing the Models

Write a Python function predict(xs: np.ndarray, slope: float, intercept: float) -> np.ndarray that takes in the array xs and the slope and intercept of a regression line and returns an array of predicted \(y\)-values for the corresponding \(x\)-values in xs.
Write a Python function calculate_prediction_error(n: int, xs: np.ndarray, ys: np.ndarray, slope: float, intercept: float) -> Tuple[float, float] that generates n random points in the interval \((0, 1)\) and uses the predict function to estimate the corresponding \(y\)-values for each \(x\)-value using the slope and intercept of the regression line. The function should then calculate the prediction error by subtracting the true noise-free \(y\)-values from the estimated \(y\)-values. Finally, the function should return the mean and standard deviation of the prediction error.

Exercise 2: Turning a molecule into a mathematical object

When dealing with molecules in machine learning contexts, we need mathematical objects describing such molecules. The straightforward use of cartesian coordinates and nuclear charges (which enter the Hamiltonian) turns out to be disadvantageous. In this exercise, we explore two alternatives.

Normally, representations taylored to molecules such as the Coulomb Matrix perform better than one-hot encoding. However, sometimes meaningful categorial representations catch more of the Physics involved, such that one-hot-encoding leaves more advanced representations behind, e.g. in this work on learning reaction barriers.

Task 2.1: Implement the Coulomb Matrix representation

Write a python function coulomb_matrix(xs: list[float], ys: list[float], zs: list[float], elements: list[str]) -> list[list] to calculate the coulomb matrix representation from the cartesian coordinates (xs, ys, zs) and the nuclear charges.

\[\begin{split}M_{ij} =\begin{cases}\tfrac{1}{2} Z_{i}^{2.4} & \text{if } i = j \\\frac{Z_{i}Z_{j}}{\| {\bf R}_{i} - {\bf R}_{j}\|} & \text{if } i \neq j\end{cases}\end{split}\]

You may use the following molecular geometries to test your function:

Water:

O 0.0 0.0 0.0
H -0.77 -0.573 0.0
H 0.775 -0.566 0.0

Caffeine:

C 3.184 1.228 0.0
N 2.239 0.145 0.001
C 2.488 -1.176 0.004
N 1.401 -1.918 0.003
C 0.383 -1.03 -0.002
C 0.869 0.27 -0.003
C 0.012 1.416 -0.002
O 0.357 2.585 -0.003
N -1.33 1.037 0.002
C -1.851 -0.253 0.001
O -3.047 -0.443 0.005
N -0.95 -1.296 0.003
C -1.43 -2.661 -0.009
C -2.327 2.096 0.0
H 3.838 1.161 0.871
H 3.794 1.2 -0.905
H 2.628 2.164 0.033
H 3.48 -1.583 0.007
H -0.746 -3.272 0.576
H -2.432 -2.674 0.417
H -1.468 -3.041 -1.031
H -2.942 2.026 0.897
H -2.971 1.993 -0.873
H -1.797 3.045 -0.025

Task 2.2: One-hot encoding

To learn categorial data, we need a representation that allows for regression to be done. One-hot encoding is such a representation, of which we will now consider a simplified, slightly overcomplete version for sake of simplicity of this exercise. In one-hot encoding, each category in the data is represented by a binary vector, with a length equal to the number of categories. Each vector has a value of 1 in the position corresponding to the category it represents, and a value of 0 in all other positions. For example, if we have three categories: alkanes, amines, and acids, then the one-hot encoding for alkanes would be [1,0,0], for amines [0,1,0], and for acids [0,0,1]. If there are multiple such categories, they will be appended: if we take the categories (alkanes, amines), then the one-hot encoding would be [1,0,0,0,1,0].

Write a function one_hot(categories: list[str], possible_categories: list[str]) -> list[list[int]] that converts a list of categories such as letters chosen from a set of possible categories into a one-hot encoded vector.

Examples:

one_hot(["A"], ["A", "B"]) -> [1,0]
one_hot(["A", "C"], ["A", "B", "C"]) -> [1,0,0,0,0,1]

Task 2.3: Invertible representations

Is the Coulomb matrix representation invertible, i.e. can you recover the molecule from the representation? Explain your answer.
Write a function sum_formula(coulomb_matrix: list[list[float]]) -> str that obtains the sum formula from the Coulomb matrix representation.

Exercise 3: Loss functions and their impact

Loss functions constitute the key metric that is minimized in order to improve a model. There are many different possible choices which impact the predictions of a model. This means that the loss function needs to be specified to fully qualify a model.

Different loss functions can yield highly disparate predictive power, exemplified in this work estimating bond dissociation energies.

Task 3.1: Loss functions

Let's fit a function of the form

\[y=a\exp(bx)\]

which can be logarithmized as

\[\log(y)=\log(a)+bx\]

Write loss functions for the squared error of the residuals in the original form and the logarithmized form. Use the following signature: original_loss(xs: np.ndarray, ys: np.ndarray, params: tuple[float]) -> float.

Task 3.2: Fit models

Write a function random_data(c: float) -> tuple(np.ndarray, np.ndarray) that returns the values 0-9 for xs and some exponential of the form

\[y=a\exp(bx)+c\epsilon\]

where epsilon is some elementwise normal distributed random noise. Fit the data using the loss functions of task 1 and scipy.optimize.minimize once for \(c=0\) and once for \(c\) being some non-zero value. Discuss the results.

Task 3.3: Motivation for loss functions

Consider the effects you saw in task 3.2. What could be possible motiviations to choose one loss function over another? Is there a best loss function? Which problem do you see for loss functions for multivariate models?

Exercise 4: Simple models

Most models for materials/chemistry applications build upon some description of a local environment. One way of doing so is to assign the same property to each similar environment. These kind of models can be refined by making the environment larger, e.g. by considering more neighbors. If the whole molecule is segmented into complete yet disjunct parts, this family of models is called a group contribution model/method. Often, models of increasing local environments yield acceptable results, but fail to converge, as typically many-body and long-range effects and interactions contribute significantly, especially for charged systems.

When considering the full environment, i.e. the whole molecule for small ones, it is possible to learn the 3D geometry from the graphs.

Task 4.1: Dressed atom from a database

The file bondorder.txt contains one line per molecule with three space-separated groups:

comma-separated elements
comma-separated bond order matrix as upper triangle (there are relevant numpy functions; note that bond orders of 1.5 are used to denote aromatic bonds)
energy in Hartrees

For water, the row would be

O,H,H 1,1,0 -74.9621236632301

Write a function that reads this file and estimates the energies using the dressed atom model, i.e. approximating the total energy as a atom-wise contribution which only depends on the element in question. You can solve this via linear regression. Which relative error do you get for the energies?

Task 4.2: Bond counting

Now follow the same strategy as in task 1, but also consider all unique bonds, i.e. a CO bond, a CH bond, ... The total energy is now assumed to have one contribution from every element and another contribution from every bond. Which relative error do you get now?

Task 4.3: Generalisation

Give three different reasons (with examples) where this method must fail beyond the initial success you showed in task 1 and 2. Are these methods useful then? Discuss in the context of data standardisation.

Exercise 5: Decision trees

Task 5.1: Dicey

You have two dice A (can show 1,2,3) and B (can show 1, 3 each) and a coin C (can show the numbers 0 and 2). If the quantity

\[Q\equiv (A+B)C\]

is to be found: which item evaluated first offers the highest information gain? (Can be answered without any code.)

Task 5.2: Dressed atoms

Create a minimal decision tree which estimates the total energy of a molecule (any bonds obeying the octet rule) of at most 4 atoms of the elements C,H, and F. Why is it hard to generalize to many atoms (or even 5)?

Task 5.3

What do you expect if columns with random entries are added to a small dataset that is the input to a decision tree model? How would that be in the case of very large datasets?

Exercise 6: KRR

Task 6.1: Generate training data

Write a function training_data(npoints: int) which generates a training set of npoints training points over the interval (0.2, 5) including their y value, which here should be the Morse potential with unit parameters.

Task 6.2: Implement Kernel Ridge Regression

Do not use external libraries for this task, but rather implement KRR from the equations in the slides using scipy/numpy only. Implement a function build_model(points: np.ndarray) which takes the data points from task 1 and returns the model coefficients (alpha on the slides) for a Gaussian kernel. Write another function evaluate(points: np.ndarray, alphas: np.ndarray, testset: np.ndarray) which returns the mean absolute error on a number of points in the testset of the same length as the training data.

Task 6.3: Learning curves

Run your code for different sizes of the training set and plot the error as a function of the nuber of training points. What do you observe? Do you have some ideas how the learning curve could become steeper or be lower? Hint: What if the domain was not (0.2, 5) but rather (0.2, 100)?

Exercise 7: Hyperparameters

Task 7.1: Hyperparameter optimisation

Start from the solution code from exercise 6 and include the optimisation of the sole hyperparameter \(\gamma\) for each training set size. It is recommended that you just try a grid of values for the hyperparameter. Optimize using the correct loss function for KRR (which one is that?) Which problem do you face if you run this optimization repeatedly?

Task 7.2: Average performance

Using the loss function of KRR, plot the average performance over 10 models as a function of training set sizes by showing both average and the variance as an errorbar. Do the same for the MAE and the RMSE. What do you observe?

Task 7.3: Optimize the kernel

Implicitly, the kernel is another hyperparameter. For simplicity, we will consider one family of kernel functions only: choose different exponents of the distances in the Gaussian kernel function expression and plot the kernel performance as a function of that exponent.

Exercise 8: Bias

Task 8.1: Performance curves

Write (or extend previous) code to learn the function \(x^2\) using KRR and a Gaussian kernel. Plot the loss function as a function of training set size. Plot the average error in the minimum position and the curvature at the minimum as a function of training set size in the same plot. What do you observe? Note: you can obtain the curvature at the minimum by finite differences.

\[\frac{\partial^2 f(x)}{\partial x^2} \simeq \frac{f(x+h)-2f(x)+f(x-h)}{h^2}\]

Task 8.2: Regularization

Now plot the loss function over training set size for noisy data where you add Gaussian noise to each training point. Set lambda in KRR to different values. What do you observe?

Task 8.3: Implicit knowledge

A simpler model would be to directly fit a quadratic function to the training data and extract minimum and curvature from there directly rather than building the KRR model of task 1. Implement this simpler model for the noisy data of task 2. What do you observe? What would be the result of the comparison if we had no noise?

Exercise 9: Simple Neural Network

Task 9.1: Setup / understand

Run the following code for a simple neural network. Install tensorflow using pip install tensorflow and run the file. Go through the code and comment on what each line is doing.

import tensorflow.keras as keras
import tensorflow as tf

(x_train, y_train), _ = keras.datasets.mnist.load_data()
dataset = tf.data.Dataset.from_tensor_slices(
    (x_train.reshape(60000, 784).astype("float32") / 255, y_train)
)
dataset = dataset.shuffle(buffer_size=1024).batch(512)

inputs = keras.Input(shape=(784,), dtype="float32")
x = keras.layers.Dense(16, activation="relu")(inputs)
x = keras.layers.Dense(16, activation="relu")(x)
x = keras.layers.Dense(16, activation="relu")(x)
outputs = keras.layers.Dense(10)(x)
model = keras.Model(inputs, outputs)

model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(dataset, epochs=10)
model.predict(dataset)
model.evaluate(dataset)

Task 9.2: Parameters

Vary some of the following: the learning rate, the layer size, the network depth, the number of training steps and the activation function. Describe and explain what you observe.

Task 9.3: Dimensionality reduction

Add another dense layer between the existing ones and reduce its dimensionality drastically. What would you expect to happen? How is it possible can ten digits be learned if the layer has dimensionality 1?