Machine Learning and Quantum Alchemy: Day 2

Part of the course Machine Learning and Quantum Alchemy.

Exercises

Implement a Gaussian kernel function: gaussian_kernel(x1: np.ndarray, x2: np.ndarray) -> float. Plot the result in one dimension for different values of x1 and x2. The equation is

\[k(x_1, x_2) = \exp\left(-\gamma \lVert x_1-x_2\rVert^2\right)\]

You may assume \(\gamma\) to be 1.

Example Solution

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x1: np.ndarray, x2: np.ndarray) -> float:
    difference_vector = x1-x2
    distance = np.linalg.norm(difference_vector)
    return np.exp(-distance**2)

xs = np.linspace(-5,5, 500)
ys = [gaussian_kernel(2, x) for x in xs]

plt.plot(xs, ys)

Using the functions for the Morse potential and the corresponding training data from the previous day, implement Kernel-Ridge-Regression as fit_krr(training_xs: np.ndarray, training_ys: np.ndarray) -> np.ndarray:

\[\alpha = (\mathbf{K} - \lambda\mathbf{I})^{-1}y\]

Where \(\alpha\) is the model coefficient vector and \(\mathbf{K}\) is the Kernel matrix. Each element of that matrix is obtained from applying the kernel function from task 1 to each pair of feature vectors in the training data:

\[\mathbf{K}_{ij} = k(x_i, x_j)\]

Make sure that the function fit_krr takes 2D arrays like scikit-learn even in the 1D case.

Build a model using your code for some random training data.
Example Solution
```
def fit_krr(training_xs: np.ndarray, training_ys: np.ndarray) -> np.ndarray:
    n = len(training_ys)
    K = np.ones((n, n))
    for i in range(n):
        for j in range(i+1, n):
            K[i, j] = gaussian_kernel(training_xs[i], training_xs[j])
            K[j, i] = K[i, j]

    lval = 1e-10
    alpha = np.linalg.inv(K - lval * np.identity(n))@training_ys
    return alpha
```

Implement the predictions in KRR in a function predict_krr(training_xs: np.ndarray, xprime: np.ndarray, alphas: np.ndarray) -> float. The prediction equation is

\[y'(x') = \sum_i\alpha_i k(x_i, x')\]

Where the sum runs over all training instances.

Example Solution

def predict_krr(training_xs: np.ndarray, xprime: np.ndarray, alphas: np.ndarray) -> float:
    yprime = 0
    n = len(alphas)
    for i in range(n):
        yprime += alphas[i] * gaussian_kernel(training_xs[i], xprime)
    return yprime

Let's move on to molecules. Use the following code to obtain the coordinates of naphthalene:

def get_geometry():
    import requests
    res = requests.get("https://zenodo.org/records/3994178/files/inp.xyz?download=1")
    geos = res.content.decode("utf-8").replace("C", "").replace("H", "")
    geos = geos.split("\n", maxsplit=2)[-1].replace("\n", " ")
    return np.array([float(_) for _ in geos.split()]).reshape(-1, 3)

Write a python function coulomb_matrix(positions: np.ndarray, nuclear_charges: np.ndarray) -> np.ndarray to calculate the coulomb matrix representation from the cartesian coordinates and the nuclear charges \(Z_i\).

\[\begin{split}M_{ij} =\begin{cases}\tfrac{1}{2} Z_{i}^{2.4} & \text{if } i = j \\\frac{Z_{i}Z_{j}}{\| {\bf R}_{i} - {\bf R}_{j}\|} & \text{if } i \neq j\end{cases}\end{split}\]

The return value of the function should be a 1D-vector which can be obtained with .reshape(-1).

Example Solution

# basic python
def coulomb_matrix(positions: np.ndarray, nuclear_charges: np.ndarray) -> np.ndarray:
    n = len(nuclear_charges)
    M = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            if i == j:
                M[i, j] = 0.5*nuclear_charges[i]**2.4
            else:
                distance = np.linalg.norm(positions[i]-positions[j])
                M[i, j] = nuclear_charges[i] * nuclear_charges[j] / distance
    return M.reshape(-1)

Advanced solution

This approach employs numpy methods only to achieve the functionally identical results to the code above.

# numpy only
def coulomb_matrix(positions: np.ndarray, nuclear_charges: np.ndarray) -> np.ndarray:
    ds = np.linalg.norm(positions[:, np.newaxis]-positions[np.newaxis, :], axis=2)
    np.fill_diagonal(ds, 1)
    M = np.outer(nuclear_charges, nuclear_charges) / ds
    np.fill_diagonal(M, 0.5*nuclear_charges**2.4)
    return M.reshape(-1)

This however calculates the full symmetric Coulomb matrix, which is not quite needed, since it is symmetric. Using scipy.spatial.distance.pdist this can be done more efficiently.

Use get_data() in the following code to download a database of about 2286 naphathalene derivatives. Use the coulomb matrix representation to build a KRR model of 100 random molecules and predict on the others.

def get_data():
    import requests
    res = requests.get("https://zenodo.org/records/3994178/files/PBE.txt?download=1")
    charges = []
    energies = []
    for line in res.content.decode("ascii").split("\n")[1:-1]:
        c, e = line.split()
        charges.append([int(_) for _ in c])
        energies.append(float(e))
    charges, energies = np.array(charges, dtype=float), np.array(energies)
    ordering = np.arange(energies.shape[0])
    np.random.shuffle(ordering)
    return charges[ordering, :], energies[ordering]

Example Solution

charges, energies = get_data()
representations = []
for i in range(100):
    Zs = list(charges[i]) + [1.]*8
    representations.append(coulomb_matrix(geo, Zs))
alphas = fit_krr(representations, energies[:100])
errors = []
for i in range(len(energies)):
    Zs = list(charges[i]) + [1.]*8
    rep = coulomb_matrix(geo, Zs)
    errors.append(predict_krr(representations, rep, alphas) - energies[i])