Machine Learning and Quantum Alchemy: Day 1

Part of the course Machine Learning and Quantum Alchemy.

Exercises

Write a function which calculates the Morse potential of two atoms: morse_potential(distance: float) -> float. This is the target function which we will model. We assume all parameters to be unity for simplicity.
Example Solution
```
import numpy as np

def morse_potential(distance: float) -> float:
    return (1-np.exp(-(distance-1)))**2
```

Plot the function and identify the domain of interest.

import matplotlib.pyplot as plt
plt.plot(xs, ys) # plots a line
plt.scatter(xs, ys) # plots just points

Example Solution

import matplotlib.pyplot as plt

xs = np.linspace(0.1, 6)
# list comprehension
ys = [morse_potential(_) for _ in xs]

plt.plot(xs, ys)

Write a function to generate n uniformly random data points (positions and their functional values) within that domain: generate_dataset(n: int) -> tuple[np.ndarray, np.ndarray]. We will use this for the training data and for the test data.
Example Solution
```
def generate_dataset(n: int):
    xs = np.random.uniform(0.1, 6, n)
    ys = morse_potential(xs)
    return xs, ys
```

Implement nearest neighbor prediction at position by hand (without scikit-learn): predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float. Plot the results together with the correct function.

Example Solution

def predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float:
    distances = abs(position - training_xs)
    nearest_neighbor_index = np.argmin(distances)
    return training_ys[nearest_neighbor_index]

train_xs, train_ys = generate_dataset(10)
xs = np.linspace(0.1, 6, 100)
ys = morse_potential(xs)
plt.plot(xs, ys, label="Ground truth")
plt.scatter(train_xs, train_ys, label="training data")
plt.plot(xs, [predict_nearest_neighbor(x, train_xs, train_ys) for x in xs], label="model prediction")
plt.legend()

Do the same using scikit-learn (will be first introduced in class).

Example Solution

from sklearn import neighbors

# as before
train_xs, train_ys = generate_dataset(10)
xs = np.linspace(0.1, 6, 100)
ys = morse_potential(xs)

# sklearn
cls = neighbors.KNeighborsRegressor(n_neighbors=2)
X = train_xs[:, np.newaxis]
y = train_ys
cls.fit(X, y)

# inference
Xprime = xs[:, np.newaxis]
yprime = cls.predict(Xprime)

plt.plot(xs, ys, label="Ground truth")
plt.scatter(train_xs, train_ys, label="training data")
plt.plot(xs, yprime, label="model prediction")
plt.legend()

Play with the number of training data points and the parameters of k-nearest neighbors in the scikit-learn implementation. Focus on the low-data regime with about 5-10 points. What do you observe empirically?
- How can we choose which parameter is best? How should we choose which parameter is best?
- How does the fit quality improve with more data?
- Which regions are particularly hard to predict and why?
- Based on your findings for this 1D problem, which conclusions do you draw for higher dimensions?
What is the best 1-nearest neighbor (k=1) model you can build with three training points only? Use your intuition from the previous tasks to first predict the solution, then implement it numerically with either a grid search or with scipy.optimize.

Advanced exercises (if the other ones are too easy)

Write a function that calculates the Morse potential of a variable number \(n\) of particles in 3D: morse_potential_3d(coordinates: np.ndarray) -> float, where coordinates is a 1D array of length \(3n\) for \(n>1\). You may want to use scipy.spatial.distance.pdist for this.
Write a function to produce randomized coordinates which are sampled from a normal distribution with center 0 and unit variance.
Use scikit-learn to build a k-nearest neighbor model for this problem. Play around with it. To which physical effect you know is the model oblivious and therefore needs data to learn it?
A learning curve shows how a model becomes more accurate with more data points. To this end, the mean absolute error is shown as a function of the number of training points on a log-log scale. Create learning curves for your code with \(k=1\) and different \(n\). What do you observe? Which conclusions can you draw for practical machine learning methods?