Machine Learning and Quantum Alchemy: Day 1
Part of the course Machine Learning and Quantum Alchemy.
Exercises
-
Write a function which calculates the Morse potential of two atoms:
morse_potential(distance: float) -> float
. This is the target function which we will model. We assume all parameters to be unity for simplicity.Example Solution
import numpy as np def morse_potential(distance: float) -> float: return (1-np.exp(-(distance-1)))**2
-
Plot the function and identify the domain of interest.
import matplotlib.pyplot as plt plt.plot(xs, ys) # plots a line plt.scatter(xs, ys) # plots just points
Example Solution
import matplotlib.pyplot as plt xs = np.linspace(0.1, 6) # list comprehension ys = [morse_potential(_) for _ in xs] plt.plot(xs, ys)
-
Write a function to generate
n
uniformly random data points (positions and their functional values) within that domain:generate_dataset(n: int) -> tuple[np.ndarray, np.ndarray]
. We will use this for the training data and for the test data.Example Solution
def generate_dataset(n: int): xs = np.random.uniform(0.1, 6, n) ys = morse_potential(xs) return xs, ys
-
Implement nearest neighbor prediction at
position
by hand (withoutscikit-learn
):predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float
. Plot the results together with the correct function.Example Solution
def predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float: distances = abs(position - training_xs) nearest_neighbor_index = np.argmin(distances) return training_ys[nearest_neighbor_index] train_xs, train_ys = generate_dataset(10) xs = np.linspace(0.1, 6, 100) ys = morse_potential(xs) plt.plot(xs, ys, label="Ground truth") plt.scatter(train_xs, train_ys, label="training data") plt.plot(xs, [predict_nearest_neighbor(x, train_xs, train_ys) for x in xs], label="model prediction") plt.legend()
-
Do the same using
scikit-learn
(will be first introduced in class).Example Solution
from sklearn import neighbors # as before train_xs, train_ys = generate_dataset(10) xs = np.linspace(0.1, 6, 100) ys = morse_potential(xs) # sklearn cls = neighbors.KNeighborsRegressor(n_neighbors=2) X = train_xs[:, np.newaxis] y = train_ys cls.fit(X, y) # inference Xprime = xs[:, np.newaxis] yprime = cls.predict(Xprime) plt.plot(xs, ys, label="Ground truth") plt.scatter(train_xs, train_ys, label="training data") plt.plot(xs, yprime, label="model prediction") plt.legend()
-
Play with the number of training data points and the parameters of k-nearest neighbors in the scikit-learn implementation. Focus on the low-data regime with about 5-10 points. What do you observe empirically?
- How can we choose which parameter is best? How should we choose which parameter is best?
- How does the fit quality improve with more data?
- Which regions are particularly hard to predict and why?
- Based on your findings for this 1D problem, which conclusions do you draw for higher dimensions?
-
What is the best 1-nearest neighbor (
k=1
) model you can build with three training points only? Use your intuition from the previous tasks to first predict the solution, then implement it numerically with either a grid search or withscipy.optimize
.
Advanced exercises (if the other ones are too easy)
-
Write a function that calculates the Morse potential of a variable number \(n\) of particles in 3D:
morse_potential_3d(coordinates: np.ndarray) -> float
, wherecoordinates
is a 1D array of length \(3n\) for \(n>1\). You may want to usescipy.spatial.distance.pdist
for this. -
Write a function to produce randomized coordinates which are sampled from a normal distribution with center 0 and unit variance.
-
Use
scikit-learn
to build a k-nearest neighbor model for this problem. Play around with it. To which physical effect you know is the model oblivious and therefore needs data to learn it? -
A learning curve shows how a model becomes more accurate with more data points. To this end, the mean absolute error is shown as a function of the number of training points on a log-log scale. Create learning curves for your code with \(k=1\) and different \(n\). What do you observe? Which conclusions can you draw for practical machine learning methods?