Skip to content

Machine Learning and Quantum Alchemy: Day 1

Part of the course Machine Learning and Quantum Alchemy.

Exercises

  1. Write a function which calculates the Morse potential of two atoms: morse_potential(distance: float) -> float. This is the target function which we will model. We assume all parameters to be unity for simplicity.

    Example Solution
    import numpy as np
    
    def morse_potential(distance: float) -> float:
        return (1-np.exp(-(distance-1)))**2
    
  2. Plot the function and identify the domain of interest.

    import matplotlib.pyplot as plt
    plt.plot(xs, ys) # plots a line
    plt.scatter(xs, ys) # plots just points
    

    Example Solution
    import matplotlib.pyplot as plt
    
    xs = np.linspace(0.1, 6)
    # list comprehension
    ys = [morse_potential(_) for _ in xs]
    
    plt.plot(xs, ys)
    
  3. Write a function to generate n uniformly random data points (positions and their functional values) within that domain: generate_dataset(n: int) -> tuple[np.ndarray, np.ndarray]. We will use this for the training data and for the test data.

    Example Solution
    def generate_dataset(n: int):
        xs = np.random.uniform(0.1, 6, n)
        ys = morse_potential(xs)
        return xs, ys
    
  4. Implement nearest neighbor prediction at position by hand (without scikit-learn): predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float. Plot the results together with the correct function.

    Example Solution
    def predict_nearest_neighbor(position: float, training_xs: np.ndarray, training_ys: np.ndarray) -> float:
        distances = abs(position - training_xs)
        nearest_neighbor_index = np.argmin(distances)
        return training_ys[nearest_neighbor_index]
    
    train_xs, train_ys = generate_dataset(10)
    xs = np.linspace(0.1, 6, 100)
    ys = morse_potential(xs)
    plt.plot(xs, ys, label="Ground truth")
    plt.scatter(train_xs, train_ys, label="training data")
    plt.plot(xs, [predict_nearest_neighbor(x, train_xs, train_ys) for x in xs], label="model prediction")
    plt.legend()
    
  5. Do the same using scikit-learn (will be first introduced in class).

    Example Solution
    from sklearn import neighbors
    
    # as before
    train_xs, train_ys = generate_dataset(10)
    xs = np.linspace(0.1, 6, 100)
    ys = morse_potential(xs)
    
    # sklearn
    cls = neighbors.KNeighborsRegressor(n_neighbors=2)
    X = train_xs[:, np.newaxis]
    y = train_ys
    cls.fit(X, y)
    
    # inference
    Xprime = xs[:, np.newaxis]
    yprime = cls.predict(Xprime)
    
    plt.plot(xs, ys, label="Ground truth")
    plt.scatter(train_xs, train_ys, label="training data")
    plt.plot(xs, yprime, label="model prediction")
    plt.legend()
    
  6. Play with the number of training data points and the parameters of k-nearest neighbors in the scikit-learn implementation. Focus on the low-data regime with about 5-10 points. What do you observe empirically?

    • How can we choose which parameter is best? How should we choose which parameter is best?
    • How does the fit quality improve with more data?
    • Which regions are particularly hard to predict and why?
    • Based on your findings for this 1D problem, which conclusions do you draw for higher dimensions?
  7. What is the best 1-nearest neighbor (k=1) model you can build with three training points only? Use your intuition from the previous tasks to first predict the solution, then implement it numerically with either a grid search or with scipy.optimize.

Advanced exercises (if the other ones are too easy)

  1. Write a function that calculates the Morse potential of a variable number \(n\) of particles in 3D: morse_potential_3d(coordinates: np.ndarray) -> float, where coordinates is a 1D array of length \(3n\) for \(n>1\). You may want to use scipy.spatial.distance.pdist for this.

  2. Write a function to produce randomized coordinates which are sampled from a normal distribution with center 0 and unit variance.

  3. Use scikit-learn to build a k-nearest neighbor model for this problem. Play around with it. To which physical effect you know is the model oblivious and therefore needs data to learn it?

  4. A learning curve shows how a model becomes more accurate with more data points. To this end, the mean absolute error is shown as a function of the number of training points on a log-log scale. Create learning curves for your code with \(k=1\) and different \(n\). What do you observe? Which conclusions can you draw for practical machine learning methods?