Exercise 2

Part of the course Machine Learning for Materials and Chemistry.

When dealing with molecules in machine learning contexts, we need mathematical objects describing such molecules. The straightforward use of cartesian coordinates and nuclear charges (which enter the Hamiltonian) turns out to be disadvantageous. In this exercise, we explore two alternatives.

Normally, representations taylored to molecules such as the Coulomb Matrix perform better than one-hot encoding. However, sometimes meaningful categorial representations catch more of the Physics involved, such that one-hot-encoding leaves more advanced representations behind, e.g. in this work on learning reaction barriers.

Task 2.1: Implement the Coulomb Matrix representation

Write a python function coulomb_matrix(xs: list[float], ys: list[float], zs: list[float], elements: list[str]) -> list[list] to calculate the coulomb matrix representation from the cartesian coordinates (xs, ys, zs) and the nuclear charges. You may use the following molecular geometries to test your function:

Water:

O 0.0 0.0 0.0
H -0.77 -0.573 0.0
H 0.775 -0.566 0.0

Caffeine:

C 3.184 1.228 0.0
N 2.239 0.145 0.001
C 2.488 -1.176 0.004
N 1.401 -1.918 0.003
C 0.383 -1.03 -0.002
C 0.869 0.27 -0.003
C 0.012 1.416 -0.002
O 0.357 2.585 -0.003
N -1.33 1.037 0.002
C -1.851 -0.253 0.001
O -3.047 -0.443 0.005
N -0.95 -1.296 0.003
C -1.43 -2.661 -0.009
C -2.327 2.096 0.0
H 3.838 1.161 0.871
H 3.794 1.2 -0.905
H 2.628 2.164 0.033
H 3.48 -1.583 0.007
H -0.746 -3.272 0.576
H -2.432 -2.674 0.417
H -1.468 -3.041 -1.031
H -2.942 2.026 0.897
H -2.971 1.993 -0.873
H -1.797 3.045 -0.025

Task 2.2: One-hot encoding

To learn categorial data, we need a representation that allows for regression to be done. One-hot encoding is such a representation, of which we will now consider a simplified, slightly overcomplete version for sake of simplicity of this exercise. In one-hot encoding, each category in the data is represented by a binary vector, with a length equal to the number of categories. Each vector has a value of 1 in the position corresponding to the category it represents, and a value of 0 in all other positions. For example, if we have three categories: alkanes, amines, and acids, then the one-hot encoding for alkanes would be [1,0,0], for amines [0,1,0], and for acids [0,0,1]. If there are multiple such categories, they will be appended: if we take the categories (alkanes, amines), then the one-hot encoding would be [1,0,0,0,1,0].

Write a function one_hot(categories: list[str], possible_categories:list[str]) -> list[list[int]] that converts a list of categories such as letters chosen from a set of possible categories into a one-hot encoded vector.

Examples:

one_hot(["A"], ["A", "B"]) -> [1,0]
one_hot(["A", "C"], ["A", "B", "C"]) -> [1,0,0,0,0,1]

Task 2.3: Invertible representations

Is the Coulomb matrix representation invertible, i.e. can you recover the molecule from the representation? Explain your answer.
Write a function sum_formula(coulomb_matrix: list[list[float]]) -> str that obtains the sum formula from the Coulomb matrix representation.