Exercise 4
Part of the course Machine Learning for Materials and Chemistry.
Most models for materials/chemistry applications build upon some description of a local environment. One way of doing so is to assign the same property to each similar environment. These kind of models can be refined by making the environment larger, e.g. by considering more neighbors. If the whole molecule is segmented into complete yet disjunct parts, this family of models is called a group contribution model/method. Often, models of increasing local environments yield acceptable results, but fail to converge, as typically many-body and long-range effects and interactions contribute significantly, especially for charged systems.
When considering the full environment, i.e. the whole molecule for small ones, it is possible to learn the 3D geometry from the graphs.
Task 4.1: Dressed atom from a database
The file bondorder.txt contains one line per molecule with three space-separated groups:
- comma-separated elements
- comma-separated bond order matrix as upper triangle (there are relevant numpy functions; note that bond orders of 1.5 are used to denote aromatic bonds)
- energy in Hartrees
For water, the row would be
O,H,H 1,1,0 -74.9621236632301
Write a function that reads this file and estimates the energies using the dressed atom
model, i.e. approximating the total energy as a atom-wise contribution which only depends on the element in question. You can solve this via linear regression. Which relative error do you get for the energies?
Task 4.2: Bond counting
Now follow the same strategy as in task 1, but also consider all unique bonds, i.e. a CO bond, a CH bond, ... The total energy is now assumed to have one contribution from every element and another contribution from every bond. Which relative error do you get now?
Task 4.3: Generalisation
Give three different reasons (with examples) where this method must fail beyond the initial success you showed in task 1 and 2. Are these methods useful then? Discuss in the context of data standardisation.