Exercise 1

Part of the course Machine Learning for Materials and Chemistry.

Linear regression is a common technique used in statistics and machine learning to model the relationship between two variables. However, when using linear regression with a small number of data points, the model may be subject to noise and may not accurately capture the underlying pattern in the data. This can result in overfitting, underfitting, or inaccurate predictions.

As more data points are added, the model becomes more accurate and better able to capture the underlying pattern in the data. This is because the addition of more data points provides more information about the relationship between the variables and helps to reduce the impact of noise in the data.

The Theil-Sen regressor is a robust alternative to linear regression that is designed to overcome the issue of noise in small datasets. It uses a method called the "median slope" to estimate the slope of the line that best fits the data, rather than the mean slope used in linear regression. However, as the number of data points increases, the difference between the Theil-Sen regressor and linear regression tends to become smaller, and linear regression can provide accurate results when enough data is available.

We have used this particular regressor to improve Hammett models for reaction barriers.

Task 1.1: Implementing Linear Regression

Write a Python function randomized_points(n: int) -> Tuple[np.ndarray, np.ndarray] that generates n random points x uniformly distributed over the interval (0, 1) and the corresponding y = a*x + b + epsilon, where a is the slope (a=2), b is the intercept (b=1), and epsilon is the error term that is drawn from a normal distribution with mean 0 and standard deviation 0.2. The function should return two arrays xs and ys, where xs contains the x-coordinates of the n data points and ys contains the corresponding y-coordinates.
Write a Python function linear_regression(xs: np.ndarray, ys: np.ndarray) -> Tuple[float, float] that takes in the arrays xs and ys from the previous step and returns the slope and intercept of a linear regression line that fits the data points.

Task 1.2: Using Theil-Sen Regressor from Scipy

Write a Python function theil_sen(xs: np.ndarray, ys: np.ndarray) -> Tuple[float, float] that performs linear regression using Theil-Sen Regressor from Scipy. The function should take in the arrays xs and ys from Task 1 and return the slope and intercept of the regression line.

Task 1.3: Assessing the Models

Write a Python function predict(xs: np.ndarray, slope: float, intercept: float) -> np.ndarray that takes in the array xs and the slope and intercept of a regression line and returns an array of predicted y-values for the corresponding x-values in xs.
Write a Python function calculate_prediction_error(n: int, xs: np.ndarray, ys: np.ndarray, slope: float, intercept: float) -> Tuple[float, float] that generates n random points in the interval (0, 1) and uses the predict function to estimate the corresponding y-values for each x-value using the slope and intercept of the regression line. The function should then calculate the prediction error by subtracting the true noise-free y-values from the estimated y-values. Finally, the function should return the mean and standard deviation of the prediction error.
Plot the results for the two models using matplotlib.