Symbolic Regression

1. What is symbolic regression?

Symbolic regression is a machine learning technique that finds a symbolic expression that matches data from an unknown function. In other words, it is a machinery able to identify an underlying mathematical expression that best describes a relationship between one or more variables.

The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including:

2. How does it works?

It builds a set of random formulas to represent the relationship between known independent variables and their dependent variable targets in order to predict them.

Each successive step (generation) of programs is then transformed (evolved) from the one that came before it (by selecting the fittest individuals) from the data (population) to undergo next (genetic) operations.

2.1. Representation

For example, to write the following expression:

\begin{equation} y = X^2_0 - 3 X_1 + 0.5 \end{equation}

we can rewrite it as \begin{equation} y = X_0 \times X_0 - 3 \times X_1 + 0.5 . \end{equation}

But we can do more, we can use a LISP symbolic expression: \begin{equation} y = (+ ( - (\times X_0 X_0)(\times 3 X_1)) 0.5 ) \end{equation} or even, we can understand is as a syntax tree, where the interior nodes are the functions and the variables and constants are the terminal nodes: tree

2.2 Fitness

It determines how well the program performs. As in other ML things, in GP we have to know whether the metric needs to be maximized or minimized in order to be able solve each specific problem:

2.3 Initialization

Compreehends the parameters that should be chosen to perform the symbolic operation:

2.4 Summarizing

scheme

3. What we are going to use?

gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API.

gplearn

4. Symbolic Regressor Example

Here we will predict the Hublle evolution $H$ with the redshift $z$:

\begin{equation} H (z) = H_0 \sqrt{\Omega_k (1 + z)^2 + \Omega_m (1 + z)^3 + \Omega_r (1 + z)^4 + \Omega_{\Lambda}} \end{equation}

where $H_0$ is the Hubble parameter, $\Omega_k$ is the curvature density, $\Omega_m$ is the matter density, $\Omega_r$ is the radiation density and $\Omega_{\Lambda}$ is the dark energy density of the Universe, all today.

4.1 Importing libraries

4.2 Initial dataset and data analysis

Initial dataset:

Data visualization:

4.3 ML getting data:

4.4 GPlearn implementation

4.4.1 First test:

a) Choosing just some functions

b) Fit:

c) Prediction

d) Score

4.4.1.1. Visualizing the symbolic function

a) Equation

b) Score

c) Plot

d) Tree

4.4.2 Second test:

a) Don't imposing any function

b) Fit

c) Prediction

d) Score

4.4.2.1 Visualizing the symbolic function

a) Equation

b) Plot

c) Tree

4.5 Comparing GPlearn to traditional ML approaches

4.5.1 Decision Tree Regressor

a) Model and fit

b) Prediction and score

c) Plot

4.5.2 Random Forest Regressor

a) Model and Fit

b) Prediction and Score

c) Plot

4.5.3 All together

5. Symbolic Classifier

The SymbolicClassifier works in exactly the same way as the SymbolicRegressor in how the evolution takes place. The only difference is that the output of the program is transformed through a sigmoid function in order to transform the numeric output into probabilities of each class.

In essence this means that a negative output of a function means that the program is predicting one class, and a positive output predicts the other.

5.1 Importing libraries

5.2 Wisconsin breast cancer

a) Loading:

b) Description:

5.3 Data pre-processing

5.5.1 Shuffling data

5.5.2 Spliting data

5.3 Classifying

5.3.1 Fit

5.3.2 Predicting

5.3.3 Scoring

5.4 Visualizing

a) ROC curve:

b) Tree:

6. References