Teaching data fitting better

One of the most useful skills we teach on the physics degree is data analysis. This is important in almost all scientific research, and it is also key to good decision making in other fields such as economics, as well as being a core part of data science — increasing numbers of our graduates are going into the growing number of careers as data scientists. One basic task in data analysis is fitting a model to noisy data, eg fitting a straight line y = mx + c to data of the form a set of points (x , y). As far as I know there is essentially complete consensus about how to determine the best values of the two fit parameters, the intercept  and the slope m. This is to minimise the sum of the squared differences between the fit function, and the data points.

This can be done very simply, there are simple mathematical expressions for c and m. You just plug the set of data points (x , y) in these expressions, and out pop the best-fit values of  and m. This is easy. But to do data analysis, such as fitting, properly, you need error bars on  and m. For example, if you are testing a hypothesis that the true slope is m = 5.5, then a best-fit value m = 5.1 is not enough, you need to know what the uncertainty in this best-fit value of is. If the uncertainty is about 0.4 or larger, then the data are consistent with the hypothesis, while if the uncertainty is only 0.1, then the data have ruled out (falsified) your hypothesis.

The calculation of error bars is a lot trickier than just determining the best-fit values. First of all, there are several ways of doing this, and interpreting what they output is a lot harder on the brain, than understanding that the best-fit values are those that minimise a sum of squares. The classical way to estimate uncertainties is via analytical expressions that basically assume the noise is Gaussian, and then give simple expressions for the uncertainties. These work well in most circumstances, but at least to me, they are not very intuitive, and it is typically not clear how to generalise these elegant but a bit obscure equations, to other situations.

An alternative to these classic expressions, are what are called resampling techniques.  Sampling is just selecting sets of points from a dataset, while resampling is the same but typically you go back repeatedly to the same small dataset and sample it over and over again. The two most common resampling techniques are the jacknife and the bootstrap methods. Both are simple algorithms that are easy to code for a huge range of problems.

Resampling techniques are bit hard to explain (although this old article does a good job, I think), but basically the idea is that to determine uncertainties what you would really like is a lot more data. Then you could just sample this additional data to get what you need. But what you have is very limited data, so resampling techniques make the best of this bad situation by repeatedly sampling from this limited data that you have. They kind of assume that by sampling from this data as if you were sampling from a much bigger pool, you get the best error estimates you can, from the limited data you have.

As these resampling techniques are algorithms, the best way to teach them is just to give the algorithms to the students, and allow them to play around with them. With this in mind I have written a IPython notebook. As it is a notebook I have put descriptions of what it does, links to further reading etc, in it, to try and make it a more-or-less self-contained tutorial to doing a decent job of fitting a simple model to data, with both best-fit values and uncertainties. The students can just download and run the notebook, and hopefully look at how it works to get a feel for how these resampling techniques work. They can then edit it to fit whatever model they want to their own data.

The hope is that students will use the notebook as a springboard to better data analysis, both in terms of getting better results for the uncertainties, but also in understanding that from noisy data all we can get is estimates of the uncertainties. One of the disadvantages of the standard expressions for errors is that they encourage students to treat the numbers as if they are tablets of stone, and not to question them. This is bad. They are just estimates, and sometimes the true values can lie outside them, either just by chance, as error estimates are inherently probabilistic in nature, or because there are systematic biases in either the data, or in the fitting procedure. Hopefully, the IPython notebook will help students learn this by doing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s