I am starting to sort out my Python teaching for the coming semester; the course contains some introductory data analysis. As part of this, I have just read a relatively old (2001) but I think influential article that compares and contrasts two schools of data analysis. Roughly speaking these are:
- School A): fit a simple-as-possible model function to the data, for example a straight-line or exponential fit, to try and understand what is going on.
- School B): use a machine learning algorithm such as a neural net, or a support vector machine, to obtain the best possible predictions.
The author is Leo Breiman, a statistician, who was encouraging his fellow statisticians to give School B a try. He thought many statisticians were sticking too rigidly to School A, and this inspired him to write this article, which argues for School B.
I will basically be teaching School A, as this is the default basic approach in physics. Often in physics we are dealing with relatively simple systems where we have a good idea what the model function should be, for example, if our data is for radioactive decay, we know that the radioactivity should decrease exponentially with time, and so will use an exponential model function. In these simple cases, it is a no brainer to go for School A. School B there throws away a big piece of knowledge we have about the data — that it is decaying exponentially — which is very wasteful and inefficient.
However, if the data is for a very complex system, where any guess at a model function is just that, a guess, then the two schools of thought each have pluses and minuses. Breiman’s article is well worth a read, it is very clear. I think I got two related messages from it, others will get other things from it.
The first is that when the data are from a complex system where we have no expectation that the system is linear or exponential or some other simple function, then fitting some complex assumed model function can lead you up a blind alley. With enough parameters in your model function you can fit the data well, but this is meaningless. Such complex problems are the norm in social sciences (people are complex!), they are common in biology, and certainly do occur in physics — for example I study the nucleation of crystals, and this can be very complex. This highlights a problem with School A.
The second is that one of the most common arguments against School B, is overstated. This common argument is that machine learning techniques such neural nets, support vector machines, decision trees, are so complicated and have internal workings that are so complex, that they don’t lead to understanding*.
Breiman’s point here is, I think, that although these machine learning approaches are complex, this does not stop you working out what are the key variables. Roughly speaking, you can deliberately introduce artificial noise into any variable you select, and if that noise greatly decreases the accuracy of the predictions made by the neural net, decision tree, etc, then that variable must be important. However, if the noise has no affect on how accurate your predictions are, then you that variable is largely irrelevant. In this way, you can get a lot of insight into what parts of your data are driving the effects you are interested in.
I will still be teaching School A, as this is just an introductory course and that is the basic approach in physics, but I have added a link to this article on the course page, so the students can read a comparison of what they are doing and the now-very-fashionable machine learning approaches. If they read about machine learning outside the course
* To clarify what I mean by understanding here, we can define understanding data as having extracted key features of the data that allow us to make general predictions, in the following sense. We can draw conclusions such as variable X but not variable Y, controls output A, and so predict that changing the value of X to a value far outside the range we have data for is likely to strongly change A, while we predict that doing the same for Y, is unlikely to change A by much. The key phrase here is “far outside the range we have data for”. This definition of understanding means that it implies being able to rationally extrapolate. It is a criticism of School-B-type analyses they don’t allow us to do that.