I am starting to sort out my Python teaching for the coming semester; the course contains some introductory data analysis. As part of this, I have just read a relatively old (2001) but I think influential article that compares and contrasts two schools of data analysis. Roughly speaking these are:
- School A): fit a simple-as-possible model function to the data, for example a straight-line or exponential fit, to try and understand what is going on.
- School B): use a machine learning algorithm such as a neural net, or a support vector machine, to obtain the best possible predictions.
The author is Leo Breiman, a statistician, who was encouraging his fellow statisticians to give School B a try. He thought many statisticians were sticking too rigidly to School A, and this inspired him to write this article, which argues for School B.
Growing a crystal of a protein often starts by mixing a solution of protein with a solution of a salt. If you imagine sitting on a point that starts in the protein solution, as mixing occurs protein diffuses away into the salt solution and is diluted, so the protein concentration decreases, while as the salt arrives, the salt concentration increases. This means that in a plot with the x-axis the salt concentration, and the y-axis the protein concentration, the concentrations at the point move down and to the right. It will start at the point marked above by the blue circle, and finish at the magenta circle. If the mixing is just diffusion of the protein and salt, and if they diffuse equally fast, the point will follow the path of the straight dashed-red line above. But if protein diffuses much slower (which it does) and there is flow of the solutions (almost unavoidable except for the smallest volumes*) the point should follow the path of the dashed black line — this is a very different path of course.