One of the classic mistakes you can make as a teacher is to spot what you fondly think is a small gap in the curriculum, and then commit to filling it. The not-so-small gap is in our teaching of data analysis. Analysing data is, as I just said to our second years, a key part of doing science. As I also said to them, it is poor practice to use formulas or software such as Excel without knowing what they are doing. Both of these statements are true. The problem is that data analysis is a huge subject, and it is underpinned by lots of maths whose details I don’t know myself and will not be teaching to the students. So, by committing to do one extra lecture to try and improve matters here, I bit off a bit more than I could chew.

# Category Archives: statistics

## The “exemplary statistical safety record” and the fireball visible 60 km away

I am revising a numerical physics course for the forthcoming semester, in particular the bits about data analysis. So I have been reading a couple of books to both learn from them, and to see if they could be useful to the students. One compact but good summary is *The Data Loom* by Stephen Few. It is quite introductory and short, so I am thinking that it could be good to recommend to the students. It covers a lot of ground and I like the author’s practical, sceptical tone. It is also has some excellent examples.

## Statisticians for machine learning

I am starting to sort out my Python teaching for the coming semester; the course contains some introductory data analysis. As part of this, I have just read a relatively old (2001) but I think influential article that compares and contrasts two schools of data analysis. Roughly speaking these are:

- School A): fit a simple-as-possible model function to the data, for example a straight-line or exponential fit, to try and understand what is going on.
- School B): use a machine learning algorithm such as a neural net, or a support vector machine, to obtain the best possible predictions.

The author is Leo Breiman, a statistician, who was encouraging his fellow statisticians to give School B a try. He thought many statisticians were sticking too rigidly to School A, and this inspired him to write this article, which argues for School B.

## Seventeen top-ten universities

The 2020 Guardian University League Tables are out, and Saturday’s print edition ran with the headline “Oxford falls to third place in university rankings”. As someone who teaches data analysis that seemed to be quite a definite statement to me — there is no obvious caveat to indicate how confident they are of this statement. This omission concerns me, but to be fair to *The Guardian,* they have the 2020 league table data available for download as a spreadsheet. It looks like a fair number of the data values are missing, so I turned to the 2019 league table data. This data set looks complete, and is of the same form. Each university has nine data values, and in each case the analysis assumes that it is the bigger the better, i.e., large values of each number indicate a good university, or good teaching, somehow*.

## Uncertainty estimates for Fermi estimates

In a final year course that I co-teach, I teach Fermi estimation* (my notes are here). Fermi estimates are simple back-of-the-envelope calculations. Let’s say you want a Fermi estimate of how many people in the UK take a train journey on a normal week day. You start by saying “Well the population of the UK is about 60 million people”, then you say “I guess about 10% take a train journey on a given day, as 1% of people taking a train looks too low, while it is clearly not 100%”. The Fermi estimate is then that about 6 million people take the train in one day. To check this estimate, I did a little Googling, and there are about 6 million journeys per day in the UK, so assuming that people who travel in a day take two trips (eg to and from work), it looks like I am about a factor of two, too large. Not bad for a simple estimate.

## Excellence framework scored: must do better

The government has introduced the Teaching Excellence Framework (TEF) which purports to assess the excellence or otherwise of teaching in English universities. Surrey was awarded the highest score, a gold, in 2017*. But measuring teaching is hard, it is subjective, and so mostly what the TEF measures is statistics about a university, plus a text summary. There is no actual observation, let alone direct assessment of, teaching, in the TEF. The august body, the Royal Statistics Society (RSS), has just issued a critique of the TEF. The critique reads like the feedback on a piece of statistics coursework submitted by an unusually weak student.

## Being useful to molecular biologists really boosts your citations

In an earlier blog post, I noted that by one metric for the impact of my published work I lose out to F.D.C. Willard, who was a cat. In a similar vein, above I have plotted the number of citations to my most highly cited paper of the last few years, *Dynamic Stratification in Drying Films of Colloidal Mixtures* by Fortini *et al.*, together with the number of citations of a 1970 paper by Laemmli. In Laemmli’s paper, he pioneered a technique called SDS PAGE. Note that the column with our paper appears blank, this is because on the scale of the plot, the bar for our paper is less than one pixel high. At the bottom I have replotted this bar graph on a logscale, where you can see the number of citations of our work.

## Dragon-kings, black swans and an anti-HIV drug

Next week, I am off to Paris for a workshop, so I am writing my talk. Above is a plot of French cities. The x-axis is the log of the rank of the city, where the ranking of the city is by size, i.e., the first point (shown in pink) is for France’s largest city, Paris, at an a x value of log(1)=0, while the second point is France’s second largest city, Marseille, at a value log(2)=0.30, the third is Lyon at log(3)=0.48, etc. The y axis is the population of the city, raised to the power *c = *0.18. This value is a fitting parameter, the value of 0.18 is the one that makes the data closest to a straight line — as you can see for this value of *c* the data falls on a pretty decent straight line.

## People in this country have had enough of experts

The title is a quote of Michael Gove — to be fair on him I think he was referring to economists – a profession whose track record of consistently making inaccurate predictions is notorious. But still, as a PhD educated scientist, the quote does rankle a bit. Although that may be just my natural reaction to Michael Gove – he could read the phone book and I’d still get grumpy.

## Earthquakes as the price of cheap oil

The graph above shows the numbers of earthquakes in the US state of Oklahoma, for each year from 1978 to 2016 (2016 data is only up September). The number is for earthquakes with magnitude greater than 3.0. There is a striking increase from 2008, when there were 2 earthquakes, to 2015, when there 890.