An earlier post showed the data (blue and green symbols) above but not the fits (cyan and re curves). The data are for the number of papers published in two scientific journals, *Nature *and *PLOS One*, as a function of the number of citations (+1 so it fits on a log-log scale plot), that the paper received. So for example, the blue circle at the top left edge is at *x *and *y *coordinates of 10^{0} = 1 and 13,000, meaning that *PLOS One *published 13,000 articles that were cited 1-1 = 0 times. The papers were published in 2013 and 2014, and the citations were in 2015. The mean number of citations is the Journal Impact Factor (JIF), so the JIFs of the two journals are the means of the distributions above.

Both distributions are broad, and broad distributions are often fit with power laws. The cyan line above is a power-law fit to the large citations tail of the distribution of papers from *PLOS One.* The fit is that number of papers with *n* citations is proportional to *n*^{-4}. It is not a bad fit to the righthand tail of the distribution, the part with the papers with the largest number of citations. The red curve is a fit of a log-normal distribution to the number of *Nature *papers as a function of the number of citations *n*, they get. It is also a reasonable fit, except at small *n.*

Power law and log-normal distributions are actually close cousins*, typically they result from similar models. Power laws can result from what are often called ‘preferential attachment’ models*. In terms of paper citations, a preferential-attachment model just assumes that the more citations a paper has then the more likely it is to be cited again, e.g., a paper that has 4 citations on average picks up further citations at twice the rate of a paper that has 2 citations. This will give a power law, like that seen for *PLOS One.*

Log-normal distributions arise* when in successive time intervals, the variable (here the number of citations) increases by successive multiples that are on average are constant but fluctuate around a mean value. For example, if each month the number of citations of a paper increases by on average 5% of the current number, but the real number varies from one month to another, then that will give a log-normal distribution. So, just as with the power-law model, the more citations a paper has, the more additional citations it tends to pick up.

So both models include what is called the Matthew effect, named after the quote in the Gospel of Matthew: “For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken even that which he hath.” In less Biblical language: the more you have, the more get. Indeed, this type of model is used to model the distribution of wealth in society.

The preferential attachment model is just one way of getting a power law, so you cannot prove that this model is a good model for citations, in the sense of explaining the distribution. But it is quite plausible, the more times a paper is cited, the more people will come across it in reference lists, which should tend to increase its citations. So maybe, it is at least part of the explanation.

* There is a very nice overview/history/derivations of power law and log-normal distributions in a paper by Michael Mitzenmacher.

## One thought on “When average is not enough: part II”