Encyclopedia > Wikipedia:Modelling Wikipedia's growth

  Article Content

Wikipedia:Modelling Wikipedia's growth

This page analyses the article count data in Wikipedia:size of Wikipedia as of June 2003, and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.

Is the growth of Wikipedia exponential?

One common model of Wikipedia growth is that

  • more content leads to more traffic,
  • which leads to more edits
  • which generate more content.

Thus, the average rate of growth should be proportional to the size of the Wikipedia.

However, it is quite difficult to see whether this is the case, given the disturbing effects of auto-generated articles, sampling noise and server slowdowns, which would act to hide any such trend, even if it were present.

Here is the graph of article count growth data from Wikipedia:size of Wikipedia:

Given the sizeable artifacts in the data, it is almost impossible to test whether the current growth is approximately linear, quadratic or exponential by fitting a curve to the data.

The scatter-plot below attempts to examine the exponential-growth hypothesis by looking at the relationship of incremental short-term changes against absolute size. For each successive pair of data points recorded, the average rate of increase in the period was plotted against the average article count. The plot was then cropped to remove the moderate number of large positive and a few negative outliers that represented the submission of large numbers of auto-generated articles, re-scalings of the article count, and software glitches.

Some remaining outliers are still visible: by cross-checking with the article count vs. date graph, you can see how they correlate with Rambot activity, so I have chosen to ignore them for the purposes of curve fitting.

In particular, the following features are present:

  • two very low points at around 35,000 articles represent editing during the major server slowdown in June/July 2002
  • between 40,000 and 90,000 articles, the data is dominated by Rambot's auto-generated articles: most of the sample points in these intervals are well off the top of the chart, with thousands of articles per day.
  • the low outlier at around 120,000 articles is caused by the article counter being locked

Note that the data is really quite noisy. Further analysis is welcomed!

The red line is a visual fit for the trend, ignoring the outliers.

Speculative growth predictions

Hypothesis: growth rate is a constant number of articles per day, submitted by "hard-core" wikipedians, with an extra number that is proportional to the article count of Wikipedia. Thus, it should be possible to fit a straight line to the bulk of the "main-line" points in the scatter plot.

Here's a by-eye fit:

<math>\frac {dy} {dt} = 50 + \frac {170} {140000} y</math>

where y is the article count and t the time since January 10, 2001, measured in days. This is a first-order nonhomogeneous linear differential equation[?]. Using this very crude model, we get the following prediction for human-contributed Wikipedia articles, assuming no slow-downs and no data-dumping:

Note how linear growth dominates for the first part of the graph, with exponential effects only really being visible for the first time in late 2003 / early 2004. After that, the growth is dominated by exponential growth.

Surprisingly, the model is a remarkably good fit for the past, given that the model was only taken from by-eye inspection of the scatter plot, with no attempt made to fit the prediction to the current figures. The model predicts around 100,000 articles in mid-2003, instead of the current 132,000: but 36,000 of the current article count are Rambot-generated articles, and so not human-contributed growth in the sense meant above.

Prediction based on this model: 1,000,000 articles in mid-2008. After that, the size grows to several millions over the next few years.

Questions:

  • is this model even remotely valid? (Time will tell).
  • how long can exponential growth go on, or is this just really the early part of a logistic curve?
  • what does this imply for server and traffic scaling?

Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to lack of things to write on. But probably the amount of information in each article will begin to increase a lot more.

External links



All Wikipedia text is available under the terms of the GNU Free Documentation License

 
  Search Encyclopedia

Search over one million articles, find something about almost anything!
 
 
  
  Featured Article
1904

... (+ 1973) March 6 - Joseph Schmidt[?], tenor (+ 1942) March 7 - Reinhard Heydrich, Nazi official March 20 - B. F. Skinner, behavioral psychologist (+ 1990) March 23 ...

 
 
 
This page was created in 37.6 ms