Is the growth of Wikipedia exponential?
One common model of Wikipedia growth is that
Thus, the average rate of growth should be proportional to the size of the Wikipedia.
However, it is quite difficult to see whether this is the case, given the disturbing effects of auto-generated articles, sampling noise and server slowdowns, which would act to hide any such trend, even if it were present.
Here is the graph of article count growth data from Wikipedia:size of Wikipedia:
Given the sizeable artifacts in the data, it is almost impossible to test whether the current growth is approximately linear, quadratic or exponential by fitting a curve to the data.
The scatter-plot below attempts to examine the exponential-growth hypothesis by looking at the relationship of incremental short-term changes against absolute size. For each successive pair of data points recorded, the average rate of increase in the period was plotted against the average article count. The plot was then cropped to remove the moderate number of large positive and a few negative outliers that represented the submission of large numbers of auto-generated articles, re-scalings of the article count, and software glitches.
Some remaining outliers are still visible: by cross-checking with the article count vs. date graph, you can see how they correlate with Rambot activity, so I have chosen to ignore them for the purposes of curve fitting.
In particular, the following features are present:
Note that the data is really quite noisy. Further analysis is welcomed!
The red line is a visual fit for the trend, ignoring the outliers.
Speculative growth predictions
Hypothesis: growth rate is a constant number of articles per day, submitted by "hard-core" wikipedians, with an extra number that is proportional to the article count of Wikipedia. Thus, it should be possible to fit a straight line to the bulk of the "main-line" points in the scatter plot.
Here's a by-eye fit:
where y is the article count and t the time since January 10, 2001, measured in days. This is a first-order nonhomogeneous linear differential equation[?]. Using this very crude model, we get the following prediction for human-contributed Wikipedia articles, assuming no slow-downs and no data-dumping:
Note how linear growth dominates for the first part of the graph, with exponential effects only really being visible for the first time in late 2003 / early 2004. After that, the growth is dominated by exponential growth.
Surprisingly, the model is a remarkably good fit for the past, given that the model was only taken from by-eye inspection of the scatter plot, with no attempt made to fit the prediction to the current figures. The model predicts around 100,000 articles in mid-2003, instead of the current 132,000: but 36,000 of the current article count are Rambot-generated articles, and so not human-contributed growth in the sense meant above.
Prediction based on this model: 1,000,000 articles in mid-2008. After that, the size grows to several millions over the next few years.
Questions:
Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to lack of things to write on. But probably the amount of information in each article will begin to increase a lot more.
Search Encyclopedia
|