Encyclopedia > Wikipedia talk:Statistics

Article Content

Wikipedia talk:Statistics

OK, somebody has to say this: The fact that we are patting ourselves on the back for intentionally undercounting our articles is just plain silly. I just now went and looked under "short pages" at all 28 pages with exactly 100 bytes, and 13 of them contained a comma. Not a single one of them deserves to be called an article, but almost half are counted. Next I looked at all 33 pages with exactly 200 bytes, and 27 of those contained a comma. A few of them (not eighty percent!) might be considered articles under an extremely lenient definition of article, but does anyone outside of Wikipedia consider a single, brief paragraph to be an article? Are ANY of Brittanica's articles under 500 bytes?

I estimate our median article size as 1000 bytes, because that's the size of our 18943rd longest page according to long pages. (18943 would be the median of 37886 total articles.) To my mind, a conservative count of articles would place an 1000-byte minimum, rather than a 1000-byte median, which would trim our total article count in half. But no matter how we count articles, let us at least prominently post the median size of the articles which are included in the count. And please, please don't call the count "unimpeachable". (For refernence, my little tirade (including this sentence) is 1367 bytes long, i.e. rather longer than our median article.)

--Fritzlein 02:55 Aug 17, 2002 (PDT)

I agree.

I'm sure that I found information about the total database size of Wikipedia recently, but I can't find it again. Could this be made available again, please.

It would be good to have the info on this page, or a link on this page to where it can be found.

If I get an answer - this time I'll try to keep the info! David Martland 15:20 Dec 13, 2002 (UTC)

Don't know if this is what you want, but:

 $ ls -l
 total 2801904
 -rw-rw----    1 mysql    mysql        8852 Aug  9 19:27 archive.frm
 -rw-rw----    1 mysql    mysql    21077440 Dec 13 17:19 archive.MYD
 -rw-rw----    1 mysql    mysql        1024 Dec 13 17:19 archive.MYI
 -rw-rw----    1 mysql    mysql        8586 Jul 20 19:30 brokenlinks.frm
 -rw-rw----    1 mysql    mysql     7599784 Dec 13 17:58 brokenlinks.MYD
 -rw-rw----    1 mysql    mysql     6398976 Dec 13 17:58 brokenlinks.MYI
 -rw-rw----    1 mysql    mysql        9114 Nov 22 08:21 cur.frm
 -rw-rw----    1 mysql    mysql    451396440 Dec 13 17:59 cur.MYD
 -rw-rw----    1 mysql    mysql    201449472 Dec 13 17:59 cur.MYI
 -rw-rw----    1 mysql    mysql        8756 Jul 20 19:30 image.frm
 -rw-rw----    1 mysql    mysql        8586 Jul 20 19:30 imagelinks.frm
 -rw-rw----    1 mysql    mysql      192136 Dec 13 17:51 imagelinks.MYD
 -rw-rw----    1 mysql    mysql      175104 Dec 13 17:51 imagelinks.MYI
 -rw-rw----    1 mysql    mysql      457448 Dec 13 15:53 image.MYD
 -rw-rw----    1 mysql    mysql      215040 Dec 13 15:53 image.MYI
 -rw-rw----    1 mysql    mysql        8706 Jul 20 19:30 ipblocks.frm
 -rw-rw----    1 mysql    mysql        7300 Dec 12 15:23 ipblocks.MYD
 -rw-rw----    1 mysql    mysql        3072 Dec 12 15:23 ipblocks.MYI
 -rw-rw----    1 mysql    mysql        8582 Jul 20 19:30 links.frm
 -rw-rw----    1 mysql    mysql    41151856 Dec 13 17:59 links.MYD
 -rw-rw----    1 mysql    mysql    26686464 Dec 13 17:59 links.MYI
 -rw-rw----    1 mysql    mysql        8898 Nov 22 08:43 old.frm
 -rw-rw----    1 mysql    mysql        8790 Jul 20 19:30 oldimage.frm
 -rw-rw----    1 mysql    mysql       54436 Dec 13 09:51 oldimage.MYD
 -rw-rw----    1 mysql    mysql       11264 Dec 13 09:51 oldimage.MYI
 -rw-rw----    1 mysql    mysql    2082194432 Dec 13 17:59 old.MYD
 -rw-rw----    1 mysql    mysql    19528704 Dec 13 17:59 old.MYI
 -rw-rw----    1 mysql    mysql        8598 Jul 20 18:49 random.frm
 -rw-rw----    1 mysql    mysql       85000 Dec 13 04:47 random.MYD
 -rw-rw----    1 mysql    mysql        1024 Dec 13 04:47 random.MYI
 -rw-rw----    1 mysql    mysql        8964 Oct 28 08:06 recentchanges.frm
 -rw-rw----    1 mysql    mysql     3014196 Dec 13 17:59 recentchanges.MYD
 -rw-rw----    1 mysql    mysql     1556480 Dec 13 17:59 recentchanges.MYI
 -rw-rw----    1 mysql    mysql        8700 Jul 20 18:49 site_stats.frm
 -rw-rw----    1 mysql    mysql          29 Dec 13 17:59 site_stats.MYD
 -rw-rw----    1 mysql    mysql        2048 Dec 13 17:59 site_stats.MYI
 -rw-rw----    1 mysql    mysql        8874 Aug 24 02:54 user.frm
 -rw-rw----    1 mysql    mysql     1838736 Dec 13 17:48 user.MYD
 -rw-rw----    1 mysql    mysql      174080 Dec 13 17:48 user.MYI
 -rw-rw----    1 mysql    mysql        8590 Nov 27 15:15 watchlist.frm
 -rw-rw----    1 mysql    mysql      291006 Dec 13 17:37 watchlist.MYD
 -rw-rw----    1 mysql    mysql      471040 Dec 13 17:37 watchlist.MYI

I'm sure I've seen some info regarding the total size of Wikipedia (in bytes, or Mbytes) and the availability of downloading the whole database.

I'd be glad if someone could point me in the direction of that info or the relevant page again. I'll save the info this time!

It'd be good to have that info on this page - or a link - too. David Martland 15:18 Dec 13, 2002 (UTC)

There really needs to be a more conservative total article count that makes a distinction between encyclopedia articles and almanac articles and that excludes more certain pages. What particularly troubles me is that there are now thousands of year almanac pages and that most of them can't be considered to even be almanac articles because they are just templates.

I therefore propose the following (in addition to the current criteria); 1) any page linked to centuries should be excluded from the total article count and should be given its own line in special:statistics at least until most of these pages become almanac articles (the vast majority are either templates or templates with one or two entries). 2) any page with a link to Wikipedia:Disambiguation be excluded from the count. 3) any page that is less than 500 bytes be excluded from the count (E. coli is 610 bytes). and 4) there should be three "total article counts" for everything not excluded by the above; one for anything with the string, list, chart, timeline or table in their titles (these would be "almanac-like" articles), one for everything left over (these would be "encyclopedia articles") and one grand total count that would still be the number displayed on the Main Page.

Our current count is exaggerating the true number of articles we have and is harming the project as a result. We need to be honest with our article counts and very conservative -- otherwise we will loose credit with passers-by who are at first impressed by our article count but then find out that it is bloated. --mav 13:36 Aug 28, 2002 (PDT)

I think that sounds pretty reasonable. --Brion 13:47 Aug 28, 2002 (PDT)

As long the criteria you select are easily computable, I'm happy to make whatever change in the software is necessary to reflect a better count. I also don't think anyone is making any claims about the accuracy of the count--the statistics page itself is careful to point out that these are just estimates. But I agree, a more conservative estimate is entirely warranted. --LDC

Great! While you are at it a link to Wikipedia:What is an article under the first occurance of that word on the special stats page would be nice. --mav

I never thought I'd say this, but I'd like something to be more conservative. :-) (Just the article count, not any of Bush's cabinet). --KQ

From a random sampling of pages, I would say that something like a third of our pages would truly count as useful articles in the eyes of a new user (that agrees with earlier observations from Kajakit on the mailing list). That would mean we have something like 10,000-15,000 'useful articles' in the database. We could proxy that by counting, say, articles over 1500 characters long. At the time of writing, that would give 14,148 'useful articles' compared to a headline number on the main page of 39,654.

The 1500 character threshold has the advantage of being long enough to cover most of the non-articles according to the criteria suggested by mav (century pages, disambiguation pages etc) automatically.

I would not like to see the headline count on the main page reduced - I think that would be confusing for new users and perhaps a bit demotivating for the rest of us. We could consider changing the main page wording to something like:

... We started in January 2001 and are already working on 0 articles, with more being added and improved all the time. We want to make over 100,000 complete articles, so let's get to work! Anyone, including you, can edit any article ....

That would let us keep the headline count without creating the suggestion that they are all finished, polished articles. We could keep a running total of '1500 character articles', and perhaps other sizes too, on the statistics pages for those that are interested.

Enchanter 17:19 Aug 28, 2002 (PDT)

I don't think 1500 characters is a particularly useful number -- there are many subjects for which 500 characters would suffice as a minimum article size (although something in the range of 500 - 1000 characters wouldn't bother me too much). Could you maybe run some quick numbers to see how much of a reduction would occur if my proposal were to be enacted (this could be done easily if there were an page count in "what links here")? I was thinking about a reduction of 5 - 7 (maybe 10) thousand. Even if it is more than that I don't think that a temporary reduction in the total article count would hurt. We've already been through one round of this back when we upgraded to phase II and it didn't hurt anybody's moral that I know of. All that we have to do is write-up an announcement that we are enacting a far more conservative definition of what we consider to be an article as far as automatic detection goes. --mav

Mav - heres some figures broadly following your proposal above:

 Total articles    45179  (including without comma)
 <500             -14314
 -Disambiguation    -289
 -Year in review   -1292  (pages with numbers as title)
 -'list' in title   -386
 -'century' in title -60
 -'timeline' in title-59
 -'table' in title   -55
 TOTAL             28903

The main message here is that what really drives the numbers is the threshold for article size that you use. The other exclusions make a relatively small difference (although I'm sure there are in fact many more 'list like' non articles that aren't picked up by these criteria).

I also don't think we could ever be able to teach a computer how to dertermine just what is, or is not, a 'useful' article. --mav

I agree. That's why I think the best process is to:

Decide the proportion of pages we want to count, by randomly sampling recent changes and making subjective decisions.
Choosing an article size that gives broadly the number we want.

That's how I came up with the threshold of 1500. I absolutely agree that some articles that are shorter than 1500 are worthwhile, but these are offset by the longer articles that are not much use (according to my relatively strict subjective definition of an article).

The impression I get that the average quality of Wikipedia has been fairly constant. That is, the tendency for the average quality to rise as articles are improved and the tendency for average quality to fall as new stubs are added broadly cancel out. If so, then picking an article size threshold should give a reasonably stable indicator of articles up to a certain quality.

Enchanter 01:39 Aug 29, 2002 (PDT)

Thanks for doing the numbers -- this should give the developers plenty to chew on. I Think a figure of 28,000 is about right. --mav

I like the format of this count. I'd love to see it replicated (i.e. automatically generated) on the statistics page. I agree that the size threshold is the most important decision we have to make, although excluding the other types of articles should also be done if it doesn't bog down the server to cut them out on the fly. Just to make things more confusing, I vote for a minimum of 1000 bytes, which cuts our count roughly in half. However, I would support either 500 or 1500 in preference to what we have now.

I don't care a great deal what number we use for the headline count, as long as the more detailed statistics are only click deeper. --Karl Juhnke

Would it be possible to now and again run a query (perhaps from crontab) that showed how many "article" pages we have with c characters, c<1000, how many 1000<=c<2000,2000<=c<3000, et cetera? DanKeshet

At the moment:

       =0:     2
      <16:     3   (     1-15:     1)
      <31:    21   (    16-30:    18)
      <63:   111   (    31-62:    90)
     <125:  1222   (   63-124:  1111)
     <250:  4646   (  125-249:  3424)
     <500: 14138   (  250-499:  9492)
    <1000: 25474   (  500-999: 11336)
    <2000: 34739   (1000-1999:  9265)
    <4000: 40849   (2000-3999:  6110)
    total: 45172   (4000+:      4323)

The queries are on the form of SELECT COUNT(*) FROM cur WHERE LENGTH(cur_text)<500 AND cur_is_redirect=0 AND cur_namespace=0 --Brion 20:38 Aug 28, 2002 (PDT)

Thanks, Brion! I think it's pretty interesting how it works out. DanKeshet

Can we have some figures for mean article word/character counts (ignoring markup and HTML), please? This would enable better comparisons with existing encyclopedias: see the article text for comparisons.

character count... bah. thats unreliable too. Can i suggest a simple, effective, and _working_ solution? yeah thats what i thought. Since wikipedia is usermoderated, why not add the option for registered users, or maybe unregistered too to vote on how usefull they found the article, including a reason why. think slash dot (-7 too short), +5 well written, etc... it wouldn't be hard to implement, and i think it would be good, and then you could count the "real" articles based on their user approval. Of course this affects articles which are voted really low, then majourly updated to help this that still have a low score. thats why i think ratings should be cleared every time there is a majour update (eg. not minor + sometype of change comparison with the .diff file) I think. ideas anyone? I think this is pretty good. and i'd be willing to implement it, if people like the idea, i dont know how long it'd take me, cause im not familiar with the codebase, but im very profficient in php and db work as well as other programming languages. i am already registered in SF too... so if nayone likes this idea, leave a comment here, mny talk page, or [e-mail me].

Lightning, Sept 29 3:17

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!