Template talk:Wikipedialang (word count)

From Wikipedia, the free encyclopedia

Contents

[edit] ordering

the following WPs have between 250k and 1M words at the moment:

cs, ko, id, uk, ms, se, ia, hr, nn, gl, lt, su, la, ar, cy, wa, tr, ku, tt

thus, 43 WPs have >250k words, while 54 WPs have >1000 articles. We may want to list all with >100k, if we want to list >50 languages on the main page. dab () 13:25, 30 Dec 2004 (UTC)

We were taking the word count from http://en.wikipedia.org/wikistats/EN/Sitemap.htm -- apparently counting the words in the entire database, not just in articles. We need of course a way to count the words in the article namespace only! dab () 14:22, 2 Jan 2005 (UTC)

An alternative would be "article count * mean size" (i.e. text size), for en: 446k * 2434 = 1085564k = 1035MB. We need to know how unicode encoding affects the byte count, though. eg. þ counts as 7 bytes, but þ just as one? dab () 14:27, 2 Jan 2005 (UTC)

Link to wikistats: http://en.wikipedia.org/wikistats/EN/TablesDatabaseWords.htm

[edit] zh:

or characters for the Japanese and Chinese Wikipedias — this is probably wrong. At least in ja:, words will be counted, i.e. katakana or hiragana strings, or kanji with hiragana complement. I do not know how the algorithms work, exactly, though. But I imagine also in Chinese, words are counted, i.e. one or two 漢字 per word:
"In Chinese, a word or phrase (词/詞 cí) (a unit of meaning) is composed of one or more characters (字 zì), as in hànzì (汉字/漢字), which has two characters"

dab () 20:02, 2 Jan 2005 (UTC)

Computer algorithms can't separate character / kana strings into words, you need human intelligence for that... also the Wikistats page seems to equate Chinese characters with "whole words", so that's what it's apparently counting. -- ran (talk) 20:08, Jan 2, 2005 (UTC)

ah, sorry, I had not noticed you had put it back on the Main Page, otherwise I would have waited for you to react before reverting. Shall I revert to your footnote? :o) dab () 20:07, 2 Jan 2005 (UTC)
Sure, thanks :) -- ran (talk) 20:08, Jan 2, 2005 (UTC)
hm, so how do you explain that Chinese WP shows up as smaller compared to en: in word count? I think the algorithm weighs the character count with some average value of characters-per-word, then. dab () 20:13, 2 Jan 2005 (UTC)

The Chinese Wikipedia is somewhat stubby.... Serbia and Montenegro is a stub, for example.

Also, I quote from Wikistats: Database size depends on coding system (unicode characters take several bytes) and on how much meaning can be conveyed by one character (e.g. Chinese characters are whole words). -- ran (talk) 20:19, Jan 2, 2005 (UTC)


re: coding system, this I had suspected, and this is why I opted for word count rather than db size.
re, chinese: I stand corrected then. It turns out zh: comes up as 3% of en: in word count, as 4% in article count, and as 0.4% in db size. the 3% and 4% values are "too high" then, but the 0.4% value is "too low" (more information by character). Any measurement more accurate than these three would be non-trivial (e.g. we could 'integrate' over character frequencies for a measure of 'information per character'). Since this concerns mostly just zh:, and seeing that word count at the moment gives the better value for zh (due to extreme stubbyness), I think the bottom line is still in favour of this template. dab () 20:24, 2 Jan 2005 (UTC)

I agree that it's better, since a slightly-off ranking is certainly better than a totally-off ranking. Which is why I put the footnote in.

Perhaps we should contact the Wikistats people in some way, to find out exactly how they arrived at those statistics? Not just for Chinese and Japanese either, I'm also wondering about Korean (which does use spaces...), Thai, etc...-- ran (talk) 20:36, Jan 2, 2005 (UTC)

I've posted a question for User:Erik Zachte (talk), creator of Wikistats. -- ran (talk) 20:55, Jan 2, 2005 (UTC)

Erik has responded on his talk page; turns out that dab's hunch was right! (I'm pretty impressed :D) The stats do indeed use a multiplier factor, approximating the number of Japanese / Chinese characters per word. In this case the factors were arrived at by comparing Japanese / Chinese texts with English ones, hence giving the number of characters per English word, but the general effect is the same.
I've removed the logograph tag from the template. Great job on it btw, dab. ;) -- ran (talk) 01:45, Jan 3, 2005 (UTC)
well, from the reactions on Talk:Main Page, I expect the template will be scrapped after all. I suppose even if we do show it is superior, technically, habits are habits, and change but slowly. dab () 08:37, 3 Jan 2005 (UTC)

[edit] reordering?

based on the feelings expressed below, I think it may be advisable to scrap "ordering size" and go back to tiers, i.e. listing WPs as on equal standing above some threshold of "encyclopedicity" (8M?). We effectively have tiers in the lower area, viz. 250k, 500k, 1M. Maybe we should 'granularize' them to >250k, >500k, >1M, >2M, >4M, >8M, with no further classification over 8M (a decent encyclopedic size of some 16 volumes). I am playing the big proponent here, but the present template was really created by User:GeorgeStepanek -- what do you think? Ran? anyone? dab () 13:22, 3 Jan 2005 (UTC)


[edit] A plea from the Romanian Wikipedia for more relevance and understanding

dab, I know how it feels when a proposal is downturned on the basis of habits. But you need to understand that your proposal has very big implications and if we switch to word count ordering, then everything has to be changed - Multilingual stats rankings, counts on main pages of language WPs, everything... for this reason, I think we should put it this way: it's either the word count method or the article count method. As a representative of the Romanian Wikipedia, I feel the Romanian Wikipedia is being pulled from two sides by this proposal. I have nothing against it, broadly. But, I am totally against using word counts on the main page while using article counts everywhere else. This is because maximising word counts and article counts, while the two may be related are in my opinion contradictory goals. Yes, it's true that by maximising article count we are also maximising word count and vice versa, but I think we need to give our language Wikipedias a bit of focus. We need to say to them: "We're now using word counts everywhere. It's a new and technically-superior system." Then they know and will try to improve articles by adding more information, rather than by forming new articles. But if we continue measuring Wikipedia size by both article count and word count, what is a wikipedia like ro.wiki to do? Should we continue to create new short articles or work on longer articles? Then there's the other contradictory goals: advancing our status in the Wikipedia community and serving the needs of our users. There are again two related but contradictory goals. If we focus on serving the needs of our users, our Wikipedia status declines. What I mean by this is: by focussing on the needs of our users, we would create a Wikipedia where we write long articles about highly relevant themes, meaning less growth in terms of articles. By focussing on our Wikipedia community status, we would be looking towards maximising Wikipedia's size statistically through article counts, word counts, internal links, etc. When we work to do this, we usually create stubs and unnnecessary articles, thereby making us less relevant to our users.

Now the problem is this (and excuse me if I'm raving on and on): all of this struggle and uncertainty is caused because the Wikipedia community puts a great deal of pressure on local WPs. The Romanian WIkipedia does not want to create stubs, we are forced into it in order to maintain our status. We blatantly neglect and put down our smaller WPs, those with less than 1000s, passing them off as insignificant when we should be helping them. I remember reading on talk page yesterday someone's message (and I don't remember who said it) that "if Wikipedia X wants to be listed on the main page, it should try to reach 1000 articles and work harder" That made me feel really bad and I wondered to myself: who are we to impose this on smaller WIkipedias, some of them working under limited Internet access and poor conditions. How dare we force and threaten Wikipedias to work just to achieve a certain status in the Wikipedia community.

This leads to my second point: a dependence on artificial growth. It's basically like hiring excess labour just to keep everyone employed when less labour could still do the same work with more efficiently. We are compelling our Wikipedias to add more and more articles until they are bloated and full of stubs, just so we can then congratulate them on having reached milestones. Word count is also a problem - we will create the same dependence and urge to advance, only with word count it's a lot _fairer_.

I support word count more for this reason, but only if we discontinue article count. If we don't make it important anymore. If we can finally get to a stage where we can say that the Wikipedia X with 10,000 articles is better than Wikipedia Y with 55,000 articles just because Wikipedia X has more information and serves its users better.

Finally, what we need to remember is that Wikipedia is one project, not 200. Language rivalry is a terrible thing because while Wikipedia remains open and free, there will always be ways to cheat and to falsify growth. Picture this - ONE user decides to create artificial growth through stubs in just ONE language WP. This Wikipedia grows a lot and advances in article count by, let's say, 10 ranks. All the other Wikipedias are revolted, and they have to do something to reclaim their spot. Their only way, though, is to add stubs themselves. So they do it, and then more and more Wikipedias fall into the trap. While this may be an oversimplification, this scenario happened and continues to happen today, leading to a cycle or bloated and artifical, unrelevant, growth.

Now onto the flaws of word count. Yes, it's flawed, because it cuts of the wings of many smaller WPs. This is because we are totally removing the hope of growth. We are pressuring them to grow while at the same time restricting their means to grow. Because it's much harder to advance in word count rankings, many WPs will lose hope. What drives many Wikipedias to grow, even if that growth is artificial, is the fact that they know they can overtake other WPs and therefore heighten their status. Yes, it's a terrible occurrence of stupid rivalry, but that's what happens. And that's what we need to stop. And the only way we can stop that is to stop congratulating Wikipedias on reaching milestones, to stop ranking them, to stop sorting them out by word counts or article sizes but rather to simply say this: your only goal is to serve your users in the best way possible. To not to try to beat other Wikipedias in size, not to try to have the most articles, but rather to have the most visitors. To get as many people involved. It's only then that Wikipedia will grow in real terms. For that, I don't support either word count or article count particularly much. It's either one, it's either the other, but in any way, we should stop making such a big deal about rankings and statistics. It isn't those that count, it's relevance to a language's users. I would rather prefer a Romanian Wikipedia community with 10,000 articles yet dozens of users who are benefitting from Wikipedia than a Ro.wiki with 60,000 users, ranked 5th or 6th but which no one uses. Any thoughts on this are highly welcome. Thanks, Ronline 11:45, 3 Jan 2005 (UTC)

holy.... I certainly didn't want to stir "interwiki competition" by "ordering by size". I rather intended to remove competition based on article count. So many feelings seem to be attached to the "other languages" links on Main Page that I think I should better not interfere any further. However:
I do not see a need to rely on either word count or article count. This template is directed at visitors, giving a rough idea of multilingual wikihood. There is no need to draw any conclusion for internal statistics or what not from that. Indeed, I think I have shown that an accurate idea of size relations is only possible by a combination of word count, article count and db size.
I have regarded Main Page simply as the main page of en:. It is true however that http://wikipedia.org redirects there, making it a special case. But I think that decisions on the main page of en: should be unaffected by the question whether wikipedia.org should redirect there, or to a multilingual entry page. I agree, however, that en: has an additional responsibility as long as the former is the case, and that should maybe encourage us to include more interwiki links than we normally would. Still, as for 'clipping the wings' of small WPs, I think this is quite an improper characterisation of the issue.
If anything, the exaggerated importance attributed to this template by other WPs reinforces me in my opinion that it should not encourage speed-article-creation in order to get a higher standing. dab () 13:10, 3 Jan 2005 (UTC)
My understanding was that http://www.wikipedia.org redirects to the appropriate language page based on the preferred language variable in the HTTP request. Thus the English Main Page is only shown those those users who have English as their preferred language. The vast majority of the people who view this page have no interest in other languages; hence my desire to de-emphasise the whole issue. GeorgeStepanek\talk 20:00, 3 Jan 2005 (UTC)

[edit] confusing; hard to search

This layout makes it very hard to find a major language, as there is no longer any alphabetic ordering. And the word count as displayed is confusing. I hope we can return to "over 10k" and "xxx-10k" lists, do away with the "50k+" section, and remember that article count is no big deal. +sj +

well, we did return to article count. I accept the 'hard to search' argument, but that would not stop us to remake this template into a couple of tiers (>250k, >1M, >4M, >16M), if we can agree that word count is more useful than article count. dab () 14:57, 7 Jan 2005 (UTC)

[edit] Missing wikis

The ang wikipedia is missing from the stats page, and it hasn't been updated since May of this year. Accidental, or purposeful? Are any other wikis similarly neglected? --JamesR1701E 08:59, 24 September 2005 (UTC)