Talk:Bigram

From Wikipedia, the free encyclopedia

Socrates This article is within the scope of the WikiProject Philosophy, which collaborates on articles related to philosophy. To participate, you can edit this article or visit the project page for more details.
??? This article has not yet received a rating on the quality scale.
??? This article has not yet received an importance rating on the importance scale.

(I suspect that there is an error, or an ambiguity in the information given below. If TH occurs 50 times in 200 letters of words, then half the text consists only of the bigram TH. This is incredible. Further, if I add up the occurrences of the most frequent 5 bigrams, they add up to 200. 200 bigram occurences would involve 400 letters. Do I misunderstand the definition of bigram occurrence frequency? If so what led me to this misunderstanding? Ramani)

I didn't double-check the numbers, but it seems to me you're missing something very important: 1 bigram consists of 2 letters, but 200 bigram occurrences do not necessarily mean 400 letters. Look at the word 'banana' -- The bigrams in this word are 'ba', 'an', 'na', 'an', 'na'. That's 5 bigrams, but only 6 letters.
Also, please discuss articles in their respective talk pages, not in the article itself. -- AWendt 11:12, 19 September 2007 (UTC)

Thanks. You are right in saying that the number of bigrams in a sequence of n letters is (n-1). But that does not answer the question on how the numbers given in the article are to be interpreted. The article says

"The most common letter bigrams in the English language are listed below, with the expected number of occurrences per 200 letters. In the analysis here, the bigrams are not permitted to span across consecutive words.

TH 50 AT 25 ST 20

ER 40 EN 25 IO 18

ON 39 ES 25 LE 18

AN 38 OF 25 IS 17

... ... ..."

I should expect the sum of numbers shown above should not exceed 199. But the sum is larger. I would appreciate your double-checking the text. Ramani 12 Nov 07 —Preceding unsigned comment added by 122.167.157.78 (talk) 17:25, 12 November 2007 (UTC)

Why should the sum not exceed 199? The word BATH, for example, has both a TH and an AT in three letters. quota 21:13, 13 November 2007 (UTC)

There is definitely an error, the number of bigrams in n letters is equal to n-1 but the sum of all the bigrams is much larger than 199. 200 is probably a typo for 2000. —Preceding unsigned comment added by 128.97.19.56 (talk) 21:44, 31 March 2008 (UTC)

Indeed. Here's a reference: [1]. This quotes TH as occurring 5532 times in 40,000 words (which would be about 200,000 letters). That's ~55 times in 2000 letters, which roughly matches the table above.
However, the table in the reference has HE and IN between TH and ER -- but the analysis over the larger sample should be better. I'll modify the article accordingly... quota (talk) 08:56, 1 April 2008 (UTC)