Talk:Letter frequencies

From Wikipedia, the free encyclopedia

News This page has been cited as a source by a media organization. See the 2005 press source article for details.

The citation is in: Alynne Morris. "Codebreaking- Frequency Analysis", BellaOnline, July 4, 2005.

Contents

[edit] Dispute

Actually, the source is Cryptographical Mathematics, by Robert Edward Lewand and it does not state the sample size. The 15000 word sample is from someone named Tom's apparnetly independent analysis. hnw555, 11/28/06

The sources for the statistics are taken from: http://www.central.edu/homepages/LintonT/classes/spring01/cryptography/letterfreq.html And are based on a ridiculously small sample size (15000 characters). It might also be copyrighted information.

Oh dear, I didn't notice that sample size. I think the thing to do is remove the data from this page and link to (preferably several) others with data. Objections? Frencheigh 03:23, 20 July 2005 (UTC)
I found a nice letter frequency calculator and I'm gonna give it input from a largish number of Wikipedia articles. It'll contain some bias from stub templates and things, but it should be a somewhat accurate estimate of the frequencies of letters in the English language. --Ihope127 19:37, 9 September 2005 (UTC)
...Whoops, it poofed. Ah well... --Ihope127 13:41, 10 September 2005 (UTC)
Wikipedia is a very biased source, besides the templates, there are going to be many foreign words which will affect the frequency. For English you'll be better off getting 50MB of text files from Project Gutenberg. -- 6 October 2005
I don't think that sort of thing could be used here, as original research is not permitted on Wikipedia. (Wikipedia:No original research)
I guess the results from Project Gutenberg that I just posted could be construed as original research. If so, I apologize, and I guess take them down. :-( I can provide detailed explanation of my methods and source code, as well as full result data, to anyone interested. Matt Whitlock 21:54, 12 April 2006 (UTC)
Removed, for interested persons this is when they were added. Frencheigh 17:59, 5 July 2006 (UTC)
Meanwhile, I've heard that plain-old data isn't protected by copyright. So instead of what I suggested above, I envision a table where each row is a letter and each column is a different study, and in the cells is the frequency of that letter as given by that study. That way the page would be immediately useful for what I bet is the main reason somebody would want to view it. Anyone know if that would be legal? Frencheigh 23:38, 6 October 2005 (UTC)

Well the data that is in there now doesn't seem terribly good, so much better would be to replace it with something else. There's got to be published sources that have used large representative samples. Any ideas of where to look? - Taxman Talk 18:38, 30 November 2005 (UTC)

Ok I did some searching and found corpus linguistics, which seems a much better way to do it. Summary statistics on the prominent corpus' such as the Brown Corpus and the British National Corpus seems much more valuable than what is in the article. Only I couldn't find them. All I could find when searching was this that lists some interesting letter frequencies in various languages, but they appear to be just from some guys webpage that calculated them. Help on finding summary statistics on the corpus' would be great. - Taxman Talk 22:02, 30 November 2005 (UTC)

You've all misunderstood the original quoted article. The frequencies given in the Wikipedia article are correct; note that they all match the second source quoted of British National Corpus to the accuracy given. The mistake was that the title "Tom's Letter Frequencies (in order)" in the center of the page is NOT the caption to the table above; rather, it is the heading for the paragraph and table BELOW. Note that the paragraph even says it is "below" and also that the second table, based on the 15,000 letter sample, is in "order" of frequency. Thus, the original Wikipedia article should stand as being accurate. (JPP)

Is the factual accuracy still disputed? Argyriou 20:09, 3 July 2006 (UTC)
It appears that the four following sections are still based upon that 15,000-char analysis. If there are no objections, I think I'll remove those four sections and attribute the rest above them to "Cryptographical Mathematics" by Robert Edward Lewand. Now, are we sure on the title? "Cryptological Mathematics" gets many more google hits. ([1], [2]). ((signature added later - comment by User:Frencheigh, PDT 15:59, 5 July 2006))
I've removed the {{disputed}} label. The sections User:Frencheigh removed can be found at [3] Argyriou 21:58, 11 July 2006 (UTC)
Directly above i was referring to the "Top 10 beginning of word letters", "Top 10 end of word letters", "Most common bigrams (in order)", and "Most common trigrams (in order)" sections, which were present during the above discussion, unlike the other Project Gutenberg ones I deleted recently (on account of their being OR, see farther up). I suppose I'll leave it for a bit again and clarify; I intend to remove all sections but "Relative frequencies of letters", "See also", and "External links", because the others are from the 15000-char analysis. Frencheigh 08:41, 12 July 2006 (UTC)
Done. Frencheigh 20:15, 19 July 2006 (UTC)

When I was a kid, I read a book on cryptography (I think it may have been "The First Book of Codes and Ciphers," which you can see Neal Stephenson reading in his author photo in "Cryptonomicon"!) that gave the frequency list as ETAONRISHDLFCMUGYPWBVKXJQZ. Anyone else recognize this ordering? Anyone know what statistical source it might have come from? I'm obviously not the only one who's ever thought it was the authoritative ordering, since googling that string of letters produces 187 results. --Mr. A. 21:07, 16 July 2006 (UTC)

[edit] Chart ordered by frequency would be helpful

The chart shown graphing letter frequency vs. letter is ordered alphabetically. An additional chart ordering the vertical bars by frequency (rather than alphabetically) would enhance the presentation.

I generated such a frequency-ordered chart on my Windows system using the Excel spreadsheet chart facility. I have not tried to add the result to the Wiki article because it's relatively ugly and because I couldn't figure out how to convert it to a .png file.

I've found an ordered letter frequency of the english language in this page: http://www.csm.astate.edu/~rossa/datasec/frequency.html The source of the table is: H. Beker and F. Piper, Cipher Systems, Wiley-Interscience, 1982. I don't know if it would be ok to put it here.


CAN SOMEONE PLEASE ADD INFORMATION ABOUT HOW TO GENERATE LETTER FREQUENCY TABLES IN FOREIGN LANGUAGES (i.e. from texts that are loaded into a computer program)?!? [24.59.100.23]

I wrote a program that takes a file as input and generates a primitive frequency table...it won't work for Unicode, though. I have to fix that. If you want it, I can upload it to Wikipedia (is that legal?). 7 July 2006 - dargueta

[edit] Statistics from a larger sample size

In the book The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography by Simon Singh, I found the following table with a caption that reads:

This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Beker and F. Piper, and originally published in Cipher Systems: The Protection Of Communication.

Note that the values below add to 100.3 due to rounding.

Letter Percentage Letter Percentage
a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1


The following table sorts the values given above in order of letter frequency.

Letter Percentage Letter Percentage
e 12.7 m 2.4
t 9.1 w 2.4
a 8.2 f 2.2
o 7.5 g 2.0
i 7.0 y 2.0
n 6.7 p 1.9
s 6.3 b 1.5
h 6.1 v 1.0
r 6.0 k 0.8
d 4.3 j 0.2
l 4.0 x 0.2
c 2.8 q 0.1
u 2.8 z 0.1


I took a little class on cryptology once, and The Code Book and Cryptological Mathematics were our textbooks. I'm pretty sure they have the same data, but in The Code Book it's rounded. --Ravi12346 19:40, 30 July 2006 (UTC)

[edit] Query

sth is a surprise in a list of high-frequency trigrams. On its own it's an abbreviation of south, and I can think of a few words containing it, but not enough to account for its listing here. Can anyone tell me what it is I haven't thought of?

I grepped a dictionary and came up with 414 results... admittedly, almost all of them you wouldn't use in conversation (try "somesthetic" and "chromesthesia"), but there are a couple like 'firsthand' and 'guesthouse' that aren't so outlandish.

is as has was / this the that there they

Trigraphs ignoring spaces may not be of great practical use though. Uldoon 10:33, 10 March 2006 (UTC)

Given that this seems suspicious, and that we have reproducable numbers from PG, might this section (and sections 1-4) be gotten rid of? Onepairofpants 14:38, 30 May 2006 (UTC)

I agree that the top portion of the page should be deleted. The sample size of that portion is 15000 characters with only 2700 words. And the input is definitely biased (license agreement from Sun, teaching philosophy of a computer science professor, letter of recommendation). This is probably why "sth" appears in the results.

[edit] American English

contains a lot more "z"s than British English. 218.102.218.250 03:02, 5 April 2006 (UTC)

Mainly, I assume, thro' a preference for -ize as a suffix in the US rather than -ise; this despite the reverance in which the Oxford English dictionary is held, and its general preference for the former spelling.

[edit] Average Word length

I would be interested to know some more statistics about these letter frequencies, but I lack the skill to extract the relevant information from the PG archive's ample selection of texts; what is the average word length in english? I read somewhere that it was 4.26, though this was with a rather small sample size. Is the distribution of word lengths a standard distribution? if so, what is the std deviation? How does letter frequency vary with word length? obviously at words of 1 letter, the frequencies will be 0 apart from "I", "A" and possibly "O"... Would anyone have the ability and the capability to satisfy my curiosity? 86.20.233.151 20:59, 1 June 2006 (UTC)

[edit] Wheel of Fortune

So...H and D appear more than L, but the "gimme" letters in the last round of Wheel of Fortune are RSTLNE. That should appear somewhere towards the bottom of this article. --JD79 19:20, 26 January 2007 (UTC)

Pure speculation here, but (1) I think those are the gimme letters because people had gotten the idea that they were a "good set" and guessed the exact same letters all the time, and the producers probably wanted to mix it up; and (2) if you've already guessed S and T, you can probably infer the locations of H's, as well as the likelihood of the last letter's being D (the past tense prefix no doubt being the reason why D is so common). --Mr. A. 04:21, 27 January 2007 (UTC)