Tuesday, February 28, 2006

(Spelled) segment distribution: lexicon vs. corpus

Recently reading a historical para about Scrabble, I was surprised to (re)realize that the letter distributions in Scrabble are based on an (informal) corpus count (New York Times front page), not a dictionary-headword (i.e. lexicon) count. So, e.g., 12 per cent (12 tiles of 100) of letters in the Scrabble bag are 'e'; that's pretty exactly the percentage of letter occurrences that are 'e' in a corpus of written English.

The reason this surprised me is that in a corpus there will be many repetitions of function words, which probably inflates the percentages of certain letters, e.g. 't' (the, to, it), 'e' (the, he, she), etc., when compared to their percentage occurrence in the lexicon, which contains only one token of each function word.

But of course Scrabble is all about producing nice individual words, really, not about producing a corpus-like set of word tokens. Indeed, someone who repeatedly put down words like 'to', 'the', 'it', and so on would not get very far in a game of Scrabble. So it seemed to me that it would have been more appropriate to use a letter distribution based on the percentage of each letter occurring in a list of dictionary headwords.

I thought I would try to find out how different the lexicon-based letter distribution in English is from the corpus-based letter distribution, but I can't find any numbers online for the former. (Numbers for the latter are all over the place, of course, and match the Scrabble distributions pretty exactly; the reason letter distribution is so interesting to many people is because it's a good way to solve simple encryption problems.)

I know it would be a supersimple programming problem to produce a list of letters and their respective percentage distributions in the headwords of any online dictionary database (e.g. the fourth column of Mike Hammond's 'newdic' file), but it'd be a biggish time investment for me to figure it out right this second. Anyone who can see a quick and easy way to do it want to send me the numbers for comparison to the corpus numbers? It would be interesting...

(Seems to me that it might even be useful, theoretically -- if you're into exemplar models of mental lexicon representations, e.g., the frequency/markedness value of a given English segment in your mental inventory might be expected to correlate with the corpus distribution, while if you're not into exemplar models but rather a more traditional lexicon-based model, with a single abstract phonological representation of each given word, you might expect the segment markedness values to correlate with their lexicon distribution in English.)


Blogger Loxias said...

"Scrabble and the mental lexicon"

Hmmm, very very interesting.

So, that's why I can never get the 'right' letters in Scrabble... ;-)

I wish I could do some computing.

12:25 AM  
Blogger Lance said...

Fortunately, I have nothing better to do with my time. Well, OK, not much time; half-assed python scripting is easy.

I used (a version of) the OSPD as the lexicon--about 79,400 words--and got:

a 8.03
b 2.26
c 3.61
d 4.12
e 11.78
f 1.52
g 2.96
h 2.44
i 7.5
j 0.26
k 1.41
l 5.58
m 2.84
n 5.78
o 6.08
p 2.94
q 0.18
r 7.23
s 9.33
t 5.74
u 3.67
v 0.97
w 1.2
x 0.33
y 1.83
z 0.44

"E" is about the same, but "T" and "H" are much, much lower on this count.

6:03 PM  
Anonymous Tilde said...

Using that newdic file & my own half-assed Python-fu, I get:

'a': 8.9
'b': 2.1
'c': 4.7
'd': 2.9
'e': 11.0
'f': 1.4
'g': 2.2
'h': 2.3
'i': 8.8
'j': 0.2
'k': 0.8
'l': 5.5
'm': 3.2
'n': 6.8
'o': 6.9
'p': 3.2
'q': 0.2
'r': 7.5
's': 5.3
't': 7.7
'u': 3.8
'v': 1.2
'w': 0.9
'x': 0.3
'y': 1.8
'z': 0.4

6:59 PM  
Blogger Nayeli said...

It would be interesting to use the TWL or SOWPODs dictionaries themselves to compute the lexicon-based frequency.

3:08 PM  
Anonymous Anonymous said...

nike tnEnter the necessary language translation, up to 200 bytes winter, moves frequently in Chinanike chaussures showing that the deep strategy of the Chinese market. Harvard Business School, tn chaussures according to the relevant survey data show that in recent years the Chinese market three brands, Adidas, mens clothingpolo shirts Li Ning market share at 21 percent, respectively, 20%, 17%. The brand is first-line to three lines of urban competition for mutual penetration. Side of theworld,announced layoffs, while China's large-scale facilities fists. The sporting goods giant Nike's every move in the winter will be fully exposed its strategy. Years later, the Nike, Inc. announced the world's Fan

6:42 AM  
Anonymous Anonymous said...

cheap polos
polo shirts
ralph lauren polo shirtssport shoes
ugg boots
puma shoes
chaussures pumamp4
trade chinalacoste polo shirts
chaussure puma femmewedding dressestennis racket
cheap handbags

6:42 AM  
Anonymous Anonymous said...

MENSCLOTHING mans clothing
cheap ugg boots
converse shoes
wedding dresses
wholesale polo shirts
brand clothingcheap clothing
clothes sportspolos shirtair shoesair shoesed hardy clothinged hardy clothing

6:43 AM  
Anonymous Anonymous said...

初音ミク網頁設計会社設立グループウェア探偵浮気調査コンタクトレンズ腰痛名刺作成留学矯正歯科インプラント電報ショッピング枠 現金化クレジットカード 現金化ジュエリーおまとめローン格安航空券電話占いワンクリック詐欺カラーコンタクトクレジットカード 現金化多重債務国内格安航空券債務整理債務整理薬剤師 求人葬儀 千葉フランチャイズフランチャイズ幼児教室個別指導塾経営雑誌経済雑誌似顔絵ウェルカムボードCrazyTalkCloneDVDCloneCDクレージートークフロアコーティング 川崎フロアコーティング会社設立埼玉 不動産フロント 仕事治験お見合いインプラントキャッシング東京 ホームページ制作別れさせ屋システム開発サーバー管理育毛剤育毛剤不動産渋谷区 賃貸

6:09 PM  
Anonymous Anonymous said...

cheap hair straightenerscheap flat ironnew polo shirtssexy lingerie storepolo shirtsnorth face jacketschi straightenerpink chichaussures puma chaussure puma

6:19 PM  
Anonymous Anonymous said...

hair straightenersugg bootscheap handbagscheap bagscheap pursetnspyder jacketstattoo wholesalejackets worldjackets cartmen's clothingwomen's clothing

6:19 PM  
Anonymous Anonymous said...

handbags Louis Vuitton Vuitton  handbags Balenciaga Balenciaga  Bally handbags Bottega Veneta handbags Cartier handbags Chanel handbags Chloe handbags Christian Dior handbags Coach handbags Dolce Gabanna handbags

6:20 PM  
Anonymous Anonymous said...

waroneMen's Lacoste Polo Shirts Men's RL Striped Polo Shirts Women's Lacoste Polo Shirts Men's polo shirts Men's polo shirts Men's polo shirts 4 polo shirts Women's polo shirts 21 polo shirts Men's polo shirts Women's LACOSTE 5 PCS of Ralph

6:20 PM  

Post a Comment

<< Home