Sunday, March 05, 2006

Scrabble's letter distributions: Art or science?

Thanks to Lance's python wizardry, we now have lexicon-based letter distribution counts! (See his comment to the previous post.) Interestingly, comparing the lexicon and corpus-based counts side by side with the Scrabble counts, some odd discrepancies appear.

Here's some bar charts summarizing the results. The leftmost (blue) bar represents Lance's lexicon letter counts. The center (red) bar represents the corpus letter counts. And the rightmost (yellow) bar represents the Scrabble tile distribution (as if it's out of 100, though it ought to be 98, because of the two blank tiles).

A-L
M-Z

There are a few differences between the corpus and lexicon counts. As Lance notes, the letter 'h' occurs in the corpus way more frequently than it does in the lexicon (the, he, her, their, those, them,...). The letter 't' as well, is more frequent in the corpus than the lexicon. Weirdly, the letter 's' is underrepresented in the corpus compared to the lexicon; I wonder if the headword list Lance chose included plurals of all the nouns?

What's interesting is that the Scrabble tile distribution matches lexicon frequency in some cases of discrepancy, corpus frequency in other cases, and neither in a couple of cases. This seems like a possibly odd result, given that the Scrabble letter distribution was supposed to have originally been based on a corpus count consisting of NYT front pages. Either the front pages on the days in question had some exceptional trends in letter usage (see below) or the creator of Scrabble, Alfred Mosher Butts, adjusted some frequencies based on his intuitions about what would make the game go better.

Of course, for letters whose frequency is less than 1%, Scrabble has a higher distribution because you can't have a letter with less than one tile. So 'q', for instance, is overrepresented in Scrabble, of necessity.

Other variations, though, seem to be more a matter of intuitive game-play facilitation. 'S', for instance, is less frequent in Scrabble than in either the lexicon or corpus -- obviously 's' makes high-scoring hooks easy, increasing its value as a letter, and Alfred foresaw this and deliberately made them scarcer. On the other hand, there's twice as many 'v's as there ought to be, as anyone who's tried to find a good way to use one knows (there are no two letter words with 'v' in them -- hard to hook). On the mitigating side, there's fewer 'c's than there ought to be, at least comparing to the lexicon distribution; ought to be 3, but there's only 2 'c' tiles. Since there's also no legal two letter words with 'c', that's kind of nice. I find it hard to imagine that Alfred was thinking about the availability of two-letter words, though maybe he was. Elsewhere, there's too many 'i's, compared to the lexicon count, and too many 'o's, but two few 'l's. In the latter two cases the corpus and Scrabble counts match pretty well, but he must just have been being perverse about the 'i's, because there the extra-high Scrabble count matches neither the lexicon nor the corpus count. (It does often feel like there are too many 'i's, IMHO.)

Anyway, thanks to Lance for pulling this data out! I think it's interesting how the two counts are actually not all that different. In the phonological version of this, of course, the edh segment would be the one with the way high count compared to the lexicon (rather than 'h' or 't' -- though maybe /h/ would also be high because of the pronouns). I wonder if any others would also exhibit significant mismatch?

Update: Check out this series of posts on the same topic at Nikolasco:
Scrabble Distributions
Best Fit Scrabble Letter Distribution
Super Scrabble

I was especially interested to see the results of his 'Best Fit' computations, and the discussion of Super Scrabble (which I have found to be actually quite a lot of fun—a more freewheeling game, especially with four players.)

Also, check out this post from Patrick Hall on Blogamundo about helping linguists execute their programming inclinations. Thanks for the thought, Patrick! I'll be watching for the updates.

10 Comments:

Blogger Lance said...

A little more python scripting...

S is far and away the most common final letter for words in the OSPD. 31.17% of all words end in S; the next most common are E, D, R, Y, at 11.1, 10.41, 7.83, 6.73, respectively. (Two words end in Q, which is <.01% of the lexicon.) The full list in order is:

SEDRYTGNLAHCMKOPIWXFBUZVJQ

Which is only partially an explanation. The real test is: what percentage of the occurrences of a given letter are at the end of a word? In fact, nearly half the S's in the OSPD are at the ends of words: 24,728 of the 49,932 S's, or 49.52%.

Interestingly, S isn't the letter most likely to appear at the end of the word when it appears. That would be Y, 54.51% of whose appearances are at the ends of words. Again, the full list:

YSDGXRTKENHLMCWAPFZOBIUJVQ

"X" may be the most surprising: about 17% of all X's are at the ends of words, even though the likelihood of any given word ending in X is .38%.

Aren't you glad you asked?

SEDRYTGNLAHCMKOPIWXFBUZVJQ

10:20 AM  
Blogger Lance said...

OK, one more script, and then I swear it's back to the dissertation revision.

How many words ending in a given letter are still words when you remove that letter? By no means are these all plurals--not only do you get FLAMINGO/FLAMING, you get NEEDLESS/NEEDLES, so not even all the "still a word when a final S is removed" words are plurals.

Nevertheless: 20,972 of the words ending in S are still words when that S is removed. That's 84.81% of the words ending in S; compared to 28.68% of the words ending in D, 22.27% of the words ending in R, and 24.64% of the words ending in Y. (The percentages are somewhat less indicative here: 57% of words ending in J are still words when the J is removed, because there are only seven words ending in J, and four of them are HADJ, HAJJ, HAJ, and TAJ.)

"O" is a surprisingly curtailable letter; 23% of its 869 ending appearances are removable, often to unrelated words. Sure, there's BRONCO, BASSO, etc., but there's also BONGO, CAMEO, CANTO, COMBO, DINERO, EXPRESSO...

Work, I tell you, work.

10:34 AM  
Blogger Nikolas Coukouma said...

I just made a somewhat similar post, but comparing tile and lexical frequencies for British and American dictionaries.

2:58 AM  
Blogger Spudart said...

I found this blog post by doing a google search for: scrabble frequency. This post was fifth in the results.

Fascinating blog, I'm adding you to my weekly check. :-)

4:50 PM  
Anonymous iron garden gates said...

companies marketing mineral makeups and also get the best bargains in mineral makeup you can imagine,
find aout how to consolidate your students loans or just how to lower your actual rates.,
looking for breast enlargements? in Rochester,
homeopathy for eczema learn about it.,
Allergies, information about lipitor,
save big with great bargains in mineral makeup,

change edition interviewing motivational people preparing second
,

interviewing motivational people preparing second time
,

interviewing people motivational preparing for a second time
,

black mold exposure
,

black mold exposure symptoms
,

black mold symptoms of exposure
,

free job interview questions
,

free job interview answers
,

interview answers to get a job
,

lookfor hair styles for fine thin hair
,

search hair styles for fine thin hair
,

hair styles for fine thin hair
,

beach resort in the philippines
,

great beach resort in the philippines
,

luxury beach resort in the philippines
,
iron garden gates, here,
iron garden gates,
wrought iron garden gates
, here
,
wrought iron garden gates
,
You: The Owner's Manual: An Insider's Guide to the Body That Will Make You Healthier and Younger
,
eat eating mindless more than think we we why
,


texturizer,
texturizers here,
black hair texturizer,
find aout how care curly hair,
find about how to care curly hair,
care curly hair,
lipitor rash,
lipitor reactions,
new house ventura california,
the house new houston tx,
new house washington dc,
new house pa philadelphia,
san antonio tx house new,
house new pa philadelphia,
new house washington dc,
new house ventura california,
the house new houston tx,
house new san antonio tx,
the house new houston tx, that you are looking for,
new house ventura california, you need to buy,
new house washington dc,
house new pa philadelphia,
new house san antonio tx,

hair surgery transplant
,

air filter allergy
,

refurbished dell laptop computers
,

hair surgery transplant
,

air filter allergy
,

refurbished dell laptop computers
,

hair surgery transplant
,

air filter allergy
,

refurbished dell laptop computers
,

chocolate esophagus heartburn study
,

chocolate esophagus heartburn study
be informed,

digestion healing healthy heartburn natural preventing way
,

digestion healing healthy heartburn natural preventing way
,
sew skirts, 16simple styles you can make!,
sew what skirts 16 simple styles you,
rebates and discounts on sunsetter awnings,
sunsetter awnings discounts and rebates,
discount on sunsetter awnings


truck and bus tires 12r 22.5, get the best price,
tires truck and bus 12r 22.5 best price,
tires truck bus tires12r 22.5 best price,
plush car seat strap covers,
car seat strap covers,plush,
car seat strap, plush covers,
oscoda voip phone systems, the best!,
oscoda voip the phone system,
oscoda voip phone systems,
exterior iron gates,
oriental wrought iron gates,
powder coated iron garden fencing,

6:15 AM  
Blogger Manikandan said...

Hi .nice blog.I need free job posts website.can anybody help me....

10:16 AM  
Anonymous iron gates said...

black mold exposure,
black mold symptoms of exposure,

wrought iron garden gates,
your next iron garden gates, here,

hair styles for fine thin hair,
search hair styles for fine thin hair,

night vision binoculars,
buy, night vision binoculars,

lipitor reactions,
lipitor reactions,

luxury beach resort in the philippines,
beach resort in the philippines,

homeopathy for baby eczema.,
homeopathy for baby eczema.,

save big with great mineral makeup bargains,
companies marketing mineral makeups,

prodam iphone praha,
Apple prodam iphone praha,

iphone clone cect manual,
manual for iphone clone cect,

fero 52 binoculars night vision,
fero 52 night vision,

best night vision binoculars,
buy, best night vision binoculars,

computer programs to make photo albums,
computer programs, make photo albums,

2:38 PM  
Blogger Ethylene said...

My friends and I definitely think there are too many I's. We very often have more than one of them. We hate them.

9:04 AM  
Anonymous Anonymous said...

^^ nice blog!! ^@^

徵信, 徵信網, 徵信社, 徵信社, 感情挽回, 婚姻挽回, 挽回婚姻, 挽回感情, 徵信, 徵信社, 徵信, 徵信, 捉姦, 徵信公司, 通姦, 通姦罪, 抓姦, 抓猴, 捉猴, 捉姦, 監聽, 調查跟蹤, 反跟蹤, 外遇問題, 徵信, 捉姦, 女人徵信, 女子徵信, 外遇問題, 女子徵信, 外遇, 徵信公司, 徵信網, 外遇蒐證, 抓姦, 抓猴, 捉猴, 調查跟蹤, 反跟蹤, 感情挽回, 挽回感情, 婚姻挽回, 挽回婚姻, 外遇沖開, 抓姦, 女子徵信, 外遇蒐證, 外遇, 通姦, 通姦罪, 贍養費, 徵信, 徵信社, 抓姦, 徵信, 徵信公司, 徵信社, 徵信公司, 徵信社, 徵信公司, 女人徵信,

徵信, 徵信網, 徵信社, 徵信網, 外遇, 徵信, 徵信社, 抓姦, 徵信, 女人徵信, 徵信社, 女人徵信社, 外遇, 抓姦, 徵信公司, 徵信社, 徵信社, 徵信社, 徵信社, 徵信社, 女人徵信社, 徵信社, 徵信, 徵信社, 徵信, 女子徵信社, 女子徵信社, 女子徵信社, 女子徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信社,

徵信, 徵信社,徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 外遇, 抓姦, 離婚, 外遇,離婚,

徵信社,徵信, 徵信社, 徵信, 徵信社, 徵信,徵信社, 徵信社, 徵信, 外遇, 抓姦, 徵信, 徵信社, 徵信, 徵信社, 徵信, 徵信社, 徵信社, 徵信社, 徵信社,徵信,徵信, 徵信, 外遇, 抓姦

7:25 PM  
Blogger qilong said...

初音ミク網頁設計会社設立グループウェア探偵浮気調査コンタクトレンズ腰痛名刺作成留学矯正歯科インプラント電報ショッピング枠 現金化クレジットカード 現金化ジュエリーおまとめローン格安航空券電話占いワンクリック詐欺カラーコンタクトクレジットカード 現金化多重債務国内格安航空券債務整理債務整理薬剤師 求人葬儀 千葉フランチャイズフランチャイズ幼児教室個別指導塾経営雑誌経済雑誌似顔絵ウェルカムボードCrazyTalkCloneDVDCloneCDクレージートークフロアコーティング 川崎フロアコーティング会社設立埼玉 不動産フロント 仕事治験お見合いインプラントキャッシング東京 ホームページ制作別れさせ屋システム開発サーバー管理育毛剤育毛剤不動産渋谷区 賃貸

6:47 PM  

Post a Comment

<< Home