Tuesday, February 28, 2006

(Spelled) segment distribution: lexicon vs. corpus

Recently reading a historical para about Scrabble, I was surprised to (re)realize that the letter distributions in Scrabble are based on an (informal) corpus count (New York Times front page), not a dictionary-headword (i.e. lexicon) count. So, e.g., 12 per cent (12 tiles of 100) of letters in the Scrabble bag are 'e'; that's pretty exactly the percentage of letter occurrences that are 'e' in a corpus of written English.

The reason this surprised me is that in a corpus there will be many repetitions of function words, which probably inflates the percentages of certain letters, e.g. 't' (the, to, it), 'e' (the, he, she), etc., when compared to their percentage occurrence in the lexicon, which contains only one token of each function word.

But of course Scrabble is all about producing nice individual words, really, not about producing a corpus-like set of word tokens. Indeed, someone who repeatedly put down words like 'to', 'the', 'it', and so on would not get very far in a game of Scrabble. So it seemed to me that it would have been more appropriate to use a letter distribution based on the percentage of each letter occurring in a list of dictionary headwords.

I thought I would try to find out how different the lexicon-based letter distribution in English is from the corpus-based letter distribution, but I can't find any numbers online for the former. (Numbers for the latter are all over the place, of course, and match the Scrabble distributions pretty exactly; the reason letter distribution is so interesting to many people is because it's a good way to solve simple encryption problems.)

I know it would be a supersimple programming problem to produce a list of letters and their respective percentage distributions in the headwords of any online dictionary database (e.g. the fourth column of Mike Hammond's 'newdic' file), but it'd be a biggish time investment for me to figure it out right this second. Anyone who can see a quick and easy way to do it want to send me the numbers for comparison to the corpus numbers? It would be interesting...

(Seems to me that it might even be useful, theoretically -- if you're into exemplar models of mental lexicon representations, e.g., the frequency/markedness value of a given English segment in your mental inventory might be expected to correlate with the corpus distribution, while if you're not into exemplar models but rather a more traditional lexicon-based model, with a single abstract phonological representation of each given word, you might expect the segment markedness values to correlate with their lexicon distribution in English.)

Sunday, February 26, 2006

They did it!

Just have to note that the entire island of Newfoundland (and the considerably bigger hunk of land that is Labrador) has gone bananas with joy because the men's curling team won the gold medal on Friday. Hooray! They were very endearing about it too -- Brad Gushue, attempting to throw the hammer into scoring position in the decisive sixth end, with six of his stones already in place to score, threw the seventh stone right through the house and out the other side. Six points in a end is already an unheard-of amount in international-caliber curling, but he could have had seven. He said to his teammates in apology for the missed shot, "Sorry guys -- couldn't get my heart rate down." Six, seven -- it didn't matter; it was in the bag.
Apparently the provincial govenment declared a half-day holiday so schoolkids could go home and watch the game; the Memorial University of Newfoundland (my alma mater) set up a live feed a screen in the Field House for the whole campus to come see, which they apparently did. A friend of my dad's, himself a come-from-away, wrote in bemusement: "There are apparently three important dates in the history of Nlfd.: 1497, when John Cabot landed and claimed the island for England; 1949, the date of Confederation with Canada, and 2006, when Gushue and the boys won the gold medal in curling."
I expect it's keeping them all warm through the 60 centimeters of snow that apparently has just fallen on St. John's.

In other news, I'm about to go give a colloquium talk in Madison, where MILC will be happening again this year. Watch for an important paper by M. Goose and R. Rabbit!

Thursday, February 23, 2006

Search interlinear data from 600 languages!

University of Arizona alumnus and current University of Washington faculty member Will Lewis and his collaborator Scott Farrar, also a grad of UAZ, have put together a website that allows you to search for example sentences from over 600 languages, made available online in interlinearly glossed form. The website is called ODIN, for Online Database of INterlinear (linguistic data), and is part of the GOLD project.

The example sentences are culled from the web by a spider program that uses various heuristics to recognize the tell-tale signs of interlinerally glossed example sentences in on-line text. The spider then automatically identifies the language in the example by a combination of searching surrounding text for language names and statistically analyzing the letter sequences and comparing transitional probability profiles to those of language samples that have been hand-identified. It then enters the example into the database, which is searchable by language and gloss terms. (The 'Advanced Search' function, which allows access to the search-by-gloss terms, isn't linked to its button yet but will be soon.)

Will and Scott are currently working on upgrading the search function to identify several easily-picked-out syntactic configurations, based mostly on the translation line. For example, looking at the translation line of sluice example, it's easy to pick them out: interlinerally glossed sentences whose English translations end in wh-words are mostly sluices; this afternoon a quick simple query to the database for sentences like that turned up examples of sluices in Passamaquoddy and Hausa. Besides sluicing, future search functionality may include ways to find sentences exhibiting obligatory control, gapping, ellipsis, ACD, scopal ambiguity, and whatever else they can think of straightforward ways to automatically identify. Suggestions from interested linguists very welcome!

In other news, the Newfoundland rink has made it into the gold medal game in men's curling! They play Finland for the gold today (Friday Feb. 24). (The Canadian women won a bronze yesterday. )

I checked out the team at the Canadian curling website and it turns out that of the guys on the ice, actually only Brad Gushue, the skip, is from St. John's. Mark Nichols and Mike Adam are from Labrador City, and Jamie Korab is from Harbor Grace. Their second rock, Russ Howard, is a longtime Canadian curling champ from the mainland. The coach, Toby McDonald, is a St. John's man.

Friday, February 17, 2006

I'm feewing wucky

My brother points out that in Google preferences you can set your preferred language for Google to talk to you in. It's got lots of your more interesting actual languages such as Xhosa, Twi and Guarani, as well as Klingon, Elmer Fudd, Hacker, Pig Latin, and Bork Bork Bork (what the Swedish Chef speaks). Took me a while to find English again from the Klingon menu, but I was able to switch to Elmer Fudd and then back to my native language from there.

The Canadian curlers have been rocking the house! (That's a little curling pun.) They were in first place until today, when they dropped back to a tie for second with Finland, behind Great Britain. Keep it up, b'ys! Get into that final!

Tuesday, February 14, 2006

Newfoundland English

Hey -- if you want to hear some (St. John's) Nfld. English, tune in to NBC's coverage of the Canadian curling team. The Newfoundland team is representing Canada this year, and they've been playing some awfully good ends. The teams are miked, so you can hear them discussing the shots and encouraging their stones. So check it out! and root for them a bit while you do. They lost a heartbreaker to Sweden today, but they're still in it.