Quantcast
Channel: Corpus linguistics
Browsing all 48 articles
Browse latest View live

Image may be NSFW.
Clik here to view.

Hwaet! Old English corpora and a quick look at my favorite word in Beowulf

More often than I should admit, when people talk about wh-words, I hear a sharp rush-clack of hwæt! That’s where what comes from but it’s not just the inversion of w and h. The Anglo-Saxons seemed to...

View Article



Image may be NSFW.
Clik here to view.

Making a corpus from YouTube: dialects in North America

Here is a link to Rick Aschmann’s amazing collection of speech clips from Canadian and American speakers on YouTube using the Atlas of North American English as a starting point:...

View Article

Image may be NSFW.
Clik here to view.

Geography stuff for dialectology

InfoChimps have a geography API that might help you plot people against locations: bit.ly/xLJKzQ Brice Russ at OSU has been doing Twitter dialect stuff and has been using the Data Science Toolkit, but...

View Article

Image may be NSFW.
Clik here to view.

Build your own corpus (well, for now)

BootCaT is meant to help folks build up their own corpora from the Internet. However, it uses the Bing API and may not be able to so for much longer, so it may go down temporarily. Go get your corpus...

View Article

Image may be NSFW.
Clik here to view.

Erorrs erorrs evrerywehere

Earlier I showed a neat resource that has grammatical and ungrammatical sentences from various linguistics papers over the years. But what if you want a whole bunch of English errors? In that case,...

View Article


Image may be NSFW.
Clik here to view.

World Englishes

Elizabeth Traugott offers these suggestions for corpora on world Englishes: eWAVE = The electronic World Atlas of Varieties of English. 2011. Edited by Bernd Kortmann and Kerstin Lunkenheimer. Leipzig:...

View Article

Image may be NSFW.
Clik here to view.

Persian verb inflections

Mohammad Sadegh Rasooli has put together a rule-based verb-inflector for Persian as part of work for preprocessing the Persian dependency treebank and some other work, find out more: Project:...

View Article

Image may be NSFW.
Clik here to view.

Names

Stephanie Shih has been doing really fun work on what makes a name (first and last) using a corpus of Facebook names. This helps her get recent trends–the Social Security Administration releases first...

View Article


Image may be NSFW.
Clik here to view.

Verb phrase ellipsis corpus

One of the hot topics in syntax dissertations in recent years has been ellipsis (I’m not kidding–when I was on a job search committee a few years ago, ellipsis was *everywhere*). Bos and Spenader...

View Article


Image may be NSFW.
Clik here to view.

Super-European language translation corpus

The DGT-TM corpus takes sentences from 22 European languages and has translations (manually produced) for 21 other languages. That means there are about 3 million sentences per language. With all pairs...

View Article

Image may be NSFW.
Clik here to view.

Malto (a Dravidian language) corpus for dialect, folklore research and more!

Check out the new LDC corpus of Malto, which is a Dravidian language of India (I think ~200k speakers). Malto corpus (LDC2012S04) It’s 8 hours, 27 speakers (glosses/transcriptions/etc for 6 of the 8...

View Article

Image may be NSFW.
Clik here to view.

South Asian resources

Just a while ago, I mentioned a corpus of Malto, a Dravidian language of NE India and Bangladesh. But what about other South Asian resources? (Given the MILLIONS of people speaking many South Asian...

View Article

Image may be NSFW.
Clik here to view.

Power (Supreme Court Justices and Wikipedia editors)!

Cristian Danescu-Niculescu-Mizil, Timothy Hawes, and colleagues have released some more corpora that are worth playing with. The Wikipedia Talk Page Conversations Corpus: 125,000 conversations...

View Article


Image may be NSFW.
Clik here to view.

Aboriginal Australian language support

Here’s a table Piers Kelley put together for the R-N-L-D mailing list. It has some handy resources for people interested in corpora as well as language documentation/preservation/teaching/learning....

View Article

Image may be NSFW.
Clik here to view.

Article 0

Brendan O’Connor & Co. from CMU have updated their tweet parser and provided a bunch of other stuff, including a collection of 56 million English-language tweets....

View Article


Image may be NSFW.
Clik here to view.

Romnesia–word play techniques

(Add your suggestions in comments!) So one of the words hitting big around social media is “Romnesia”, referring to presidential candidate Mitt Romney’s forgetting of his position because it’s changed...

View Article

Image may be NSFW.
Clik here to view.

“Strength” in Presidential debates

Word clouds are pretty. Here’s what it looks like across presidential and vice-presidential debates from the first Kennedy-Nixon to the third Obama-Romney. Frequency is kind of like the old grey mare...

View Article


Image may be NSFW.
Clik here to view.

Python classes for free!

Sign up to get started with Python–it’s natural language toolkit (NLTK) makes it easy to answer lots of linguistic questions. http://t.co/nU912Ew2

View Article

Image may be NSFW.
Clik here to view.

How do doctors and patients talk?

Kozloff and Barnett (2007) have a corpus of medical consultations (doctors advising patients about diabetes, coronary artery disease, etc). http://www.verilogue.com/

View Article

Image may be NSFW.
Clik here to view.

Sociolinguistic summary: news from NWAV 41

To get Linguists to dance, play songs that are not in English. #macarena #NWAV41 http://t.co/tloJETYh One parting shot from Dennis Preston: “it’s not hard to teach adjectives to Texans. Just ask if...

View Article
Browsing all 48 articles
Browse latest View live




Latest Images