LDC Corpora available at UIUC
The Beckman Institute has an LDC membership for the years 1996, 1997,
1999, 2003, 2004, 2005, 2006, and 2007. We actually have copies of
the following corpora.
Margaret Fleck has (web-accessible, ask her for details):
- Arabic Treebank (volumes 1-3)
- Arabic Gigaword
- Boston Radio Speech Corpus (1997)
- Buckwalter Arabic morphological analyzer
- CELEX-2
- Comlex English lexicon (Pronlex)
- ECI Multilingual Corpus 1 (1994)
- English Gigaword
- Gulf Arabic Conversational Telephone Speech Transcripts
- HUB4 Broadcast News text data (1996)
- Switchboard:
- Switchboard I Release 2 (1997) audio
- Mississippi State word-level transcriptions
- ICSI phonetic transcriptions
- see Treebank-3 for versions with POS, disfluency, and/or syntactic parses
- Egyptian Arabic CALLHOME (transcripts)
- Propbank
- TDT 2 (version 3.2, English text only)
- Treebank-3
Mark Hasegawa-Johnson has:
- Switchboard:
- Switchboard I Release 2 (1997) audio
- "switchboard speaker ID Evaluation Test 1996" (what exactly is this?)
- word transcriptions (the ones from Mississippi State? or....?)
- "phonetic transcriptions for WS97 subset" (are these the ones
distributed by ICSI?)
- Broadcast news: speech and transcriptions
- Mandarin Broadcast news
- Boston Radio Speech Corpus (1997)
- HUB4 Broadcast News text data (1996)
- TIMIT, NTIMIT
- CMU kids speech
- TIDIGITS
- ICSI Meeting corpus
- Santa Barbara corpus (part 2)
Richard Sproat has:
- Arabic Treebank (volumes 1-3)
- Arabic Gigaword
- Egyptian Arabic CALLHOME (transcripts)
- English Gigaword
- Buckwalter Arabic morphological analyzer
- Chinese Gigaword
- Chinese Treebank 4
- SigHan Chinese Treebank Segmentation Evaluation Corpus
- 2001 Hub5 Mandarin Transcription
Dan Roth has:
- ACE-2
- ACE 2004 Multilingual Training Corpus
- Comlex English lexicon (Pronlex)
- Google n-grams
- Hong Kong Hansards Parallel Text
- MUC-7
- North American News Corpus
- Propbank
- Reuters-21578 (1997) newswire data
- TDT 2 (version 3.2, English text only)
- TREC/ACQUAINT
- Treebank-2
ChengXiang Zhai has
- TREC-1 to TREC-8 disks
- lots more stuff, details coming soon
David Forsyth has
- TRECVID news video (pre-release version, 2004)
Someone apparently has:
- TRAINS (1995)
- Form2 Kinematic Gesture