Resources for Corpus Linguistics

Resources for Corpus Linguistics

RESOURCES FOR CORPUS LINGUISTICS 1. (FIRST GENERATION) WRITTEN CORPORA Brown Corpus Properties: 1m words, written, American, 1961 Design: original (see Corpus Design handout) Availability: ICAME-CD, Free online access at LDC Lancaster-Oslo/Bergen (LOB) Properties: 1m words, written, British, 1961 Design: based on BROWN Availability: ICAME-CD Kolhapur Corpus of Indian English Properties: 1m words, Written, Indian, 1978 Design: roughly based on BROWN (less specific fiction, more general fiction) Availability: ICAME-CD Wellington Corpus of Written New Zealand English Properties: 1m words, written, NZ, 1986-90 Design: roughly based on BROWN (no subcategories within Fiction) Availability: ICAME-CD, Wellington Corpus CD Australian Corpus of English (ACE) Properties: 1m words, written, Australian, 1986 Design: roughly based on BROWN Availability: ICAME-CD, ACE-CD London-Lund (LLC) Properties: 0.5m words, Spoken, British, 1959ff Design: derived from SEU Availability: ICAME-CD Freiburg-Brown (FROWN) Properties: 1m words, Written, American, 1991 Design: based on BROWN Availability: ICAME-CD Freiburg-LOB (FLOB) Properties: 1m words, Written, British, 1991 Design: based on LOB (i.e. BROWN) Availability: ICAME-CD 2. MEGA CORPORA Cobuild Bank of English Properties: > 300m words, mostly written, mostly British (some American) Design: opportunistic Availability: HarperCollins, Online demo at Cobulid web site British National Corpus (BNC) Properties: 100m words, spoken (10m) and written (90m), 1990s Design: original (see Corpus Design handout) Availability: BNC CD, Online demo at BNC web site. Corpus Linguistics 1/3 © 2003 Anatol Stefanowitsch [email protected] Resources for corpus linguistics 2/3 3. VARIOUS SPECIALIZED CORPORA Spoken Corpora Corpus of Spoken American English Properties: growing, spoken, American Design: free conversation among friends in natural setting, tape recorded by participants Availability: Free download at Talkbank Switchboard (SWB) Properties: ca. 3m words, spoken, American, 1990s Design: Telephone conversations between strangers on predetermined topics Availability: Free online access at LDC Spoken Portion of BNC Properties: 10m words, British, 1990s Design: see BNC Availability: BNC-World CD Note: Spoken language files are distributed through all subdirectories of the BNC; they can be extracted by searching for files that contain the string stext in the header. Corpus of Spoken Professional American English Properties: ca. 2m words, spoken, American, 1990s Design: faculty meetings, committee meetings, White House press conferences Availability: CSPAE-CD (Athelstan) Michigan Corpus of Spoken Academic English (MICASE) Properties: 1.7m words, spoken, Academic English University of Michigan, 1997-2001 Design: representative of speech in academic settings Availability: Free online access at MICASE web site Diachronic Complete Corpus of Old English Properties: Written, Old English Design: full-text (contains all surviving Old English texts) Availability: University of Toronto Helsinki Corpus of English Texts, Diachronic Part Properties: 1.5m words, written, Old English to Middle English Design: original Availability: ICAME-CD Language Acquisition Child Language Data Exchange System (CHILDES) Properties: Broad range of corpora of child language Design: Varies acc. to corpus Availability: Childes Website 4. TEXT ARCHIVES Project Gutenberg http://gutenberg.net/ Web interface: http://clwww.essex.ac.uk/w3c/corpus_ling/content/search_engine.html University of Virginia Electronic Text Center http://etext.virginia.edu/ Resources for corpus linguistics 3/3 5. USING THE INTERNET AS A CORPUS http://www.webcorp.org.uk/ 6. WEB PAGES OF MAJOR CORPORA OF ENGLISH Talkbank www.talkbank.org ICAME http://www.hit.uib.no/icame.html British National Corpus (BNC) http://www.hcu.ox.ac.uk/BNC/ Demo Access: http://sara.natcorp.ox.ac.uk/lookup.html COBUILD Bank of English http://www.cobuild.collins.co.uk/boe_info.html Demo Access: http://www.cobuild.collins.co.uk/form.html Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu Demo Access to BROWN and SWITCHBOARD: http://www.ldc.upenn.edu/lol International Corpus of English (ICE) http://www.ucl.ac.uk/english-usage/ice/ Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/m/micase/ Online access: http://www.hti.umich.edu/cgi/m/micase/micase-idx?type=revise Child Language Data Exchange System (CHILDES) http://childes.psy.cmu.edu/ Bergen Corpus of London Teenage Language (COLT) http://nora.hd.uib.no/colt/.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    3 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us