Survey on European and Brazilian Portuguese Speech and Text Corpora

Survey on European and Brazilian Portuguese Speech and Text Corpora

Carla Simões, [email protected] Survey on European and Brazilian Portuguese Speech and Text Corpora 1 Portuguese Corpora • Where can we find it? – World Wide Web • www.elra.info , European Language Resources Association • www.elda.org, Evaluations and Language resources Distribution Agency • www.ldc.upenn.edu, Linguistic Data Consortium • www.iltec.pt, Instituto de Linguística Teórica e Computacional • www.clul.ul.pt, Centro de Linguística da Universidade de Lisboa • www.l2f.inesc-id.pt, Laboratório de Sistemas de Língua Falada, INESC • www.linguateca.pt, Language resource center for Portuguese • devoted.to/corpora, Bookmarks for Corpus-based Linguists • www.appen.com.au, Appen Speech Technologies 2 ELRA www.elra.info 3 European Language Resources Association • Spoken corpus • Written corpus – Desktop/microphone – Monolingual Lexicon • C-ORAL-ROM - Integrated • LusoLEX European Portuguese reference corpora for spoken Lexicon romance languages • BrasiLEX Brazilian Portuguese • FASiL Portuguese “fasil-pt” corpus lexicon • Portuguese Speecon database • PAROLE Portuguese Lexicon • GlobalPhone Portuguese • LABEL-LEX (MW) (Brazilian) • LABEL-LEX (SW) – Telephony – Written corpora • Portuguese SpeechDat(M) • PAROLE Portuguese Corpus database • ECI/MCI (European Corpus • Portuguese SpeechDat(II) FDB- Initiative/Multilingual Corpus I) 4000 • MLCC - Multilingual and Parallel Corpora 4 Spoken corpus – Desktop/microphone • C-ORAL-ROM - Integrated reference corpora for • Catalog Reference : spoken romance languages S0172 – The corpus consists of four comparable recording • Source Channel : collections of Italian, French, Portuguese and Spanish Microphone, Radio, spontaneous speech sessions (around 300,000 words Telephone, Television for each Language) • Members Prices – Academic - Commercial – It provides the acoustic source of each session together 10000.00 EUR with the following main annotations: – Academic - Research • The orthographic transcription, in CHAT format, enriched 1500.00 EUR with the tagging of terminal and non terminal prosodic breaks – Commercial - Commercial 10000.00 EUR • Session metadata – Commercial - Research • The text to speech synchronization, in WIN PITCH CORPUS 10000.00 EUR format, based on the alignment of each transcribed utterance • Non Member Prices : – Package: – Academic - Commercial 20000.00 EUR • uncompressed .WAV files (Win PCM: 22,050 hz; 16 bit) – Academic - Research • Transcription files in .TXT and .XML format 3000.00 EUR • transcription files with PoS tagging in .TXT files – Commercial - Commercial 20000.00 EUR • The frequency list of lemmas for each language collection in – Commercial - Research TXT files 20000.00 EUR • Measurements of spoken language variability in EXCEL files 5 Menu Spoken corpus – Desktop/microphone • FASiL Portuguese unimodal “fasil-pt” corpus – The corpus was collected in the context of the FASiL • Members Prices project, EU FP5 IST-2001-38685 (http://www.fasil.co.uk), – Academic - Commercial as a wizard-of-oz experiment 8000.00 Academic - Research 4000.00 EUR – Commercial - – There are sound recordings of subject and wizard. A Commercial 8000.00 total of 70 subjects were recorded EUR – Commercial - Research 8000.00 EUR – The woz experiment is about the voice interaction with a • Non Member Prices Virtual Personal Assistant (VPA) for an email, calendar – Academic - Commercial and contacts task 10000.00 EUR – Academic - Research – .wav files (u-law) for audio, plain ASCII text (.txt) for 8000.00 EUR transcriptions – Commercial - • Catalog Reference : S0174-02 Commercial 10000.00 EUR • Distribution medium : CD-ROM, DVD – Commercial - Research 10000.00 EUR • Also available the FASiL combined unimodal “fasil- all” corpus and the FASiL multimodal “fasil-mm” corpus, where the subjects were recorded in three project languages: Swedish, Portuguese and English • Demo 6 Menu Spoken corpus – Desktop/microphone • Portuguese Speecon database • Catalog Reference : – It’s a Portuguese speech corpus recorded in S0180 Portugal • Source Channel : Microphone – Recorded at 16 KHz, 16 bit – linear – 87 hours • Members Prices – Microphone: CloseTalk/ FarTalk – Academic - Commercial 67000.00 – Divided in two sets: – Academic - Research 50000.00 EUR • The first set comprises the recordings of 553 – Commercial - Commercial adult Portuguese speakers (266 males, 287 67000.00 – Commercial - Research females), recorded over 4 microphone 67000.00 channels in 4 recording environments (office, entertainment, car, public place) • The second set comprises the recordings of • Non Member Prices 52 child Portuguese speakers (19 boys, 33 – Academic - Commercial girls), recorded over 4 microphone channels in 75000.00 EUR 1 recording environment (children room) – Academic - Research 60000.00 EUR • This database is partitioned into 29 DVDs – Commercial - Commercial (first set) and 4 DVDs (second set) 75000.00 EUR – Commercial - Research 75000.00 EUR 7 Menu Spoken corpus – Desktop/microphone • GlobalPhone Portuguese (Brazilian) • Catalog Reference : S0201 – provides transcribed speech data for the • Info development and evaluation of large vocabulary – http://www.cs.cmu.edu/~ continuous speech recognition systems in the most tanja/GlobalPhone widespread languages of the world • Distribution medium : DVD – The Portuguese (Brazilian) corpus was produced using the Folha de São Paulo newspaper • Members Prices – Academic - Commercial – The entire GlobalPhone corpus contains over 300 3000.00 hours of speech spoken by more than 1500 native – Academic - Research 600.00 adult speakers – Commercial - Commercial 3000.00 EUR – It contains recordings of 102 speakers (54 males, 48 – Commercial - Research females with different age distribution) recorded in 3000.00 Porto Velho and Sao Paulo, Brazil • Non Member Prices – Academic - Commercial – In each language about 100 adult native speakers 3600.00 Academic - were asked to read 100 sentences Research 700.00 – Commercial - Commercial – About 2 Gb for each language 3600.00 EUR – close-speaking microphone, PCM encoding, – Commercial - Research mono quality, 16-bit quantization, and 16 kHz 3600.00 sampling rate 8 Menu Spoken corpus – Telephony • Portuguese SpeechDat(M) database • An INESC project under a subcontract with Portugal • Catalog Reference : S0068 Telecom – first phase • Source Channel : Telephone – Contains the recordings of 1,001 speakers (453 males, • Distribution : 548 females). This speech database was collected by CD-ROM Portugal Telecom within the European SpeechDat • Members prices : project – Academic - Commercial – It has a good representation of many regional accents, 14000.00 and age distribution – EURAcademic - Research – It is also included a pronunciation lexicon with a 11000.00 phonemic transcription in SAMPA – EURCommercial - Commercial 14000.00 – 8 kHz, 8-bit A-law – EURCommercial - Research – Each speaker uttered the following items: 14000.00 EUR • natural numbers • digits • Non member prices : • money amounts – Academic - Commercial • dates 20000.00 • time phrase • application words – EURAcademic - Research • spelled-out words 14000.00 • word spotting phrases – EURCommercial - • sentences Commercial 20000.00 • yes/no questions – EURCommercial - Research • spontaneous date 20000.00 EUR • spontaneous time • region name 9 Menu Spoken corpus – Telephony • Portuguese SpeechDat(II) FDB-4000 • Catalog Reference : S0092 • An INESC project under a subcontract with Portugal Telecom – second phase • Source Channel : – Comprises 4027 Portuguese speakers (1861 males, 2166 Telephone females) recorded over the Portuguese fixed telephone • Distribution : network. It has a good representation of many regional CD-ROM accents, and age distribution – 8-bit 8 kHz A-law • Members prices : – A pronunciation lexicon with a phonemic transcription in – Academic - Commercial SAMPA 40000.00 EUR – Each speaker uttered different items: – Academic - Research - digits, numbers 28000.00 EUR - currency money amount – Commercial - Commercial - dates 40000.00 - time phrases : – Commercial - Research - spelled words : 40000.00 EUR - directory assistance utterances - yes/no questions • Non member prices : - application words – Academic - Commercial - phonetically rich words 56000.00 EUR - phonetically rich sentences – Academic - Research 48000.00 EUR • Samples - http://speechdat.phonetik.uni- – Commercial - Commercial muenchen.de/speechdt/speechDB/FIXED1PT/HTML/index. 56000.00 html – Commercial - Research 56000.00 EUR 10 Menu Written corpus - Monolingual Lexicon • LusoLEX European Portuguese Lexicon and BrasiLEX Brazilian Portuguese lexicon – Available at Microsoft Language Resources • PAROLE Portuguese Lexicon – It’s constituted by 20 000 entries morpho-syntactically and syntactically encoded – Distribution medium : CD-ROM • Members Prices – Academic - Commercial 10500.00 EUR – Academic - Research 1400.00 EUR – Commercial - Commercial 10500.00 EUR – Commercial - Research 3500.00 EUR • Non Member Prices – Academic - Commercial 15000.00 EUR – Academic - Research 2000.00 EUR – Commercial - Commercial 15000.00 EUR – Commercial - Research 5000.00 EUR 11 Menu Written corpus – Monolingual Lexicon • LABEL-LEX (MW) • Members Prices – It is a Portuguese formalized lexicon, – Academic - Commercial 10000.00 EUR containing 88 619 multiword lexical – Academic - Research 3000.00 EUR units (formally, sequences of simple – Commercial - Commercial 10000.00 EUR words) – Commercial - Research 10000.00 EUR – Often, are used to express ideas and • Non Member Prices concepts – Academic - Commercial 15000.00

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    56 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us