Survey on European and Brazilian Portuguese Speech and Text Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Carla Simões, [email protected] Survey on European and Brazilian Portuguese Speech and Text Corpora 1 Portuguese Corpora • Where can we find it? – World Wide Web • www.elra.info , European Language Resources Association • www.elda.org, Evaluations and Language resources Distribution Agency • www.ldc.upenn.edu, Linguistic Data Consortium • www.iltec.pt, Instituto de Linguística Teórica e Computacional • www.clul.ul.pt, Centro de Linguística da Universidade de Lisboa • www.l2f.inesc-id.pt, Laboratório de Sistemas de Língua Falada, INESC • www.linguateca.pt, Language resource center for Portuguese • devoted.to/corpora, Bookmarks for Corpus-based Linguists • www.appen.com.au, Appen Speech Technologies 2 ELRA www.elra.info 3 European Language Resources Association • Spoken corpus • Written corpus – Desktop/microphone – Monolingual Lexicon • C-ORAL-ROM - Integrated • LusoLEX European Portuguese reference corpora for spoken Lexicon romance languages • BrasiLEX Brazilian Portuguese • FASiL Portuguese “fasil-pt” corpus lexicon • Portuguese Speecon database • PAROLE Portuguese Lexicon • GlobalPhone Portuguese • LABEL-LEX (MW) (Brazilian) • LABEL-LEX (SW) – Telephony – Written corpora • Portuguese SpeechDat(M) • PAROLE Portuguese Corpus database • ECI/MCI (European Corpus • Portuguese SpeechDat(II) FDB- Initiative/Multilingual Corpus I) 4000 • MLCC - Multilingual and Parallel Corpora 4 Spoken corpus – Desktop/microphone • C-ORAL-ROM - Integrated reference corpora for • Catalog Reference : spoken romance languages S0172 – The corpus consists of four comparable recording • Source Channel : collections of Italian, French, Portuguese and Spanish Microphone, Radio, spontaneous speech sessions (around 300,000 words Telephone, Television for each Language) • Members Prices – Academic - Commercial – It provides the acoustic source of each session together 10000.00 EUR with the following main annotations: – Academic - Research • The orthographic transcription, in CHAT format, enriched 1500.00 EUR with the tagging of terminal and non terminal prosodic breaks – Commercial - Commercial 10000.00 EUR • Session metadata – Commercial - Research • The text to speech synchronization, in WIN PITCH CORPUS 10000.00 EUR format, based on the alignment of each transcribed utterance • Non Member Prices : – Package: – Academic - Commercial 20000.00 EUR • uncompressed .WAV files (Win PCM: 22,050 hz; 16 bit) – Academic - Research • Transcription files in .TXT and .XML format 3000.00 EUR • transcription files with PoS tagging in .TXT files – Commercial - Commercial 20000.00 EUR • The frequency list of lemmas for each language collection in – Commercial - Research TXT files 20000.00 EUR • Measurements of spoken language variability in EXCEL files 5 Menu Spoken corpus – Desktop/microphone • FASiL Portuguese unimodal “fasil-pt” corpus – The corpus was collected in the context of the FASiL • Members Prices project, EU FP5 IST-2001-38685 (http://www.fasil.co.uk), – Academic - Commercial as a wizard-of-oz experiment 8000.00 Academic - Research 4000.00 EUR – Commercial - – There are sound recordings of subject and wizard. A Commercial 8000.00 total of 70 subjects were recorded EUR – Commercial - Research 8000.00 EUR – The woz experiment is about the voice interaction with a • Non Member Prices Virtual Personal Assistant (VPA) for an email, calendar – Academic - Commercial and contacts task 10000.00 EUR – Academic - Research – .wav files (u-law) for audio, plain ASCII text (.txt) for 8000.00 EUR transcriptions – Commercial - • Catalog Reference : S0174-02 Commercial 10000.00 EUR • Distribution medium : CD-ROM, DVD – Commercial - Research 10000.00 EUR • Also available the FASiL combined unimodal “fasil- all” corpus and the FASiL multimodal “fasil-mm” corpus, where the subjects were recorded in three project languages: Swedish, Portuguese and English • Demo 6 Menu Spoken corpus – Desktop/microphone • Portuguese Speecon database • Catalog Reference : – It’s a Portuguese speech corpus recorded in S0180 Portugal • Source Channel : Microphone – Recorded at 16 KHz, 16 bit – linear – 87 hours • Members Prices – Microphone: CloseTalk/ FarTalk – Academic - Commercial 67000.00 – Divided in two sets: – Academic - Research 50000.00 EUR • The first set comprises the recordings of 553 – Commercial - Commercial adult Portuguese speakers (266 males, 287 67000.00 – Commercial - Research females), recorded over 4 microphone 67000.00 channels in 4 recording environments (office, entertainment, car, public place) • The second set comprises the recordings of • Non Member Prices 52 child Portuguese speakers (19 boys, 33 – Academic - Commercial girls), recorded over 4 microphone channels in 75000.00 EUR 1 recording environment (children room) – Academic - Research 60000.00 EUR • This database is partitioned into 29 DVDs – Commercial - Commercial (first set) and 4 DVDs (second set) 75000.00 EUR – Commercial - Research 75000.00 EUR 7 Menu Spoken corpus – Desktop/microphone • GlobalPhone Portuguese (Brazilian) • Catalog Reference : S0201 – provides transcribed speech data for the • Info development and evaluation of large vocabulary – http://www.cs.cmu.edu/~ continuous speech recognition systems in the most tanja/GlobalPhone widespread languages of the world • Distribution medium : DVD – The Portuguese (Brazilian) corpus was produced using the Folha de São Paulo newspaper • Members Prices – Academic - Commercial – The entire GlobalPhone corpus contains over 300 3000.00 hours of speech spoken by more than 1500 native – Academic - Research 600.00 adult speakers – Commercial - Commercial 3000.00 EUR – It contains recordings of 102 speakers (54 males, 48 – Commercial - Research females with different age distribution) recorded in 3000.00 Porto Velho and Sao Paulo, Brazil • Non Member Prices – Academic - Commercial – In each language about 100 adult native speakers 3600.00 Academic - were asked to read 100 sentences Research 700.00 – Commercial - Commercial – About 2 Gb for each language 3600.00 EUR – close-speaking microphone, PCM encoding, – Commercial - Research mono quality, 16-bit quantization, and 16 kHz 3600.00 sampling rate 8 Menu Spoken corpus – Telephony • Portuguese SpeechDat(M) database • An INESC project under a subcontract with Portugal • Catalog Reference : S0068 Telecom – first phase • Source Channel : Telephone – Contains the recordings of 1,001 speakers (453 males, • Distribution : 548 females). This speech database was collected by CD-ROM Portugal Telecom within the European SpeechDat • Members prices : project – Academic - Commercial – It has a good representation of many regional accents, 14000.00 and age distribution – EURAcademic - Research – It is also included a pronunciation lexicon with a 11000.00 phonemic transcription in SAMPA – EURCommercial - Commercial 14000.00 – 8 kHz, 8-bit A-law – EURCommercial - Research – Each speaker uttered the following items: 14000.00 EUR • natural numbers • digits • Non member prices : • money amounts – Academic - Commercial • dates 20000.00 • time phrase • application words – EURAcademic - Research • spelled-out words 14000.00 • word spotting phrases – EURCommercial - • sentences Commercial 20000.00 • yes/no questions – EURCommercial - Research • spontaneous date 20000.00 EUR • spontaneous time • region name 9 Menu Spoken corpus – Telephony • Portuguese SpeechDat(II) FDB-4000 • Catalog Reference : S0092 • An INESC project under a subcontract with Portugal Telecom – second phase • Source Channel : – Comprises 4027 Portuguese speakers (1861 males, 2166 Telephone females) recorded over the Portuguese fixed telephone • Distribution : network. It has a good representation of many regional CD-ROM accents, and age distribution – 8-bit 8 kHz A-law • Members prices : – A pronunciation lexicon with a phonemic transcription in – Academic - Commercial SAMPA 40000.00 EUR – Each speaker uttered different items: – Academic - Research - digits, numbers 28000.00 EUR - currency money amount – Commercial - Commercial - dates 40000.00 - time phrases : – Commercial - Research - spelled words : 40000.00 EUR - directory assistance utterances - yes/no questions • Non member prices : - application words – Academic - Commercial - phonetically rich words 56000.00 EUR - phonetically rich sentences – Academic - Research 48000.00 EUR • Samples - http://speechdat.phonetik.uni- – Commercial - Commercial muenchen.de/speechdt/speechDB/FIXED1PT/HTML/index. 56000.00 html – Commercial - Research 56000.00 EUR 10 Menu Written corpus - Monolingual Lexicon • LusoLEX European Portuguese Lexicon and BrasiLEX Brazilian Portuguese lexicon – Available at Microsoft Language Resources • PAROLE Portuguese Lexicon – It’s constituted by 20 000 entries morpho-syntactically and syntactically encoded – Distribution medium : CD-ROM • Members Prices – Academic - Commercial 10500.00 EUR – Academic - Research 1400.00 EUR – Commercial - Commercial 10500.00 EUR – Commercial - Research 3500.00 EUR • Non Member Prices – Academic - Commercial 15000.00 EUR – Academic - Research 2000.00 EUR – Commercial - Commercial 15000.00 EUR – Commercial - Research 5000.00 EUR 11 Menu Written corpus – Monolingual Lexicon • LABEL-LEX (MW) • Members Prices – It is a Portuguese formalized lexicon, – Academic - Commercial 10000.00 EUR containing 88 619 multiword lexical – Academic - Research 3000.00 EUR units (formally, sequences of simple – Commercial - Commercial 10000.00 EUR words) – Commercial - Research 10000.00 EUR – Often, are used to express ideas and • Non Member Prices concepts – Academic - Commercial 15000.00