Activity Report 2011 - 2016

Lorraine Laboratory of Research in Computer Science and its Applications ACTIVITY REPORT 2011 - 2016 PROSPECTIVES FOR 2017 - 2022 A research unit from the research department AM2I of Lorraine University: Automatics, Mathematics, Computer Science and their Interactions Volume 5 Contents Department 4: Natural Language Processing & Knowledge Discovery 5 Team Cello . 19 Team Multispeech . 25 Team Orpailleur . 37 Team QGAR . 47 Team Read. 55 Team Sémagramme . 61 Team SMarT. 71 Team Synalp . 79 References for Department 4 . 89 Department 4: Natural Language Processing & Knowledge Discovery 185 Department project. 185 Detailed projects of the NLPKD department teams . 189 CONTENTS | 1 | HCERES CONTENTS | 2 | HCERES 01 Activity Report Department 4 Natural Language Processing & Knowledge Discovery Department Head: Bruno Guillaume Team Cello . 19 Team Multispeech . 25 Team Orpailleur . 37 Team QGAR . 47 Team Read . 55 Team Sémagramme . 61 Team SMarT . 71 Team Synalp . 79 References for Department 4 . 89 Department 4 named Natural Language Processing & Knowledge Discovery (NLPKD) is com- posed of eight teams which are interested in the processing of all media used by humans for communication (speech, text and other kind of written document) and in the modeling of the content of these data (knowledge modeling, semantic of natural language). Speech processing and modeling is the main focus of MULTISPEECH and SMART (speech recognition, speech synthesis and translation are considered). Natural Language Processing is the main interest of SYNALP and SÉMAGRAMME with various activities about modeling, analysis and generation of several levels (syntax, semantics and discourse). QGAR and READ are active on document processing: segmentation, hand-written recognition, shape recognition. Knowledge discovery and knowledge processing is at the heart of ORPAILLEUR team: fundamental methods are developed and applied to areas like life science and data of the semantic web; CELLO is involved in knowledge modeling through extensions of modal logic. Despite different research areas, the eight teams of the department 4 share common background either on symbolic methods or on statistical approaches that are applied to most of these topics. Activity Report | 5 | HCERES Overview of Department 4 ................................................................................................................................. 1 Department Composition Department leader Bruno Guillaume (since September, 2013) Yannick Toussaint (before September, 2013) List of teams • CELLO: Computational Epistemic Logic in Lorraine • ORPAILLEUR: Knowledge Discovery and Knowledge Engineering (EPC Inria) • MULTISPEECH: Speech Modeling for Facilitating Oral-Based Communication (EPC Inria) • QGAR: Querying Graphics through Analysis and Recognition • READ: Reconnaissance de l’Ecriture et Analyse de Documents (Writting recognition and document analysis) • SÉMAGRAMME: Semantic Analysis of Natural Language (EPC Inria) • SMART: Statistical Machine Translation • SYNALP: Statistic and Symbolic Natural Language Processing PR MCF DR CR Total 2011 4 21 7 7 39 2016 7 22 6 12 47 PhD’s defended 40 On-going PhD’s 27 In the 67 PhD Thesis, 9 are co-supervised PhD funded outside the department. The 58 PhD remaining theses are funded as follows: 12 by MESR, 8 by Inria, 8 by industrial contracts, 7 by ANR projects, 4 by European projects, 2 by Foreign fundings (Vietnam and Algeria). Other funding are given by INCa, CNRS, FUI, DGA, Campus France, École Polytechnique. During the evaluation period, 26 post-docs and 28 engineers were in the teams of the department. Activity Report | 6 | HCERES Department evolution 2010 2011 2012 2013 2014 2015 2016 2017 6 7 TALARIS SYNALP 1 10 12 PAROLE MULTISPEECH 2 3 SMART 5 5 CALLIGRAMME SÉMAGRAMME Department 4 4 5 QGAR 2 2 READ 1 1 CELLO 12 12 ORPAILLEUR 3 CAPSID (Dept 5) The figure above gives an overview of the teams and their evolution during the evaluation period. Arrows correspond to creation of a new team where several members of the department move from an old team to the new one. 2 Life of the department The department organizes a seminar with invited talks every two months both for members of the department and for students of the Erasmus Mundus master program in NLP (Natural Language Processing). Since last year, a department day is organized each year where all PhD students are invited to present their works to members of other teams. The department is also in charge of the ranking of PhD applications and of the profiling of job for permanent positions. There are many collaborations between the different teams of the department, in particular through two national funding PIA (Programme d’Investissment d’Avenir): ORPAILLEUR and SYNALP are involved in the PIA Istex; SYNALP, SÉMAGRAMME and MULTISPEECH are involved in the PIA Ortolang (two engineers were shared by these teams). Combined national and regional fundings through CPER (Contrat de Plan État-Région) were used notably to construct common infrastructures like clusters and experimental platforms (“CPER TALC” until 2013 and “CPER LCHN” since 2014). In terms of collaborations with other departments, we mainly have interactions with Department 5 (common projects, co-supervised theses) and also a few interactions with Department 2. 3 Research topics Keywords: Speech processing, Natural Language Processing, Document Processing, Knowledge Pro- cessing and Knowledge Discovery. The teams of the NLPKD department have a wide scope, the main centers of interest are: • Speech Processing. Many different aspects are covered by teams MULTISPEECH, SMART and SYNALP. Main topics are: Speech Recognition, Speech Synthesis, Multimodal Speech Process- ing, Source Separation, Articulatory Modeling, Speech-to-speech Translation. Activity Report | 7 | HCERES • Natural Language Processing. In the department, the teams SMART, SYNALP, SÉMA- GRAMME and ORPAILLEUR are working on NLP. They are active in: Language Modeling (Formal semantics, Syntax-Semantic Interface, Discourse Dynamics), Natural Language Process- ing (Syntax and Semantics Parsing, Dialog Processing, Text Clustering), Natural Language Gen- eration and Text Mining. • Document Processing. The two teams READ and QGAR focus on Handwriting Recognition, Document Image Analysis and Indexing, Pattern Recognition, Feature Extraction, Segmentation and Incremental Classification. • Knowledge Processing and Knowledge Discovery. The ORPAILLEUR team has a strong focus on Knowledge with topics like: Knowledge Discovery in Databases, Formal Concept Analysis, Knowledge Engineering or Web of Data. The CELLO team is involved in Knowledge and Belief, Knowledge progression and its formalization in Modal Logic. The research activities in the department also cover a large range of methods: • Statistical or Learning based methods (HMM, BIRL, SVM, CRF, DNN, DBN) are used in all area described above, i.e. speech, text, document and Knowledge processing. • Symbolic based methods are also widely used in SÉMAGRAMME, SYNALP, ORPAILLEUR and CELLO with a focus on Logic in SÉMAGRAMME and CELLO, on Formal Concept Anal- ysis and Case-based Reasoning in ORPAILLEUR and on hybrid methods in SYNALP. The eight teams of the department are working in a large set of application domains: Foreign Lan- guage Learning, Sentiment Analysis and Opinion mining, Preference Modeling, Pathological speech or Pathological language, Life Science, Dialog System, Serious Games, Chemistry, Typography. We summarize below the research topics and application domains of these 8 teams as they exist in 2016. A more detailed description is given in the sections dedicated to each team. During the evaluation period, the PAROLE team has become MULTISPEECH and has now a stronger focus on speech. The main objective of the team is to better understand how humans produce and perceive speech. Statistical approaches are common for processing speech and they are used in ac- tual applications but their performance drops significantly when dealing with degraded speech, such as noisy signals and spontaneous speech. MULTISPEECH has used a wide range of statistical methods (with a focus on Deep Neural Networks) and applied them to audio source separation, speech enhance- ment, acoustic modeling, pronunciation modeling, language modeling and speech generation. One of the interests of the team is to investigate uncertainty estimation, i.e. to quantify the confidence in the output of a given speech processing technique and to exploit it for further processing. Uncertainty estimation is applied to acoustic modeling, speech recognition, phonetic segmentation and prosody. Another research axis is more oriented toward synthesis. In this axis, the team has worked on articulatory modeling, audio-signal speech synthesis (animation of a 3D model of human face) and categorization of sounds and prosody. Applications for hard-of-hiring children or for feedback for foreign language learners are considered. Since 2015 a part of the activities previously developed in the PAROLE team have led to a new team called SMART. The objective of SMART team is to develop statistical approaches for Natural Language Processing applications. Methods are corpus-based and researches are focused on multi-lingual aspects. Machine translation is one of the main area for the team. More precisely, application domains are speech- to-speech translation, translation of modern Arabic to different Arabic dialects and quality estimation for Activity Report | 8 | HCERES machine

Activity Report 2011 - 2016

An Analysis of Sentence Segmentation Features For

A Bayesian Framework for Word Segmentation: Exploring the Effects of Context

The Induction of Phonotactics for Speech Segmentation

Automatic Speech Segmentation

A Semi-Markov Model for Speech Segmentation with an Utterance-Break Prior

Seminar, Tue 4:15Pm

The Polysemy of the Words That Children Learn Over Time Arxiv

AUTOMATIC LINGUISTIC SEGMENTATION of CONVERSATIONAL SPEECH Andreas Stolcke Elizabeth Shriberg

Segmentation and Morphology

Unsupervised Segmentation of Audio Speech Using the Voting Experts Algorithm Matthew Im Ller Adam Miller Iowa State University

The Universal Dependencies Treebank of Spoken Slovenian

Sentence Boundary Detection Based on Parallel Lexical and Acoustic Models