Where Are We in Named Entity Recognition from Speech?
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4514–4520 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Where are we in Named Entity Recognition from Speech? Antoine Caubriere` 1, Sophie Rosset2, Yannick Esteve` 3, Antoine Laurent1, Emmanuel Morin4 1LIUM - Le Mans University, France, 2LIMSI - University of Paris-Saclay, CNRS, France, 3LIA - Avignon University, France, 4LS2N - University of Nantes, CNRS, France [email protected], [email protected], [email protected], [email protected], [email protected] Abstract Named entity recognition (NER) from speech is usually made through a pipeline process that consists in (i) processing audio using an automatic speech recognition system (ASR) and (ii) applying a NER to the ASR outputs. The latest data available for named entity extraction from speech in French were produced during the ETAPE evaluation campaign in 2012. Since the publication of ETAPE’s campaign results, major improvements were done on NER and ASR systems, especially with the development of neural approaches for both of these components. In addition, recent studies have shown the capability of End-to-End (E2E) approach for NER / SLU tasks. In this paper, we propose a study of the improvements made in speech recognition and named entity recognition for pipeline approaches. For this type of systems, we propose an original 3-pass approach. We also explore the capability of an E2E system to do structured NER. Finally, we compare the performances of ETAPE’s systems (state-of-the-art systems in 2012) with the performances obtained us- ing current technologies. The results show the interest of the E2E approach, which however remains below an updated pipeline approach. Keywords: Named Entity Recognition, Automatic Speech Recognition, Tree-structured Named Entity, End-to-End 1. Introduction In 2012, HMM-GMM implementations were still the state- Named entity recognition seeks to locate and classify of-the-art approaches for ASR technologies. Since this named entity mentions in unstructured text into pre-defined date, the great contribution of neural approaches for NER categories (such as person names, organizations, locations, and ASR tasks were demonstrated. ...). Quaero project (Grouin et al., 2011) is at the initia- Recent studies (Lample et al., 2016; Ma and Hovy, 2016) tive of an extended definition of named entity for French improve the NER accuracy by using a combination of bidi- data. This extended version has a multilevel tree structure, rectional Long Short-Term Memory (bLSTM) and CRF where base entities are combined to define more complex layers. ones. With the extended definition, named entity recogni- Other studies (Tomashenko et al., 2016) are based on a tion consists in the detection, the classification and the de- combination of HMM and Deep Neural Network (DNN) composition of the entities. This new definition was used to reach ASR state-of-the-art performances. for the French evaluation campaign ETAPE (Galibert et al., Lately, some E2E approaches for Named Entity Recogni- 2014). tion from speech have been proposed in (Ghannay et al., Since the ETAPE’s results publication in 2012, no new 2018). In this work, the E2E systems will learn an align- work were published, to the best of our knowledge, on ment between audio and manual transcription enriched with named entity recognition from speech for Quaero-like tree- NE without tree-structure. Other works use End-to-End structured French data. Tree-structured named entities can approach to map directly speech to intent instead of map not be tackled as a simple sequence labeling task. At the speech to word and then word to intent (Lugosch et al., time of the ETAPE campaign, state-of-the-art works fo- 2019). cused on multiple processing steps before rebuilding a tree Theses works shows the growing interest in E2E ap- structure. Conditional Random Field (Lafferty et al., 2001) proaches for this type of task. (CRF) are in the core of these previous sequence labeling In this paper, we propose a study of recent improvements approaches. Some approaches (Dinarelli and Rosset, 2012; for NER in the scope of the ETAPE campaign. We compare Dinarelli and Rosset, 2011) used Probabilistic Context-Free classical pipeline approaches with updated components and Grammar (Johnson, 1998) (PCFG) in complement of CRF E2E approaches train with two kinds of strategy. to implement a cascade model. CRF was trained on compo- The first contribution of this paper is a 3-pass implementa- nents information and PCFG was used to predict the whole tion in order to tackle tree-structured named entity recog- entity tree. The ETAPE winning NER system (Raymond, nition. This 3-pass implementation consists in splitting the 2013) only used CRF models with one model per base en- tree-structured scheme of named entity annotation into 3 tity. parts to allow classical sequential labeling of each part be- Most of the typical approaches for named entity recognition fore rebuilding the complex structure. from speech follows a two steps pipeline, with first an ASR The second contribution is an application of an E2E ap- system and then a NER system on automatic transcriptions proach for tree-structured named entity recognition. It con- produced by the ASR system. In this configuration, the sists in training a system that learns the alignment between NER component must deal with an imperfect transcription audio and textual transcription enriched with the structured of speech. As a result, the quality of automatic transcrip- named entity. tions has a major impact on NER performances (Ben Jannet After a description of the Quaero named entity tree- et al., 2015). structured task (Section 2.), we described our 3-pass im- 4514 plementation, our state-of-the-arts for NER and ASR com- problem by reducing all labels related to a word into a sin- ponents and our E2E implementation (Sections 3. and 4.). gle one. Figure 1 illustrates an example of this concatena- Data sets (Section 5.), experimental results and analyses tion. are presented (Section 6.) followed by a conclusion (Sec- tion 7.). 2. Task definition This study focuses on tree-structured named entities fol- lowing the Quaero guideline (Rosset et al., 2011). This guideline allows annotation according to 8 main types of named entities: amount, event, func, loc, org, pers, prod and time. The annotation uses sub-types to set up a hier- archy of named entities in order to better describe the con- Figure 1: Transformation example of a tree-structured cepts. Final annotation is necessarily the leaf of the hierar- named entity sequence into BIO format. This sentence chical tree with each annotation node separates by a point. means in English ”the town hall of paris” For example loc.add.phys which is the physical address of a place. With types and sub-types, there is 39 possible entity The label concatenation induces a dramatic increase of the types in the Quaero annotation guideline. number of predictable outputs for a classical sequence la- Also, in order to decompose the concepts, named entities beling approach. With this concatenation this number grow are annotated by component. There is 28 possible compo- up to around 1690 predictable tags. It also induces a large nent in the Quaero annotation guideline. The component annotation sparsity. These issues motivated us to split the is the smallest annotated element. Each word located in- BIO annotation into different levels. side a named entity needs to be annotated in components. Since the named entities are necessarily decomposed in Except for some articles and linking words. Most of the components, two facts can be deduced. First, the root of components depend on named entities types (e.g ”day”, the tree structure is necessarily a named entity type. Sec- ”week” which refer to the type ”time”) but some are cross- ond, the leaves of the tree structure are mainly components. cutting (e.g ”kind”, ”qualifier” which can be located inside And third, annotations between the leaves and the root of all named entity types). this structure are a mixture of type and component. Based Finally, annotations have a tree-structure. A named en- on these observations we split the concatenated BIO anno- tity can be composed of components and other named tation into three different levels. entities, itself composed of components and named en- The first level contains the furthest annotations to the word tities without nesting limit. For example, the sentence level. These annotations are the root of the tree-structured ”la mairie de paris” can be annotated as ”la <org.adm named entities. This level is represented in green color in <kind mairie > de <loc.adm.town <name paris > > >”. figure 1 and requires 96 predictable tags. org.adm/loc.adm.town are Named Entities types with sub- The third level contains the closest annotations to the word types and kind/name are components. level. These annotations are the leaves of the named enti- With the Quaero definition of named entity, NER consists ties. This level is represented in red in figure 1 and requires in entity detection, classification and decomposition. Since 57 predictable tags. this new definition is used for the French evaluation cam- Finally, the second level contains every others annotations. paign ETAPE, the task in this study consists in Quaero These annotations are named entity types and/or compo- named entity extraction from speech. nents located between the root and the leaves of the named 3. Pipelines systems entities. This level is represented in black in figure 1 and requires 187 predictable tags. 3.1. 3-pass implementation With the annotation divided into three levels, the tree- Our NER systems use standard BIO2 (Sang and Veenstra, structured NER task is tackled by three sequence-labeling 1999) format.