Labelling Companies Referred to in Newspaper Articles
Total Page:16
File Type:pdf, Size:1020Kb
Labelling companies referred to in newspaper articles Amine Nahid Előd Egyed-Zsigmond Sylvie Calabretto [email protected] [email protected] [email protected] Université de Lyon; LIRIS UMR 5205 Lyon, France ABSTRACT This usual name list will then be combined to other methods to There are several domains where establishing links between news- propose a global score that predicts a distance between a text and a paper articles and companies is useful. In this paper, we will present company, in order to end up with a model that labels a newspaper the first elements of our solution to predict links between a news- article with its corresponding companies among our list. paper article written in French and a list of companies identified by their name and activity domain. We base our study on a semi- 2 STATE OF THE ART automatically annotated article corpus and the almost complete list Matching press articles with the companies they mention is part of official French company names. We combine statistical linguis- of the Named Entity Recognition (NER) domain. The term NER, tic methods with acronym generation and filtering techniques to appeared for the first time in the MUC-6 [5] conference. The task propose a global score that predicts a distance between a text and a of recognising company mentions in texts is hence a sub-problem company. The main objective of the study presented in this paper of NER, where we are interested only in entities representing com- is the creation of a usual name list for each company in order to panies. The issue can be addressed with different approaches. A improve the labelling of newspaper articles. baseline approach would be searching the official name of the com- pany in the text. Nonetheless, searching the official name ofa CCS CONCEPTS company within a newspaper article might reveal itself inefficient, • Information systems ! Document topic models; Relevance given that most companies have usual or common names that assessment. slightly differ from their legal ones. Working on a German corpus, [3] proposed the use of dictionaries of colloquial names from var- KEYWORDS ious sources, as well as an alias generator that generates an alias natural language processing, named entity recognition, information out of an official denomination (it goes through some classic NLP retrieval, text mining, text tagging data cleaning : removal of legal designations, special characters, geographic indications and token normalisation). There have been 1 INTRODUCTION other works elaborating rule based systems, based on heuristics and/or hand crafted rules on a morphological level [4, 6, 7]. Unfor- Businesses have always had interest in assessing their performances, tunately rule based methods are domain and language specific, and evaluating their financial and public relations situation. Hence, in- are not portable therefore. There are recently attempts to execute formation contained in press articles, clients feed-backs, etc. might generic NER tasks, using deep learning [2], but they usually need be of strategic importance. much more training examples than we have, annotated more pre- Our main problem is to link articles with companies for a very large cisely. We are also experimenting with CRF (Conditional Random number of companies registered in France, and identified by their Fields) based techniques, with promising results. These experiments unique national identifier (SIREN code) and legal name. However, will be related in a future paper. the companies are seldom referenced in the press using their legal In the following section we propose a statistics based protocol names, that are often long. Our project is to design a solution to to tackle the company recognition problem through common name link economic press articles written in French with a set of compa- dictionary generation. nies. We have a semi-automatically annotated article ground truth corpus and the list of the official denominations of around 30,000 companies registered in France. Our main contribution in this paper 3 PROPOSED APPROACH is a protocol to construct the common names of companies given In this section we present our company usual name creation method, their legal name and the set of annotated articles. first based on the official names and then on generated acronyms. We carry out our experiments and develop our tools on French language texts, but most of the methods used can be easily adapted 3.1 Hypothesis to other languages. Since companies are barely referred to by their legal names and are "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- rather known by one or more common names, we need to provide mons License Attribution 4.0 International (CC BY 4.0)." an accurate automatic protocol to generate these common names. Nahid et al. Through the observation of the legal names of a set of French companies, we made the following hypotheses: • The common name of a company might be only its legal name • The common name of a company might be a contiguous sequence of terms that form the legal name (a sub word =6A0< of the legal name) • The common name might be an acronym of the legal name or some part of it. With these hypotheses, we aim to implement a common name generator that operates in two steps: as a first step it generates the sub-sequences and then the acronyms. The second step is the search of the best subset of =6A0<s and acronyms to compose the common name set. 3.2 Pre-processing For our study, we have two data sets. The first one catalogues around 30k French companies identified by their SIREN codes (unique French identifier for businesses and not-for-profit organisations) and legal names in capital characters with no accents. The second data set contains around 120 thousand annotated French newspa- Figure 1: Number of companies per number of referencing per article URLs, manually labelled with the SIREN code of the articles companies they are talking about. Its elements are listed in accor- dance with the following scheme: id, SIREN code, legal name of the company, URL address of the article. We developed a scrapper Since not all companies have the privilege of being talked about that collected the title and body of the articles when available. We very often in the press, our ground truth shall be about the same. included finally only the articles for which we managed to scrap For our dataset, the graph (cf. Figure 1) shows the number of com- their content: title and text of the article. That gave us a dataset panies in function of the number of articles labeled as talking about with around 58k articles. them. 2375 French companies have more than 6 articles labeled as We cleaned the official names from the first data set by removing talking about them. We shall call these companies well-documented the punctuation marks, especially the dots, commas and parenthe- companies and focus our study on them. We consider that for the − ses. However, we chose to keep the hyphens as their use in French other ;4BB 3>2D<4=C43 companies, it is difficult to generate usual is very common for compound names, considered as single terms names based on annotated articles. in our model. Examples: 3.3 ngram generator We call ngrams all the contiguous sequences of terms contained in • For companies without the special characters aforemen- an expression. Our commitment at this is that for each company tioned, e.g. ELECTRICITE DE FRANCE, nothing is removed. we generate all possible ngrams for its legal denomination. • For CA INDOSUEZ WEALTH (FRANCE), the parentheses For instance for the company "COMPAGNIE DU RHONE", we should are irrelevant and would be problematic for the =6A0< and generate the following n-grams: “COMPAGNIE", “DU", “RHONE", 02A>=~< generation, the name has therefore got to be trans- “COMPAGNIE DU", “DU RHONE", “COMPAGNIE DU RHONE". formed to CA INDOSUEZ WEALTH FRANCE before any In order to filter potentially irrelevant =6A0<B we introduced 2 further process. rules: filter one character long =6A0<B, filter =6A0<B based on their • There is a company registered in France with the official frequency in the official name list. name: CASINO, GUICHARD-PERRACHON. The comma is to be removed (hence CASINO GUICHARD-PERRACHON) as it 3.3.1 Occurrence frequency. For a given =6A0< we compute an is useless for any future process. However, as written before, inverse occurrence frequency score >5 _B2>A4¹=6A0<º depending the hyphen is kept because GUICHARD-PERRACHON is on the number of times it occurs in the company legal names set. actually one name and should not be considered as two The higher >5 _B2>A4¹=6A0<º is, the more unique the n-gram is. separate terms. • Also the dots are removed so as to normalise the acronyms 2>D=C ¹f5 2 퐶j2>=C08=B¹;=5 , =6A0<gºº >5 _B2>A4¹=6A0<º = 1 − in use within the legal names. e.g.: SARL and S.A.R.L. 2>D=C ¹퐶º For the second data set, we concatenate the titles and bodies (1) under a unique attribute we called 2>A?DB. We also normalise the Where: corpus by removing non-printable Unicode characters. The articles • =6A0< is a subsequence of the legal name of the company 2 are then put into an 퐸;0BC82B40A2ℎ index. containing = words 0 < = ≤ F>A3_2>D=C ¹;=2 º Labelling companies referred to in newspaper articles • ;=2 , ;=5 are the legal names of the companies 2, 5 Similarly to the previous step, we end up with a key-value dictio- • 퐶 is the set of all the companies we have nary, called 382C02A , where the keys are the SIREN codes of the • 2>D=C ¹f5 2 퐶j2>=C08=B¹;=5 , =6A0<2 gºº is the size of the sub- companies and the values are lists of retained acronyms for the set of companies from 퐶 containing =6A0<2 in their official given companies.