Detecting Personal Life Events from Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Automatic Correction of Real-Word Errors in Spanish Clinical Texts
sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information. -
Knowledge Graphs on the Web – an Overview Arxiv:2003.00719V3 [Cs
January 2020 Knowledge Graphs on the Web – an Overview Nicolas HEIST, Sven HERTLING, Daniel RINGLER, and Heiko PAULHEIM Data and Web Science Group, University of Mannheim, Germany Abstract. Knowledge Graphs are an emerging form of knowledge representation. While Google coined the term Knowledge Graph first and promoted it as a means to improve their search results, they are used in many applications today. In a knowl- edge graph, entities in the real world and/or a business domain (e.g., people, places, or events) are represented as nodes, which are connected by edges representing the relations between those entities. While companies such as Google, Microsoft, and Facebook have their own, non-public knowledge graphs, there is also a larger body of publicly available knowledge graphs, such as DBpedia or Wikidata. In this chap- ter, we provide an overview and comparison of those publicly available knowledge graphs, and give insights into their contents, size, coverage, and overlap. Keywords. Knowledge Graph, Linked Data, Semantic Web, Profiling 1. Introduction Knowledge Graphs are increasingly used as means to represent knowledge. Due to their versatile means of representation, they can be used to integrate different heterogeneous data sources, both within as well as across organizations. [8,9] Besides such domain-specific knowledge graphs which are typically developed for specific domains and/or use cases, there are also public, cross-domain knowledge graphs encoding common knowledge, such as DBpedia, Wikidata, or YAGO. [33] Such knowl- edge graphs may be used, e.g., for automatically enriching data with background knowl- arXiv:2003.00719v3 [cs.AI] 12 Mar 2020 edge to be used in knowledge-intensive downstream applications. -
Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020
Welsh language technology action plan Progress report 2020 Welsh language technology action plan: Progress report 2020 Audience All those interested in ensuring that the Welsh language thrives digitally. Overview This report reviews progress with work packages of the Welsh Government’s Welsh language technology action plan between its October 2018 publication and the end of 2020. The Welsh language technology action plan derives from the Welsh Government’s strategy Cymraeg 2050: A million Welsh speakers (2017). Its aim is to plan technological developments to ensure that the Welsh language can be used in a wide variety of contexts, be that by using voice, keyboard or other means of human-computer interaction. Action required For information. Further information Enquiries about this document should be directed to: Welsh Language Division Welsh Government Cathays Park Cardiff CF10 3NQ e-mail: [email protected] @cymraeg Facebook/Cymraeg Additional copies This document can be accessed from gov.wales Related documents Prosperity for All: the national strategy (2017); Education in Wales: Our national mission, Action plan 2017–21 (2017); Cymraeg 2050: A million Welsh speakers (2017); Cymraeg 2050: A million Welsh speakers, Work programme 2017–21 (2017); Welsh language technology action plan (2018); Welsh-language Technology and Digital Media Action Plan (2013); Technology, Websites and Software: Welsh Language Considerations (Welsh Language Commissioner, 2016) Mae’r ddogfen yma hefyd ar gael yn Gymraeg. This document is also available in Welsh. -
Learning to Read by Spelling Towards Unsupervised Text Recognition
Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group Visual Geometry Group Visual Geometry Group University of Oxford University of Oxford University of Oxford [email protected] [email protected] [email protected] tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz iterations iterations bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered Training brought to view by dissection it was discovered Figure 1: Text recognition from unaligned data. We present a method for recognising text in images without using any labelled data. This is achieved by learning to align the statistics of the predicted text strings, against the statistics of valid text strings sampled from a corpus. The figure above visualises the transcriptions as various characters are learnt through the training iterations. The model firstlearns the concept of {space}, and hence, learns to segment the string into words; followed by common words like {to, it}, and only later learns to correctly map the less frequent characters like {v, w}. The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy). ABSTRACT 1 INTRODUCTION This work presents a method for visual text recognition without read (ri:d) verb • Look at and comprehend the meaning of using any paired supervisory data. We formulate the text recogni- (written or printed matter) by interpreting the characters or tion task as one of aligning the conditional distribution of strings symbols of which it is composed. -
Spelling Correction: from Two-Level Morphology to Open Source
Spelling Correction: from two-level morphology to open source Iñaki Alegria, Klara Ceberio, Nerea Ezeiza, Aitor Soroa, Gregorio Hernandez Ixa group. University of the Basque Country / Eleka S.L. 649 P.K. 20080 Donostia. Basque Country. [email protected] Abstract Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the "free world" (OpenOffice, Mozilla, emacs, LaTeX, ...)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper. previous work and reuse the morphological information. 1. Introduction Two-level morphology (Koskenniemi, 1983; Beesley & Karttunen, 2003) has been applied successfully to the 2. Free software for spelling correction morphological description of highly inflected languages. Unfortunately there are not open source tools for spelling There are two-level based descriptions for very different correction with these features: languages (English, German, Swedish, French, Spanish, • It is standardized in the most of the applications Danish, Norwegian, Finnish, Basque, Russian, Turkish, (OpenOffice, Mozilla, emacs, LaTeX, ...). Arab, Aymara, Swahili, etc.). • It is based on the two-level morphology. After doing the morphological description, it is easy to The spell family of spell checkers (ispell, aspell, myspell) develop a spelling checker/corrector for the language fits the first condition but not the second. -
Easychair Preprint Feedback Learning: Automating the Process
EasyChair Preprint № 992 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Rakshith Bymana Ponnappa, Khurram Azeem Hashmi, Syed Saqib Bukhari and Andreas Dengel EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. May 12, 2019 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Abstract—In recent years, with the increasing usage of digital handwritten or typed information in the digital mailroom media and advancements in deep learning architectures, most systems. The problem arises when these information extraction of the paper-based documents have been revolutionized into systems cannot extract the exact information due to many digital versions. These advancements have helped the state-of-the- art Optical Character Recognition (OCR) and digital mailroom reasons like the source graphical document itself is not readable, technologies become progressively efficient. Commercially, there scanner has a poor result, characters in the document are too already exists end to end systems which use OCR and digital close to each other resulting in problems like reading “Ouery” mailroom technologies for extracting relevant information from instead of “Query”, “8” instead of ”3”, to name a few. It financial documents such as invoices. However, there is plenty of is even more challenging to correct errors in proper nouns room for improvement in terms of automating and correcting post information extracted errors. This paper describes the like names, addresses and also in numerical values such as user-involved, self-correction concept based on the sequence to telephone number, insurance number and so on. -
Hunspell – the Free Spelling Checker
Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of -
Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation Alexander Panchenkoz, Fide Martenz, Eugen Ruppertz, Stefano Faralliy, Dmitry Ustalov∗, Simone Paolo Ponzettoy, and Chris Biemannz zLanguage Technology Group, Department of Informatics, Universitat¨ Hamburg, Germany yWeb and Data Science Group, Department of Informatics, Universitat¨ Mannheim, Germany ∗Institute of Natural Sciences and Mathematics, Ural Federal University, Russia fpanchenko,marten,ruppert,[email protected] fsimone,[email protected] [email protected] Abstract manually in one of the underlying resources, such as Wikipedia. Unsupervised knowledge-free ap- Interpretability of a predictive model is proaches, e.g. (Di Marco and Navigli, 2013; Bar- a powerful feature that gains the trust of tunov et al., 2016), require no manual labor, but users in the correctness of the predictions. the resulting sense representations lack the above- In word sense disambiguation (WSD), mentioned features enabling interpretability. For knowledge-based systems tend to be much instance, systems based on sense embeddings are more interpretable than knowledge-free based on dense uninterpretable vectors. Therefore, counterparts as they rely on the wealth of the meaning of a sense can be interpreted only on manually-encoded elements representing the basis of a list of related senses. word senses, such as hypernyms, usage We present a system that brings interpretability examples, and images. We present a WSD of the knowledge-based sense representations into system that bridges the gap between these the world of unsupervised knowledge-free WSD two so far disconnected groups of meth- models. The contribution of this paper is the first ods. Namely, our system, providing access system for word sense induction and disambigua- to several state-of-the-art WSD models, tion, which is unsupervised, knowledge-free, and aims to be interpretable as a knowledge- interpretable at the same time. -
Ten Years of Babelnet: a Survey
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Survey Track Ten Years of BabelNet: A Survey Roberto Navigli1 , Michele Bevilacqua1 , Simone Conia1 , Dario Montagnini2 and Francesco Cecconi2 1Sapienza NLP Group, Sapienza University of Rome, Italy 2Babelscape, Italy froberto.navigli, michele.bevilacqua, [email protected] fmontagnini, [email protected] Abstract to integrate symbolic knowledge into neural architectures [d’Avila Garcez and Lamb, 2020]. The rationale is that the The intelligent manipulation of symbolic knowl- use of, and linkage to, symbolic knowledge can not only en- edge has been a long-sought goal of AI. How- able interpretable, explainable and accountable AI systems, ever, when it comes to Natural Language Process- but it can also increase the degree of generalization to rare ing (NLP), symbols have to be mapped to words patterns (e.g., infrequent meanings) and promote better use and phrases, which are not only ambiguous but also of information which is not explicit in the text. language-specific: multilinguality is indeed a de- Symbolic knowledge requires that the link between form sirable property for NLP systems, and one which and meaning be made explicit, connecting strings to repre- enables the generalization of tasks where multiple sentations of concepts, entities and thoughts. Historical re- languages need to be dealt with, without translat- sources such as WordNet [Miller, 1995] are important en- ing text. In this paper we survey BabelNet, a pop- deavors which systematize symbolic knowledge about the ular wide-coverage lexical-semantic knowledge re- words of a language, i.e., lexicographic knowledge, not only source obtained by merging heterogeneous sources in a machine-readable format, but also in structured form, into a unified semantic network that helps to scale thanks to the organization of concepts into a semantic net- tasks and applications to hundreds of languages. -
KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents
KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents 1 2 2 Paramita Mirza, Fariz Darari, ∗ Rahmad Mahendra ∗ 1 Max Planck Institute for Informatics, Germany 2 Faculty of Computer Science, Universitas Indonesia, Indonesia paramita @mpi-inf.mpg.de fariz,rahmad.mahendra{ } @cs.ui.ac.id { } Abstract Subtask S3 In order to answer questions of type (ii), participating systems are also required We present KOI (Knowledge of Incidents), a to identify participant roles in each identified an- system that given news articles as input, builds swer incident (e.g., victim, subject-suspect), and a knowledge graph (KOI-KG) of incidental use such information along with victim-related nu- events. KOI-KG can then be used to effi- merals (“three people were killed”) mentioned in ciently answer questions such as “How many the corresponding answer documents, i.e., docu- killing incidents happened in 2017 that involve ments that report on the answer incident, to deter- Sean?” The required steps in building the KG include: (i) document preprocessing involv- mine the total number of victims. ing word sense disambiguation, named-entity Datasets The organizers released two datasets: recognition, temporal expression recognition and normalization, and semantic role labeling; (i) test data, stemming from three domains of (ii) incidental event extraction and coreference gun violence, fire disasters and business, and (ii) resolution via document clustering; and (iii) trial data, covering only the gun violence domain. KG construction and population. Each dataset -
Web Search Result Clustering with Babelnet
Web Search Result Clustering with BabelNet Marek Kozlowski Maciej Kazula OPI-PIB OPI-PIB [email protected] [email protected] Abstract 2 Related Work In this paper we present well-known 2.1 Search Result Clustering search result clustering method enriched with BabelNet information. The goal is The goal of text clustering in information retrieval to verify how Babelnet/Babelfy can im- is to discover groups of semantically related docu- prove the clustering quality in the web ments. Contextual descriptions (snippets) of docu- search results domain. During the evalua- ments returned by a search engine are short, often tion we tested three algorithms (Bisecting incomplete, and highly biased toward the query, so K-Means, STC, Lingo). At the first stage, establishing a notion of proximity between docu- we performed experiments only with tex- ments is a challenging task that is called Search tual features coming from snippets. Next, Result Clustering (SRC). Search Results Cluster- we introduced new semantic features from ing (SRC) is a specific area of documents cluster- BabelNet (as disambiguated synsets, cate- ing. gories and glosses describing synsets, or Approaches to search result clustering can be semantic edges) in order to verify how classified as data-centric or description-centric they influence on the clustering quality (Carpineto, 2009). of the search result clustering. The al- The data-centric approach (as Bisecting K- gorithms were evaluated on AMBIENT Means) focuses more on the problem of data clus- dataset in terms of the clustering quality. tering, rather than presenting the results to the user. Other data-centric methods use hierarchical 1 Introduction agglomerative clustering (Maarek, 2000) that re- In the previous years, Web clustering engines places single terms with lexical affinities (2-grams (Carpineto, 2009) have been proposed as a solu- of words) as features, or exploit link information tion to the issue of lexical ambiguity in Informa- (Zhang, 2008). -
Sethesaurus: Wordnet in Software Engineering
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TSE.2019.2940439 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2015 1 SEthesaurus: WordNet in Software Engineering Xiang Chen, Member, IEEE, Chunyang Chen, Member, IEEE, Dun Zhang, and Zhenchang Xing, Member, IEEE, Abstract—Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations.