Natural Language Processing and Text Mining with Graph-Structured Representations

Total Page:16

File Type:pdf, Size:1020Kb

Natural Language Processing and Text Mining with Graph-Structured Representations Natural Language Processing and Text Mining with Graph-Structured Representations by Bang Liu A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Department of Electrical and Computer Engineering University of Alberta © Bang Liu, 2020 Abstract Natural Language Processing (NLP) and understanding aims to read from unformat- ted text to accomplish different tasks. As a first step, it is necessary to represent text as a simplified model. Traditionally, Vector Space Model (VSM) is most commonly used, in which text is represented as a bag of words. Recent years, word vectors learned by deep neural networks are also widely used. However, the underlying lin- guistic and semantic structures of text pieces cannot be expressed and exploited in these representations. Graph is a natural way to capture the connections between different text pieces, such as entities, sentences, and documents. To overcome the limits in vector space models, we combine deep learning models with graph-structured representations for various tasks in NLP and text mining. Such combinations help to make full use of both the structural information in text and the representation learning ability of deep neural networks. Specifically, we make contributions to the following NLP tasks: First, we introduce tree-based/graph-based sentence/document decomposition tech- niques to align sentence/document pairs, and combine them with Siamese neural network and graph convolutional networks (GCN) to perform fine-grained semantic relevance estimation. Based on them, we propose Story Forest system to automati- cally cluster streaming documents into fine-grained events, while connecting related events in growing trees to tell evolving stories. Story Forest has been deployed into Tencent QQ Browser for hot event discovery. Second, we propose ConcepT and GIANT systems to construct a user-centered, web-scale ontology, containing a large number of heterogeneous phrases conforming ii to user attentions at various granularities, mined from the vast volume of web docu- ments and search click logs. We introduce novel graphical representation and combine it with Relational-GCN to perform heterogeneous phrase mining and relation identi- fication. GIANT system has been deployed into Tencent QQ Browser for news feeds recommendation and searching, serving more than 110 million daily active users. It also offers document tagging service to WeChat. Third, we propose Answer-Clue-Style-aware Question Generation to automatically generate diverse and high-quality question-answer pairs from unlabeled text corpus at scale by mimicking the way a human asks questions. Our algorithms combine sentence structure parsing with GCN and Seq2Seq-based generative model to make the "one-to-many" question generation close to "one-to-one" mapping problem. A major part of our work has been deployed into real world applications in Tencent and serves billions of users. iii To my parents, Jinzhi Cheng and Shangping Liu. To my grandparents, Chunhua Hu and Jiafa Cheng. iv . Where there’s a will, there is a way. 事(º: v Acknowledgements This work is not mine alone. I learned a lot and got support from different people for the past six years. Firstly, I am immensely grateful to my advisor, Professor Di Niu, for teaching me how to become a professional researcher. I joined Di’s group since September, 2013. During the last 6 years, Di not only created a great research environment for me and all his other students, but also helped me by providing a lot of valuable experiences and suggestions in how to develop a professional career. More importantly, Di is not only a kind and supportive advisor, but also an older friend of mine. He always believes in me even though I am not always that confident about myself. I learned a lot from Di and I am very grateful to him. I learned a great deal from my talented collaborators and mentors: Professor Lin- glong Kong and Professor Zongpeng Li. Professor Linglong Kong is my co-supervisor. He is a very nice supervisor as well as friend. He is quite smart and can see the nature of research problems. I learned a lot from him in multiple research projects. Professor Zongpeng Li is one of the coauthors in my first paper. He is an amiable professor with great enthusiasm in research. Thanks for his support in my early research works. I would like to thanks Professor H. Vicky Zhao, who is my co-supervisor when I was pursuing my Master’s degree. I am also very grateful to Professor Jian Pei, Davood Rafiei and Cheng Li for being the members in my PhD supervisor committee. More- over, I would like to thank Professor Denilson Barbosa, James Miller and Scott Dick for being the members in the committee of my PhD candidacy exam. My friends have made my time over the last six years much more enjoyable. Thanks to my friends, including but not limited to Yan Liu, Yaochen Hu, Rui Zhu, Wuhua Zhang, Wanru Liu, Xu Zhang, Haolan Chen, Dashun Wang, Zhuangzhi Li, Lingju Meng, Qiuyang Xiong, Ting Zhao, Ting Zhang, Fred X. Han, Chenglin Li, Mingjun Zhao, Chi Feng, Lu Qian, Yuanyuan, Ruitong Huang, Jing Cao, Eren, Shuai Zhou, Zhaoyang Shao, Kai Zhou, Yushi Wang, etc. You are my family in Canada. Thank you for everything we have experience together. I am very grateful to Tencent for their support. I met a lot of friends and brilliant colleagues there. Thanks to my friends Jinghong Lin, Xiao Bao, Yuhao Zhang, Litao vi Hong, Weifeng Yang, Shishi Duan, Guangyi Chen, Chun Wu, Chaoyue Wang, Jinwen Luo, Nan Wang, Dong Liu, Chenglin Wu, Mengyi Li, Lin Ma, Xiaohui Han, Haojie Wei, Binfeng Luo, Di Chen, Zutong Li, Jiaosheng Zhao, Shengli Yan, Shunnan Xu, Ruicong Xu and so on. Life is a lot more fun with all of you. Thanks to Weidong Guo, Kunfeng Lai, Yu Xu, Yancheng He, and Bowei Long, for the full support to my research works in Tencent. Thanks to my friend Qun Li and my sister Xiaoqin Zhao for all the accompanying time. Thanks to my parents, Jinzhi Cheng and Shangping Liu, and my little sister Jia Liu. Your love is what makes me strong. Thanks to all my family members, I love all of you. Lastly, thanks to my grandparents, Chunhua Hu and Jiafa Cheng, who raised me. I will always miss you, grandma. vii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 User and Text Understanding: a Graph Approach . .3 1.3 Contributions . .5 1.4 Thesis Outline . .8 2 Related Work 10 2.1 Information Organization . 10 2.1.1 Text Clustering . 10 2.1.2 Story Structure Generation . 11 2.1.3 Text Matching. 13 2.1.4 Graphical Document Representation . 14 2.2 Information Recommendation . 15 2.2.1 Concept Mining . 15 2.2.2 Event Extraction . 16 2.2.3 Relation Extraction . 16 2.2.4 Taxonomy and Knowledge Base Construction . 17 2.2.5 Text Conceptualization . 17 2.3 Reading Comprehension . 18 I Text Clustering and Matching: Growing Story Trees to Solve Information Explosion 20 3 Story Forest: Extracting Events and Telling Stories from Breaking News 22 3.1 Introduction . 23 3.2 Problem Definition and Notations . 27 3.2.1 Problem Definition . 27 viii 3.2.2 Notations . 28 3.2.3 Case Study . 29 3.3 The Story Forest System . 30 3.3.1 Preprocessing . 31 3.3.2 Event Extraction by EventX . 33 3.3.3 Growing Story Trees Online . 38 3.4 Performance Evaluation . 40 3.4.1 News Datasets . 41 3.4.2 Evaluation of EventX . 42 3.4.3 Evaluation of Story Forest . 48 3.4.4 Algorithm Complexity and System Overhead . 53 3.5 Concluding Remarks and Future Works . 55 4 Matching Article Pairs with Graphical Decomposition and Convo- lutions 57 4.1 Introduction . 57 4.2 Concept Interaction Graph . 59 4.3 Article Pair Matching through Graph Convolutions . 62 4.4 Evaluation . 64 4.4.1 Results and Analysis . 69 4.5 Conclusion . 71 5 Matching Natural Language Sentences with Hierarchical Sentence Factorization 72 5.1 Introduction . 73 5.2 Hierarchical Sentence Factorization and Reordering . 75 5.2.1 Hierarchical Sentence Factorization . 77 5.3 Ordered Word Mover’s Distance . 80 5.4 Multi-scale Sentence Matching . 84 5.5 Evaluation . 86 5.5.1 Experimental Setup . 86 5.5.2 Unsupervised Matching with OWMD . 88 5.5.3 Supervised Multi-scale Semantic Matching . 90 5.6 Conclusion . 91 II Text Mining: Recognizing User Attentions for Searching ix and Recommendation 93 6 A User-Centered Concept Mining System for Query and Document Understanding at Tencent 95 6.1 Introduction . 96 6.2 User-Centered Concept Mining . 100 6.3 Document Tagging and Taxonomy Construction . 103 6.3.1 Concept Tagging for Documents . 103 6.3.2 Taxonomy Construction . 106 6.4 Evaluation . 107 6.4.1 Evaluation of Concept Mining . 107 6.4.2 Evaluation of Document Tagging and Taxonomy Construction 110 6.4.3 Online A/B Testing for Recommendation . 111 6.4.4 Offline User Study of Query Rewriting for Searching . 113 6.5 Information for Reproducibility . 113 6.5.1 System Implementation and Deployment . 113 6.5.2 Parameter Settings and Training Process . 114 6.5.3 Publish Our Datasets . 116 6.5.4 Details about Document Topic Classification . 117 6.5.5 Examples of Queries and Extracted Concepts . 117 6.6 Conclusion . 118 7 Scalable Creation of a Web-scale Ontology 119 7.1 Introduction . 119 7.2 The Attention Ontology . 123 7.3 Ontology Construction . 125 7.3.1 Mining User Attentions . 127 7.3.2 Linking User Attentions . 134 7.4 Applications . 136 7.5 Evaluation .
Recommended publications
  • Construction of English-Bodo Parallel Text
    International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018 CONSTRUCTION OF ENGLISH -BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION Saiful Islam 1, Abhijit Paul 2, Bipul Syam Purkayastha 3 and Ismail Hussain 4 1,2,3 Department of Computer Science, Assam University, Silchar, India 4Department of Bodo, Gauhati University, Guwahati, India ABSTRACT Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
    [Show full text]
  • Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources.
    [Show full text]
  • Forescout Counteract® Endpoint Support Compatibility Matrix Updated: October 2018
    ForeScout CounterACT® Endpoint Support Compatibility Matrix Updated: October 2018 ForeScout CounterACT Endpoint Support Compatibility Matrix 2 Table of Contents About Endpoint Support Compatibility ......................................................... 3 Operating Systems ....................................................................................... 3 Microsoft Windows (32 & 64 BIT Versions) ...................................................... 3 MAC OS X / MACOS ...................................................................................... 5 Linux .......................................................................................................... 6 Web Browsers .............................................................................................. 8 Microsoft Windows Applications ...................................................................... 9 Antivirus ................................................................................................. 9 Peer-to-Peer .......................................................................................... 25 Instant Messaging .................................................................................. 31 Anti-Spyware ......................................................................................... 34 Personal Firewall .................................................................................... 36 Hard Drive Encryption ............................................................................. 38 Cloud Sync ...........................................................................................
    [Show full text]
  • Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples
    Proceedings of eLex 2019 Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples Pēteris Paikens 1, Normunds Grūzītis 2, Laura Rituma 2, Gunta Nešpore 2, Viktors Lipskis 2, Lauma Pretkalniņa2, Andrejs Spektors 2 1 Latvian Information Agency LETA, Marijas street 2, Riga, Latvia 2 Institute of Mathematics and Computer Science, University of Latvia, Raina blvd. 29, Riga, Latvia E-mail: [email protected], [email protected] Abstract This paper describes ongoing work to extend an online dictionary of Latvian – Tezaurs.lv – with representative semantically annotated corpus examples according to the FrameNet and PropBank methodologies and word sense inventories. Tezaurs.lv is one of the largest open lexical resources for Latvian, combining information from more than 300 legacy dictionaries and other sources. The corpus examples are extracted from Latvian FrameNet and PropBank corpora, which are manually annotated parallel subsets of a balanced text corpus of contemporary Latvian. The proposed approach augments traditional lexicographic information with modern cross-lingually interpretable information and enables analysis of word senses from the perspective of frame semantics, which is substantially different from (complementary to) the traditional approach applied in Latvian lexicography. In cases where FrameNet and PropBank corpus evidence aligns well with the word sense split in legacy dictionaries, the frame-semantically annotated corpus examples supplement the word sense information with clarifying usage examples and commonly used semantic valence patterns. However, the annotated corpus examples often provide evidence of a different sense split, which is often more coarse-grained and, thus, may help dictionary users to cluster and comprehend a fine-grained sense split suggested by the legacy sources.
    [Show full text]
  • 20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation
    Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation JOHANN-MATTIS LIST, Max Planck Institute for the Science of Human History, Germany 20 NATHANIEL A. SIMS, The University of California, Santa Barbara, America ROBERT FORKEL, Max Planck Institute for the Science of Human History, Germany While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustain- able. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that couldhelp to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documen- tation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework forthe computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardiza- tion proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research. CCS Concepts: • Applied computing → Language translation; Additional Key Words and Phrases: Sino-tibetan, interlinear-glossed text, computer-assisted language com- parison, standardization, qiang ACM Reference format: Johann-Mattis List, Nathaniel A.
    [Show full text]
  • Linguistic Annotation of the Digital Papyrological Corpus: Sematia
    Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances.
    [Show full text]
  • Formal Concept Analysis in Knowledge Processing: a Survey on Applications ⇑ Jonas Poelmans A,C, , Dmitry I
    Expert Systems with Applications 40 (2013) 6538–6560 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Review Formal concept analysis in knowledge processing: A survey on applications ⇑ Jonas Poelmans a,c, , Dmitry I. Ignatov c, Sergei O. Kuznetsov c, Guido Dedene a,b a KU Leuven, Faculty of Business and Economics, Naamsestraat 69, 3000 Leuven, Belgium b Universiteit van Amsterdam Business School, Roetersstraat 11, 1018 WB Amsterdam, The Netherlands c National Research University Higher School of Economics, Pokrovsky boulevard, 11, 109028 Moscow, Russia article info abstract Keywords: This is the second part of a large survey paper in which we analyze recent literature on Formal Concept Formal concept analysis (FCA) Analysis (FCA) and some closely related disciplines using FCA. We collected 1072 papers published Knowledge discovery in databases between 2003 and 2011 mentioning terms related to Formal Concept Analysis in the title, abstract and Text mining keywords. We developed a knowledge browsing environment to support our literature analysis process. Applications We use the visualization capabilities of FCA to explore the literature, to discover and conceptually repre- Systematic literature overview sent the main research topics in the FCA community. In this second part, we zoom in on and give an extensive overview of the papers published between 2003 and 2011 which applied FCA-based methods for knowledge discovery and ontology engineering in various application domains. These domains include software mining, web analytics, medicine, biology and chemistry data. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction mining, i.e. gaining insight into source code with FCA.
    [Show full text]
  • Tracking Users Across the Web Via TLS Session Resumption
    Tracking Users across the Web via TLS Session Resumption Erik Sy Christian Burkert University of Hamburg University of Hamburg Hannes Federrath Mathias Fischer University of Hamburg University of Hamburg ABSTRACT modes, and browser extensions to restrict tracking practices such as User tracking on the Internet can come in various forms, e.g., via HTTP cookies. Browser fingerprinting got more difficult, as trackers cookies or by fingerprinting web browsers. A technique that got can hardly distinguish the fingerprints of mobile browsers. They are less attention so far is user tracking based on TLS and specifically often not as unique as their counterparts on desktop systems [4, 12]. based on the TLS session resumption mechanism. To the best of Tracking based on IP addresses is restricted because of NAT that our knowledge, we are the first that investigate the applicability of causes users to share public IP addresses and it cannot track devices TLS session resumption for user tracking. For that, we evaluated across different networks. As a result, trackers have an increased the configuration of 48 popular browsers and one million of the interest in additional methods for regaining the visibility on the most popular websites. Moreover, we present a so-called prolon- browsing habits of users. The result is a race of arms between gation attack, which allows extending the tracking period beyond trackers as well as privacy-aware users and browser vendors. the lifetime of the session resumption mechanism. To show that One novel tracking technique could be based on TLS session re- under the observed browser configurations tracking via TLS session sumption, which allows abbreviating TLS handshakes by leveraging resumptions is feasible, we also looked into DNS data to understand key material exchanged in an earlier TLS session.
    [Show full text]
  • January 10, 2021 Baptism of the Lord Sunday Preparing for Worship If I Had to Guess, I Would Estimate That I Have Something of Me
    January 10, 2021 Baptism of the Lord Sunday Preparing for worship If I had to guess, I would estimate that I have something of me. Each baptism required that I been baptized no less than 50 times in my life. get in the water. I had to remember to hold my Growing up as a pastor’s kid, playing baptism breath — that one was sometimes tricky. I had was one of the favorite pastimes in our house- to be ready for something, even if I wasn’t sure hold. My brother and I often baptized our what that something was. stuffed animals, or if they were very unlucky, Today, we recognize in worship the Baptism our live animals. Baby dolls and Barbies didn’t of the Lord Sunday. We recognize the step that escape my young baptismal zealotry. Most Jesus took to get into the water, to come nearer swimming pools flip-flopped between staging to us. Some of you might remember your own grounds for Marco Polo and baptism services of baptism. Some of you might be wondering what cousins and neighborhood kids. baptism might be like for you one day. Wherever Then, of course, there was the baptism where you are in your journey, baptism requires I stood in the baptistery of my beloved church getting into the water and being ready for as my father baptized me. There has been much something. Today, as you prepare for worship, light-hearted debate as to which one of these you might take a few deep cleansing breaths. baptisms “took,” which one really counted.
    [Show full text]
  • Concept Maps Mining for Text Summarization
    UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO CENTRO TECNOLÓGICO DEPARTAMENTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM INFORMÁTICA Camila Zacché de Aguiar Concept Maps Mining for Text Summarization VITÓRIA-ES, BRAZIL March 2017 Camila Zacché de Aguiar Concept Maps Mining for Text Summarization Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Informática da Universidade Federal do Espírito Santo, como requisito parcial para obtenção do Grau de Mestre em Informática. Orientador (a): Davidson CurY Co-orientador: Amal Zouaq VITÓRIA-ES, BRAZIL March 2017 Dados Internacionais de Catalogação-na-publicação (CIP) (Biblioteca Setorial Tecnológica, Universidade Federal do Espírito Santo, ES, Brasil) Aguiar, Camila Zacché de, 1987- A282c Concept maps mining for text summarization / Camila Zacché de Aguiar. – 2017. 149 f. : il. Orientador: Davidson Cury. Coorientador: Amal Zouaq. Dissertação (Mestrado em Informática) – Universidade Federal do Espírito Santo, Centro Tecnológico. 1. Informática na educação. 2. Processamento de linguagem natural (Computação). 3. Recuperação da informação. 4. Mapas conceituais. 5. Sumarização de Textos. 6. Mineração de dados (Computação). I. Cury, Davidson. II. Zouaq, Amal. III. Universidade Federal do Espírito Santo. Centro Tecnológico. IV. Título. CDU: 004 “Most of the fundamental ideas of science are essentially simple, and may, as a rule, be expressed in a language comprehensible to everyone.” Albert Einstein 4 Acknowledgments I would like to thank the manY people who have been with me over the years and who have contributed in one way or another to the completion of this master’s degree. Especially my advisor, Davidson Cury, who in a constructivist way disoriented me several times as a stimulus to the search for new answers.
    [Show full text]
  • Empowering Video Storytellers: Concept Discovery and Annotation for Large Audio-Video-Text Archives
    Empowering Video Storytellers: Concept Discovery and Annotation for Large Audio-Video-Text Archives A large amount of video content is available and stored by broadcasters, libraries and other enterprises. However, the lack of effective and efficient tools for searching and retrieving has prevented video becoming a major value asset. Much work has been done for retrieving video information from constrained structured domains, such as news and sports, but general solutions for unstructured audio-video-language intensive content are missing. Our goal is to transform the multimedia archive into an organized and searchable asset so that human efforts can focus on a more important aspect - storytelling. This project involves a cross- disciplinary effort towards development of new techniques for organizing, summarizing, and question answering over audio-video-language intensive content, such as raw documentary footage. Such domains offer the full range of real-world situations and events, manifested with interaction of rich multimedia information. The team includes investigators from Schools of Engineering and Applied Science, Journalism, and Arts. It has extensive experience and deep expertise in analysis of multimedia information - video, audio, and language - as well as broad experience with professional documentary film production and interactive media designs. The project includes two major content partners, WITNESS and WNET, both providing unique video archives, well-defined application domains and user communities. In particular, WITNESS collects and manages a documentary video archive from activists all over the world to support the promotion of human rights. This archive has several important qualities: it is diverse (including interviews, documentation of events, and ‘hidden camera’ reporting), it is relatively ‘raw’ (shot by semi-professional operators), it has been shot for specific purposes (e.g.
    [Show full text]
  • Technical Phrase Extraction for Patent Mining: a Multi-Level Approach
    2020 IEEE International Conference on Data Mining (ICDM) Technical Phrase Extraction for Patent Mining: A Multi-level Approach Ye Liu, Han Wu, Zhenya Huang, Hao Wang, Jianhui Ma, Qi Liu, Enhong Chen*, Hanqing Tao, Ke Rui Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China {liuyer, wuhanhan, wanghao3, hqtao, kerui}@mail.ustc.edu.cn, {huangzhy, jianhui, qiliuql, cheneh}@ustc.edu.cn Abstract—Recent years have witnessed a booming increase of TABLE I: Technical Phrases vs. Non-technical Phrases patent applications, which provides an open chance for revealing Domain Technical phrase Non-technical phrase the inner law of innovation, but in the meantime, puts forward Electricity wireless communication, wire and cable, TV sig- higher requirements on patent mining techniques. Considering netcentric computer service nal, power plug Mechanical fluid leak detection, power building materials, steer that patent mining highly relies on patent document analysis, Engineering transmission column, seat back this paper makes a focused study on constructing a technology portrait for each patent, i.e., to recognize technical phrases relevant works have been explored on phrase extraction. Ac- concerned in it, which can summarize and represent patents cording to the extraction target, they can be divided into Key from a technology angle. To this end, we first give a clear Phrase Extraction [8], Named Entity Recognition (NER) [9] and detailed description about technical phrases in patents and Concept Extraction [10]. Key Phrase Extraction aims to based on various prior works and analyses. Then, combining characteristics of technical phrases and multi-level structures extract phrases that provide a concise summary of a document, of patent documents, we develop an Unsupervised Multi-level which prefers those both frequently-occurring and closed to Technical Phrase Extraction (UMTPE) model.
    [Show full text]