A Hybrid Method for Toponym Recognition on Informal Turkish Text

A HYBRID METHOD FOR TOPONYM RECOGNITION ON INFORMAL TURKISH TEXT A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES OF MIDDLE EAST TECHNICAL UNIVERSITY BY MERYEM SAGCAN˘ IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER ENGINEERING AUGUST 2014 Approval of the thesis: A HYBRID METHOD FOR TOPONYM RECOGNITION ON INFORMAL TURKISH TEXT submitted by MERYEM SAGCAN˘ in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Department, Middle East Technical University by, Prof. Dr. Canan Özgen Dean, Graduate School of Natural and Applied Sciences Prof. Dr. Adnan Yazıcı Head of Department, Computer Engineering Assoc. Prof. Dr. Pınar Karagöz Supervisor, Computer Engineering Department, METU Examining Committee Members: Prof. Dr. Ahmet Co¸sar Computer Engineering Department, METU Assoc. Prof. Dr. Pınar Karagöz Computer Engineering Department, METU Assoc. Prof. Dr. Osman Abul Computer Engineering Department, TOBB UET Assoc. Prof. Dr. Tolga Can Computer Engineering Department, METU Dr. Ruket Çakıcı Computer Engineering Department, METU Date: I hereby declare that all information in this document has been obtained and presented in accordance with academic rules and ethical conduct. I also declare that, as required by these rules and conduct, I have fully cited and referenced all material and results that are not original to this work. Name, Last Name: MERYEM SAGCAN˘ Signature : iv ABSTRACT A HYBRID METHOD FOR TOPONYM RECOGNITION ON INFORMAL TURKISH TEXT Sagcan,˘ Meryem M.S., Department of Computer Engineering Supervisor : Assoc. Prof. Dr. Pınar Karagöz August 2014, 86 pages Since accessing the Internet is getting easier and people are more willing to share information on the Internet than the previous generations, the data on such kind of reachable sources are growing very rapidly day by day. Moreover, because of the popularity and widely usage of those sources, the information which researchers and organizations are interested in can be found somewhere in these data collection. The purpose of Information Extraction (IE) is to analyze this information cloud and to extract the desired data among them. This study designs a system dealing with a sub- field of Information Extraction, namely, Named Entity Recognition (NER), which many of the IE systems use as a basis. NER is used to identify the entities related to the aspired information in texts and classify them into a set of predefined categories such as person, location, and organi- zation names, date and money expressions, etc. Since most of the desired information such as trends, agendas, needs and thoughts of people may vary among locations and a location name can be used for more than one location, extracting location information is another research area. There is a field for this purpose, named as Toponym Extraction, which uses NER as a basic step in order to recognize location names. Toponym Extraction consists of two steps, namely Toponym Recognition and To- ponym Resolution. The first step, Toponym Recognition, is the subject of the pro- v posed study. It aims to extract named entities referring to location names; whereas, Toponym Resolution aims to make decision about which geographical coordinate the entity refers to; since, a location name can be used for more than one geographical coordinates. Prominence of social media such as Twitter and Facebook have drawn attention from companies and researchers interested in detecting trends; however, the informal and popular nature of these services leads to a large amount of noisy misspellings, lack of punctuation, non-standard abbreviations and abnormal capitalization which make the recognition process really hard. This case creates a new challenge in NER field; thus, it also creates a new challenge in Toponym Recognition. The proposed system in this thesis, constructs a hybrid NER system which uses both rule based and machine learning based techniques to extract toponyms from an in- formally written, unstructured text document which includes Turkish tweets. In this study, Conditional Random Fields (CRF) is used as a machine learning tool and some features such as POS-Tags and Conjunction Window are defined to train the con- structed CRF model. In the rule based part, regular expressions which aim to define some rules in order to extract some words that containing "köy", "deniz", "¸sehir","istan", etc. are used. The result of the rule based part is used as a feature in the machine learning part. All defined features are experimented interchangeably and incremen- tally. In addition, various learning mechanisms within CRF are compared in terms of their accuracy. Finally, the proposed study shows the effect of the size of the training and test data sets on the system accuracy. Those parameters are all experimented and the combination giving the best result is used in the comparison part in which the system is compared with some previous studies. Keywords: Named Entity Recognition, Twitter, Conditional Random Fields, Infor- mation Extraction, Toponym Recognition, Turkish Named Entity Extraction vi ÖZ GÜNDELIK˙ TÜRKÇE METINLERDE˙ HIBR˙ IT˙ YÖNTEMLE YER IS˙ IMLER˙ IN˙ I˙ TANIMA Sagcan,˘ Meryem Yüksek Lisans, Bilgisayar Mühendisligi˘ Bölümü Tez Yöneticisi : Doç. Dr. Pınar Karagöz Agustos˘ 2014 , 86 sayfa Internet˙ eri¸simininve internette bilgi payla¸smaktançekinmeyen ki¸sisayısının artma- sıyla beraber bu tarz kaynaklardaki bilgi birikimi hızla çogalmaktadır.˘ Insanlar˙ ken- dileri, istekleri ve ¸sikayetlerihakkında bilgi payla¸stıkçabüyüyen bu bilgi karma¸sası içerisinden ihtiyaç duyulan verinin çıkarılabilmesi Bilgi Çıkarımı (BÇ) bilim dalının ilgi alanıdır. Bu tezde sunulan çalı¸smadaBÇ bilim dalının alt dalı olan Varlık Isimle-˙ rini Tanıma (VIT)˙ yöntemi kullanılmı¸stır. VIT˙ yöntemi, dokümanlardaki ula¸sılmayaçalı¸sılanbilgilere ait varlık isimlerini bulur ve bu bulguları önceden tanımlanmı¸skategorilere göre sınıflandırır. Bu kategoriler insan isimleri, yer isimleri, tarih ve para miktarı gibi ifadelerdir. Moda, gündem, in- sanların dü¸sünceve ihtiyaçları degi¸sikyerler˘ arasında büyük farklılıklar gösterdigi˘ için ve bir yer ismi birden fazla koordinatı temsil edebilecegi˘ için, yer bilgisi çıkarma i¸slemiayrı bir alan altında incelenmektedir. Yer Isimleri˙ Çıkarımı alanı bu amaç dog-˘ rultusunda çıkmı¸stır. Ilk˙ a¸samadayer isimlerini tanıyabilmek için VIT˙ yöntemlerini kullanmaktadır. Yer Isimleri˙ Çıkarımı i¸slemiiki kısımdan olu¸smaktadır. Ilk˙ a¸samaYer Isimleri˙ Ta- nıma, ikinci a¸samaise Yer Isimleri˙ Çözümleme i¸slemleridir. Yer Isimleri˙ Tanıma i¸slemidokümanlardan yer ismi belirten varlık isimlerini çıkarmayı amaçlar. Ikinci˙ a¸samaise, ilk kısımda bulunan yer isimlerinin gerçekte hangi cografik˘ bölgeyi kas- vii tettigini˘ bulmaya çalı¸sır. Çünkü yeryüzünde aynı ismi ta¸sıyan birden fazla cografik˘ koordinat bulunabilir. Yer Isimleri˙ Çıkarımı i¸slemininilk a¸samasıolan Yer Isimleri˙ Tanıma i¸slemiVIT˙ alanının bir alt alanıdır. Insanların˙ sosyal medyada oldukça aktif olmaları, Facebook ve Twitter gibi sosyal medyanın önde gelen temsilcilerinin, toplumun egilimini˘ bulmaya çalı¸sanfirma ve ara¸stırmacılarındikkatlerini çekmelerine sebep olmu¸stur. Fakat bu tarz sosyal med- yadan edinilen veri yapısal bozukluk, noklama i¸saretlerieksikligi,˘ yanlı¸syazılmı¸s kısaltmalar ve kurala uygun olmayan büyük harf kullanımları gibi fazlaca bozukluk içermektedir. Bu bozukluklar kelimelerin ve cümlelerin dolayısıyla da yer isimlerinin anla¸sılmasınızorla¸stırmaktadır. Bu tezde sunulan sistem, kural tabanlı ve makine ögrenimi˘ tabanlı iki yöntemi birle¸sti- rerek Türkçe tweet içeren dokümanlardan yer ismi belirten varlık isimlerini tanımaya çalı¸sanhibrit bir sistemdir. Bu sistemde makine ögrenimi˘ için ¸SartlıRastgele Alanlar modeli kullanılmı¸stır. Bu modeli egitebilmek˘ için yer isimlerini tanımlayan özellik- ler (Sözcük Türü, Düzenli Ifadelerden˙ çıkarılan özellikler, Konjonksiyon Penceresi, vs.) kullanılmaktadır. Kural tabanlı kısım içinse, içerisinde geçtigi˘ ya da arkasından geldigi˘ kelimelerin yer ismi olma ihtimallerini artıran bazı karakter dizileri ("köy", "deniz", "¸sehir", "istan", vb.) için düzenli ifadeler tanımlanmı¸stır. Kural tabanlı a¸sa- manın çıktıları makine ögrenimine˘ dayalı model için girdi olarak kullanılmı¸stır. Çalı¸smakapsamında bu özelliklerden bazı Özellik Takımları olu¸sturulupbunların her biri için bir test ko¸sturulmu¸stur. Ikinci˙ çesit deneyde ise ¸SartlıRastgele Alanları egit-˘ mek için kullanılan farklı egiticiler˘ test edilmi¸stir. Son olarak egitim˘ ve test dosyala- rının boyutları ve her birinin içindeki yer isimleri sayısı degi¸stirilerek,bu˘ dosyaların sistem sonucu üzerindeki etkisi incelenmi¸stir. En iyi sonucu veren kombinasyon daha önceden gündelik Türkçe veri üzerine çalı¸sanbazı sistemlerle kar¸sıla¸stırmadakulla- nılacaktır. Anahtar Kelimeler: Varlık Ismi˙ Tanıma, Twitter, ¸SartlıRastgele Alanlar, Bilgi Çıka- rımı, Yer Isimleri˙ Tanıma, Türkçe Varlık Ismi˙ Tanıma viii to my mom indeptedly and to my husband with love... ix ACKNOWLEDGMENTS I would like to express my deepest thanks to my supervisor Assoc. Prof. Dr. Pınar Karagöz for her encouragement and support throughout this study. I would like to thanks Dilek Önal for her helping me at any stage of the thesis, espe- cially gathering datasets. Another special thanks go to Gül¸senEryigit˘ for her giving me a chance to use their pipeline, to Dilek Küçük for her support in experiments part of the thesis. I would also like to thank to

Load more