A Study on Name-Alias Relationship Identification in Thai News Articles

A STUDY ON NAME-ALIAS RELATIONSHIP IDENTIFICATION IN THAI NEWS ARTICLES

THAWATCHAI SUWANAPONG

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN TECHNOLOGY SIRINDHORN INTERNATIONAL INSTITUTE OF TECHNOLOGY THAMMASAT UNIVERSITY ACADEMIC YEAR 2016 A STUDY ON NAME-ALIAS RELATIONSHIP IDENTIFICATION IN THAI NEWS ARTICLES

THAWATCHAI SUWANAPONG

Abstract

A STUDY ON NAME-ALIAS RELATIONSHIP IDENTIFICATION IN THAI NEWS ARTICLES

Thawatchai Suwanapong

Bachelor of Industrial Technology, Southeast Asia University, 1991 Master of Engineering, Chiang Mai University, 1995 Doctor of Philosophy in Technology, Sirindhorn International Institute of Technology, Thammasat University, 2016

Name alias recognition is one of the most important processes towards automatic content understanding. Its performance, however, depends greatly on text processing. This thesis presents the effects of preprocessing factors on name alias relationship identification in Thai news articles. Three complementary groups of factors are investigated: weighting schemes, preprocessing steps (normalization, co-occurrence matrix calculation, and named entity type filtering), and clustering methods (similarity-based cutoff clustering, hierarchical clustering, and clustering based on equivalence relations). By experiments, the effects of these factors are investigated using collections of2,000 Thai news articles; 1,000 each in the domain of football and that of politics. The experimental results show that named entity type filtering, normalization, co-occurrence matrix, and hierarchical clustering using the complete linkage function are helpful for name-alias relationship identification. The highest combination performances in the football and politics datasets are 97.24% and 76.37%, respectively. From the results provide clear evidence that co-occurrence information is very important for identifying name-alias relations. This thesis also proposes an alternative co-occurrence matrix construction method using association measures. The effects of association measures are investigated by comparing their use with the traditional co-occurrence matrix construction method. Various complementary factors are considered in the comparison, e.g., weighting schemes, a normalization process, and linkage functions for hierarchical clustering. Using the same two collections of Thai news articles in experiments, combinations using association measures yields the highest performance in both news domains. The best performances achieved from the football and politics datasets are 98.75% and 83.87%, respectively.

ii Acknowledgements

Foremost, I would like to express my sincere gratitude to my advisor and co-advisor, Dr. Thanaruk Theeramunkong and Dr. Ekawit Nantajeewarawat, for the continuous support of my Ph.D study and research, for their patience, motivations, enthusiasms, and immense knowledge. Their guidances helped me in all the time of research and writing of this dissertation. I could not have imagined having better advisors and mentors for my study at SIIT.

I spend my gratitude to the committee members, Dr. Pakinee Aimmanee, Dr. Cholwich Nattee, and Dr. Thepchai Supnithi, who gave several good comments from their large professional and technical skills during the progress presentation and the thesis defense.

My gratitude also spends to the external examiner, Professor Takenobu Tokunaga, Department of Computer Science, Graduate School of Information Science and Engi- neering, Tokyo Institute of Technology, for his kindness in examining this thesis as well as noteworthy comments, which substantially improve the quality of this work.

I want to express my gratitude to Prof. Dr. Pakorn Adulbhan, Khun Darika Adulbhan, and Ajarn Wanida Somsongkul for supporting and encouraging me. My gratitude also to Ajarn Somchai Bhawaworrapun for his help during my stay at Pathum Thani. I also want to thank Asst. Prof. Boonying Somrang for statistics guidance.

Last, but not least, I would like to thank my workplace: Prince of Songkhla Uni- versity, Trang Campus, for supporting me the full scholarship over the four years of this study.

iii Table of Contents

Chapter Title Page

Signature Page i

Abstract ii

Acknowledgements iii

Table of Contents iv

List of Tables vi

List of Figures viii

1 Introduction 1

1.1 Motivation 1

1.2 Problem Statement 3

1.3 Contributions 3

1.4 Organization 4

2 Backgrounds 5

2.1 Named Entity 5

2.2 Name Aliases in News Articles 10

2.3 Thai Name Aliases 12

2.4 Co-occurrence Word Information 13

3 Name-alias Relationship Identification 15

3.1 An Overview of the Framework 15

iv 3.2 Preprocessing Factors and Clustering Methods 17

3.3 Experimental Settings 21

4 Experimental Results 25

4.1 Overall Results 25

4.2 Analysis of Effects of Preprocessing Factors and Clustering Methods 29

5 Exploration of Co-occurrence Matrix Construction Methods 34

5.1 An Overview of the Framework 34

5.2 Preprocessing Factors 35

5.3 Co-occurrence Matrix Construction (COC) 36

5.4 Evaluation Results 38

5.5 Discussion 45

6 Error Analysis 46

6.1 Analysis of Mismatched Name-Alias Pairs 46

6.2 Solutions 47

7 Conclusions and Future Works 49

7.1 Conclusions 49

7.2 Future Works 51

References 52

Appendices 61

Appendix A 62

Appendix B 71

Appendix C 80

Appendix D 89

Appendix E 98

v List of Tables

Tables Page

1.1 Thai language characteristics compared to English 2

2.1 Examples name aliases in Thai sport news 12

2.2 Examples name aliases in Thai political news 12

3.1 Global weighting 18

3.2 Dataset characteristics 22

3.3 Number of formal names having each particular number of alias names 23

4.1 Top 20 combinations and those in the ranks 40–420 stepping by 20 in

DFB 26

4.2 Combinations without any optional preprocessing step in DFB 26

4.3 Top 20 combinations and those in the ranks 40–420 stepping by 20 in

DPL 27

4.4 Combinations without any preprocessing step in DPL 27

4.5 Comparisons between weighting schemes in DFB 30

4.6 Comparisons between weighting schemes in DPL 30

4.7 Comparisons between clustering methods in DFB 31

4.8 Comparisons between clustering methods in DPL 31

4.9 Comparison between optional preprocessing steps in DFB 32

4.10 Comparisons between optional preprocessing steps in DPL 32

vi 5.1 Association measure functions for co-occurrence matrix construction 37

5.2 Baseline combinations in DFB 38

5.3 Top 25 combinations and some selected combinations in the ranks 30–

468 in DFB 39

5.4 Baseline combinations in DPL 40

5.5 Top 25 combinations and some selected combinations in the ranks 30–

468 in DPL 41

5.6 Comparisons between weighting schemes in DFB 42

5.7 Comparisons between weighting schemes in DPL 42

5.8 Comparisons between the combinations with and without NOR in DFB 43

5.9 Comparisons between the combinations with and without NOR in DPL 43

5.10 Comparison between co-occurrence matrix construction schemes in DFB 44

5.11 Comparison between co-occurrence matrix construction schemes in DPL 44

5.12 Comparisons between clustering linkage functions in DFB 45

5.13 Comparisons between clustering linkage functions in DPL 45

6.1 Examples of mismatched name-alias pairs derived from the highest

performance combination in DFB 46

6.2 Examples of mismatched name-alias pairs derived from the highest

performance combination in DPL 47

vii List of Figures

Figures Page

1.1 Example of Thai name-alias relations based on co-occurrence name information [51] 2

3.1 An example of Thai football news articles in the context of English Premier League [71] 15

3.2 An overview of name-alias identification framework 16

5.1 An overview of name-alias identification framework 35

viii Chapter 1

Introduction

1.1 Motivation

Name ambiguity is one of hardest problems in various applications in the areas of languages and linguistics, e.g., information retrieval [7, 45], knowledge discovery [27], and text summarization [59, 76]. There are two basic types of name ambiguity: (i) lexical ambiguity, i.e., ambiguity arising from different entities being referred to by the same name, and (ii) referential ambiguity, i.e., ambiguity arising from a single entity being referred to by multiple names. The use of name aliases (antonym, pseudonym [63]) typically causes referential ambiguity. Lexical ambiguity has been addressed for many years and several techniques for named entity disambiguation have been reported, e.g., [27, 73, 98]. By contrast, only a few works deal with referential ambiguity; most of them focus on name alias detection from web pages [2, 13, 38, 39, 67, 69].

Name alias frequently occur in Thai literatures, in both paper-based and online articles. Thai people often refer to persons and places by using their name aliases, e.g., when talking about politicians [72, 83], countries [42], monks [57], etc. Name alias recognition in Thai is very difficult because many Thai sentences are not in the formof SVO (subject + verb + object) [10, 37], and neither special signs nor special characters are used to indicate proper nouns. No strict word order is imposed on Thai sentences. It is thus difficult to identify verb-like entity names. Moreover, a name alias canbe constructed by combining noun, verb, adjective, or word components of other types, and can also be a transliterated name. Due to the difficulty of name alias detection in the Thai language, most existing Thai name recognition frameworks deal with only formal names [17, 49, 77, 84, 86, 87, 91], and work related to name aliases is relatively rare. Table 1.1 shows a comparison of the language characteristics between Thai and English [36].

However, a Thai name alias usually does not occur independently. It typically occurs in a document containing a formal name to which it refers (see Figure 1.1). This observation naturally motivates an attempt to identify name-alias relationships based on co-occurrence name information in given documents. Generally, co-occurrence

1 information can be represented by a co-occurrence matrix. Co-occurrence matrices have been successfully applied in many fields, including relevance feedback [19], information science [50], or keyword extraction [52]. Using the vector space model, a basic way to construct a co-occurrence matrix is to multiply a name-by-document matrix by its transpose. The cosine similarity is often employed for measuring relationships among name vectors in a co-occurrence matrix.

Table 1.1: Thai language characteristics compared to English Characteristic English Thai Word boundary indicated by spaces Yes No An explicit mark (i.e., a full stop) at Yes No the end of a sentence Capitalized letters Yes No Writing left to right Yes Yes Conjugation of verbs Yes No Subject-verb agreement Yes No Use of articles (definite/indefinite) Yes No Pronominal form of social position No Yes Noun classifier No Yes

Figure 1.1: Example of Thai name-alias relations based on co-occurrence name information [51]

“เขยแม้ว” เป่านกหวีด! พท.จัดแถวเชลียร์ “แม้ว” ถึงเขมร ชายจืดดอดเยี่ยม “แม้ว” ชี้ ส.ส.เพื่อไทย จัดแถวเชลียร์นายใหญ่ ถือเป็นสิทธิส่วนตัว เชียร์พํานักเขมร ไม่ร้อนเหมือนที่ดูไบ ไม่เชื่อ “ทักษิณ” ลงทุนในกัมพชาหรือใช้เกาู กงเป็น านทางการเมือง รูด ิปปาก บทสัมภาษณ์ “แม้ว” ในไทมส์ออนไลน์ วันนี้ (11 พ.ย.) ที่พรรคเพื่อไทย นายสมชาย วงศ์สวัสดิ์ อดีตนายกรั มนตรี ให้สัมภาษณ์ภายหลังเดินทาง กลับจากการ เยี่ยมเยียน พ.ต.ท.ทักษิณ ชินวัตร อดีตนายกรั มนตรี ที่ปร เทศกัมพชาว่าู ได้ไปเยี่ยมเยียน ตามปกติ หากท่านยังอยู่ นครดูไบ สหรั อาหรับเอมิเรตส์ ตนคงไม่มีค่าตั๋วเครื่องบินเดินทางไป ที่ไปเยี่ยม เยียนนั้นได้พดคุยเรื่องสารทุกข์สุกดิบู สุขภาพ อาหารการกินว่าเป็นอย่างไรแค่นั้น ส่วนกรณีที่รั บาลมองว่า ส.ส.พรรคเพื่อไทย จ เดินทางไป เยี่ยมผ้ร้ายู ข้ามแดนนั้น เรื่องนี้ถือเป็นสิทธิส่วนตัว ส.ส. ที่เดินทางไป ส่วนใหญ่มาจากพรรคไทยรักไทย มีความผูกพันกัน เมื่อเห็นว่า เจ้านายมาก็ไปพบ ไม่ใช่การไปเยี่ยมผ้ร้ายู อีกทั้งกัมพชาใกล้กับปรู เทศไทยมากกว่าดูไบ อีกทั้งกัมพชานั้นค่าใช้จ่ายู ในการเดินทางก็ถูกกว่า ไม่ต้อง เสียค่าเครื่องบินแล ไม่ร้อนเหมือนดูไบด้วย อย่างไรก็ตาม ส่วนตัวไม่เชื่อว่า พ.ต.ท.ทักษิณ จ ไปลงทุน ทําธุรกิจในปร เทศกัมพชาู

Name alias Formal name

2 1.2 Problem Statement

A basic assumption for this thesis is that it is possible to find a relationship between names and name aliases by exploiting co-occurrence information [26, 44, 46, 61]. A name and a name alias that frequently co-occur in the same document are likely to refer to the same entity. Such a name-alias relationship can usually be discovered by using clustering techniques. Effectiveness of various preprocessing factors, including weighting schemes, normalization, co-occurrence calculation, and named entity type filtering are investigated and compared when similarity-based cutoff clustering, hierarchical clustering [43, 96], and clustering based on equivalence relations [47, 48, 90, 95] are used.

Association rule mining is a technique for discovering interesting relations in databases [1, 74]. Based on co-occurring items in transactions, a rule of the form x → y is constructed to represent the relation “if an item x occurs in a transaction, then an item y is likely to also occur in that transaction”. In this thesis, association rule mining is applied for co-occurrence matrix construction, where a relation x → y represents the co-occurrence relation “if a name x occurs in a document, then a name y is likely to occur in the same document” mined from a document collection. To construct co- occurrence information that exist between names, various association measures, e.g., Confidence, Klosgen, Laplace, Leverage, Support [4, 80], are used. Effectiveness of co-occurrence matrices constructed by association measures and that of matrix multiplication are investigated and compared when using various preprocessing factors, i.e., weighting schemes, normalization, and hierarchical clustering algorithms, i.e., single linkage, complete linkage, and centroid linkage. For evaluation, two datasets from football and political news categories are employed.

1.3 Contributions

The following contributions are made for satisfaction in this study:

1. To apply fundamental factors used in the vector space model and association rule mining for constructing combinations for name-alias relationship identification in Thai news articles.

2. To investigate the effects of these factors in each combination and compare performance differences between them.

3. To determine utilized combinations that can further be applied to the related tasks in Thai natural language processing, i.e., named entity detection/ recognition, named entity disambiguation, text summarization.

3 1.4 Organization

The rest of this thesis is organized as follows:

Chapter 2 provides a review of named entity extraction, named entity disambiguation, name alias detection, and co-occurrence information usage.

Chapter 3 gives an overview of name-alias relationship identification framework and details the framework components.

Chapter 4 investigates the effects of preprocessing factors and clustering methods.

Chapter 5 explores the effects of co-occurrence matrix construction methods and related factors.

Chapter 6 examines the highest performance combinations and proposes solutions for Thai name-alias relationship identification.

Chapter 7 draws conclusions and proposes future works.

4 Chapter 2

Backgrounds

This chapter summarizes named entity extraction, named entity disambiguation, and co-occurrence information applications for solving name ambiguity. Thai names and name aliases are also exemplified.

2.1 Named Entity

Named Entity (NE) is widely used in Information Extraction (IE), Question An- swering (QA) or other Natural Language Processing (NLP) applications [66]. It was first introduced in the Message Understanding Conferences (MUC), which influenced IE research in the U.S. in the 1990’s [20, 32]. The NE task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages in text [11].

2.1.1 Named Entity Extraction

Broadly speaking, Information Extraction is to identify structured and user- desired information from large volumes of unstructured text [34]. Named entity extraction/recognition/classification is a main process of IE applied to discover achunk of words which specifies a unique existence, i.e., a person, a time period, a location [87]. In MUC-7, named entity extraction/recognition is classified into three main groups:

1. Entity names (person, organization, and location names).

2. Temporal expressions (date, time).

3. Numerical expressions (monetary amount, percentage).

Entity name is one of the most difficult to be recognized because the structures of the names are complicated and the contexts they occur are varying [86]. Temporal expressions in texts contain significant temporal information [97], which can be applied

5 to temporally aware system. For example, for a QA system in newswire domain, if anyone wants to know who was the Prime Minister (PM) of Bangladesh in the February of 1995, and he only had documents that tell him about the PM from 1994 to 1999, then a temporally aware system will help the QA system to infer who was PM in the February of 1995 as well [88]. Numerical expression is an important one because many named entity recognition (NER) approaches in MUC-7 used economic datasets [17]. Additionally, any new group of NEs is allowed depending on datasets, i.e., medical, biology, politics, sports.

Many NER systems based on pattern matching rules or statistical models achieved satisfactory performances on well-formed text. Based on the 1997 MUC-7/MET-2 evaluation, NER systems have reached 94% F -score on English newswire text, 85%–91% on Chinese text, 87%–93% on Japanese text, and less than 94% on Spanish text [21, 34].

2.1.2 Thai Named Entity Extraction

Discovering Thai entity names are very popular for NE tasks. Charoenporn- sawat et al [16] addressed the problem of Thai proper name identification by using feature-based approach to extract the candidates and then using learning algorithm, namely Winnow, to recognize the entity names. Their proposed work was successful with about 92% accuracy. Chattrimongkol [17] developed a Thai named entity recognition and classification system using a hybrid approach composed between statistical and rule parts. The statistical part was used for extracting named entity candidates. In this part, the Localmaxs algorithm and various statistical methods were employed for measuring associations between syllables. Named entity candidates were recognized and classified by linguistic rules in the second part. The recognition rates measured by F -score were 69.15%, 62.95%, 38.87% for person name, organization name, and location name, respectively.

Sutheebanjard and Premchaiswadi [77] studied on Thai personal named entity extraction without using word segmentation or part of speech (POS) tagging. By removing none alphabet and extracting contextual environment of names (e.g., front, rear context), personal names were identified. The performances evaluated by the F -measure were 91.44%, 91.72%, and 27.28% for political, financial, and sport news domains, respectively. Tirasaroj and Aroonmanakun [86] proposed Thai named entity recognition (NER) systems using supervised Conditional Random Fields (CRFs) with various answer patterns to find out whether different answer patterns would affect the performance of the systems. Three types of Thai NEs (e.g., person, organization, and location) on BEST2009 corpus (Benchmark for Enhancing the Standard of Thai Language Processing [82]) were extracted with the maximum accuracy 81.30% F -score.

6 Tongtep and Theeramunkong [87] proposed the multi-stage annotation framework for reducing the number of tokens marked as unknown and the number of terms identified as ambiguous. Two stages of chunking (e.g., named entity extraction bypat- tern matching and word segmentation by dictionary) and three stages of tagging (e.g., dictionary-based, pattern-based, and statistical-based), were employed. The pattern- based tagging was able to reduce the number of tokens marked as unknown by the dictionary-based tagging by 44.76%, and the statistical-based tagging was able to reduce the number of terms identified as ambiguous by both above methods by 72.44%. Tepdang [81] improved the performance of Thai word segmentation by utilizing named entity recognition. Based on three word-level grams (3, 5, and 7), NE prefixes and suffixes were used as main feature and several models were constructed and compared. By experiments, Thai segmentation program (Thai Lexeme Analyzer) and BEST2009 corpus (37 articles, 72,000 words) were used and the F -scores were improved from 92.23% to 93.38%.

2.1.3 Ambiguous Name

As described in the previous section, NE tasks seem to be successful for many applications. However, though this sounds clear, special cases arise to require lengthy guidelines, e.g, Is “Wall Street Journal” an organization?, Is “White House” an organization or a location?, Is “a street name” a location?, Should “yesterday” and “last Tuesday” be labelled dates?, Is “mid-morning” a time? In order to achieve human annotator consistency, guidelines with numerous special cases have been defined for MUC-7 [20]. Basic problems in NE can be summarized as follows [55]:

• Variation of NEs:

– John Smith, John, Mr.John

• Ambiguity of NE types:

– John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)

• Ambiguity with common word:

– may (verb)

• Issues of style, structure, domain, genre etc:

– Punctuation, spelling, spacing, formatting

7 As a matter of fact, ambiguous names happen in a remarkable number of cases. Bunescu and Pasca [15] have been reported that, in September 2005, the English version of Wikipedia had 1,783,868 occurrences of ambiguous named entities. Ambiguity limits the performance of free-text based information retrieval systems [5] because it can make users retrieve information they are not interested in (i.e., searching for “Georgia” meaning the USA state and retrieving documents about “Georgia” meaning the Asian country), or prevent them from finding documents related to what they are interested in (i.e., a search about “Angela Merkel” will not retrieve a document where “Angela Kasner”, her birth name, is mentioned instead) [27].

Generally, two types of ambiguity are often found [12]. Firstly, lexical ambiguity specifies the situation that a name may not uniquely refer to only onecertain entity, instead ambiguously points to multiple entities. Secondly, referential ambiguity mentions to the situation that an entity may have more than one name. Naturally, referential ambiguity is more difficult than lexical ambiguity. For example,JT “ signs on, Chelsea skipper pens fresh one-year deal” [85], how can anyone understand that “JT” means “John Terry”, a footballer for “Chelsea F.C.” in English Premier League? However, a combination between lexical ambiguity and referential ambiguity is extremely difficult. Given a document includingBarcelona “ 2-0 victory over United in the 2009 final in Rome”, “Barcelona” and “United”, respectively, mean professional football clubs “FC Barcelona” and “Manchester United F.C.”, while “Rome” denotes a capital city of Italy, and “the 2009 final” is the final match of UEFA Champions League in 2009 [70]. Names “Barcelona” and “United” can be a city in Spain and an American airline “United Airlines”, respectively. Possibly, most of people may go wrong to disambiguate them because of the lack of knowledge or misunderstanding. For computers, they are far more challenging to disambiguate the terms accurately.

2.1.4 Named Entity Disambiguation (NED)

In natural language processing, named entity disambiguation is the problem of mapping mentions of entities in a text with the object they are referring to. It is a step further from named entity recognition, which involves the identification and classification of so-called named entities [31]. Damljanovic and Bontcheva [24] concluded the disambiguation algorithm using context in which the particular entity appearing and a weighted sum of the following three similarity metrics:

• String similarity: refers to the Levenshtein distance between text strings, i.e., “Paris” and “Paris Hilton”, “Paris” and “Paris”, or “Paris” and “Ontario”.

8 • Structural similarity: is calculated based on whether the ambiguous NE has a relation with any other NE from the same sentence or document. For example, if the document mentions both “Paris” and “France”, then structural similarity indicates that “Paris” refers to “the capital of France”. Otherwise, all other entities can be disregarded.

• Contextual similarity: is calculated based on the probability that two words have a similar meaning or they appear with a similar set of other words.

Based on these similarity matrices, Vu et al [89] improved the performance of personal name disambiguation using web directories over the vector space model and named entity recognition. The documents of 24 nameshake people, who specialized in the research field, e.g., Computer Science, Physics, Medicine, History, were used. The knowledge base was applied to the documents for measuring similarity among documents and then the documents were re-ranked, which documents relevant to the same person would go to the top of ranking.

Fernández et al [27] implemented the IdentityRank algorithm to address the problem of named entity disambiguation in news items. Different from other algorithms that tried to disambiguate named entities by mapping or linking them to instances defined in an ontology such as a knowledge base or a controlled vocabulary [23, 25,33, 92]; based on clustering techniques to determine whether two occurrences of the same named entity in two documents refer to the same instance or not [6, 18, 41, 58]; the proposed algorithm processed news items to take future decisions and used as features for disambiguation metadata. By comparing to the aforementioned algorithms, the new algorithm was evaluated using several corpora of actual news items, achieving an average accuracy of around 96%.

Nui et al [56] used the entity pairs as the probable matching candidates and construct a connection graph between these pairs. The more similar textual information for one probable pair has the greater possibility the two records corresponded to the same real world entity. The proposed method was used to disambiguate authors in publication records. The best F-score 80.00% was achieved which was much better than the result from utilizing only the textual similarity.

While most of the works have done with text collections, Spina et al [73] developed the keyword filtering in Twitter, i.e., company, brandapple (“ ” may refer to the company, the fruit, the singer, etc.), where texts were very short and there was a little context to disambiguate (1 tweet = 140 chars, ∼15 words). The expressions of filter keywords were related/unrelated to the company by comparing to the Oracle keywords (perfect on Twitter). The automatic discovered filter keywords were able to classify a subset of 30%-60% tweets with an accuracy range of 0.75–0.79.

9 2.2 Name Aliases in News Articles

Newspapers have influenced society for many decades in various aspects oflife. With their popularity over other news media, they have been major sources for in- troducing and promoting new terminologies, including entity names, e.g., names of persons, organizations, places, and products, etc., to the public. From topics and situ- ations of interest, new informal names are often introduced to reflect personal attitudes at certain time points. They are known as name aliases or aliases for short. An alias often characterizes a person, a group of persons, a place, or an organization, in terms of some related activities or behaviours.

The fast growing of electronic media and Internet usage shifted many news pub- lishers to extend their contents to online news. The number of name aliases increases rapidly as the number of published news increases. The hidden meanings of name aliases and a variety of their forms (word-component types) increase the difficulty of works related to news analysis, e.g., named entity extraction, named entity recognition, news summarization, and machine translation. Notwithstanding recent development of methods for detecting Thai named entities in news, most of the proposed works focus on formal names [49, 77, 79, 84, 86, 87], rather than name aliases.

Name aliases arise from several sources and depend greatly on writing styles, popularity, and time. It is thus difficult to create rules or patterns for alias detection and identification. Even for English documents, very few works on name alias detection have been reported in the literature. Bhat et al. [9] developed an automated system for discovering aliases for an entity by using a two-stage algorithm based on Latent Semantic Analysis (LSA). Candidate aliases produced using LSA in the first stage are re-ordered in the second stage by considering their types, which are determined by words surrounding them. In [12, 13], Bollegala et al. used a set of known pairs of names and aliases for constructing web search queries and extracted text lexical patterns for finding aliases from text snippets returned by a search engine. Pantel [60] proposed an unsupervised information theoretic approach for automatically detecting aliases in malicious environments. Entities exhibiting similar behaviours, determined based on the most informative observations (e.g., emails, phone calls, relational data) between them, are identified. Sapena et al. [65] compared two techniques for alias assignment tasks by using alias-entity pairs as feature functions. The first technique measured global similarity between an alias and all possible entities. It is extended by the second technique by an addition of syntactic features such as the number of words in an alias or an entity. With this extension, improved performance was observed.

Recently, Shaikh et al [67] proposed the technique called “exchange of vowels” for Arabic names that had “vowel variations”. This technique defined a list containing the vowels “a, e, i, o, u, y”, which represented the typographic substitution of vowels based on common spelling mistakes, i.e., “Osama bin Laden” and “Usama bin

10 Ladan”. The extension of the approximate string matching (ASM) algorithms was used for measuring the similarity between two variation names. By experiments, the extended algorithm was significant improvement in accuracy compared to the basic algorithm. Jakhete and Dharmadhikari [38] proposed four different approaches, e.g., lexical pattern frequency, co-occurrence frequency, web dice, and graph mining measures, for detecting Indian person and location name aliases. To extract name aliases from web search engine, lexicon patterns were used. To identify correct alias, similarity and graph mining measures were employed. The proposed method could improve the precision and minimized the recall compared to their baseline method.

Jiang et al [39] used the active-learning-based method to detect entity aliases without string similarity. Alias candidates were extracted and pairwise compared with con- cerned entities. The active learning based logistic regression classifier was employed to predict whether a candidate was the alias of a given entity. The experimental results on three datasets clearly demonstrated that the proposed approach could effectively detect entity aliases being referred to by referential ambiguity. Shen et al [69] presented an order-of-magnitude based similarity mechanism that integrated multiple link properties to derive semantic-rich similarity descriptions. This approach extended conventional order-of-magnitude reasoning with the theory of fuzzy sets. The experimental results shown that their approach was effective for a name variation caused by typographi- cal and translation errors, i.e, “John Doe” and “Jon Doe”, or “Mohamed Atta” and “Muhammad Atta”. However, it would fail drastically for referential ambiguity, i.e., “Osama bin Laden” and “The Prince”.

Gandhi et al [29] proposed a word co-occurrence graph by making connections between nodes representing name and aliases in the graph based on their first order associations with each other. The graph mining algorithm was used to find out the hop distances between nodes for identifying the association orders between name and aliases. Ranking SVM was used to rank the anchor texts according to the co-occurrence statistics. The results revealed that the extracted aliases significantly improved recall in a relation detection task and rendered useful in a web search task.

Anwar and Abulaish [2] presented the generic context-based approach for mining alias names of namesakes sharing a common name on the web. The proposed method employed a search-engine application programming interface to retrieve relevant web pages for a given name. The retrieved web pages were modeled into a graph, and a clustering algorithm was applied to disambiguate the web pages. Thereafter, each obtained cluster standing for a namesake was mined for alias identification following a text pattern based statistical technique. By comparing to the state-of-the-art method [13], the average precision derived from this method was considerably higher.

11 2.3 Thai Name Aliases

Some Thai name aliases are obtained from substrings of their formal names. For example, “เบิร์บ” (/boep/), a transliteration of “Berb”, is an alias for the football player “Dimitar Berbatov”; “แมนยู” (/maen/yu/), a transliteration of “Man U”, is an alias for “Manchester United Football Club”; and “เทพ” (/thep/), a transliteration of “Thep”, is an alias for the politician “Suthep Thuegsuban”. However, some aliases are semantically, rather than syntactically, related to their formal names. For example, “ป๋า” (/pa/), which literally means a father, an experienced man, or an old man, is an alias for “Sir Alex Ferguson”, the former manager of Manchester United Football Club, since he was highly recognized by Thai sport news columnists for his professional success. Similarly, in Thai political news, “เฮียโยง่ ” (/hia/yong/), which literally means a tall person, is an alias for the politician “Korn Chatikavanij” owing to his physical appearance; and “สามสี” (/sam/si/), which literally translates as three colors, is an alias for “Trairong Suwankiri”, a former deputy prime minister of Thailand, because his first name also means three colors in Thai.

Aliases may be nested. For example, the ex-prime minister of Thailand, “Somchai Wongsawat” is called “เขยแม้ว” (/khoei/maeo/), i.e., a brother-in-law of “แม้ว” (/maeo/), which is an alias for “Thaksin Shinawatra”, an ex-prime minister of Thailand.

Table 2.1: Examples name aliases in Thai sport news Formal Name Alias English Thai Expression Romanize script Meaning Arsenal อาร์เซนอล ปืนใหญ่ /puen/yi/ big gun Dimitar Berbatov ดิมิทาร์ เบอร์บาตอฟ ดาวเตะศิลปิน /dao/te/sin/la/pin/ artistic player Manchester City แมนเชสเตอร์ ซิตี เรือใบ /rueabai/ sailboat Manchester United แมนเชสเตอร์ ยูไนเต็ด ผี /phi/ ghost, devil ผีแดง /phi/daeng/ red devil Sir Alex Ferguson เซอร์ อเล็กซ์ เฟอร์กูสัน ป๋า /pa/ experienced man

Table 2.2: Examples name aliases in Thai political news Formal Name Alias English Thai Expression Romanize script Meaning Korn Chatikavanij กรณ์ จาติกวณิช เฮียโยง่ /hia/yong/ tall man Nevin Chidchop เนวิน ชิดชอบ ห้อย /hoi/ mouth hanging Phromphong Noprit พร้อมพงศ์ นพฤทธิ เสด็จพี /sadet/phi/ character in TV series Somchai Wongsawat สมชาย วงศ์สวัสดิ เขยแม้ว /khoei/maeo/ Thaksin’s brother-in-law Trairong Suwankiri ไตรรงค์ สุวรรณคีรี สามสี /sam/si/ three colors

Tables 2.1 and 2.2 show examples of formal names and their semantically related aliases in Thai sport and political news articles, respectively.

12 2.4 Co-occurrence Word Information

The utility of co-occurrence word information in the NLP system is extremely high [54]. It is very important for canceling the ambiguous and the polysemy of words to improve the accuracy of the entire system, which various researches have been done. Morita et al [54] summarized information between two basic words which was variety depended on definitions such as hierarchical relationship (i.e.,Clothes “ ” and “Sports shirt”); case relation (i.e., relation between verb and noun phrase); compound word relation (i.e., “America” and “United States of America”); and synonym relation (i.e., synonym “America” and “United States of America” or shortening word of synonym “Cutter” and “Sports shirt”). This work also described the classification technique of the co-occurrence word and co-occurrence frequency, and proposed the technique for an automatic construction system and a complete thesaurus.

Matsuo and Ishizuka [52] presented a new keyword extraction algorithm that applies to a single document without using a corpus. Co-occurrences of a term and frequent terms were counted. If a term appears frequently with a particular subset of terms, the term was likely to have important meaning. A co-occurrence matrix was then constructed by counting frequencies of pairwise term co-occurrences. Similarity-based clustering and pairwise clustering were used to consider two terms that have similar distribution of co-occurrence with other terms and co-occur frequently to be in the same cluster.

Mori et al [53] studied on a novel keyword extraction method to extract user semantics from the web. Based on co-occurrence information of words, the proposed method extracted relevant keywords depending on the context of a person. They defined co-occurrence of two words as word appearance in the same web page. Iftwo words co-occur in many pages, it was assumed that those two have a strong relation. The co-occurrence information was acquired by the number of retrieved documents of a search engine result. For example, the search result of a query “Alfred Kobsa and User Modeling” returns about 3,100 documents, while about 450 documents for a query “Alfred Kobsa and Software engineering”. In this manner, it can guess that “User Modeling” was more relevant to “Alfred Kobsa” than “Software engineering”. By conclusions, the word that co-occurs with a person’s name in many web pages could be his or her keyword, and the word that co-occurs with a context word in many web pages could be the keyword in the context. However, this work faced lexicon ambiguity in the case of two or more people having the same full name.

Rokaya et al [62] used co-word analysis to achieve a ranking of a selected sample of Field Association (FA) terms. For field association term (or word), it was a minimum word which can not be divided without losing semantic meaning [28] or a limited set of discriminating terms that can specify document fields (i.e.,home “ run” can indicate the document field ) [68]. Co-word analysis was a method used to establish a

13 subject similarity between two documents. For example, in Bibliometrics, if papers A and B are both cited by paper C, they may be said to be related to one another, even though they do not directly cite each other. If papers A and B are both cited by many other papers, they have a stronger relationship. In [62], as a step in co-word analysis, a matrix based on the word co-occurrence was built. The value of the cell of two words was decided by the times these two words both appear in the same document, and the higher co-occurrence frequency of two words meant the closer relation between them.

By taking advantage of co-occurrence information, Jin [40] introduced a novel classification method based on pattern co-occurrence to derive graph classification rules. This method employed a pattern exploration order such that the complementary discriminative patterns which were examined first. Patterns were grouped into co-occurrence rules during the pattern exploration, leading to an integrated process of pattern mining and classifier learning. The proposed method produced a more inter- pretable classifier and shows better or competitive classification effectiveness interms of accuracy and execution time.

Chen and Yu [19] proposed a word co-occurrence matrix based method for relevance feedback. The definition of word co-occurrence matrix was given first. Unlike other studies about word association, they considered the influence of the inter word distance and co-windows ratio. The co-occurrence matrix simply represented the document semantic relations and it was used to calculate the similarity between documents. In the feedback process, similarity score was combined with the initial score to improve the retrieval effectiveness, which can be demonstrated by experiments on TREC dataset.

14 Chapter 3

Name-alias Relationship Identification

This chapter gives a basic idea of name-alias relationship identification in Thai news articles. The proposed name-alias relationship identification framework and the techniques used in the framework are explained.

3.1 An Overview of the Framework

The purpose of this dissertation is to classify names and name aliases into groups, where each of which represents an entity (e.g., a person, an organization). Typically, a name alias in news article does not occur without its formal name occurrence. Based on this observation in given documents, name-alias relationships can be identified. A co-occurrence relationship between names can be evaluated by using a clustering method.

Figure 3.1: An example of Thai football news articles in the context of English Premier League [71]

มอยส์ (/moyes/) หวัง อาร์วีพี (/r/p/v/) ยิงกระจายช่วยผี (/phi/) ป้องกันแชมป์

เดวิด มอยส์ (/david/moyes/) กุนซือ แมนเชสเตอร์ ยูไนเต็ด (/manchester/united/) มั่นใจ โรบิน ฟาน เพอร์ซี่ (/robon/van/persie/) ศูนย์หน้าตัวเก่งจะยังคงโชว์ฟอร์มเทพ และยิงได้ อย่างถล่มทลายเหมือนเดิมสาหรับ ภารกิจป้องกันแชมป์พรีเมียร์ลีกซีซั่นหน้า รับประทับ ใจการทางานของดาวยิง ทีมชาติฮอลแลนด์ หลังเจ้าตัว บินมาฝึกซ้อมกับ "ปีศาจแดง" (/pisat/daeng/) ที่ซิดนีย์ ประเทศออสเตรเลีย แล้ว

For more illustration, name relations in a news article are demonstrated. Figure 3.1 shows a football news article that contains three formal names, “เดวิด มอยส์” (/david/ moyes/), “แมนเชสเตอร์ ยูไนเต็ด” (/manchester/united/), and “โรบิน ฟาน เพอร์ซี” (/robon/ van/persie/), and four name aliases, “มอยส์” (/moyes/), “อาร์วีพี” (/r/p/v/), “ผี” (/phi/), and “ปีศาจแดง” (/pisat/daeng/). Using the formal names and name aliases, 21 name

15 7 pairs are constructed first ( C2). However, only name-alias pairs having the same types are considered, i.e., (“โรบิน ฟาน เพอร์ซี”, “อาร์วีพี”) (their types are person), (“เดวิด มอยส์”, “อาร์วีพี”) (their types are person), (“ปีศาจแดง”, “ผี”) (their types are organization). Using a clustering method, the remaining pairs can be merged into name clusters, i.e., {“โร บิน ฟาน เพอร์ซี”, “อาร์วีพี”}, and {“เดวิด มอยส์”, “มอยส์”}, where each one implies an entity.

Figure 3.2: An overview of name-alias identification framework

Weighted Name-by-document Matrix Preprocessing

(1) (2) (3) (4) (5)

Co-occurrence Similarity Named Entity Weighting Normalization Matrix Matrix Type Filtering Calculation Calculation

Similarity-based Cutoff NameName--by-documentdocument (6) Clustering Matrix

A Collection of (7) Hierarchical Clustering News Documents

(8) Clustering Based on Equivalence Relations

Clustering Process

As depicted by Figure 3.2, the proposed name-alias relationship identification framework takes as input a name-by-document matrix constructed from a set of documents with their extracted terminology, such as person names and organization names, etc. It consists of two main parts: preprocessing and clustering. In the first part, a weighted name-by-document matrix is constructed from the input name-by-document matrix and similarity among names are calculated. The obtained similarity information is used for clustering names into groups, based on which the name-alias relationship is determined.

Among the five preprocessing steps shown in the figure, weighting and similarity matrix calculation are mandatory. The other three steps are optional, i.e., normalization, co-occurrence matrix calculation, and named entity type filtering. The optional preprocessing steps are indicated by rectangles with dot lines. Generally, any clustering method can be applied in the second part. Three types of clustering are investigated and compared in this work, i.e., similarity-based cutoff clustering, hierarchical clustering, and clustering based on equivalence relations.

16 3.2 Preprocessing Factors and Clustering Methods

The preprocessing factors and clustering methods used in this dissertation are detailed in this section.

3.2.1 Preliminary Notation

The notation that follows holds thereafter:

1. m and n are the number of names and the that of documents, respectively, under consideration.

2. F and A are the set of formal names and the set of name aliases, respectively, under consideration.

m×n 3. A = (aij) ∈ Z is a given input name-by-document matrix, where aij is the number of occurrences of the i-th name in the j-th document.

4. T is a set of name types, e.g., person, organization.

5. τ :(F ∪ A) → T is a type assignment function, associating name types with names.

When no confusion is caused, the i-th name and the j-th document in the name-by- document matrix A are often identified with the indexes i and j, respectively. For example, if “เบิร์บ” (/boep/) is the 5th name, then τ(“เบิร์บ”) is also written as τ(5).

3.2.2 Preprocessing Factors

Weighting Scheme

Weighting has been commonly applied in the context of information retrieval in order to enhance retrieval effectiveness. In this work, weighting is employed to improve name identification performance. A general scheme for constructing a weighted m×n matrix W = (wij) ∈ R from the input name-by-document matrix A is wij = lijgi, where lij is the local weight for a name i in a document j and gi is the global weight for a name i in the document collection [64].

The local and global weights are determined based on occurrence frequency in the input matrix A. Two local and six global weights are applied [8, 22]. The two local weights are term frequency (TF), i.e., lij = aij, and binary frequency (BF), i.e., if aij > 0, then lij = 1, otherwise lij = 0. The six global weights are given in Table 3.1, where

17 ∑ n • gfi is the global name frequency for a name i, i.e., gfi = j=1 aij,

• dfi is the document frequency for a name i, i.e., the number of documents containing the name i, and

• pij is a probability that a name i occurs in a document j with respect to aij the global name frequency, i.e., pij = . gfi

Table 3.1: Global weighting Global weight Formula

One (ONE) 1

gfi Global Frequency–Inverse Document Frequency (GFIDF) df i ( ) Inverse Document Frequency (IDF) 1 + log n dfi ∑n ( ) pij log(pij ) Entropy (ENT) 1 + log(n) j=1 1 ∑n Normal (NORM) 2 aij j=1( ) − Probabilistic Inverse (PRINV) log n dfi dfi

Normalization (NOR)

In the vector space model, normalization is often used for correcting discrep- ancies in documents [22]. In this work, the effect of normalization for weight adjustment is investigated. The Cosine function [8] is used for normalization, i.e., the normalized m×n weighted matrix N = (nij) ∈ R is given by w n = √ ij , (3.1) ij ∑m 2 wij i=0

where wij is an element of the weighted matrix W. In this framework, the normalization step is optional.

Co-occurrence Matrix Calculation (COC)

Co-occurrence information such as co-citations, co-words, or co-links [3, 50, 53] can be represented by a matrix called co-occurrence matrix. A co-occurrence matrix, denoted by C, is constructed by multiplying a weighted name-by-document matrix (possibly normalized) and its transpose. In this framework, if the normalization step is applied, then C = N × NT ; otherwise, C = W × WT .

18 An element of a co-occurrence matrix typically indicates how often two names co-occur in a similar linguistic context. With the assumption that frequent co-occurring names tend to refer to the same entity, it is expected that the addition of co-occurrence information can strengthen relations between names for similarity measurement.

Similarity Matrix Calculation

A similarity function

ς :(F ∪ A) × (F ∪ A) → {ϵ ∈ R | 0 ≤ ϵ ≤ 1}, (3.2) is used for representing referential similarities between names. A value in the co- domain of ς represents a similarity score. Typically, it can be determined using the Cosine similarity function, which is defined as follows: For any names i, j ∈ F ∪ A,

∑p xkyk ς(i, j) = √ k=1 √ , (3.3) ∑p ∑p 2 2 xk yk k=1 k=1

where (x1, ..., xp), and (y1, ..., yp) are the vectors representing the names i and j, respectively, in the matrix input to this process. Note that (i) the input matrix for this step can be the weighted name-by-document matrix W, the normalized weighted matrix N, or the co-occurrence matrix C, depending on the choice of earlier processing steps; and (ii) in Equation (3.3), p is the number of documents (i.e., n) if the input matrix is W or N, and p is the number of names (i.e., m) if the input matrix is C.

m×m Using the similarity function ς, a similarity matrix S = (sij) ∈ R , where sij = ς(i, j), is constructed.

Named Entity Type Filtering (NET)

The similarity function ς produces a variety of referential similarities between names regardless of their types. Named entity type filtering eliminates similarity relations between names with different entity types. It constructs a typed similarity m×m matrix T = (tij) ∈ R by { ς(i, j) if τ(i) = τ(j), t = (3.4) ij 0 otherwise, where τ is the type assignment function given in Section 3.2.1.

19 3.2.3 Clustering Method

To group names based on their similarity, cluster analysis is applied. Three clustering algorithms are considered, i.e., similarity-based cutoff clustering, hierarchical clustering, and clustering based on equivalence relations. The input for the clustering m×m process in this framework is a symmetric matrix U = (uij) ∈ R , where U is the typed similarity matrix T (Section 3.2.2) if the preprocessing step NET is applied; otherwise U is the similarity matrix S (Section 3.2.2).

Similarity-based Cutoff Clustering (SBC)

Similarity-based cutoff clustering [94] produces clusters from similarity relations represented by the matrix U by using a cutting level (or threshold). Given a name i and a cutting level λ, a cluster Ci is the set {j | (1 ≤ j ≤ m) ∧ (uij ≥ λ)}. This clustering algorithm allows cluster overlapping [14], i.e., for any names i and j, Ci and Cj are in general not disjoint.

Hierarchical Clustering

Hierarchical clustering [43] is applied for partitioning names into disjoin- t groups based on the similarity relations in the matrix U. The used hierarchical clustering algorithm takes an agglomerative approach [93], i.e., each name starts with its own cluster, and a hierarchy of clusters is built by successively merging the most similar pair of clusters. More precisely, it works as follows:

1. Create an initial cluster set CL0 = {C1, ..., Cm}, where for each i ∈ {1, ..., m}, Ci is the singleton cluster {i}.

2. Let h = 0.

3. While CLh contains more than one cluster, perform the following steps:

(a) Find the largest element uij in U such that (i) i ≠ j and (ii) i and j belong to different clusters in CLh. ′ (b) Let C and C be the clusters in CLh that contain the names i and j, respectively. ′ (c) Let Cnew = C ∪ C . ′ (d) Let CLh+1 = (CLh − {C,C }) ∪ {Cnew}.

(e) For any k ∈ Cnew, updates the row and column for k in U using the procedure Update(k, i, j).

20 (f) h := h + 1.

The procedure Update used at Step 3e is parametrized by the choice of linkage functions. It takes three input names k, i, and j and update the row and column for k in U using the vectors representing the names i and j and a linkage function, denoted by

Linkage, as follows: For each p ∈ {1, ..., m}, replace ukp and upk with Linkage(uip, ujp). Three linkage functions are considered, i.e., the single linkage (SLC), the complete linkage (CLC), and the centroid linkage (ZLC). Given real numbers x and y, Linkage(x, y) returns max(x, y), min(x, y), and (x + y)/2, respectively, when the functions SLC, CLC, and ZLC are employed.

Clustering Based on Equivalence Relations (ERC)

The similarity function represented by U determines a fuzzy relation R between names. From R, an equivalence relation R∞ is constructed, where R1 = R and for any t ≥ 1, Rt = Rt−1 ◦ R, which is given by

µRt−1◦R(i, j) = max [min[µR(i, k), µRt−1 (k, j)]] (3.5) 1≤k≤m for any names i, j ∈ F ∪ A. λ-cut is then applied to R∞ in order to partition the set

F ∪ A. The λ-cut equivalence relation Rλ is defined by { 1 if µ ∞ (i, j) ≥ λ, µ (i, j) = R (3.6) Rλ 0 otherwise,

for any names i, j ∈ F ∪ A. The obtained quotient set (F ∪ A)/Rλ yields a set of clusters.

3.3 Experimental Settings

The collections of news articles used in this dissertation are described below a- long with names annotation with type assignment, dataset characteristics, name-by- document matrix construction, and performance measurement.

3.3.1 Data Collection

Two datasets [78] obtained from different news domains are employed in experiments. The first dataset, referred to as DFB, consists of 1,000 football news in the context of the English Premier League, collected from Siam Sport Online [71], which

21 is the most popular online sport newspaper in Thailand. The second dataset, referred to as DPL, consists of 1,000 political news, gathered from a famous Thai news website, Manager Online [51]. The two datasets are used since they contain a lot of name aliases. Names in the two collections and their types are manually annotated. The annotated names in the first dataset are divided into two types, i.e., football clubs (FBC) and footballer names including manager (PLN). Those in the second dataset are divided into politician names (POL) and non-politician names (NOP).

3.3.2 Name Filtering

Names that rarely occur are unlikely to be recognized by the public and tend to have no alias. Such names are filtered out based on the median frequency, i.e.,a ˜ ˜ name i is filtered out if gfi < f, where gfi is the global name frequency for i, and f is the median frequency derived from all global name frequencies. The median frequency in DFB is 12, whereas that in DPL is 3. The remaining names in each dataset are recorded in a typed named entity (TNE) list. Only the names in a TNE list are used for constructing an input name-by-document matrix.

3.3.3 Dataset Characteristics

Table 3.2 details the characteristics of the two datasets. Before filtering, the number of names in DFB is approximately a half of that in DPL (264/515 for formal names and 144/274 for name aliases), formal names occur more frequently in DFB (8,221/3,463), and the number of name alias occurrences in DFB and that in DPL are approximately the same (14,048/14,103). After filtering, the number of names in DFB is also less than that in DPL (93/138 for formal names and 133/195 for name aliases), the number of name occurrences in DFB is higher (7,313/2,929 for formal names and 13,850/12,822 for name aliases), and the proportion of name occurrences to names in

DFB is higher (78.63/21.22 for formal names and 104.14/65.75 for name aliases).

Table 3.2: Dataset characteristics Before filtering After filtering Characteristic DFB DPL DFB DPL #formal names (types) 264 515 93 138 #name aliases (types) 144 274 133 195 #formal name occurrences 8,221 3,463 7,313 2,929 #name alias occurrences 14,048 14,103 13,850 12,822 #formal name occurrences/#formal names 31.14 6.72 78.63 21.22 #name alias occurrences/#name aliases 97.56 51.47 104.14 65.75 #name aliases/#formal names 0.55 0.53 1.43 1.41 #name alias occurrences/#formal name occurrences 1.71 4.07 1.89 4.38

22 Table 3.3: Number of formal names having each particular number of alias names D D #Aliases FB PL FBC FLN POL NOP

0 0 37 23 2 1 0 17 59 9 2 5 15 23 3 3 5 3 9 1 4 6 0 1 2 5 2 1 2 0 6 1 0 0 2 7 1 0 0 1 8 0 0 1 0

Most formal names in the two datasets have not more than 2 aliases and only 11 formal names have more than 4 aliases. Table 3.3 shows the number of formal names having each particular number of alias names after name filtering. For example, the 3rd row indicates in its 2nd and 3rd columns that 5 formal organization names and 15 formal person names in DFB have 2 alias names.

3.3.4 Evaluation Method

The performance of this framework is evaluated by F1-measure (F1), defined by P × R F = 2 × , (3.7) 1 P + R where P (precision) is the proportion of correctly retrieved name-alias pairs to all retrieved name-alias pairs, and R (recall) is that of correctly retrieved name-alias pairs to all relevant name-alias pairs. Varying cutting levels, varying levels of merging, and varying λ-values, respectively, are used for investigating the performance of similarity- based cutoff clustering, hierarchical clustering, and clustering based on equivalence relations. As a result, a large number of F1-values are obtained from each clustering method. The highest obtained F1-value is chosen to indicate the performance of the clustering method under consideration.

The recall and precision are calculated by enumerating all name-alias pairs for both reference and system answers. As an example, suppose that {a, b, c, d} and {x, y} are two name clusters, where a and x are formal names, b, c, and d are name aliases for a, and y is the name alias for x. Also suppose that the system can detect two name clusters {a, b, c} and {x, y, d}.

1. The reference name-alias pairs are (a, b), (a, c), (a, d), (b, c), (b, d), (c, d) and

(x, y) (Nr = 7). 2. The system name-alias pairs are (a, b), (a, c), (b, c), (x, y), (x, d) and (y, d)

(Ns = 6).

23 3. From the above combinations, the correct name-alias pairs are (a, b), (a, c),

(b, c) and (x, y) (Nc = 4).

4. The recall is Nc/Nr = 4/7 while the precision is Nc/Ns = 4/6.

Using the precision and recall, the F1-measure is obtained.

24 Chapter 4

Experimental Results

In this chapter, effectiveness of various preprocessing factors, including weightings, normalization, co-occurrence calculation, and named entity type filtering of the framework in Chapter 3 are investigated and compared when similarity-based cutoff clustering, hierarchical clustering, and clustering based on equivalence relations are used. For evaluation, two datasets from football and political news categories are employed.

4.1 Overall Results

This section presents the output performance obtained using different combinations of preprocessing factors and clustering methods.

4.1.1 Evaluation Results

As described in Section 3.2, two local weights (TF, BF), six global weights (ENT, GFIDF, IDF, NORM, ONE, PRINV), three optional preprocessing steps (NOR, COC, NET), and five clustering methods (SBC, SLC, CLC, ZLC, ERC) are selected. The number of all possible combinations is thus 2 × 6 × 23 × 5 = 480. The combinations are divided into two groups. The first group consists of all combinations with at least one of NOR, COC, and NET (480 − (2 × 6 × 5) = 420 combinations). The second group consists of all those in which none of the optional preprocessing steps is selected (2 × 6 × 5 = 60 combinations).

Table 4.1 shows the results of the first combination group in DFB, ranked by their output performance in descending order. The 3rd–5th columns indicate whether each optional preprocessing step is selected (‘1’) or not selected (‘0’). The 7th column shows characteristics of resulting clusters in the format M/S/A, where M is the number of clusters with multiple names, S is the number of singleton clusters, and A is the average number of names in a cluster. Referring to Tables 3.2 and 3.3, the clusters obtained by manual annotation (the reference clusters) in DFB, where each cluster contains exactly one formal name along with (zero or more) name aliases, have the

25 Table 4.1: Top 20 combinations and those in the ranks 40–420 stepping by 20 in DFB Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-ENT 0 1 1 CLC 55/44/2.28 97.24 2 TF-IDF 0 1 1 CLC 56/41/2.33 97.07 3 BF-ONE 1 1 1 CLC 55/43/2.31 96.72 4 TF-ENT 1 1 1 CLC 57/39/2.35 96.35 5 TF-IDF 1 1 1 CLC 55/43/2.31 95.81 6 TF-PRINV 1 1 1 CLC 57/41/2.31 95.60 7 TF-GFIDF 0 1 1 CLC 56/42/2.31 95.41 8 TF-NORM 1 1 1 CLC 56/44/2.26 95.38 9 BF-IDF 1 1 1 CLC 53/49/2.22 95.19 10 TF-ONE 1 1 1 CLC 56/44/2.26 95.03 11 TF-ONE 0 1 1 CLC 56/43/2.28 94.83 12 TF-PRINV 0 1 1 CLC 54/46/2.26 94.66 13 TF-GFIDF 1 1 1 CLC 57/41/2.31 94.49 14 BF-IDF 0 1 1 CLC 53/52/2.15 94.32 15 TF-ENT 0 1 1 ZLC 57/40/2.33 94.12 16 TF-ENT 1 1 1 ZLC 60/37/2.33 94.12 17 BF-PRINV 1 1 1 CLC 53/51/2.17 93.83 18 TF-GFIDF 1 1 1 ZLC 59/42/2.24 93.83 19 BF-PRINV 0 1 1 CLC 54/51/2.15 93.51 20 TF-GFIDF 1 1 1 SLC 56/45/2.24 93.51 40 TF-GFIDF 0 1 1 ZLC 61/37/2.31 91.82 60 BF-ENT 1 1 1 ZLC 58/41/2.28 87.41 80 TF-NORM 1 1 1 ZLC 57/50/2.11 84.98 100 TF-PRINV 0 0 1 ZLC 56/52/2.09 84.75 120 TF-ONE 1 0 1 SLC 55/53/2.09 83.56 140 BF-ENT 0 0 1 CLC 53/65/1.92 82.01 160 BF-ONE 0 0 1 ZLC 57/54/2.04 79.92 180 BF-ONE 0 0 1 ERC 55/61/1.95 76.80 200 TF-ONE 0 1 1 SBC 224/2/4.97 74.63 220 TF-IDF 1 1 0 SLC 54/71/1.81 71.08 240 TF-ENT 0 0 1 SBC 218/8/5.03 68.30 260 BF-NORM 0 0 1 SBC 218/8/5.00 66.44 280 TF-IDF 1 0 0 SLC 51/76/1.78 64.90 300 TF-NORM 1 1 0 ERC 52/77/1.75 63.13 320 TF-NORM 1 1 0 ZLC 58/84/1.59 60.05 340 TF-ONE 0 1 0 ZLC 61/74/1.67 57.97 360 BF-ENT 1 1 0 CLC 55/79/1.69 56.21 380 BF-IDF 0 1 0 ERC 37/62/2.28 54.90 400 BF-NORM 0 1 0 SBC 191/35/5.35 52.35 420 BF-NORM 0 1 0 ZLC 53/96/1.52 47.69

Table 4.2: Combinations without any optional preprocessing step in DFB Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-* 0 0 0 ERC 50/75/1.81 61.11 2 TF-* 0 0 0 CLC 53/84/1.65 60.00 3 TF-* 0 0 0 SLC 50/83/1.70 59.36 4 TF-* 0 0 0 ZLC 60/78/1.64 58.06 5 BF-* 0 0 0 ERC 51/85/1.66 57.27 6 TF-* 0 0 0 SBC 207/19/5.63 56.31 7 BF-* 0 0 0 SLC 51/91/1.59 55.42 8 BF-* 0 0 0 SBC 198/28/5.58 53.97 9 BF-* 0 0 0 CLC 56/96/1.49 50.66 10 BF-* 0 0 0 ZLC 59/79/1.64 49.28

26 Table 4.3: Top 20 combinations and those in the ranks 40–420 stepping by 20 in DPL Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-ONE 0 1 1 CLC 103/66/1.97 76.37 2 TF-ENT 1 1 1 CLC 102/61/2.04 76.35 3 TF-IDF 1 1 1 CLC 104/61/2.02 76.22 4 TF-ONE 1 1 1 CLC 103/57/2.08 76.08 5 TF-PRINV 1 1 1 CLC 104/64/1.98 75.00 6 TF-NORM 1 1 1 CLC 101/65/2.01 73.36 7 TF-ONE 1 1 0 CLC 109/58/1.99 72.81 8 TF-ENT 0 1 1 CLC 107/55/2.06 72.05 9 BF-GFIDF 1 1 1 CLC 105/72/1.88 71.29 10 TF-ENT 1 1 0 CLC 107/48/2.15 71.24 11 TF-GFIDF 1 1 1 CLC 113/53/2.01 70.74 12 TF-PRINV 0 1 1 CLC 111/48/2.09 70.70 13 BF-GFIDF 1 1 0 CLC 109/70/1.86 70.41 14 TF-IDF 0 1 1 CLC 109/55/2.03 70.03 15 TF-NORM 0 1 1 CLC 104/64/1.98 69.94 16 TF-IDF 1 1 0 CLC 108/49/2.12 69.91 17 BF-GFIDF 0 1 1 CLC 107/73/1.85 69.88 18 TF-ENT 1 1 1 ERC 101/45/2.28 69.78 19 TF-ENT 1 1 1 SLC 101/49/2.22 69.47 20 BF-ONE 1 1 1 CLC 86/103/1.76 69.43 40 TF-IDF 1 1 0 ZLC 112/67/1.86 67.70 60 TF-IDF 1 0 1 ERC 106/59/2.02 66.98 80 TF-NORM 1 1 1 ERC 103/67/1.96 66.36 100 BF-IDF 1 1 0 ZLC 109/70/1.86 65.76 120 TF-ENT 0 0 1 ZLC 110/70/1.85 65.17 140 TF-ENT 0 1 0 CLC 107/58/2.02 64.55 160 TF-ONE 0 0 1 ERC 101/60/2.07 64.17 180 BF-ENT 1 1 0 ERC 100/69/1.97 63.76 200 TF-NORM 0 1 1 ZLC 111/71/1.83 63.46 220 TF-ONE 1 1 0 ERC 96/97/1.73 63.16 240 TF-ENT 0 0 1 SLC 103/68/1.95 62.50 260 BF-GFIDF 0 0 1 CLC 106/83/1.76 62.06 280 BF-GFIDF 1 1 1 SBC 309/24/4.18 61.68 300 BF-NORM 1 0 0 ZLC 109/81/1.75 61.04 320 BF-GFIDF 1 1 0 SBC 307/26/4.05 60.08 340 TF-PRINV 0 1 0 SLC 95/94/1.76 58.42 360 TF-PRINV 0 1 0 ERC 89/108/1.69 57.24 380 BF-GFIDF 0 1 0 SBC 292/41/3.93 56.32 400 TF-NORM 1 0 1 SBC 311/22/4.19 54.80 420 TF-GFIDF 0 1 0 ERC 71/144/1.55 44.07

Table 4.4: Combinations without any preprocessing step in DPL Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-* 0 0 0 CLC 113/58/1.95 65.44 2 TF-* 0 0 0 ZLC 112/68/1.85 63.43 3 BF-* 0 0 0 ERC 98/66/2.03 62.14 4 BF-* 0 0 0 ZLC 108/83/1.74 60.68 5 BF-* 0 0 0 CLC 105/87/1.73 60.22 6 BF-* 0 0 0 SLC 100/85/1.80 59.45 7 TF-* 0 0 0 ERC 100/80/1.85 59.01 8 TF-* 0 0 0 SLC 104/69/1.92 58.96 9 BF-* 0 0 0 SBC 312/21/4.20 55.17 10 TF-* 0 0 0 SBC 323/10/4.45 53.31

27 characteristics 56/37/2.43, i.e., there are 56 formal names with aliases (56 = 17 + 5 + 15 + 5 + 3 + 6 + 2 + 1 + 1 + 1), 37 formal names without any alias, and on average a formal name has 2.43−1 = 1.43 aliases (2.43 = (93+133)/(56+37)). All combinations in the top 20 use the preprocessing steps COC and NET. All those in the top 14 use the clustering method CLC. The performance difference between the first rank and the last rank is approximately 50% (97.24 − 47.69).

Table 4.2 shows the performance of the second group in DFB. When none of NOR, COC, and NET is selected, the choice of global weights does not affect clustering results. The asterisk sign in the 2nd column represents any arbitrary global weight.

Table 4.3 shows the performance of the first combination group in DPL. All combinations in the top 20 use the preprocessing step COC and all those in the top 17 use the clustering method CLC. The reference clusters in this dataset have the characteristics 113/25/2.41. The performance difference between the first rank and the last rank is approximately 32% (76.37 − 44.07). Table 4.4 shows the performance of the second combination group in DPL.

4.1.2 Preliminary Discussion

As shown in Tables 4.1 and 4.3, the F1-performance of the best combination in DFB and that in DPL are 97.24% and 76.37%, respectively. From their characteristics in shown Table 3.2, the performance difference between the two datasets can be explained based on the following observations:

1. The number of name occurrences in DFB (7,313 formal name occurrences and 13,850 name alias occurrences) is far higher than that in DPL (2,929 formal name occurrences and 12,822 name alias occurrences).

2. The proportion of name occurrences to names in DFB (78.63 for formal names and 104.14 for name aliases) is far greater than that in DPL (21.22 for formal names and 65.75 for name aliases).

3. The proportion of name alias occurrences to formal name occurrences in DFB (1.89) is about two times less than that in DPL (4.38).

Almost all combinations in the top-20 of both datasets use COC, i.e., co-occurrence information is very important. As a result, from the first and the second observations, the probability of successful name-alias relationship identification in DFB is higher than in DPL. Moreover, since most formal names in the two datasets have not more than 2 aliases (see Table 3.3), the third observation suggests that name-alias identification in

DPL is more difficult.

28 Detailed analysis and discussion of the effects of preprocessing factors and clustering methods are given in the next section.

4.2 Analysis of Effects of Preprocessing Factors and Clustering Methods

Effects of weighting schemes and their statistical significance are investigated in Section 4.2.1. A similar investigation for clustering methods and that for optional preprocessing steps are given in Sections 4.2.2 and 4.2.3, respectively.

4.2.1 Effects of Weighting Schemes

To investigate their effects, the 12 weighting schemes are pairwise compared. For any weighting scheme w, let C(w) be the set of all combinations of w and other 3 factors (2 × 5 = 40 combinations). Given weighting schemes w1 and w2, two types of comparisons between C(w1) and C(w2) are conducted:

1. The performance of corresponding combinations: Based on the F1 measure, corresponding combinations in C(w1) and C(w2) are compared individually, and the comparison results are shown in the format W/D/L, where W, D,

and L are the number of combinations with which w1 wins over w2, w1 draws with w2, and w1 loses to w2, respectively.

2. The significance of the difference between C(w1) and C(w2): The F-test is first used for comparing the variance ofthe F1 performance of C(w1) and that of C(w2). If the variance difference is not significant, the t-test with equal variance is applied. Otherwise, the t-test with unequal variance is used.

Table 4.5 shows the comparison results in DFB, where the star sign indicates a statistical significant difference. The last two columns represent the average F1 performance and the score of the comparison results of the first type, where a wining combination, a drawing combination, a losing combination, get 3 points, 1 point, and 0 point, respectively. For example, TF-ENT wins over, draws with, loses to TF- GFIDF 17 times, 10 times, 13 times and the difference between C(TF-ENT) and C(TF-GFIDF) is not statistically significant. The table shows that (i) TF yields better results compared to BF and (ii) except for the case when GFIDF is used, the differences between TF and BF are statistically significant. The top 3 weighting schemes areTF- ENT, TF-IDF, and TF-GFIDF, with their scores being 1,101, 1,031, and 1,022 points, respectively.

Table 4.6 shows the comparison results in DPL. Except for the comparison between TF-GFIDF and BF-NORM, no statistically significant difference is observed.

29 Table 4.5: Comparisons between weighting schemes in DFB

Weighting TF-ENT TF-GFIDF TF-IDF TF-NORM TF-ONE TF-PRINV BF-ENT BF-GFIDF BF-IDF BF-NORM BF-ONE BF-PRINV F1(avg) Point TF-ENT - 17/10/13 18/17/5 30/10/0 22/10/8 23/13/4 40/0/0⋆ 40/0/0 39/0/1⋆ 40/0/0⋆ 38/0/2⋆ 40/0/0⋆ 75.01 1,101 TF-GFIDF 13/10/17 - 14/10/16 27/10/3 18/10/12 18/10/12 39/0/1⋆ 40/0/0 37/0/3⋆ 40/0/0⋆ 38/0/2⋆ 40/0/0⋆ 74.65 1,022 TF-IDF 5/17/18 16/10/14 - 30/10/0 19/10/11 16/15/9 40/0/0⋆ 40/0/0 39/0/1⋆ 40/0/0⋆ 38/0/2⋆ 40/0/0⋆ 74.67 1,031 TF-NORM 0/10/30 3/10/27 0/10/30 - 2/10/28 1/10/29 35/0/5 25/0/15 34/0/6 40/0/0⋆ 29/0/11 36/0/4 71.08 665 TF-ONE 8/10/22 12/10/18 11/10/19 28/10/2 - 12/10/18 40/0/0⋆ 40/0/0 39/0/1⋆ 40/0/0⋆ 39/0/1⋆ 40/0/0⋆ 74.50 977 TF-PRINV 4/13/23 12/10/18 9/15/16 29/10/1 18/10/12 - 39/0/1⋆ 40/0/0 38/0/2⋆ 40/0/0⋆ 38/0/2⋆ 40/0/0⋆ 74.53 979 BF-ENT 0/0/40⋆ 1/0/39⋆ 0/0/40⋆ 5/0/35 0/0/40⋆ 1/0/39⋆ - 2/10/28 13/17/10 28/10/2 8/10/22 16/13/11 67.16 282 BF-GFIDF 0/0/40 0/0/40 0/0/40 15/0/25 0/0/40 0/0/40 28/10/2 - 23/10/7 30/10/0 25/10/5 27/10/3 69.42 494 BF-IDF 1/0/39⋆ 3/0/37⋆ 1/0/39⋆ 6/0/34 1/0/39⋆ 2/0/38⋆ 10/17/13 7/10/23 - 30/10/0 9/10/21 18/13/9 67.48 324 BF-NORM 0/0/40⋆ 0/0/40⋆ 0/0/40⋆ 0/0/40⋆ 0/0/40⋆ 0/0/40⋆ 2/10/28 0/10/30 0/10/30 - 2/10/28 1/10/29 63.64 65 BF-ONE 2/0/38⋆ 2/0/38⋆ 2/0/38⋆ 11/0/29 1/0/39⋆ 2/0/38⋆ 22/10/8 5/10/25 21/10/9 28/10/2 - 22/11/7 67.77 405 BF-PRINV 0/0/40⋆ 0/0/40⋆ 0/0/40⋆ 4/0/36 0/0/40⋆ 0/0/40⋆ 11/13/16 3/10/27 9/13/18 29/10/1 7/11/22 - 66.99 246 30

Table 4.6: Comparisons between weighting schemes in DPL

Weighting TF-ENT TF-GFIDF TF-IDF TF-NORM TF-ONE TF-PRINV BF-ENT BF-GFIDF BF-IDF BF-NORM BF-ONE BF-PRINV F1(avg) Point TF-ENT - 27/10/3⋆ 21/12/7 25/10/5 27/10/3 25/10/5 28/0/12 24/0/16 27/0/13 30/0/10⋆ 25/0/15 28/0/12 63.76 913 TF-GFIDF 3/10/27⋆ - 2/10/28⋆ 13/10/17 6/10/24 2/10/28⋆ 17/0/23⋆ 14/0/26⋆ 16/0/24⋆ 24/0/16 16/0/24⋆ 18/0/22 60.05 443 TF-IDF 7/12/21 28/10/2⋆ - 23/10/7 26/11/3 24/10/6 27/0/13 23/0/17 27/0/13 28/0/12⋆ 25/0/15 27/0/13 63.50 848 TF-NORM 5/10/25 17/10/13 7/10/23 - 14/10/16 8/10/22 18/0/22 17/0/23 19/0/21 27/0/13 14/0/26 21/0/19 62.11 551 TF-ONE 3/10/27 24/10/6 3/11/26 16/10/14 - 4/10/26 16/0/24 17/0/23 19/0/21 28/0/12 18/0/20 20/0/20 62.18 555 TF-PRINV 5/10/25 28/10/2⋆ 6/10/24 22/10/8 26/10/4 - 22/0/18 22/0/18 22/0/18 28/0/12⋆ 22/0/18 24/0/16 63.05 731 BF-ENT 12/0/28 23/0/17⋆ 13/0/27 22/0/18 24/0/16 18/0/22 - 9/11/20 21/13/6 28/10/2⋆ 10/10/20 17/17/6 62.44 652 BF-GFIDF 16/0/24 26/0/14⋆ 17/0/23 23/0/17 23/0/17 18/0/22 20/11/9 - 22/10/8 28/10/2⋆ 10/10/20 23/11/6 62.76 730 BF-IDF 13/0/27 24/0/16⋆ 13/0/27 21/0/19 21/0/19 18/0/22 6/13/21 8/10/22 - 29/10/1⋆ 10/10/20 18/14/8 62.35 600 BF-NORM 10/0/30* 16/0/24 12/0/28⋆ 13/0/27 12/0/28 12/0/28⋆ 2/10/28⋆ 2/10/28⋆ 1/10/29⋆ - 2/10/28⋆ 2/10/28⋆ 60.56 302 BF-ONE 15/0/25 24/0/16⋆ 15/0/25 26/0/14 22/0/18 18/0/22 20/10/10 20/10/10 20/10/10 18/10/2⋆ - 25/10/5 62.93 749 BF-PRINV 12/0/28 22/0/18 13/0/27 19/0/21 20/0/20 16/0/24 6/17/17 6/11/23 8/14/18 28/10/2⋆ 5/10/25 - 62.18 527 The top 3 weighting schemes are TF-ENT, TF-IDF, and BF-ONE, with their scores being 913, 848, and 749 points, respectively.

4.2.2 Effects of Clustering Methods

For any clustering method c, let C(c) be the set of all combinations of c and 3 other factors (2 × 6 × 2 = 96 combinations). Again, given clustering methods c1 and c2, the comparison between C(c1) and C(c2) are conducted using the two types of comparisons described in Section 5.2.

Table 4.7: Comparisons between clustering methods in DFB

Clustering SBC ERC SLC CLC ZLC F1(avg) Point SBC - 0/0/96⋆ 4/0/92⋆ 21/0/75⋆ 29/0/67⋆ 63.91 162 ERC 96/0/0⋆ - 58/27/11 34/0/62 72/0/24 72.54 807 SLC 92/0/4⋆ 11/27/58 - 23/0/73 69/0/27 71.85 612 CLC 75/0/21⋆ 62/0/34 73/0/23 - 95/0/1 74.23 915 ZLC 67/0/29⋆ 24/0/72 27/0/69 1/0/95 - 70.34 357

Table 4.7 shows the comparison results in DFB. The performance obtained using SBC is significantly different than that obtained using other clustering methods. The top 3 clustering methods are CLC, ERC, and SLC, with their scores being 915, 807, and 612 points, respectively.

Table 4.8: Comparisons between clustering methods in DPL

Clustering SBC ERC SLC CLC ZLC F1(avg) Point SBC - 1/0/95⋆ 0/0/96⋆ 0/0/96⋆ 0/0/96⋆ 56.50 3 ERC 95/0/1⋆ - 52/9/35 18/0/78⋆ 28/0/68⋆ 62.58 588 SLC 96/0/0⋆ 35/9/52 - 5/1/90⋆ 17/0/79⋆ 62.32 469 CLC 96/0/0⋆ 78/0/18⋆ 90/1/5⋆ - 75/2/19⋆ 66.38 1,020 ZLC 96/0/0⋆ 68/0/28⋆ 79/0/17⋆ 19/2/75⋆ - 63.85 788

Table 4.8 shows the comparison results in DPL. Except for the case when c1 is ERC and c2 is SLC, the difference between C(c1) and C(c2) is statistically significant. The top 3 clustering methods are CLC, ZLC, and ERC, with their scores being 1,020, 788, and 588 points, respectively.

4.2.3 Effects of Optional Preprocessing Steps

Next, consider the three preprocessing steps NOR, COC, and NET. Given any arbitrary permutation [p1, p2, p3] of these preprocessing steps, the following notation is used:

• For each i ∈ {1, 2}, si denotes the status of pi as to whether it is selected.

31 • C(s1, s2) is the set of all combinations of weights, preprocessing steps, and clustering methods in which p1 has the status s1, p2 has the status s2, and the status of p3 is arbitrary.

For example, C(NOR=1, COC=0) is the set of all combinations in which NOR is selected and COC is not selected (2 × 6 × 21 × 5 = 120 combinations).

For any pair of distinct optional preprocessing steps p1 and p2, the two types of comparisons described in Section 4.2.1 are made between C(p1 = 1, s) and C(p1 = 0, s), where s is a given status of p2. Table 4.9 shows the comparison results in DFB. To illustrate the meaning of this table, consider, for example, the following two cells:

• “96/0/24⋆” in the column “NOR=0” of the row “COC=1 vs COC=0”.

• “109/0/11⋆” in the column “NOR=1” of the same row.

Table 4.9: Comparison between optional preprocessing steps in DFB Preprocessing NOR=0 NOR=1 COC=0 COC=1 NET=0 NET=1 Point

NOR=1 vs NOR=0 - - 74/1/45 102/2/16 106/1/13⋆ 70/2/48⋆ 1,062 COC=1 vs COC=0 96/0/24⋆ 109/0/11⋆ - - 89/0/31⋆ 116/0/4⋆ 1,230 NET=1 vs NET=0 120/0/0⋆ 120/0/0⋆ 120/0/0* 120/0/0* - - 1,440

Table 4.10: Comparisons between optional preprocessing steps in DPL Preprocessing NOR=0 NOR=1 COC=0 COC=1 NET=0 NET=1 Point

NOR=1 vs NOR=0 - - 95/0/25⋆ 112/0/8⋆ 111/0/9⋆ 96/0/24⋆ 1,242 COC=1 vs COC=0 69/0/51 104/0/16⋆ - - 82/0/38 91/0/29⋆ 1,038 NET=1 vs NET=0 120/0/0⋆ 113/2/5⋆ 116/1/3⋆ 117/1/2* - - 1,402

The first cell indicates the comparison result between C(COC=1, NOR=0) and C(COC=0, NOR=0), i.e., the result of comparing (i) the combinations in which COC is selected but NOR is not selected with (ii) the corresponding combinations in which both COC and NOR are not selected. Similarly, the second cell provides the comparison result between C(COC=1, NOR=1) and C(COC=0, NOR=1). These two cells show that:

1. No matter whether NOR is selected, the selection of COC yields better performance and the difference between selection and non-selection of COCis statistically significant.

2. NOR affects COC, i.e., the use of NOR widens the difference between selection and non-selection of COC (the number of winning combinations increases from 96 to 109 when NOR is used).

32 The last column shows that the use of NET has the greatest overall effects (1,440 points), compared to the use of COC (1,230 points) and that of NOR (1,062 points).

Table 4.10 shows the comparison results in DPL. It indicates that NOR affects COC, COC affects both NOR and NET, and NET affects COC. Almost all comparison results in DPL are significantly different. The use of NET has the greatest overall effects (1,402 points), compared to the use of NOR (1,242 points) and that of COC (1,038 points).

33 Chapter 5

Exploration of Co-occurrence Matrix Construction Methods

This chapter explores an alternative co-occurrence matrix construction method using association measures. Effectiveness of the association measures and the traditional co-occurrence matrix construction method are investigated and compared when using various preprocessing factors, i.e., weighting schemes, normalization, and linkage functions for hierarchical clustering. For evaluation, two news collections used in Chapter 4 are employed.

5.1 An Overview of the Framework

The purpose of the new name-alias relationship identification framework is to classify names and name aliases into groups, each of which represents an entity (e.g., a person, an organization, etc.). Based on similarity measuring derived from co-occurrence information, name-alias pairs with strong similarity are considered as candidate name- alias pairs. Using a clustering method, candidate pairs are merged into hierarchical clusters. A name alias is associated with a formal name if they belong to the same cluster.

Figure 5.1 shows the proposed framework. It consists of four main parts:

1. Preprocessing: A name-by-document matrix constructed from a set of documents with their extracted terminology, such as person names and organization names, is taken as input to the framework. A weighted matrix is calculated from the name-by-document matrix. A normalization process is optionally applied to the weighted matrix.

2. Co-occurrence Matrix Construction: Two alternative co-occurrence matrix construction methods are explored, i.e., construction using matrix multiplication and that using association measures.

34 3. Similarity Calculation: The similarity measuring process is used to construct relationships among names. The named entity (NE) type filtering process is employed for filtering out related names having different types.

4. Name Clustering: After the NE type filtering process, the remaining names are grouped by using hierarchical clustering.

Figure 5.1: An overview of name-alias identification framework

Co-occurrence Matrix Construction Preprocessing (3) Similarity Calculation (1) (2) Matrix Multiplication Weighting Similarity Normalization (5) Scheme Measuring (4) Association Rule Measure

Name-by-document NE Type (6) Matrix Checking

Hierarchical (7) Clustering A collection of news documents Name Clustering

5.2 Preprocessing Factors

The preprocessing factors, co-occurrence matrix construction, similarity calculation, and clustering methods used in this framework are detailed below.

5.2.1 Weighting Schemes

Two local weighs (BF and TF) and six global weights (ENT, GFIDF, IDF, NORM, ONE, PRINV) used in Chapter 3 are also applied to the new framework. Based on occurrence frequency in the input matrix A, the local and global weights are determined. The general scheme wij = lijgi is also used for constructing a weighted m×n matrix W = (wij) ∈ R .

35 5.2.2 Normalization (NOR)

The normalization process proposed in Chapter 3 is still used. It takes an input as the weighted matrix W. Using the Cosine function, a normalized weighted matrix m×n N = (nij) ∈ R is constructed. In this framework, normalization step is optional.

5.2.3 Similarity Measure

For measuring referential similarities among names, the similarity function proposed in Chapter 3 is used. The input matrix for this step can be the weighted name-by- document matrix W, the normalized weighted matrix N, or the co-occurrence matrix C (Section 5.3), depending on the choice of earlier processing steps. In Equation (3.3), p is the number of documents (i.e., n) if the input matrix is W or N, and p is the number of names (i.e., m) if the input matrix is C. Using the similarity function ς in m×m Equation 3.3, a similarity matrix S = (sij) ∈ R , where sij = ς(i, j), is constructed.

5.2.4 Named Entity Type Filtering (NET)

Using the types assignment function given in Section 3.2.1 and the input matrix m×m S, a typed similarity matrix T = (tij) ∈ R is constructed. Different from the previous framework, the named entity type filtering applied to the new framework is mandatory.

5.3 Co-occurrence Matrix Construction (COC)

In this section, two methods for co-occurrence matrix construction are described. The first one uses a simple matrix multiplication and the second one employs association measures. An element of the co-occurrence matrix typically indicates how often two names co-occur in a similar linguistic context. With the assumption that frequent co-occurring names tend to refer to the same entity, it is expected that the addition of co-occurrence information can strengthen relations between names for similarity measurement. Co-occurrence matrix construction method is alternative.

5.3.1 Matrix Multiplication (MULP)

Matrix multiplication is a basic method for co-occurrence matrix construction. m×m A co-occurrence matrix C = (cij) ∈ R is constructed by multiplying a weighted name-by-document matrix and its transpose, i.e., C = W × WT , or a normalized weighted matrix and its transpose, i.e., C = N × NT .

36 5.3.2 Association Measures

Traditionally, association rule mining is applied to a set of transactions in order to discover rules for predicting the occurrence of an item in a transaction based on the occurrences of other items. In the new framework, association rule mining is adopted for finding name co-occurrence relations in a document collection. By applying association measures, various patterns of co-occurrence information are produced. The association measures are normally defined based on a support function. The support function given in [75] is used.

The input of the support function is either the weighted matrix W or the m×n normalized weighted matrix N. Let a matrix U = (uij) ∈ R be an input matrix (U = W or U = N). The support function, denoted by σ, is defined by ∑n min(uxj, uyj) j=1 σ(x, y) = ∑n (5.1) m maxi=1(uij) j=1 for any names x and y in F ∪ A. Given a name x ∈ F ∪ A, σ(x, x) is simply written as σ(x).

Table 5.1: Association measure functions for co-occurrence matrix construction Measure Function Definition

Support (SUPP) Supp(x → y) σ(x, y) ( ) → σ(x,y) σ(y,x) Confidence (CONF) Conf(x y) max σ(x) , σ(y) √ ( ) → σ(x,y) − σ(y,x) − Klosgen (KLOS) Klos(x y) σ(x, y) max σ(x) σ(y), σ(y) σ(x) Leverage (LEVE) Leve(x → y) σ(x, y) − σ(x)σ(y)

→ σ(x,y) Lift (LIFT) Lift(x y) σ(x)σ(y)

Table 5.1 shows the association measure functions [4, 30, 35, 80] considered in this work. When an association measure function is selected, a co-occurrence matrix m×m C = (cij) ∈ R is constructed. For example, if x and y are the i-th and j-th names and if the confidence function (CONF) is used, then cij = Conf(x → y). Using the functions in Table 5.1, the rules x → y and y → x have the same association value for any names x and y. The obtained matrix C is thus symmetric.

5.3.3 Name Clustering

Hierarchical clustering in Section 3.2.3 is applied for partitioning names into disjoint groups based on the typed similarity relations in the matrix T. It is only one clustering method used in this framework.

37 5.4 Evaluation Results

The following combinations of factors, methods, and linkage functions are investigated:

• 2 × 6 × 2 = 24 combinations of preprocessing factors, i.e., select one of the two local weights (BF, TF) and one of the six global weights (ENT, GFIDF, IDF, NORM, ONE, PRINV), and select whether the normalization process (NOR) is disabled.

• 7 alternative co-occurrence matrix construction methods, i.e., use matrix multiplication (MULP), use one of the five interestingness measures (SUPP, CONF, KLOS, LEVE, LIFT) for mining association rules, or disable co-occurrence matrix construction (NONE).

• 3 alternative linkage functions for hierarchical clustering, i.e., use one of single linkage (SLC), complete linkage (CLC), and centroid linkage (ZLC).

Altogether, 24×7×3 = 504 combinations are considered. The combinations are divided into two groups. The first group consists of all combinations in which NOR is disabled and co-occurrence matrix construction is disabled (2×6×1×1×3 = 36 combinations). The combinations in this group are used as baseline combinations. The second group consists of all the remaining combinations (504 − 36 = 468 combinations).

Table 5.2: Baseline combinations in DFB Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 TF-* 0 NONE CLC 55/56/2.04 86.12 2 TF-* 0 NONE SLC 55/49/2.17 84.84 3 TF-* 0 NONE ZLC 56/52/2.09 84.75 4 BF-* 0 NONE CLC 55/60/1.97 81.76 5 BF-* 0 NONE ZLC 57/54/2.04 79.92 6 BF-* 0 NONE SLC 56/62/1.92 77.12

Table 5.2 shows the results of the first combination group (baseline combinations) in DFB, ranked by their output performances in descending order. The 6th column shows characteristics of resulting clusters in the format M/S/A, where M is the number of clusters with multiple names, S is the number of singleton clusters, and A is the average number of names in a cluster. Referring to Tables 3.2 and 3.3, the clusters obtained by manual annotation (the reference clusters) in DFB, where each cluster contains exactly one formal name along with (zero or more) name aliases, have the characteristics 56/37/2.43, i.e., there are 56 formal names with aliases (56 = 17 + 5 + 15 + 5 + 3 + 6 + 2 + 1 + 1 + 1), 37 formal names without any alias, and on average a formal name has 2.43 − 1 = 1.43 aliases (2.43 = (93 + 133)/(56 + 37)). When the

38 Table 5.3: Top 25 combinations and some selected combinations in the ranks 30–468 in DFB Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 TF-ONE 0 SUPP CLC 56/37/2.43 98.75 2 TF-ONE 1 SUPP CLC 57/36/2.43 98.21 3 TF-ENT 0 MULP CLC 55/43/2.31 97.80 4 TF-IDF 0 MULP CLC 56/40/2.35 97.63 5 BF-GFIDF 1 LEVE CLC 56/37/2.43 96.98 6 TF-ENT 0 SUPP CLC 56/38/2.40 96.95 7 TF-PRINV 0 SUPP CLC 56/38/2.40 96.95 8 BF-GFIDF 0 KLOS CLC 56/38/2.40 96.95 9 TF-ENT 1 MULP CLC 57/38/2.38 96.91 10 TF-PRINV 1 CONF CLC 55/43/2.31 96.90 11 BF-GFIDF 1 KLOS CLC 55/43/2.31 96.90 12 TF-PRINV 1 KLOS CLC 55/40/2.38 96.77 13 TF-ONE 0 KLOS CLC 54/43/2.33 96.59 14 TF-IDF 1 CONF CLC 55/43/2.31 96.54 15 TF-PRINV 1 LEVE CLC 55/43/2.31 96.54 16 TF-IDF 1 MULP CLC 55/42/2.33 96.38 17 TF-IDF 0 CONF CLC 55/42/2.33 96.38 18 TF-PRINV 1 MULP CLC 56/41/2.33 96.36 19 TF-IDF 1 LEVE CLC 54/46/2.26 96.34 20 BF-ONE 1 MULP CLC 55/44/2.28 96.32 21 TF-GFIDF 0 LEVE CLC 54/44/2.31 96.19 22 TF-GFIDF 0 MULP CLC 55/42/2.33 96.17 23 TF-PRINV 1 SUPP CLC 56/42/2.31 96.15 24 TF-IDF 1 KLOS CLC 56/41/2.33 96.03 25 TF-NORM 1 MULP CLC 56/43/2.28 95.96 30 TF-PRINV 0 LEVE CLC 55/43/2.31 95.64 40 BF-GFIDF 1 CONF CLC 56/37/2.43 95.39 50 TF-IDF 0 LEVE CLC 55/43/2.31 94.91 60 TF-GFIDF 0 KLOS ZLC 59/40/2.28 94.42 70 TF-GFIDF 1 KLOS SLC 56/42/2.31 93.73 80 TF-GFIDF 1 KLOS ZLC 59/40/2.28 93.51 90 BF-ENT 1 CONF CLC 53/46/2.28 93.26 100 TF-IDF 1 MULP ZLC 58/43/2.24 93.08 150 BF-ONE 1 SUPP CLC 54/53/2.11 91.30 200 TF-PRINV 1 LIFT CLC 56/50/2.13 88.19 250 TF-NORM 1 CONF SLC 56/48/2.17 85.55 300 BF-PRINV 1 MULP SLC 56/32/2.57 83.80 350 BF-ONE 1 LIFT SLC 56/52/2.09 82.24 400 TF-NORM 1 NONE SLC 56/56/2.02 80.00 450 BF-NORM 0 LIFT ZLC 57/62/1.90 74.15 465 BF-NORM 0 KLOS SLC 56/70/1.79 70.25 466 BF-NORM 1 KLOS SLC 56/70/1.79 70.25 467 BF-NORM 0 SUPP SLC 55/70/1.81 70.20 468 BF-NORM 0 SUPP ZLC 60/64/1.82 68.89

39 preprocessing step NOR and co-occurrence matrix construction are both disabled, the choice of global weights does not affect clustering results. The asterisk sign in the 2nd column represents any arbitrary global weight.

Table 5.3 shows the performance of the second combination group in DFB. The 3rd column indicates whether the optional preprocessing step NOR is used (‘1’) or is disabled (‘0’). Most combinations in the top 25 use association rule measures (5, 5, 4, and 3 combinations with SUPP, KLOS, LEVE, and CONF, respectively). The interestingness measure function SUPP gives the highest performance. All combinations in the top 25 use the linkage function CLC and 15 of them use the preprocessing step NOR. The performance difference between the first rank and the last rank is approximately 30% (98.75 − 68.89).

Table 5.4: Baseline combinations in DPL Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 TF-* 0 NONE CLC 108/70/1.87 67.24 2 TF-* 0 NONE ZLC 110/70/1.85 65.17 3 BF-* 0 NONE ZLC 108/75/1.82 63.30 4 TF-* 0 NONE SLC 103/68/1.95 62.50 5 BF-* 0 NONE CLC 106/83/1.76 62.06 6 BF-* 0 NONE SLC 102/82/1.81 61.51

Table 5.4 shows the performance of the first combination group (baseline combinations) in DPL. The reference clusters in this dataset have the characteristics 113/25/2.41. Table 5.5 shows the performance of the second combination group in

DPL. Most combinations in the top 25 use association rule measures (9, 8, and 3 combinations with SUPP, KLOS, and LEVE, respectively). The function KLOS yields the highest performance. All combinations in the top 25 use the linkage function CLC and 13 of them use the preprocessing step NOR. The performance difference between the first rank and the last rank is approximately 30%(83.87 − 53.81).

5.4.1 Effects of Weighting Schemes

To investigate their effects, the 12 weighting schemes are pairwise compared. For any weighting scheme w, let C(w) be the set of all combinations of w and other factors (1 × 1 × 2 × 7 × 3 = 42 combinations). Given weighting schemes w1 and w2, two types of comparisons between C(w1) and C(w2) are conducted as same as in Section 4.2.1.

Table 5.6 shows the comparison results in DFB, where the star sign indicates a statistical significant difference. The last two columns represent the average F1 performance and the score of the comparison results of the first type, where a wining combination, a drawing combination, a losing combination, get 3 points, 1 point, and

40 Table 5.5: Top 25 combinations and some selected combinations in the ranks 30–468 in DPL Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 BF-GFIDF 0 KLOS CLC 103/57/2.08 83.87 2 TF-IDF 0 SUPP CLC 107/53/2.08 79.27 3 TF-ONE 1 KLOS CLC 108/50/2.11 77.88 4 BF-GFIDF 1 KLOS CLC 102/62/2.03 77.76 5 TF-ONE 1 SUPP CLC 104/62/2.01 77.06 6 BF-ONE 1 KLOS CLC 89/83/1.94 76.53 7 TF-GFIDF 1 SUPP CLC 105/63/1.98 76.42 8 TF-ONE 0 MULP CLC 103/66/1.97 76.37 9 TF-ENT 1 MULP CLC 102/61/2.04 76.35 10 TF-IDF 1 MULP CLC 104/61/2.02 76.22 11 TF-ONE 1 MULP CLC 103/57/2.08 76.08 12 TF-IDF 1 SUPP CLC 109/52/2.07 75.93 13 TF-IDF 0 LEVE CLC 105/60/2.02 75.63 14 TF-PRINV 1 MULP CLC 104/64/1.98 75.00 15 TF-PRINV 1 SUPP CLC 105/59/2.03 75.00 16 BF-ONE 0 KLOS CLC 101/65/2.01 74.58 17 TF-ENT 1 LEVE CLC 106/63/1.97 74.20 18 TF-PRINV 0 SUPP CLC 104/62/2.01 74.11 19 TF-GFIDF 0 SUPP CLC 104/72/1.89 74.03 20 TF-GFIDF 0 KLOS SLC 95/40/2.47 73.88 21 TF-PRINV 0 LEVE CLC 100/61/2.07 73.81 22 TF-ONE 0 KLOS CLC 106/61/1.99 73.77 23 TF-ONE 0 SUPP CLC 105/57/2.06 73.76 24 TF-ENT 1 SUPP CLC 107/59/2.01 73.75 25 TF-IDF 0 KLOS CLC 108/57/2.02 73.47 30 BF-GFIDF 0 KLOS SLC 98/43/2.36 72.82 40 TF-IDF 0 LIFT CLC 105/64/1.97 71.59 50 TF-GFIDF 1 MULP CLC 113/53/2.01 70.74 60 BF-GFIDF 0 SUPP CLC 111/64/1.90 70.30 70 TF-ONE 1 LIFT CLC 103/65/1.98 70.02 80 BF-IDF 1 CONF CLC 104/70/1.91 69.72 90 BF-GFIDF 1 LIFT CLC 104/63/1.99 69.28 100 BF-PRINV 0 KLOS CLC 108/67/1.90 68.78 150 BF-IDF 0 SUPP CLC 105/67/1.94 67.31 200 TF-ENT 1 LEVE SLC 99/72/1.95 65.85 250 BF-GFIDF 1 NONE CLC 105/82/1.78 64.90 300 BF-PRINV 1 MULP SLC 100/81/1.84 63.97 350 TF-PRINV 0 MULP ZLC 117/65/1.83 63.05 400 BF-GFIDF 0 LEVE ZLC 119/57/1.89 61.17 450 BF-NORM 1 LEVE SLC 89/96/1.80 58.28 465 TF-GFIDF 0 MULP ZLC 114/75/1.76 56.73 466 BF-NORM 1 LIFT SLC 93/108/1.66 56.49 467 BF-ONE 0 LIFT SLC 97/82/1.86 55.25 468 TF-GFIDF 0 MULP SLC 76/136/1.57 53.81

41 Table 5.6: Comparisons between weighting schemes in DFB

Weighting BF-ENT BF-GFIDF BF-IDF BF-NORM BF-ONE BF-PRINV TF-ENT TF-GFIDF TF-IDF TF-NORM TF-ONE TF-PRINV F1(avg) Point BF-ENT - 10/3/29⋆ 17/10/15 39/3/0⋆ 13/3/26 18/4/20 1/0/41⋆ 4/0/38⋆ 0/0/42⋆ 12/1/29 1/0/41⋆ 1/0/41⋆ 84.20 372 BF-GFIDF 29/3/10⋆ - 26/3/13 38/3/1⋆ 31/3/8 29/3/10⋆ 4/0/38⋆ 10/0/32 3/0/39⋆ 24/1/17 6/0/36 6/0/36⋆ 86.69 634 BF-IDF 15/10/17 13/3/26 - 39/3/0⋆ 16/3/23 25/3/14 1/0/41⋆ 7/0/35⋆ 1/0/41⋆ 12/0/30 2/1/39⋆ 2/0/40⋆ 84.45 422 BF-NORM 0/3/39⋆ 1/3/38⋆ 0/3/39⋆ - 0/3/39⋆ 0/3/39⋆ 0/0/42⋆ 1/0/41⋆ 0/0/42⋆ 0/0/42⋆ 0/0/42⋆ 0/0/42⋆ 75.01 21 BF-ONE 26/3/13 8/3/31 23/3/16 39/3/0⋆ - 27/3/12 3/0/39⋆ 8/0/34⋆ 1/0/41⋆ 20/0/22 5/1/36⋆ 1/0/41⋆ 84.99 499 BF-PRINV 20/4/18 10/3/29⋆ 14/3/25 39/3/0⋆ 12/3/27 - 0/0/42⋆ 5/0/37⋆ 0/0/42⋆ 10/0/32⋆ 4/0/38⋆ 0/0/42⋆ 83.92 358 TF-ENT 41/0/1⋆ 38/0/4⋆ 41/0/1⋆ 42/0/0⋆ 39/0/3⋆ 42/0/0⋆ - 30/3/9⋆ 12/12/18 37/3/2⋆ 28/3/11 19/10/13 91.02 1,138 TF-GFIDF 38/0/4⋆ 32/0/10 35/0/7⋆ 41/0/1⋆ 34/0/8⋆ 37/0/5⋆ 9/3/30⋆ - 8/3/31⋆ 31/3/8⋆ 17/3/22 11/3/28 88.54 894 TF-IDF 42/0/0⋆ 39/0/3⋆ 41/0/1⋆ 42/0/0⋆ 41/0/1⋆ 42/0/0⋆ 18/12/12 31/3/8⋆ - 38/3/1⋆ 29/4/9 20/10/12 91.16 1,181 TF-NORM 29/1/12 17/1/24 30/0/12 42/0/0⋆ 22/0/20⋆ 32/0/10⋆ 2/3/37⋆ 8/3/31⋆ 1/3/38⋆ - 8/3/31⋆ 3/3/36⋆ 85.77 599 TF-ONE 41/0/1⋆ 36/0/6 39/1/2⋆ 42/0/0⋆ 36/1/5⋆ 38/0/4⋆ 11/3/28 22/3/17 9/4/29 31/3/8⋆ - 11/3/28 89.24 966 TF-PRINV 41/0/1⋆ 36/0/6⋆ 40/0/2⋆ 42/0/0⋆ 41/0/1⋆ 42/0/0⋆ 13/10/19 28/3/11 12/10/20 36/3/3⋆ 28/3/11 - 90.61 1,106 42

Table 5.7: Comparisons between weighting schemes in DPL

Weighting BF-ENT BF-GFIDF BF-IDF BF-NORM BF-ONE BF-PRINV TF-ENT TF-GFIDF TF-IDF TF-NORM TF-ONE TF-PRINV F1(avg) Point BF-ENT - 12/3/27 22/5/15 37/3/2⋆ 22/3/17 21/5/16 5/1/36⋆ 14/0/28 5/0/37⋆ 11/0/31 13/0/29⋆ 3/0/39⋆ 64.77 515 BF-GFIDF 27/3/12 - 31/3/8⋆ 38/3/1⋆ 29/3/10 33/3/6⋆ 8/0/34 21/1/20 8/0/34 19/0/23 16/0/26 10/0/32 66.33 736 BF-IDF 15/5/22 8/3/31⋆ - 37/3/2⋆ 19/3/20 17/5/20 4/0/38⋆ 14/0/28 2/0/40⋆ 11/0/31 9/0/33⋆ 5/0/37⋆ 64.54 442 BF-NORM 2/3/37⋆ 1/3/38⋆ 2/3/37⋆ - 6/3/33⋆ 2/3/37⋆ 2/0/40⋆ 3/0/39⋆ 2/0/40⋆ 2/0/40⋆ 3/0/39⋆ 2/0/40⋆ 61.15 96 BF-ONE 17/3/22 10/3/29 20/3/19 33/3/6⋆ - 20/4/18 6/0/36⋆ 11/0/31 5/0/37⋆ 17/0/25 9/0/33⋆ 8/1/33⋆ 64.41 485 BF-PRINV 16/5/21 6/3/33⋆ 20/5/17 37/3/2⋆ 18/4/20 - 4/0/38⋆ 10/0/32 3/0/39⋆ 11/0/31⋆ 9/0/33⋆ 4/0/38⋆ 64.45 434 TF-ENT 36/1/5⋆ 34/0/8 38/0/4⋆ 40/0/2⋆ 36/0/6⋆ 38/0/4⋆ - 27/3/12 18/5/19 28/3/11⋆ 22/3/17 19/3/20 67.27 1,026 TF-GFIDF 28/0/14 20/1/21 28/0/14 39/0/3⋆ 31/0/11 32/0/10 12/3/27 - 8/3/31 15/3/24 16/3/23 7/3/32 65.89 724 TF-IDF 37/0/5⋆ 34/0/8 40/0/2⋆ 40/0/2⋆ 37/0/5⋆ 39/0/3⋆ 19/5/18 31/3/8 - 27/3/12⋆ 27/4/11 23/6/13 67.66 1,083 TF-NORM 31/0/11 23/0/19 31/0/11 40/0/2⋆ 25/0/17 31/0/11⋆ 11/3/28⋆ 24/3/15 12/3/27⋆ - 18/3/21 12/3/27 65.72 789 TF-ONE 29/0/13⋆ 26/0/16 33/0/9⋆ 39/0/3⋆ 33/0/9⋆ 33/0/9⋆ 17/3/22 23/3/16 11/4/27 21/3/18 - 14/3/25 66.67 853 TF-PRINV 39/0/3⋆ 32/0/10 37/0/5⋆ 40/0/2⋆ 33/1/8⋆ 38/0/4⋆ 20/3/19 32/3/7 13/6/23 27/3/12 25/3/14 - 67.17 1,027 0 point, respectively. For example, BF-ENT wins over, draws with, loses to BF-GFIDF 10 times, 3 times, 29 times and the difference between C(BF-ENT) and C(BF-GFIDF) is statistically significant. The table shows that TF yields better performances compared to BF. Except for the cases when TF-NORM and TF-ONE are used, the differences between TF and BF are statistically significant. The top 3 weighting schemes are TF-IDF, TF-ENT, and TF-PRINV, with their scores being 1,181, 1,138, and 1,106 points, respectively.

Table 5.7 shows the comparison results in DPL. Again, TF yields better performances compared to BF. The top 3 weighting schemes are TF-IDF, TF-PRINV, and TF-ENT, with their scores being 1,083, 1,027, and 1,026 points, respectively.

5.4.2 Effects of Normalization

Let C(NOR=1) denote the set of all combinations in which NOR is used (2 × 6 × 1 × 7 × 3 = 252 combinations), and C(NOR=0) denote the set of all those in which NOR is disabled (again 252 combinations). Following the two types of comparisons described in Section 4.2.1, C(NOR=1) and C(NOR=0) are compared. Table 5.8 shows the comparison results in DFB. The results indicate that the use of NOR yields better performance compared to the case when NOR is not used. However, the performance difference is not statistically significant. Table 5.9 shows the comparison resultsin

DPL. Again, the use of NOR yields better performance compared to the case when it is disabled. In DPL, the performance difference between them is statistically significant.

Table 5.8: Comparisons between the combinations with and without NOR in DFB Preprocessing NOR=0 NOR=1 F 1 (avg) Point

NOR=0 - 86/5/161 86.04 263 NOR=1 161/5/86 - 86.56 488

Table 5.9: Comparisons between the combinations with and without NOR in DPL Preprocessing NOR=0 NOR=1 F 1 (avg) Point

NOR=0 - 67/1/184⋆ 64.94 202 NOR=1 184/1/67⋆ - 66.06 553

5.4.3 Effects of Co-occurrence Matrix Construction Schemes

For any co-occurrence matrix construction scheme o, let C(o) be the set of all combinations of o and other factors (2 × 6 × 2 × 1 × 3 = 72 combinations). Given co- occurrence matrix construction schemes o1 and o2, the comparison between C(o1) and C(o2) are conducted according to the comparison schemes in Section 4.2.1. Table 5.10 shows the comparison results in DFB. The first row indicates that except for the

43 case when LIFT is used, co-occurrence matrix construction yields better performance, with statistical significance, compared to the cases when it is disabled (NONE). KLOS performs better than MULP with the comparison result 40/0/32. Except for the cases when SUPP and KLOS are used, the performance differences between MULP and other schemes are statistically significant. The top 3 co-occurrence matrix construction schemes are MULP, KLOS, and LEVE, with their scores being 967, 961, and 740 points, respectively.

Table 5.10: Comparison between co-occurrence matrix construction schemes in DFB

Method NONE COOC SUPP CONF KLOS LEVE LIFT F1 (avg) Point NONE - 3/0/69⋆ 8/0/64⋆ 13/0/59⋆ 7/0/65⋆ 10/0/62⋆ 47/0/25 82.02 264 COOC 69/0/3⋆ - 52/3/17 48/1/23⋆ 32/0/40 49/0/23⋆ 71/0/1⋆ 89.57 967 SUPP 64/0/8⋆ 17/3/52 - 41/0/31 21/0/51 37/0/35 63/0/9⋆ 87.80 732 CONF 59/0/13⋆ 23/1/48⋆ 31/0/41 - 21/1/50⋆ 24/17/31 65/0/7⋆ 87.08 688 KLOS 65/0/7⋆ 40/0/32 51/0/21 50/1/21⋆ - 48/0/24 66/0/6⋆ 89.33 961 LEVE 62/0/10⋆ 23/0/49⋆ 35/0/37 31/17/24 24/0/48 - 66/0/6⋆ 87.29 740 LIFT 25/0/47 1/0/71⋆ 9/0/63⋆ 7/0/65⋆ 6/0/66⋆ 6/0/66⋆ - 81.01 162

Table 5.11: Comparison between co-occurrence matrix construction schemes in DPL

Method NONE COOC SUPP CONF KLOS LEVE LIFT F1 (avg) Point NONE - 15/0/57⋆ 14/0/58⋆ 29/1/42 7/0/65⋆ 31/0/41 51/0/21⋆ 64.22 442 COOC 57/0/15⋆ - 39/2/31 50/0/22 18/0/54⋆ 49/0/23 59/0/13⋆ 66.21 818 SUPP 58/0/14⋆ 31/2/39 - 50/0/22⋆ 15/0/57⋆ 48/0/24⋆ 67/0/5⋆ 66.68 809 CONF 42/1/29 22/0/50 22/0/50⋆ - 14/0/58⋆ 29/11/32 60/0/12⋆ 64.94 579 KLOS 65/0/7⋆ 54/0/18⋆ 57/0/15⋆ 58/0/14⋆ - 58/0/14⋆ 70/0/2⋆ 68.60 1,086 LEVE 41/0/31 23/0/49 24/0/48⋆ 32/11/29 14/0/58⋆ - 61/0/11⋆ 64.99 596 LIFT 21/0/51⋆ 13/0/59⋆ 5/0/67⋆ 12/0/60⋆ 2/0/70⋆ 11/0/61⋆ - 62.89 192

Table 5.11 shows the comparison results in DPL. The first row indicate that except for the case when LIFT is used, co-occurrence matrix construction yields better performance compared to the cases when it is disabled. From the same row, except for the cases when CONF and LEVE are used, the performance differences between NONE and other schemes are statistically significant. KLOS performs better than MULP with the comparison result 54/0/18, and the performance difference between them is statistically significant. The top 3 co-occurrence matrix construction methods are KLOS, MULP, and SUPP, with their scores being 1,086, 818, and 809 points, respectively.

5.4.4 Effects of Clustering Linkage Functions

For any clustering linkage function c, let C(c) be the set of all combinations of c and other factors (2×6×2×7×1 = 168 combinations). Again, given clustering linkage functions c1 and c2, C(c1) and C(c2) are compared. Table 5.12 shows the comparison results in DFB. The performance difference between any pair of linkage functions is statistically significant. The linkage function CLC yields the highest score. Table 5.13

44 shows the comparison results in DP L. There is no statistical significance between SLC and ZLC. The linkage function with the highest performance is again CLC.

Table 5.12: Comparisons between clustering linkage functions in DFB

Clustering SLC CLC ZLC F1 (avg) Point SLC - 9/0/159⋆ 128/1/39⋆ 85.33 412 CLC 159/0/9⋆ - 167/0/1⋆ 89.68 978 ZLC 39/1/128⋆ 1/0/167⋆ - 83.90 121

Table 5.13: Comparisons between clustering linkage functions in DPL

Clustering SLC CLC ZLC F1 (avg) Point SLC - 8/0/160⋆ 67/0/101 63.49 225 CLC 160/0/8⋆ - 153/2/13⋆ 68.92 941 ZLC 101/0/67 13/2/153⋆ - 64.10 344

5.5 Discussion

Referring to Tables 5.3 and 5.5, the best F1-performances in DFB and DPL are 98.75 and 83.87%, respectively. From their characteristics shown in Table 3.2 and 3.3 the performance difference between the two datasets can be explained based onthe preliminary discussion in Section 4.1.2.

Co-occurrence matrix construction is used in each of the top 25 combinations in both datasets (cf. Tables 5.3 and 5.5). This provides clear evidence that co-occurrence information is very important for identifying name-alias relations. As a result, from the first and the second observations, the probability of successful name-alias relationship identification in DFB is higher than in DPL. Moreover, since most formal names in the two datasets have not more than 2 aliases (see Table 3.3), the third observation suggests that name-alias identification in DPL is more difficult.

This work proposes the use of association measures as an alternative method for co-occurrence matrix construction. In contrast to the matrix multiplication method, which is based solely on name occurrence frequencies, the association measures are employed for determining statistical co-occurrence patterns from name occurrences. The experiment results indicate that the association measures can be effectively used for co-occurrence matrix construction, i.e., 68% and 80% of the top 25 combinations in DFB and DPL, respectively, use association measures. From Tables 5.10 and 5.11, the KLOS function outperforms better than the matrix multiplication method and the other association measure functions. However, as shown in Sections 4.2.1, 5.4.2, and 5.4.4, not only does co-occurrence matrix, but weighting schemes, normalization, and clustering linkage functions also affect the combination performances.

45 Chapter 6

Error Analysis

This chapter examines errors in name-alias relationship identification. Common problems are summarized and their solutions are proposed.

6.1 Analysis of Mismatched Name-Alias Pairs

Table 6.1 shows examples of mismatched name-alias pairs in DFB, taken from the result of the highest performance combination in Table 5.3 (TF-ONE, NOR=0,

SUPP, CLC, F1 = 98.75%). The 3rd and 4th columns indicate any name and its candidate alias, respectively, with the number of documents that a given name or its alias occurrence in a parenthesis, i.e., “หลุยส์ เฟลิเป้ สโคลารี” is found in 68 documents. The last column presents the number of documents that any name and its candidate alias co-occurrences (fc).

Table 6.1: Examples of mismatched name-alias pairs derived from the highest performance combination in DFB Pair No. Reference f Name Alias c

1 (f1, a1) หลุยส์ เฟลิเป้ สโคลารี (68) โคล (106) 68 (Luiz Felipe Scolari) (Cole) 2 (a2, a1) สโคลารี (68) โคล (106) 68 (Scolari) (Cole) 3 (a3, a1) บิกฟิล (44) โคล (106) 44 (Big Fil) (Cole) 4 (f2, a4) สจ๊วร์ต ดาวนิง (18) เซาธ์เกต (22) 8 (Stewart Downing) (Southgate)

From Table 6.1, the first pair shows a mismatch betweenหลุยส์ names“ เฟลิเป้ สโคลา รี”(Luiz Felipe Scolari) and “โคล”(Cole), represented by f1 and a1, respectively. Name a1 is an alias for “โจ โคล”(Joe Cole), a former footballer of Chelsea Football Club. By examining news articles that contained both names, they often co-occur because they work together for the same football club (f1 is a former manager of Chelsea Football Club). After investigating the result of highest performance combination, name pairs (f1, a2) and (f1, a3) are identified correctly, where a2 and a3 are aliases for

46 f1. Resulting from (f1, a1) causes (a2, a1) and (a3, a1) mismatching. The same situation also occurs in the fourth pair, which f2 and a4 work for a football club at the same time period (a4 is an alias for “แกเร็ธ เซาธ์เกต”(Gareth Southgate), a former manager of Middlesbrough Football Club).

Table 6.2: Examples of mismatched name-alias pairs derived from the highest performance combination in DPL Pair No. Reference f Name Alias c

1 (a1, a2) ชวนนท์ (30) ชวน (102) 30 /chawa/non/ /chuan/ 2 (f1, a2) ชวนนท์ อินทรโกมาลย์สุต (27) ชวน (102) 27 /chawa/non/-/in/thon/ko/man/sut/ /chuan/ 3 (f2, a3) สุชาติ ลายนําเงิน (5) สุชาติ (14) 5 /su/chat/-/lainam/ngoen/ /su/chat/ 4 (f3, a4) สุชาติ เหมือนแกว้ (4) ป๊อด (4) 2 /su/chat/-/muean/kaeo/ /pot/

Table 6.2 shows examples of mismatched name-alias pairs in DPL, taken from the result of the highest performance combination in Table 5.5 (BF-GFIDF, NOR=0, K-

LOS, CLC, F1 = 83.87%). For the first pair, names a1 and a2 are being referred to “ชวน หลีกภัย” (/chuan/-/leekpai/, a former leader of the Democrat Party and the Ex- Prime Minister of Thailand), and “ชวนนท์ อินทรโกมาลย์สุต” (/chawa/non/-/in/thon/ko/ man/sut/, the current Democrat Party Spokesman), respectively. With their relationships in real life, a1 and a2 often co-occur in news articles. Moreover, “ชวน” (/chuan/) has two meanings, a name of person and a Thai verb (meaning persuade in English).

The third and fourth name pairs face the problem of lexical ambiguities. For a3, it is an alias for f3 and can be an alias for f2. Name a4 is an alias for “พัชรวาท วงษ์สุวรรณ” (/phatchara/wat/-/wong/suwan/, Police General, a former National Police Chief). In many news articles, f3 and a4 often co-occur because they were in charge on the same political case.

From Tables 6.1 and 6.2, common problems are concluded as follows:

1. Group reference: Two or more persons which work for an organization possibly co-occur within a news document, e.g., most pairs in Tables 6.1 and 6.2.

2. Lexical ambiguity: Two or more persons sharing a name may cause issue in name-alias relationship identification. Moreover, some names or aliases are lexicographically identical with words in a dictionary.

6.2 Solutions

To resolve the issues resulting from Thai name-alias relationship identification, the following suggestions are proposed:

47 • In most contexts, the first name and the last name of a full name can becon- sidered as a name alias of that full name. Due to this, it can avoid several cases of lexical ambiguity problem.

• In the case of an alias and a name that is lexicographically identical to a word in a dictionary, it needs some mechanisms, such as syntactic analysis, to detect whether it is a word or a name.

• Group reference is mostly found in news articles. One possible solution is to consider documents with various time periods. A change of group members during that period may improve name-alias identification.

However, the above solutions may not be suitable for all the cases. A semantic analysis component is considered for using rather than just only statistics. Due to the complexity of structures in written Thai, it is necessary to develop a set of specific techniques to parse Thai running texts. This task is a challenging task in Thai language processing.

48 Chapter 7

Conclusions and Future Works

This chapter concludes the effects of various factors used in the two frameworks. Future works are also proposed in this chapter.

7.1 Conclusions

This thesis aims to study the effects of preprocessing factors, co-occurrence matrix construction methods, and clustering methods to name-alias relationship identification in Thai news articles. Two frameworks are proposed and designed based on the observation that a name and an alias for it usually co-occur in a news article. The first framework looks into the effects of preprocessing factors and clustering methods. The second framework focuses on the effects of co-occurrence matrix construction methods with other factors.

In the first framework, various factors are considered for improving the recognition performance, i.e., two local weights (TF and BF), six global weights (ENT, GFIDF, IDF, NORM, ONE, and PRINV), three optional preprocessing steps (NOR, COC, and NET), and five alternative clustering methods (SBC, ERC, SLC, CLC, and ZLC).The performance of each possible combination of these factors is examined on two datasets, which are extracted from two different news domains, e.g., football and politics. The effects of the optional factors and the candidate factors are investigated bydirectly comparing the performance of corresponding combinations and statistically testing the significance of their differences. The investigation reveals that in both news domains, (i) the best and second best weighting schemes are TF-ENT and TF-IDF, respectively; (ii) NET is the optional preprocessing step with the greatest overall effects; and (iii) CLC is the clustering method that yields the highest performance.

In the football dataset, TF performs better than BF regardless of the choice of global weighting, and except for the case when GFIDF is used, the performance d- ifference between TF and BF is statistically significant. In the politics dataset,TF performs better than BF on average; however, the difference between them is often not statistically significant since the number of name occurrences in this dataset is

49 relatively low. Considering the optional preprocessing steps, NOR affects COC (i.e., NOR makes the difference between selection and non-selection of COC wider), COC affects NOR, and NET affects COC in both datasets. In the politics dataset,COC also affects NET. The use of NET together with COC yields the highest performance. When NOR, COC, and NET are all applied, high output performance can be observed in both datasets. The highest combination performances in the football and politics datasets are 97.24% and 76.37%, respectively.

Identifying relationships between names and their aliases is an important issue for named entity disambiguation. Based on the observation that a Thai name alias often co-occurs with its formal name within a document, name co-occurrence information is applied for name-alias relationship identification. Typically, name co-occurrence information can be represented by a co-occurrence matrix. A common way to construct a co-occurrence matrix is to multiply a name-by-document matrix by its transpose. The second framework proposes an alternative method for co-occurrence matrix construction by using association measures to determine statistical co-occurrence patterns from name occurrences. Five association measure functions (SUPP, CONF, KLOS, LEVE, and LIFT) are used. They are compared along with the tradition co-occurrence matrix construction method (MULP). Various factors are considered in the comparison, i.e., two local weights (TF and BF), six global weights (ENT, GFIDF, IDF, NORM, ONE, and PRINV), one optional preprocessing step (NOR), and three linkage functions for hierarchical clustering (SLC, CLC, and ZLC). Two collections of news articles extracted from the football (DFB) and politics (DPL) news categories are employed in this experiments.

The effects of co-occurrence matrix construction methods and other factors arein- vestigated by comparing the performance of their corresponding combinations, and by statistically testing the significance of their performance differences. The investigation reveals that (i) for the overall performance, SUPP and KLOS yield the highest performances in DFB and DPL, respectively, and (ii) on average, MULP and KLOS give better performances in DFB and DPL, respectively, compared to the other association measure functions. For other factors, (i) the best and the second best weighting schemes in DFB are TF-ENT and TF-IDF, respectively; (ii) the best and the second best weighting schemes in DPL are TF-IDF and TF-PRINV, respectively; (iii) in both datasets, the use of NOR gives better performance compared to the case when it is disabled; and (iv) CLC is the best linkage function in both datasets. The best performances achieved from the football and politics datasets are 98.75% and 83.87%, respectively.

50 7.2 Future Works

The experimental results show that preprocessing factors and association measures can be used effectively for Thai name-alias relationship identification. Further work includes an investigation of the use of other association measure functions for discovering relationships between names and name aliases. An analysis of clustering results is alternative for improving combination performances.

51 References

[1] Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Association Rules Be- tween Sets of Items in Large Databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data held in Washington, D.C., USA, 26–28 May 1993 (pp. 207–216). New York: Association for Computing Machinery.

[2] Anwar, T. & Abulaish, M. (2014). Namesake Alias Mining on the Web and Its Role Towards Suspect Tracking. Information Sciences, 207, 123–145.

[3] Atlam, E. (2014). Improving the Quality of FA Word Dictionary Based on Co- occurrence Word Information and Its Hierarchically Classification. Information-An International Interdisciplinary, 17(2), 709–734.

[4] Azevedo, P. J. & Jorge, A. M. (2007). Comparing Rule Measures for Predictive Association Rules. In Proceedings of the 18th European Conference on Machine Learning held in Warsaw, Poland, 17–21 September 2007 (pp. 510–517). Berlin, Heidelberg: Springer-Verlag.

[5] Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: Addison Wesley Longman.

[6] Bagga, A. & Baldwin, B. (1998). Entity-based Cross-document Coreferencing Us- ing the Vector Space Model. In Proceedings of the 17th International Conference on Computational Linguistics held in Montreal, Canada, 10–14 August 1998 (p- p. 79–85), Stroudsburg: Association for Computational Linguistics.

[7] Ballesteros, L. & Croft, W. B. (1998). Resolving Ambiguity for Cross-language Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Confer- ence on Research and Development in Information Retrieval held in Melbourne, Australia, 24–28 August 1998 (pp. 64–71). New York: Association for Computing Machinery.

[8] Berry, W. M. & Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and Text Retrieval. Philadelphia: Society for Industrial and Applied Mathematics.

52 [9] Bhat, V., Oates, T., Shanbhag, V., & Nicholas, C. (2004). Finding Aliases on the Web Using Latent Semantic Analysis. Data and Knowledge Engineering, 49, 129–143.

[10] Bheganan, P., Nayak, R., & Xu, Y. (2009). Thai Word Segmentation with Hidden Markov Model and Decision Tree. In Proceedings of the 13th Pacific-Asia Knowl- edge Discovery and Data Mining Conference held in Bangkok, Thailand, 27–30 April 2009 (pp. 74–85). Berlin, Heidelberg: Springer-Verlag.

[11] Bikel, D. M., Schwartz, R., & Weischedel, R. M. (1999). An Algorithm that Learns What’s in a Name. Machine Learning, 34(1–3), 211–231.

[12] Bollegala, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Automatically Extracting Personal Name Aliases from the Web. In Proceedings of the 6th Inter- national Conference on Natural Language Processing held in Gothenburg, Sweden, 25–27 August 2008 (pp. 77–88). Berlin, Heidelberg: Springer-Verlag.

[13] Bollegala, D., Matsuo, Y., & Ishizuka, M. (2011). Automatic Discovery of Per- sonal Name Aliases from the Web. IEEE Transactions on Knowledge and Data Engineering, 23(6), 831–844.

[14] Bonchi, F., Gionis, A., & Ukkonen, A. (2011). Overlapping Correlation Clustering. In Proceedings of the 11th IEEE International Conference on Data Mining held in Vancouver, Canada, 11–14 December 2011 (pp. 51–60). Washington: IEEE Computer Society.

[15] Bunescu, R. & Pasca, M. (2006). Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics held in Trento, Italy, 3–7 April 2006 (pp. 9–16). –:–.

[16] Charoenpornsawat, P., Kijsirikul B., & Meknavin, S. (1998). Feature-based Proper Name Identification in Thai. In Proceedings of National Computer Science and Engineering Conference held in Bangkok, Thailand, 19–21 October 1998. –: –.

[17] Chattrimongkol, S. (2005). Named Entity Recognition and Classification in Thai, Master of Arts Thesis. Bangkok: Department of Linguistics, Faculty of Arts, Chulalongkorn University.

[18] Chen, Y. & Martin, J. (2007). Towards Robust Unsupervised Personal Name Disambiguation. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning held in Prague, Czech Republic, 28–30 June 2007 (pp. 190–198). –: –.

[19] Chen, Z. & Lu, Y. (2011). A Word Co-occurrence Matrix Based Method for Rel- evance Feedback. Computational Information Systems, 7(1), 17–24.

53 [20] Chinchor, N. (1998). MUC-7 Information Extraction Task Definition. Retrieved from May, 2014, http://www.itl.nist.gov/iaui/894.02/related_projects/ muc/proceedings/ie_task.html.

[21] Chinchor, N. (2001). Overviews of MUC-7/MET-2. Retrieved from May, 2014, http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ proceedings/muc_7_proceedings/overview.html.

[22] Chisholm, E. & Kolda, T. G. (1999, March). New Term Weighting Formulas for the Vector Space Method in Information Retrieval (Report No. ORNL/TM- 13756). Retrieved from January 2012, http://www.sandia.gov/~tgkolda/pubs/ pubfiles/ornl-tm-13756.pdf.

[23] Cucerzan, S. (2007). Large-scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of the 2007 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computational Natural Language Learning held in Prague, Czech Republic, 28–30 June 2007 (pp. 708–716). –: –.

[24] Damljanovic, D. & Bontcheva, K. (2012), Named Entity Disambiguation Using Linked Data. In Proceedings of the 9th Extended Semantic Web Conference held in Heraklion, Greece, 27–31 May 2012 (Poster). –: –.

[25] Dredze, M., McNamee, P., Rao, D., Gerber, A., & Finin, T. (2010). Entity Disam- biguation for Knowledge Base Population. In Proceedings of the 23rd International Conference on Computational Linguistics held in Beijing, China, 23–27 August 2010 (pp. 277–285). Stroudsburg: Association for Computational Linguistics.

[26] Fernandez-Amoros, D., Gil, R. H., Somolinos, J. A. C., & Somolinos, C. C. (2010). Automatic Word Sense Disambiguation Using Co-occurrence and Hierarchical In- formation. In Proceedings of the Natural Language Processing and Information Systems held in Cardiff, UK, 23–25 June 2010 (pp. 60–67). Berlin, Heidelberg: Springer-Verlag.

[27] Fernández, N., Fisteus, J. A., Sánchez, L., & López, G. (2012). IdentityRank: Named Entity Disambiguation in the News Domain. Expert Systems with Appli- cations, 39(10), 9207–9221.

[28] Fuketa, M., Atlam, E., Ghada, E., Morita, K., & Aoe, J. (2006). Building New Field Association Term Candidates Automatically by Search Engine. In Proceed- ings of Knowledge-Based Intelligent Information and Engineering Systems held in Bournemouth, UK, 9–11 October 2006 (pp. 325–330). Berlin, Heidelberg: Springer-Verlag.

[29] Gandhi, R. V., Suman, N., Zuber, M., & Mahender, U. (2013). Automatic Detec- tion of Name and Aliases From The Web. Engineering Research and Applications, 3(6), 1684–1689.

54 [30] Geng, L. & Hamilton, H. J. (2007). Choosing the Right Lens: Finding What is Interesting in Data Mining. Studies in Computational Intelligence, 43, 3–24.

[31] Gentile, A. L., Zhang, Z., Xia, L., & Iria, J. (2009). Graph-based Semantic Re- latedness for Named Entity Disambiguation. In Proceedings of International Con- ference on Software, Services, and Semantic Technologies held in Sofia, Bulgaria, 28–29 October 2009 (pp. 13–20). –: –.

[32] Grishman, R. & Sundheim, B. (1996). Message Understanding Conference-6: A Brief History, In Proceedings of The 16th International Conference on Computa- tional Linguistics held in Copenhagen, Denmark, 5–9 August 1996 (pp. 466–471). Stroudsburg: Association for Computational Linguistics.

[33] Hassell, J., Aleman-Meza, B., & Arpinar, I. B. (2006). Ontology-driven Auto- matic Entity Disambiguation in Unstructured Text. In Proceedings of the 5th In- ternational Semantic Web Conference held in Athens, USA, 5–9 November 2006 (pp. 44–57). Berlin, Heidelberg: Springer-Verlag.

[34] Huang, F. (2005). Multilingual Named Entity Extraction and Translation From Text and Speech, Ph.D. Thesis. Pittsburgh: School of Computer Science, Carnegie Mellon University.

[35] Huynh, H. H., Guillet, F., Blanchard, J., Kuntz, P., Briand, H., & Gras, R. (2007). A Graph-based Clustering Approach to Evaluate Interestingness Measures: A Tool and a Comparative Study. Studies in Computational Intelligence, 43, 25–50.

[36] Intarapaiboon, P. (2011). A Study on Domain-specific Information Extraction From Thai Text, Ph.D. Thesis. Pathum Thani: School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University.

[37] Iwasaki, S. & Ingkaphirom, P. (2007). A Reference Grammar of Thai, An Inter- national Review of General Linguistics. Lingua, 117, 1497–1512.

[38] Jakhete, S. A. & Dharmadhikari, S. C. (2012). Automatic Extraction of Entity Alias from the Web. Applied Information Systems, 3(8), 5–9.

[39] Jiang, L., Wang, J., Luo, P., Ning A., & Wang, M. (2012). Towards Alias Detection Without String Similarity: An Active Learning Based Approach. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval held in Portland, USA, 12–16 August 2012 (pp. 1155–1156). New York: Association for Computing Machinery.

[40] Jin, N., Young, C., & Wang, W. (2009). Graph Classification Based on Pattern Co-occurrence. In Proceedings of the 18th ACM Conference on Information and Knowledge Management held in Hong Kong, China, 2–6 November 2009 (pp. 573– 582). New York: Association for Computing Machinery.

55 [41] Jin, H., Huang, L., & Yuan, P. (2009). Name Disambiguation Using Semantic Association Clustering. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering held in Macau China, 21–23 October 2009 (pp. 42–48). –: –.

[42] Jiranantanaporn, S. (2008). Country Nicknames in the Viewpoint of Thai Mass Media. Humanities, 1(5), 47–70.

[43] Johnson, S. C. (1967). Hierarchical Clustering Schemes. Psychometrika, 32(3), 241–254.

[44] Justeson, J. S. & Katz, S. M. (1991). Co-occurrences of Antonymous Adjectives and Their Contexts. Computational Linguistics, 17(1), 1–19.

[45] Krovetz, R. & Croft, W. B. (1992). Lexical Ambiguity and Information Retrieval. ACM Transactions on Information Systems, 10(2), 115–141.

[46] Lancia, F. (2005). Word Co-occurrence and Theory of Meaning. Retrieved from August, 2013, http://www.mytlab.com/wcsmeaning.pdf.

[47] Lee, H. S. (1999). Automatic Clustering of Business Processes in Business Systems Planning. European Journal of Operation Research, 114, 354–362.

[48] Lee, H. S. (2001). An Optimal Algorithm for Computing the Max-min Transitive Closure of a Fuzzy Similarity Matrix. Fuzzy Sets and Systems, 123, 129–136.

[49] Lertcheva, N. & Aroonmanakun, W. (2009). A Linguistic Study of Product Names in Thai Economic News. In Proceedings of the 8th International Symposium on Natural Language Processing held in Bangkok, Thailand, 20–22 October 2009 (pp. 26–29). Bangkok: Dhurakij Pundit University.

[50] Leydesdorff, L. & Vaughan, L. (2006). Co-occurrence Matrices and Their Appli- cations in Information Science: Extending ACA to the Web Environment. The American Society for Information Science and Technology, 57(12), 1616–1628.

[51] Manager Online. (2009). Political News. Retrieved from July, 2009, http://www. manager.co.th.

[52] Matsuo, Y. & Ishizuka, M. (2004). Keyword Extraction from a Single Documen- t using Word Co-occurrence Statistical Information. Artificial Intelligence Tools 13(1), 157-169.

[53] Mori, J., Matsuo, Y., & Ishizuka, M. (2005). Finding User Semantics on the Web using Word Co-occurrence Information. In Workshop on Personalization on the Semantic Web held in Edinburgh, UK, 24–29 July 2005. –: –.

56 [54] Morita, K., Atlam, E., Fuketra, M., Tsuda, K., Oono, M., & Aoe, J. (2004). Word Classification and Hierarchy Using Co-occurrence Word Information. Information Processing and Management, 40, 957–972.

[55] Neumann, G. (2010). Named Entity Extraction. Retrieved from November, 2013, http://www.dfki.de/~neumann/InformationExtractionLecture2011/ sessions/3-NEE-Overview.pdf.

[56] Niu, L., Wu, J., & Shi, Y. (2012). Entity disambiguation with Textual and Con- nection Information. Procedia Computer Science, 9, 1249–1255.

[57] Paksasuk A. (2007). The Study of Thai Monk Naming, Master of Arts Thesis. Nakhon Pathom: Department of Thai, Graduate School, Silapakorn University.

[58] Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name Discrimination by Clustering Similar Contexts. In Proceedings of the 6th International Conference on Intelligent Text Processing and Computational Linguistics held in Mexico City, Mexico, 13–19 February 2005 (pp. 226–237). Berlin, Heidelberg: Springer-Verlag.

[59] Plaza, L., Stevenson, M., & Diáz, A. (2012). Resolving Ambiguity in Biomedical Text to Improve Summarization. Information Processing and Management, 48, 755–766.

[60] Pantel, P. (2006). Alias Detection in Malicious Environments. In Proceedings of AAAI Fall Symposium on Capturing and Using Patterns for Evidence Detection held in Arlington, USA, 13–15 October 2006 (pp. 14–20). Washington: American Association for Artificial Intelligence.

[61] Popescu, O. & Magnini, B. (2009). An Iterative Model for Discovering Person Co- references Using Name Frequency Estimates. In Proceedings of the 3rd Language and Technology Conference, Poznan, Poland, 5–7 October 2007 (pp. 428–439). Berlin, Heidelberg: Springer-Verlag.

[62] Rokaya, M., Atlam, E., Fuketa, M., Dorji, T. C., & Aoe, J. (2008). Ranking of Field Association Terms Using Co-word Analysis. Information Processing and Management, 44, 738–755.

[63] The Royal Institute Thailand. (2012). Thai Dictionary. Retrieved from June 2013, http://rirs3.royin.go.th/new-search/word-search-all-x.asp.

[64] Salton, G. & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5), 513–524.

[65] Sapena, E., Padró, L., & Turmo, J. (2007). Alias Assignment in Information Extraction. Procesamiento del Lenguaje Natural, 39, 105–112.

57 [66] Sekine, S. (2004). Named Entity: History and Future. Retrieved from August, 2013, http://cs.nyu.edu/~sekine/papers/NEsurvey200402.pdf.

[67] Shaikh, M., Memon, N., & Wii, U. K. (2011). Extended Approximate String Matching Algorithms to Detect Name Aliases. In Proceedings of Intelligence and Security Informatics held in Beijing, China, 10–12 July 2011 (pp. 216–219). –: –.

[68] Sharif, U. M., Ghada, E., Atlam, E., Fuketa, M., Morita, K., & Aoe, J. (2007). Improvement of Building Field Association Term Dictionary Using Passage Re- trieval. Information Processing and Management, 43, 1793–1807.

[69] Shen, Q. Boongoen, T., & Price, C. (2012). Fuzzy Orders-of-Magnitude Based Link Analysis for Qualitative Alias Detection. IEEE Transaction on Knowledge and Data Engineering, 24(4), 649–664.

[70] Shirakawa, M., Wang, H., Song, Y., Wang, Z., Nakayama, K., & Hara, T. (2011, November). Entity Disambiguation Based on a Probabilistic Taxonomy (Re- port No. MSR-TR-2011-125). Retrieved from August, 2012, http://research. microsoft.com/pubs/156194/techreport.pdf.

[71] Siamsport Online, (2009), English Premier League. Retrieved from September, 2009, http://www.siamsport.co.th.

[72] Siriyuvasak, U., Jiajanpong, A., Taewutoom, W., & Urapeepatthanapong, T. (2012). A Study of Media Frame and Discourse on the 2007 Constitution A- mendment in the Thai Press in 2012. Retrieved from July, 2013, http:// mediainsideout.net/research/2013/02/108.

[73] Spina D., Gonzalo, J., & Amigó, E. (2013). Discovering Filter Keywords for Com- pany Name Disambiguation in Twitter. Expert Systems with Applications, 40, 4986–5003.

[74] Srikant, R. & Agrawal, R. (1997). Mining Generalized Association Rules. Future Generation Computer Systems, 13(2–3), 161–180.

[75] Sriphaew, S. & Theeramunkong T. (2007). Measuring the Validity of Documents Relations Discovered from Frequent Itemset Mining. In Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining held in Hon- olulu, USA, 1 March – 5 April 2007 (pp. 293–299). –: –.

[76] Subha, R. & Palaniswami, S. (2013). Quality Factor Assessment and Text Summa- rization of Unambiguous Natural. Advances in Computing, Communication, and Control Communications in Computer and Information Science, 361, 131–146.

[77] Sutheebanjard, P. & Premchaiswadi, W. (2009). Thai Personal Named Entity Ex- traction Without Using Word Segmentation or POS Tagging. In Proceedings of the

58 8th International Symposium on Natural Language Processing held in Bangkok, Thailand, 20–22 October 2009 (pp. 221–226). Bangkok: Dhurakij Pundit Univer- sity.

[78] Suwanapong, S., Theeramunkong, T., & Nantajeewarawat, E. (2012). A Fuzzy- relation Approach for Name-alias Identification in Thai News Articles. In Pro- ceedings of the 1st Asian Conference on Information Systems held in Siem Reap, Cambodia, 6–8 December 2012 (pp. 352–357). –: –.

[79] Suwanno, N., Suzuki, Y., & Yamazaki, H. (2007). Selecting the Optimal Feature Sets for Thai Named Entity Extraction. In Proceedings of the 3rd International Conference on Engineering and Environment held in Phuket, Thailand, 10–11 May 2007. Phuket: Prince of Songkla University.

[80] Tan, P. N., Kumar, V., & Srivastava, J. (2004). Selecting the Right Objective Measure for Association Analysis. Information Systems, 29, 293–313.

[81] Tepdang, S. (2010). Improving Thai Word Segmentation with Named Entity Recog- nition, Master Thesis. Pathum Thani: Computer Science Department, Faculty of Science, Thammasat University.

[82] Thailand National Electronics and Computer Technology Center. (2013). Bench- mark for Enhancing the Standard of Thai Language Processing. Retrieved from December, 2013, http://thailang.nectec.or.th/best/.

[83] Thailand Political Database. (2012). Thailand Political Database Newsletter. Retrieved from July, 2013, http://www.tpd.in.th/pdffile/E-newsletter/E% 20Newsletter8-2555(1-15jan12).pdf.

[84] Theeramunkong, T., Boriboon, M., Haruechaiyasak, C., Kittiphattanabawon, N., Kosawat, K., Onsuwan, C., Siriwat, I., Suwanapong, T., & Tongtep, N. (2010). THAI-NEST: A Framework for Thai Named Entity Tagging Specification and Tools. In Proceedings of the 2nd International Conference on Corpus Linguistics held in Corunã, Spain, 13–15 May 2010 (pp. 895–908). –: –.

[85] The Sun. (2014). English Football News. Retrieved from February, 2014, http://www.thesun.co.uk/sol/homepage/sport/football/5625286/ John-Terry-signs-on-at-Chelsea-with-one-year-contract.html.

[86] Tirasaroj, M. & Aroonmanakun, W. (2011). The Effect of Answer Patterns for Supervised Named Entity Recognition in Thai. In Proceedings of the 25th Pacific Asia Conference on Language, Information, and Computation held in Singapore, 16–18 December 2011 (pp. 392–399). Singapore: Nanyang Technological Univer- sity.

59 [87] Tongtep, N. & Theeramunkong, T. (2010). Pattern-based Extraction of Named Entities in Thai News Documents. Thammasat International Journal of Science and Technology, 15(1), 70–81.

[88] UzZaman, N. & Allen, J. F. (2011). Event and Temporal Expression Extraction from Raw Text: First Step Towards a Temporally Aware System. Semantic Com- puting, 4(4), 487–508.

[89] Vu, Q. M., Takasu, A., & Adachi, J. (2010). Improving the Performance of Per- sonal Name Disambiguation Using Web Directories. Information Processing and Management, 44, 1546–1561.

[90] Wang, Y. J. (2011). A Clustering System for Data Sequence Partitioning. Expert System with Application, 38, 659–666.

[91] Wanvarie, D., Takamura, H., & Okumura, M. (2009). Character-based Thai Named Entity Recognition. Natural Language Processing, 3, 8–11.

[92] Xia, N., Lin, H., Yang, Z., & Li, Y. (2011). Combining Multiple Disambigua- tion Methods for Gene Mention Normalization. Expert Systems with Applications, 38(7), 7994–7999.

[93] Xu, R. & Wunsch II, D. (2005). Survey of Clustering Algorithms. IEEE Transac- tions on Neural Networks, 16(3), 645–678.

[94] Yeung, D. S. & Wang, X. Z. (2002). Improving Performance of Similarity-Based Clustering by Feature Weight Learning. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(4), 556–561.

[95] Zadeh, L. A. (1971). Similarity Relations and Fuzzy Orderings. Information Sci- ence, 3, 177–200.

[96] Zhao, Y. & Karypis, G. (2002). Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In Proceedings of the 11th International Conference on Information and Knowledge Management held in McLean, USA, 4–9 November 2002 (pp. 515–524). New York: Association for Computing Machinery.

[97] Zhao, X., Jin, P., & Yue, L. (2010). Automatic Temporal Expression Normalization with Reference Time Dynamic-Choosing. In Proceedings of the 23rd International Conference on Computational Linguistics held in Beijing, China, 23–27 August 2010 (pp. 1498–1506). Stroudsburg: Association for Computational Linguistics.

[98] Zheng, Z. & Zhu, X. (2013). Entity Disambiguation with Type Taxonomy. Com- putational Information Systems, 9(3), 1199–1207.

60 Appendices

61 Appendix A

The First Framework: Experimental Results in DFB

Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-ENT 0 1 1 CLC 55/44/2.28 97.24 2 TF-IDF 0 1 1 CLC 56/41/2.33 97.07 3 BF-ONE 1 1 1 CLC 55/43/2.31 96.72 4 TF-ENT 1 1 1 CLC 57/39/2.35 96.35 5 TF-IDF 1 1 1 CLC 55/43/2.31 95.81 6 TF-PRINV 1 1 1 CLC 57/41/2.31 95.60 7 TF-GFIDF 0 1 1 CLC 56/42/2.31 95.41 8 TF-NORM 1 1 1 CLC 56/44/2.26 95.38 9 BF-IDF 1 1 1 CLC 53/49/2.22 95.19 10 TF-ONE 1 1 1 CLC 56/44/2.26 95.03 11 TF-ONE 0 1 1 CLC 56/43/2.28 94.83 12 TF-PRINV 0 1 1 CLC 54/46/2.26 94.66 13 TF-GFIDF 1 1 1 CLC 57/41/2.31 94.49 14 BF-IDF 0 1 1 CLC 53/52/2.15 94.32 15 TF-ENT 0 1 1 ALC 57/40/2.33 94.12 16 TF-ENT 1 1 1 ALC 60/37/2.33 94.12 17 BF-PRINV 1 1 1 CLC 53/51/2.17 93.83 18 TF-GFIDF 1 1 1 ALC 59/42/2.24 93.83 19 BF-PRINV 0 1 1 CLC 54/51/2.15 93.51 20 TF-GFIDF 1 1 1 SLC 56/45/2.24 93.51 21 TF-ENT 1 1 1 ERC 56/39/2.38 93.45 22 TF-IDF 1 1 1 ERC 56/39/2.38 93.45 23 TF-ENT 1 1 1 SLC 56/39/2.38 93.45 24 TF-IDF 1 1 1 SLC 56/39/2.38 93.45 25 TF-ONE 1 1 1 ERC 57/43/2.26 93.33 26 TF-ONE 1 1 1 ALC 59/43/2.22 93.21 27 BF-GFIDF 1 1 1 CLC 54/51/2.15 93.18 28 TF-GFIDF 1 1 1 ERC 55/42/2.33 92.97 29 BF-ENT 1 1 1 CLC 55/48/2.19 92.83 30 TF-PRINV 1 1 1 ERC 56/37/2.43 92.78 31 TF-PRINV 1 1 1 SLC 56/37/2.43 92.78 32 TF-NORM 0 1 1 CLC 53/54/2.11 92.51 33 TF-IDF 1 1 1 ALC 59/37/2.35 92.50 34 TF-IDF 0 1 1 ALC 58/41/2.28 92.16 35 TF-PRINV 1 1 1 ALC 59/37/2.35 92.14 36 BF-ONE 0 1 1 CLC 55/47/2.22 92.13 37 BF-ENT 0 1 1 CLC 53/55/2.09 92.10 38 TF-ENT 0 1 1 SLC 56/39/2.38 91.97 39 TF-ONE 1 1 1 SLC 56/38/2.40 91.92 40 TF-GFIDF 0 1 1 ALC 61/37/2.31 91.82 41 BF-GFIDF 0 1 1 CLC 58/42/2.26 91.35 42 TF-ENT 0 1 1 ERC 55/40/2.38 91.27 43 TF-IDF 0 1 1 ERC 55/40/2.38 91.27 44 TF-IDF 0 1 1 SLC 55/40/2.38 91.27 45 TF-ONE 0 1 1 ALC 63/40/2.19 91.01

62 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

46 TF-GFIDF 0 1 1 ERC 56/44/2.26 90.94 47 TF-PRINV 0 1 1 ERC 55/43/2.31 90.88 48 TF-PRINV 0 1 1 SLC 55/43/2.31 90.88 49 TF-GFIDF 0 1 1 SLC 55/41/2.35 90.81 50 TF-PRINV 0 1 1 ALC 59/39/2.31 90.71 51 TF-ONE 0 1 1 ERC 56/47/2.19 90.57 52 TF-ONE 0 1 1 SLC 56/47/2.19 90.57 53 BF-GFIDF 1 1 1 ERC 55/45/2.26 89.59 54 BF-GFIDF 1 1 1 SLC 55/45/2.26 89.59 55 BF-GFIDF 1 1 1 ALC 61/41/2.22 89.06 56 BF-ONE 1 1 1 ERC 56/40/2.35 87.84 57 BF-ONE 1 1 1 SLC 56/40/2.35 87.84 58 BF-GFIDF 0 1 1 ERC 55/47/2.22 87.43 59 BF-GFIDF 0 1 1 SLC 55/47/2.22 87.43 60 BF-ENT 1 1 1 ALC 58/41/2.28 87.41 61 TF-ENT 0 0 1 CLC 56/53/2.07 86.80 62 TF-GFIDF 0 0 1 CLC 56/53/2.07 86.80 63 TF-IDF 0 0 1 CLC 56/53/2.07 86.80 64 TF-NORM 0 0 1 CLC 56/53/2.07 86.80 65 TF-ONE 0 0 1 CLC 56/53/2.07 86.80 66 TF-PRINV 0 0 1 CLC 56/53/2.07 86.80 67 TF-ONE 1 0 1 CLC 56/53/2.07 86.80 68 BF-GFIDF 0 1 1 ALC 62/40/2.22 86.71 69 BF-ONE 1 0 1 CLC 55/55/2.05 86.41 70 TF-IDF 1 0 1 CLC 56/54/2.05 86.35 71 TF-ENT 1 0 1 CLC 56/54/2.05 85.94 72 BF-ONE 0 1 1 ERC 56/39/2.38 85.71 73 BF-ONE 0 1 1 SLC 56/39/2.38 85.71 74 BF-ENT 1 0 1 CLC 54/57/2.04 85.66 75 BF-IDF 1 0 1 CLC 53/59/2.02 85.43 76 TF-GFIDF 1 0 1 CLC 54/58/2.02 85.25 77 TF-NORM 0 1 1 ALC 57/48/2.15 85.21 78 TF-NORM 0 1 1 ERC 56/45/2.24 85.16 79 BF-ONE 1 1 1 ALC 60/44/2.17 85.10 80 TF-NORM 1 1 1 ALC 57/50/2.11 84.98 81 TF-ENT 0 0 1 ERC 55/49/2.17 84.84 82 TF-GFIDF 0 0 1 ERC 55/49/2.17 84.84 83 TF-IDF 0 0 1 ERC 55/49/2.17 84.84 84 TF-NORM 0 0 1 ERC 55/49/2.17 84.84 85 TF-ONE 0 0 1 ERC 55/49/2.17 84.84 86 TF-PRINV 0 0 1 ERC 55/49/2.17 84.84 87 TF-ENT 0 0 1 SLC 55/49/2.17 84.84 88 TF-GFIDF 0 0 1 SLC 55/49/2.17 84.84 89 TF-IDF 0 0 1 SLC 55/49/2.17 84.84 90 TF-NORM 0 0 1 SLC 55/49/2.17 84.84 91 TF-ONE 0 0 1 SLC 55/49/2.17 84.84 92 TF-PRINV 0 0 1 SLC 55/49/2.17 84.84 93 TF-NORM 1 1 1 ERC 56/40/2.35 84.80 94 TF-NORM 1 1 1 SLC 56/40/2.35 84.80 95 TF-ENT 0 0 1 ALC 56/52/2.09 84.75 96 TF-GFIDF 0 0 1 ALC 56/52/2.09 84.75 97 TF-IDF 0 0 1 ALC 56/52/2.09 84.75 98 TF-NORM 0 0 1 ALC 56/52/2.09 84.75 99 TF-ONE 0 0 1 ALC 56/52/2.09 84.75 100 TF-PRINV 0 0 1 ALC 56/52/2.09 84.75 101 BF-NORM 0 1 1 CLC 54/57/2.04 84.74 102 TF-NORM 1 0 1 CLC 56/56/2.02 84.73 103 TF-NORM 0 1 1 SLC 57/48/2.15 84.58 104 TF-ENT 1 0 1 ERC 55/50/2.15 84.39 105 TF-IDF 1 0 1 ERC 55/50/2.15 84.39 106 TF-PRINV 1 0 1 ERC 55/50/2.15 84.39

63 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

107 BF-ENT 1 1 1 ERC 55/35/2.51 84.34 108 BF-IDF 1 1 1 ERC 55/35/2.51 84.34 109 BF-ENT 1 1 1 SLC 55/35/2.51 84.34 110 BF-IDF 1 1 1 SLC 55/35/2.51 84.34 111 TF-ENT 1 0 1 SLC 55/52/2.11 84.25 112 TF-IDF 1 0 1 SLC 55/52/2.11 84.25 113 TF-PRINV 1 0 1 SLC 55/52/2.11 84.25 114 BF-IDF 1 1 1 ALC 59/44/2.19 84.25 115 BF-PRINV 1 1 1 ERC 55/50/2.15 83.98 116 BF-PRINV 1 1 1 SLC 56/32/2.57 83.80 117 TF-PRINV 1 0 1 CLC 56/58/1.98 83.78 118 TF-ONE 1 0 1 ERC 55/51/2.13 83.72 119 BF-ONE 0 1 1 ALC 61/44/2.15 83.59 120 TF-ONE 1 0 1 SLC 55/53/2.09 83.56 121 TF-ONE 1 0 1 ALC 57/51/2.09 83.40 122 BF-ENT 0 1 1 ERC 55/42/2.33 83.15 123 BF-ENT 0 1 1 SLC 55/42/2.33 83.15 124 TF-GFIDF 1 0 1 ERC 55/52/2.11 83.04 125 TF-GFIDF 1 0 1 SLC 56/55/2.04 83.00 126 BF-GFIDF 1 0 1 CLC 55/58/2.00 82.99 127 BF-IDF 1 0 1 ALC 56/55/2.04 82.76 128 BF-IDF 0 1 1 ERC 55/47/2.22 82.69 129 BF-IDF 0 1 1 SLC 55/47/2.22 82.69 130 TF-GFIDF 1 0 1 ALC 57/54/2.04 82.55 131 TF-ENT 1 0 1 ALC 58/51/2.07 82.45 132 TF-IDF 1 0 1 ALC 58/51/2.07 82.45 133 TF-PRINV 1 0 1 ALC 58/51/2.07 82.45 134 BF-IDF 0 1 1 ALC 60/42/2.22 82.42 135 BF-PRINV 1 1 1 ALC 60/44/2.17 82.38 136 BF-ENT 0 1 1 ALC 60/43/2.19 82.28 137 BF-PRINV 0 1 1 SLC 54/46/2.26 82.13 138 BF-PRINV 1 0 1 CLC 53/63/1.95 82.08 139 BF-PRINV 0 1 1 ERC 55/43/2.31 82.04 140 BF-ENT 0 0 1 CLC 53/65/1.92 82.01 141 BF-GFIDF 0 0 1 CLC 53/65/1.92 82.01 142 BF-IDF 0 0 1 CLC 53/65/1.92 82.01 143 BF-NORM 0 0 1 CLC 53/65/1.92 82.01 144 BF-ONE 0 0 1 CLC 53/65/1.92 82.01 145 BF-PRINV 0 0 1 CLC 53/65/1.92 82.01 146 BF-NORM 0 1 1 ERC 55/53/2.09 81.75 147 BF-NORM 0 1 1 SLC 55/53/2.09 81.75 148 TF-NORM 1 0 1 ALC 59/50/2.07 81.56 149 BF-GFIDF 1 0 1 ALC 57/51/2.09 81.53 150 BF-ONE 1 0 1 ALC 57/52/2.07 81.15 151 BF-PRINV 0 1 1 ALC 60/46/2.13 80.72 152 BF-NORM 1 1 1 CLC 52/68/1.88 80.58 153 BF-NORM 1 0 1 CLC 54/63/1.93 80.51 154 TF-NORM 1 0 1 ERC 56/54/2.05 80.24 155 TF-NORM 1 0 1 SLC 56/56/2.02 80.00 156 BF-ENT 0 0 1 ALC 57/54/2.04 79.92 157 BF-GFIDF 0 0 1 ALC 57/54/2.04 79.92 158 BF-IDF 0 0 1 ALC 57/54/2.04 79.92 159 BF-NORM 0 0 1 ALC 57/54/2.04 79.92 160 BF-ONE 0 0 1 ALC 57/54/2.04 79.92 161 BF-PRINV 0 0 1 ALC 57/54/2.04 79.92 162 BF-ENT 1 0 1 ALC 57/56/2.00 79.33 163 BF-PRINV 1 0 1 ALC 57/56/2.00 79.33 164 BF-GFIDF 1 0 1 ERC 55/59/1.98 78.05 165 BF-GFIDF 1 0 1 SLC 56/62/1.92 77.80 166 BF-PRINV 1 0 1 ERC 56/62/1.92 77.45 167 BF-PRINV 1 0 1 SLC 56/62/1.92 77.45

64 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

168 BF-ENT 0 0 1 SLC 56/62/1.92 77.12 169 BF-GFIDF 0 0 1 SLC 56/62/1.92 77.12 170 BF-IDF 0 0 1 SLC 56/62/1.92 77.12 171 BF-NORM 0 0 1 SLC 56/62/1.92 77.12 172 BF-ONE 0 0 1 SLC 56/62/1.92 77.12 173 BF-PRINV 0 0 1 SLC 56/62/1.92 77.12 174 BF-NORM 1 1 1 ERC 56/52/2.09 77.05 175 BF-NORM 1 1 1 SLC 56/52/2.09 77.05 176 BF-ENT 0 0 1 ERC 55/61/1.95 76.80 177 BF-GFIDF 0 0 1 ERC 55/61/1.95 76.80 178 BF-IDF 0 0 1 ERC 55/61/1.95 76.80 179 BF-NORM 0 0 1 ERC 55/61/1.95 76.80 180 BF-ONE 0 0 1 ERC 55/61/1.95 76.80 181 BF-PRINV 0 0 1 ERC 55/61/1.95 76.80 182 BF-ENT 1 0 1 ERC 56/63/1.90 76.39 183 BF-IDF 1 0 1 ERC 56/63/1.90 76.39 184 BF-ENT 1 0 1 SLC 56/63/1.90 76.39 185 BF-IDF 1 0 1 SLC 56/63/1.90 76.39 186 BF-NORM 1 0 1 ALC 56/62/1.92 75.91 187 BF-ONE 1 0 1 SLC 55/61/1.95 75.83 188 BF-ONE 1 0 1 ERC 56/60/1.95 75.78 189 TF-GFIDF 1 1 1 SBC 221/5/5.00 75.50 190 TF-ENT 1 1 1 SBC 225/1/4.97 75.49 191 TF-PRINV 1 1 1 SBC 224/2/4.96 75.41 192 TF-GFIDF 1 1 0 CLC 55/69/1.82 75.11 193 TF-IDF 1 1 1 SBC 224/2/4.99 75.04 194 TF-IDF 0 1 1 SBC 226/0/5.04 75.03 195 TF-ENT 0 1 1 SBC 226/0/5.04 75.00 196 TF-ENT 1 1 0 CLC 56/63/1.90 74.95 197 TF-ONE 1 1 1 SBC 223/3/4.96 74.83 198 BF-GFIDF 1 1 1 SBC 224/2/4.89 74.82 199 TF-PRINV 0 1 1 SBC 226/0/5.06 74.74 200 TF-ONE 0 1 1 SBC 224/2/4.97 74.63 201 BF-NORM 0 1 1 ALC 59/57/1.95 74.63 202 TF-GFIDF 0 1 1 SBC 223/3/5.01 74.58 203 TF-PRINV 1 1 0 CLC 54/67/1.87 74.41 204 BF-GFIDF 0 1 1 SBC 222/4/4.96 73.64 205 BF-ENT 1 1 1 SBC 221/5/4.78 73.57 206 BF-NORM 1 1 1 ALC 59/61/1.88 73.36 207 BF-IDF 1 1 1 SBC 221/5/4.82 73.14 208 BF-ONE 1 1 1 SBC 222/4/5.02 73.10 209 TF-GFIDF 1 1 0 ERC 57/67/1.82 73.00 210 TF-GFIDF 1 1 0 SLC 57/67/1.82 73.00 211 BF-PRINV 1 1 1 SBC 221/5/4.83 72.94 212 BF-NORM 1 0 1 SLC 58/62/1.88 72.49 213 TF-ENT 1 1 0 ERC 51/62/2.00 71.76 214 TF-IDF 1 1 0 ERC 54/67/1.87 71.52 215 TF-ONE 1 1 0 CLC 57/67/1.82 71.40 216 TF-ONE 1 1 0 ERC 49/56/2.15 71.33 217 BF-ENT 0 1 1 SBC 221/5/4.88 71.28 218 TF-IDF 1 1 0 CLC 54/76/1.74 71.23 219 TF-ENT 1 1 0 SLC 54/71/1.81 71.08 220 TF-IDF 1 1 0 SLC 54/71/1.81 71.08 221 BF-ONE 0 1 1 SBC 217/9/5.03 71.04 222 BF-PRINV 0 1 1 SBC 220/6/4.87 71.04 223 BF-IDF 0 1 1 SBC 221/5/4.91 71.00 224 BF-NORM 1 0 1 ERC 55/68/1.84 70.69 225 TF-NORM 0 1 1 SBC 223/3/5.06 70.67 226 TF-ONE 1 1 0 SLC 57/73/1.74 70.59 227 TF-NORM 1 1 1 SBC 223/3/5.05 70.46 228 TF-GFIDF 1 1 0 ALC 65/53/1.92 70.04

65 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

229 TF-GFIDF 1 0 1 SBC 222/4/5.04 70.03 230 TF-ENT 1 0 1 SBC 223/3/5.16 69.90 231 TF-IDF 1 0 1 SBC 222/4/5.14 69.87 232 TF-PRINV 1 0 1 SBC 222/4/5.15 69.78 233 TF-ONE 1 0 1 SBC 220/6/5.16 69.66 234 TF-PRINV 1 1 0 ERC 52/66/1.92 69.28 235 BF-GFIDF 1 0 1 SBC 219/7/5.07 69.00 236 TF-ENT 1 1 0 ALC 58/72/1.74 68.52 237 BF-ONE 1 0 1 SBC 220/6/5.10 68.46 238 TF-PRINV 1 1 0 ALC 60/69/1.75 68.36 239 TF-ONE 1 1 0 ALC 64/61/1.81 68.35 240 TF-ENT 0 0 1 SBC 218/8/5.03 68.30 241 TF-GFIDF 0 0 1 SBC 218/8/5.03 68.30 242 TF-IDF 0 0 1 SBC 218/8/5.03 68.30 243 TF-NORM 0 0 1 SBC 218/8/5.03 68.30 244 TF-ONE 0 0 1 SBC 218/8/5.03 68.30 245 TF-PRINV 0 0 1 SBC 218/8/5.03 68.30 246 TF-PRINV 1 1 0 SLC 54/72/1.79 68.29 247 TF-GFIDF 1 1 0 SBC 205/21/5.17 68.25 248 BF-PRINV 1 0 1 SBC 219/7/4.93 67.69 249 BF-IDF 1 0 1 SBC 219/7/4.98 67.66 250 BF-ENT 1 0 1 SBC 219/7/4.99 67.63 251 TF-PRINV 1 0 0 CLC 59/70/1.75 67.44 252 TF-NORM 1 0 1 SBC 218/8/5.07 67.34 253 TF-ONE 1 0 0 CLC 55/78/1.70 67.13 254 TF-ONE 0 1 0 ERC 47/61/2.09 66.90 255 TF-IDF 1 1 0 ALC 58/72/1.74 66.82 256 TF-IDF 1 0 0 CLC 59/71/1.74 66.51 257 BF-ENT 0 0 1 SBC 218/8/5.00 66.44 258 BF-GFIDF 0 0 1 SBC 218/8/5.00 66.44 259 BF-IDF 0 0 1 SBC 218/8/5.00 66.44 260 BF-NORM 0 0 1 SBC 218/8/5.00 66.44 261 BF-ONE 0 0 1 SBC 218/8/5.00 66.44 262 BF-PRINV 0 0 1 SBC 218/8/5.00 66.44 263 TF-ENT 0 1 0 CLC 54/71/1.81 66.37 264 BF-GFIDF 1 1 0 ERC 55/75/1.74 66.37 265 TF-ONE 1 1 0 SBC 205/21/5.28 66.26 266 TF-ENT 1 0 0 ERC 52/68/1.88 66.25 267 TF-GFIDF 1 0 0 CLC 56/78/1.69 66.20 268 BF-GFIDF 1 1 0 SLC 55/77/1.71 66.06 269 TF-IDF 1 0 0 ERC 51/70/1.87 65.97 270 TF-PRINV 1 0 0 ERC 51/70/1.87 65.97 271 TF-PRINV 0 1 0 CLC 54/73/1.78 65.77 272 TF-ONE 0 1 0 CLC 56/73/1.75 65.75 273 BF-NORM 0 1 1 SBC 212/14/4.64 65.75 274 TF-IDF 0 1 0 ERC 49/69/1.92 65.47 275 TF-PRINV 0 1 0 ERC 49/69/1.92 65.47 276 TF-ENT 1 0 0 SLC 53/73/1.79 65.34 277 TF-NORM 1 1 0 CLC 57/73/1.74 65.28 278 TF-ENT 1 1 0 SBC 205/21/5.32 65.13 279 TF-ENT 0 1 0 ERC 49/70/1.90 64.93 280 TF-IDF 1 0 0 SLC 51/76/1.78 64.90 281 TF-GFIDF 1 0 0 ALC 60/72/1.71 64.78 282 TF-PRINV 1 1 0 SBC 206/20/5.37 64.67 283 TF-IDF 1 1 0 SBC 206/20/5.38 64.60 284 TF-PRINV 1 0 0 SLC 51/78/1.75 64.57 285 TF-GFIDF 1 0 0 ERC 53/67/1.88 64.44 286 TF-ENT 1 0 0 CLC 59/73/1.71 64.39 287 BF-NORM 1 1 1 SBC 205/21/4.39 64.02 288 TF-IDF 1 0 0 ALC 58/77/1.67 63.96 289 BF-NORM 1 0 1 SBC 203/23/4.53 63.90

66 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

290 TF-ONE 1 0 0 ERC 53/68/1.87 63.58 291 TF-NORM 1 0 0 ERC 49/71/1.88 63.52 292 TF-ONE 1 0 0 ALC 61/72/1.70 63.48 293 TF-PRINV 0 1 0 SLC 49/81/1.74 63.30 294 TF-GFIDF 0 1 0 ERC 47/61/2.09 63.29 295 TF-ONE 0 1 0 SLC 50/76/1.79 63.29 296 TF-IDF 0 1 0 CLC 54/80/1.69 63.21 297 TF-ENT 1 0 0 ALC 60/73/1.70 63.18 298 TF-PRINV 1 0 0 ALC 59/76/1.67 63.16 299 TF-ENT 0 1 0 SLC 51/77/1.77 63.13 300 TF-NORM 1 1 0 ERC 52/77/1.75 63.13 301 TF-GFIDF 1 0 0 SLC 53/75/1.77 62.95 302 TF-GFIDF 0 1 0 SLC 49/71/1.88 62.78 303 TF-ONE 0 1 0 SBC 208/18/5.63 62.55 304 TF-GFIDF 0 1 0 SBC 205/21/5.49 62.54 305 TF-ONE 1 0 0 SLC 53/74/1.78 62.39 306 TF-IDF 0 1 0 SLC 52/76/1.77 62.22 307 TF-NORM 1 0 0 CLC 57/79/1.66 62.17 308 TF-NORM 1 0 0 SLC 52/81/1.70 62.04 309 TF-NORM 0 1 0 CLC 56/81/1.65 61.84 310 TF-NORM 1 1 0 SLC 53/82/1.67 61.68 311 TF-ENT 0 1 0 SBC 211/15/5.88 61.62 312 TF-IDF 0 1 0 SBC 212/14/5.88 61.58 313 TF-ENT 0 1 0 ALC 56/82/1.64 61.54 314 TF-PRINV 0 1 0 SBC 206/20/5.84 61.46 315 TF-NORM 1 0 0 ALC 60/79/1.63 61.43 316 BF-GFIDF 1 1 0 SBC 206/20/5.58 61.42 317 TF-PRINV 0 1 0 ALC 56/80/1.66 61.24 318 TF-ENT 0 0 0 ERC 50/75/1.81 61.11 319 TF-GFIDF 0 0 0 ERC 50/75/1.81 61.11 320 TF-IDF 0 0 0 ERC 50/75/1.81 61.11 321 TF-NORM 0 0 0 ERC 50/75/1.81 61.11 322 TF-ONE 0 0 0 ERC 50/75/1.81 61.11 323 TF-PRINV 0 0 0 ERC 50/75/1.81 61.11 324 TF-GFIDF 0 1 0 CLC 57/74/1.73 61.03 325 TF-IDF 0 1 0 ALC 57/79/1.66 60.91 326 TF-NORM 1 1 0 ALC 58/84/1.59 60.05 327 TF-ENT 0 0 0 CLC 53/84/1.65 60.00 328 TF-GFIDF 0 0 0 CLC 53/84/1.65 60.00 329 TF-IDF 0 0 0 CLC 53/84/1.65 60.00 330 TF-NORM 0 0 0 CLC 53/84/1.65 60.00 331 TF-ONE 0 0 0 CLC 53/84/1.65 60.00 332 TF-PRINV 0 0 0 CLC 53/84/1.65 60.00 333 BF-GFIDF 0 1 0 ERC 45/62/2.11 59.86 334 BF-ONE 1 1 0 ERC 41/60/2.24 59.59 335 BF-GFIDF 0 1 0 SLC 46/69/1.97 59.49 336 TF-NORM 0 1 0 ALC 55/85/1.61 59.42 337 TF-ENT 0 0 0 SLC 50/83/1.70 59.36 338 TF-GFIDF 0 0 0 SLC 50/83/1.70 59.36 339 TF-IDF 0 0 0 SLC 50/83/1.70 59.36 340 TF-NORM 0 0 0 SLC 50/83/1.70 59.36 341 TF-ONE 0 0 0 SLC 50/83/1.70 59.36 342 TF-PRINV 0 0 0 SLC 50/83/1.70 59.36 343 BF-GFIDF 1 1 0 CLC 55/79/1.69 59.35 344 BF-GFIDF 1 0 0 ERC 43/65/2.09 59.19 345 TF-NORM 0 1 0 ERC 49/85/1.69 58.93 346 BF-ENT 1 1 0 ERC 41/65/2.13 58.90 347 BF-PRINV 1 1 0 ERC 41/64/2.15 58.61 348 BF-IDF 1 1 0 ERC 41/66/2.11 58.43 349 TF-GFIDF 1 0 0 SBC 209/17/5.47 58.40 350 TF-PRINV 1 0 0 SBC 211/15/5.48 58.33

67 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

351 BF-ONE 1 1 0 SLC 42/69/2.04 58.28 352 TF-ENT 1 0 0 SBC 210/16/5.49 58.26 353 BF-GFIDF 1 1 0 ALC 56/82/1.64 58.25 354 TF-IDF 1 0 0 SBC 211/15/5.50 58.18 355 BF-PRINV 1 1 0 SLC 42/69/2.04 58.14 356 TF-ENT 0 0 0 ALC 60/78/1.64 58.06 357 TF-GFIDF 0 0 0 ALC 60/78/1.64 58.06 358 TF-IDF 0 0 0 ALC 60/78/1.64 58.06 359 TF-NORM 0 0 0 ALC 60/78/1.64 58.06 360 TF-ONE 0 0 0 ALC 60/78/1.64 58.06 361 TF-PRINV 0 0 0 ALC 60/78/1.64 58.06 362 BF-ONE 0 1 0 CLC 51/84/1.67 58.06 363 TF-GFIDF 0 1 0 ALC 61/79/1.61 58.00 364 TF-ONE 0 1 0 ALC 61/74/1.67 57.97 365 BF-GFIDF 0 1 0 SBC 200/26/5.35 57.93 366 TF-ONE 1 0 0 SBC 208/18/5.53 57.60 367 BF-IDF 0 1 0 CLC 53/74/1.78 57.58 368 BF-ONE 1 0 0 ERC 50/81/1.73 57.58 369 BF-ENT 1 1 0 SLC 42/72/1.98 57.55 370 BF-IDF 1 1 0 SLC 43/72/1.97 57.50 371 BF-GFIDF 0 1 0 CLC 57/78/1.67 57.28 372 BF-ENT 0 0 0 ERC 51/85/1.66 57.27 373 BF-GFIDF 0 0 0 ERC 51/85/1.66 57.27 374 BF-IDF 0 0 0 ERC 51/85/1.66 57.27 375 BF-NORM 0 0 0 ERC 51/85/1.66 57.27 376 BF-ONE 0 0 0 ERC 51/85/1.66 57.27 377 BF-PRINV 0 0 0 ERC 51/85/1.66 57.27 378 TF-NORM 0 1 0 SLC 42/74/1.95 57.25 379 BF-ENT 1 0 0 ERC 45/75/1.88 57.25 380 BF-IDF 1 0 0 ERC 45/75/1.88 57.25 381 BF-PRINV 1 0 0 ERC 45/75/1.88 57.25 382 BF-ONE 1 1 0 SBC 206/20/5.81 57.09 383 TF-NORM 1 1 0 SBC 196/30/5.34 56.90 384 BF-GFIDF 1 0 0 SLC 55/92/1.54 56.85 385 TF-NORM 0 1 0 SBC 200/26/5.45 56.77 386 BF-ENT 1 1 0 SBC 210/16/5.88 56.60 387 BF-IDF 1 1 0 SBC 210/16/5.86 56.50 388 BF-PRINV 1 1 0 SBC 211/15/5.99 56.42 389 TF-ENT 0 0 0 SBC 207/19/5.63 56.31 390 TF-GFIDF 0 0 0 SBC 207/19/5.63 56.31 391 TF-IDF 0 0 0 SBC 207/19/5.63 56.31 392 TF-NORM 0 0 0 SBC 207/19/5.63 56.31 393 TF-ONE 0 0 0 SBC 207/19/5.63 56.31 394 TF-PRINV 0 0 0 SBC 207/19/5.63 56.31 395 BF-ENT 0 1 0 CLC 53/82/1.67 56.21 396 BF-ENT 1 1 0 CLC 55/79/1.69 56.21 397 TF-NORM 1 0 0 SBC 200/26/5.36 56.16 398 BF-ONE 1 0 0 SLC 50/87/1.65 56.02 399 BF-GFIDF 1 0 0 SBC 202/24/5.48 55.99 400 BF-IDF 1 1 0 CLC 54/81/1.67 55.87 401 BF-PRINV 1 1 0 CLC 54/81/1.67 55.87 402 BF-ONE 1 1 0 CLC 54/83/1.65 55.77 403 BF-ENT 1 0 0 SLC 54/89/1.58 55.75 404 BF-IDF 1 0 0 SLC 54/89/1.58 55.75 405 BF-PRINV 1 0 0 SLC 54/89/1.58 55.75 406 BF-NORM 1 0 0 ERC 51/86/1.65 55.71 407 BF-ONE 0 1 0 ERC 38/59/2.33 55.46 408 BF-ENT 0 0 0 SLC 51/91/1.59 55.42 409 BF-GFIDF 0 0 0 SLC 51/91/1.59 55.42 410 BF-IDF 0 0 0 SLC 51/91/1.59 55.42 411 BF-NORM 0 0 0 SLC 51/91/1.59 55.42

68 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

412 BF-ONE 0 0 0 SLC 51/91/1.59 55.42 413 BF-PRINV 0 0 0 SLC 51/91/1.59 55.42 414 BF-ENT 0 1 0 ERC 38/62/2.26 55.40 415 BF-ONE 0 1 0 SBC 205/21/5.69 55.37 416 BF-ONE 1 0 0 SBC 202/24/5.48 55.27 417 BF-PRINV 0 1 0 ERC 37/59/2.35 55.19 418 BF-IDF 1 0 0 SBC 198/28/5.36 55.15 419 BF-PRINV 1 0 0 SBC 198/28/5.36 55.12 420 BF-ENT 1 0 0 SBC 198/28/5.37 55.09 421 BF-ONE 1 0 0 CLC 55/83/1.64 54.94 422 BF-IDF 0 1 0 ERC 37/62/2.28 54.90 423 BF-ENT 0 1 0 SBC 206/20/5.96 54.73 424 BF-PRINV 0 1 0 CLC 51/82/1.70 54.63 425 BF-IDF 1 0 0 CLC 57/82/1.63 54.63 426 BF-IDF 0 1 0 SBC 203/23/5.72 54.62 427 BF-NORM 1 0 0 SLC 55/89/1.57 54.59 428 BF-PRINV 0 1 0 SBC 200/26/5.64 54.51 429 BF-PRINV 0 1 0 SLC 36/52/2.57 54.45 430 BF-ENT 0 1 0 SLC 38/68/2.13 54.11 431 BF-ENT 0 0 0 SBC 198/28/5.58 53.97 432 BF-GFIDF 0 0 0 SBC 198/28/5.58 53.97 433 BF-IDF 0 0 0 SBC 198/28/5.58 53.97 434 BF-NORM 0 0 0 SBC 198/28/5.58 53.97 435 BF-ONE 0 0 0 SBC 198/28/5.58 53.97 436 BF-PRINV 0 0 0 SBC 198/28/5.58 53.97 437 BF-IDF 0 1 0 SLC 36/53/2.54 53.95 438 BF-IDF 1 0 0 ALC 58/88/1.55 53.81 439 BF-GFIDF 0 1 0 ALC 58/79/1.65 53.53 440 BF-NORM 0 1 0 ERC 36/59/2.38 53.45 441 BF-NORM 1 1 0 ERC 50/92/1.59 53.18 442 BF-NORM 0 1 0 SLC 36/61/2.33 52.98 443 BF-PRINV 1 1 0 ALC 58/77/1.67 52.88 444 BF-IDF 1 1 0 ALC 59/77/1.66 52.78 445 BF-ONE 0 1 0 SLC 46/78/1.82 52.61 446 BF-PRINV 1 0 0 CLC 58/87/1.56 52.55 447 BF-NORM 1 0 0 SBC 193/33/5.17 52.50 448 BF-NORM 0 1 0 SBC 191/35/5.35 52.35 449 BF-GFIDF 1 0 0 CLC 55/96/1.50 52.22 450 BF-NORM 1 1 0 SLC 36/62/2.31 52.08 451 BF-GFIDF 1 0 0 ALC 59/73/1.71 51.99 452 BF-NORM 1 1 0 SBC 186/40/5.04 51.67 453 BF-NORM 1 0 0 CLC 56/91/1.54 51.40 454 BF-NORM 1 0 0 ALC 62/86/1.53 50.92 455 BF-PRINV 1 0 0 ALC 58/88/1.55 50.90 456 BF-ENT 1 1 0 ALC 58/80/1.64 50.73 457 BF-ENT 0 0 0 CLC 56/96/1.49 50.66 458 BF-GFIDF 0 0 0 CLC 56/96/1.49 50.66 459 BF-IDF 0 0 0 CLC 56/96/1.49 50.66 460 BF-NORM 0 0 0 CLC 56/96/1.49 50.66 461 BF-ONE 0 0 0 CLC 56/96/1.49 50.66 462 BF-PRINV 0 0 0 CLC 56/96/1.49 50.66 463 BF-ONE 1 0 0 ALC 62/77/1.63 50.50 464 BF-ENT 1 0 0 ALC 58/90/1.53 50.39 465 BF-ENT 1 0 0 CLC 56/98/1.47 50.13 466 BF-ONE 1 1 0 ALC 55/95/1.51 49.74 467 BF-ENT 0 0 0 ALC 59/79/1.64 49.28 468 BF-GFIDF 0 0 0 ALC 59/79/1.64 49.28 469 BF-IDF 0 0 0 ALC 59/79/1.64 49.28 470 BF-NORM 0 0 0 ALC 59/79/1.64 49.28 471 BF-ONE 0 0 0 ALC 59/79/1.64 49.28 472 BF-PRINV 0 0 0 ALC 59/79/1.64 49.28

69 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

473 BF-ONE 0 1 0 ALC 56/93/1.52 49.22 474 BF-ENT 0 1 0 ALC 56/96/1.49 48.55 475 BF-IDF 0 1 0 ALC 56/96/1.49 48.55 476 BF-PRINV 0 1 0 ALC 55/98/1.48 48.15 477 BF-NORM 1 1 0 CLC 53/96/1.52 48.08 478 BF-NORM 0 1 0 CLC 55/97/1.49 47.89 479 BF-NORM 1 1 0 ALC 54/101/1.46 47.87 480 BF-NORM 0 1 0 ALC 53/96/1.52 47.69

70 Appendix B

The First Framework: Experimental Results in DPL

Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

1 TF-ONE 0 1 1 CLC 103/66/1.97 76.37 2 TF-ENT 1 1 1 CLC 102/61/2.04 76.35 3 TF-IDF 1 1 1 CLC 104/61/2.02 76.22 4 TF-ONE 1 1 1 CLC 103/57/2.08 76.08 5 TF-PRINV 1 1 1 CLC 104/64/1.98 75.00 6 TF-NORM 1 1 1 CLC 101/65/2.01 73.36 7 TF-ONE 1 1 0 CLC 109/58/1.99 72.81 8 TF-ENT 0 1 1 CLC 107/55/2.06 72.05 9 BF-GFIDF 1 1 1 CLC 105/72/1.88 71.29 10 TF-ENT 1 1 0 CLC 107/48/2.15 71.24 11 TF-GFIDF 1 1 1 CLC 113/53/2.01 70.74 12 TF-PRINV 0 1 1 CLC 111/48/2.09 70.70 13 BF-GFIDF 1 1 0 CLC 109/70/1.86 70.41 14 TF-IDF 0 1 1 CLC 109/55/2.03 70.03 15 TF-NORM 0 1 1 CLC 104/64/1.98 69.94 16 TF-IDF 1 1 0 CLC 108/49/2.12 69.91 17 BF-GFIDF 0 1 1 CLC 107/73/1.85 69.88 18 TF-ENT 1 1 1 ERC 101/45/2.28 69.78 19 TF-ENT 1 1 1 SLC 101/49/2.22 69.47 20 BF-ONE 1 1 1 CLC 86/103/1.76 69.43 21 TF-IDF 1 1 1 SLC 101/39/2.38 69.37 22 BF-PRINV 1 1 1 CLC 105/69/1.91 69.18 23 BF-ENT 1 1 1 CLC 92/98/1.80 68.94 24 BF-PRINV 1 1 0 CLC 105/72/1.88 68.90 25 TF-IDF 1 1 1 ERC 105/57/2.06 68.69 26 BF-IDF 0 1 1 CLC 107/69/1.89 68.67 27 TF-ENT 1 0 1 ALC 115/60/1.90 68.61 28 BF-IDF 1 1 1 CLC 86/105/1.74 68.51 29 TF-IDF 1 1 1 ALC 111/68/1.86 68.50 30 TF-GFIDF 0 1 1 CLC 114/54/1.98 68.49 31 BF-ENT 0 1 1 CLC 108/68/1.89 68.45 32 TF-PRINV 1 1 1 ERC 105/57/2.06 68.29 33 TF-PRINV 1 1 1 SLC 105/57/2.06 68.29 34 BF-ONE 1 1 1 ERC 99/71/1.96 68.26 35 TF-GFIDF 1 1 1 ALC 117/51/1.98 67.98 36 TF-IDF 1 0 1 ALC 114/60/1.91 67.92 37 TF-PRINV 1 0 1 ALC 116/58/1.91 67.81 38 TF-PRINV 1 1 1 ALC 117/44/2.07 67.73 39 TF-PRINV 1 0 1 CLC 113/65/1.87 67.70 40 TF-IDF 1 1 0 ALC 112/67/1.86 67.70 41 BF-PRINV 0 1 1 CLC 102/74/1.89 67.63 42 TF-GFIDF 1 0 1 CLC 110/73/1.82 67.60 43 TF-ENT 1 0 1 CLC 113/62/1.90 67.58 44 TF-IDF 1 0 1 CLC 113/62/1.90 67.58 45 TF-IDF 1 1 0 SLC 105/56/2.07 67.57

71 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

46 TF-ENT 1 1 0 SLC 104/62/2.01 67.40 47 TF-ENT 0 0 1 CLC 108/70/1.87 67.24 48 TF-GFIDF 0 0 1 CLC 108/70/1.87 67.24 49 TF-IDF 0 0 1 CLC 108/70/1.87 67.24 50 TF-NORM 0 0 1 CLC 108/70/1.87 67.24 51 TF-ONE 0 0 1 CLC 108/70/1.87 67.24 52 TF-PRINV 0 0 1 CLC 108/70/1.87 67.24 53 TF-GFIDF 1 0 0 CLC 111/66/1.88 67.24 54 TF-ENT 1 0 1 ERC 106/61/1.99 67.19 55 TF-ENT 1 1 0 ALC 113/70/1.82 67.14 56 BF-IDF 1 1 0 CLC 107/69/1.89 67.11 57 TF-ENT 1 1 1 ALC 112/71/1.82 67.02 58 TF-PRINV 1 0 0 CLC 115/62/1.88 67.01 59 BF-ONE 1 1 0 CLC 86/104/1.75 66.99 60 TF-IDF 1 0 1 ERC 106/59/2.02 66.98 61 BF-ONE 1 1 1 SLC 99/77/1.89 66.98 62 TF-ENT 1 0 0 ALC 119/55/1.91 66.90 63 TF-ENT 1 0 0 CLC 116/57/1.92 66.89 64 BF-NORM 0 1 1 CLC 106/69/1.90 66.78 65 BF-NORM 0 1 1 ALC 108/73/1.84 66.78 66 TF-ENT 1 0 0 ERC 106/62/1.98 66.77 67 TF-IDF 1 0 0 ERC 106/62/1.98 66.77 68 TF-PRINV 1 1 0 SLC 105/57/2.06 66.77 69 TF-PRINV 1 1 0 CLC 111/51/2.06 66.77 70 TF-PRINV 1 0 1 ERC 104/59/2.04 66.57 71 TF-IDF 1 0 0 CLC 115/63/1.87 66.55 72 TF-ONE 1 0 0 CLC 114/63/1.88 66.55 73 BF-ENT 1 0 1 CLC 109/70/1.86 66.55 74 BF-ENT 1 1 0 CLC 108/72/1.85 66.55 75 BF-ENT 1 1 1 ALC 109/71/1.85 66.55 76 BF-ONE 0 1 1 ALC 110/69/1.86 66.44 77 TF-PRINV 1 0 0 ALC 118/55/1.92 66.44 78 TF-ONE 1 0 1 CLC 113/64/1.88 66.44 79 TF-ONE 1 0 1 ALC 113/63/1.89 66.44 80 TF-NORM 1 1 1 ERC 103/67/1.96 66.36 81 TF-NORM 1 1 1 SLC 103/67/1.96 66.36 82 TF-ENT 1 0 1 SLC 109/61/1.96 66.34 83 TF-IDF 1 0 0 ALC 120/52/1.94 66.32 84 BF-IDF 1 0 1 CLC 108/72/1.85 66.32 85 BF-IDF 1 1 1 ALC 108/72/1.85 66.32 86 BF-ONE 1 1 0 ERC 99/68/1.99 66.18 87 BF-GFIDF 1 1 1 ERC 97/72/1.97 66.09 88 TF-IDF 1 0 1 SLC 107/60/1.99 66.02 89 TF-GFIDF 1 1 0 CLC 114/51/2.02 66.02 90 BF-ENT 1 1 0 ALC 110/69/1.86 65.99 91 TF-ONE 1 0 0 ALC 113/64/1.88 65.98 92 TF-GFIDF 1 0 1 ALC 112/67/1.86 65.98 93 BF-GFIDF 1 1 0 SLC 97/86/1.82 65.93 94 TF-ENT 1 0 0 SLC 108/66/1.91 65.89 95 TF-PRINV 1 1 0 ERC 101/78/1.86 65.89 96 BF-GFIDF 1 1 1 SLC 97/82/1.86 65.84 97 BF-ONE 0 1 1 ERC 102/78/1.85 65.80 98 TF-IDF 1 0 0 SLC 107/63/1.96 65.79 99 BF-GFIDF 0 1 0 CLC 104/80/1.81 65.78 100 BF-IDF 1 1 0 ALC 109/70/1.86 65.76 101 TF-PRINV 1 1 0 ALC 112/70/1.83 65.73 102 BF-GFIDF 1 1 1 ALC 110/75/1.80 65.73 103 TF-ENT 1 1 0 ERC 103/61/2.03 65.66 104 TF-PRINV 1 0 1 SLC 105/60/2.02 65.62 105 BF-GFIDF 1 1 0 ALC 111/73/1.81 65.62 106 BF-ONE 1 1 1 ALC 107/83/1.75 65.60

72 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

107 TF-NORM 1 1 0 ERC 102/67/1.97 65.54 108 TF-NORM 1 1 0 SLC 102/67/1.97 65.54 109 TF-IDF 1 1 0 ERC 104/55/2.09 65.51 110 TF-ONE 1 1 1 ALC 119/60/1.86 65.50 111 BF-ONE 1 1 0 ALC 108/81/1.76 65.48 112 TF-ENT 0 0 0 CLC 113/58/1.95 65.44 113 TF-GFIDF 0 0 0 CLC 113/58/1.95 65.44 114 TF-IDF 0 0 0 CLC 113/58/1.95 65.44 115 TF-NORM 0 0 0 CLC 113/58/1.95 65.44 116 TF-ONE 0 0 0 CLC 113/58/1.95 65.44 117 TF-PRINV 0 0 0 CLC 113/58/1.95 65.44 118 BF-PRINV 1 0 1 CLC 107/72/1.86 65.41 119 BF-GFIDF 0 1 1 ALC 112/75/1.78 65.37 120 TF-PRINV 1 0 0 ERC 103/66/1.97 65.33 121 TF-NORM 1 1 0 CLC 102/68/1.96 65.30 122 BF-ENT 0 1 0 CLC 108/75/1.82 65.27 123 TF-NORM 1 1 0 ALC 113/70/1.82 65.26 124 TF-NORM 1 1 1 ALC 113/69/1.83 65.26 125 BF-PRINV 1 1 1 ALC 110/61/1.95 65.24 126 TF-ENT 0 0 1 ALC 110/70/1.85 65.17 127 TF-GFIDF 0 0 1 ALC 110/70/1.85 65.17 128 TF-IDF 0 0 1 ALC 110/70/1.85 65.17 129 TF-NORM 0 0 1 ALC 110/70/1.85 65.17 130 TF-ONE 0 0 1 ALC 110/70/1.85 65.17 131 TF-PRINV 0 0 1 ALC 110/70/1.85 65.17 132 BF-PRINV 0 1 0 CLC 104/81/1.80 65.16 133 BF-IDF 0 1 1 ALC 110/73/1.82 65.16 134 TF-NORM 1 0 1 CLC 113/68/1.84 65.15 135 TF-NORM 0 1 0 CLC 110/78/1.77 65.13 136 BF-ENT 1 1 1 SLC 100/74/1.91 65.10 137 BF-ONE 0 1 1 SLC 103/79/1.83 65.00 138 TF-ONE 0 1 0 CLC 101/71/1.94 64.94 139 BF-ONE 0 1 0 ALC 110/73/1.82 64.93 140 BF-GFIDF 1 0 1 CLC 105/82/1.78 64.90 141 BF-ONE 1 1 0 SLC 99/74/1.92 64.84 142 BF-ENT 1 1 1 ERC 100/71/1.95 64.83 143 BF-IDF 0 1 0 CLC 106/77/1.82 64.81 144 BF-PRINV 1 1 0 ALC 111/60/1.95 64.69 145 BF-ONE 0 1 0 ERC 101/78/1.86 64.55 146 TF-ENT 0 1 0 CLC 107/58/2.02 64.55 147 TF-ONE 1 0 1 ERC 102/67/1.97 64.53 148 BF-PRINV 0 1 1 ALC 111/67/1.87 64.52 149 BF-ENT 0 1 1 ALC 112/67/1.86 64.51 150 BF-ENT 1 1 0 SLC 101/73/1.91 64.47 151 BF-IDF 1 1 1 ERC 95/51/2.28 64.46 152 TF-NORM 1 0 0 CLC 114/67/1.84 64.45 153 BF-IDF 1 1 1 SLC 95/58/2.18 64.36 154 TF-PRINV 1 0 0 SLC 104/67/1.95 64.30 155 BF-ENT 1 0 1 ALC 110/66/1.89 64.30 156 BF-ONE 0 1 0 SLC 101/85/1.79 64.29 157 TF-GFIDF 1 1 1 SLC 102/50/2.19 64.28 158 BF-IDF 0 1 0 ALC 109/73/1.83 64.27 159 BF-PRINV 1 1 1 ERC 96/51/2.27 64.26 160 TF-IDF 0 1 0 CLC 109/72/1.84 64.25 161 BF-ONE 1 0 1 CLC 109/73/1.83 64.24 162 TF-ENT 0 0 1 ERC 101/60/2.07 64.17 163 TF-GFIDF 0 0 1 ERC 101/60/2.07 64.17 164 TF-IDF 0 0 1 ERC 101/60/2.07 64.17 165 TF-NORM 0 0 1 ERC 101/60/2.07 64.17 166 TF-ONE 0 0 1 ERC 101/60/2.07 64.17 167 TF-PRINV 0 0 1 ERC 101/60/2.07 64.17

73 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

168 TF-ONE 1 1 1 SLC 97/59/2.13 64.15 169 BF-ENT 1 0 0 CLC 110/72/1.83 64.11 170 BF-PRINV 1 1 0 ERC 101/79/1.85 64.07 171 TF-ONE 1 1 1 ERC 99/91/1.75 64.05 172 BF-ENT 0 1 1 ERC 100/76/1.89 64.04 173 BF-PRINV 0 1 1 ERC 100/76/1.89 64.04 174 BF-PRINV 1 1 1 SLC 100/81/1.84 63.97 175 TF-GFIDF 1 0 0 ALC 116/62/1.87 63.90 176 TF-ONE 1 1 0 SLC 99/92/1.74 63.90 177 BF-IDF 1 0 0 CLC 109/74/1.82 63.87 178 BF-IDF 0 1 1 ERC 99/78/1.88 63.82 179 BF-ENT 0 0 1 ERC 100/67/1.99 63.80 180 BF-GFIDF 0 0 1 ERC 100/67/1.99 63.80 181 BF-IDF 0 0 1 ERC 100/67/1.99 63.80 182 BF-NORM 0 0 1 ERC 100/67/1.99 63.80 183 BF-ONE 0 0 1 ERC 100/67/1.99 63.80 184 BF-PRINV 0 0 1 ERC 100/67/1.99 63.80 185 TF-GFIDF 1 0 1 ERC 107/76/1.89 63.78 186 BF-ENT 1 1 0 ERC 100/69/1.97 63.76 187 BF-IDF 1 0 1 ALC 110/65/1.90 63.74 188 TF-ONE 1 0 1 SLC 102/69/1.95 63.72 189 TF-GFIDF 1 0 0 ERC 101/74/1.90 63.68 190 BF-PRINV 1 0 0 CLC 106/76/1.83 63.67 191 TF-NORM 1 0 0 ALC 112/68/1.85 63.67 192 TF-NORM 1 0 1 ALC 112/68/1.85 63.67 193 BF-ONE 1 0 0 CLC 110/72/1.83 63.65 194 BF-ONE 1 0 1 SLC 102/56/2.11 63.65 195 BF-ONE 0 1 1 CLC 89/108/1.69 63.64 196 BF-GFIDF 0 1 1 ERC 97/89/1.79 63.62 197 BF-GFIDF 0 1 1 SLC 97/89/1.79 63.62 198 BF-ONE 1 0 1 ALC 108/69/1.88 63.62 199 BF-ENT 1 0 0 ALC 113/64/1.88 63.59 200 TF-ONE 1 0 0 ERC 102/75/1.88 63.58 201 TF-ENT 0 1 1 ALC 107/88/1.71 63.52 202 BF-ONE 1 0 1 ERC 101/55/2.13 63.52 203 BF-GFIDF 1 0 0 CLC 108/77/1.80 63.51 204 BF-NORM 1 1 1 ERC 97/88/1.80 63.50 205 BF-NORM 1 1 1 SLC 97/88/1.80 63.50 206 TF-NORM 0 1 1 ALC 111/71/1.83 63.46 207 BF-GFIDF 1 1 0 ERC 96/85/1.84 63.46 208 BF-GFIDF 0 1 0 ALC 111/77/1.77 63.44 209 TF-ENT 0 0 0 ALC 112/68/1.85 63.43 210 TF-GFIDF 0 0 0 ALC 112/68/1.85 63.43 211 TF-IDF 0 0 0 ALC 112/68/1.85 63.43 212 TF-NORM 0 0 0 ALC 112/68/1.85 63.43 213 TF-ONE 0 0 0 ALC 112/68/1.85 63.43 214 TF-PRINV 0 0 0 ALC 112/68/1.85 63.43 215 BF-ONE 1 0 0 ALC 110/65/1.90 63.41 216 TF-NORM 1 0 1 ERC 100/60/2.08 63.33 217 TF-NORM 1 0 1 SLC 100/60/2.08 63.33 218 BF-IDF 1 1 0 SLC 100/81/1.84 63.33 219 BF-PRINV 1 1 0 SLC 100/81/1.84 63.33 220 BF-IDF 1 1 0 ERC 100/77/1.88 63.31 221 BF-ENT 0 0 1 ALC 108/75/1.82 63.30 222 BF-GFIDF 0 0 1 ALC 108/75/1.82 63.30 223 BF-IDF 0 0 1 ALC 108/75/1.82 63.30 224 BF-NORM 0 0 1 ALC 108/75/1.82 63.30 225 BF-ONE 0 0 1 ALC 108/75/1.82 63.30 226 BF-PRINV 0 0 1 ALC 108/75/1.82 63.30 227 BF-GFIDF 1 0 1 SLC 97/56/2.18 63.29 228 TF-ONE 1 1 0 ALC 122/56/1.87 63.27

74 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

229 BF-ENT 0 1 1 SLC 100/78/1.87 63.21 230 BF-PRINV 0 1 1 SLC 100/78/1.87 63.21 231 TF-ENT 0 1 0 ALC 107/87/1.72 63.18 232 TF-ONE 1 1 0 ERC 96/97/1.73 63.16 233 BF-NORM 1 1 1 CLC 99/85/1.81 63.14 234 BF-GFIDF 1 0 1 ERC 97/65/2.06 63.10 235 BF-PRINV 1 0 1 ALC 110/65/1.90 63.07 236 TF-PRINV 0 1 1 ALC 117/65/1.83 63.05 237 TF-GFIDF 1 0 1 SLC 100/78/1.87 63.05 238 BF-IDF 0 1 1 SLC 99/80/1.86 62.99 239 TF-GFIDF 1 0 0 SLC 101/76/1.88 62.95 240 BF-NORM 1 1 0 SLC 97/88/1.80 62.85 241 BF-NORM 0 1 0 ALC 111/72/1.82 62.83 242 TF-ONE 1 0 0 SLC 103/81/1.81 62.80 243 BF-GFIDF 1 0 1 ALC 110/70/1.85 62.76 244 BF-IDF 1 0 1 SLC 101/68/1.97 62.66 245 BF-PRINV 1 0 1 SLC 101/68/1.97 62.66 246 BF-ONE 0 1 0 CLC 109/79/1.77 62.63 247 BF-NORM 1 1 0 ERC 97/86/1.82 62.58 248 BF-ONE 1 0 0 SLC 98/63/2.07 62.57 249 TF-PRINV 0 1 0 CLC 108/74/1.83 62.52 250 BF-GFIDF 1 0 0 ALC 105/73/1.87 62.52 251 BF-ENT 1 0 1 SLC 101/70/1.95 62.52 252 TF-ENT 0 0 1 SLC 103/68/1.95 62.50 253 TF-GFIDF 0 0 1 SLC 103/68/1.95 62.50 254 TF-IDF 0 0 1 SLC 103/68/1.95 62.50 255 TF-NORM 0 0 1 SLC 103/68/1.95 62.50 256 TF-ONE 0 0 1 SLC 103/68/1.95 62.50 257 TF-PRINV 0 0 1 SLC 103/68/1.95 62.50 258 BF-ENT 0 1 0 ERC 98/76/1.91 62.46 259 BF-PRINV 0 1 0 ERC 98/76/1.91 62.46 260 BF-IDF 1 0 0 ALC 112/66/1.87 62.44 261 BF-NORM 1 1 0 CLC 103/78/1.84 62.35 262 BF-ONE 1 1 1 SBC 315/18/4.44 62.34 263 BF-PRINV 0 1 0 ALC 107/77/1.81 62.31 264 BF-IDF 0 1 0 ERC 97/78/1.90 62.25 265 TF-GFIDF 1 1 0 ALC 118/62/1.85 62.24 266 BF-ONE 1 0 0 ERC 97/60/2.12 62.23 267 BF-GFIDF 1 0 0 SLC 97/80/1.88 62.20 268 BF-ENT 0 0 0 ERC 98/66/2.03 62.14 269 BF-GFIDF 0 0 0 ERC 98/66/2.03 62.14 270 BF-IDF 0 0 0 ERC 98/66/2.03 62.14 271 BF-NORM 0 0 0 ERC 98/66/2.03 62.14 272 BF-ONE 0 0 0 ERC 98/66/2.03 62.14 273 BF-PRINV 0 0 0 ERC 98/66/2.03 62.14 274 BF-ENT 1 0 1 ERC 99/62/2.07 62.08 275 BF-IDF 1 0 1 ERC 99/62/2.07 62.08 276 BF-PRINV 1 0 1 ERC 99/62/2.07 62.08 277 BF-ENT 0 0 1 CLC 106/83/1.76 62.06 278 BF-GFIDF 0 0 1 CLC 106/83/1.76 62.06 279 BF-IDF 0 0 1 CLC 106/83/1.76 62.06 280 BF-NORM 0 0 1 CLC 106/83/1.76 62.06 281 BF-ONE 0 0 1 CLC 106/83/1.76 62.06 282 BF-PRINV 0 0 1 CLC 106/83/1.76 62.06 283 TF-ENT 1 1 1 SBC 322/11/4.85 62.05 284 TF-NORM 1 0 0 SLC 101/70/1.95 61.99 285 TF-IDF 0 1 1 ALC 115/73/1.77 61.98 286 BF-NORM 0 1 1 ERC 99/79/1.87 61.97 287 BF-PRINV 1 0 0 ALC 113/62/1.90 61.90 288 TF-IDF 1 1 1 SBC 322/11/4.88 61.90 289 BF-GFIDF 1 0 0 ERC 96/75/1.95 61.89

75 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

290 BF-ENT 0 1 0 SLC 98/80/1.87 61.86 291 BF-GFIDF 0 1 0 SLC 97/89/1.79 61.86 292 BF-PRINV 0 1 0 SLC 98/80/1.87 61.86 293 BF-ENT 0 1 0 ALC 107/83/1.75 61.84 294 TF-NORM 1 0 0 ERC 102/74/1.89 61.81 295 BF-NORM 1 1 1 ALC 107/83/1.75 61.79 296 BF-NORM 0 1 1 SLC 100/80/1.85 61.76 297 TF-GFIDF 1 1 1 ERC 92/99/1.74 61.74 298 BF-GFIDF 1 1 1 SBC 309/24/4.18 61.68 299 BF-IDF 0 1 0 SLC 97/82/1.86 61.64 300 BF-IDF 1 0 0 SLC 100/68/1.98 61.64 301 TF-NORM 0 1 1 SLC 98/66/2.03 61.60 302 BF-NORM 1 0 0 CLC 106/85/1.74 61.54 303 BF-ENT 0 0 1 SLC 102/82/1.81 61.51 304 BF-GFIDF 0 0 1 SLC 102/82/1.81 61.51 305 BF-IDF 0 0 1 SLC 102/82/1.81 61.51 306 BF-NORM 0 0 1 SLC 102/82/1.81 61.51 307 BF-ONE 0 0 1 SLC 102/82/1.81 61.51 308 BF-PRINV 0 0 1 SLC 102/82/1.81 61.51 309 BF-ENT 1 0 0 SLC 100/70/1.96 61.49 310 BF-PRINV 1 0 0 SLC 100/70/1.96 61.49 311 BF-ENT 1 1 1 SBC 322/11/4.72 61.42 312 BF-NORM 0 1 0 CLC 107/74/1.84 61.38 313 TF-GFIDF 0 1 0 CLC 108/77/1.80 61.36 314 BF-NORM 1 0 1 CLC 104/85/1.76 61.35 315 TF-NORM 0 1 1 ERC 98/65/2.04 61.28 316 BF-IDF 1 1 1 SBC 322/11/4.57 61.27 317 BF-NORM 1 1 0 ALC 102/94/1.70 61.12 318 BF-NORM 1 0 0 ALC 109/81/1.75 61.04 319 TF-IDF 0 1 0 ALC 115/72/1.78 60.93 320 BF-ENT 1 0 0 ERC 98/60/2.11 60.92 321 BF-IDF 1 0 0 ERC 98/60/2.11 60.92 322 BF-PRINV 1 0 0 ERC 98/60/2.11 60.92 323 TF-NORM 0 1 0 ALC 113/70/1.82 60.87 324 BF-NORM 1 0 1 ERC 96/79/1.90 60.86 325 BF-NORM 1 0 1 SLC 96/79/1.90 60.86 326 BF-NORM 1 0 1 ALC 105/76/1.84 60.85 327 BF-PRINV 1 1 1 SBC 322/11/4.62 60.80 328 TF-PRINV 1 1 1 SBC 324/9/4.94 60.69 329 BF-ENT 0 0 0 ALC 108/83/1.74 60.68 330 BF-GFIDF 0 0 0 ALC 108/83/1.74 60.68 331 BF-IDF 0 0 0 ALC 108/83/1.74 60.68 332 BF-NORM 0 0 0 ALC 108/83/1.74 60.68 333 BF-ONE 0 0 0 ALC 108/83/1.74 60.68 334 BF-PRINV 0 0 0 ALC 108/83/1.74 60.68 335 TF-PRINV 0 1 0 ALC 107/90/1.69 60.55 336 BF-NORM 0 1 0 ERC 97/79/1.89 60.46 337 BF-NORM 0 1 0 SLC 98/82/1.85 60.44 338 BF-GFIDF 0 1 0 ERC 93/102/1.71 60.40 339 BF-ONE 1 1 0 SBC 315/18/4.42 60.40 340 BF-ENT 0 0 0 CLC 105/87/1.73 60.22 341 BF-GFIDF 0 0 0 CLC 105/87/1.73 60.22 342 BF-IDF 0 0 0 CLC 105/87/1.73 60.22 343 BF-NORM 0 0 0 CLC 105/87/1.73 60.22 344 BF-ONE 0 0 0 CLC 105/87/1.73 60.22 345 BF-PRINV 0 0 0 CLC 105/87/1.73 60.22 346 TF-NORM 0 1 0 ERC 96/99/1.71 60.21 347 BF-NORM 1 0 0 ERC 98/87/1.80 60.13 348 BF-NORM 1 0 0 SLC 98/87/1.80 60.13 349 TF-ONE 1 1 1 SBC 320/13/4.72 60.12 350 BF-GFIDF 1 1 0 SBC 307/26/4.05 60.08

76 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

351 TF-ENT 1 1 0 SBC 323/10/4.61 60.07 352 TF-NORM 0 1 0 SLC 96/102/1.68 60.00 353 TF-IDF 0 1 1 SLC 97/67/2.03 59.85 354 TF-IDF 1 1 0 SBC 324/9/4.73 59.69 355 TF-ENT 0 1 1 SLC 96/87/1.82 59.67 356 TF-ENT 0 1 0 SLC 97/86/1.82 59.63 357 BF-ONE 0 1 1 SBC 313/20/4.39 59.47 358 BF-ENT 0 0 0 SLC 100/85/1.80 59.45 359 BF-GFIDF 0 0 0 SLC 100/85/1.80 59.45 360 BF-IDF 0 0 0 SLC 100/85/1.80 59.45 361 BF-NORM 0 0 0 SLC 100/85/1.80 59.45 362 BF-ONE 0 0 0 SLC 100/85/1.80 59.45 363 BF-PRINV 0 0 0 SLC 100/85/1.80 59.45 364 BF-ENT 1 1 0 SBC 322/11/4.36 59.43 365 BF-IDF 1 1 0 SBC 321/12/4.32 59.34 366 TF-PRINV 0 1 1 SLC 96/69/2.02 59.08 367 BF-PRINV 1 1 0 SBC 322/11/4.34 59.05 368 TF-ENT 0 0 0 ERC 100/80/1.85 59.01 369 TF-GFIDF 0 0 0 ERC 100/80/1.85 59.01 370 TF-IDF 0 0 0 ERC 100/80/1.85 59.01 371 TF-NORM 0 0 0 ERC 100/80/1.85 59.01 372 TF-ONE 0 0 0 ERC 100/80/1.85 59.01 373 TF-PRINV 0 0 0 ERC 100/80/1.85 59.01 374 TF-ENT 0 0 0 SLC 104/69/1.92 58.96 375 TF-GFIDF 0 0 0 SLC 104/69/1.92 58.96 376 TF-IDF 0 0 0 SLC 104/69/1.92 58.96 377 TF-NORM 0 0 0 SLC 104/69/1.92 58.96 378 TF-ONE 0 0 0 SLC 104/69/1.92 58.96 379 TF-PRINV 0 0 0 SLC 104/69/1.92 58.96 380 BF-GFIDF 0 1 1 SBC 303/30/4.13 58.91 381 TF-ENT 0 1 1 ERC 95/79/1.91 58.91 382 TF-PRINV 1 1 0 SBC 324/9/4.73 58.68 383 TF-GFIDF 1 1 0 SLC 92/98/1.75 58.65 384 BF-GFIDF 1 0 1 SBC 317/16/4.43 58.59 385 TF-IDF 0 1 0 SLC 95/86/1.84 58.58 386 TF-IDF 0 1 1 ERC 94/83/1.88 58.53 387 TF-IDF 1 0 1 SBC 324/9/4.66 58.51 388 TF-PRINV 0 1 0 SLC 95/94/1.76 58.42 389 TF-ENT 1 0 1 SBC 324/9/4.68 58.37 390 TF-ONE 0 1 1 ALC 108/93/1.66 58.27 391 BF-ENT 1 0 1 SBC 315/18/4.24 58.20 392 TF-PRINV 1 0 1 SBC 325/8/4.68 58.19 393 TF-PRINV 0 1 1 ERC 91/102/1.73 58.18 394 BF-IDF 1 0 1 SBC 315/18/4.24 58.18 395 TF-GFIDF 1 1 1 SBC 317/16/4.70 58.17 396 BF-ONE 1 0 1 SBC 319/14/4.54 58.07 397 TF-GFIDF 1 1 0 ERC 88/106/1.72 58.05 398 BF-PRINV 1 0 1 SBC 315/18/4.23 57.92 399 BF-ENT 0 1 1 SBC 314/19/4.39 57.76 400 TF-ONE 1 1 0 SBC 319/14/4.62 57.73 401 TF-ONE 0 1 0 ALC 108/92/1.67 57.68 402 TF-ONE 1 0 1 SBC 324/9/4.59 57.56 403 TF-IDF 0 1 0 ERC 89/107/1.70 57.54 404 TF-ONE 0 1 1 SLC 92/106/1.68 57.54 405 BF-IDF 0 1 1 SBC 314/19/4.50 57.49 406 BF-ONE 0 1 0 SBC 307/26/4.01 57.37 407 BF-GFIDF 1 0 0 SBC 319/14/4.32 57.28 408 TF-PRINV 0 1 0 ERC 89/108/1.69 57.24 409 TF-ENT 0 1 0 ERC 91/99/1.75 57.10 410 TF-ONE 0 1 1 ERC 90/113/1.64 57.04 411 TF-IDF 1 0 0 SBC 326/7/4.61 56.96

77 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

412 BF-PRINV 0 1 1 SBC 318/15/4.68 56.92 413 TF-GFIDF 1 0 1 SBC 320/13/4.43 56.89 414 TF-ENT 1 0 0 SBC 326/7/4.62 56.87 415 BF-ENT 1 0 0 SBC 314/19/4.08 56.76 416 BF-IDF 1 0 0 SBC 314/19/4.08 56.76 417 TF-GFIDF 0 1 1 ALC 114/75/1.76 56.73 418 BF-ONE 1 0 0 SBC 319/14/4.29 56.64 419 BF-ENT 0 0 1 SBC 308/25/4.11 56.61 420 BF-GFIDF 0 0 1 SBC 308/25/4.11 56.61 421 BF-IDF 0 0 1 SBC 308/25/4.11 56.61 422 BF-NORM 0 0 1 SBC 308/25/4.11 56.61 423 BF-ONE 0 0 1 SBC 308/25/4.11 56.61 424 BF-PRINV 0 0 1 SBC 308/25/4.11 56.61 425 TF-PRINV 1 0 0 SBC 327/6/4.59 56.58 426 BF-PRINV 1 0 0 SBC 314/19/4.11 56.43 427 TF-ONE 0 1 0 SLC 91/114/1.62 56.36 428 BF-GFIDF 0 1 0 SBC 292/41/3.93 56.32 429 TF-ENT 0 1 1 SBC 321/12/4.83 56.23 430 TF-ONE 1 0 0 SBC 326/7/4.57 56.18 431 BF-NORM 0 1 1 SBC 301/32/4.10 56.13 432 TF-IDF 0 1 1 SBC 320/13/4.78 55.87 433 BF-ENT 0 1 0 SBC 305/28/4.08 55.86 434 TF-NORM 1 1 1 SBC 303/30/4.04 55.86 435 TF-GFIDF 1 0 0 SBC 322/11/4.47 55.67 436 BF-IDF 0 1 0 SBC 303/30/4.09 55.66 437 TF-PRINV 0 1 1 SBC 320/13/4.73 55.47 438 BF-NORM 1 1 1 SBC 284/49/3.61 55.34 439 BF-PRINV 0 1 0 SBC 302/31/4.01 55.27 440 BF-ENT 0 0 0 SBC 312/21/4.20 55.17 441 BF-GFIDF 0 0 0 SBC 312/21/4.20 55.17 442 BF-IDF 0 0 0 SBC 312/21/4.20 55.17 443 BF-NORM 0 0 0 SBC 312/21/4.20 55.17 444 BF-ONE 0 0 0 SBC 312/21/4.20 55.17 445 BF-PRINV 0 0 0 SBC 312/21/4.20 55.17 446 TF-ENT 0 0 1 SBC 319/14/4.51 55.03 447 TF-GFIDF 0 0 1 SBC 319/14/4.51 55.03 448 TF-IDF 0 0 1 SBC 319/14/4.51 55.03 449 TF-NORM 0 0 1 SBC 319/14/4.51 55.03 450 TF-ONE 0 0 1 SBC 319/14/4.51 55.03 451 TF-PRINV 0 0 1 SBC 319/14/4.51 55.03 452 TF-GFIDF 1 1 0 SBC 312/21/4.80 54.84 453 TF-NORM 1 1 0 SBC 303/30/3.92 54.81 454 TF-NORM 1 0 1 SBC 311/22/4.19 54.80 455 BF-NORM 1 0 1 SBC 291/42/3.74 54.78 456 TF-NORM 0 1 1 SBC 309/24/4.35 54.71 457 BF-NORM 0 1 0 SBC 288/45/3.80 54.48 458 TF-ONE 0 1 1 SBC 307/26/4.48 54.25 459 TF-ONE 0 1 0 ERC 89/111/1.67 54.24 460 BF-NORM 1 1 0 SBC 285/48/3.58 54.18 461 TF-ENT 0 1 0 SBC 317/16/4.47 53.81 462 TF-GFIDF 0 1 1 ERC 76/136/1.57 53.81 463 TF-GFIDF 0 1 1 SLC 76/136/1.57 53.81 464 BF-NORM 1 0 0 SBC 298/35/3.72 53.67 465 TF-NORM 1 0 0 SBC 310/23/4.08 53.64 466 TF-IDF 0 1 0 SBC 316/17/4.44 53.60 467 TF-NORM 0 1 0 SBC 304/29/4.20 53.36 468 TF-ENT 0 0 0 SBC 323/10/4.45 53.31 469 TF-GFIDF 0 0 0 SBC 323/10/4.45 53.31 470 TF-IDF 0 0 0 SBC 323/10/4.45 53.31 471 TF-NORM 0 0 0 SBC 323/10/4.45 53.31 472 TF-ONE 0 0 0 SBC 323/10/4.45 53.31

78 Cluster Rank Weighting NOR COC NET Clustering F Characteristic 1

473 TF-PRINV 0 0 0 SBC 323/10/4.45 53.31 474 TF-PRINV 0 1 0 SBC 316/17/4.50 53.28 475 TF-ONE 0 1 0 SBC 299/34/4.27 51.84 476 TF-GFIDF 0 1 0 ALC 101/109/1.59 51.72 477 TF-GFIDF 0 1 1 SBC 284/49/4.39 49.65 478 TF-GFIDF 0 1 0 SLC 74/149/1.49 48.73 479 TF-GFIDF 0 1 0 SBC 289/44/4.65 45.83 480 TF-GFIDF 0 1 0 ERC 71/144/1.55 44.07

79 Appendix C

The Second Framework: Experimental Results in DFB

Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 TF-ONE 0 SUPP CLC 56/37/2.43 98.75 2 TF-ONE 1 SUPP CLC 57/36/2.43 98.21 3 TF-ENT 0 MULP CLC 55/43/2.31 97.80 4 TF-IDF 0 MULP CLC 56/40/2.35 97.63 5 BF-GFIDF 1 LEVE CLC 56/37/2.43 96.98 6 TF-ENT 0 SUPP CLC 56/38/2.40 96.95 7 TF-PRINV 0 SUPP CLC 56/38/2.40 96.95 8 BF-GFIDF 0 KLOS CLC 56/38/2.40 96.95 9 TF-ENT 1 MULP CLC 57/38/2.38 96.91 10 TF-PRINV 1 CONF CLC 55/43/2.31 96.90 11 BF-GFIDF 1 KLOS CLC 55/43/2.31 96.90 12 TF-PRINV 1 KLOS CLC 55/40/2.38 96.77 13 TF-ONE 0 KLOS CLC 54/43/2.33 96.59 14 TF-IDF 1 CONF CLC 55/43/2.31 96.54 15 TF-PRINV 1 LEVE CLC 55/43/2.31 96.54 16 TF-IDF 1 MULP CLC 55/42/2.33 96.38 17 TF-IDF 0 CONF CLC 55/42/2.33 96.38 18 TF-PRINV 1 MULP CLC 56/41/2.33 96.36 19 TF-IDF 1 LEVE CLC 54/46/2.26 96.34 20 BF-ONE 1 MULP CLC 55/44/2.28 96.32 21 TF-GFIDF 0 LEVE CLC 54/44/2.31 96.19 22 TF-GFIDF 0 MULP CLC 55/42/2.33 96.17 23 TF-PRINV 1 SUPP CLC 56/42/2.31 96.15 24 TF-IDF 1 KLOS CLC 56/41/2.33 96.03 25 TF-NORM 1 MULP CLC 56/43/2.28 95.96 26 TF-ONE 1 CONF CLC 55/40/2.38 95.90 27 TF-ONE 1 MULP CLC 55/44/2.28 95.80 28 BF-IDF 1 MULP CLC 53/48/2.24 95.76 29 TF-ENT 0 CONF CLC 55/43/2.31 95.64 30 TF-PRINV 0 LEVE CLC 55/43/2.31 95.64 31 TF-ENT 1 LEVE CLC 55/43/2.31 95.64 32 TF-IDF 0 SUPP CLC 56/40/2.35 95.62 33 TF-ONE 0 MULP CLC 55/43/2.31 95.60 34 TF-IDF 1 SUPP CLC 58/36/2.40 95.50 35 TF-ENT 1 SUPP CLC 58/37/2.38 95.48 36 TF-PRINV 0 KLOS CLC 56/40/2.35 95.48 37 TF-IDF 0 KLOS CLC 56/41/2.33 95.46 38 TF-GFIDF 1 KLOS CLC 56/41/2.33 95.46 39 TF-ENT 0 LEVE CLC 55/43/2.31 95.43 40 BF-GFIDF 1 CONF CLC 56/37/2.43 95.39 41 BF-ONE 1 KLOS CLC 54/49/2.19 95.33 42 TF-ENT 1 CONF CLC 55/43/2.31 95.27 43 TF-GFIDF 1 MULP CLC 56/41/2.33 95.26 44 TF-PRINV 0 MULP CLC 54/45/2.28 95.24 45 TF-PRINV 0 CONF CLC 54/45/2.28 95.24

80 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

46 TF-ONE 1 KLOS CLC 56/39/2.38 95.00 47 TF-ONE 1 LEVE CLC 55/42/2.33 94.96 48 BF-ONE 0 KLOS CLC 54/48/2.22 94.93 49 TF-ENT 1 KLOS CLC 54/46/2.26 94.91 50 TF-IDF 0 LEVE CLC 55/43/2.31 94.91 51 TF-ENT 0 MULP ZLC 56/40/2.35 94.89 52 TF-ENT 1 MULP ZLC 59/37/2.35 94.89 53 TF-GFIDF 0 KLOS CLC 56/42/2.31 94.89 54 TF-ONE 0 CONF CLC 54/44/2.31 94.77 55 TF-ONE 0 LEVE CLC 54/44/2.31 94.77 56 TF-GFIDF 1 MULP ZLC 58/42/2.26 94.62 57 TF-ONE 1 KLOS SLC 56/38/2.40 94.58 58 TF-ENT 0 KLOS CLC 56/43/2.28 94.51 59 BF-PRINV 1 MULP CLC 53/50/2.19 94.42 60 TF-GFIDF 0 KLOS ZLC 59/40/2.28 94.42 61 BF-IDF 0 MULP CLC 52/54/2.13 94.12 62 TF-ONE 0 KLOS SLC 56/37/2.43 94.08 63 TF-GFIDF 0 KLOS SLC 55/45/2.26 94.05 64 TF-ONE 1 MULP ZLC 58/43/2.24 94.01 65 TF-IDF 1 SUPP SLC 56/40/2.35 93.97 66 TF-ENT 0 KLOS SLC 56/36/2.46 93.91 67 BF-ONE 1 CONF CLC 52/52/2.17 93.88 68 BF-ONE 1 LEVE CLC 52/52/2.17 93.88 69 TF-ENT 1 SUPP SLC 56/41/2.33 93.75 70 TF-GFIDF 1 KLOS SLC 56/42/2.31 93.73 71 TF-GFIDF 0 SUPP CLC 53/52/2.15 93.66 72 TF-ENT 1 CONF SLC 56/40/2.35 93.58 73 TF-IDF 1 CONF SLC 56/40/2.35 93.58 74 TF-ENT 1 LEVE SLC 56/40/2.35 93.58 75 TF-IDF 1 LEVE SLC 56/40/2.35 93.58 76 TF-PRINV 1 LEVE SLC 56/40/2.35 93.58 77 TF-ENT 1 KLOS SLC 56/35/2.48 93.57 78 TF-IDF 1 KLOS SLC 56/35/2.48 93.57 79 TF-GFIDF 1 MULP SLC 56/45/2.24 93.51 80 TF-GFIDF 1 KLOS ZLC 59/40/2.28 93.51 81 BF-GFIDF 1 KLOS SLC 56/39/2.38 93.48 82 TF-ENT 1 MULP SLC 56/39/2.38 93.45 83 TF-IDF 1 MULP SLC 56/39/2.38 93.45 84 BF-ENT 1 MULP CLC 55/47/2.22 93.43 85 TF-ONE 0 KLOS ZLC 59/38/2.33 93.38 86 TF-PRINV 1 KLOS ZLC 57/42/2.28 93.38 87 TF-IDF 1 KLOS ZLC 57/42/2.28 93.33 88 TF-ONE 1 KLOS ZLC 60/38/2.31 93.33 89 BF-PRINV 0 MULP CLC 53/53/2.13 93.31 90 BF-ENT 1 CONF CLC 53/46/2.28 93.26 91 TF-PRINV 1 CONF SLC 56/39/2.38 93.24 92 TF-IDF 0 KLOS SLC 56/39/2.38 93.24 93 TF-PRINV 1 KLOS SLC 56/39/2.38 93.24 94 TF-ENT 0 KLOS ZLC 60/41/2.24 93.23 95 BF-GFIDF 1 MULP CLC 55/49/2.17 93.21 96 TF-ENT 0 SUPP SLC 55/43/2.31 93.19 97 TF-GFIDF 1 LEVE CLC 54/44/2.31 93.17 98 TF-ENT 1 KLOS ZLC 59/39/2.31 93.16 99 TF-IDF 1 LEVE ZLC 58/43/2.24 93.11 100 TF-IDF 1 MULP ZLC 58/43/2.24 93.08 101 TF-IDF 0 MULP ZLC 57/41/2.31 92.96 102 TF-PRINV 0 SUPP SLC 55/40/2.38 92.95 103 BF-GFIDF 0 LEVE CLC 57/42/2.28 92.94 104 TF-PRINV 0 KLOS SLC 56/38/2.40 92.90 105 TF-PRINV 1 MULP SLC 56/37/2.43 92.78 106 TF-ENT 0 CONF SLC 56/42/2.31 92.74

81 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

107 TF-IDF 0 CONF SLC 56/42/2.31 92.74 108 TF-PRINV 0 CONF SLC 56/42/2.31 92.74 109 TF-ENT 0 LEVE SLC 56/42/2.31 92.74 110 TF-IDF 0 LEVE SLC 56/42/2.31 92.74 111 TF-PRINV 0 LEVE SLC 56/42/2.31 92.74 112 BF-GFIDF 0 KLOS SLC 56/36/2.46 92.69 113 TF-GFIDF 1 SUPP CLC 55/50/2.15 92.68 114 TF-IDF 0 SUPP SLC 55/42/2.33 92.67 115 TF-GFIDF 0 MULP ZLC 60/37/2.33 92.62 116 BF-GFIDF 1 SUPP CLC 55/48/2.19 92.60 117 TF-PRINV 1 SUPP SLC 56/41/2.33 92.59 118 BF-GFIDF 1 KLOS ZLC 60/40/2.26 92.54 119 TF-IDF 1 SUPP ZLC 59/37/2.35 92.50 120 BF-GFIDF 0 CONF CLC 55/45/2.26 92.50 121 TF-IDF 1 CONF ZLC 59/40/2.28 92.39 122 BF-ONE 0 MULP CLC 54/51/2.15 92.37 123 BF-ONE 0 SUPP CLC 54/51/2.15 92.37 124 TF-IDF 0 KLOS ZLC 57/44/2.24 92.34 125 TF-IDF 0 SUPP ZLC 57/40/2.33 92.31 126 TF-GFIDF 0 CONF CLC 54/50/2.17 92.31 127 TF-PRINV 0 SUPP ZLC 58/43/2.24 92.22 128 TF-ENT 1 SUPP ZLC 60/38/2.31 92.22 129 TF-PRINV 1 MULP ZLC 59/37/2.35 92.14 130 BF-ENT 1 LEVE CLC 52/51/2.19 92.05 131 TF-ENT 0 MULP SLC 56/39/2.38 91.97 132 BF-GFIDF 0 MULP CLC 56/45/2.24 91.93 133 TF-ONE 1 SUPP SLC 56/46/2.22 91.93 134 TF-ONE 1 MULP SLC 56/38/2.40 91.92 135 BF-ENT 0 MULP CLC 52/57/2.07 91.89 136 TF-NORM 0 MULP CLC 52/57/2.07 91.89 137 TF-ONE 0 MULP ZLC 62/40/2.22 91.84 138 TF-NORM 1 LEVE CLC 54/49/2.19 91.70 139 BF-ENT 0 LEVE CLC 54/47/2.24 91.62 140 BF-IDF 1 KLOS CLC 55/51/2.13 91.60 141 BF-IDF 0 SUPP CLC 53/54/2.11 91.57 142 BF-IDF 1 SUPP CLC 54/52/2.13 91.57 143 TF-ENT 0 SUPP ZLC 60/41/2.24 91.56 144 TF-PRINV 0 MULP ZLC 58/39/2.33 91.51 145 BF-IDF 0 KLOS CLC 51/59/2.05 91.40 146 BF-IDF 1 LEVE CLC 53/54/2.11 91.40 147 TF-ONE 0 SUPP SLC 56/47/2.19 91.32 148 TF-ENT 1 LEVE ZLC 60/42/2.22 91.32 149 BF-GFIDF 0 KLOS ZLC 60/38/2.31 91.31 150 BF-ONE 1 SUPP CLC 54/53/2.11 91.30 151 TF-IDF 0 MULP SLC 55/40/2.38 91.27 152 BF-IDF 1 CONF CLC 52/56/2.09 91.19 153 TF-PRINV 1 CONF ZLC 60/41/2.24 91.18 154 TF-NORM 1 CONF CLC 55/51/2.13 91.15 155 BF-PRINV 1 KLOS CLC 54/47/2.24 91.08 156 TF-GFIDF 1 CONF CLC 55/47/2.22 90.98 157 TF-PRINV 0 MULP SLC 55/43/2.31 90.88 158 TF-GFIDF 0 MULP SLC 55/41/2.35 90.81 159 BF-PRINV 1 SUPP CLC 54/52/2.13 90.70 160 TF-ONE 0 MULP SLC 56/47/2.19 90.57 161 TF-ENT 1 CONF ZLC 60/42/2.22 90.57 162 TF-NORM 0 CONF CLC 52/55/2.11 90.29 163 TF-PRINV 1 LEVE ZLC 60/42/2.22 90.29 164 TF-NORM 0 LEVE CLC 54/54/2.09 90.14 165 TF-GFIDF 1 LEVE SLC 56/45/2.24 90.09 166 TF-ENT 0 CONF ZLC 62/42/2.17 90.08 167 TF-NORM 1 KLOS CLC 58/47/2.15 90.06

82 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

168 TF-GFIDF 1 CONF SLC 55/46/2.24 89.93 169 BF-ENT 1 SUPP CLC 51/60/2.04 89.92 170 TF-PRINV 0 KLOS ZLC 58/42/2.26 89.90 171 TF-PRINV 1 SUPP ZLC 60/41/2.24 89.89 172 TF-IDF 0 CONF ZLC 62/42/2.17 89.87 173 TF-PRINV 0 LEVE ZLC 62/42/2.17 89.87 174 BF-ONE 1 KLOS ZLC 59/39/2.31 89.80 175 BF-PRINV 0 SUPP CLC 54/55/2.07 89.76 176 BF-ENT 0 SUPP CLC 51/59/2.05 89.75 177 TF-GFIDF 1 SUPP ZLC 60/47/2.11 89.75 178 BF-GFIDF 0 SUPP CLC 56/48/2.17 89.71 179 TF-NORM 0 SUPP CLC 56/49/2.15 89.71 180 TF-IDF 0 LEVE ZLC 61/43/2.17 89.69 181 BF-GFIDF 1 MULP SLC 55/45/2.26 89.59 182 TF-ONE 0 SUPP ZLC 62/42/2.17 89.48 183 TF-PRINV 0 CONF ZLC 62/42/2.17 89.48 184 BF-PRINV 1 LEVE CLC 52/59/2.04 89.28 185 TF-ENT 0 LEVE ZLC 61/44/2.15 89.27 186 BF-IDF 0 LEVE CLC 51/58/2.07 89.23 187 TF-NORM 1 SUPP CLC 56/50/2.13 89.19 188 BF-PRINV 0 KLOS CLC 53/57/2.05 89.15 189 BF-GFIDF 1 SUPP SLC 56/43/2.28 89.13 190 BF-GFIDF 1 MULP ZLC 61/41/2.22 89.06 191 TF-ONE 0 LEVE SLC 56/48/2.17 89.06 192 TF-ONE 1 CONF SLC 56/50/2.13 88.89 193 TF-NORM 0 KLOS CLC 57/46/2.19 88.89 194 TF-ONE 1 LEVE SLC 56/45/2.24 88.89 195 TF-ONE 0 CONF SLC 55/50/2.15 88.85 196 TF-ONE 1 SUPP ZLC 64/41/2.15 88.80 197 BF-ENT 0 KLOS CLC 48/64/2.02 88.63 198 TF-GFIDF 0 LEVE SLC 56/47/2.19 88.59 199 TF-GFIDF 0 SUPP ZLC 64/43/2.11 88.33 200 TF-PRINV 1 LIFT CLC 56/50/2.13 88.19 201 TF-IDF 1 LIFT CLC 55/53/2.09 88.14 202 BF-ONE 0 CONF CLC 56/51/2.11 88.10 203 TF-PRINV 0 LIFT CLC 55/51/2.13 88.06 204 TF-GFIDF 0 CONF SLC 55/48/2.19 88.05 205 BF-ONE 0 LEVE CLC 55/53/2.09 87.87 206 BF-GFIDF 1 CONF SLC 54/46/2.26 87.85 207 BF-GFIDF 1 LEVE SLC 54/46/2.26 87.85 208 BF-ONE 1 MULP SLC 56/40/2.35 87.84 209 BF-PRINV 0 CONF CLC 52/59/2.04 87.80 210 BF-ONE 1 KLOS SLC 56/43/2.28 87.80 211 BF-GFIDF 0 SUPP SLC 56/46/2.22 87.48 212 BF-GFIDF 0 MULP SLC 55/47/2.22 87.43 213 BF-ENT 1 MULP ZLC 58/41/2.28 87.41 214 TF-GFIDF 0 SUPP SLC 57/46/2.19 87.31 215 BF-ONE 0 KLOS SLC 56/42/2.31 87.31 216 BF-ONE 0 KLOS ZLC 59/41/2.26 87.31 217 BF-PRINV 1 CONF CLC 52/62/1.98 87.30 218 TF-IDF 0 LIFT CLC 56/50/2.13 87.23 219 BF-IDF 0 CONF CLC 55/54/2.07 87.03 220 TF-NORM 1 LIFT CLC 57/50/2.11 87.03 221 TF-GFIDF 1 SUPP SLC 57/45/2.22 86.97 222 TF-ENT 0 LIFT CLC 55/53/2.09 86.96 223 BF-ENT 0 CONF CLC 52/57/2.07 86.94 224 BF-ONE 1 SUPP SLC 56/44/2.26 86.83 225 BF-GFIDF 0 SUPP ZLC 66/39/2.15 86.82 226 BF-GFIDF 0 CONF SLC 54/47/2.24 86.79 227 BF-GFIDF 0 LEVE SLC 54/47/2.24 86.79 228 BF-ENT 1 KLOS CLC 54/53/2.11 86.77

83 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

229 BF-GFIDF 0 MULP ZLC 62/40/2.22 86.71 230 BF-ENT 1 LEVE SLC 56/45/2.24 86.64 231 BF-ENT 1 SUPP ZLC 59/46/2.15 86.59 232 TF-NORM 1 KLOS SLC 56/50/2.13 86.39 233 BF-PRINV 1 CONF ZLC 58/44/2.22 86.32 234 TF-ENT 1 LIFT CLC 54/54/2.09 86.22 235 BF-ENT 1 KLOS SLC 56/41/2.33 86.20 236 BF-IDF 1 KLOS SLC 56/41/2.33 86.20 237 BF-ONE 1 NONE CLC 54/57/2.04 86.18 238 TF-NORM 0 KLOS SLC 56/50/2.13 86.17 239 TF-NORM 0 LIFT CLC 57/51/2.09 86.17 240 BF-ENT 1 CONF SLC 56/44/2.26 86.15 241 BF-IDF 1 CONF SLC 56/44/2.26 86.15 242 BF-IDF 1 LEVE SLC 56/44/2.26 86.15 243 TF-ENT 0 NONE CLC 55/56/2.04 86.12 244 TF-GFIDF 0 NONE CLC 55/56/2.04 86.12 245 TF-IDF 0 NONE CLC 55/56/2.04 86.12 246 TF-NORM 0 NONE CLC 55/56/2.04 86.12 247 TF-ONE 0 NONE CLC 55/56/2.04 86.12 248 TF-PRINV 0 NONE CLC 55/56/2.04 86.12 249 TF-ONE 1 NONE CLC 55/56/2.04 86.12 250 BF-ONE 0 MULP SLC 56/39/2.38 85.71 251 BF-ONE 0 SUPP SLC 56/39/2.38 85.71 252 TF-IDF 1 NONE CLC 55/57/2.02 85.66 253 TF-NORM 0 MULP ZLC 57/47/2.17 85.66 254 TF-GFIDF 1 CONF ZLC 63/39/2.22 85.60 255 BF-PRINV 1 KLOS SLC 56/34/2.51 85.56 256 TF-NORM 1 CONF SLC 56/48/2.17 85.55 257 TF-NORM 1 LEVE SLC 56/48/2.17 85.55 258 TF-NORM 1 CONF ZLC 58/49/2.11 85.54 259 BF-PRINV 0 KLOS SLC 56/29/2.66 85.51 260 TF-NORM 1 KLOS ZLC 57/46/2.19 85.44 261 BF-ENT 1 NONE CLC 53/59/2.02 85.43 262 TF-NORM 1 MULP ZLC 57/49/2.13 85.43 263 BF-ONE 1 LIFT CLC 54/59/2.00 85.43 264 BF-IDF 1 SUPP ZLC 59/43/2.22 85.39 265 BF-ENT 0 KLOS SLC 56/31/2.60 85.36 266 TF-ONE 0 CONF ZLC 63/38/2.24 85.33 267 TF-NORM 1 LEVE ZLC 59/46/2.15 85.28 268 TF-ENT 1 NONE CLC 55/57/2.02 85.25 269 TF-GFIDF 0 CONF ZLC 62/41/2.19 85.21 270 BF-NORM 0 MULP CLC 54/56/2.05 85.20 271 BF-IDF 1 NONE CLC 52/61/2.00 85.19 272 BF-PRINV 0 LEVE CLC 55/57/2.02 85.19 273 BF-ONE 1 MULP ZLC 60/44/2.17 85.10 274 TF-NORM 0 CONF SLC 56/47/2.19 84.99 275 TF-NORM 0 LEVE SLC 56/47/2.19 84.99 276 BF-IDF 0 KLOS SLC 56/32/2.57 84.96 277 BF-PRINV 1 CONF SLC 56/43/2.28 84.95 278 BF-PRINV 1 LEVE SLC 56/43/2.28 84.95 279 TF-ONE 0 LEVE ZLC 63/38/2.24 84.94 280 BF-GFIDF 1 SUPP ZLC 64/36/2.26 84.91 281 TF-ENT 0 NONE SLC 55/49/2.17 84.84 282 TF-GFIDF 0 NONE SLC 55/49/2.17 84.84 283 TF-IDF 0 NONE SLC 55/49/2.17 84.84 284 TF-NORM 0 NONE SLC 55/49/2.17 84.84 285 TF-ONE 0 NONE SLC 55/49/2.17 84.84 286 TF-PRINV 0 NONE SLC 55/49/2.17 84.84 287 BF-ENT 0 LEVE SLC 56/39/2.38 84.81 288 TF-NORM 1 MULP SLC 56/40/2.35 84.80 289 TF-NORM 1 LIFT ZLC 57/51/2.09 84.80

84 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

290 TF-ENT 0 NONE ZLC 56/52/2.09 84.75 291 TF-GFIDF 0 NONE ZLC 56/52/2.09 84.75 292 TF-IDF 0 NONE ZLC 56/52/2.09 84.75 293 TF-NORM 0 NONE ZLC 56/52/2.09 84.75 294 TF-ONE 0 NONE ZLC 56/52/2.09 84.75 295 TF-PRINV 0 NONE ZLC 56/52/2.09 84.75 296 BF-IDF 1 CONF ZLC 61/42/2.19 84.65 297 BF-IDF 1 LEVE ZLC 61/42/2.19 84.65 298 BF-PRINV 1 KLOS ZLC 57/38/2.38 84.64 299 TF-NORM 0 MULP SLC 57/48/2.15 84.58 300 TF-ONE 1 LEVE ZLC 64/38/2.22 84.57 301 TF-GFIDF 1 NONE CLC 53/61/1.98 84.55 302 BF-PRINV 0 LIFT CLC 56/54/2.05 84.55 303 BF-ONE 1 SUPP ZLC 62/47/2.07 84.52 304 TF-GFIDF 0 LEVE ZLC 61/48/2.07 84.49 305 BF-ENT 0 CONF SLC 56/38/2.40 84.35 306 BF-IDF 0 CONF SLC 56/38/2.40 84.35 307 BF-IDF 0 LEVE SLC 56/38/2.40 84.35 308 BF-ENT 1 MULP SLC 55/35/2.51 84.34 309 BF-IDF 1 MULP SLC 55/35/2.51 84.34 310 TF-ENT 1 NONE SLC 55/52/2.11 84.25 311 TF-IDF 1 NONE SLC 55/52/2.11 84.25 312 TF-PRINV 1 NONE SLC 55/52/2.11 84.25 313 BF-IDF 1 MULP ZLC 59/44/2.19 84.25 314 TF-NORM 0 LEVE ZLC 57/51/2.09 84.13 315 TF-NORM 1 NONE CLC 55/59/1.98 84.02 316 TF-IDF 1 LIFT SLC 56/51/2.11 83.90 317 TF-PRINV 1 LIFT SLC 56/51/2.11 83.90 318 BF-PRINV 1 MULP SLC 56/32/2.57 83.80 319 BF-ENT 1 CONF ZLC 60/48/2.09 83.72 320 BF-IDF 1 KLOS ZLC 58/37/2.38 83.62 321 BF-ONE 0 MULP ZLC 61/44/2.15 83.59 322 BF-ONE 0 SUPP ZLC 61/44/2.15 83.59 323 BF-PRINV 1 SUPP SLC 55/41/2.35 83.58 324 TF-ONE 1 NONE SLC 55/53/2.09 83.56 325 TF-NORM 0 KLOS ZLC 57/45/2.22 83.56 326 TF-ENT 0 LIFT SLC 56/53/2.07 83.53 327 TF-IDF 0 LIFT SLC 56/53/2.07 83.53 328 TF-PRINV 0 LIFT SLC 56/53/2.07 83.53 329 BF-NORM 1 SUPP CLC 56/56/2.02 83.47 330 BF-PRINV 0 LEVE SLC 56/40/2.35 83.46 331 TF-ONE 0 LIFT CLC 55/58/2.00 83.33 332 TF-NORM 0 LIFT SLC 57/52/2.07 83.30 333 BF-PRINV 1 LEVE ZLC 59/47/2.13 83.27 334 TF-ENT 1 LIFT SLC 56/52/2.09 83.20 335 BF-ENT 0 KLOS ZLC 56/41/2.33 83.18 336 BF-ENT 1 KLOS ZLC 57/44/2.24 83.17 337 TF-IDF 0 LIFT ZLC 58/49/2.11 83.17 338 BF-ENT 0 MULP SLC 55/42/2.33 83.15 339 BF-ENT 1 SUPP SLC 55/42/2.33 83.15 340 TF-PRINV 1 NONE CLC 55/61/1.95 83.06 341 BF-PRINV 1 LIFT CLC 56/57/2.00 83.06 342 TF-GFIDF 1 NONE SLC 56/55/2.04 83.00 343 BF-IDF 0 LIFT CLC 54/62/1.95 82.99 344 TF-ONE 1 NONE ZLC 57/52/2.07 82.93 345 TF-GFIDF 1 LEVE ZLC 64/39/2.19 82.88 346 BF-IDF 1 NONE ZLC 56/55/2.04 82.76 347 BF-PRINV 0 CONF SLC 56/38/2.40 82.75 348 BF-GFIDF 1 NONE CLC 54/60/1.98 82.74 349 BF-IDF 0 MULP SLC 55/47/2.22 82.69 350 TF-ONE 1 CONF ZLC 63/45/2.09 82.69

85 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

351 TF-ENT 1 LIFT ZLC 57/54/2.04 82.66 352 BF-PRINV 0 CONF ZLC 60/43/2.19 82.61 353 BF-ONE 1 CONF SLC 55/48/2.19 82.56 354 BF-IDF 1 SUPP SLC 55/46/2.24 82.53 355 BF-PRINV 1 SUPP ZLC 59/42/2.24 82.49 356 BF-IDF 0 MULP ZLC 60/42/2.22 82.42 357 TF-NORM 1 SUPP ZLC 59/48/2.11 82.40 358 BF-PRINV 1 MULP ZLC 60/44/2.17 82.38 359 BF-PRINV 0 LEVE ZLC 61/42/2.19 82.38 360 TF-GFIDF 1 LIFT CLC 51/67/1.92 82.38 361 TF-GFIDF 1 LIFT SLC 55/46/2.24 82.35 362 TF-NORM 0 SUPP SLC 56/54/2.05 82.33 363 TF-NORM 1 SUPP SLC 56/54/2.05 82.33 364 BF-IDF 0 LEVE ZLC 59/47/2.13 82.33 365 BF-ENT 0 MULP ZLC 60/43/2.19 82.28 366 BF-ENT 0 SUPP SLC 55/50/2.15 82.26 367 BF-PRINV 0 SUPP ZLC 58/50/2.09 82.26 368 BF-ONE 1 LIFT SLC 56/52/2.09 82.24 369 TF-ONE 1 LIFT SLC 56/52/2.09 82.24 370 TF-GFIDF 0 LIFT SLC 55/46/2.24 82.22 371 TF-ONE 0 LIFT SLC 56/50/2.13 82.21 372 TF-NORM 0 CONF ZLC 57/53/2.05 82.16 373 BF-ONE 1 LEVE SLC 55/46/2.24 82.15 374 BF-GFIDF 1 LEVE ZLC 65/44/2.07 82.14 375 BF-PRINV 0 MULP SLC 54/46/2.26 82.13 376 BF-ENT 0 SUPP ZLC 61/47/2.09 82.12 377 TF-NORM 1 LIFT SLC 57/54/2.04 82.11 378 TF-GFIDF 1 NONE ZLC 57/55/2.02 82.06 379 BF-ONE 0 CONF SLC 55/48/2.19 82.01 380 BF-ONE 0 LEVE SLC 55/48/2.19 82.01 381 BF-ENT 0 LEVE ZLC 62/47/2.07 82.00 382 BF-ENT 0 LIFT SLC 56/51/2.11 81.98 383 BF-IDF 0 LIFT SLC 56/51/2.11 81.98 384 TF-ENT 1 NONE ZLC 58/52/2.05 81.97 385 TF-IDF 1 NONE ZLC 58/52/2.05 81.97 386 TF-PRINV 1 NONE ZLC 58/52/2.05 81.97 387 BF-IDF 0 SUPP SLC 55/49/2.17 81.94 388 TF-IDF 1 LIFT ZLC 59/51/2.05 81.93 389 BF-PRINV 1 NONE CLC 52/65/1.93 81.84 390 BF-ENT 0 NONE CLC 55/60/1.97 81.76 391 BF-GFIDF 0 NONE CLC 55/60/1.97 81.76 392 BF-IDF 0 NONE CLC 55/60/1.97 81.76 393 BF-NORM 0 NONE CLC 55/60/1.97 81.76 394 BF-ONE 0 NONE CLC 55/60/1.97 81.76 395 BF-PRINV 0 NONE CLC 55/60/1.97 81.76 396 BF-ONE 0 LIFT SLC 56/53/2.07 81.76 397 BF-ENT 1 LIFT SLC 56/53/2.07 81.76 398 BF-IDF 1 LIFT SLC 56/53/2.07 81.76 399 BF-NORM 0 MULP SLC 55/53/2.09 81.75 400 BF-PRINV 0 KLOS ZLC 61/41/2.22 81.73 401 BF-ONE 0 LIFT CLC 53/64/1.93 81.57 402 TF-NORM 1 NONE ZLC 59/50/2.07 81.56 403 BF-GFIDF 1 NONE ZLC 57/51/2.09 81.53 404 BF-ONE 1 NONE ZLC 57/52/2.07 81.15 405 BF-IDF 0 KLOS ZLC 61/37/2.31 81.00 406 TF-ENT 0 LIFT ZLC 59/52/2.04 80.97 407 TF-NORM 0 LIFT ZLC 58/50/2.09 80.89 408 BF-PRINV 0 SUPP SLC 54/42/2.35 80.81 409 BF-ENT 0 CONF ZLC 62/48/2.05 80.74 410 BF-PRINV 0 MULP ZLC 60/46/2.13 80.72 411 TF-NORM 0 SUPP ZLC 60/51/2.04 80.67

86 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

412 BF-GFIDF 1 CONF ZLC 66/45/2.04 80.67 413 BF-NORM 1 MULP CLC 52/68/1.88 80.58 414 BF-GFIDF 1 LIFT CLC 54/64/1.92 80.58 415 BF-PRINV 0 LIFT SLC 56/52/2.09 80.56 416 BF-NORM 1 NONE CLC 54/63/1.93 80.51 417 BF-ENT 0 LIFT CLC 55/59/1.98 80.49 418 BF-GFIDF 1 LIFT SLC 57/54/2.04 80.33 419 BF-PRINV 1 LIFT SLC 56/54/2.05 80.32 420 BF-GFIDF 0 LEVE ZLC 66/36/2.22 80.31 421 BF-IDF 1 LIFT CLC 54/63/1.93 80.17 422 TF-ONE 1 LIFT CLC 51/70/1.87 80.17 423 BF-GFIDF 0 LIFT SLC 55/50/2.15 80.16 424 TF-NORM 1 NONE SLC 56/56/2.02 80.00 425 BF-ENT 0 NONE ZLC 57/54/2.04 79.92 426 BF-GFIDF 0 NONE ZLC 57/54/2.04 79.92 427 BF-IDF 0 NONE ZLC 57/54/2.04 79.92 428 BF-NORM 0 NONE ZLC 57/54/2.04 79.92 429 BF-ONE 0 NONE ZLC 57/54/2.04 79.92 430 BF-PRINV 0 NONE ZLC 57/54/2.04 79.92 431 BF-NORM 0 LIFT CLC 57/57/1.98 79.92 432 BF-IDF 0 CONF ZLC 61/46/2.11 79.84 433 BF-GFIDF 0 LIFT ZLC 58/53/2.04 79.84 434 TF-PRINV 1 LIFT ZLC 57/56/2.00 79.58 435 BF-IDF 0 SUPP ZLC 62/44/2.13 79.52 436 BF-ENT 1 LEVE ZLC 58/56/1.98 79.41 437 BF-GFIDF 0 LIFT CLC 56/59/1.97 79.41 438 BF-ENT 1 NONE ZLC 57/56/2.00 79.33 439 BF-PRINV 1 NONE ZLC 57/56/2.00 79.33 440 TF-GFIDF 0 LIFT CLC 52/68/1.88 79.16 441 BF-GFIDF 0 CONF ZLC 64/40/2.17 79.04 442 BF-ONE 0 CONF ZLC 58/55/2.00 78.91 443 TF-PRINV 0 LIFT ZLC 60/48/2.09 78.70 444 BF-ENT 1 LIFT CLC 53/61/1.98 78.53 445 BF-ONE 0 LEVE ZLC 65/38/2.19 78.19 446 BF-GFIDF 1 LIFT ZLC 60/52/2.02 78.10 447 BF-ONE 1 CONF ZLC 56/60/1.95 78.06 448 BF-GFIDF 1 NONE SLC 56/62/1.92 77.80 449 BF-ONE 1 LEVE ZLC 63/45/2.09 77.73 450 BF-PRINV 1 LIFT ZLC 59/53/2.02 77.69 451 BF-PRINV 1 NONE SLC 56/62/1.92 77.45 452 TF-GFIDF 0 LIFT ZLC 58/60/1.92 77.25 453 BF-NORM 1 LEVE CLC 56/65/1.87 77.16 454 BF-ENT 0 NONE SLC 56/62/1.92 77.12 455 BF-GFIDF 0 NONE SLC 56/62/1.92 77.12 456 BF-IDF 0 NONE SLC 56/62/1.92 77.12 457 BF-NORM 0 NONE SLC 56/62/1.92 77.12 458 BF-ONE 0 NONE SLC 56/62/1.92 77.12 459 BF-PRINV 0 NONE SLC 56/62/1.92 77.12 460 BF-NORM 1 MULP SLC 56/52/2.09 77.05 461 TF-ONE 0 LIFT ZLC 60/54/1.98 76.96 462 BF-NORM 0 LIFT SLC 56/64/1.88 76.92 463 BF-ONE 0 LIFT ZLC 57/58/1.97 76.79 464 BF-PRINV 0 LIFT ZLC 58/54/2.02 76.70 465 BF-NORM 0 SUPP CLC 52/70/1.85 76.69 466 BF-ENT 0 LIFT ZLC 58/60/1.92 76.66 467 BF-NORM 1 LIFT SLC 56/65/1.87 76.66 468 BF-ONE 1 LIFT ZLC 58/53/2.04 76.60 469 BF-ENT 1 NONE SLC 56/63/1.90 76.39 470 BF-IDF 1 NONE SLC 56/63/1.90 76.39 471 BF-NORM 0 LEVE CLC 59/60/1.90 76.39 472 TF-ONE 1 LIFT ZLC 58/59/1.93 76.23

87 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

473 BF-NORM 1 NONE ZLC 56/62/1.92 75.91 474 BF-ONE 1 NONE SLC 55/61/1.95 75.83 475 BF-NORM 1 LIFT CLC 53/72/1.81 75.59 476 BF-ENT 1 LIFT ZLC 56/64/1.88 75.59 477 BF-NORM 0 CONF CLC 59/60/1.90 75.49 478 BF-IDF 1 LIFT ZLC 57/63/1.88 75.49 479 BF-NORM 1 KLOS CLC 58/60/1.92 75.05 480 BF-IDF 0 LIFT ZLC 59/56/1.97 75.05 481 TF-GFIDF 1 LIFT ZLC 60/59/1.90 74.73 482 BF-NORM 1 CONF CLC 61/58/1.90 74.67 483 BF-NORM 0 KLOS CLC 60/60/1.88 74.67 484 BF-NORM 0 MULP ZLC 59/57/1.95 74.63 485 BF-NORM 1 LIFT ZLC 58/59/1.93 74.53 486 BF-NORM 0 LIFT ZLC 57/62/1.90 74.15 487 BF-NORM 0 LEVE ZLC 61/56/1.93 74.04 488 BF-NORM 1 MULP ZLC 59/61/1.88 73.36 489 BF-NORM 0 CONF ZLC 61/58/1.90 72.96 490 BF-NORM 1 NONE SLC 58/62/1.88 72.49 491 BF-NORM 0 KLOS ZLC 60/59/1.90 72.29 492 BF-NORM 1 LEVE SLC 56/62/1.92 71.88 493 BF-NORM 1 CONF SLC 55/62/1.93 71.70 494 BF-NORM 0 CONF SLC 55/63/1.92 71.61 495 BF-NORM 0 LEVE SLC 53/59/2.02 71.43 496 BF-NORM 1 CONF ZLC 61/62/1.84 71.11 497 BF-NORM 1 LEVE ZLC 62/61/1.84 70.82 498 BF-NORM 1 KLOS ZLC 60/62/1.85 70.61 499 BF-NORM 1 SUPP SLC 56/69/1.81 70.51 500 BF-NORM 1 SUPP ZLC 59/64/1.84 70.33 501 BF-NORM 0 KLOS SLC 56/70/1.79 70.25 502 BF-NORM 1 KLOS SLC 56/70/1.79 70.25 503 BF-NORM 0 SUPP SLC 55/70/1.81 70.20 504 BF-NORM 0 SUPP ZLC 60/64/1.82 68.89

88 Appendix D

The Second Framework: Experimental Results in DPL

Cluster Rank Weighting NOR COC Clustering F Characteristic 1

1 BF-GFIDF 0 KLOS CLC 103/57/2.08 83.87 2 TF-IDF 0 SUPP CLC 107/53/2.08 79.27 3 TF-ONE 1 KLOS CLC 108/50/2.11 77.88 4 BF-GFIDF 1 KLOS CLC 102/62/2.03 77.76 5 TF-ONE 1 SUPP CLC 104/62/2.01 77.06 6 BF-ONE 1 KLOS CLC 89/83/1.94 76.53 7 TF-GFIDF 1 SUPP CLC 105/63/1.98 76.42 8 TF-ONE 0 MULP CLC 103/66/1.97 76.37 9 TF-ENT 1 MULP CLC 102/61/2.04 76.35 10 TF-IDF 1 MULP CLC 104/61/2.02 76.22 11 TF-ONE 1 MULP CLC 103/57/2.08 76.08 12 TF-IDF 1 SUPP CLC 109/52/2.07 75.93 13 TF-IDF 0 LEVE CLC 105/60/2.02 75.63 14 TF-PRINV 1 MULP CLC 104/64/1.98 75.00 15 TF-PRINV 1 SUPP CLC 105/59/2.03 75.00 16 BF-ONE 0 KLOS CLC 101/65/2.01 74.58 17 TF-ENT 1 LEVE CLC 106/63/1.97 74.20 18 TF-PRINV 0 SUPP CLC 104/62/2.01 74.11 19 TF-GFIDF 0 SUPP CLC 104/72/1.89 74.03 20 TF-GFIDF 0 KLOS SLC 95/40/2.47 73.88 21 TF-PRINV 0 LEVE CLC 100/61/2.07 73.81 22 TF-ONE 0 KLOS CLC 106/61/1.99 73.77 23 TF-ONE 0 SUPP CLC 105/57/2.06 73.76 24 TF-ENT 1 SUPP CLC 107/59/2.01 73.75 25 TF-IDF 0 KLOS CLC 108/57/2.02 73.47 26 TF-NORM 1 MULP CLC 101/65/2.01 73.36 27 TF-IDF 0 CONF CLC 101/71/1.94 73.25 28 TF-NORM 0 KLOS CLC 106/62/1.98 72.96 29 TF-ENT 0 CONF CLC 105/64/1.97 72.84 30 BF-GFIDF 0 KLOS SLC 98/43/2.36 72.82 31 TF-ENT 0 SUPP CLC 107/60/1.99 72.81 32 TF-GFIDF 1 KLOS SLC 99/42/2.36 72.14 33 TF-ENT 1 CONF CLC 107/62/1.97 72.12 34 TF-ENT 0 MULP CLC 107/55/2.06 72.05 35 TF-IDF 1 KLOS CLC 110/60/1.96 71.99 36 TF-PRINV 0 CONF CLC 103/66/1.97 71.97 37 TF-GFIDF 0 KLOS ZLC 114/36/2.22 71.93 38 TF-PRINV 1 KLOS CLC 115/44/2.09 71.92 39 TF-GFIDF 1 KLOS CLC 108/62/1.96 71.70 40 TF-IDF 0 LIFT CLC 105/64/1.97 71.59 41 BF-GFIDF 1 MULP CLC 105/72/1.88 71.29 42 BF-GFIDF 1 KLOS SLC 98/42/2.38 71.19 43 TF-PRINV 0 KLOS CLC 108/54/2.06 71.12 44 TF-ONE 0 LEVE CLC 105/67/1.94 71.10 45 TF-NORM 1 LEVE CLC 106/61/1.99 71.09

89 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

46 TF-GFIDF 0 CONF CLC 105/65/1.96 71.04 47 BF-GFIDF 1 SUPP CLC 103/67/1.96 70.98 48 TF-ENT 1 SUPP SLC 102/54/2.13 70.88 49 TF-GFIDF 0 KLOS CLC 108/60/1.98 70.81 50 TF-GFIDF 1 MULP CLC 113/53/2.01 70.74 51 TF-PRINV 0 MULP CLC 111/48/2.09 70.70 52 TF-ONE 1 KLOS SLC 101/44/2.30 70.70 53 TF-GFIDF 1 KLOS ZLC 119/38/2.12 70.59 54 BF-PRINV 1 CONF CLC 86/105/1.74 70.53 55 BF-ENT 0 LEVE CLC 104/65/1.97 70.51 56 TF-ENT 0 LEVE CLC 104/67/1.95 70.51 57 TF-ONE 0 CONF CLC 105/66/1.95 70.44 58 TF-NORM 1 CONF CLC 105/65/1.96 70.38 59 TF-IDF 1 CONF CLC 103/69/1.94 70.32 60 BF-GFIDF 0 SUPP CLC 111/64/1.90 70.30 61 TF-GFIDF 1 SUPP SLC 96/61/2.12 70.28 62 TF-PRINV 1 LIFT CLC 104/70/1.91 70.24 63 BF-ONE 1 KLOS SLC 100/56/2.13 70.23 64 TF-ENT 0 KLOS CLC 109/55/2.03 70.18 65 BF-NORM 0 KLOS CLC 103/67/1.96 70.14 66 BF-ENT 0 CONF CLC 90/87/1.88 70.11 67 TF-IDF 1 LIFT CLC 107/64/1.95 70.11 68 TF-PRINV 0 LIFT CLC 105/67/1.94 70.10 69 TF-IDF 0 MULP CLC 109/55/2.03 70.03 70 TF-ONE 1 LIFT CLC 103/65/1.98 70.02 71 BF-ENT 0 KLOS CLC 93/77/1.96 69.97 72 TF-NORM 0 MULP CLC 104/64/1.98 69.94 73 BF-GFIDF 1 LEVE CLC 104/67/1.95 69.90 74 TF-IDF 1 LEVE CLC 105/63/1.98 69.89 75 BF-GFIDF 0 MULP CLC 107/73/1.85 69.88 76 TF-ONE 1 SUPP SLC 101/61/2.06 69.83 77 TF-PRINV 1 CONF CLC 110/58/1.98 69.82 78 TF-PRINV 1 KLOS SLC 101/54/2.15 69.79 79 TF-ENT 1 KLOS SLC 102/51/2.18 69.77 80 BF-IDF 1 CONF CLC 104/70/1.91 69.72 81 TF-ONE 0 LIFT CLC 105/63/1.98 69.62 82 BF-PRINV 1 LEVE CLC 88/102/1.75 69.54 83 TF-IDF 1 KLOS SLC 103/56/2.09 69.50 84 TF-ENT 1 MULP SLC 101/49/2.22 69.47 85 BF-ONE 1 MULP CLC 86/103/1.76 69.43 86 TF-ENT 1 SUPP ZLC 113/66/1.86 69.43 87 BF-IDF 1 LEVE CLC 85/108/1.73 69.42 88 TF-IDF 1 MULP SLC 101/39/2.38 69.37 89 TF-ENT 1 KLOS CLC 109/62/1.95 69.35 90 BF-GFIDF 1 LIFT CLC 104/63/1.99 69.28 91 TF-NORM 1 KLOS CLC 104/70/1.91 69.27 92 BF-GFIDF 0 LEVE CLC 105/65/1.96 69.21 93 BF-PRINV 1 MULP CLC 105/69/1.91 69.18 94 TF-IDF 1 SUPP SLC 101/53/2.16 69.15 95 TF-GFIDF 1 LIFT CLC 100/75/1.90 69.13 96 BF-ENT 1 MULP CLC 92/93/1.80 68.94 97 BF-IDF 1 KLOS CLC 88/99/1.78 68.84 98 TF-PRINV 1 LEVE CLC 110/59/1.97 68.84 99 TF-NORM 0 SUPP CLC 104/69/1.92 68.82 100 BF-PRINV 0 KLOS CLC 108/67/1.90 68.78 101 BF-IDF 0 MULP CLC 107/69/1.89 68.67 102 TF-IDF 0 KLOS SLC 99/60/2.09 68.67 103 TF-ONE 0 KLOS SLC 95/60/2.15 68.67 104 TF-ENT 1 NONE ZLC 115/60/1.90 68.61 105 BF-ONE 1 SUPP SLC 101/61/2.06 68.60 106 BF-ONE 0 CONF CLC 90/95/1.80 68.60

90 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

107 TF-ENT 0 KLOS ZLC 113/62/1.90 68.60 108 BF-ONE 1 SUPP CLC 107/75/1.83 68.58 109 BF-IDF 1 MULP CLC 86/105/1.74 68.51 110 TF-IDF 1 MULP ZLC 111/68/1.86 68.50 111 TF-GFIDF 0 MULP CLC 114/54/1.98 68.49 112 TF-ENT 0 LIFT CLC 104/67/1.95 68.48 113 BF-ENT 0 MULP CLC 108/68/1.89 68.45 114 BF-ONE 0 KLOS SLC 98/60/2.11 68.38 115 BF-GFIDF 0 LIFT CLC 105/67/1.94 68.38 116 TF-ONE 1 KLOS ZLC 112/33/2.30 68.34 117 TF-PRINV 1 MULP SLC 105/57/2.06 68.29 118 BF-IDF 0 CONF CLC 84/109/1.73 68.21 119 TF-PRINV 0 KLOS SLC 99/64/2.04 68.20 120 TF-GFIDF 0 LIFT CLC 106/66/1.94 68.16 121 BF-IDF 0 KLOS CLC 106/67/1.92 68.10 122 BF-GFIDF 0 CONF CLC 105/68/1.92 68.06 123 TF-PRINV 1 KLOS ZLC 116/27/2.33 68.06 124 BF-PRINV 0 LEVE CLC 84/111/1.71 68.01 125 TF-GFIDF 1 MULP ZLC 117/51/1.98 67.98 126 BF-GFIDF 1 CONF CLC 103/70/1.92 67.97 127 TF-PRINV 0 KLOS ZLC 113/66/1.86 67.94 128 TF-IDF 1 NONE ZLC 114/60/1.91 67.92 129 BF-PRINV 1 KLOS CLC 108/69/1.88 67.92 130 BF-GFIDF 1 SUPP SLC 98/68/2.01 67.83 131 TF-PRINV 1 NONE ZLC 116/58/1.91 67.81 132 TF-IDF 1 KLOS ZLC 115/60/1.90 67.81 133 BF-IDF 0 LEVE CLC 103/74/1.88 67.77 134 BF-ENT 1 LEVE CLC 103/69/1.94 67.75 135 TF-PRINV 1 MULP ZLC 117/44/2.07 67.73 136 TF-PRINV 1 NONE CLC 113/65/1.87 67.70 137 TF-IDF 1 SUPP ZLC 116/61/1.88 67.70 138 BF-ENT 1 KLOS CLC 107/66/1.92 67.67 139 BF-ENT 1 CONF CLC 105/70/1.90 67.66 140 BF-PRINV 0 MULP CLC 102/74/1.89 67.63 141 BF-GFIDF 0 KLOS ZLC 116/48/2.03 67.63 142 TF-GFIDF 1 NONE CLC 110/73/1.82 67.60 143 TF-ENT 1 NONE CLC 113/62/1.90 67.58 144 TF-IDF 1 NONE CLC 113/62/1.90 67.58 145 BF-PRINV 0 SUPP CLC 108/64/1.94 67.53 146 TF-IDF 0 KLOS ZLC 112/67/1.86 67.47 147 BF-PRINV 0 CONF CLC 84/112/1.70 67.45 148 BF-PRINV 1 LIFT CLC 108/67/1.90 67.45 149 TF-ONE 0 SUPP SLC 102/77/1.86 67.41 150 BF-IDF 0 SUPP CLC 105/67/1.94 67.31 151 BF-IDF 1 KLOS SLC 98/45/2.33 67.29 152 TF-ENT 0 SUPP ZLC 113/71/1.81 67.25 153 TF-IDF 0 SUPP ZLC 113/71/1.81 67.25 154 TF-NORM 0 CONF CLC 107/82/1.76 67.25 155 TF-ENT 1 KLOS ZLC 114/66/1.85 67.25 156 TF-NORM 0 LEVE CLC 107/82/1.76 67.25 157 TF-ENT 0 NONE CLC 108/70/1.87 67.24 158 TF-GFIDF 0 NONE CLC 108/70/1.87 67.24 159 TF-IDF 0 NONE CLC 108/70/1.87 67.24 160 TF-NORM 0 NONE CLC 108/70/1.87 67.24 161 TF-ONE 0 NONE CLC 108/70/1.87 67.24 162 TF-PRINV 0 NONE CLC 108/70/1.87 67.24 163 BF-ENT 1 KLOS ZLC 112/63/1.90 67.24 164 TF-NORM 1 SUPP CLC 109/63/1.94 67.22 165 TF-ENT 0 KLOS SLC 99/51/2.22 67.14 166 TF-ENT 1 MULP ZLC 112/71/1.82 67.02 167 BF-ENT 0 KLOS ZLC 111/64/1.90 67.01

91 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

168 BF-ONE 1 MULP SLC 99/77/1.89 66.98 169 BF-PRINV 1 KLOS SLC 98/46/2.31 66.94 170 BF-ENT 1 KLOS SLC 94/42/2.45 66.83 171 BF-NORM 0 MULP CLC 106/69/1.90 66.78 172 BF-NORM 0 MULP ZLC 108/73/1.84 66.78 173 BF-GFIDF 0 SUPP ZLC 114/70/1.81 66.78 174 BF-GFIDF 1 SUPP ZLC 111/71/1.83 66.78 175 BF-ONE 0 LEVE CLC 91/96/1.78 66.78 176 TF-ONE 0 KLOS ZLC 119/34/2.18 66.67 177 TF-GFIDF 0 LEVE CLC 106/71/1.88 66.67 178 TF-NORM 0 LIFT CLC 110/68/1.87 66.67 179 TF-GFIDF 1 CONF CLC 106/73/1.86 66.56 180 BF-ONE 0 KLOS ZLC 108/65/1.92 66.56 181 TF-NORM 1 LIFT CLC 102/72/1.91 66.56 182 BF-ENT 1 NONE CLC 109/70/1.86 66.55 183 BF-ENT 1 MULP ZLC 109/71/1.85 66.55 184 TF-GFIDF 1 LEVE CLC 105/77/1.83 66.55 185 TF-ONE 1 NONE CLC 113/64/1.88 66.44 186 TF-ONE 1 NONE ZLC 113/63/1.89 66.44 187 BF-ONE 0 MULP ZLC 110/69/1.86 66.44 188 TF-NORM 1 MULP SLC 103/67/1.96 66.36 189 TF-ENT 1 NONE SLC 109/61/1.96 66.34 190 BF-ONE 1 LEVE CLC 88/106/1.72 66.33 191 BF-IDF 1 NONE CLC 108/72/1.85 66.32 192 BF-IDF 1 MULP ZLC 108/72/1.85 66.32 193 TF-ONE 1 LEVE CLC 103/72/1.90 66.23 194 BF-GFIDF 1 KLOS ZLC 115/61/1.89 66.21 195 TF-ENT 0 SUPP SLC 101/80/1.84 66.12 196 BF-IDF 1 SUPP CLC 108/68/1.89 66.11 197 TF-IDF 1 NONE SLC 107/60/1.99 66.02 198 TF-GFIDF 1 NONE ZLC 112/67/1.86 65.98 199 TF-PRINV 1 SUPP ZLC 115/62/1.88 65.98 200 TF-IDF 1 CONF SLC 100/75/1.90 65.94 201 TF-IDF 1 LEVE SLC 100/75/1.90 65.94 202 TF-PRINV 1 CONF SLC 98/79/1.88 65.93 203 TF-PRINV 1 LEVE SLC 98/79/1.88 65.93 204 TF-PRINV 1 SUPP SLC 104/72/1.89 65.91 205 TF-ENT 1 CONF SLC 99/72/1.95 65.85 206 TF-ENT 1 LEVE SLC 99/72/1.95 65.85 207 BF-GFIDF 1 MULP SLC 97/82/1.86 65.84 208 TF-ENT 1 LIFT CLC 101/79/1.85 65.79 209 BF-PRINV 0 KLOS ZLC 112/60/1.94 65.77 210 BF-ONE 1 SUPP ZLC 108/75/1.82 65.75 211 TF-NORM 1 LEVE SLC 102/65/1.99 65.74 212 BF-GFIDF 1 MULP ZLC 110/75/1.80 65.73 213 TF-NORM 1 SUPP ZLC 110/75/1.80 65.73 214 BF-ONE 1 LIFT CLC 105/71/1.89 65.67 215 BF-ENT 1 SUPP ZLC 110/65/1.90 65.66 216 BF-ONE 1 KLOS ZLC 113/61/1.91 65.65 217 BF-PRINV 1 KLOS ZLC 111/64/1.90 65.65 218 TF-NORM 1 LEVE ZLC 111/66/1.88 65.65 219 TF-PRINV 1 NONE SLC 105/60/2.02 65.62 220 TF-NORM 0 LEVE ZLC 112/73/1.80 65.61 221 BF-ONE 1 MULP ZLC 107/83/1.75 65.60 222 BF-IDF 0 KLOS ZLC 110/62/1.94 65.55 223 BF-ONE 0 SUPP ZLC 111/71/1.83 65.51 224 TF-ONE 1 MULP ZLC 119/60/1.86 65.50 225 BF-ENT 0 SUPP CLC 109/63/1.94 65.45 226 TF-NORM 1 KLOS ZLC 109/67/1.89 65.43 227 TF-IDF 0 SUPP SLC 102/82/1.81 65.42 228 BF-PRINV 1 SUPP CLC 105/74/1.86 65.42

92 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

229 BF-PRINV 1 NONE CLC 107/72/1.86 65.41 230 TF-PRINV 1 CONF ZLC 108/75/1.82 65.40 231 BF-GFIDF 0 MULP ZLC 112/75/1.78 65.37 232 BF-ENT 1 LIFT CLC 106/70/1.89 65.33 233 BF-IDF 1 KLOS ZLC 111/64/1.90 65.31 234 TF-PRINV 0 CONF ZLC 120/41/2.07 65.28 235 TF-NORM 1 CONF SLC 102/71/1.92 65.27 236 TF-NORM 1 MULP ZLC 113/69/1.83 65.26 237 TF-ONE 1 SUPP ZLC 118/64/1.83 65.25 238 BF-PRINV 1 MULP ZLC 110/61/1.95 65.24 239 TF-ONE 0 SUPP ZLC 114/73/1.78 65.23 240 BF-PRINV 0 LIFT CLC 105/71/1.89 65.22 241 BF-ENT 0 LEVE ZLC 112/62/1.91 65.21 242 BF-GFIDF 1 LEVE SLC 95/63/2.11 65.20 243 TF-ENT 1 LEVE ZLC 121/37/2.11 65.18 244 BF-IDF 1 LIFT CLC 104/78/1.83 65.18 245 TF-ENT 0 NONE ZLC 110/70/1.85 65.17 246 TF-GFIDF 0 NONE ZLC 110/70/1.85 65.17 247 TF-IDF 0 NONE ZLC 110/70/1.85 65.17 248 TF-NORM 0 NONE ZLC 110/70/1.85 65.17 249 TF-ONE 0 NONE ZLC 110/70/1.85 65.17 250 TF-PRINV 0 NONE ZLC 110/70/1.85 65.17 251 BF-IDF 0 MULP ZLC 110/73/1.82 65.16 252 TF-NORM 1 NONE CLC 113/68/1.84 65.15 253 BF-IDF 1 SUPP ZLC 112/58/1.96 65.12 254 BF-ENT 1 SUPP CLC 108/67/1.90 65.11 255 BF-ENT 1 MULP SLC 100/74/1.91 65.10 256 BF-ONE 0 MULP SLC 103/79/1.83 65.00 257 BF-ONE 0 SUPP SLC 103/79/1.83 65.00 258 TF-PRINV 0 SUPP SLC 101/81/1.83 65.00 259 BF-ONE 0 SUPP CLC 88/108/1.70 64.99 260 TF-NORM 1 SUPP SLC 104/71/1.90 64.94 261 TF-IDF 1 CONF ZLC 109/74/1.82 64.93 262 BF-GFIDF 1 NONE CLC 105/82/1.78 64.90 263 BF-ENT 0 KLOS SLC 99/50/2.23 64.84 264 TF-NORM 0 CONF SLC 101/78/1.86 64.83 265 TF-NORM 0 LEVE SLC 101/78/1.86 64.83 266 TF-PRINV 1 LEVE ZLC 109/75/1.81 64.81 267 BF-GFIDF 0 SUPP SLC 100/71/1.95 64.77 268 BF-IDF 0 SUPP ZLC 108/56/2.03 64.76 269 BF-ONE 1 CONF CLC 86/113/1.67 64.74 270 TF-NORM 0 SUPP ZLC 111/61/1.94 64.69 271 BF-GFIDF 1 CONF SLC 95/65/2.08 64.62 272 TF-NORM 1 CONF ZLC 114/69/1.82 64.55 273 BF-PRINV 1 LEVE SLC 98/65/2.04 64.53 274 BF-PRINV 0 MULP ZLC 111/67/1.87 64.52 275 BF-ENT 0 MULP ZLC 112/67/1.86 64.51 276 BF-NORM 1 CONF CLC 102/74/1.89 64.51 277 BF-IDF 0 LIFT CLC 107/78/1.80 64.46 278 TF-NORM 0 CONF ZLC 112/73/1.80 64.44 279 BF-NORM 1 LIFT CLC 103/71/1.91 64.43 280 TF-PRINV 0 SUPP ZLC 113/74/1.78 64.40 281 BF-ENT 0 SUPP ZLC 111/71/1.83 64.37 282 BF-IDF 1 MULP SLC 95/58/2.18 64.36 283 BF-IDF 0 KLOS SLC 100/56/2.13 64.36 284 BF-IDF 1 LEVE SLC 98/67/2.02 64.33 285 TF-ONE 1 LEVE SLC 93/60/2.18 64.32 286 BF-ENT 1 LEVE ZLC 111/65/1.89 64.32 287 TF-NORM 1 KLOS SLC 104/68/1.94 64.31 288 BF-ENT 1 NONE ZLC 110/66/1.89 64.30 289 BF-ENT 1 CONF ZLC 111/66/1.88 64.30

93 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

290 TF-GFIDF 1 MULP SLC 102/50/2.19 64.28 291 TF-GFIDF 1 CONF SLC 93/58/2.21 64.26 292 BF-GFIDF 1 CONF ZLC 116/50/2.01 64.26 293 TF-GFIDF 1 LEVE SLC 93/58/2.21 64.26 294 BF-ONE 0 LIFT CLC 103/81/1.81 64.25 295 BF-ONE 1 NONE CLC 109/73/1.83 64.24 296 BF-PRINV 1 CONF SLC 98/65/2.04 64.24 297 TF-ONE 0 LEVE ZLC 118/52/1.96 64.24 298 TF-IDF 1 LEVE ZLC 109/74/1.82 64.24 299 BF-NORM 1 LEVE CLC 102/74/1.89 64.19 300 TF-ONE 1 MULP SLC 97/59/2.13 64.15 301 BF-IDF 1 CONF SLC 98/67/2.02 64.12 302 BF-GFIDF 1 LEVE ZLC 113/62/1.90 64.11 303 TF-GFIDF 1 LEVE ZLC 121/35/2.13 64.11 304 BF-ENT 0 CONF ZLC 112/63/1.90 64.08 305 BF-PRINV 0 KLOS SLC 100/56/2.13 64.07 306 TF-NORM 0 LIFT ZLC 110/69/1.86 64.07 307 TF-ONE 1 CONF SLC 92/63/2.15 64.04 308 BF-ENT 1 CONF SLC 95/67/2.06 64.01 309 TF-ENT 1 LIFT SLC 101/84/1.80 64.01 310 TF-IDF 0 CONF ZLC 107/78/1.80 64.00 311 BF-PRINV 1 LEVE ZLC 114/51/2.02 63.99 312 BF-PRINV 1 MULP SLC 100/81/1.84 63.97 313 BF-ENT 1 SUPP SLC 103/74/1.88 63.95 314 BF-ENT 1 LEVE SLC 95/69/2.03 63.90 315 TF-IDF 0 LEVE ZLC 114/51/2.02 63.88 316 TF-ONE 0 CONF ZLC 125/31/2.13 63.86 317 TF-IDF 0 CONF SLC 100/93/1.73 63.84 318 TF-PRINV 0 CONF SLC 100/93/1.73 63.84 319 TF-IDF 0 LEVE SLC 100/93/1.73 63.84 320 TF-PRINV 0 LEVE SLC 100/93/1.73 63.84 321 BF-PRINV 0 SUPP ZLC 112/68/1.85 63.78 322 TF-NORM 1 LIFT ZLC 111/73/1.81 63.76 323 BF-IDF 1 NONE ZLC 110/65/1.90 63.74 324 BF-NORM 0 CONF CLC 106/74/1.85 63.74 325 TF-ONE 1 NONE SLC 102/69/1.95 63.72 326 TF-NORM 1 NONE ZLC 112/68/1.85 63.67 327 BF-PRINV 1 SUPP SLC 100/80/1.85 63.67 328 BF-ONE 1 NONE SLC 102/56/2.11 63.65 329 TF-ONE 1 CONF CLC 102/76/1.87 63.65 330 BF-ONE 0 MULP CLC 89/108/1.69 63.64 331 BF-ONE 1 NONE ZLC 108/69/1.88 63.62 332 BF-GFIDF 0 MULP SLC 97/89/1.79 63.62 333 BF-PRINV 1 CONF ZLC 108/71/1.86 63.61 334 TF-GFIDF 0 LEVE SLC 96/73/1.97 63.58 335 TF-ONE 1 LEVE ZLC 115/39/2.16 63.57 336 TF-ONE 0 LIFT ZLC 106/87/1.73 63.57 337 TF-NORM 0 KLOS SLC 104/70/1.91 63.56 338 BF-PRINV 1 SUPP ZLC 110/63/1.92 63.55 339 TF-ENT 0 LIFT ZLC 105/90/1.71 63.54 340 TF-ENT 0 LEVE ZLC 108/78/1.79 63.53 341 TF-ENT 0 MULP ZLC 107/88/1.71 63.52 342 BF-NORM 1 MULP SLC 97/88/1.80 63.50 343 TF-NORM 0 MULP ZLC 111/71/1.83 63.46 344 TF-IDF 1 LIFT SLC 100/86/1.79 63.46 345 TF-PRINV 1 LIFT SLC 100/86/1.79 63.46 346 BF-IDF 1 CONF ZLC 106/74/1.85 63.41 347 TF-ONE 1 LIFT ZLC 109/81/1.75 63.35 348 TF-PRINV 1 LIFT ZLC 111/56/1.99 63.34 349 TF-NORM 1 NONE SLC 100/60/2.08 63.33 350 BF-ENT 0 NONE ZLC 108/75/1.82 63.30

94 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

351 BF-GFIDF 0 NONE ZLC 108/75/1.82 63.30 352 BF-IDF 0 NONE ZLC 108/75/1.82 63.30 353 BF-NORM 0 NONE ZLC 108/75/1.82 63.30 354 BF-ONE 0 NONE ZLC 108/75/1.82 63.30 355 BF-PRINV 0 NONE ZLC 108/75/1.82 63.30 356 BF-GFIDF 1 NONE SLC 97/56/2.18 63.29 357 TF-PRINV 0 LIFT ZLC 106/89/1.71 63.29 358 TF-ONE 1 CONF ZLC 117/42/2.09 63.27 359 BF-ENT 0 MULP SLC 100/78/1.87 63.21 360 BF-PRINV 0 MULP SLC 100/78/1.87 63.21 361 BF-IDF 0 LEVE ZLC 110/65/1.90 63.19 362 TF-NORM 0 KLOS ZLC 111/65/1.89 63.18 363 BF-NORM 1 MULP CLC 99/85/1.81 63.14 364 TF-ENT 0 CONF SLC 100/94/1.72 63.12 365 BF-IDF 0 CONF ZLC 111/62/1.92 63.11 366 BF-PRINV 1 NONE ZLC 110/65/1.90 63.07 367 TF-GFIDF 1 NONE SLC 100/78/1.87 63.05 368 TF-PRINV 0 MULP ZLC 117/65/1.83 63.05 369 TF-GFIDF 0 SUPP SLC 87/58/2.30 63.02 370 BF-IDF 1 SUPP SLC 101/82/1.82 63.00 371 BF-IDF 0 MULP SLC 99/80/1.86 62.99 372 TF-ENT 1 LIFT ZLC 113/73/1.79 62.99 373 TF-ENT 0 LEVE SLC 100/96/1.70 62.97 374 TF-ENT 0 CONF ZLC 108/77/1.80 62.94 375 BF-ENT 0 LIFT CLC 103/71/1.91 62.93 376 TF-NORM 0 SUPP SLC 96/63/2.09 62.91 377 TF-GFIDF 0 CONF SLC 91/67/2.11 62.83 378 BF-IDF 1 LEVE ZLC 106/81/1.78 62.83 379 BF-NORM 0 LEVE CLC 105/73/1.87 62.81 380 BF-GFIDF 1 NONE ZLC 110/70/1.85 62.76 381 BF-PRINV 0 SUPP SLC 100/85/1.80 62.73 382 TF-PRINV 0 LEVE ZLC 107/85/1.73 62.72 383 TF-IDF 1 LIFT ZLC 110/81/1.74 62.70 384 BF-IDF 1 NONE SLC 101/68/1.97 62.66 385 BF-PRINV 1 NONE SLC 101/68/1.97 62.66 386 TF-GFIDF 0 LIFT ZLC 106/85/1.74 62.63 387 TF-ENT 1 CONF ZLC 108/79/1.78 62.54 388 BF-ENT 1 NONE SLC 101/70/1.95 62.52 389 TF-ENT 0 NONE SLC 103/68/1.95 62.50 390 TF-GFIDF 0 NONE SLC 103/68/1.95 62.50 391 TF-IDF 0 NONE SLC 103/68/1.95 62.50 392 TF-NORM 0 NONE SLC 103/68/1.95 62.50 393 TF-ONE 0 NONE SLC 103/68/1.95 62.50 394 TF-PRINV 0 NONE SLC 103/68/1.95 62.50 395 TF-GFIDF 0 LEVE ZLC 115/65/1.85 62.50 396 TF-GFIDF 1 SUPP ZLC 121/42/2.04 62.46 397 BF-NORM 1 CONF ZLC 103/86/1.76 62.46 398 BF-IDF 0 SUPP SLC 100/84/1.81 62.42 399 BF-ENT 1 LIFT ZLC 106/68/1.91 62.40 400 BF-NORM 0 LIFT CLC 101/76/1.88 62.38 401 TF-GFIDF 0 CONF ZLC 120/41/2.07 62.35 402 BF-NORM 1 KLOS CLC 105/83/1.77 62.21 403 BF-PRINV 0 CONF ZLC 111/63/1.91 62.18 404 TF-ONE 0 CONF SLC 97/69/2.01 62.17 405 TF-ONE 0 LEVE SLC 97/69/2.01 62.17 406 BF-NORM 0 SUPP CLC 105/73/1.87 62.16 407 BF-GFIDF 0 CONF SLC 88/68/2.13 62.15 408 BF-ENT 0 SUPP SLC 100/85/1.80 62.10 409 TF-GFIDF 1 CONF ZLC 122/40/2.06 62.10 410 BF-ENT 0 NONE CLC 106/83/1.76 62.06 411 BF-GFIDF 0 NONE CLC 106/83/1.76 62.06

95 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

412 BF-IDF 0 NONE CLC 106/83/1.76 62.06 413 BF-NORM 0 NONE CLC 106/83/1.76 62.06 414 BF-ONE 0 NONE CLC 106/83/1.76 62.06 415 BF-PRINV 0 NONE CLC 106/83/1.76 62.06 416 TF-IDF 0 MULP ZLC 115/73/1.77 61.98 417 BF-GFIDF 0 LEVE SLC 90/66/2.13 61.95 418 BF-PRINV 0 LEVE ZLC 111/76/1.78 61.92 419 BF-ONE 1 LIFT ZLC 109/71/1.85 61.82 420 BF-NORM 1 MULP ZLC 107/83/1.75 61.79 421 BF-NORM 0 MULP SLC 100/80/1.85 61.76 422 TF-IDF 0 LIFT ZLC 111/57/1.98 61.66 423 TF-NORM 1 LIFT SLC 103/77/1.85 61.61 424 TF-NORM 0 MULP SLC 98/66/2.03 61.60 425 BF-NORM 0 KLOS ZLC 103/84/1.78 61.57 426 BF-ENT 0 NONE SLC 102/82/1.81 61.51 427 BF-GFIDF 0 NONE SLC 102/82/1.81 61.51 428 BF-IDF 0 NONE SLC 102/82/1.81 61.51 429 BF-NORM 0 NONE SLC 102/82/1.81 61.51 430 BF-ONE 0 NONE SLC 102/82/1.81 61.51 431 BF-PRINV 0 NONE SLC 102/82/1.81 61.51 432 BF-NORM 1 NONE CLC 104/85/1.76 61.35 433 BF-GFIDF 1 LIFT ZLC 115/57/1.94 61.34 434 BF-GFIDF 0 CONF ZLC 120/57/1.88 61.25 435 BF-NORM 1 SUPP CLC 101/89/1.75 61.19 436 BF-GFIDF 0 LEVE ZLC 119/57/1.89 61.17 437 BF-PRINV 1 LIFT SLC 99/81/1.85 61.15 438 BF-ONE 0 LIFT ZLC 108/66/1.91 61.11 439 BF-NORM 0 LEVE ZLC 114/73/1.78 61.10 440 BF-GFIDF 0 LIFT ZLC 108/74/1.83 61.06 441 TF-GFIDF 1 LIFT ZLC 106/87/1.73 61.04 442 BF-IDF 0 CONF SLC 98/76/1.91 61.02 443 TF-PRINV 0 LIFT SLC 98/85/1.82 61.02 444 TF-ONE 1 LIFT SLC 97/88/1.80 61.01 445 BF-ENT 0 CONF SLC 97/83/1.85 60.97 446 BF-ENT 0 LEVE SLC 97/83/1.85 60.97 447 BF-ENT 1 LIFT SLC 98/83/1.84 60.93 448 BF-IDF 1 LIFT SLC 98/83/1.84 60.93 449 BF-NORM 1 NONE SLC 96/79/1.90 60.86 450 BF-NORM 1 NONE ZLC 105/76/1.84 60.85 451 TF-ENT 0 LIFT SLC 98/85/1.82 60.80 452 BF-PRINV 0 CONF SLC 99/77/1.89 60.79 453 BF-IDF 0 LEVE SLC 99/77/1.89 60.79 454 BF-NORM 0 CONF ZLC 116/69/1.80 60.75 455 BF-PRINV 1 LIFT ZLC 108/83/1.74 60.68 456 BF-ONE 1 CONF ZLC 112/64/1.89 60.64 457 BF-GFIDF 1 LIFT SLC 96/86/1.83 60.60 458 TF-GFIDF 0 SUPP ZLC 115/73/1.77 60.54 459 BF-IDF 1 LIFT ZLC 108/84/1.73 60.54 460 TF-NORM 0 LIFT SLC 97/96/1.73 60.52 461 TF-IDF 0 LIFT SLC 97/86/1.82 60.41 462 BF-ONE 1 CONF SLC 92/57/2.23 60.39 463 BF-ONE 0 CONF ZLC 119/59/1.87 60.28 464 BF-PRINV 0 LEVE SLC 98/79/1.88 60.25 465 BF-ONE 0 LEVE ZLC 113/60/1.92 60.23 466 BF-NORM 1 KLOS ZLC 104/83/1.78 59.93 467 BF-PRINV 0 LIFT ZLC 106/89/1.71 59.89 468 TF-IDF 0 MULP SLC 97/67/2.03 59.85 469 BF-ONE 1 LEVE SLC 91/58/2.23 59.79 470 BF-ENT 0 LIFT ZLC 105/88/1.73 59.75 471 BF-NORM 1 SUPP ZLC 100/96/1.70 59.71 472 BF-ONE 1 LEVE ZLC 115/69/1.81 59.68

96 Cluster Rank Weighting NOR COC Clustering F Characteristic 1

473 TF-ENT 0 MULP SLC 96/87/1.82 59.67 474 BF-IDF 0 LIFT ZLC 105/90/1.71 59.67 475 TF-ONE 0 LIFT SLC 96/77/1.92 59.57 476 TF-GFIDF 1 LIFT SLC 90/82/1.94 59.36 477 BF-NORM 0 SUPP ZLC 112/74/1.79 59.15 478 TF-PRINV 0 MULP SLC 96/69/2.02 59.08 479 TF-GFIDF 0 LIFT SLC 96/81/1.88 59.08 480 BF-NORM 1 KLOS SLC 92/93/1.80 58.88 481 BF-NORM 1 LEVE ZLC 104/86/1.75 58.87 482 BF-ONE 0 CONF SLC 88/60/2.25 58.81 483 BF-NORM 0 KLOS SLC 93/99/1.73 58.80 484 BF-GFIDF 0 LIFT SLC 94/83/1.88 58.75 485 BF-NORM 1 CONF SLC 89/96/1.80 58.28 486 BF-NORM 1 LEVE SLC 89/96/1.80 58.28 487 TF-ONE 0 MULP ZLC 108/93/1.66 58.27 488 BF-ENT 0 LIFT SLC 98/86/1.81 58.06 489 BF-IDF 0 LIFT SLC 98/86/1.81 58.06 490 BF-PRINV 0 LIFT SLC 98/86/1.81 58.06 491 BF-NORM 1 SUPP SLC 92/102/1.72 57.92 492 BF-NORM 0 SUPP SLC 93/98/1.74 57.85 493 BF-NORM 0 LIFT ZLC 112/75/1.78 57.80 494 BF-ONE 0 LEVE SLC 85/53/2.41 57.59 495 TF-ONE 0 MULP SLC 92/106/1.68 57.54 496 BF-NORM 0 LEVE SLC 92/97/1.76 57.52 497 BF-NORM 1 LIFT ZLC 103/91/1.72 57.50 498 BF-ONE 1 LIFT SLC 99/84/1.82 57.43 499 BF-NORM 0 CONF SLC 92/99/1.74 57.33 500 BF-NORM 0 LIFT SLC 91/110/1.66 56.85 501 TF-GFIDF 0 MULP ZLC 114/75/1.76 56.73 502 BF-NORM 1 LIFT SLC 93/108/1.66 56.49 503 BF-ONE 0 LIFT SLC 97/82/1.86 55.25 504 TF-GFIDF 0 MULP SLC 76/136/1.57 53.81

97 Appendix E

List of Publications

International Journal

• Suwanapong, T., Theeramunkong, T., and Nantajeewarawat, E.: Name-alias Re- lationship Identification in Thai News Articles: A Comparison of Co-occurrence matrix Construction Methods, Chiang Mai Journal of Science. (Article in Press)

• Suwanapong, T., Theeramunkong, T., and Nantajeewarawat, E.: Investigation of Preprocessing Factors and Clustering Methods on Name-Alias Relationship Iden- tification in Thai News Articles, Information-An International Interdisciplinary, Vol. 18, No. 7, July 2015, pp. 3001–3020.

International Conference

• Suwanapong, T., Theeramunkong, T., and Nantajeewarawat, E.: A Fuzzy-relation Approach for Name-alias Identification in Thai News Articles. In: Proceedings of The 1st ASIAN Conference on Information Systems (ACIS2012), 6–8 December 2012, Siem Reap, Cambodia, pp. 352–357.

• Theeramunkong, T., Boriboon, M., Haruechaiyasak, C., Kittiphattanabawon, N., Kosawat, K., Onsuwan, C., Siriwat, I., Suwanapong, T., Tongtep, N.: THAI- NEST: A Framework for Thai Named Entity Tagging Specification and Tools. In: Proceedings of the 2nd International Conference on Corpus Linguistics (CILC10), 13–15 May 2010, Corunã, Spain, pp. 895–908.

• Suwanapong, T., Theeramunkong, T., and Nantajeewarawat, E.: The vector space models for finding co-occurrence names as aliases in Thai sports news. In: Proceedings of the 2nd Asian Conference on Intelligent Information and Database Systems (ACIIDS2010), Volume 5990 of Lecture Notes in Computer Science, 24– 26 March 2010, Hue City, Vietnam, pp. 122–130.

98 • Suwanapong, T. and Theeramunkong, T., Aliases discovered in Thai sports news articles, In: Proceedings of the 8th International Symposium on Natural Language Processing (SNLP2009), 20–22 October 2009, Bangkok, Thailand, pp. 63–66.