Conference in Research and Practice in Information Technology

Working Paper Series ISSN 1177-777X

Text Categorization and Similarity Analysis: Similarity measure, Architecture and Design

Michael Fowke1, Annika Hinze1, Ralf Heese2

Working Paper: 12/2013 December 2013

© 2013 Michael Fowke, Annika Hinze, Ralf Heese 1 Department of Computer Science The University of Waikato Private Bag 3105 Hamilton, New Zealand 2 Pingar International Ltd. 152 Quay St, Auckland, New Zealand

Categorization and Similarity Analysis: Similarity measure, Architecture and Design.

Michael Fowke1, Annika Hinze1, Ralf Heese2

1University of Waikato, Hamilton, New Zealand 2Pingar International Ltd, Auckland, New Zealand

Abstract This research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required.

report the design of the software can begin based on the 1. Introduction conclusions found in the research. Document classification and provenance has become an important area of computer science as the amount of 2. Background digital information is growing significantly. Software is This covers the background and resources that are now required to show similarities between documents (i.e. required to understand the research problem. document classification) and to point out duplicates and 2.1 Example documents possibly the history of each document (i.e. provenance). This honours project is done with Pingar who are a Throughout the report, five example documents will be company based in Auckland who aim to help organise the used to illustrate how each of the similarity measures growing amount of unstructured digital data. Pingar performs. As input text examples, the short documents provided the Pingar API and taxonomy generator which [2], [3], [4] shown in the appendix are used. Figure 1 are software tools that assist with document classification. shows the relationship between each of the documents. Pingar also provided a document corpus to use for testing of the software to be created. The intended outcome of the system is the ability to find the strength of semantic relationships and version relationships between an input document and any of the Figure 1: Relationship between documents documents in the corpus. If a company has a collection of documents, they will be able to use the software to 2.2 Semantic technology analyse a new document and find semantically related Pingar have provided two of their semantic technology documents and if it is a version of an existing document. software: Pingar API for extracting entities from a The literature review [1] covered a number of different document collection and a Taxonomy generator for existing implementations that attempted to classify finding relationships between extracted entities. These documents and make it easier for a user to organise their two technologies will be used to generate more accurate digital information. This report follows on from the classifications than using the document text alone. literature review and is a more extensive look at the An example to illustrate the technology is one using different similarity measures. The similarity measures are document 3. The Pingar API extracts Jason Dufner and analysed on their ability to find related documents both Keegan Bradley as entities from the document. These semantically and in version. The measures are also entities are related as they are both people and the API analysed on their speed and disk space required. The aim returns this information which is visualised in Figure 2. is to fully understand each and to choose the most The entities are the objects and the taxonomy gives the appropriate to implement in the design of the software. semantic relationships between them. The reminder of the document is structured as follows. We start with some background on the project. The next section looks at each of the different similarity measures used to classify documents. The approaches are analysed on a set of criteria to determine the most appropriate. This Figure 2: Output from API and taxonomy method was then chosen and the rest of the document covers the research using this method. By the end of the

2.3 Software structure given to documents with no version relationship. Only As an outcome of the literature study reported in [1], the document relationships will be shown that are over a overall structure of the document classification software certain threshold i.e. 0.5. was designed as shown in Figure 3. The blue rectangles in Figure 3 are software and include the Pingar API and taxonomy generator as well as the software that will be created. The green ovals show the output from the Pingar software. The input to the system is shown at the top and is a document to be analysed. The output from the system is shown at the bottom and is the document with calculated relationships. The grey rectangle is the similarity measure which is the main topic of this report. As shown in Figure 3, the software will receive a static input document with the aim of identifying which documents in the document corpus are semantically related and which ones are versions of the same text. As Figure 3: Overall structure of the software only the content of the static documents is known (no history data), the only classification strategy available is 2.4 Example of system content analysis. The outcome of the software will be a When document 3 is fed into the system, the extracted set of relationships between the input document and entities include golf, major, Jason Dufner and New York. documents within the collection. Each relationship will The taxonomy will also identify that New York is a have a value from 0.0-1.0 to show how strong the location and Jason Dufner is a person. This information is relationship is between documents. then input to the software to be created as well as the Input Data: Pingar provided a document corpus the original document text. The system has access to a corpus software will be tested on. Static input documents mean of pre-processed documents that will be compared to the that the software is unaware of any interactions a person input document. These are processed so the computation has made with the documents before classification. A time is minimised. The software created will take the single input document will be compared against the input document and information from Pingar software and documents in the corpus. use the distance measure to find any documents in the corpus that are related. If Document 4 is in the corpus Pingar Software: will be used to initially analyse the then one of the relationships output will be that document documents. More information about the Pingar API and 4 has a high semantic relationship with document 3. This taxonomy generator is given in section 2.2. The software is due to document 4 sharing common entities such as will produce a taxonomy and a set of named entities with golf and Jason Dufner. Details on the similarity measure location references. The taxonomy shows the are provided in the next section. relationships between the extracted entities (e.g. a hierarchical structure). The API also gives information on 3. Similarity Approaches which documents an entity occurs in. Named entities are The similarity measure is the remaining part to finalise phrases that contain the names of persons, organizations before the software can be built. This is the measure that and locations [5]. will determine to which extend two static input The classification software: will use these entities from documents are related. To make an informed decision for the Pingar API and taxonomy generator as well as the the most appropriate similarity measure a number of original document text to output the similarities between criteria are introduced to help identify the best approach the input document and documents in the corpus. These (section 3.1). Each of the measures analysed is then outputs are both required and are then fed into the discussed in its design and illustrated using the example software. scenario from section 2.1. A table at the end of this Distance measure: This is the component of the system chapter will summarise the performance of each that takes the input and determines the distance between measurement. Each approach is awarded a ++ or + to two documents i.e. their similarity. The measure will show very well performing or fairly well performing or a determine the strength of the relationship between each -- or - to show very poorly performing or slightly bad document. performance. Output data: The output will be a list of relationship 3.1 Assessment criteria values between the input document and each of the The following criteria are used to evaluate the similarity documents in the corpus. The values will be between 0.0 measures. and 1.0 to show the strength of the relationship with 1.0 1)Accurately find versions of the same document being a perfect match and 0.0 being no relation. A version relationship is a closer relationship than being The similarity approach must accurately identify that versions of the same document are related. A version is a semantically related so values from Ɵ - 1.0 are reserved document with mainly the same content with only a few for documents being related by version. The symbol Ɵ is a placeholder as it is not yet known what value should be extra words inserted or removed. Documents 1 and 2 are versions as the first two paragraphs in document 2 have the lower range of the version relationship. A score ofƟ only a couple of extra words and document 2 has an means a weak version relationship. Scores belowƟ are 2

______

Figure 4: Shingles from documents 1 and 2 extra paragraph at the end. A similarity approach is software to run quickly. A similarity approach is awarded awarded ++ if it is capable of finding document versions a mark between ++ and -- with ++ being a very quick using the initial document text alone and a single + if it time for each comparison. can find versions but only by incorporating the Pingar 3.2 Sim Hash API. The first similarity approach looked at was the Simhash 2) Accurately find semantically related algorithm, described by Charikar [6]. They argue that the documents amount of information currently is large and we should The similarity approach must accurately identify that two not be looking to compare entire documents but rather documents that share a high number of common themes creating smaller representations of each document, which and topics are semantically related. Documents 3 and 4 will then be compared. Hashing is an example of this are semantically related as they are both talking about strategy as the content is represented by hashes which Jason Dufner winning his first PGA Championship title. leads to greater disk space efficiency and speed. Simhash A similarity approach is awarded a ++ if it can find is calculated by applying a family of hash functions to semantically similar documents from the initial document each of the input phrases and the output is a value text alone, and a single + if it can only when the Pingar between 0 and 1. 1 shows that documents are identical API is incorporated. and 0 shows that documents are very different. Charikar 3) To produce a smaller representation of the input states that the similarity between two sets produced by document (disk space) using the family of hash functions can be estimated by counting the number of matching coordinates in their The space required to store digital information is corresponding hash vectors. Similarity estimation is based increasing and the method chosen should represent the on a test of equality of hash function values. input document in a more compact form without losing important information. The software should use as little An implementation of the simhash algorithm is provided disk space as possible so this criterion is necessary. A as on a separate website [7]. It also describes why a similarity measure is awarded a mark between ++ and -- simhash algorithm is much more effective than a normal based on how compact a document becomes with ++ hashing algorithm. A normal hashing algorithm will give being maximum compression. very different hashing values if the input phrase differs only slightly. Simhash will compute similar hash values 4) Number of comparisons required for the first two paragraphs of documents 1 and 2. Normal Speed of algorithm is very important, particularly when hashing would not do this due to the few extra words and the software is run on a large number of documents. The word reordering. computation time is heavily dependent on how many Simhash works well for finding different versions of the comparisons are required between documents. The greater same document. Two different versions of documents are the number of comparisons or operations required to find likely to be the same for large sections of text with one similar documents, the less efficient and slower the having additional text or text removed. Figure 4 shows algorithm will be. The software should be able to find the first portion of shingles (2 letter pairs) extracted from related documents with as few comparisons as possible. A the first paragraph of documents 1 and 2 when ordered similarity measure is awarded a mark between ++ and -- alphabetically. The simhash value is calculated using the with ++ being a very low number of comparisons sum of the hash values of each of the shingles. Despite required. the different word ordering and slightly different words, 5) Speed of comparisons both documents share most of the shingles so the hash The number of comparisons required impacts speed but value will be very similar. The extra paragraph in also the time taken for each of the comparisons. This document 2 cannot be matched with document 1 but two criteria rates each approach on the computational paragraphs is enough to show a version. Simhash also complexity of each comparison required between the performs well when words have different endings. documents. The software should be able to do each Document 1 uses official and document 2 officially yet comparison quickly as this will likely cause the entire 3

Figure 5: Clustering using documents 3 and 4 simhash can identify that the majority of the word is the being represented by 5 hash values. When identifying same. semantically related documents, a document can be represented by a single number being the hash of the Simhash would not work so well in finding documents extracted concepts from the document. The power of that are related semantically when used on the original using numbers instead of text is really exploited in the text. When analysing the entire document simhash is next stage of the algorithm. Once the chunks of text have unable to identify the key concepts to compare between been turned into hash values, the chunks can be ordered documents. Also simhash is looking at the words based on the numerical value of the hash. This saves time themselves and not at the meaning of the words. in the search for related documents. Finding versions Documents 3 and 4 are related semantically and share a using simhash involves a number of comparisons with number of common key words but simhash is unaware each of the chunks needing to be checked against each what the key words are. Document 3 uses the term chunk in another document. With the ordering of the winning score and document 4 uses leading score which hashed chunks it reduces the number of comparisons will not be determined as the same by simhash which is required as each chunk only requires checking against looking at the letters and words. The Pingar API would hash values that are similar in value. The number of make Sim Hash accurate in this area. Creating a hash comparisons becomes efficient due to the ability to order value of the phrase created by joining all of the extracted numbers. The computation required in each comparison is entities together gives a very compact representation of also minimal. Each comparison involves finding the the key concepts in the document. For Document 3 this number of bits difference between two numbers which is would mean hashing golf, Jason Dufner etc. This one easy and fast to compute. hash value is all that would be required to compare this 3.3 Clustering using Wikipedia guidance document with others in the collection of documents. The next method considered is to use Wikipedia to cluster There are other variations that could be used such as the documents. A similar approach was used as by Huang calculating the simhash value for all the extracted in [8] and [9]. Wikipedia is becoming an increasingly concepts in each of the paragraphs in the document. This useful resource for clustering documents as it is a huge, would again be compact as only one number per well structured collection of information. Knopp et al. paragraph would be required. state that Wikipedia has been explored in a number of Simhash performs well on the criteria required by the natural language processing tasks in the last years [10]. similarity measure. As stated previously it is able to find Huang uses Wikipedia and the links to related pages to documents that are related in version. It is also able to generate relationships between concepts. This information identify documents that are semantically related but only can then be used to supervise the normally unsupervised by using the output of the Pingar API as an input to method of clustering. Using the extra semantic hashing. A major advantage of simhash is its small information found through Wikipedia analysis produced representation of a document. When identifying versions far more accurate clusters than the clustering algorithm of documents then simhash will give a single number per could alone. chunk (likely a sentence). This means the entire document The Wikipedia part of the method would not be required can be represented by a list of values such as document 5 as the Pingar API gives the same output. Huang used

Figure 6: Words from documents 1 and 2

Wikipedia to find the relationship between concepts in a Clustering is unable to find documents to be versions as document which the Pingar API does also. Concepts from this involves using far too many dimensions in the d- document 5 such as physical exercise and obesity would dimensional vector. When looking for versions, the entire be found to be related using Wikipedia which the Pingar document text needs to be analysed as it is not just the API will also find. This is a clever technique however it is extracted entities that are important. This would mean not required in this software as the Pingar API does this. thousands of dimensions in the vector which is Of more relevance is the technique that Huang used for impractical. It would also become similar to a word finding the similarity of documents once Wikipedia was frequency approach which is discussed in the next used for finding the relationship between concepts. Huang section. states that concepts are clustered according to their pair- Clustering only meets some of the criteria. It is unable to wised semantic relatedness as computed from Wikipedia. identify versions as it cannot handle the high number of Active learning is then applied to documents using the dimensions required in the vector representation. It is able clustered concepts to derive a "keep together" or to identify documents that are related semantically using "separate" relationship. From our example documents it the modification above that involves clustering using only would find that golf and PGA championships are concepts the extracted concepts from the Pingar API. Clustering is that should be kept together and golf and bikes are likely able to represent the document in a more compact form as to be separated. The report describes the method used to it considers a document as only containing a number of determine if a document was related to a cluster of key terms. This is not quite as small as the single concepts. The weighting of a cluster involves a numerical value used by simhash but still an efficient use calculation using the number of occurrences of a concept. of disk space. Clustering treats the entire document as a This is one method that can be used to find the whole and there is no implementation of clustering that relationship between a document and concepts but it is involves breaking the document into chunks therefore the essentially a word frequency approach which is addressed number of comparisons is small. The comparisons further in the next method on word frequency. between documents are also fast and efficient as a value is Huang uses Cop-Kmeans clustering algorithm (a variation calculated for each document and compared to the mean of the k-means algorithm) to cluster the documents. Cop- value of each cluster and grouped accordingly. kmeans uses the relationships as stated above to guide the 3.4 Word Frequency clustering but underneath it still uses the simple k-means The final method introduced in the literature review was a algorithm. Huang states that the clustering algorithm can technique using word frequency [11]. Stanford University only relate documents that use identical terminology, and used a system that uses word frequencies to detect important semantic relations between terms such as plagiarism. The technique could identify if a submitted acronyms, synonyms, hypernyms, spelling variations and document was a copy of one already in their databases of related terms are all ignored. This means that clustering is documents (i.e. version). This is similar to the technique not able to find documents to be related unless they share using simhash in that it is looking at the original exact words as the algorithm is not aware of synonyms document text. This method compares documents based and other important language characteristics. Clustering on how many times certain words appear and if two would not identify that winning score and leading score documents are versions of each other and are very similar from documents 3 and 4 are related. Clustering could be then they will have similar word frequencies. extended to look at only the concepts extracted by the Pingar API and synonyms to resolve this issue. Word Frequency works well in finding different versions of the same document using only the original document Clustering would be able to identify semantically related text with a few modifications. Documents 1 and 2 are documents if it used only the extracted concepts rather versions of each other as it is clear the first two than the entire document. As stated by Huang [8], paragraphs are very similar and Figure 6 shows how this clustering can only relate documents that use identical works. The figure shows the words from the first terminology and will miss semantic relations. Huang also paragraphs of documents 1 and 2. Even though the word states that the terms used by clustering are usually single ordering differs and there are a few different words, when word terms which causes the algorithm to miss important ordered alphabetically it is clear that most of the words concepts that are more than one word. These issues are common therefore the documents are versions. If the would be solved by using the extracted concepts from the word frequency was analysed over the entire document Pingar API. Documents 3 and 4 are related semantically text then two versions of the same document will not and Figure 5 shows how clustering can identify this appear related if one document contains an extra section relationship. The entities are extracted and clustering with completely different terms and word ratios. represents each document by a d-dimensional vector with Document 2 has the extra paragraph on Tiger Woods and each dimension being the presence of an extracted entity. Lindsey Vonn and when analysed as an entire document These vectors will be compared and related documents the word frequencies would not be matching. Analysing grouped accordingly. Analysing the text alone will not the document by paragraph would show that the first two find that concepts like golf title and golf trophy are related paragraphs of each document are versions. so synonyms from the Pingar API need to be used to find these concepts to be related. This method would find documents that are semantically similar by using a method similar to that used by simhash. Word frequency using only the original text would not work as it focuses on words and not synonyms and 5

misspellings. Documents 3 and 4 are very similar The Simhash method is the only one that performed semantically and a word frequency approach would not positively in the 5 criteria. Clustering is unable to identify work as it is looking at the entire document and not just versions of a document which is a major requirement of the concepts which is all that matters for semantically the project and for this reason it is not considered. The related documents. It is also unaware of terms like major differences between simhash and word frequency winning score and leading score which are used in the comes in the way they process documents and the speed different documents but mean the same thing. When the and space required. Simhash represents a document by a Pingar API is combined with a word frequency method list of hash values when trying to identify versions which then these issues are resolved. The Pingar API extracts is a compact representation. Document 1 would be the concepts from a document and these can be compared represented by 3 hash values, 1 for each sentence in the to the concepts of another document. If one document document. This list can then be sorted so the comparison contains the same concepts (or synonyms) as another of chunks between documents can be done efficiently. document then they are likely related semantically. Word frequency does not use a compact representation A word frequency approach does not meet all the criteria. and instead looks at a word count for each chunk in a It does meet the major criteria of being able to find document. Doc 1 would be represented by a list of word : versions of a document as long as it is implemented by frequency pairs. This increases the space required and breaking the document into chunks. It also meets the each chunk has no natural ordering to aid computation major requirement of finding semantically related speed. The algorithm would need to check every chunk documents as it can compare the presence of key concepts against every other chunk. between documents when combined with the Pingar API. Figure 8 shows the difference in disk space required by This approach does not perform so well on the remaining the two methods and figure 9 shows how these values three criteria. Representing the document in a more were calculated. The values were calculated for document compact way is an important requirement of the software. 2. Both the approach to finding semantically related When looking for versions, this method would represent a documents and versions of the same document consume document by a word frequency for each chunk (likely a over 10 times the disk space using word frequency. paragraph) in the text. This is not reducing the size of the document at all. When looking for semantic relatedness Simhash Word frequency this method would be better in that a document would be Finding versions 12 bytes 497 bytes related by only a small set of main concepts. This is still Finding semantically relatedness 4 bytes 48 bytes not as compact as the simhash method which only Figure 8: disk space required for documents requires a single hash value for all the concepts

combined. This method will involve a number of Simhash Word frequency comparisons between the chunks in different documents. (32 bits per hash value = 4 bytes) (Average word is 6 letters = 6 btyes) The difference between this and simhash is that the Version: Version: chunks are likely to be paragraphs in this algorithm and = 3 sentences * 4bytes Each paragraph is represented by a list of word-occurrences pairs i.e. golf-2. The word requires on average 6 bytes and the sentences in a simhash implementation. As a result less number 1 byte so 7 bytes per word. Document 1 has 24 chunks need comparing. But then as the chunks are different words in paragraph 1, 26 different words in paragraph 2 and 21 different words in paragraph 3. represented by word counts rather than numbers, every = 24*7 + 26*7 + 21 * 7 chunk will need comparing whereas simhash can be much = 12 bytes = 497 bytes smarter about which chunks need comparing by ordering Semantically Semantically: the chunks numerically. There is little ordering that can = 8 extracted entities into 1 hash 6 bytes per each of the 8 entities extracted = 4 bytes = 6 * 8 = 48 bytes be applied to word counts except for partial alphabetical ordering. As a result the total number of comparisons Figure 9: calculation of bytes required would be similar. The speed of each comparison would be slow as each chunk would need comparing on the It was shown above that simhash performs better than presence of different words and the count for each. word frequency in storage space required. The quality of the classification is also crucial in the decision making 3.5 Summary of approaches process. Simhash and word frequency are now compared The performance of the three similarity approaches is on their accuracy. The statistics of interest are the summarised in Figure 7 and simhash receives the best precision and recall of the algorithms. The precision score based on the assessment criteria. statistic shows the percentage of documents that are returned as similar that are in fact similar. i.e. true Simhash Clustering Word frequency positives/ (true positives + false positives). Recall shows Accurately find versions of documents ++ -- ++ the percentage of similar documents that are returned by Accurately find semantically related documents + + + the algorithm. i.e. true positives / (true positives + false Create a smaller representation of a document ++ + - Efficient number of comparisons required. + ++ + negatives). The f measure is the combination of these two Good speed of comparison ++ + - statistics to give an overall accuracy reading. Total +8 +3 +2 Sood [12] states that simhash algorithms have good recall but tend to have a low precision. This means that the Figure 7: performance on required criteria algorithm is able to identify almost all the documents that are related and return them as related. So it is likely documents 3 and 4 will be identified as related. The

algorithm can struggle in that it returns a number of false Break the phrases up into features positives which are documents that it decides are related This part of the algorithm involves breaking the input that should not be. It may find documents 4 and 5 to be phrases into smaller chunks called shingles with a few related where they should not be. Sood worked with a letters. The example given broke the text into two letter recall percentage of 95% and found the algorithm to be shingles and this was the method created and used fast but it did return a number of false-positives. Simhash initially. The two letter shingles include spaces and no receives a mid range f measure to reflect the high recall duplicate shingle is included. An example phrase is but lower precision. The precision can be improved with "Tiger Woods has reportedly divorced his wife Elin tightening up the criteria for documents to be related. If Nordegren" from document 1 and figure 11 shows this the recall percentage is lowered to 90% by only including broken into shingles. documents that are more closely related such as documents 1 and 2, the number of false-positives will reduce. This is the concept of a ROC curve (Receiver Operating Characteristic) where the threshold used for deciding whether documents are related is altered and the 'Figure 11: Document 1 in 2 letter shingles changes in true positive and false positive rates are monitored. There are also ways of introducing the Pingar Hash each feature API which should increase the precision of the algorithm. It was suggested that a 32-bit hashing algorithm was used. The recall for a word frequency algorithm is high as it This length was chosen to be long enough so that clashes will find documents to be related even if the words in the did not occur with different input being hashed to the sentence are reordered or a few extra words are inserted. same output. The length was also short enough to be Like the simhash algorithm, the issues arise with word computationally efficient. Testing was done with 16 and frequency having a lower precision. Paragraphs can have 64 bit algorithms also. The larger the hash value the very similar word frequencies and thus appear as related greater the bit difference between related phrases but it but in fact not be versions at all. The f measure will be increased linearly compared to the difference between mid range again to reflect the high recall but lower two other phrases. This means the bit difference between precision. two different sets of phrases may have been 4 and 8 with 32 bit and 8 and 16 with 64 bit so the results did not Word frequency and simhash perform similarly in terms reveal any further information. of recall and precision but the main difference between The Java [13] hashing function was giving fairly small the two is in disc usage and also speed of comparisons. hash values even though it was a 32-bit hashing function. Simhash is able to work much quicker and with the use of This appeared to be as the input to hash was small. As a less space so for these reasons it is chosen as the result some of the phrases were found to be very similar similarity measure to use. Testing can be done in the even if they were in fact completely different. A new implementation to determine the threshold to use for hashing algorithm implementation was written to resolve accuracy to get the most useful results regarding recall this. The hashing method is simple and gets the lowest 32 and precision. bits when the byte value of the two letters is multiplied by a large prime number. The prime number used was 4. Sim hash implementation 27644437 [14] and it gave good hash values. The The next stage is to work out exactly how to apply the algorithm works much better with this new hashing simhash method for the highest accuracy. The question of algorithm. Figure 12 shows this hashing algorithm. this research is how to best combine the Pingar API and simhash method to find document similarity in semantics and in version. Matpalm [7] had an implementation of the simhash algorithm which this research is based on. This section outlines the initial implementation of the simhash algorithm Figure 12: calculating hash value 4.1 Initial algorithm This is the algorithm used initially to calculate the Keep a 32 value array to modify simhash of chunks in a document. At each stage the idea For each of the hashed shingles described above, if a bit i is described as well as any experimentation done to find is set then add 1 to the value at position i in the array. If the best version. The simhash value used was the number bit i is not set in the hashed value, then subtract one from of bits difference between two generated hash values, the value at position i in the array. This was created fairly Figure 10 illustrates this difference in bits. The arrows quickly and there was little room for different represent the bits which differ in the numbers. The implementations. difference in bits in these two hash values is 8. Calculate 32-bit simhash value 1537307734 = 0 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0 1 1 0 Set bit i to 1 if the value at position i in the array above is 1218804829 = 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 1 > 0. Set the value to 0 otherwise. Again there is little room for variation. Figure 10: Difference in bits of sim hash values Difference in bits.

At this point the simhash value has been calculated for trying to find similarities in the phrases based on common each of the phrases input to the algorithm. The next step 2 letter pairs in the phrases. It is incapable of recognising is to find the difference between each hashed phrase in related terms such as leading and winning in the first two terms of bits. Each of the phrases is analysed and the phrases. The method needs to be adjusted to find semantic algorithm outputs the number of bits difference between similarities and is where the Pingar API is useful. each one. 5. Incorporating the Pingar API 4.2 Output from initial algorithm The implementation described above is a good A first implementation of the algorithm had been created introduction to simhash but alone it will not give a high and was tested on a few phrases. Each phrase represents a level of accuracy for the classification software. The chunk that would be extracted from a complete document. Pingar API can be incorporated to particularly help with Figure 13 shows the three input phrases from the example finding semantic relatedness but also related versions. documents and Figure 14 shows the bit difference of the This section is broken into two parts, the first is looking at hashed values. combining simhash with the Pingar API to find versions of a document, the second is looking at finding semantically related documents. 5.1 Finding document versions This is the part where simhash already performs well. Figure 137: Three input phrases to demonstrate Simhash analyses each chunk or sentence in a document similar text and creates a hash value which can be compared against every chunk in other documents. Each method is applied Phrases Number of bits different to documents 1 and 2 to show their effectiveness. 1,2 3 1,3 10 5.1.1. Using simhash on original document text 2,3 10 When looking for versions, it is possible that using Figure 84: bit difference in example for versions simhash on the original document text is the most From this example it appears that the number of bits accurate way to find relationships. The document is difference did show which of the phrases were the most broken into chunks (likely sentences) and each chunk has similar. The output did not show very well just how its simhash value calculated. Each document is different each of the phrases were. The difference represented by a list of hash values which can then be between phrase 3 and the other two is huge but it is compared against every other document. If documents difficult to get a read on how different using just the value share a number of chunks that are within a fixed number output. of bits of each other then the documents are found to be versions of each other. Documents 1 and 2 are shown to This implementation works fairly well and can identify be related as the sentences contain similar words and phrases that are similar in terms of words. This can be structure in the first two paragraphs as in Figure 17. used for finding documents that are versions of one another. When one word in a phrase was changed, or removed, the simhash implementation would still find the phrases to be related. The current implementation is not very good at finding phrases that are related in terms of semantics but not words. Figures 15 and 16 show an illustration of this.

Figure 15: Input phrases for semantic relation Keywords Jason Dufner, Oak Hill Country Club, PGA Championship, golf, Keegan Bradley, bogey,Atlanta, major, winning score Phrases Number of bits different People Keegan Bradley, Jason Dufner 1,2 8 Locations Oak Hill Country Club, New York, Atlanta 1,3 9 2,3 10 Figure 17: Differences in documents 1 and 2 Figure 16: Bits different in semantic example The first sentences are the same but with different word The first two phrases should be shown as slightly related ordering and slightly different words such as became in but the output does not appear to show anything. The first document 1 and was done in document 2. The hash values two are shown as similar, likely because of common will be very similar as simhash looks at 2 letter pairs. The words such as score and 270 but the difference in the hash values of the 2nd sentences will be very similar for other phrases is not much more. The simhash algorithm is the same reasons. Document 2 has a 3rd sentence which

8 document 1 does not but this is fine as the first part of each document is shown to be versions as 2/2 of the sentences from document 1 appear in document 2. Keywords Jason Dufner, Oak Hill Country Club, PGA Championship, golf, Keegan Bradley, bogey,Atlanta, major, winning score 5.1.2. Using simhash on text with synonyms People Keegan Bradley, Jason Dufner Locations Oak Hill Country Club, New York, Atlanta The previous approach can be modified to incorporate the

Pingar API. Figure 18 shows how the method will work on the first sentence of documents 1 and 2. This approach Figure19: Entities extracted from document 3 would find phrases to be versions if the author has swapped the concept for a synonym. The example finds official to be a concept and finds the synonyms to include Keywords Jason Dufner, Jim Furyk, Tiger Woods, golf, Oak in the input to hash. This example shows why looking at Hill, American, major, PGA the concepts is unnecessary. Including the synonym has People Jason Dufner, Tiger Woods, Jim Furyk added no further accuracy to the algorithm. In documents Locations New York, Oak Hill, Torrey Pines 1 and 2 there are no concepts that are in one document but Figure 20: Entities extracted from document 4 not the other. All the differences in the documents that make them versions are in word ordering and 5.2.1. Hashing concepts from an entire insertion/deletion of minor words such as it became rather document than was done. The sentences in Figure 18 differ in that they have 2 different words and 1 word where it has a The first approach to using the Pingar API is to extract all different ending. Using synonyms has not helped to the concepts from the document and generate a simhash improve this situation. value for the concatenation of these concepts. Figure 21 demonstrates this process. Each document would be represented by a single number representing the hash value of each of the concepts combined. Comparisons between documents are fast and easy with just a single number being compared for each document. Documents that share mainly the same concepts will have a similar simhash value and be shown as semantically related.

Figure 18: simhash on chunks with synonyms

5.1.3. Summary of document version approaches

Using synonyms is not necessary for finding document Figure 21: hashing of extracted concepts versions. When a person is writing a document, they are just as likely to change the smaller words in a sentence as the concept words. The example showed that using the 5.2.2. Hashing concepts from each paragraph synonyms did not improve accuracy as in general the The next approach is similar to the previous except the synonyms will be a very small amount of the words that document is broken into sections. A document may differ in versions. Simhash will find sentences to be contain two fairly separate sections and this method versions if only 1 or 2 words differ so it is not necessary would identify this and still find related documents. If the to introduce the synonyms for finding document versions. first half of document a is on an identical topic to document b but then document a discusses a different 5.2 Finding semantic relationships topic in the second half, the previous approach would not This is the part which requires a lot of assistance from the find these documents related. The number of shared Pingar API. Simhash alone is incapable of finding concepts would not be high enough to produce similar semantically related documents as it does not consider simhash values. Figure 22 shows an example of this from synonyms or misspelling. The concepts extracted by the documents 3 and 4. The majority of each document is on Pingar API can be used as an input to hash to generate the same topic however the last paragraph of document 3 more accurate results. To find the effectiveness, each is on the tournament 2 years ago whereas the last method is shown on documents 3 and 4. Figures 19 and paragraph in document 4 is on Tiger Woods. The 20 show the entities extracted from documents 3 and 4. paragraph has a completely different simhash value which These will be used to illustrate the upcoming approaches. can be discarded as the rest of the paragraphs are similar.

Figure 23: frequency of concepts in documents Figure 22: Hashing concepts per paragraph 5.2.4. Summary of semantic approaches 5.2.3. Include frequency of entities The second approach would find documents to be related This adjustment to the simhash method for semantics uses even if only a few of the paragraphs are related. This is the number of occurrences of each of the entities and can the desired output and hashing the concepts from the be applied to either of the approaches above to improve entire document would not find these documents which accuracy. If two documents are closely related in terms of have only some related paragraphs. The example showed topic, not only will they use common words or concepts a situation where new entities are introduced in a throughout but they will use these key terms many times. paragraph of document 3 with different entities in the last paragraph of document 4. The hash per paragraph Figure 23 displays an example of two documents with the approach finds the simhash value for the first few entities extracted and their frequency. The first 3 paragraphs to be identical with only the last paragraphs documents are all adaptations of document 3 and the last being different which it then discards. is document 5 to illustrate the introduction of entity frequency. One approach would be to include the term Care will have to be taken when using this method as the golf numerous times in the input to hash. i.e. for paragraphs in the documents may not always line up so document 3a it would be simhash(golf golf golf golf well. One document may talk about topic a in the first Jason Dufner PGA championship). This is essentially a paragraph and topic b in the second and a second variation on the tf-idf algorithm (Term Frequency- document may do it vice versa. The comparisons between Inverse Document Frequency) [15]. This algorithm documents will need to check each paragraph with every calculates the importance of each term in a document and paragraph in the other documents. The simhash values goes up when a term is included many times in a will be quite different between two paragraphs if one has document and goes down when the term is used many an extra two entities with the rest of the paragraphs being times in the entire document corpus. The trouble with this very similar. This can cause an issue and for this reason it is the simhash implementation looks at 2 letter shingles is probably only necessary to find one or two paragraphs within the phrase and discards duplicates. So including that have similar hash values between two documents as it golf many times in the simhash has the same effect as is fairly hard to achieve. Often the first paragraphs of including it once. documents sum up the main topics and these will be found to have similar hash values. It was then investigated what would happen if the simhash algorithm was altered to include duplicates. The The frequency of entities will be considered by result was an algorithm that places a very high weighting calculating semantic relatedness in two different ways. It on the entities that occur many times. Documents 3a and was found that introducing an entity into the simhash 3b were found to be no more similar than documents 3a calculation many times if it occurred many times in the and 5 and 5 is completely different. The reason for this is document had a very large impact on the overall simhash that golf has such a high weighting in the calculation of value calculated. The documents will be compared simhash that the other terms are essentially ignored. To semantically using both tf-idf and without. reinforce this the similarity between documents 3a and 3c A possible extension is to introduce broader/narrow was calculated and found to be really high. This shows concepts as determined by the pingar API and taxonomy that the simhash value for document 3a is so highly generator. This would be similar to the frequency issue in influenced by the entity golf that it is almost the same as that the stronger relationships between concepts can be hashing only golf. This tf-idf algorithm can still be used used with greater weighting in the simhash value. This but it should be used carefully. An entities frequency is will be considered as an extension to the project to important as two documents that both mention golf 20 achieve greater accuracy. times should be shown as related regardless of the content of the rest of the documents which this algorithm would 6. Optimizing comparing simhash chunks show. The semantic similarity should be calculated twice, The simhash implementation involves comparing a once with the tf-idf included and once without. number of hash values between documents to find Calculating the semantic similarity without including numbers that differ by a small number of bits. This is a entities many times will find documents 3a and 3b to be process which uses a high level of computation and semantically related which is also true. When these two should be designed to be as efficient as possible. Simhash approaches are used carefully, documents can be more can be efficient in that the hash values can be ordered so accurately analysed that using either approach alone. that the minimum number of comparisons are carried out to find the related chunks. If every chunk is compared against every other chunk then the algorithm runs in O(n2). Documents 1 and 2 will need 6 comparisons to 10 determine relatedness as document 1 has 2 sentences and document 2 has 3 sentences. This suggested optimization will reduce this.

6.1 Method for optimization This is based on the method described in matpalm [7] but Figure 25: Bit difference of rotated hash adjusted to fit this software. Each of the hashed chunks in values[3] a document must be checked against every other chunk in another document but the number of comparisons can be Repeat step of ordering and finding related chunks reduced. Now repeat the steps of ordering the list of hashed chunks Remove chunks from document 2 that are very from document 2 and then removing the ones within n different bits difference to the hashed chunk from document 1. These chunks have a close enough relation to the hashed First step is to count the number of bits set in the chunk chunk from the first document. from the first document currently being compared, call this x. Remove any chunks from the list of chunks from Repeat steps 4 and 5 the second document that have more than x+n or less than Rotate the bits in every chunk in the list by another bit to x-n bits set with n being the threshold for number of bits the left, sort the list and remove the closely related hash difference that is considered related. If for example a values. Repeat these steps as many times as there are bits hashed chunk in document 2 has 26 bits set and the chunk in the hash values. i.e. rotate 32 times for 32 bit hash from document 1 has 10 bits set then the number of bits values. difference is always going to be at least 16. Repeat steps 1-6 for each chunk from document 1 Order the list of hashed chunks for the second Repeat the process for each of the hashed chunks from the document first document. If the bits different between two hashed chunks are in the Repeat steps 1-7 for every other document in the lowest few bits then ordering the hash values will result in collection the similar chunks appearing next to each other in the list. Repeat the earlier steps for every other document in the Insert the hashed chunk from the first document into this collection. So compare each chunk from document 1 with list and remember its position. Figure 24 shows an chunks in documents 2 till m with m being the number of ordered list of hash values and phrases (3,6) and (8,5) documents in the collection. Then compare every chunks have ended up close together and both have a small bit in document 2 with chunks in documents three till m. difference. Continue until chunks in documents m-1 and m have been compared.

7. Conclusion Through its performance on a set of criteria, simhash was found to be the best performing of the similarity Figure 24: ordered list of hash values [7] measurements. Word frequency and simhash were both accurate in their classification however simhash was far Find chunks close to hashed chunk from the first more efficient in time taken and disk space used. A document simple first version of simhash was introduced to discover Navigate through the list of chunks to find those that are how it worked and where it broke down. Pingar had within n bits difference from the hashed chunk from the provided an API and taxonomy generator which can be first document that has been inserted into the list. These combined with the simhash method to fix most of the are chunks that are closely enough related to the first areas where the simhash algorithm struggled. chunk. Remove these chunks from the list of chunks from For document versions it was found that the best solution the second document. was to calculate the simhash value for each sentence Rotate each chunk one bit to the left within the original document and the document So far the method has only found chunks that differ in the summarised by a list of these hash values. A method lower bits of the hash value. By rotating all of the values introducing synonyms into the document text was tested one bit to the left, the difference between each of the and discovered to be of no great benefit to the algorithm values will still remain intact. Figure 25 shows the rotated as the initial document text was satisfactory for finding hash values with the bit difference the same as in the versions of a document. previous figure. When classifying documents related semantically it was found that the best approach is to combine the extracted entities and their synonyms into a single simhash value for each paragraph within the document. Introducing synonyms helps the algorithm to find related paragraphs that use different terms for the same ideas. Analysing the

11 document by paragraph rather than as a whole meant a [13]Java. (2013). What is Java. Retrieved from section of a document on a completely different topic www.java.com would not stop the document being found to be related to [14] Wolfram Math World. (2013) Bell Number. another if the majority of the documents were similar. Retrieved from http://mathworld.wolfram.com Documents will be analysed semantically twice, once [15]Ramos, J. (2003, December). Using tf-idf to with tf-idf and once without to improve the accuracy by determine word relevance in document queries. including entity frequency. Broader/narrower In Proceedings of the First Instructional Conference on relationships were considered but are considered Machine Learning. extensions time permitting.

An efficient method for comparing chunks was then introduced. This method was based on an algorithm 9. Appendix: Example documents introduced in literature but modified to work with the nature of this simhash application. This improvement in Document 1 efficiency helps make simhash a quick algorithm for classifying related documents. Tiger Woods and his wife, Elin Nordegren, are reportedly divorced. According to their lawyers it became official in Bay 8. References County Circuit Court on Monday. [1] Fowke, M. (2013). Text categorization and analysis Woods and Nordegren have already commented on their based on document history. Literaure Review. Waikato divorce: "We are sad that our marriage is over and we wish each other the very best for the future." University. Document 2 [2] (2010, August 23). Divorce of Tiger Woods and wife finalized, Short News Tiger Woods has reportedly divorced his wife Elin Nordegren. According to their lawyers it was done officially on Monday in [3] (2013, August 12). Golf: Jason Dufner claims first Bay County Circuit Court. major title, Short News Woods and Nordegren commented already on their unfortunate [4] (2013, August 7). Study: Walking, cycling to work divorce: "We are sad that our marriage is over and we wish each may lower diabetes risk, Short News other the very best for the future." [5] Tjong Kim Sang, E. F., & De Meulder, F. (2003, Tiger has since been linked to another Blonde woman in skier May). Introduction to the CoNLL-2003 shared task: Lindey Vonn. Vonn was spotted course side during the BMW championships. Language-independent named entity recognition. InProceedings of the seventh conference on Natural Document 3 language learning at HLT-NAACL 2003-Volume 4 (pp. Jason Dufner finished bogey-bogey on the two most difficult 142-147). Association for Computational Linguistics. holes on the Oak Hill Country Club course in New York to claim his first major golf title at the PGA Championship on [6] Charikar, M. S. (2002, May). Similarity estimation Sunday. techniques from rounding algorithms. In Proceedings of The winning score was a 10-under 270, four shots better than the thiry-fourth annual ACM symposium on Theory of the lowest score in the five previous majors at Oak Hill. computing (pp. 380-388). ACM. Two years ago in Atlanta, the 36-year-old had blown a five-shot [7]Matpalm. The simhash algorithm. Retrieved from lead and Keegan Bradley ended up winning the title. http://matpalm.com/resemblance/simhash Document 4 [8]Huang, A., Milne, D., Frank, E., & Witten, I. H. (2008, The PGA championship concluded in New York on Sunday December). Clustering documents with active learning with Jason Dufner winning his first major golf trophy. using Wikipedia. In Data Mining, 2008. ICDM'08. Eighth Dufner won the tournament by 2 strokes over American Jim IEEE International Conference on (pp. 839-844). IEEE. Furyk at the Oak Hill course with a leading score of 270 the best [9] Huang, A., Milne, D., Frank, E., & Witten, I. H. in five years. (2009). Clustering documents using a Wikipedia- Tiger Woods finished well down the field which was frustrating based concept representation. In Advances in Knowledge for the hot favourite going into the event. It is now 5 years since Discovery and Data Mining (pp. 628-636). Springer Tiger won his last major title at the US open in Torrey Pines. Berlin Heidelberg Document 5 [10]Knopp, J., Frank, A., & Riezler, S. People who walk to work are 40 percent less likely to develop (2010). Classification of named entities in a large diabetes and 17 percent less likely to develop high blood multilingual resource using the Wikipedia category pressure than those who drive, a new study by UK researchers suggests. system (Master’s thesis, University of Heidelberg). Of the adults who used private transport such as cars, [11] García-Molina, H., Gravano, L., & Shivakumar, N. motorbikes and taxis to get to work, 19 per cent were obese, (1996, December). dSCAM: Finding document compared to only 13 percent of those who cycled to work and copies across multiple databases. In Parallel and 15 percent of those who walked. Distributed Information Systems, 1996.,Fourth "This study highlights that building physical activity into the International Conference on (pp. 68-79). IEEE. daily routine by walking, cycling or using public transport to get [12]Sood, S. (2011). Probabilistic Simhash to work is good for personal health," states study co-author Matching (Doctoral dissertation, Texas A&M University). Anthony Laverty.