<<

Relating Articles Textually and Visually

Nachum Dershowitz Daniel Labenski Adi Silberpfennig School of Computer Science School of Computer Science School of Electrical Engineering Tel Aviv University Tel Aviv University Tel Aviv University Tel Aviv, Israel Tel Aviv, Israel Tel Aviv, Israel [email protected] [email protected] [email protected]

Lior Wolf Yaron Tsur School of Computer Science Department of Jewish History Tel Aviv University Tel Aviv University Tel Aviv, Israel Tel Aviv, Israel [email protected] [email protected]

Abstract— Historical documents have been undergoing The task we tackle is topic detection – specifically, large-scale digitization over the past years, placing massive looking for newspaper articles that discuss the same image collections online. Optical character recognition ongoing event. Identifying related articles in historical (OCR) often performs poorly on such material, which makes searching within these resources problematic and archives can help scholars explore historical stories from textual analysis of such documents difficult. a broad perspective, can provide visitors of archival We present two approaches to overcome this obstacle, websites with effective navigation – as in modern news one textual and one visual. We show that, for tasks like sites, and can help compare the coverage of events finding newspaper articles related by topic, poor-quality between different publications.2 Unfortunately, optical OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by character recognition (OCR) for older documents is not adding with similar distributional representations. yet satisfactory, making it quite challenging to search As an alternative to OCR-based methods, one can within those images for something particular, limiting perform image-based search, using spotting. Synthetic research options.3 images are generated for every word in a lexicon, and word- We adopt the following nomenclature [1]: An event is spotting is used to compile vectors of their occurrences. “something that happens at a specific time and location”. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those A story is “a topically cohesive segment of news that obtained using noisy OCR. includes two or more declarative independent clauses We report on experiments applying both methods, sepa- about a single event”; in our case this is always a single rately and together, on historical Hebrew newspapers, with newspaper article. A topic is “a set of news stories that their added problem of rich morphology. are strongly related by some seminal real world event”. We use subject to refer to thematically related topics. I.INTRODUCTION For example [2], the eruption of Mount Pinatubo on June Recent large-scale digitization and preservation efforts 15th, 1991 is an event, whereas the ensuing crisis and have made a huge quantity of images of historical books, the aftermaths caused by the cataclysmic event are part manuscripts, and other media readily available over the of the larger topic, and the subject is “natural disasters”. Internet, opening doors to new opportunities, like distant The following is one of the stories that day (New York 1 reading and . Times):

Research supported in part by Grant #I-145-101.3-2013 from the 2A list of archives may be found at https://en.wikipedia.org/wiki/ German-Israeli Foundation for Scientific Research and Development Wikipedia:LOONA. (GIF), Grant #01019841 from the Deutsch-Israelische Projektkooper- 3See, for example, the famous newspaper article by Craig Clai- ation (DIP), Grant #1330/14 of the Israel Science Foundation (ISF), borne (The New York Times, November 14, 1975), “Just a quiet and a grant from the Blavatnik Family Fund. It forms part of D.L.s dinner for two in Paris” (https://ia801009.us.archive.org/18/items/ and A.S.’s M.Sc. theses at Tel Aviv University. 354644-claiborne/354644-claiborne.pdf, last item), which read as gib- 1http://www.nytimes.com/2011/06/26/books/review/ berish not too long ago (https://assets.documentcloud.org/documents/ the-mechanic-muse-what-is-distant-reading.html. 354644/claiborne.txt). Mount Pinatubo rumbled today with explosions that Vector Representations: For solving such tasks, it hurled ash and gas more than 12 miles high. . . . is common to represent each story as a vector in a President Corazon C. Aquino dismissed a British conventional [3]. Various term weighting newspaper report that the Americans had warned her of possible radioactive contamination if the schemes for combining term frequency (tf ) and inverse volcano damaged nuclear storage sites at Clark Air document frequency (idf ) can be employed [7]. We use Base. . . . She said the story was “baseless and purely this standard method for document representation and fabricated.” . . . measuring similarity. We first consider the case where printed texts have Allan and Harding [2], [8] used a story representation been digitized, resulting in noisy OCR text. We approach of tf-idf weights of the 1000 most weighted terms in the problem traditionally and then enhance it by using the document, computed of the story word embeddings to augment document representations. to every previous story and assign a story to the nearest This helps solve the sparsity problem, due to language neighbor cluster if the similarity was above a threshold variation, morphology, and OCR noise. Augmentation otherwise it creates a new cluster containing the story. proved to be helpful in finding more related articles We take this basic approach. Other document represen- and proved to be even better than language-specific tation and similarity measures have been suggested; for standardization methods. example, [9], [10] use language models for clusters to As an alternative to working with OCR results, we compute the probability that a new story is related to a propose a novel image-based approach for retrieval. cluster and use that as a similarity measure. Given a query image of a word, one seeks all occurrences Lexical Expansions: Petrovic´ and Osborne [4] use of that same word within the dataset. To that end, we lexical paraphrases from various sources of paraphrases, utilize a word-spotting engine that is both simple to including WordNet, to expand document vectors and implement and fast to run, making it extremely easy to overcome lexical variation in documents with different incorporate for a wide range of resources. The method is terms discussing the same topic. Moran and McCreadie completely unsupervised: words in the document images [5] show how to use to improve ac- are unsegmented and unlabeled. curacy by expanding tweets with semantically related The next section summarizes some recent approaches phrases. Their method does not depend on complex to topic detection and to word spotting. It is followed linguistic resources like WordNet and still gets better by a section on the newspaper corpus we experimented results. This resource independence allows one to run it with, its problematics, and the additional linguistic/visual on low-resource languages. We use a similar approach resources and tools that we used. SectionIV describes to augment the document bag of words. the textual method, and SectionV describes the visual Morphological Expansions: one. They are followed by experimental results for the Avraham and Goldberg two methods. We conclude with a brief discussion. [11] reported that, when using word embedding in Hebrew, which is a morphologically rich inflectional II.RELATED WORK language, word vectors bear a mix of semantic and Topic Detection: The task we study is related to morphological properties. Many neighboring word vec- the field of topic detection and tracking (TDT). This tors are of inflected forms, rather than having semantic research program, which began in 1997, aims to find similarity. We will leverage morphological properties of core technologies for organizing streams of news arti- word embeddings to expand terms with others sharing cles by topic [3]. Event detection has recently regained the same base form. popularity with the emergence of social media [4], [5]. Noisy Texts: Agarwal et al. [12] experimented with We look for stories in a retrospective fashion on a static texts from different resource types in order to understand database (Retrospective Event Detection [RED]), unlike how different levels of noise in the text harm text TDT, where one wishes “to identify new events as they classification success. They developed a spelling error occur, based on an on-line stream of stories” (New simulator and created synthetic texts with several levels Event Detection [NED]). For example, a common way of noise. They found that with up to 40% word error rate, of approaching detection is by representing stories via the accuracy of the classification was not significantly a set of features and assigning every new story to the affected. Vinciarelly [13], [14] reported similar results cluster of the most similar past one. Yang et al. [6] found in text clustering and categorization (classification) tasks. that retrospective event detection can obtain much better We were encouraged by these results to work on topic results than the online setup. detection despite the noisy OCR text.

2 Word Spotting: Whereas topic tracking in the field The average word error rate for this newspaper is of NLP is widely researched, we suggest here a novel about 20% [Ido Kissos, private communication], but is approach using images as sole input. This appears to somewhat worse for the test suite. (For example, the be the first attempt to take an image-based approach error rate in the first two paragraphs of the lead article for solving the topic-detection problem. The work of of the February 1 issue is 30/85 = 35%, and may even Wilkinson et al. [15] is somewhat related. They use be considerably worse.) Only about 45% of the tokens a word-spotting engine for computing word clouds, produced by OCR of the test corpus are found in the whereas we are applying spotting to topic detection and MILA lexicon of modern Hebrew,7 on account of OCR categorization. errors, inflections, named entities, etc. Much progress has been made throughout recent Lexicon: We make use of a Hebrew corpus (MILA) years in the area of word spotting. We employ the containing the protocols of Knesset sessions from Jan- method of [16] (see Section III); other recent methods uary 2004 through November 2005.8 We compiled a list include [17], [18], [19], [20], [21], [22], [23], [24]. of all tokens appearing 10 or more times, removing non- Hebrew words. Out of these, we took the most frequent III.RESOURCES 100,000 words as our lexicon. For a visual lexicon, these were rendered as synthetic images in a standard font Corpus: Our experiments were performed on texts 4 resembling most of the newspapers in the corpus. from the Jewish Historical Press (JPress), which offers Morphological Tools: Hebrew is a morphologically- full-text search of 124 publications (1.6 million pages) in rich language, with prefixes and suffixes attached to lem- 10 languages from a wide variety of Jewish communities. mas. Accordingly, we employ a Hebrew morphological Newspapers were digitized, segmented into articles (with analyzer, HebMorph,9 that associates a lemma to each manual help), and processed by an OCR engine. Even word form found in the hspell word list [25]. though these technologies are mature and used heavily in Representation Tools: Distributional word represen- many applications, they are not foolproof. JPress has to tation algorithms utilize the contexts in which words contend with certain phenomena that reduce the success appear and their co-occurrence distributions to repre- 5 of OCR and segmentation: “These phenomena include sent words. Their output is a mapping from words inferior quality of printing (which characterizes early (and phrases) to real-number vectors. Recently, Mikolov publications), yellowing paper, marks or damage in the et al. [26] presented a word-embedding tool, called original printing, unique fonts, torn pages, scribbled-on “”, that uses shallow neural networks.10 It has pages, and even pages damaged by rodents.” gained popularity due to its efficiency and its success in Technological limitations and poor quality raw mate- representing words as vectors, which has been used in rials lead to word identification problems, where a word many NLP tasks to obtain state-of-the-art results. is not identified, or is identified incorrectly (characters Word Spotting: For the visual method, we try to find were replaced or added), or words that were derived all occurrences of each word in the lexicon. We used from noise in the image do not really exist. Another an efficient word-spotting engine from [16], inspired problem is incorrect segmentation, where a purported by the work of Liao et al. [27] in the domain of face article contains more than one real article or one real recognition. Training data is unsupervised: no classifiers article is split into multiple ones. We will ignore the are used to learn similarities. The process starts by latter problem and regard a returned result as correct if binarizing the newspaper images and forming bounding any component of the segment matches. boxes around candidate target words. As the images Test Suite: The methods were evaluated primarily on are not pre-segmented into words, candidate words are a dataset comprising all issues of Davar from February oversampled. Each target is resized to a fixed-size rect- 1981,6 for a total of 10881 images, divided into 3233 angle and represented by image descriptors, followed by articles. We chose 111 query articles, some reporting max-pooling. After spotting, only the best of multiple a major continuous event, while others reporting on a overlapping results is chosen. singular event with few or no related articles. Indexing the words of each newspaper page takes a few seconds, depending on the page’s complexity. 4http://jpress.org.il (Hebrew); http://web.nli.org.il/sites/JPress/ English (English). 7http://mila.cs.technion.ac.il/resources lexicons mila.html. 5http://web.nli.org.il/sites/JPress/English/about/Pages/about 8http://mila.cs.technion.ac.il/resources corpora haknesset.html. technology.aspx. 9https://github.com/synhershko/HebMorph. 6http://web.nli.org.il/sites/JPress/English/Pages/Davar.aspx. 10https://code.google.com/archive/p/word2vec.

3 Retrieval requires a fraction of a second. Though this Tf-Idf: This well-known representation is standard in may not be the most accurate of spotting technologies, information retrieval [7]; while simple, it gives good, it is certainly one the simplest and provides a level of competitive results. We used the square-root normalized accuracy within reach of state-of-the-art ones. Specifi- version of tf-idf, as it performs well. cally, no optimization is performed during preprocessing Morphological Augmentation: Adding lemmata to the or during querying, representations are easily stored in vectors, while still keeping the original word, can in- conventional data structures, and the whole pipeline is crease the hit ratio when comparing documents. We use completely unsupervised. This contrasts with methods the Hebrew morphological tool to enrich vectors in this that employ techniques such as neural networks, SVM way. In case a token isn’t recognized, we try to tolerate or DTW. spelling errors and orthographical variants by deleting IV. TEXTUAL METHOD matres lectionis (optional vowel letters). Augmenting with Word2vec: We leverage two at- Given an article in a (historical) newspaper, we wish tributes of word embeddings: words with different affixes to find more articles relating to the same event(s). We but with a shared lemma have similar vectors; and words do this by representing articles as vectors and finding with OCR recognition errors also have vectors that are articles that have the closest vector to that of the query. similar to those of the actual words. The first attribute The main steps are the following: (A) text cleaning helps achieve generality similar to that of a morpholog- and preprocessing; (B) representing articles as vectors; ical analyzer but without any language-specific knowl- (C) finding nearest neighbors of query vectors. edge. The second can help counter the noise in the OCR A. Text Preprocessing output, but – unlike OCR post-processing for correcting OCR outputs many non-alphanumeric characters be- errors – we do not need to replace the error with the cause of the noise in the image that is sometimes mis- correct word but rather with a good representative that taken for numbers or the like. These spurious characters appears in other documents. are unlikely to appear again in the same word exactly After running word2vec on the cleaned OCR’ed text, in the same location, and they cause the article vectors the following is performed for each word in the article: to become much sparser and harder to compare. So, 1) Find the k nearest neighbors of the word (we used we remove all non-alphanumeric characters, including k = 20). punctuation, which has no impact on the semantics of the 2) Filter the neighbors with edit distance less than d article. (In Hebrew, the only actual words affected by the (we used d = 3). removal of punctuation are acronyms that have a double 3) Add to the article’s bag of words the most frequent quotation mark in the penultimate position to signify neighbor if its frequency is greater than the original the abbreviation, but all such acronyms will map to the word’s. same words, and it is relatively rare for ambiguity to be We trained word2vec over all of the articles in the cor- introduced as a consequence.) Another problem that is pus, using word2vec’s default configuration: continuous common to OCR systems is poor word segmentation. bag of word architecture with dimensionality of the word Some words are concatenated, while sometimes the vectors set to 200, context window size set to 5, and the output contains single letters as words—usually because hierarchical softmax training algorithm. of noise in the image. We remove all single-letter words, Since we are matching stories about specific related as there are none in Hebrew. On one particularly bad events, not broad categories, we wanted to augment the example, this cleaning process reduced the word error documents’ vectors only with syntactically similar words rate from 82% to 54%. and not with semantic paraphrases. So we filter out all B. Representing Articles neighbors that are at a large (standard) Levenshtein edit Set of Words: This is the simplest baseline. Every distance. Tests showed that 3 (additions/deletions/substi- document is represented by an unordered collection tutions) is a good threshold, even though there still are of its distinct words, and so the of each a few cases where semantic neighbors (not sharing the binary vector is the same as the size of the vocabulary root) pass through this filter. (number of distinct words occurring in the corpus). In Word2vec augments the vector with many mistaken this scheme, all words in a document get the same OCR readings, as well as with words that have the weight regardless of how many times each appears in same lemma, whereas the morphological analyzer only the document and corpus. helps with the latter. Another advantage of word2vec

4 is in augmenting named entities, which are frequent in were 20 and 13 pixels, respectively. The maximal inter- the corpus, but which the lexicon does not cover and word gap was 32 pixels and the maximal interline gap, the analyzer cannot handle. The augmentation size of 12. Candidates had a maximum bounded rectangle of the Word2vec method per word is 0 or 1, whereas the 240×55 pixels. We also applied jittering as described morphological augmentation average size is 3, making in [16] and the candidate-extraction stage resulted in the vector more compact and efficient while getting 2,289,263 word candidates. The number of actual words better results. is considerably less due to both jittering and the over- Word-Vector Averaging: This method is a common compensating detection process that suggests many word baseline in sentence and document vector representation. candidates for every real word. It gives the centroid of a document’s distributional word The encoding pipeline resized each word candidate vectors, and is used to represent a document’s topic. to a fixed 160×56 pixels, divided into a grid of 20×7 We found that using tf-idf weights for the weighted cells, each of 8×8 pixels. We used max pooling with arithmetic mean of the word vector performs better than a random partition of 3750 random exemplars into 250 uniform weighting, and therefore only report results with subsets, each of cardinality 15. Each word is therefore 250 frequency weighting. represented by a vector in R+ containing pooled dis- tances to random exemplars. This matrix containing the C. Finding Neighbors encoding for all word candidates fits in RAM and allows We use cosine similarity, popular in topic detection for efficient queries. experiments [28], [29]. Each query image from the visual lexicon is consid- ered separately, represented in the same way as candidate V. VISUAL METHOD words. We tried different font sizes, and empirically found that large font sizes produce better spotting results, As in the textual approach, each article is represented and therefore chose a size of 100pt. The query vector is as a vector, and one computes the distance between a compared to the similar vectors of all candidate targets query and all the articles in the dataset. Whereas when extracted from the dataset. All candidates are ranked in assuming access to the text it is quite straightforward accordance with the computed L2 distance. Then, for to represent articles as vectors, we work now only with each query, we apply a threshold and consider only the images as input. Since we cannot use the text itself to set of candidates for which the visual similarity between build a word list, we created a visual lexicon. the corpus candidate and that of the query passes a A. Method Description threshold τ. (The default was τ = 0.99.) For better accuracy, we also disallow any candidate We used the pre-computed Hebrew lexicon and gen- word image from appearing in more than one set of word erated synthetic images for each of its words. These matches. Out of all matches for the same bounding box, synthetic images serves as queries and were fed to the only the word with the highest rank is considered. word-spotting engine, which searched for visual matches within the images of the corpus. Based on these spotting B. Article Representation results, we represent each article in the dataset using a term vector model combined with a traditional tf-idf Applying a threshold enables us to count lexical weighting scheme. Finally, we retrieve related articles occurrences in documents and treat our problem as a using cosine similarity for the nearest neighbor search, NLP task. As is usual, we use tf-idf. For term frequency, as in the textual method. we compared three common variants: raw frequency First we run word spotting to extract candidate words of a term in a document, logarithmic-scaled frequency, from the corpus, using the engine of [16]. The parame- and square-root of frequency. The inverse document ters can be divided into two groups: those concerning frequency is the logarithmic scaled inverse fraction of word detection in a binarized image and those that the document that contain the term. govern the encoding process. The specific settings for Given an article, it is treated as explained above and the word detection were chosen based on a small sample: represented as a vector using the tf-idf method, resulting Connected components for candidate letters were in the in a vector in Rm, where m is the size of the lexicon. range [35 .. 2000] pixels. The allowed height range was Nearest neighbor search is then performed, using the [4 .. 120] pixels, and width was [3 .. 280]. For candidate cosine measure. The results are ranked in descending words, the bounding boxes’ minimal height and width order and the k-top results are returned.

5 Evaluation: Representation Recall@1 TF-IDF 0.915

Across Publishers Alignment Hebrew Analyzer 0.898

● 151 query articles (Davar) Word2Vec NN 0.932 ● Manually searched for parallel articles on the same day on Maariv edition ● 59 articles had a match

ROC curve is for predicting if the 1-NN is a parallel or not (by the cosine similarity score)

Fig. 1: ROC curve for locating parallel articles. Fig. 2: The sensitivity of topic detection to the threshold τ. The graph shows precision while varying the threshold VI.RESULTS τ used to decide whether a spotted result is accepted. We evaluate the two methods, noisy OCR text and word spotting, on the same task of finding articles describing the same topic from different editions of the B. Visual Method same newspaper. Since our goal is to compare approaches, we report The test is comprised of 111 articles chosen as queries, precision at k for the visual method, alongside the results with the goal of retrieving all other items in the 3233- of the OCR-based method. TableII shows the precisions article test suite that refer to the same topic. For each achieved by the visual detection system and its variants. query, the top 8 results were checked manually and As can be seen, normalization of tf-idf gives a sizable labeled as one of same topic, same subject (but not improvement compared to the baseline tf-idf. The log same topic), or unrelated. The average number of related normalization variant seems to perform slightly better events per query is 9.4 and the median is 7. than root normalization (used in the textual method) and To measure recall, one needs a gold standard, listing gives reasonable accuracy compared to the text-based all relevant articles for each query. The “gold” standard approach, retrieving 84% of the possible related articles used below must be taken with a grain or more of salt, in the gold standard at rank 1, compared to 87% for as it comprises all relevant articles found during any of the textual method. Overall, the textual method slightly the textual and visual experiments, but may miss a few outperforms the visual, across the board. See Table III. relevant articles that never showed up. We also present the results for the categorizing by subject in a second column. As can be seen, for cat- A. Textual Method egorization the gap between the visual and text-based As can be seen in TableI, the basic use of tf-idf methods is much reduced. vectors got about 90% of the related articles identified in We studied the effect of the visual-similarity threshold the gold standard. The vectors augmented with Hebrew τ, which we integrated in the word-spotting engine, on stems improved the results by another 2%, while the both topic detection and categorization. Fig.2 shows the word2vec augmentation added 3% to the tf-idf baseline, effect on topic detection using log normalization. The beating the Hebrew analyzer. (Occasionally, there was systems are relatively robust with regard to this param- no follow up story, which is why precision even at rank eter. Within the range [0.990 .. 0.993], the dependency 1 need not be perfect.) on the threshold is quite flat, and although there are Unsurprisingly, vector centroids did not give good many false positives, precision is high. Once τ goes results; they did, as should be expected, return articles above 0.993, not enough matches are accepted, so the on the same subject, rather than the same topic. number of false negatives is large and performance starts We performed additional tests on finding parallel to degrade. articles (same day) in another newspaper (Maariv11) in We also tried using the Hebrew morphological an- the JPress archive. Of 59 events that appeared in both alyzer to augment terms, as we did in the textual papers, tf-idf alone gave a precision of 91.5% for the experiments, but there was no noticeable improvement. top match; augmenting with word2vec gave 93.2%. The C. Combined Method ROC curves for this experiment are shown in Fig.1. Combining the two methods, giving equal weight to 11http://web.nli.org.il/sites/JPress/English/Pages/maariv test.aspx. the textual and visual vectors, after normalizing the

6 TABLE I: Textual method for topic detection.

Method Precision@1 Precision@3 Precision@5 Precision@8 Recall@8 Baseline (set of words) 0.77 0.66 0.58 0.48 0.54 Tf-idf (square root) 0.87 0.76 0.70 0.62 0.64 Word vector centroid 0.73 0.62 0.52 0.43 0.48 Morphological augmentation 0.89 0.78 0.71 0.65 0.73 Word2vec augmentation 0.90 0.80 0.72 0.64 0.72 Gold standard 0.96 0.89 0.82 0.71 0.79

TABLE II: Visual method for topic detection.

Method Precision@1 Precision@3 Precision@5 Precision@8 Recall@8 Tf-idf 0.78 0.67 0.58 0.52 0.53 Tf-idf root normalization 0.83 0.70 0.62 0.54 0.55 Tf-idf log normalization 0.84 0.72 0.64 0.56 0.57 Gold standard 0.96 0.90 0.85 0.75 0.79

TABLE III: Comparative results for topic and subject detection.

Table 1

alphaMethod 0 [email protected] [email protected] 0.4 [email protected] 0.6 [email protected] 0.8 [email protected] 1 1 0.8398378378380.837837837838Topic0.855855855856 Subject0.864864864865Topic0.864864864865 Subject 0.891891891892Topic0.882882882883 Subject Topic0.873873873874 Subject0.882882882883Topic0.882882882883 Subject0.882882882883 Visual 3 0.7187087087090.729729729730.840.753753753754 0.980.7657657657660.720.765765765766 0.94 0.7807807807810.640.783783783784 0.91 0.7747747747750.56 0.890.7687687687690.570.771771771772 0.330.771771771772 Textual (no5 0.638018018018 augmentation)0.6414414414410.870.663063063063 0.990.670270270270.760.677477477477 0.95 0.700361010830.700.70012101083 0.93 0.7039711191340.62 0.890.7045045045050.640.70990990991 0.340.699099099099 Gold standard8 0.5590450450450.5563063063060.960.574324324324 1.000.5923423423420.900.602029312289 0.98 0.6128668171560.850.622322435175 0.94 0.6218961625280.75 0.900.6234498308910.790.622747747748 0.380.614864864865

1 @1 @3 @5 @8 per articles can be connected together into a continuous

0.9 story with high precision, despite poor quality OCR. To find related articles in the OCR’ed newspapers, 0.8 we use a traditional tf-idf vector representation, which Precision 0.7 finds related articles by using nearest neighbor search with a cosine similarity measure. It proved to work well 0.6 even though the input was very noisy and the language highly inflectional and morphologically rich. We found 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 that affix removal and augmentation with lemmas im- ! proves precision. More importantly, we showed how to Fig. 3: Combining the methods, ranging from the purely achieve even better results without any language specific visual on the left (α = 0) to purely textual on the right resources by using word embedding similarities. (α = 1), and measuring Precision@k, for k = 1, 3, 5, 8. This suggests that word embeddings should be utilized The textual vector is given weight α and the visual, 1−α. for also in other information retrieval tasks, and could be used for term standardization instead distribution of scores of each method, makes for a small of the usual lemmatization. For noisy OCR texts, this improvement in precision of the top 1 and 3 results. The method can be used for post-processing error correction, combined method oftentimes promotes one additional and also to help find named entities that are not read correct match. See Fig.3. correctly by OCR. As an alternative to working with noisy text – so long VII.DISCUSSION as quality OCR technologies are still lacking for dealing We have explored different ways of enriching his- with handwritten manuscripts and historical newspa- torical datasets with language-based tools that can be pers and documents, we have demonstrated that word- incorporated in utilities to help in studies depending on spotting technologies are up to the challenge. the huge mass of historical information now available in Our visual alternative does not require any labeling digital form. We showed how historical Hebrew newspa- nor ground truth and has the advantage that document

7

1 images are its sole input and OCR is not needed at [12] S. Agarwal, S. Godbole, D. Punjani, and S. Roy, “How much all. We have seen the feasibility of this image-based noise is too much: A study in automatic text classification,” in Seventh IEEE International Conference on Data Mining (ICDM approach, and showed that it performs almost as well 2007), Oct 2007, pp. 3–12. as a text-based system. On the other hand, the more we [13] D. Grangier and A. Vinciarelli, “Noisy text clustering,” Idiap expand its vocabulary, with named entities for example, Research Institute, Martigny, Switz., Tech. Rep. IDIAP-RR 04- 31, Dec. 2004. the better we should expect the visual method to perform. [14] A. Vinciarelli, “Noisy text categorization,” IEEE Transactions on We used an efficient and relatively flexible underlying Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. word-spotting engine. This flexibility implies that word- 1882–1895, 2005. [15] T. Wilkinson and A. Brun, “Visualizing document image collec- spotting can be very useful in practice – either as is, or tions using image-based word clouds,” in Proc. 11th International as a plug-in method that can provide a list of candidates Symposium on Advances in Visual Computing (ISVC), Part I, Las to another system. Since no optimization is necessary Vegas, NV, ser. Lecture Notes in Computer Science, vol. 9474. Cham, Switz.: Springer, 2015, pp. 297–306. and the whole process is unsupervised, it can easily be [16] A. Kovalchuk, L. Wolf, and N. Dershowitz, “A simple and fast adapted to different types of datasets of imaged texts. word spotting method,” in Proc. 14th International Conference on Finally, we have seen that categorization works well, Frontiers in (ICFHR), Crete, Greece, September 2014, pp. 3–8. too, which suggests first running a classifier and using [17] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novel only subject-related articles for topic and event detection. word spotting method based on recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, 2012. REFERENCES [18] B. Gatos and I. Pratikakis, “Segmentation-free word spotting in historical printed documents,” in International Conference on [1] R. Nallapati, A. Feng, F. Peng, and J. Allan, “Event threading Document Analysis and Recognition (ICDAR), 2009. within news topics,” in Proc. Thirteenth ACM International [19] J. A. Rodr´ıguez-Serrano and F. Perronnin, “Local gradient his- Conference on Information and Knowledge Management (CIKM togram features for word spotting in unconstrained handwritten ’04). New York, NY: ACM, 2004, pp. 446–453. documents,” in Intl. Conf. on Frontiers in Handwriting Recogni- [2] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang, tion (ICFHR), 2008. “Topic detection and tracking pilot study – Final report,” in Proc. [20] S. Espana-Boquera, M. Castro-Bleda, J. Gorbe-Moya, and Broadcast News Transcription and Understanding Workshop, F. Zamora-Martinez, “Improving offline handwritten text recog- 1998, pp. 194–218. nition with hybrid HMM/ANN models,” IEEE Transactions on [3] J. Allan, Introduction to Topic Detection and Tracking. Boston, Pattern Analysis and Machine Intelligence, 2011. MA: Springer, 2002, pp. 1–16. [21] J. Almazan, A. Gordo, A. Fornes,´ and E. Valveny, “Handwritten [4] S. Petrovic,´ M. Osborne, and V. Lavrenko, “Using paraphrases word spotting with corrected attributes,” in Proc. Intl. Conf. on for improving first story detection in news and Twitter,” in (ICCV), 2013. Proc. 2012 Conference of the North American Chapter of the [22] T. M. Rath and R. Manmatha, “Features for word spotting in Association for Computational : Human Language historical manuscripts,” in International Conference on Document Technologies (NAACL HLT ’12). Stroudsburg, PA: Association Analysis and Recognition (ICDAR), 2003, pp. 218–222. for Computational Linguistics, 2012, pp. 338–346. [23] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “HMM-based [5] S. Moran, R. McCreadie, C. Macdonald, and I. Ounis, “En- word spotting in handwritten documents using subword models,” hancing first story detection using word embeddings,” in Proc. in Proc. 20th International Conference on , 39th International ACM SIGIR Conference on Research and Aug 2010, pp. 3416–3419. Development in Information Retrieval (SIGIR ’16). New York, [24] L. Rothacker, M. Rusinol, and G. A. Fink, “Bag-of-features NY: ACM, 2016, pp. 821–824. HMMs for segmentation-free word spotting in handwritten doc- [6] Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospective and uments,” in Proc. 12th International Conference on Document on-line event detection,” in Proc. 21st Annual International ACM Analysis and Recognition (ICDAR), Aug. 2013, pp. 1305–1309. SIGIR Conference on Research and Development in Information [25] N. Har’El and D. Kenigsberg, “Hspell – the free Hebrew spell Retrieval (SIGIR ’98). New York, NY: ACM, 1998, pp. 28–36. checker and morphological analyzer,” in Israeli Seminar on [7] G. Salton and C. Buckley, “Term-weighting approaches in auto- Computational Linguistics, December 2004. matic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. [26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, 513–523, Aug. 1988. “Distributed representations of words and phrases and their [8] J. Allan, S. Harding, D. Fisher, A. Bolivar, S. Guzman-Lara, and compositionality,” in Advances in Neural Information Processing P. Amstutz, “Taking topic detection from evaluation to practice,” Systems, 2013, pp. 3111–3119. in Proc. 38th Annual Hawaii International Conference on System [27] Q. Liao, J. Z. Leibo, Y. Mroueh, and T. A. Poggio, Sciences (HICSS’05) - Track 4 – Volume 04. Washington, DC: “Can a biologically-plausible hierarchy effectively replace face IEEE Computer Society, 2005, pp. 101a–101a. detection, alignment, and recognition pipelines?” Computing [9] J. P. Yamron, S. Knecht, and P. Van Mulbregt, “Dragon’s tracking Research Repository (CoRR), vol. abs/1311.4082, 2013. [Online]. and detection systems for the TDT2000 evaluation,” in Proceed- Available: http://arxiv.org/abs/1311.4082 ings of Topic Detection and Tracking Workshop, 2000, pp. 75–80. [28] J. Han, Data Mining: Concepts and Techniques. San Francisco, [10] M. Spitters and W. Kraaij, “Using language models for tracking CA: Morgan Kaufmann, 2005. events of interest over time,” in Proc. Workshop on Language [29] A. Huang, “Similarity measures for text ,” Models for Information Retrieval (LMIR) 2001, 2001. Christchurch, New Zealand, 2008. [11] O. Avraham and Y. Goldberg, “Word embeddings in Hebrew: Initial results,” in Israeli Seminar on Computational Linguistics (ISCOL), Raanana, Israel, 2015.

8