Cinema Information Retrieval from Wikipedia

Cinema Information Retrieval from Wikipedia John Ryan Courant Institute of Mathematical Sciences 251 Mercer St. New York, NY 10012 [email protected] Abstract 2 Problem Statement Our corpus is about 4,300 Wikipedia arti- In this paper, we explore various meth- cles (coverted to ASCII-only .txt documents). ods for information retrieval from a Around twenty percent of these articles de- text corpus and how they compare scribe actors, and the rest describe movies. when used to find the common link be- The goal of our system will be to take as input tween several elements of a query. We a query of several elements of the same cat- do so with several experiments on a egory (in this case, either actors or movies) corpus of Wikipedia articles on movies and output the common link between those and actors in cinema. elements in a different category. For example, the system should respond to the in- 1 Introduction put ”Annie Hall The Godfather” with ”Diane Keaton,” and the input ”Jennifer Lawrence Perhaps the best way to reference a movie, Bradley Cooper” with ”Silver Linings Play- having forgotten the title, is to hint at it by book” (or ”American Hustle”). listing starring actors. ”I can’t remember the exact name, but it’s the one with Jennifer 2.1 Corpus Lawrence and Bradley Cooper.” Conversely, The corpus contains about 4,300 documents; for lack of a name to place to a face, one 1,000 correspond to actors and 3,300 to may say ”she was in Annie Hall and The movies. The documents are plaintext files ob- Godfather,” and hope that someone else can tained by downloading the Wikipedia articles’ name the actress. Nowadays, this problem .html files and using the textutil tool in the is quickly solved by Google - for example, a Mac OS X terminal to convert from .html to Google search of ”Annie Hall The Godfather” .txt. Since the Vectorizer requires ASCII char- returns the IMDB page and the Wikipedia acters, a Java program using the Normalizer page for Diane Keaton within the first three class was used to reduce non-ASCII which results. The search has found the common corresponded with ASCII characters (so that, link between the two different entities of the for example, ”Renee´ Zellweger” would be- same category which we inputted. come ”Renee Zellweger” instead of ”Rene Suppose we’d like to perform this task of- Zellweger”), and to get rid of all other non- fline on a corpus with an algorithm of our ASCII symbols. Finally, everything below the own. Jurafsky and Martin (2000) suggest us- ”References” section header in the article (ev- ing a TF-IDF vector space model and evalu- ery Wikipedia page has one) was deleted to ating relevancy by cosine similarity. We will save space and keep things clean. implement this system, and we will compare the results when we change the term weight- 2.2 Vectorizer ing, and when we change the pairwise simi- Since the system will evaluate relevancy by larity metric. comparing vectors by some pairwise metric, our first task is to decide how to represent actresses named Ellen, the TfidfVectorizer queries and documents as vectors with real en- would recognize how ”page” and ”ellen” are tries. In this case, we will consider the entries much more common in the corpus than ”mar- of the vectors (i.e. the tokens) to be n-grams ion” or ”cotillard,” and thus would assign rel- in the vocabulary of the corpus. evancy more strongly to those containing the The clear first choice for a vector repre- latter terms. We expect that this impact will sentation is contained within sklearn’s ”fea- also be notable in our data when we include ture extraction” library, and it is found by the in our input a movie with the token ”ameri- CountVectorizer. In this model, the value for can” within, such as ”American Hustle.” a given token in the vector of a query or doc- One might argue that cases such as those ument is just its frequency. For example, if described in the previous paragraph will be we are using CountVectorize with n = 1, then fixed when n is set to 2. For example, the the vector for ”foo foo bar bar bar” would look system wouldn’t fall into the trap of consid- like f2; 3g (where remaining entries for other ering ”Ellen DeGeneres” when the token is tokens in the corpus would be 0). ”Ellen Page” instead of ”Ellen”. However, Our second choice for vector representation suppose we use a query such as ”John Ratzen- is analogous to our first, and it is found by berger Billy Crystal” when looking for a cer- n = 2 the HashingVectorizer. In this model, the en- tain Pixar film. When , we are saved try for a given n-gram is again its frequency from assigning relevancy to all the John’s in the document, but now it is computed us- and Billy’s in the corpus; however, as far as ing a hash table. How is this different? On Pixar movies are concerned, the token ”John the one hand, our computations are sped up, Ratzenberger” is redundant, since he has been since determining if we have seen a certain in every single Pixar movie. It would help if, token before takes constant time now. How- recognizing that Billy Crystal is more helpful ever, we risk having two tokens map to the in distinguishing the movie, his name be given same entry (a collision in the hash table), and more weight. having the vector be slightly misrepresenta- 2.3 Comparing Vectors: Pairwise tive. According to the documentation given Metrics by sklearn, this is not a big problem in gen- eral; however, we will see that our system’s In our implementation, we will compare re- success may be affected by these collisions, sults from two different metrics for pairwise undoubtedly owing to the size of our corpus. similarity of vectors. The motivation for such techniques is that we’d like a quick way to de- The third and final choice for vector repre- cide how similar a document is to a query by sentation which we will test in our implemen- analyzing the difference between the two vec- tation is that of the TfidfVectorizer, The key tor representations in the multi-dimensional difference in this model is that terms which vector space. appear only in a few documents are given The Cosine Similarity metric is solely con- much more weight. We expect this to be cerned with the angle between the two vec- a much more effective model than the other tors. We know that, for vectors x and y, two, as it is the only one we will use which weights a term with respect to both its doc- xyT = jxjjyj cos θ ument and the corpus. An example of the power of this weighting is found by consid- where θ is the angle between the two vectors. ering the query ”Ellen Page Marion Cotil- Thus, to find where the angle is small, we only lard.” Whereas the CountVectorizer and Hash- need to find where ingVectorizer would mistakenly assign more relevancy to documents containing multiple xyT (1) occurrences of ”page” and articles for other jxjjyj What is the query? user’s query, we will ask random students at steve martin , New York University for sample queries along john candy , with the intended common link. The reason laila robins for picking random students (as opposed to Calculating Tfidf Matrix other computer science students) is to attempt Using TFIDF with Cosine Similarityto best emulate the average Google user and 10 Planes , T r a i n s a n d Automobileshis/her lack of understanding of the underly- 9 S t e v e M a r t i n ing algorithms and what queries will make it 8 N o t h i n g b u t T r o u b l e (1991 f i l m )harder for the system. With at least 40 such 7 F a t h e r o f t h e B r i d e (1991 f i l m )queries, we will begin testing. ... The test works by having a user pass the system a query, to which the system responds Figure 1: Sample output with ten guesses as to the common link between the elements in the query. For simplic- ity, we consider sequels containing the same is close to 1 (this is algorithmically easy). cast to be identical to originals (for example, a Thus, if x and y plugged into (1) gives 0.5 and guess of ”The Matrix Revolutions” is as good 0 x and y gives 0.75, then we believe that doc- a guess as ”The Matrix” when given a query 0 ument y is more relevant to document x than containing actors in both movies). A sample document y. We note that if x and y are nor- output is given in Figure 1. malized to begin with, than (1) is equivalent Because we are intentionally asking the to system for at least 9 incorrect results, not T xy (2) much is found by evaluating F-score in the This leads us to the second pairwise metric usual manner. Instead, we will evaluate the we will use to evaluate query/document rele- different techniques by awarding points for vance: the Sigmoid Kernel. Starting with nor- each correct guess based on its position in the malized x and y, the Sigmoid Kernel is output.

Load more