<<

NHL Player Search Engine Team: Matt Verzak

Users and Uses

The NHL () Player Search Engine allows users to enter queries into a command line and retrieve pages for players who are relevant to the query entered. Any term can be entered into the query. The NHL Player Search Engine retrieves links to the player pages on the NHL website [1] for players who are relevant to the query.

The function of the system is to allow users to find and retrieve information on players based on keywords that they enter. The NHL Player Search Engine allows users to find players based on information that they know about them or find players based on information that they find interesting. For example, if a user knows the name of the minor league team that a player played for, they could run a search on that team and find several other players who played for that team. If they include a year in the search they could find that player's teammates on that minor league team.

Obvious users of this system would be fans of NHL teams. They could use it to easily go exploring finding links between players. They could also search their favorite team and get a list of present and former players for that team. For example, running a search on “ Blackhawks” returns mostly current Chicago players but also includes several players for the Jets. These players all played for Chicago in the 2009-2010 season when they won the . This was a very important season for Chicago and the NHL Player Search Engine helps users to discover more about it.

NHL commentators would likely enjoy the NHL Player Search Engine. For example, during games commentators often make remarks about opposing players' histories with each other. One very common example is that players on opposing teams often played on the same team at some point in their careers. The 2013-2014 NHL season included a break for the Winter Olympics in Sochi, and commentators will often remind viewers of different players' participation in Olympic . In a game between the and , commentators pointed out how the captains of both teams seemed to get along great when they were teammates on Team Canada in Sochi and a few weeks later were arguing with each other. Running a search for “Team USA Chicago Blackhawks Pittsburgh Penguins” uncovers that there were also two teammates on Team USA in Sochi who were on opposing teams in this game.

A good deal of fans will remember these Olympic connections between players on elite Olympic hockey teams like Team Canada, Team USA and Team Russia without the NHL Player Search Engine. The 2014 Sochi Olympic Ice Hockey games were televised and many people watched them. However, fans in the United States may not be familiar with some of the less popular Olympic teams. For example, running a search for “Finland” in the NHL Player Search Engine returns many of the players who played for Finland in Sochi. One can discover that the had three players who played for Team Finland.

Commentators will likely have a good memory of all NHL players who played in the Olympics and what team they played for. However, it is a rarity even among them to know each player's history in the junior leagues. The NHL Player Search Engine can be used to find players who played for the same junior team. For example, running a search on “” (the name of a junior hockey team) will return a list filled mostly with former players of that team who currently play in the NHL. It is notable that several of the players are now considered stars on their NHL teams ( and ). Existing Technologies

Most of the functionality of the NHL Player Search Engine is to an extent supported by existing technologies. However, the NHL Player Search Engine is an improvement over existing technologies for certain applications.

Google is a general-purpose search engine that can be used to accomplish many of the tasks that NHL Player Search Engine seeks to accomplish. However, if users are simply searching for players based on keywords, then Google is not going to be very efficient at retrieving data that the user will find useful. For example, running the “Team USA Chicago Blackhawks Pittsburgh Penguins” query in Google will return a bunch of news articles about the game that the two teams played against each other following the Olympics and about the teams in general. Towards the bottom of the first page there are some Wikipedia articles but none of them are for the relevant players.

The website for the National Hockey League includes a search function to find players but users can only search for players based on their name or current NHL team. This is a very basic search engine which the NHL Player Search Engine project aims to expand upon. Implementation

The NHL Player Search Engine is a system comprised of two parts. The first part is a basic web crawler that downloads all relevant documents from player pages on the NHL website. The second part is a search engine that builds an in-memory inverted index on those documents and uses that inverted index to efficiently run queries over them. Python and Javascript with PhantomJS [2] were used to build this system.

The first part consists of a basic web crawler that crawls player pages on the NHL website. The relevant pages all run Javascript, so the open-source software PhantomJS [2] is used to accomplish this. PhantomJS works by impersonating a browser that runs Javascript in order to retrieve dynamic content from websites. Each instance of the crawler is designed to obtain data for one NHL team. It starts by going to the page for that team. It then obtains the link to the page of each current player for the team. Then it goes to each player page and saves the player biography to a local directory. This part of the system as a whole simply creates an instance of the crawler for each team in a known list of teams. It is then possible to run the crawlers in parallel.

The second part consists of an inverted index and basic user interface to run queries. Upon experimentation, it was discovered that the inverted index over all of the player biographies could fit entirely in-memory, eliminating the need for a database. The inverted index is constructed only once upon initiation of the program and can then be used throughout the life of the program. Users will experience a short wait when the program first starts followed by very efficient query execution. Challenges with Implementation

The process of implementing the NHL Player Search Engine presented many challenges. They included conceptual difficulties as well as difficulties working with some of the technologies used.\

The greatest conceptual challenge of this project was grasping the idea of an inverted index. This data structure is not the first that comes to mind when thinking about how to represent the documents in memory. However, after observing how documents are ranked based on a query, it makes sense that the preferred data structure for doing so is the inverted index. With an inverted index, it is easy to go through each term in a query, find it in the index and calculate scores for only documents that contain it (rather than iterate through all documents searching for it).

Javascript with PhantomJS proved to be a frustrating language to work with at times. Working with it required learning a type of programming that was unfamiliar: asynchronous programming. Asynchronous programming is a form of parallel programming. In this style of programming, a method can be “asynchronous”, meaning that it will execute in parallel with the lines that follow it. At first, it was not apparent that Javascript was executing asynchronously on the machine, which lead to a lot of confusion over why Javascript was failing on a line saying that a variable was undefined when that variable is set the line above it. It turns out that the variable was being set to the return value of an asynchronous method that had not yet finished executing. Controlling the flow of the program proved to be challenging. The solution that was used was to use a function provided in one of the examples on the PhantomJS website called “waitFor”. This function takes in a function that returns a boolean value and a continuation function. Once the boolean function returns true, the function calls the continuation. Effectiveness and Ideas for Improvement

In general, the NHL Player Search Engine is effective at finding players who are relevant to the user's query. However, there are many opportunities for improvement.

The NHL Player Search Engine uses a unigram language model to rank documents. It does not take relative word positions into account. This can often lead to a sub-optimal ranking of documents. For example, consider the query “Team Canada”. If there is one player who didn't play for Team Canada but was born in Canada and whose biography happens to contain the word “team” at some point, that player may be ranked higher than a player who actually played for Team Canada at the Olympics. So the NHL Player Search Engine could be improved by taking relative word positioning into account. Specifically, taking into account word pairs might lead to a significant improvement. This is because players have first and last names and teams have city and team names, so names of players and teams are often word pairs.

A flaw that was mentioned in the paragraph above is that the ranking of players depends entirely upon the writers of the NHL website player biographies to place relevant information in the player biography and leave out irrelevant information (and to write a biography for each player). Sometimes, the biographies will contain terms that are misleading. For example, sometimes the biography of a player will include stats of games that they played well in. For example, if Player A plays several good games against Team B but doesn't play Team B very often, the results for running a search for Team B might include the page for Player A. Player A arguably is not really that relevant to Team B, but is included in the results because the biography for Player A mentioned the good games they played against Team B and mentioned Team B by name several times. This could be remedied somewhat by placing a higher weight on information found in certain places on the webpage. For example, a player's name and their team name can be found in known locations on their player page, so it would be easy to weight terms matching those strings higher than terms appearing in their general biography.

The document ranking system in the NHL Player Search Engine is currently a combination of IDF and Log TF. It is simply the sum of these two rankings for each term in the query for each document. A better ranking system such as BM25 might be implemented for the engine.

The NHL Player Search Engine does not currently take user feedback into account. It could be improved by implementing some sort of system that decided whether a document was relevant based on user rating. That rating could then be used in a feedback system to improve document ratings. References

[1] Hidayat, Ariya. PhantomJS. Computer software. PhantomJS. Vers. 1.9.7. N.p., n.d. Web. 9 May 2014. [2] NHL.com - The National Hockey League. National Hockey League, n.d. Web. 9 May 2014.