Soccer Data Web Crawler

Prototype Report

Team No. 02

First Name Last Name Role Trupti Sardesai Project Manager Wenchen Tu Prototyper Subessware Selvameena Karunamoorthy System/Software Architect Pranshu Kumar Requirements Engineer Zhitao Zhou Feasibility Analyst Yan Zhang Operational Concept Engineer Qing Hu Life Cycle Planner Amir ali Tahmasebi Shaper

1 Version History Date Author Version Changes made Rationale

10/11/14 WT 1.0 First Version of For draft FC this Document Package

2 Table of Contents

3 Table of Tables

4 Table of Figures

5 A.1. Introduction

6 Prototype Report Version 1.0

A.1.1 Purpose of Prototype Report

7 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 A.1.2 Status

The first version of Prototype report records the Prototype for basic web crawler, social media crawler and developer User Interface. These are functions that our team supposed to submit by the end of this semester. For web crawler, we give the prototype of the overall architecture, spiders for two representative websites. For extracting data from social media, we Prototype the workflow of using Facebook API to acquire player Facebook data according to his name. For developer User Interface, we implement an easy version of User Interface which provides functions such as add, update and delete website list.

A.2. Navigation Flow

Figure 1: Navigation Flow of Soccer Data Web Crawler

9 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 A.3. Prototype

A.3.1 Developer User Interface

Description The User Interface for developer to manage website list and other parameters.

Related Win  WC_3398 Condition As a developer, I can add, delete, update the specific websites visited, fields to capture from the website and frequency of crawler refreshes for each specified website.

Table 1: Developer User Interface

Figure 2: Websites list User Interface

Figure 3: Players User Interface

11 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 4: Attributes User Interface

Figure 5: Teams User Interface

A.3.2 Basic Web Crawler

In this prototype, we will give the web crawler architecture. The architecture consists of modules to extract both data and links from the website.

Description The modules and workflow of basic web crawler. Related Win  WC_3473 Condition The web crawler shall gather team information from the websites in the website list.  WC_3472 The web crawler shall gather player information from the websites in the website list.  WC_3413 The webcraweler shall gather head shots of players from the biography page on the website being crawled so that the player's picture can be shown on the report being generated.  WC_3412 The web crawler shall gather videos from the pages being crawled and ingest into STBI as is so that the coach and fans is able to watch the relevant videos PA. Table 2: Web Crawler Architecture

13 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 6: The basic web crawler architecture Besides, we analyzed the websites in the list client gave to us. There are basically two category websites. One is that the statistical table is generated statically. We can get the table from the source code of this website. The other type is that the statistical data is dynamically generated when the page is loaded. In the second type, we need to get the request URL the page used to load the data. Whoscored.com website belongs to the second category. While the nasl website belongs to the first category.

Description The Spider for crawling player data from whoscored.com website.

Related Win  WC_3472 Condition The web crawler shall gather player information from the websites in the website list.

Table 3: Spider for whoscored.com

Figure 7: Whoscored.com spider code

15 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 8: Whoscored spider output

Description The Spider for crawling player data from nasl website.

Related Win  WC_3472 Condition The web crawler shall gather player information from the websites in the website list.

Table 4: Spider for nasl

Figure 9: Nasl spider code

17 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 10: The basic web crawler architecture

A.3.3 Gather Social Media Data

Description The overall work flow to user player’s name to get Facebook data.

Related Win  WC_3416 Condition The web crawler shall get comments, name and number of members, likes from specified Facebook pages

Table 4: Workflow for gathering Facebook data

Figure 11: Workflow of gathering Facebook Data

19 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14