Soccer Data Web Crawler
Total Page:16
File Type:pdf, Size:1020Kb
Prototype Report
Soccer Data Web Crawler
Team No. 02
First Name Last Name Role Trupti Sardesai Project Manager Wenchen Tu Prototyper Subessware Selvameena Karunamoorthy System/Software Architect Pranshu Kumar Requirements Engineer Zhitao Zhou Feasibility Analyst Yan Zhang Operational Concept Engineer Qing Hu Life Cycle Planner Amir ali Tahmasebi Shaper
1 Version History Date Author Version Changes made Rationale
10/11/14 WT 1.0 First Version of For draft FC this Document Package
2 Table of Contents
3 Table of Tables
4 Table of Figures
5 A.1. Introduction
6 Prototype Report Version 1.0
A.1.1 Purpose of Prototype Report
7 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 A.1.2 Status
The first version of Prototype report records the Prototype for basic web crawler, social media crawler and developer User Interface. These are functions that our team supposed to submit by the end of this semester. For web crawler, we give the prototype of the overall architecture, spiders for two representative websites. For extracting data from social media, we Prototype the workflow of using Facebook API to acquire player Facebook data according to his name. For developer User Interface, we implement an easy version of User Interface which provides functions such as add, update and delete website list.
8 Prototype Report Version 1.0
A.2. Navigation Flow
Figure 1: Navigation Flow of Soccer Data Web Crawler
9 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 A.3. Prototype
10 Prototype Report Version 1.0
A.3.1 Developer User Interface
Description The User Interface for developer to manage website list and other parameters.
Related Win WC_3398 Condition As a developer, I can add, delete, update the specific websites visited, fields to capture from the website and frequency of crawler refreshes for each specified website.
Table 1: Developer User Interface
Figure 2: Websites list User Interface
Figure 3: Players User Interface
11 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 4: Attributes User Interface
Figure 5: Teams User Interface
12 Prototype Report Version 1.0
A.3.2 Basic Web Crawler
In this prototype, we will give the web crawler architecture. The architecture consists of modules to extract both data and links from the website.
Description The modules and workflow of basic web crawler. Related Win WC_3473 Condition The web crawler shall gather team information from the websites in the website list. WC_3472 The web crawler shall gather player information from the websites in the website list. WC_3413 The webcraweler shall gather head shots of players from the biography page on the website being crawled so that the player's picture can be shown on the report being generated. WC_3412 The web crawler shall gather videos from the pages being crawled and ingest into STBI as is so that the coach and fans is able to watch the relevant videos PA. Table 2: Web Crawler Architecture
13 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 6: The basic web crawler architecture Besides, we analyzed the websites in the list client gave to us. There are basically two category websites. One is that the statistical table is generated statically. We can get the table from the source code of this website. The other type is that the statistical data is dynamically generated when the page is loaded. In the second type, we need to get the request URL the page used to load the data. Whoscored.com website belongs to the second category. While the nasl website belongs to the first category.
Description The Spider for crawling player data from whoscored.com website.
Related Win WC_3472 Condition The web crawler shall gather player information from the websites in the website list.
Table 3: Spider for whoscored.com
14 Prototype Report Version 1.0
Figure 7: Whoscored.com spider code
15 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 8: Whoscored spider output
16 Prototype Report Version 1.0
Description The Spider for crawling player data from nasl website.
Related Win WC_3472 Condition The web crawler shall gather player information from the websites in the website list.
Table 4: Spider for nasl
Figure 9: Nasl spider code
17 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14 Figure 10: The basic web crawler architecture
18 Prototype Report Version 1.0
A.3.3 Gather Social Media Data
Description The overall work flow to user player’s name to get Facebook data.
Related Win WC_3416 Condition The web crawler shall get comments, name and number of members, likes from specified Facebook pages
Table 4: Workflow for gathering Facebook data
Figure 11: Workflow of gathering Facebook Data
19 PRO_FCP_F14a_T02_V1.0 Version Date: 10/12/14