Data Collection and Integration

Data Collection and Integration

Data Collection and Integration 01219335 Data Acquisition and Integration Chaiporn Jaikaeo Department of Computer Engineering Kasetsart University Revised 2020-10-07 Outline • Data collection from various online sources • Data cleaning • Data transformation • Web scraping • Web API 2 Public Data Sources • Examples in Thailand ◦ Government Open Data for Thailand ◦ http://data.go.th ◦ National Statistical Office of Thailand ◦ http://www.nso.go.th • WikiData ◦ Free and open knowledge base in form of structured data • Google Dataset Search ◦ https://toolbox.google.com/datasetsearch • Many many more 3 Online Data Sharing Options • Static data files such as csv, tsv, xlsx, txt, pdf, images • Human-friendly web pages • Web API 4 Example: Household Income • Available at ◦ https://www.cpe.ku.ac.th/~cpj/219335/data/household- income.html • Originally obtained from NSO • Deliberately modified to contain some inconsistency 5 Data Transformation • The process of converting data from one format or structure into another to produce desired output • Transformation operations include ◦ Pivoting/unpivoting (transposition) ◦ Mapping ◦ Joining ◦ Aggregating 6 Messy Data • Messy data contains inconsistency • E.g., all following rows probably contain the same data Name Marriage Birth date Income status (USD) Smith, John Single 1980-09-23 $98,200 Mr. John Smith single 23.09.1980 98200 John Smith single Sep 23, 1980 98,200.00 7 Data Cleaning • The process of detecting and correcting, as well as removing, messy data • Techniques/tools ◦ Spreadsheet software, e.g., Google Sheets, Microsoft Excel ◦ Pros: interactive, easy to get started, extensions may be readily available ◦ Cons: inflexible and tedious to repeat Jobs ◦ Programming libraries, e.g., Python’s Pandas ◦ Pros: most flexible ◦ Cons: difficult to setup and may lacK interactivity ◦ Others specifically designed to handle messy data, e.g., Trifacta OpenRefine 8 OpenRefine • Formally called Google Refine • Now belongs to open-source community ◦ Official website: http://openrefine.org/ ◦ Source repository: https://github.com/OpenRefine/OpenRefine • Runs locally as an interactive web application • Cleaning and transformation instructions can be stored and reused 9 Example: OpenRefine • Launch OpenRefine • Open the link below and copy the table data into the clipboard ◦ https://www.cpe.Ku.ac.th/~cpJ/219335/data/household-income.html • In OpenRefine, choose Clipboard as data source • Verify the import settings • Create a project 10 Faceting • Faceting allows data exploration by applying multiple filters • A facet is created on a particular column • Try creating a text facet on the Region column 11 Clustering • Clustering helps find groups of different cell values that might be alternative representations of the same thing • Values in a cluster can be merged into one 12 Transposition • Displaying values from different years across multiple columns is good for presentation, but not appropriate as a database structure • Transposition allows unpivoting columns into rows 13 Transforming Cells • All values in a column can be mapped to new values using Python/Jython statements or GREL (Google Refine Expression Language) • E.g., ◦ Converting text into number ◦ Removing unnecessary characters such as commas 14 Possible Additional Steps • Find and fix anomalies in the income values ◦ Using Numeric facet • Convert income values into numbers ◦ Using transform… menu • Link parts of data to other knowledge bases ◦ E.g., reconcile province cells to WikiData • Export data to a csv file • Import data into MySQL database, normalize as needed ◦ Access and manage your database via https://iot.cpe.ku.ac.th/pma ◦ Username: <ku-account> ◦ Password: <ku-google-email> 15 Getting Data from Web Pages • Most data available only on web pages are difficult to extract • With data updated periodically, some level of automation is required • Data extraction can be automated using a technique called web scraping ◦ The process of extracting information from a website in an automated way ◦ Usually based on specifying appropriate CSS selectors to pinpoint desired data in HTML document • Some available tools ◦ Web Scraper (Chrome extension) ◦ Beautiful Soup for Python, or JSoup for Java 16 Web Scraper Extension • Install Web Scraper extension for Google Chrome • Open Developer Tools and select Web Scraper tab • Documentation and video tutorial can be found at https://webscraper.io • Test sites for practicing are available at https://webscraper.io/test-sites 17 Example: SETTRADE • Browse to SETTRADE’s Market Summary page at https://www.settrade.com/C13_MarketSummary.jsp • We want to extract indices and values listed on the page • The values keep changing over time, so we are to periodically collect values 18 Launching Web Scraper • Open Chrome’s Developer Tools and select Web Scraper • Create a new sitemap and paste the URL 19 Extracting Index Names and Values • Add a Text selector and name it indices • Choose Select and click on any two of the indices • Also check Multiple as multiple values will be extracted • Click Data preview to make sure only and all needed values are extracted • Repeat the process to extract values 20 Web Scraping using Python • Scraping process can be further automated using code, e.g., written in Python • Beautiful Soup is a widely used web scraping library • CSS selectors created in Web Scraper can be used import requests from bs4 import BeautifulSoup response = requests.get("https://www.settrade.com/C13_MarketSummary.jsp") soup = BeautifulSoup(response.text,"html.parser") Selectors from # extract indices; also strip out wrapping whitespaces indices = [s.text.strip() for s in soup.select(".col-md-8 a")] Web Scraper # extract values; get rid of ',' and convert to numbers values = [float(s.text.replace(",","")) for s in soup.select(".col-md-8 td:nth-of-type(2)")] for idx,val in zip(indices,values): print(f"{idx} : {val}") 21 Web Scraping using Node-RED • Web scraping can be done in Node-RED using http request node in conjunction with html node 22 Web Scraping: Caveats • Be responsible! • Respect the robots.txt file • Do not hit servers too frequently • If possible, scrape during off-peak hours • Use the scraped data responsibly ◦ Respect copyright laws and be aware of potential copyright infringement ◦ Check the website’s Terms of Service 23 Getting Data from Web Services • Public data sources have increasingly made their data available as web services • For examples, ◦ The World Air Quality Project: http://aqicn.org ◦ Thailand Meteorological Department API (TMDAPI) https://data.tmd.go.th/api/index1.php • Several public APIs are listed at ◦ https://github.com/public-apis/public-apis 24 Example: Air Quality Data • https://aqicn.org/json-api/doc/ website provides data feed based on geographical location • For example, this URL will get feed from the station nearest to latitude 10.3, longitude 20.7 https://api.waqi.info/feed/geo:10.3;20.7/?token=XXXXX • Registration is required to obtain an access token 25 Conclusion • Data can be found and collected from several online sources using various methods • Data sources may be available in forms of static document files, static/dynamic web pages, and web services • Web scraping is a technique to extract data from web sites • Collected data needs to be carefully examined, cleaned, and transformed into appropriate structures for storing 26 Further Reading • Cleaning data with OpenRefine ◦ https://libjohn.github.io/openrefine/ • Web Scraper documentation and video tutorials ◦ http://webscraper.io • Node-RED Cookbook: HTTP requests ◦ https://cookbook.nodered.org/#http-requests 27 Assignment 7.1: AQI and Traffic • Create Node-RED flows to ◦ Collect PM2.5 levels from aqicn.org every 15 minutes ◦ Choose three different locations around Bangkok ◦ Collect Bangkok traffic index from Longdo every 15 minutes ◦ https://traffic.longdo.com/api/json/traffic/index • Record data in separate tables in your MySQL database at 158.108.34.31 ◦ Table name for PM2.5 levels: aqi ◦ Table name for traffic index: traffic ◦ Design your own schema for each table • Create a dashboard that displays: ◦ Current values and historical chart of PM2.5 levels for the 3 locations ◦ Current value and historical chart of Bangkok traffic index 28.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    28 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us