Data Mining Overview of Data Mining

Data Mining Overview of Data Mining

CSE4334/5334 Data Mining Overview of Data Mining Chengkai Li Department of Computer Science and Engineering University of Texas at Arlington Fall 2020 (Slides partly courtesy of Pang-Ning Tan, Michael Steinbach and Vipin Kumar, and Jiawei Han, Micheline Kamber and Jian Pei) Big Data http://www.ibmbigdatahub.com/infographic/four-vs-big-data 2 Big Data The 4 Vs o Volume o Variety o Velocity o Veracity 3 Volume: How much data is out there? http://www.sciencedaily.com/releases/2013/05/130522085217.htm 4 http://www.storagenewsletter.com/rubriques/market-reportsresearch/ibm-cmo-study/ Volume: How much data is out there? In total, 2.7 Zettabytes of data exists in our digital universe. (“A terabyte is equal to 1,024 gigabytes. A petabyte is equal to 1,024 terabytes. An exabyte is equal to 1,024 petabytes. A zettabyte is equal to 1,024 exabytes.”) Online Activity o Every minute: o 149, 513 emails are sent o 3.3 million Facebook posts are created o 3.8 million Google searches are performed o 65,972 Instagram photos are uploaded. o 448,800 tweets are constructed o 500 hours of YouTube videos are uploaded 5 https://www.nodegraph.se/big-data-facts/ Variety: Types of Data Structured data o (relational) database tables o CSV/TSV files Semi-structured data o XML, JSON, RDF Unstructured data o text data (documents, Web pages, short texts, e.g., social media) Multimedia data o images, videos, audios Other types of data o matrices, graphs, sequences, time-series, spatio-temporal 6 Velocity: Streaming Data v Stock trades v Highway sensors v Weather data v Social media v Telephone calls v Video streaming http://mashable.com/2012/06/22/data-created-every-minute/ 7 Veracity: uncertain and imprecise data v Biases v Data Lineage v Bugs v Noise v Abnormalities v Information security v Unreliable sources v Falsification v Uncertainty v Out of date v Human error https://simplicable.com/new/data-veracity 8 Data in Every Application Area o Business: e-commerce, transactions (retailers, banking, credit cards), ratings, reviews, stock trading, … o Web, social media (YouTube, Flickr, …), and social networks (Facebook, Twitter, …) o News o Science: bioinformatics, scientific experiments, environment, climate, astronomy o Logs and measurements o Personal information: emails, calendars, digital photos, videos o Transportation o Telecommunication o Education o Entertainment (film, music, gaming, …) o Sports o Health care 9 o Crime, security What is Data Mining? Data mining (knowledge discovery from data) o Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data What is not Data Mining? o Retrieve data instead of knowledge or pattern o Not interesting (trivial, explicit, known, useless) 10 Knowledge Discovery (KDD) Process v Data mining plays an essential role in the knowledge discovery process http://cacm.acm.org/magazines/1996/11/8517-the-kdd-process-for-extracting-useful-knowledge- from-volumes-of-data/abstract 11 KDD Process: A Typical View from ML and Statistics Input Data Data Pre- Data Post- Processing Mining Processing Data integration Pattern discovery Pattern evaluation Normalization Association & correlation Pattern selection Classification Feature selection Pattern interpretation Clustering Dimension reduction Pattern visualization Outlier analysis This is a view from typical machine learning and statistics communities Data Mining: Confluence of Multiple Disciplines Machine Pattern Statistics Learning Recognition Applications Data Mining Visualization Algorithm Database High-Performance Technology Computing 13 Data Mining Software Large-Scale Data Processing, Free, open-source Machine Learning o RapidMiner o Weka: Data mining tool in java o Apache Mahout o SCaVis: scientific computation and visualization, Java o GraphLab o Orange: Python suite o MapReduce/Hadoop o Scikit-learn: Python machine learning lbirary o Spark o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o Pregel/Giraph o R: statistical computing and graphic Commercial Products o RattleGUI: data mining GUI using R o Matlab o Octave: numerical analysis o Oracle Data Mining o Shogun: machine learning toolkit in C++ o SAS Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o IBM SPSS o SenticNet API: sentiment analysis o Microsoft SQL Server o Stanford NLP software Analysis Services o UIMA o HP Vertica 14.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us