Data Science with Postgresql
Total Page:16
File Type:pdf, Size:1020Kb
Data Science with PostgreSQL Data Science with PostgreSQL Balázs Bárány Data Scientist pgconf.eu 2015 Data Science with PostgreSQL Contents Introduction What is Data Science? Process model Tools and methods of Data Scientists Data Science with PostgreSQL Business & data understanding Preprocessing Modeling Evaluation Deployment Summary I Who is a Data Scientist? Data Science with PostgreSQL Introduction What is Data Science? Sexiest job of the 21st century I According to Google, LinkedIn, ... Data Science with PostgreSQL Introduction What is Data Science? Sexiest job of the 21st century I According to Google, LinkedIn, ... I Who is a Data Scientist? Data Science with PostgreSQL Introduction What is Data Science? Data Science Venn Diagram (c) Drew Conway, 2010. CC-BY-NC I Mash up & format for analysis I Analyze & visualize I Predict & prescribe I Operationalize Data Science with PostgreSQL Introduction What is Data Science? Tasks of data scientists I Get data from various sources I Big data? I Analyze & visualize I Predict & prescribe I Operationalize Data Science with PostgreSQL Introduction What is Data Science? Tasks of data scientists I Get data from various sources I Big data? I Mash up & format for analysis I Predict & prescribe I Operationalize Data Science with PostgreSQL Introduction What is Data Science? Tasks of data scientists I Get data from various sources I Big data? I Mash up & format for analysis I Analyze & visualize I Operationalize Data Science with PostgreSQL Introduction What is Data Science? Tasks of data scientists I Get data from various sources I Big data? I Mash up & format for analysis I Analyze & visualize I Predict & prescribe Data Science with PostgreSQL Introduction What is Data Science? Tasks of data scientists I Get data from various sources I Big data? I Mash up & format for analysis I Analyze & visualize I Predict & prescribe I Operationalize Data Science with PostgreSQL Introduction What is Data Science? Process model The Data Mining process Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons) Data Science with PostgreSQL Tools and methods of Data Scientists Tools and methods Tools and methods Data Science with PostgreSQL Tools and methods of Data Scientists Scripting and programming I R I Python with extensions I Octave/Matlab, other mathematic languages I Hadoop and Big Data programming libraries (mostly Java) I Cloud services Data Science with PostgreSQL Tools and methods of Data Scientists Integrated GUI tools I (partly) Open Source: RapidMiner, KNIME, Orange I Data Warehouse tools extended for analytics: Pentaho, Talend I Many commercial tools, e. g. SAS, IBM SPSS I Hadoop-related newcomers: e. g. Datameer Data Science with PostgreSQL Tools and methods of Data Scientists Data Infrastructure I Databases and data stores I Relational, NoSQL I Hadoop clusters I In-memory I Data streams I Free-form data: text, images, video, audio, ... I Web APIs I Open Data I Joining, aggregating, ltering, calculating, ... I Data cleansing I Missing values I Abnormal values I Result: data set suitable for analytics Data Science with PostgreSQL Tools and methods of Data Scientists Data acquisition and preprocessing I Data ingestion in raw format I Data cleansing I Missing values I Abnormal values I Result: data set suitable for analytics Data Science with PostgreSQL Tools and methods of Data Scientists Data acquisition and preprocessing I Data ingestion in raw format I Joining, aggregating, ltering, calculating, ... I Result: data set suitable for analytics Data Science with PostgreSQL Tools and methods of Data Scientists Data acquisition and preprocessing I Data ingestion in raw format I Joining, aggregating, ltering, calculating, ... I Data cleansing I Missing values I Abnormal values Data Science with PostgreSQL Tools and methods of Data Scientists Data acquisition and preprocessing I Data ingestion in raw format I Joining, aggregating, ltering, calculating, ... I Data cleansing I Missing values I Abnormal values I Result: data set suitable for analytics I Classication (supervised): Prediction of a class or category I Regression (supervised): Prediction of numeric value I Clustering (unsupervised): Automatic grouping of data I Association analysis, outlier detection, time series prediction, ... Data Science with PostgreSQL Tools and methods of Data Scientists Predictive Modeling I Supervised and unsupervised methods I Target variable known or not I Clustering (unsupervised): Automatic grouping of data I Association analysis, outlier detection, time series prediction, ... Data Science with PostgreSQL Tools and methods of Data Scientists Predictive Modeling I Supervised and unsupervised methods I Target variable known or not I Classication (supervised): Prediction of a class or category I Regression (supervised): Prediction of numeric value Data Science with PostgreSQL Tools and methods of Data Scientists Predictive Modeling I Supervised and unsupervised methods I Target variable known or not I Classication (supervised): Prediction of a class or category I Regression (supervised): Prediction of numeric value I Clustering (unsupervised): Automatic grouping of data I Association analysis, outlier detection, time series prediction, ... I Store in ERP or CRM I Tell someone (email, popup) I Add a label (e. g. mark email as spam) I Interrupt nancial transaction => prescription I Order supplies => prescription I ... Data Science with PostgreSQL Tools and methods of Data Scientists Deployment and operationalization I Model application to new data => prediction + condence I What to do with predictions? I Interrupt nancial transaction => prescription I Order supplies => prescription I ... Data Science with PostgreSQL Tools and methods of Data Scientists Deployment and operationalization I Model application to new data => prediction + condence I What to do with predictions? I Store in ERP or CRM I Tell someone (email, popup) I Add a label (e. g. mark email as spam) Data Science with PostgreSQL Tools and methods of Data Scientists Deployment and operationalization I Model application to new data => prediction + condence I What to do with predictions? I Store in ERP or CRM I Tell someone (email, popup) I Add a label (e. g. mark email as spam) I Interrupt nancial transaction => prescription I Order supplies => prescription I ... Data Science with PostgreSQL Data Science with PostgreSQL Data Science with PostgreSQL Doing Data Science with PostgreSQL I Must be root and postgres I Maintain your PostgreSQL yourself I Able to compile stu I You should ask ;-) I your boss I co-workers I customer Data Science with PostgreSQL Data Science with PostgreSQL Caveats I This stu is not easy I You should ask ;-) I your boss I co-workers I customer Data Science with PostgreSQL Data Science with PostgreSQL Caveats I This stu is not easy I Must be root and postgres I Maintain your PostgreSQL yourself I Able to compile stu Data Science with PostgreSQL Data Science with PostgreSQL Caveats I This stu is not easy I Must be root and postgres I Maintain your PostgreSQL yourself I Able to compile stu I You should ask ;-) I your boss I co-workers I customer I Project goals and challenges I Availability of data and resources I Success criteria I Not a technical activity, PostgreSQL can't help much Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding I What is the purpose of the business? I What are existing processes? I Drivers of business success I Not a technical activity, PostgreSQL can't help much Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding I What is the purpose of the business? I What are existing processes? I Drivers of business success I Project goals and challenges I Availability of data and resources I Success criteria Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding I What is the purpose of the business? I What are existing processes? I Drivers of business success I Project goals and challenges I Availability of data and resources I Success criteria I Not a technical activity, PostgreSQL can't help much I Connecting separate data sources I Simple or complex IDs I Data size I Too small I Too big I Suitability for predictive modeling I Target variable? I Attribute types Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding I Existing data I Entities and covered concepts I Complete? Correct? In suitable form? I Usable? (regulations, access constraints, ...) I Data size I Too small I Too big I Suitability for predictive modeling I Target variable? I Attribute types Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding I Existing data I Entities and covered concepts I Complete? Correct? In suitable form? I Usable? (regulations, access constraints, ...) I Connecting separate data sources I Simple or complex IDs I Suitability for predictive modeling I Target variable? I Attribute types Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding I Existing data I Entities and covered concepts I Complete? Correct? In suitable form? I Usable? (regulations, access constraints, ...) I Connecting separate data sources I Simple or complex IDs I Data size I Too small I Too big Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding