Data Science with PostgreSQL

Data Science with PostgreSQL

Balázs Bárány

Data Scientist

pgconf.eu 2015 Data Science with PostgreSQL

Contents

Introduction  What is Data Science? Process model

Tools and methods of Data Scientists

Data Science with PostgreSQL Business & data understanding Preprocessing Modeling Evaluation Deployment

Summary I Who is a Data Scientist?

Data Science with PostgreSQL Introduction  What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ... Data Science with PostgreSQL Introduction  What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ...

I Who is a Data Scientist? Data Science with PostgreSQL Introduction  What is Data Science?

Data Science Venn Diagram

(c) Drew Conway, 2010. CC-BY-NC I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data? I Analyze & visualize

I Predict & prescribe

I Operationalize

Data Science with PostgreSQL Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis I Predict & prescribe

I Operationalize

Data Science with PostgreSQL Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize I Operationalize

Data Science with PostgreSQL Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe Data Science with PostgreSQL Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize Data Science with PostgreSQL Introduction  What is Data Science? Process model The Data Mining process

Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons) Data Science with PostgreSQL Tools and methods of Data Scientists

Tools and methods

Tools and methods Data Science with PostgreSQL Tools and methods of Data Scientists

Scripting and programming

I

I Python with extensions

I Octave/Matlab, other mathematic languages

I Hadoop and Big Data programming libraries (mostly Java)

I Cloud services Data Science with PostgreSQL Tools and methods of Data Scientists

Integrated GUI tools

I (partly) Open Source: RapidMiner, KNIME, Orange

I Data Warehouse tools extended for analytics: Pentaho, Talend

I Many commercial tools, e. g. SAS, IBM SPSS

I Hadoop-related newcomers: e. g. Datameer Data Science with PostgreSQL Tools and methods of Data Scientists

Data Infrastructure

I Databases and data stores

I Relational, NoSQL I Hadoop clusters I In-memory

I Data streams

I Free-form data: text, images, video, audio, ...

I Web APIs

I Open Data I Joining, aggregating, ltering, calculating, ...

I Data cleansing

I Missing values I Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format I Data cleansing

I Missing values I Abnormal values

I Result: data set suitable for analytics

Data Science with PostgreSQL Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ... I Result: data set suitable for analytics

Data Science with PostgreSQL Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ...

I Data cleansing

I Missing values I Abnormal values Data Science with PostgreSQL Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ...

I Data cleansing

I Missing values I Abnormal values

I Result: data set suitable for analytics I Classication (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic grouping of data

I Association analysis, outlier detection, time series prediction, ...

Data Science with PostgreSQL Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not I Clustering (unsupervised): Automatic grouping of data

I Association analysis, outlier detection, time series prediction, ...

Data Science with PostgreSQL Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classication (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value Data Science with PostgreSQL Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classication (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic grouping of data

I Association analysis, outlier detection, time series prediction, ... I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt nancial transaction => prescription

I Order supplies => prescription

I ...

Data Science with PostgreSQL Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions? I Interrupt nancial transaction => prescription

I Order supplies => prescription

I ...

Data Science with PostgreSQL Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam) Data Science with PostgreSQL Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt nancial transaction => prescription

I Order supplies => prescription

I ... Data Science with PostgreSQL Data Science with PostgreSQL

Data Science with PostgreSQL

Doing Data Science with PostgreSQL I Must be root and postgres

I Maintain your PostgreSQL yourself I Able to compile stu

I You should ask ;-)

I your boss I co-workers

I customer

Data Science with PostgreSQL Data Science with PostgreSQL

Caveats

I This stu is not easy I You should ask ;-)

I your boss I co-workers

I customer

Data Science with PostgreSQL Data Science with PostgreSQL

Caveats

I This stu is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourself I Able to compile stu Data Science with PostgreSQL Data Science with PostgreSQL

Caveats

I This stu is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourself I Able to compile stu

I You should ask ;-)

I your boss I co-workers

I customer I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success I Not a technical activity, PostgreSQL can't help much

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too small

I Too big

I Suitability for predictive modeling

I Target variable? I Attribute types

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding

I Existing data

I Entities and covered concepts I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...) I Data size

I Too small

I Too big

I Suitability for predictive modeling

I Target variable? I Attribute types

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding

I Existing data

I Entities and covered concepts I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs I Suitability for predictive modeling

I Target variable? I Attribute types

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding

I Existing data

I Entities and covered concepts I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too small

I Too big Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding

I Existing data

I Entities and covered concepts I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too small

I Too big

I Suitability for predictive modeling

I Target variable? I Attribute types I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLs I Search for missing links (incomplete foreign keys)

I Analyze surprizes

I Impossible values I Missing values in required elds

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process I Foreign Data Wrappers I Analyze surprizes

I Impossible values I Missing values in required elds

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process I Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLs I Search for missing links (incomplete foreign keys) Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process I Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLs I Search for missing links (incomplete foreign keys)

I Analyze surprizes

I Impossible values I Missing values in required elds I Or maybe...

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding with PostgreSQL  summary

I Good SQL knowledge required

I Tedious manual process

I repetitive I not suitable for large number of attributes

I No built-in visualization Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding with PostgreSQL  summary

I Good SQL knowledge required

I Tedious manual process

I repetitive I not suitable for large number of attributes

I No built-in visualization

I Or maybe... Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding SQL barchart output Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Bar chart from GUI tool Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Boxplot output I Introduction: R

I A free software environment for statistical computing and graphics

I Available in PostgreSQL

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools

I Introduction: R

I A free software environment for statistical computing and graphics

I Available in PostgreSQL I PL/R: R as a loadable procedural language for PostgreSQL

I First released in 2003 by Joe Conway I License: GPL

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methods

I Powerful visualization functions I Classical, modern and bleeding edge modeling I Arrays and data frames are central data types I Operates only in memory Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methods

I Powerful visualization functions I Classical, modern and bleeding edge modeling I Arrays and data frames are central data types I Operates only in memory

I PL/R: R as a loadable procedural language for PostgreSQL

I First released in 2003 by Joe Conway I License: GPL Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding R usage outside of PostgreSQL

I Development environments

I RStudio (AGPL or commercial, local & web) I RKWard, (KDE projects) I StatET (Eclipse)

I Frontends

I

I Deducer I Rattle

I Web framework: Shiny (AGPL or commercial) Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Working with R in PostgreSQL

I Install functions in the database Example select install_rcmd(' myfunction <-function(x) {print(x)} ');

I Install without function body Example CREATE FUNCTION rnorm (n integer, mean double precision, sd double precision) RETURNS double precision[] AS  LANGUAGE 'plr'; I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics I Execution outside

I Simple and interactive, but data transfer

Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access Data Science with PostgreSQL Data Science with PostgreSQL Business & data understanding Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer I Rows: very dynamic

I Easy to create new rows by joining I Easy to lter

I Columns: not so much

I Easy to create new columns I Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external data I New attributes from existing ones I Recoding, recalculation I Missing value handling

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing

I What databases are built for I Wider interpretation of preprocessing

I Enrichment with external data I New attributes from existing ones I Recoding, recalculation I Missing value handling

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joining I Easy to lter

I Columns: not so much

I Easy to create new columns I Only explicit access Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joining I Easy to lter

I Columns: not so much

I Easy to create new columns I Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external data I New attributes from existing ones I Recoding, recalculation I Missing value handling Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: organizing workow

I Common Table Expressions

I organize processing steps

I partial and intermediate results

Example WITH source AS ( SELECT *, ROW_NUMBER() OVER () AS rownum FROM source_table ), no_missings AS ( SELECT * FROM source WHERE field1 IS NOT NULL AND field2 IS NOT NULL ) etc. I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)

I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and data mining tools

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: attribute creation

I Aggregation I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and data mining tools

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...) I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and data mining tools

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)

I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) I Much easier in SQL than programming languages and data mining tools

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)

I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...) Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)

I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)

I Much easier in SQL than programming languages and data mining tools I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki) I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...) I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala) I Write your own in C or Python or Ruby

Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...) Data Science with PostgreSQL Data Science with PostgreSQL Preprocessing Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy ...)

I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby I Some attempts to build them

I Naive Bayes, Linear Regression I Dicult for more advanced algorithms

I Better done in specialized language or tool

I PL/R I PL/Python

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Model development

I Machine learning algorithms not well suited for SQL I Better done in specialized language or tool

I PL/R I PL/Python

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear Regression I Dicult for more advanced algorithms Data Science with PostgreSQL Data Science with PostgreSQL Modeling Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear Regression I Dicult for more advanced algorithms

I Better done in specialized language or tool

I PL/R I PL/Python I PL/Python is an alternative to PL/R

Data Science with PostgreSQL Data Science with PostgreSQL Modeling PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classication, regression, clustering I Model selection, validation I Preprocessing

I matplotlib: Generic and statistical plotting Data Science with PostgreSQL Data Science with PostgreSQL Modeling PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classication, regression, clustering I Model selection, validation I Preprocessing

I matplotlib: Generic and statistical plotting library

I PL/Python is an alternative to PL/R I Results on the training set meaningless

I Split validation

I Cross validation

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ... I Split validation

I Cross validation

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless

I Split validation

I Cross validation I get more data

I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough, I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters I Better done in data mining environment

Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation Data Science with PostgreSQL Data Science with PostgreSQL Modeling Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment

I Advantages of deployment in the database:

I Less overhead I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers I Functionality available over standard interface

I Some models easily expressed in SQL

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment I Some models easily expressed in SQL

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL I Put into global context

I PL/R: load(saved object, .GlobalEnv) I PL/Python: Global dictionary GD

I Application function in matching language

I Uses existing model I Returns target variable

I Trigger func or UPDATE uses application function

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment of PL/R or PL/Python models

I Model developed in database or outside I Trigger func or UPDATE uses application function

Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(saved object, .GlobalEnv) I PL/Python: Global dictionary GD

I Application function in matching language

I Uses existing model I Returns target variable Data Science with PostgreSQL Data Science with PostgreSQL Deployment Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(saved object, .GlobalEnv) I PL/Python: Global dictionary GD

I Application function in matching language

I Uses existing model I Returns target variable

I Trigger func or UPDATE uses application function I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment I Foreign Data Wrappers for data integration

I Procedural languages for data mining

Data Science with PostgreSQL Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing I Procedural languages for data mining

Data Science with PostgreSQL Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration Data Science with PostgreSQL Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining Data Science with PostgreSQL Summary

Questions?

I Balázs Bárány,

I https://datascientist.at/