Introducing Oracle Machine Learning for Python
Total Page:16
File Type:pdf, Size:1020Kb
Introducing Oracle Machine Learning for Python Mark Hornick - Senior Director, Data Science and Machine Learning at Oracle [email protected], www.twitter.com/MarkHornick 1 Future and past TechCasts: Submit a topic to share at https://analyticsanddatasummit.org/techcasts/ Analytics & Data Oracle User Community Same great technical content…new name! www.andouc.org Save the Date TechCast Days-Winter Session January 26-28, 2021 Watch our website & social media channels for more details Share your knowledge, expertise and ideas! Submit your presentation by going to our website and clicking on “TechCasts” 3 4 Oracle Machine Learning for Python Introduction Mark Hornick Senior Director, Data Science and Machine Learning November 19, 2020 Copyright © 2020 Oracle and/or its affiliates. Safe Harbor The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. What is Python? An interpreted, object-oriented, high level, general purpose programming language Designed for rapid application development and scripting to connect existing components Created in the late 1980s and first released in 1991 Open source: https://www.python.org World-wide usage • Widely taught in Universities • Many Data Scientists know and use Python Thousands of open source packages to enhance productivity Copyright © 2020 Oracle and/or its affiliates. Traditional Python and Data Source Interaction Read/Write files using built-in tool capabilities Flat Files Data Source read extract / export export load Data source connectivity packages, e.g., cx_Oracle Access latency Deployment Paradigm shift: Python → Data Access Language → Python Ad hoc cron job Memory limitation – data size, in-memory processing Single threaded Issues for backup, recovery, security Ad hoc production deployment Copyright © 2020 Oracle and/or its affiliates. Oracle Machine Learning d OML4SQL OML Notebooks SQL API with Apache Zeppelin on Autonomous Database OML4R Oracle Data Miner R API Oracle SQL Developer extension OML4Py* OML4Spark Python API R API on Big Data OML AutoML UI* Code-free AutoML interface on Autonomous Database OML Services* Model Deployment and Management, Cognitive Text * Coming soon Copyright © 2020 Oracle and/or its affiliates. Oracle Machine Learning Notebooks Autonomous Database as a Data Science Platform Collaborative UI • Based on Apache Zeppelin • Supports data scientists, data analysts, application developers, DBAs with SQL and Python • Easy notebook sharing • Permissions, versioning, and scheduling of notebooks Included with Autonomous Database • Automatically provisioned and managed • In-database algorithms and analytics functions • Explore and prepare, build and evaluate models, score data, deploy solutions Copyright © 2020 Oracle and/or its affiliates. Oracle Machine Learning for Python Supported in Oracle Autonomous Database with OML Notebooks Use Oracle Database as HPC environment • Explore, transform, and analyze data faster and at scale OML Notebooks REST Interface Use in-database parallelized and distributed ML algorithms OML4Py • Build more models on more data, and score large volume data – faster • Use in-database algorithms from OML4SQL via natural Python API • Increased productivity from automatic data preparation, partitioned models, and integrated text mining capabilities Execute Python scripts and manage Python objects in-database • Collaborate: hand-off data science products from data scientist to developers easily • Run user-defined functions in data-parallel, task-parallel, and non-parallel fashion • Return structured and image results in Python and REST API New automatic machine learning (AutoML) and model explainability (MLX) • Enhance data scientist productivity and enable non-experts to use and benefit from machine learning • Algorithm selection, feature selection, hyperparameter tuning, model selection • Model-agnostic identification of important features that impact model predictions Copyright © 2020 Oracle and/or its affiliates. Transparency Layer In-database performance – indexes, query optimization, parallelism, partitioning Leverages proxy objects for database data: oml.DataFrame DATA.shape DATA.head() • # Create table from Pandas DataFrame data DATA.describe() DATA = oml.create(data, table = 'BOSTON') DATA.std() • # Get proxy object to DB table boston DATA.skew() DATA = oml.sync(table = 'BOSTON') TRAIN, TEST = Uses familiar Python syntax to manipulate database data DATA.split() Overloads Python functions translating functionality to SQL TRAIN.shape TEST.shape Copyright © 2020 Oracle and/or its affiliates. In-database scalable aggregation Example using the crosstab function ONTIME_S = oml.sync(table="ONTIME_S") res = ONTIME_S.crosstab('DEST') type(res) res.head() Source data is a DataFrame, ONTIME_S, select DEST, count(*) OML Notebooks which is an Oracle Database table from ONTIME_S group by DEST crosstab() function overloaded to accept OML OML4Py DataFrame objects and transparently generates SQL for scalable processing in Oracle Autonomous Database Oracle Database In-db Returns an ‘oml.core.frame.DataFrame’ object stats User tables Copyright © 2020 Oracle and/or its affiliates. OML4Py 1.0 Machine Learning in-database algorithms Classification Clustering Association Rules • Decision Tree • Expectation Maximization • Naïve Bayes • Apriori – Association Rules • Generalized Linear Model • Hierarchical k-Means • Support Vector Machine • Random Forest • Neural Network Attribute Importance Feature Extraction • Minimum Description Length • Singular Value Decomposition Regression • Explicit Semantic Analysis • Principal Component Analysis via SVD • Generalized Linear Model Anomaly Detection • Neural Network • Support Vector Machine • 1 Class Support Vector Machine Supports automatic data preparation, partitioned model ensembles, integrated text mining Copyright © 2020 Oracle and/or its affiliates. Scalable in-database algorithms Example using Support Vector Machine from oml import svm # create proxy object OML Notebooks ONTIME_S = oml.sync(table='ONTIME_S') OML4Py # define model object settings = {'svms_outlier_rate' : 0.01} svm_mod = svm('anomaly_detection', Oracle Autonomous svms_kernel_function = Database 'dbms_data_mining.svms_linear', **settings) # build anomaly detection model svm_mod = svm_mod.fit(x=ONTIME_S, y=None) User tables # view model object svm_mod Copyright © 2020 Oracle and/or its affiliates. Use matplotlib visualization with in-database model results Example using OML Notebooks with in-database clustering model build and score Drop existing model Build k-Means model Score using model 16 Copyright © 2020, Oracle and/or its affiliates Embedded Python Execution Example of parallel partitioned data flow using third party package REST Interface OML Notebooks # user-defined function using sklearn def build_lm(dat): OML4Py from sklearn import linear_model lm = linear_model.LinearRegression() Oracle Autonomous X = dat[['PETAL_WIDTH']] Database y = dat[['PETAL_LENGTH']] lm.fit(X, y) return lm User tables # select column(s) for partitioning data index = oml.DataFrame(IRIS['SPECIES']) # invoke function in parallel on IRIS table mods = oml.group_apply(IRIS, index, Python Engine spawns func=build_lm, OML4Py parallel=2) mods.pull().items() Python Engine OML4Py Copyright © 2020 Oracle and/or its affiliates. REST Interface for Embedded Python Execution py_scripts for executing user-defined functions (Python “scripts”) <oml-cloud-service-url>/oml/tenants/<tenant_name>/databases/<pdb_name>/api/py-scripts/v1/<operation>/<script_name>/ Name of script in repository Name of do-eval Customer pluggable table-apply tenant database name group-apply Cloud service within ADB URL index-apply row-apply Example synchronous invocation from cURL $ curl -X POST --header “Authorization: Bearer ${token}” --header 'Content-Type: application/json' --header 'Accept: application/json' -d '-d ‘{“graphicsFlag”:true, “service”:“MEDIUM”}' "<oml-cloud-service-url>/oml/tenants/MYTENANT/databases/MYADW/api/py-scripts/v1/ RandomRedDots/do-eval” Asynchronous invocation also available 18 Copyright © 2020, Oracle and/or its affiliates AutoML – new with OML4Py Increase data scientist productivity – reduce overall compute time Auto Algorithm Auto Feature Auto Model Selection Selection Tuning ML Data Much faster than De-noise data and Significant accuracy Table exhaustive search reduce # of features improvement Model Auto Algorithm Selection Auto Feature Selection Auto Model Tuning – Identify in-database – Reduce # of features by – Automatic tuning of algorithm algorithm that achieves identifying most predictive hyperparameters highest model quality – Improve performance – Avoid manual or exhaustive – Find best algorithm faster and accuracy search techniques than with exhaustive search Enables non-expert users to leverage Machine Learning Copyright © 2020 Oracle and/or its affiliates. Demo 20 Copyright © 2020, Oracle and/or its affiliates Summary – OML4Py Python access to Oracle Machine Learning in Autonomous Database • Scalable data exploration, preparation, and analysis • Scalable in-database machine learning • Automation for greater data scientist productivity and non-expert