on Big‐Data INGEGNERI Aurelio Uncini degli [email protected] EE ORDIN provincia di ANCONA

Facoltà di Ingegneria dell'Informazione, Informatica e Statistica (I3S)

http://ispac.diet.uniroma1.it Bologna, 07 Novembre 2013 Ancona,Rome, June July 12, 2015 2015 Prologue • AiAris totl e argued dtht that all peop le express ing s im ilar intellectual faculties and that the differences were due to the teaching and example.

• My elementary school teacher said that “man is intelligent because it has the ability to adapt”.

• Bernard Widrow (LMS inventor): - “I'm an ‘adaptive’ guy”

Keywords: teaching, example and adaptation The Big Data Phenomenon Exponential growth of available information

• Social networks

• Sensor networks

• Internet of Things

• Bureaucratic and specific database

• Apps

•…. Big Data cycle

Apps

Data Users Big Data many ‘V’ 2020 about 44x1021 (44 zettabyte)

VlVolume

Velocity

Variability

Variety Source: IDC’s Digital Universe Study (EMC) Big Data

• Untapped opportunities for socioeconomic growth World Economic Forum

• DtData is the new oil ofthf the I nt ernet and dth the new currency of the digital world. Meglena Kuneva, European Consumer Commissioner

• Data in 21st Century is like Oil in the 18th Century: an immensely, untapped valuable asset. Big Data - Many ‘V’ definition Big problem: extraction of ‘V’alue from the large pools of data

Cost center Profit center

Harvesting of valuable knowledge from Big Data is not an ordinary t ask Today, methods, have come to play a vital role in Big Data analytics and knowledge discovery Big-Data relevant themes

Data Constraint Computational Intelligence Massive scale Methods Decentralized Deep Learning Method Real - Time stream Deep Neural Nets Convolutive Neural Nets Distributed Neural Nets Meta heuristic Infrastructure ……. Massive Scale Value Cloud Storage BD business model High speed networks BD Analytics High speed computers High-value Computational model added products Adaptive …….. Parallel Task Distributed Modeling Local connections Prediction ‘Green’ Classification Clustering …….. • Support projects that can transform our ability to harness in novel ways from huge volumes of digital data.

• In April, 2013, U.S. President Barack Obama announced another federal project, a new brain mapping initiative called the BRAIN ((ggBrain Research Through Advancing Innovative Neurotechnologies).

• President Barack Obama's Big Data Keynote -- Hadoop World 2015 (He talks about the importance of Big Data and Data Science) (19 feb 2015) Biologically inspired computing Biologically inspired approach ....

Instinct Knowledge Experience Culture Brain Memory Deduction Emotions A priori knowledge Rules AtiAction Aware Reasoning ability ...

Moreover: fifusion with other ifinforma tion ...... most of our behaviors , which combine information, knowledge and intelligence; happens unconsciously.

Ex. Complex scene summarization in a few words Characteristics of the biological brain

The neuron cell

DditDendrites (receivers)

ATilAxon Terminals Cell Body (transmitters)

Nucleus Stimuli Response •Birth ofArtificial Neural Networks (ANN) (40s) •The fooarmalneur on ofo McCull och – Pitts ((93)1943)

• Simple biological inspired circuit

()s Non linear function Synaptic weights w Threshold or bias Cell potential w1 1 s x1 (ii)(activation) w Neuron's 2  input x T Stimuli s  wx T () y  ()w x Response

Summing Activation w M junction function Axon xM Dendrit

• Can be implemented by a very simple algorithm. Suitable for Artificial Neural Networks Learning model and paradigms

• Learning model: simple rewarding mechanism

• In general terms we can define two learning paradigms – Supervised – UidUnsupervised Supervised learning Learning through teaching by examples

wwnn1  Rewarding_Function

Stimuli Response

Supervisor or Teacher Comparison Correct answer

Reward mechihanism Error e[n ]

RdiRewarding mechihanism: error ftifunction minimization provided through examples. External forcing Learning by error correction • A learning algorithm with a concrete and useful results is the LMS algorithm (Delta-rule) of Bernard Widrow (()1959).

Desired output d (supervisor or teacher) w External stimuli x T Response ywx n (Signals)

Comparison w w Stimuli Error

Learning wwnn 1  xe algorithm ed  y

Bernard Widrow “I'm an ‘adaptive’ guy” Professor Emeritus Electrical Engineering Department Stan for d Un ivers itity USA Muutlti-Layer N eur al N etw ork s

Compare outputs with correct answer to get error signal Back - propagate (3)(3)(2)(2)(1)(1) error signal to y  Outputs y  Φ W Φ W Φ Wx get derivatives for learn ing

W(3)

W(2) Many Hidden layers

W(1) Feed - forward computation

x  Input vector (pattern) Back-Propagation learning algorithm (mid 80s) Unsupervised learning

Learning through self adaptation

Stimuli Response

Rewarding mechanism No extlternal fiforcings

Rewarding mechanism:simple: simple primal instinct that creates the adaptation i.e. natural evolutionary behavior Unsupervised learning

Hebbian learning

• Hebb’s Postulate • The strength of the connection depends on the activity between the neurons.

Donald Hebb (Canadian psychologist, 1904-1985) Canadian psychologist, McGill University, Montreal Neural Networks Historyyyy: Gartner Hype Cycle

• Neural Network Disillusionment

Peak of Infleted Expectation hype

RNN

r media Plateau of Productivity MLP

tions o NNs Rebirth aa Slope of Enlightenment BP

Expect Trough of Disillusionment Widrow’s LMS Technology Trigger 1950-70 ’80 ’90 ‘00 ’06 ‘10 Time BP-NNs Disillusionment, 80 and 90

•Supervised learning – It requires labeled training data – Almost all data is unlabelled

• Long learning time – Very slow in networks with many hidden layers – Vanish ggpradient problem

• It mayyp fall into poor local minima – For deep networks they may be too far from the optimal solution Back-propagation problems in the 80 and 90

Three main problems of BP

1. Difficulty of producing labelled training data set: not enough labelled data sets.

2. No fast enough CPU.

3. Difficulty of correct weights: propagation error problems. What has happened recently

1. Labelled data sets got much bigger.

2. Computer got much faster.

3. New paradigm for learning deep layers using unlabeled data (2006).

• Result: deep neural networks are the now state- of-the-art for many real world problems. Deep Neural Networks Neural Networks History: Gartner Hype Cycle

Peak of Infleted Expectation hype RNN

MLP NNs 2nd Rebirth r media Pla teau of PdProducti tiitvity? tions o aa BP DNN Slope of Enlightenment

Expect Trough of Disillusionment Widrow’s LMS Technology Trigger 1950-70 ’80 ’90 ‘00 ’06 ‘10 Time DNN DNN (industry) Deep Neural Networks - Gartner Hype Cycle

hypothesized trend DNN

Strong AI hype W A R N I N G Bill Gates Stephen Hawking r media BP tions o aa Expect Widrow’s LMS 1950-70 ’80 ’90 ‘00 ’06 ‘10 Time DNN

http://www.huffingtonpost.com/james-barrat/hawking-gates-artificial-intelligence_b_7008706.html Machine Learning performance vs amount of data

Deep learning methods

ance Standard mm machine learning algorithms Perfor

Amount of data Deep Learning definition

• Many definitions: • DL is a set of algorithms in machine learning that attempt to learn in multiple levels,,pg corresponding to different levels of abstraction. It typically uses artificial neural networks.

• DL is a class of machine learning techniques that exploit many layers of non-linear information processing for supervised or unsupervised feature extraction and transformation, and for pattern analysis andld class ifitiification.

• …. DL Biological evidence • For example the layers organization of the visual system

Receptors Motoneuron Muscle cells

External stimuli

Memory, ideation, Hidden layers psyche, etc.

Many levels of transformation DL Psychological-cognitive evidence • The knowledge is represented in different levels of abstraction

Abstraction

Wisdom

Insight

Understanding

Knowledge

Information

Data

Concreteness

The Ladder-of-Abstraction and the Data-Wisdom Pyramid Example of Deep Learning solutions

• Apple - Siri speech recognition, iPhone personal assistant, … • Facebook – massive data analysis, … • - Translator, Android’s voice recognition, text processing Word2Vec, (Google acquires AI startup Deep Mind > $500M), … • IBM – brain-like computer, deep learning for Big Data, (IBM acquires AlchemyAPI, Enhancing Watson’s Deep Learning Capabilities)… • Microsoft – speech, massive data analysis, … • Twitter – acquires Deep Learning startup Madbits • Yahoo – acquires startup LookFlow to work on Flickr and Deep Learning

• As data keeps getting bigger DL coming to play a key role in: • Data modeling • Analytics solutions • Leverage for competitive advantage Three main DNN families (L. Deng, D. Yu 2014)

• Deep networks for unsupervised or generative learning • Capture high-order correlation of the observed data when no information about target class labels is available.

• Deep networks for supervised learning • Directly provide discriminative power for pattern classification purposes.

• Hybrid deep networks • Mix of the previous models. The goal is discrimination which is assisted, often in a significant way, with the outcomes of generative or unsupervised deep networks.

• The research activities in the field, is very high Unsupervised generative model • Ex. Deep Belief Networks (DBN) • Stack of Restricted Boltzmann Machines (RBM)

Independent unsupervised training of Output layer O RBM each layer. W(4) Hidden lay er H 3 RBM W(3) DBN can effectively utilize Hidden layer H large amounts of 2 RBM W(2) unlabeled data for Hidden layer H exploiting complex data 1 RBM structures. W(1) Input layer V Deep networks for supervised learning • Ex. Convolutional Neural Network (CNN) Yann LeCun (NYU)

Specific architecture for image classification

Fig. from: http://parse.ele.tue.nl/cluster/2/CNNArchitecture.jpg

Biologically inspired - Small neuron collections which look at small portions of the input image, as the receptive fields. Convolutional Neural Network (CNN)

Softmax to predict object class

Fully-connected layers

Convolutional layers Layer 7 (same weights used at all spatial locations in layer) …..

Biologically inspired - Small neuron collections which Layer 1 look at small portions of the input image, as the receptive fields. Input

Won 2012 ImageNet challenge with 16.4% top-5 error rate Hybrid DNN architecture

W(1) W(2) (1)N  W ()N Softmax W classifier

Unsupervised learning Supervised learning

Supervised final fine tuning DNN by stacked autoencoder

Output classes Softmax classifier

N P4 N P4 P3 P4 P3 P2 P3 P2 P1 N P4 P3 P2 P1 W(1) W(2) W(3) W(4) W(()1) W(2) W(3) W(4)

Separate unsupervised pre-training of the hidden layers Large Scale Deep Neural Network Parallel and distributed computing

SM-MIMD

DM-MIMD Vector Supercomputer

High-Speed Network

Workstation Storage

Special Pourpose Architecture SIMD Large Scale Distributed Deep Networks

• Problem: training a deep network with billions of parameters using tens of thousands of CPU cores.

• Exploit many kinds of parallelism

• Data parallelism

• Model parallelism

• Data and model parallelism Large scale DNN

• Model parallelism

Minimal network traffic:

1 2 The most densely connected ee ee areas are on the same partiti on Machin Machin

Data hine 3 hine 4 cc cc Ma Ma

• Network partitions Large scale SGD • Asynchronous Stochastic Gradient Descent (SGD) (Widrow’s Generalized Delta Rule (GRD))

wwnn 1  w nParameter Server ‘Downpour’ SGD(1). Model replicas asynchronously fetch parameters w and push gradients w to the parameter

server. wn wn

Model Replicas

Data Shards

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, ‘Large Scale Distributed Deep Networks’, NIPS 2012. Large scale L-BFGS • Limited-memory conjugate gradient algorithm of Broyden, Fletcher, Goldfarb,Shanno (L-BFGS).

wwnn1  w n Parameter Server

Coordinator w n w (small messages) n

Model Replicas L-BFGS-A: single ‘coordinator’ sends small messages to reppplicas and the parameter server to orchestrate batch optimization. Data Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, ‘Large Scale Distributed Deep Networks’, NIPS 2012. DNN on Big-Data applications DNN: state-of-the-art performance reported in severaldl doma ins

• Text, Language Model and Natural Language Processing

• Information Retrieval

• Visual Object Recognition and Computer Vision

• Speech Recognition and Audio Processing

• Multimodal and Multi-task Learning: Text-Image, ShSpeech-Image, … Text and Language processing

• Feedforward Neural Net Language Model

Input Neural net language model architecture wt() M Projection Output U The traininggg is done using backpropagation Hidden

The word vectors are in matrix U U V W wt(2) wt()

U wt(1) Text and Language processing

• Skip-gram Architecture

Output Predicts the surrounding wt(2) words given the current Input word Hidden Layers wt(1)

wt()

wt(1)

wt(2) Text and Language processing

• Ex. Skipgram Text Model

Hierarchical softmax Classifier

Single embedding function

Raw sparse features

Mikolov, Chen, Corrado and Dean. Efficient Estimation of Word Representations in Vector Space, http://arxiv.org/abs/1301.3781. Text and Language processing

• Continuous Bag-of-words (CBOW) Architecture

Input Predicts the current word given the context wt(2)

Hidden Output wt(1) Layers

wt()

wt(1)

wt(2) Example: Google

• Neural network trained to predict a word given the words to it nearby.

• It allows you to create numerical representations of each word.

• These representations can be mathematically manipulate as classics vectors.

• Training carried out on database of hundreds of billions of words.

http://deeplearning4j.org/word2vec.html Example: Google

• W2V : is a neural net that ppyrocesses text before that text is handled by deep-learning algorithms. • W2V creates features without human intervention, including the context of individual words. • W2V can make highly accurate guesses about a word’s meaning based on its past appearances.

Word Cosine distance • Word: ‘france’ ------spain 0.678515 blibelgium 0. 665923 netherlands 0.652428 italy 0.633130 switzerland 0.622323 luxembourg 0.610033 portugal 0.577154 russia 0.571507 germany 0. 563291 catalonia 0.534176 Example: Google

• Here’s a ggpraph of words associated with “China” usin g Word2vec Example: Google

• ‘Semantic computation’

• The word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'),

and

vect('ki')tor('king') - vect('')tor('man') + vect t('or('woman') ilis close to vector('queen')

• W2V is the key element for the development of applica tions o f grea t ‘V’a lue, operating on Big-DtData Visual object recognition: GoogleNet(1)

Deep Network 1 billion connections, 9 - layered locally connected sparse autoencoder trained over a dataset of 10 million 200x200 pixel of images downloaded from the Internet.

Training Parallel Asynchronous Stochastic Gradient Descent on a cluster with 1,000 machines (16,000 cores) for three days.

Image from (1)

(1) Q. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng ‘’Building high-level features using large scale unsupervised learning’’, International Conference in Machine Learning. 2012. Visual object recognition (2014 Google)

Winner if 2014 ImageNet challenge with 6.66% top-5 error rate

• 6 modules of convolutional layers.

• 24 layers deep !!

• Good Fine-grained Classification

• Good Generalization

• Sensible errors

“Snake” “Dog”

Image from : http://www.engadget.com/2014/09/08/google-details-object-recognition-tech/ Visual objjgect recognition: Com plex scene summarization in a few words(1)

“Two pizzas sitting on top of a stove top oven”

(1) Google Research Blog http://googleresearch.blogspot.it/2014/11/a-picture-is- worth-thousand-coherent.html Visual objjgect recognition: Com plex scene summarization in a few words App – NVIDIA’ S DRIVE PX (1)

Self-Driving Cars Using Deep Learning

(1) http://dataconomy.com/nvidias-drive-px-platform-to-pave-way-for-self-driving-cars-using-deep-learning/ Industrial sector of interest

• Topics include: Banking / Retail / Finance – Identify: Prospective, customers Dissatisfied customers, Good customers, Bad payers – Obtain: More effective advertising, Less credit risk, Fewer fraud, Decreased churn rate – Finance: econometric, time series analysis and prediction Biomedical / Biometrics – Me dic ine: Screen ing, Diagnos is an d prognos is, Drug discovery (seman tic Me dic ine ) – Security: Face recognition, Signature / fingerprint / iris verification, speaker recognition, DNA fingerprinting, … Computer / Internet / Multimedia – Computer interfaces: Troubleshooting wizards, Handwriting and speech, Brain waves – Internet: Hit ranking, Text categorization, Text translation, Sentiment analysis, …. – Cyber security: Network anomaly, Cyber-attack prediction, Spam detection, Malicious code recognition, … – Audio Video processing, audio-video content retrieval information, scene analysis, video games, virtual movie, ... Electrical / Computer Engineering – Wireless communication, Cognitive - Radio, Remote sensing , Array processing , multi sensor data fusions, robotics, Smart-Grid, intelligent house, ... Data processing – Classification, Time series Filtering, Prediction, Regression, Clustering, Spam filtering, SitSecurity … • Etc.

Research activityy@ @

CttilComputational in telli gence

• Fast DNN model and architectures

• Random feature extraction

• Semi-supervised model

• Evolutionary methods for learning

• Distributed learning with Big Data Research activityy@ @

Ex. Large Scale Distributed Learning on Big Data

• Development of learning algorithm without communication to a single central node and that can scale to large networks.

• The data are distributed on a network of interconnected agents.

• Applications including: learning on sensor networks, on peer-to-peer, swarms of robot, …

• Lynx toolbox: an open source Matlab toolbox, designed for fast prototyping of supervised machine learning simulations. Highlights of Deep Learning on Big Data

• DL can be used to merge symbolic and not symbolic heterogeneous information

• Development of parallel DL algorithms distributed on cluster of servers and/or parallel CPU (e.g. cuda GPU, …)

• SidSupervised and unsupervididlised mixed learn ing

• Possibility of continuous adaptation (learn while it is working)

• Possibility of customized solutions for specific problems

• Real-time data stream processing

• Order of weeks to train on large-scale datasets even on the fastest available GPUs

• Heuristic approach for the determination of the network topology

• Many tricks to make them learn optimally

• Developing applications with DL requires expertise and experience Epilogue • Aristotele e il mio Maestro delle elementari, avevano già capito tutto ? Conclusioni • Il problema della coscienza artificiale sembra costituire l'u ltimo a tto d ell a s tor ia de ll'ingegner ia. DdDando a l termine ingegnere il termine estensivo di colui che fa, la costruzione di un artefatto in grado di poter dire Io esisto potrebbe rappresentare il sogno finale dell'essere umano costruttore che vuole costruire anche senza sapere. Ing. Vincenzo Tagliasco (1941-2008)

• Il mio gran male è stato sempre e sarà sempre uno: quello di desiderare e sognare, invece di volere e fare. Ing. Carlo Emilio Gadda (1893-1973) • Qti?Questions?