Data Analysis from Scratch with Python The Complete Beginner's Guide for Techniques and A Step By Step NLP using Python Guide To Expert (Including Programming Interview Questions)

STEPHEN RICHARD © Copyright 2019 by Stephen Richard All rights reserved. No a part of this book may be reproduced or transmitted in any kind or by any suggests that electronic or mechanical, as well as photocopying, Recording or by any data storage and TABLE OF CONTENTS

INTRODUCTION CHAPTER 1 Method and Technique When to use each method and technique Machine learning methods Machine Learning Techniques Intel NLP Architect: Open Source Natural Language Processing Model Library CHAPTER 2 How To Solve NLP Tasks: A Walkthrough On Natural Language Processing A look at the importance of Natural Language Processing NLP: How to Become a Natural Language Processing Specialist CHAPTER 3 Programming Interview Questions CHAPTER 4 What is data analysis and why is it important? Software For The Application Of Automatic Learning Techniques For Medical Diagnosis CHAPTER 5 What Is Machine Learning And What Is It Contributing To Cognitive Neuroscience? Machine learning techniques at the service of Energy Efficiency in the digital home CHAPTER 6 5 natural language processing methods that are rapidly changing the world around us CHAPTER 8 How to develop chatbot from scratch in Python: a detailed instruction SUMMARY AND CONCLUSION INTRODUCTION

The data analysis techniques have substantial support in machine learning for the generation of knowledge in the organization. Machine learning exploits statistics and many other areas of mathematics. Its advantage is speed. Although it would be unfair to consider that the strength of this type of techniques is only speed, it must be taken into a report that the speed that characterizes this artificial intelligence modality comes from the processing power, both sequential and parallel, and in-memory storage capacity. However, the potential of machine learning to complement other, more traditional data analysis techniques has to do with the ability to relearn the representation of the data. Machine learning allows machines to use a different and proper language for problem-solving. It is a language in continuous evolution and with roots in the analysis, although, for many, it cannot be considered analytics as such. When the possibilities of machine learning complement the scope of the data analysis techniques, it is possible to see much more clearly what matters in terms of knowledge generation, not only at the quantitative level but also ensuring a significant qualitative improvement.

Why machine learning is the best complement to data analysis techniques Machine learning methods are far superior in the analysis of data from multiple sources. Transactional information that which comes from social media or that originates from systems such as CRM can overwhelm the ability of traditional data analysis techniques. On the contrary, high-performance machine learning can analyze a whole Big Data set, instead of forcing business users to settle for a representative sample that, after all, remains a sample. This scalability not only allows predictive solutions based on sophisticated algorithms to be more accurate but also drives the importance of software speed. In this way, it is already possible to interpret in real-time the billions of rows and columns that must be investigated, while the analysis of the data flow that is arriving does not stop. In order to take full advantage of machine learning, organizations must move towards a model that allows:

Win in intelligence and usability in regards to the use of machine learning. Allow not only data scientists and professional profiles with a higher degree of specialization to access this technology. Put the means so that business users can also take advantage of the latest generation analytical capabilities that machine learning offers them. When this technology is democratized and integrated with the data analysis techniques that are already in use in the organization, the business not only gains insights but significantly accelerates the time needed to generate quality knowledge. In this way, any organization, of any size, can exploit the unprecedented competitive potential.

When data analysis techniques find no substitute It seems obvious to think that the ideal complement to data analysis techniques can also be the determining factor in their end. Moreover, yet it is not. Machine learning is a winning combination when it is known to integrate into an analytics strategy, but it cannot replace the necessary predictive capabilities that every organization needs. Sometimes it is not efficient, nor profitable, and much less logical, to resort to this form of artificial intelligence to solve specific questions. Typically, these are topics where the best answer comes from analytics, such as:

The marketing strategy that gives the best results. The calculation of routes. The behavior of a consumer segment. Why invest in machine learning when it is not the best solution? Supplement yes, substitute no. Because although the data analysis techniques that are usually used in organizations can be limited, related to the volume of data to be processed or the speed at which it is necessary to obtain insights; Machine learning is also not the most straightforward alternative. Machine learning requires the organization to be able to provide data in certain conditions (a continuous flow in real-time), and, above them, it needs to create an algorithm, which allows reaching a set of rules that work regardless of whether it exists or no information available. And, while it is true that the only way to exploit the actual value of IoT or the most efficient and lowest-cost way to achieve innovative solutions in line with wearables, for example, in the health industry, is the machine learning; It should also be borne in mind that many business decisions only require the support given by a couple of processing rules, a solution that facilitates the discovery of data and the identification of trends, or software that allows them to forecast future scenarios with a minimum margin of error. Can you do without predictive analytics? Today, it would be unthinkable to define strategies or carry out any action without the support provided by data analysis techniques. Is it possible to continue moving forward without machine learning? Except for particular industries and in particular cases, there is still room until it becomes imperative to turn to machines to identify potential patterns and correlations in data, despite the impressive results they deliver. CHAPTER 1 Method and Technique

When to use each method and technique

You probably heard more and more about machine learning, a subset of artificial intelligence. But what exactly can be done with it? Technology encompasses methods and techniques, and each has a set of potential use cases. Companies would do well to examine them before moving forward with plans to invest in the machine learning tools and the infrastructure.

Machine learning methods Supervised learning: This Supervised learning is ideal if you know what you want a machine to learn. You can expose it to a vast set of training data, examine the result, and modify the parameters until you get the results you expect. After that, you can see what the machine has learned by predicting the results of a validation data set that you have not seen before.

The supervised learning tasks include classification and prediction or regression. The Supervised learning methods can be used for applications such as determining the financial risk of individuals and organizations, based on past information on financial performance. They can also provide a good idea of how customers will act or what their preferences are based on previous behavior patterns. For example, the Lending Tree online loan market is using an automated DataRobot machine learning platform to customize experiences for its customers and predict their intention based on what they have done in the past, says Akshay Tandon, vice president and chief of strategy and analytics By predicting the intention of the client - primarily through the qualification of potential clients - Lending Tree can classify people who are looking for a rate, compared to those who are really looking for a loan and are ready to apply for one. Using supervised learning techniques, he constructed a classification model to define the probability of a lead closing.

Unsupervised learning: Unsupervised learning allows a machine to explore a set of data and identify hidden patterns that link different variables. This method can be used to group data into groups only based on their statistical properties. A useful application of the unsupervised learning is the clustering algorithm used to make links to probabilistic records, a technique that extracts connections between data elements and builds them to identify people and organizations and their connections in the physical or virtual world. This is especially useful for companies that need, for example, to integrate data from disparate sources and/or in different business units to build a consistent and complete view of their customers, says Flavio Villanustre, vice president of technology at LexisNexis Risk Solutions, a A company that uses analysis to help customers predict and manage risks. Unsupervised learning could be used for sentiment analysis, which identifies people's emotional state based on their social media posts, emails or other written comments, notes Sally Epstein, a machine learning engineering specialist at Cambridge consultancy. Consultants the firm has gotten an increasing number of companies in financial services that use unsupervised learning to obtain information on customer satisfaction.

Semi-Supervised Learning: Semi-Superviseded Learning is a hybrid of supervised and unsupervised learning. By labeling a little portion of the data, a trainer can give the machine clues about how to group the rest of the data set. Semi-supervised learning can be used to detect identity fraud, among other uses. Fortunately, fraud is not as frequent as a non-fraudulent activity, Villanustre notes, and as such, fraudulent activity can be considered an "anomaly" in the universe of legitimate activity. Even so, there is a fraud and machine learning methods for detecting semi-supervised anomalies can be used to model solutions to these types of problems. This type of learning is implemented to identify fraud in online transactions. Semi supervised learning can also be used when there is a mix of labeled and unlabeled data, which is often seen in large business environments, Epstein notes. Amazon has been able to improve the natural language understanding of its Alexa offering by training artificial intelligence algorithms in a combination of labeled and unlabeled data, says the executive. This has helped increase the accuracy of Alexa responses, he says.

Reinforcement learning: Reinforcement learning allows the machine to interact with its surroundings (for example, pushing damaged products from a conveyor to a container) and provides you with a reward when you do what you want. By automating such calculation of the reward, you can let the machine learn on its own time. A use case for reinforcement learning is the classification of clothing and other items in a retail establishment. Some clothing retailers have been testing new types of technology, such as robotics, to help classify items such as clothing, footwear, and accessories, says David Schatsky, an analyst at Deloitte, which focuses on emerging technology and business trends. Robots make use reinforcement learning (as well as deep learning) to determine how much pressure they should use when grabbing objects and the best way to grab these items in the inventory, Schatsky notes. A variation of reinforcement learning is the deep reinforcement learning, which is very suitable for autonomous decision making, where supervised learning or unsupervised learning techniques alone cannot do the job.

Deep learning: Deep learning performs types of learning, such as unsupervised or reinforced. In general terms, deep learning mimics some aspects of how people learn, mainly through the use of neural networks to identify the characteristics of the data set in more and more details. Deep learning, within the form of deep neural networks (DNN), has been used to accelerate high-content screening for drug discovery, says Schatsky. It is about applying DNN acceleration techniques to process multiple images in significantly less time while extracting a higher perception of the characteristics of the image that the model finally learns. This machine learning method also allows many companies to fight fraud, improving detection rates by using automation to detect irregularities. Deep learning can be used in the automotive industry. A company has developed a system based on neural networks that allow early detection of problems with cars, says Schatsky. This system can recognize noise and vibration and use any deviation from the norm to interpret the nature of the fault. It can become part of predictive maintenance since it determines the vibrations of any mobile part of the car and you may notice even minor changes in its performance. Machine Learning Techniques Neural networks: Neural networks are produced to mimic the structure of neurons in human brains, with each artificial neuron connecting to other neurons within the system. Neural networks are assembled in layers, with neurons in one layer passing data to multiple neurons in the next layer, and so on. Eventually, they reach the exit layer, where the network presents its best guesswork to solve a problem, identify an object, etc.

Use cases for neural networks cover a variety of industries: In life sciences and the health care, they can be used to analyze medical images to accelerate diagnostic processes and for drug discovery, says Schatsky. In telecommunications and media, neural networks can be used for language translations, fraud detection, and virtual assistant services. In financial services, they can be used for fraud detection, portfolio management, and risk analysis. In retail sales, they can be used to eliminate payment lines and customize the customer experience. Decision trees: A decision tree algorithm tries to classify the elements by identifying questions concerning their attributes that will assist decide of which class to place them. Each node inside the tree is a question, with branches that lead to more questions about the articles, and the leaves as the final classifications. Use cases for decision trees can include the construction of knowledge management platforms for customer service, price predictions, and product planning. An insurance company could use a decision tree when it requires information on what type of insurance products and premium adjustments are needed based on potential risk, says Ray Johnson, a chief data scientist at the SPR business and technology consulting firm. Using overlapping location data with weather-related loss data, you can create risk categories based on the claims submitted and the cost amounts. Then, you can evaluate new applications for coverage against models to provide a risk category and possible financial impact, the executive said. Random forests: a single decision tree must be trained to provide accurate results, the random forest algorithm takes a set of randomly created decision trees that base their decisions on different sets of attributes and allow them to vote in the most popular class. Random forests are just versatile tools for finding relationships in data sets and are quick to train, says Epstein. For example, unsolicited bulk mail has been a problem for a long time, not only for users but also for Internet service providers that have to manage the increased load on servers. In response to this problem, automated methods have been developed to filter spam from standard email, using random forests to quickly and accurately identify unwanted emails, the executive said. Other uses of random forests include the identification of disease by analyzing the patient's medical records, detecting bank fraud, predicting the volume of calls in the call centers and predicting gains or losses through Purchase of a particular stock. Clustering: Clustering algorithms use techniques such as K-means, mean- shift, or expectation-maximation to group data points based on shared or related characteristics, this is an unsupervised learning method that can be applied to the classification problems. Clustering technique is particularly useful when you need to segment or categorize, Schatsky notes. Examples include segmenting customers by different characteristics to assign marketing campaigns better, recommending news articles to confident readers, and effective law enforcement. The clustering is also active for discovering clusters in complex data sets that may not be obvious to the human eye. Examples range from the categorization of similar documents in a database to the identification of critical crime points in crime reports, says Epstein. Learning association rules: Learning association rules is an unsupervised technique that is used in recommendation engines, which seeks relationships between variables. This is the technique behind the suggestions of "people who bought X also bought Y" at many e-commerce sites, and examples of how it is used are universal. A specific use case could be a food specialty retailer that wants to generate additional sales, says Johnson. I would use this technique to examine customer buying behavior in order to provide cans and special packages for products that hold events, sports teams, etc. The association governs the technique that provides information that can reveal when and where customers bought the preferred combination of products. The use of information about past purchases and deadlines allows the company to create a rewards program proactively, Johnson says, and offers unique custom offers to boost future sales. Intel NLP Architect: Open Source Natural Language Processing Model Library

Have you noticed that more and more companies are putting a bot widget on their site? Chatbots are everywhere today. And this is just one of many examples of Natural Language Processing (NLP) and Natural Language Understanding (NLU) technologies. The potential of NLP and NLU seems limitless. Now everyone understands that we are only at the beginning of a long journey. Titans in the IT industry are creating dedicated research departments to explore this area. Intel did not stand aside. Recently, Intel AI Lab released a product called NLP Architect. This is an open source library designed to serve as the basis for further research and collaboration of developers from around the world.

NLP Architect A team of NLP researchers and developers from Intel AI Lab is studying the current architecture of deep neural networks and methods for processing and understanding text. The result of their work was a set of tools that are interesting from both a theoretical and an applied point of view.

Here's what the current version of NLP Architect has:

Models that extract the linguistic characteristics of text: for example, a parser (BIST) and an algorithm for extracting nouns (noun phrases); State-of-the-art models for understanding the language: for example, determination of user intent (intent extraction), recognition of named entities (named entity recognition, NER); Modules for semantic analysis: for example, collocations, the most probable meaning of a word, vector representations of named groups (NP2V); Building blocks for creating conversational intelligence: for example, the basis for creating chatbots, including a dialogue support system, a sequence analyzer (sequence chunker), a system for determining the user's intent; Examples of using deep end-to-end neural networks with a new architecture: for example, question-answer systems, text reading systems (machine reading comprehension). For all of the above models, there are examples of learning and prediction processes. Moreover, the Intel team added scripts to solve typical tasks that arise when implementing such models - pipelines for data processing and utilities that are often used in NLP. The library consists of separate modules, which simplifies integration. A general view of the NLP Architect framework is shown in the diagram below.

NLP Architect Framework

Components for NLP / NLU Now let's take a closer look at some of the modules to better understand what is being discussed. Offer analyzer. Analysis of sentences (sequence chunking) is one of the basic tasks of word processing, which consists in dividing sentences into syntactically related parts. For example, a sentence “Little Sasha walked along the highway” can be divided into four parts: the nouns “Little Sasha” and “highway”, the verb group “walked” and the prepositional group “by”. The analyzer of sentences from NLP Architect can build suitable neural network architecture for different types of input data: tokens, labels of parts of speech, vector representations, symbolic signs. Semantic segmentation of noun groups. A noun phrase consists of the main member - a noun or pronoun - and several dependent qualifying members. To simplify, you can divide nouns into two types:

With descriptive structure: dependent members do not significantly affect the semantics of the main member (for example, “sea water”); With collocation structure: dependent members significantly change the meaning of the main term (for example, “guinea pig”). To determine the type of name group, a multilayer is trained. This model is used in the semantic sentence segmentation algorithm. As a result of her work, nouns of the first type break up into several semantic elements, and nouns of the second type remain unified. The parser performs grammar analysis of sentences, examining the relationship between words in sentences and highlighting things like direct additions and predicates. NLP Architect includes a graph-based dependency parser that uses BiLSTM to extract features. The Named Entity Recognizer (NER) identifies certain words or combinations of words in a text that belong to a certain class of interest to us. Examples of entities include names, numbers, places, currencies, dates, organizations. Sometimes entities can be quite easily distinguished with the help of such features as the form of words, the presence of a word in a certain dictionary, part of speech. However, quite often these signs are not known to us or even exist. In such cases, in order to determine whether a word or phrase is an entity, it is necessary to analyze its context. The model for NER in NLP Architect is based on a bidirectional LSTM network and CRF classifier. A high-level review of the model is presented below. High Level Review of the NER Model

The user intent determination algorithm solves the problem of understanding the language. Its purpose is to understand what kind of action is discussed in the text, and to identify all parties involved. For example, for a sentence “Siri, please remind me to pick up things from the laundry on the way home.” the algorithm determines the action (“remind”), who should perform this action (“Siri”), who asks to perform this action (“I”) and the object of the action (“pick up things from the laundry”). The analyzer of the meaning of the word. The algorithm receives a word at the input and returns all the meanings of this word, as well as numbers characterizing the prevalence of each of the meanings in the language.

End-to-End Examples of Neural Networks It has already been said that the NLP Architect library includes some end-to- end models. Let us briefly look at two of them. Understanding of the text. The folder named reading_comprehension contains the implementation of the language comprehension model from the guide Machine Comprehension Using Match-LSTM and Answer Pointer. The idea of this method is to construct a vector representation of a fragment of text taking into account the question and submit this vector representation to the input of the neural network, which returns the position of the beginning and end of the answer to the question in the fragment.

Target dialogue

End-to-end neural network with memory for the target dialogue. In a folder called memn2n_dialogue is an implementation of a neural network with memory (memory network) to maintain the target dialogue (goal-oriented dialogue). During the target dialogue, in contrast to a conversation on a free topic, the automatic system has a specific goal, which must be achieved as a result of interaction with the user. In short, the system needs to understand the user's request and complete the corresponding task in a limited number of dialog moves. The task may be to book a table in a restaurant, set a timer, and so on.

NLP Architect Visualizer The library includes a small web server - NLP Architect Server. It makes it easy to test the performance of different NLP Architect models. Among other things, the server has visualizers, which are pretty nice diagrams that demonstrate the operation of the models. Currently, two services support visualization - a parser and a recognizer of named entities. In addition, there is a template with which the user can add visualization for other services.

NLP Architect is an open and flexible library with algorithms for word processing, which makes it possible for developers from all over the world to interact. The Intel team continues to add the results of its research to the library so that anyone can take advantage of what they have done and improved. To get started, just download the code from the Github repository and follow the installation instructions. Here you can find comprehensive documentation for all the main modules and ready-made models. In future releases, Intel AI Lab plans to demonstrate the benefits of creating text analysis algorithms using the latest deep learning technologies, and include methods for extracting text tonality, analyzing topics and trends, expanding specialized vocabulary, and extracting relationships in the library. In addition, Intel experts are exploring teaching methods without teacher and partial training, with which you can create new interpretable models for understanding and analyzing text that can adapt to new areas of knowledge. CHAPTER 2

How To Solve NLP Tasks: A Walkthrough On Natural Language Processing

Over the past year, the Insight team has participated in several hundred projects, combining the knowledge and experience of leading companies in the United States. They summarized the results of this work in a guide, the translation of which is now in front of you, and deduced approaches to solving the most common applied problems of machine learning. Let’s start with the simplest method that can work and gradually move on to more subtle approaches like feature engineering, word vectors, and deep learning. After reading the section, you will know how to:

Collect, prepare, and inspect data; To build simple models, and, if necessary, make the transition to deep learning; Interpret and then understand your models to make sure that you are interpreting information, not noise. The section is written in a walkthrough format; it can also be seen as a review of high-performance standard approaches.

Using machine learning to understand and use text Natural language processing provides exciting new results and is a very broad field. However, Insight identified the following key aspects of practical application that are much more common than others:

Identification of various cohorts of users or customers (for example, prediction of customer churn, total customer profit, product preferences) Accurate detection and extraction of various categories of reviews (positive and negative opinions, references to individual attributes such as clothing size, etc.) Classification of the text in accordance with its meaning (request for basic help, urgent problem). Despite the large number of scientific publications and training manuals on the topic of NLP on the Internet, today there are practically no full-fledged recommendations and tips on how to effectively deal with NLP tasks, while considering solutions to these problems from the very basics.

Step 1: Collect Your Data

Sample Data Sources Any machine learning task starts with data — whether it's a list of email addresses, posts, or tweets. Common sources of textual information are:

Product reviews (Amazon, Yelp and various app stores). Content created by users (tweets, Facebook posts, questions on StackOverflow). Diagnostic information (user requests, support tickets, chat logs).

Social Media Disasters dataset To illustrate the approaches described, we will use the Disasters in Social Media dataset, kindly provided by CrowdFlower.

Set ourselves the task of determining which of the tweets are related to the disaster event, as opposed to those tweets that relate to irrelevant topics (for example, films). Why do we have to do this? A potential use would be an exclusive notification of emergency officials requiring attention to Admiral Sandler’s latest film. The particular difficulty of this task is that both of these classes contain the same search criteria, so we will have to use more subtle differences to separate them. Next, we will refer to the disaster tweets as “disaster,” and the tweets about everything else as “irrelevant.”

Labels Our data is tagged, so we know which categories tweets belong to. As Richard Socher emphasizes, it’s usually faster, easier, and cheaper to find and mark up enough data to model on, rather than trying to optimize a complex teaching method without a teacher.

Step 2. Clear your data Rule number one: "Your model can only become as good as your data." One of the key skills of a professional Data Scientist knows what should be the next step - working on a model or data. As practice shows, at first it is better to look at the data itself, and only then clean it up. A clean dataset will allow the model to learn significant attributes and not retrain on irrelevant noise. The following is a checklist that is used to clear our data (details can be found in the code). Delete all irrelevant characters (for example, any non-alphanumeric characters). Tokenize the text by dividing it into individual words. Remove irrelevant words - for example, Twitter mentions or URLs. Convert all characters to lower case so that the words "hello", "hello" and "hello" are considered the same word. Consider combining misspelled or alternative spelling words (for example, cool / cool/cool) Consider lemmatizing, that is, reducing the various forms of one word to a dictionary form (for example, “machine” instead of “machine”, “by machine”, “machines”, etc.) After we go through these steps and check for additional errors, we can begin to use clean, tagged data to train the models.

Step 3. Choose a good view of the data. As input, machine learning models accept numerical values. For example, models that work with images take a matrix that displays the intensity of each pixel in each color channel. A smiling face, represented as an array of numbers Our dataset is a list of sentences, so in order for our algorithm to extract patterns from data, we must first find a way to present it in such a way that our algorithm can understand it. One-hot encoding ("Bag of words") The natural way to display text in computers is to encode each character individually as a number (an example of this approach is ASCII encoding). If we “feed” such a simple representation to the classifier, he will have to study the structure of words from scratch, based only on our data, which is impossible on most datasets. Therefore, we should use a higher level approach. For example, we can build a dictionary of all unique words in our dataset, and associate a unique index with each word in the dictionary. Each sentence can then be displayed in a list whose length is equal to the number of unique words in our dictionary, and in each index in this list it will be stored how many times this word appears in the sentence. This model is called the “Bag of words” (Bag of Words), since it is a mapping completely ignoring the word order of a sentence. Below is an illustration of this approach.

Presentation of sentences in the form of a “Bag of words”. The original sentences are indicated on the left, their presentation is on the right. Each index in vectors represents one specific word.

Visualize Vector Representations. The Social Media Disasters dictionary contains about 20,000 words. This means that each sentence will be reflected by a vector of length 20,000. This vector will contain mainly zeros, since each sentence contains only a small subset of our dictionary. In order to find out whether our vector representations (embeddings) capture information relevant to our task (for example, whether tweets are related to disasters or not), you should try to visualize them and see how well these classes are separated. Since dictionaries are usually very large and data visualization over 20,000 measurements is not possible, approaches like the principal component method (PCA) help project data into two dimensions.

Visualization of vector representations for the “bag of words” Judging by the resulting graph, it does not seem that the two classes are separated as it should — this may be a feature of our representation or simply the effect of reducing the dimension. In order to find out whether the capabilities of the “bag of words” are useful to us, we can train a classifier based on them.

Step 4. Classification

When you are starting a task for the first time, it is common practice to start with the simplest method or tool that can solve this problem. When it comes to data classification, the most common way is logistic regression because of its versatility and ease of interpretation. It is very simple to train, and its results can be interpreted, since you can easily extract all the most important coefficients from the model.

We will divide our data into a training sample, which we will use to train our model, and a test one, in order to see how well our model generalizes to data that I have not seen before. After training, we get an accuracy of 75.4%. Not so bad! Guessing the most frequent class (“irrelevant”) would give us only 57%.

However, even if a result with 75% accuracy would be sufficient for our needs, we should never use the model in production without trying to understand it.

Step 5. Inspection

Error matrix The first step is to understand what types of errors our model makes and what types of errors we would like to encounter less often in the future. In the case of our example, false-positive results classify an irrelevant tweet as a catastrophe, false-negative ones classify a catastrophe as an irrelevant tweet. If our priority is to respond to every potential event, then we will want to reduce our false-negative responses. However, if we are limited in resources, then we can prioritize a lower false-negative rate to reduce the likelihood of a false alarm. A good way to visualize this information is to use an error matrix, which compares the predictions made by our model with real marks. Ideally, this matrix will be a diagonal line going from the upper left to the lower right corner (this will mean that our predictions coincided perfectly with the truth).

Our classifier creates more false negative than false positive results (proportionally). In other words, the most common mistake in our model is the inaccurate classification of disasters as irrelevant. If false positives reflect a high cost for law enforcement, then this may be a good option for our classifier.

Explanation and interpretation of our model In order to validate our model and interpret its predictions, it is important to look at what words it uses to make decisions. If our data is biased, our classifier will make accurate predictions on the sample data, but the model will not be able to generalize them well enough in the real world. The diagram below shows the most significant words for disaster classes and irrelevant tweets. Drawing up charts reflecting the meaning of words is not difficult in case of using a “bag of words” and logistic regression, since we simply extract and rank the coefficients that the model uses for its predictions. “Bag of words”: the meaning of words

Our classifier correctly found several patterns (hiroshima - “Hiroshima”, massacre - "massacre"), but it is clear that he retrained on some meaningless terms ("heyoo", "x1392"). So, now our “bag of words” deals with a huge dictionary of various words and all these words are equivalent for him. However, some of these words are very common, and only add noise to our predictions. Therefore, we will try to find a way to present sentences in such a way that they can take into account the frequency of words, and see if we can get more useful information from our data.

Step 6. Consider the structure of the dictionary TF-IDF To help our model focus on meaningful words, we can use TF-IDF (Term Frequency, Inverse Document Frequency) scoring on top of our “word bag” model. TF-IDF weighs based on how rare they are in our dataset, lowering in priority words that are too common and just add noise. The following is a projection of the principal component method to evaluate our new view.

Visualization of vector representation using TF-IDF.

We can observe a clearer separation between the two colors. This indicates that it should become easier for our classifier to separate both groups. Let's see how our results improve. Having trained another logistic regression in our new vector representations, we get an accuracy of 76.2%. Very slight improvement. Maybe our model even began to choose more important words? If the result obtained in this part has become better, and we do not allow the model to “cheat”, then this approach can be considered an improvement. TF-IDF: Significance of Words The words chosen by the model really look much more relevant despite the fact that the metrics on our test set have increased very slightly, we now have much more confidence in using the model in a real system that will interact with customers.

Step 7. Application of semantics Word2vec Our latest model was able to “grab” the words that carry the greatest meaning. However, most likely, when we release her in production, she will encounter words that were not found in the training sample and will not be able to accurately classify these tweets, even if she saw very similar words during the training. To solve this problem, we need to capture the semantic (semantic) meaning of words this means that it is important for us to understand that the words “good” and “positive” are closer to each other than the words “apricot” and “continent”. We will use the Word2Vec tool to help us match word meanings. Using the results of pre-training Word2Vec is a technique for finding continuous mappings for words. Word2Vec learns by reading a huge amount of text, and then remembering which word appears in similar contexts. After training on enough data, Word2Vec generates a vector of 300 dimensions for each word in the dictionary, in which words with a similar meaning are located closer to each other. The authors of the publication on the topic of continuous vector representations of words laid out in open access a model that was previously trained on a very large amount of information, and we can use it in our model to bring knowledge about the semantic meaning of words.

Offer Level Display A quick way to get sentence attachments for our classifier is to average Word2Vec ratings for all words in our sentence. This is the same approach as with the “bag of words” earlier, but this time we only lose the syntax of our sentence, while preserving the semantic (semantic) information.

Vector representations of sentences in Word2Vec Here is a visualization of our new vector representations after using the listed techniques: Visualization of vector representations of Word2Vec. Now the two groups of colors look even more separated, and this should help our classifier to find the difference between the two classes. After training the same model for the third time (logistic regression), we get an accuracy of 77.7% and this is our best result at the moment! It is time to study our model.

The trade-off between complexity and explain ability Since our vector representations are no longer represented as a vector with one dimension per word, as in previous models, it is now harder to understand which words are most relevant to our classification. Despite the fact that we still have access to the coefficients of our logistic regression, they relate to 300 dimensions of our investments, and not to word indices. For such a small increase in accuracy, a complete loss of the ability to explain the operation of the model is too much a compromise. Fortunately, when working with more complex models, we can use interpreters like LIME, which are used to get some idea of how the classifier works. LIME LIME is available on Github as an open package. This black box-based interpreter allows users to explain the decisions of any classifier using one specific example by changing the input (in our case, deleting a word from a sentence) and observing how the prediction changes. Let's take a look at a couple of explanations for the suggestions from our dataset.

The correct words for disasters are selected for classification as “relevant”.

Here, the contribution of words to the classification seems less obvious. However, we don’t have enough time to explore thousands of examples from our dataset. Instead, let's run LIME on a representative sample of test data, and see which words are found regularly and contribute most to the end result. Using this approach, we can obtain estimates of the significance of words in the same way as we did for previous models, and validate the predictions of our model. It seems that the model selects highly relevant words and accordingly makes clear decisions. Compared to all previous models, she selects the most relevant words, so it would be better to send her to production.

Step 8. Using syntax when applying end-to-end approaches We looked at fast and efficient approaches for generating compact vector representations of sentences. However, omitting the word order, we discard all syntactic information from our sentences. If these methods do not give sufficient results, you can use a more complex model that takes whole expressions as input and predicts labels, without the need for an intermediate representation. A common way to do this is to consider a sentence as a sequence of individual word vectors using either Word2Vec, or more recent approaches like GloVe or CoVe. That is what we will do next. Highly efficient model-learning architecture without additional pre- processing and post-processing (end-to-end, source)

Convolutional neural networks for classifying sentences (CNNs for Sentence Classification) learn very quickly and can serve as an excellent input to the deep learning architecture. Although convolutional neural networks (CNNs) are mostly known for their high performance on image data, they show excellent results when working with text data, and are usually much faster to learn than most complex NLP approaches (for example, LSTM networks and Encoder / Decoder Architecture) This model preserves the word order and learns valuable information about which word sequences serve as a prediction of our target classes. Unlike previous models, she is aware of the difference between the phrases "Lesha eats plants" and "Plants eat Lesha." Training this model does not require much more effort than previous approaches, and, as a result, we get a model that works much better than the previous one, allowing us to obtain an accuracy of 79.5%. As with the models we reviewed earlier, the next step is to research and explain predictions using the methods we described above to make sure that the model is the best option we can offer users. At this point, you should already feel confident enough to cope with the next steps yourself. So, a summary of the approach that we have successfully put into practice:

Start with a quick and easy model; We explain her predictions; We understand what kinds of mistakes she makes; We use the acquired knowledge to make a decision about the next step - whether it's working on data, or on a more complex model. We examined these approaches using a specific example using models tailored to recognize, understand, and use short texts - for example, tweets; however, these same ideas are widely applicable to many different tasks. A look at the importance of Natural Language Processing

The Deep Learning Tsunami Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse. Accompanying ICML 2015 in Lille, France, there was another, almost as big, event: the 2015 Deep Learning Workshop. The workshop ended with a panel discussion, and at it, Neil Lawrence said, “NLP is kind of like a rabbit in the headlights of the Deep Learning machine, waiting to be flattened.” Now that is a remark that the computational linguistics community has to take seriously! Is it the end of the road for us? Where are these predictions of steamrollering coming from? At the June 2015 opening of the Facebook AI Research Lab in Paris, its director Yann LeCun said: “The next big step for Deep Learning is natural language understanding, which aims to give machines the power to understand not just individual words but entire sentences and paragraphs.”1 In a November 2014 Reddit AMA (Ask Me Anything), Geoff Hinton said, “I think that the most exciting areas over the next five years will be really understanding text and videos. I will be disappointed if in five years’ time we do not have something that can watch a YouTube video and tell a story about what happened. In a few years’ time we will put [Deep Learning] on a chip that fits into someone’s ear and have an English-decoding chip that’s just like a real Babel fish.”2 And Yoshua Bengio, the third giant of modern Deep Learning, has also increasingly oriented his group’s research toward language, including recent exciting new developments in neural machine translation systems. It’s not just Deep Learning researchers. When leading machine learning researcher Michael Jordan was asked at a September 2014 AMA, “If you got a billion dollars to spend on a huge research project that you get to lead, what would you like to do?” he answered: “I’d use the billion dollars to build a NASA-size program focusing on natural language processing, in all of its glory (semantics, pragmatics, etc.).” He went on: “Intellectually I think that NLP is fascinating, allowing us to focus on highly structured inference problems, on issues that go to the core of ‘what is thought’ but remain eminently practical, and on a technology that surely would make the world a better place.” Well, that sounds very nice! So, should computational linguistics researchers be afraid? I’d argue, no. To return to the Hitchhiker’s Guide to the Galaxy theme that Geoff Hinton introduced, we need to turn the book over and look at the back cover, which says in large, friendly letters: “Don’t panic.”

The Success of Deep Learning There is no doubt that Deep Learning has ushered in amazing technological advances in the last few years. I won’t give an extensive rundown of successes, but here is one example. A recent Google blog post told about Neon, the new transcription system for Google Voice.3 After admitting that in the past Google Voice voicemail transcriptions often weren’t fully intelligible, the post explained the development of Neon, an improved voicemail system that delivers more accurate transcriptions, like this: “Using a (deep breath) long short-term memory deep recurrent neural network (whew!), we cut our transcription errors by 49 percent.” Do we not all dream of developing a new approach to a problem which halves the error rate of the previously state-of-the-art system?

Why Computational Linguists Need Not Worry Michael Jordan, in his AMA, gave two reasons why he wasn’t convinced that Deep Learning would solve NLP: “Although current deep learning research tends to claim to encompass NLP, I’m much less convinced about the strength of the results, compared to the results in, say, vision; much less convinced in the case of NLP than, say, vision, the way to go is to couple huge amounts of data with black-box learning architectures.” Jordan is certainly right about his first point: So far, problems in higher-level language processing have not seen the dramatic error rate reductions from deep learning that have been seen in speech recognition and in object recognition in vision. Although there have been gains from deep learning approaches, they have been more modest than sudden 25 percent or 50 percent error reductions. It could easily turn out that this remains the case. The really dramatic gains may only have been possible on true signal processing tasks. It Is People in Linguistics, People in NLP, Who Are the Designers On the other hand, you would be much less convinced by his second argument. However, there are two reasons why NLP need not worry about deep learning: (1) It just has to be wonderful for our field for the smartest and most influential people in machine learning to be saying that NLP is the problem area to focus on; and (2) Our field is the domain science of language technology; it’s not about the best method of machine learning—the central issue remains the domain problems. The domain problems will not go away. Joseph Reisinger wrote on his blog: “I get pitched regularly by startups doing ‘generic machine learning’ which is, in all honesty, a pretty ridiculous idea. Machine learning is not undifferentiated heavy lifting, it’s not commoditizable like EC2, and closer to design than coding.” From this perspective, it is people in linguistics, people in NLP, who are the designers. Recently at ACL conferences, there has been an over-focus on numbers, on beating the state of the art. Call it playing the Kaggle game. More of the field’s effort should go into problems, approaches, and architectures. Recently, one thing that I’ve been devoting a lot of time to together with many other collaborators is the development of Universal Dependencies. The goal is to develop a common syntactic dependency representation and POS and feature label sets that can be used with reasonable linguistic fidelity and human usability across all human languages. That’s just one example; there are many other design efforts underway in our field. One other current example is the idea of Abstract Meaning Representation. Deep Learning of Language Where has Deep Learning helped NLP? The gains so far have not so much been from true Deep Learning (use of a hierarchy of more abstract representations to promote generalization) as from the use of distributed word representations—through the use of real-valued vector representations of words and concepts. Having a dense, multidimensional representation of similarity between all words is incredibly useful in NLP, but not only in NLP. Indeed, the importance of distributed representations evokes the “Parallel Distributed Processing” mantra of the earlier surge of neural network methods, which had a much more cognitive-science directed focus (Rumelhart and McClelland 1986). It can better explain human-like generalization, but also, from an engineering perspective, the use of small dimensionality and dense vectors for words allows us to model large contexts, leading to greatly improved language models. Especially seen from this new perspective, the exponentially greater sparsity that comes from increasing the order of traditional word n-gram models seems conceptually bankrupt. Intelligence requires being able to understand bigger things from knowing about smaller parts. Few people do not believe that the idea of deep models will also prove useful. The sharing that occurs within deep representations can theoretically give an exponential representational advantage, and in practice, offers improved learning systems. The general approach to building Deep Learning systems is compelling and powerful: The researcher defines model architecture and a top-level loss function and then both the parameters and the representations of the model self-organize so as to minimize this loss, in an end-to-end learning framework. We are starting to see the power of such deep systems in recent work in neural machine translation (Sutskever, Vinyals, and Le 2014; Luong et al. 2015). Intelligence requires being able to understand bigger things from knowing about smaller parts. In particular for language, understanding novel and complex sentences crucially depends on being able to construct their meaning compositionally from smaller parts words and multi-word expressions of which they are constituted. Recently, there have been many, many papers showing how systems can be improved by using distributed word representations from “deep learning” approaches, such as word2vec (Mikolov et al. 2013) or GloVe (Pennington, Socher, and Manning 2014). However, this is not actually building Deep Learning models, and there is hope in the future that more people focus on the strongly linguistic question of whether we can build meaning composition functions in Deep Learning systems.

Scientific Questions That Connect Computational Linguistics and Deep Learning People are encouraged to not get into the rut of doing no more than using word vectors to make performance go up a couple of percent. Even more strongly, I would like to suggest that we might return instead to some of the interesting linguistic and cognitive issues that motivated no categorical representations and neural network approaches. One example of no categorical phenomena in language is the POS of words in the gerund V-ing form, such as driving. This form is classically described as ambiguous between a verbal form and a nominal gerund. In fact, however, the situation is more complex, as V-ing forms can appear in any of the four core categories of Chomsky (1970):

What is even more interesting is that there is evidence that there is not just an ambiguity but mixed noun–verb status. For example, a classic linguistic test for being a noun is appearing with a determiner, while a classic linguistic test for being a verb is taking a direct object. However, it is well known that the gerund nominalization can do both of these things at once:

1. The not observing this rule is that which the world has blamed in our satorist. (Dryden, Essay Dramatick Poesy, 1684, page 310)

2. The only mental provision she was making for the evening of life, was the collecting and transcribing all the riddles of every sort that she could meet with. (Jane Austen, Emma, 1816)

3. The difficulty is in the getting the gold into Erewhon. (Sam Butler, Erewhon Revisited, 1902)

This is oftentimes analyzed by some sort of category-change operation within the levels of a phrase-structure tree, but there is good evidence that this is in fact a case of no categorical behavior in language. Indeed, this construction was used early on as an example of a “squish” by Ross (1972). Diachronically, the V-ing form shows a history of increasing verbalization, but in many periods it shows a notably non-discrete status. For example, we find clearly graded judgments in this domain:

4. Tom’s winning the election was a big upset.

5. This teasing John all the time has got to stop.

6. There are no marking exams on Fridays.

7. The cessation hostilities were unexpected. Various combinations of determiner and verb object do not sound so good, but still much better than trying to put a direct object after a nominalization via a derivational morpheme such as -ation. Houston (1985, page 320) shows that assignment of V-ing forms to a discrete part-of-speech classification is less successful (in a predictive sense) than a continuum in explaining the spoken alternation between -ing vs. -in’, suggesting that “grammatical categories exist along a continuum which does not exhibit sharp boundaries between the categories.”

8. [That kind [of knife]] isn’t used much.

9. We are [kind of] hungry. The interesting thing is that there is a path of reanalysis through ambiguous forms, such as the following pair, which suggests how one form emerged from the other.

10. [a [kind [of dense rock]]]

11. [a [[kind of] dense] rock] Tabor (1994) discusses how Old English has kind but few or no uses of kind of. Beginning in Middle English, ambiguous contexts, which provide a breeding ground for the reanalysis, start to appear (the 1570 example in Example (13)), and then, later, examples that are unambiguously the hedging modifier appear (the 1830 example in Example (14)):

12. A nette sent in to the see, and of alle kind of fishis gedrynge (Wyclif, 1382) 13. Their finest and best, is a kind of course red cloth (True Report, 1570)

14. I was kind of provoked at the way you came up (Mass. Spy, 1830) This is history not synchrony. Presumably kids today learn the softener use of kind/sort of first. Did the reader notice an example of it in the quote in my first paragraph?

15. NLP is kind of like a rabbit in the headlights of the deep learning machine (Neil Lawrence, DL workshop panel, 2015) Whitney Tabor modeled this evolution with a small, but already deep, recurrent neural network—one with two hidden layers. He did that in 1994, taking advantage of the opportunity to work with Dave Rumelhart at Stanford. Just recently, there has started to be some new work harnessing the power of distributed representations for modeling and explaining linguistic variation and change. Sagi, Kaufmann, and Clark (2011)—actually using the more traditional method of Latent Semantic Analysis to generate distributed word representations—show how distributed representations can capture a semantic change: the broadening and narrowing of reference over time. They look at examples such as how in Old English deer was any animal, whereas in Middle and Modern English it applies to one clear animal family. The words dog and hound have swapped: In Middle English, hound was used for any kind of canine, while now it is used for a particular sub-kind, whereas the reverse is true for dog. We should feel excited and glad to live in a time when NLP is so central to both machine learning and industry application problems. Kulkarni et al. (2015) use neural word embedding to model the shift in meaning of words such as gay over the last century (exploiting the online Google Books Ngrams corpus). At a recent ACL workshop, Kim et al. (2014) use a similar approach—using word2vec—to look at recent changes in the meaning of words. For example, in Figure 1, they show how around 2000, the meaning of the word cell changed rapidly from being close in meaning to closet and dungeon to being close in meaning to phone and cordless. The meaning of a word in this context is the average over the meanings of all senses of a word, weighted by their frequency of use.

Figure 1 Trend in the meaning of cell, represented by showing its cosine similarity to four other words over time (where 1.0 represents maximal similarity, and 0.0 represents no similarity). These more scientific uses of distributed representations and Deep Learning for modeling phenomena characterize the previous boom in neural networks. There has been a bit of a kerfuffle online lately about citing and crediting work in Deep Learning, and from that perspective, it seems to me that the two people who scarcely get mentioned any more are Dave Rumelhart and Jay McClelland. Starting from the Parallel Distributed Processing Research Group in San Diego, their research program was aimed at a clearly more scientific and cognitive study of neural networks. Now, there are indeed some good questions about the adequacy of neural network approaches for rule-governed linguistic behavior. Old timers in our community should remember that arguing against the adequacy of neural networks for rule-governed linguistic behavior was the foundation for the rise to fame of Steve Pinker—and the foundation of the career of about six of his graduate students. It would take too much space to go through the issues here, but in the end, I think it was a productive debate. It led to a vast amount of work by Paul Smolensky on how basically categorical systems can emerge and be represented in a neural substrate (Smolensky and Legendre 2006). Indeed, Paul Smolensky arguably went too far down the rabbit hole, devoting a large part of his career to developing a new categorical model of phonology, Optimality Theory (Prince and Smolensky 2004). There is a rich body of earlier scientific work that has been neglected. It would be good to return some emphasis within NLP to cognitive and scientific investigation of language rather than almost exclusively using an engineering model of research. Overall, I think we should feel excited and glad to live in a time when Natural Language Processing is seen as so central to both the further development of machine learning and industry application problems. The future is bright. However, I would encourage everyone to think about problems, architectures, cognitive science, and the details of human language, how it is learned, processed, and how it changes, rather than just chasing state-of-the-art numbers on a benchmark task. NLP: How to Become a Natural Language Processing

Specialist

Along with the development of the field of Data Science, the demand for personnel for this industry is growing. So how do you become a specialist in such a part of data analysis as Natural Language Processing? The personnel market in this area is not very large. Although there seem to be many vacancies in Data Science, NLP tasks are rarely found in employers' requests. And mainly companies in large cities Moscow, St. Petersburg, Novosibirsk, Yekaterinburg are looking for natural language processing specialists. At the same time, employers are not only corporations but also small teams of developers and even startups. So in the NLP segment for beginners, there are specific prospects.

Specialties and tasks in natural language processing To understand how to become a specialist in natural language processing, let us first understand what tasks are solved in this area and in what areas of business these solutions are in demand. Language is a complex collection of different levels, such as syntax, morphology, semantics, and discourse. Each level has its specific tasks. However, in practice, most or all of the language levels are involved. For example, the classic tasks of syntax and morphology are tokenization (a division of the text into words) and lemmatization (reduction of a word to its original form). There are tasks for parsing a test (parsing), extracting entities - for example, names and geographical names. The problems of thematic modeling (highlighting topics in a large collection of documents) and determining emotional coloring are associated with semantics. The level of discourse influences the task of summarizing the text. Machine translation does involve all levels of the language. Speech recognition and generation are also part of the NLP. The next question is who needs all these tasks? Their mention can be seen in advertisements for employment in many areas of business. In particular, contact centers are interested in natural language processing, who need to operate with a massive flow of incoming requests: break them down into categories, identify topics, and automatically select answer options. Online stores are also looking for such specialists because they improve the search in their catalogs and implement interactive and recommendation systems. There is a request in the field of marketing and PR: to investigate the media coverage of the company and monitor whether the image is created by the audience positive or negative. In the same way, research feedback and comments on social networks. In many areas of business, chatbots are used; for example, they are in demand among banks that are looking for natural language processing specialists for their developments. In addition to such companies that need their systems to support their business processes, many B2B sector IT companies are looking for natural language processing specialists. They develop software solutions for sale to their customers. In particular, regardless of the industry, companies with a massive flow of incoming documents and requests may benefit from systems that optimize performance. It is necessary to distribute appeals by topics and departments, highlight the most important and negative ones, speed up translation into other languages, and improve the search in the company database. Many medium and large companies, sooner or later have to segment their customer base.

Essential skills for an NLP specialist What skills are needed to master the profession of a natural language processing specialist?

You need to understand that natural language processing consists of several components: knowledge of the language, knowledge of mathematics and statistics, and programming skills. Moreover, mathematics and programming are more critical than linguistics. There are general requirements that employers make for NLP job applicants. These include knowledge of mathematics, probability theory, statistics, knowledge of areas of applicability, understanding the pros and cons of various families of machine learning algorithms (such as logistic regression, various clustering algorithms, neural networks, boosting, and random forest). An NLP specialist needs to be able to work with databases and know SQL. Sometimes knowledge is required not only of relational databases and related tools (PostgreSQL, MySQL, MS SQL, Oracle) but also an understanding of NoSQL-systems (Cassandra, Redis, MongoDB). You may need to get familiar with the frameworks for working with big data and with various search engines.

A prerequisite is knowledge of data structures. The next category of skills is related to language: an understanding of morphological, graphematical, and syntactic analysis is necessary. You need to have knowledge of algorithms and techniques specific to natural language processing tasks, to understand such things as thematic modeling, information retrieval, and distribution semantics. What programming language do you need to learn? There is no single answer. The most common request is in Python, less often NLP specialists require knowledge of R. To develop final solutions, programmers most often need Java, C # / C ++, Scala. The most common request is in Python, less often NLP specialists require knowledge of R.

In addition, there is a set of technologies that you need to be able to handle. This is the Data Science technology stack for Python. The base libraries for any data analyst are pandas (for working with data in tabular form), numpy (for working with large numerical arrays) and scipy (for computing). For visualization, you may need the matplotlib and seaborn libraries. Machine learning requires knowledge of the scikit-learn core library and other specified libraries (e.g. XGBoost and LGBM for gradient boosting). For the tasks of natural language processing, one needs to understand the contents of special libraries: nltk, StanfordNLP, spacy, gensim, bigartm, word2vec, fasttext. We need experience in building deep neural networks using the frameworks Tensorflow, Keras, PyTorch. And, of course, it is advisable to be able to work with tools designed specifically for the Russian language, such as pymystem3, pymorphy for morphological analysis, Tomita parser, yargy for extracting facts and entities. Moreso, you need to be able to test methods of processing text data and know the methods for assessing the quality of models. Of course, this is the maximum program. Depending on the specific tasks of the employer, one or another programming language, one or another stack of technologies may be needed. A plus for developers of NLP solutions is the availability of their own projects on GitHub, as well as certificates of participation in Kaggle and other machine learning competitions. Adequate level of English is also expected from applicants, but this is a universal requirement.

Where can I learn the NLP profession? In NLP vacancies, in most cases, a specialized education in computer science, computational mathematics, physics, or related fields is required. Sometimes - not lower than a magistracy or specialty. But there are other options: for example, studying at a university with a degree in computer linguistics. A graduate of such an educational program is Polina Kazakova, Data Scientist in the IRELA project, an employee of the MISiS Big Data Analysis Center. She is developing data analysis systems for various companies, that is, just processing natural language.

Is it worth it to go to computer linguistics for the sake of working in NLP? “This computer linguistics program itself has a good purpose: to combine theory with practice,” says Polina. We had theoretical disciplines related to syntax, phonetics, morphology and others, and there were attempts to introduce exact disciplines, such as mathematical analysis, linear algebra, statistics, the basics of programming and even machine learning. And machine learning, as you might imagine, has been very actively used recently in natural language processing (although not in all tasks, in some places the rule-based approach is still bypassed). Returning to the program, it seems that the idea is cool so many things to combine. But in practice, it then turns out that it is impossible to give two fundamental formations in the same degree to one, so we had a very good fundamental linguistic education and there was some kind of mathematics. Probably, this is enough to start working in the field of NLP, but further more deep knowledge will still be needed. After graduating from the undergraduate program, I realized that I lacked fundamental knowledge from the basic course of mathematics. Therefore, now, for example, I am independently engaged in matanalysis and linear algebra. The advice to people who want to engage in natural language processing: go to a technical specialty and simultaneously study some linguistic things, because in reality, highly specific knowledge of the language is very rarely useful in the practical problems of machine learning, especially now, when there are neural networks that, on a large set of well-marked data, themselves derive all the rules that linguists could make manually. We can say that a good qualification in linguistics can sometimes give some kind of profit in the NLP, but in most cases, a good qualification in machine learning is more likely needed Go to a technical specialty and simultaneously study some linguistic things

Who is involved in the markup of text data for neural networks and in general for training machine learning models? "This is a good question. Markup is a common pain for everyone — linguists and machine learning and natural language processing specialists,”says Polina Kazakova. It’s not easy to get good markup of sufficient volume, for this you need people who will manually lay out the data array. And linguists should deal with the compilation of the markup methodology of the text. For example, for the task of tokenization, they must formulate rules for dividing the text into words (given complex cases such as “sofa bed”, “some”, “90th year”). And after that, you can already attract almost random people who will mark up the text in accordance with the given rules. By the way, interns who want to further develop in the direction of Data Science / NLP can be engaged in this task - data markup. From time to time, such vacancies appear. Despite the general need for graduates of technical specialties, sometimes language experts are required in teams of programmers and mathematicians. Polina recalls one such case: “Samsung has a voice assistant, and recently they began to actively recruit linguists to work in Russia. They are engaged in speech recognition and synthesis, and they hired my classmate, a cool theoretical phonetist, despite the fact that she had no programming experience and did not know what neural networks were. It was taken because of specific linguistic knowledge, but in my opinion, this is an exceptional case."

NLP Continuing Education Whatever basic education you receive - linguistic or mathematical you can always find that some knowledge is lacking, especially since NLP and Data Science are actively developing. How to get additional education? School of Data Analysis (SHAD) - Yandex program, training for which employers often mention as an advantage of a potential candidate. As the name suggests, the basics of machine learning and data analysis are given there. The format of the school is more like a full-time magistracy than an additional education. The program is designed for two years, full-time study in the evening (there are branches in Moscow, Minsk, Yekaterinburg and Nizhny Novgorod) and correspondence for nonresident students. You can enter the SHAD free of charge through the competition (which is rather difficult, the competition is large), and in the case of a shortage of points, study for a fee however, only in Moscow and only in person. ShAD lecture notes on some subjects are available to all comers. Some of the selected courses will be useful to a future NLP specialist:

Algorithms And Structures Of Search Data: Discrete Analysis And Probability Theory To complement your education, you can take online training, for example, on the Coursera platform. List of courses recommended by HSE for undergraduates in computer linguistics, among which there are courses in NLP and machine learning. Natural Language Processing - HSE's own NLP course, which is actively developing the direction of online learning. Machine Learning with TensorFlow on Google Cloud Platform is a specialization (set of related courses) developed by Google on the Coursera platform. The courses included there can be studied separately, for example, Sequence Models for Time Series and Natural Language Processing. Text Retrieval and the Search Engines and Text Mining and Analytics are courses included in the Data Mining specialization as part of the Master of Computer Science in Data Science (MCSDS) program from the University of Illinois (USA). Courses can be attended independently. Natural Language Processing (NLP) and Speech Recognition Systems are courses included in the edX Microsoft Professional Program in Artificial Intelligence. Professional Certificate Program IBM is a professional program from IBM on the same educational platform. It consists of 5 courses: Deep Learning Fundamentals with Keras, Deep Learning with Python and PyTorch, Deep Learning with Tensorflow, Using GPUs to Scale and Speed-up Deep Learning, Applied Deep Learning Capstone Project. There are several training options on the Russian platform "Open Education". The peculiarity of this platform is that it was created with the participation of leading Russian universities, all courses are available there for free, but at the same time it is possible to get a certificate and set off taking these courses at your university. On foreign platforms, you can often take courses for free, but you will have to pay for a certificate. “Analyzing data in practice” is a course from MIPT, which, among other things, considers word processing. Big Data Analytics is a program from NUST MISiS, in the framework of which many of the technologies required for NLP are studied. Its launch is scheduled for February. “Data Science and Big Data Analytics” is a review course from the St. Petersburg Polytechnic, partly devoted to text analysis. Enrollment for this course is now closed, as well as for many others of the above that are already ongoing or have recently completed. But you can subscribe to course updates so you don't miss the next set. CHAPTER 3 Programming Interview Questions Are you preparing for the next programming interviews? Maybe the first time, so now you're playing a role in a very demanding field, programming and coding is vital for making positive disruptions in every industry in the world. It all starts with people like you, working behind the scenes to make sure your digital infrastructure, operating systems and industrial applications work as needed. The following 50 plus programming interview questions should give you an idea of the kind of dialogue you can expect with a recruitment manager or recruitment specialist in the programming field. Data structures are just the beginning many things can be asked, including questions about solving logic problems using coding languages. Continue diving after moving out of the entrance; starting with some of the most basic interview questions that any professional will be familiar with.

1. Culture and Compliance You are probably no stranger to job interviews or ice breaker “cultural relevance” questions. In a programming interview, some questions in this category may be asked more specifically. You may be asked to define your workflow, discuss your approach to working with teams, or comment on how you are organized during coding projects.

For example:

Which programming language did you learn first? Are there any existing certificates? What relevant skills did you acquire along the way? How do you organize during complex projects and in general? How can you keep your knowledge and skills up to date in this sector? You will be familiar with the business requirements so far, so your questions about your educational background and experience should not feel left. Technology companies around the world are reassessing their business needs in the light of ongoing coding and lack of computer science skills. If you are meeting with such a company, you must prove your willingness to actively participate in self-starting and continuous learning to keep your skills relevant. However, there are some basics that every programming interviewer should know. To assist you prepare, our programming interview questions are usefully organized into 10 broad categories. This is not all about the position you are following. Will all these questions be asked? No way! However, before you know whether a career as a programmer is right for you, you need to know the breadth of the subject you will be responsible for. I Dallayal.

2. General Programming and Design Questions Applicants who do not know these coding principles do not receive a call back. See how numerous of these questions you can answer consecutively without slowing down.

What are the variables? Without variables, there is no program. Variables are programs that vary depending on the general conditions of the program and the work the program does. Variables are values that are stored or “declared in the program and represent the operating data set for the program. Variables = data sets + algorithms. What is heredity? Inheritance is the basis of object-oriented programming provides the capacity to extend existing code to meet new requirements without writing code from scratch.

What is polymorphism? Polymorphism allows new objects to assume the properties of existing objects in programming. The object that provides the inherited properties is known as the "base class" or "super class". This is the program's ability to handle an object differently by type or class. Imagine a car against a tank. Polymorphism will be a system for common systems that provides a plan for systems operating in both vehicle types.

What are pointers? The markers contain memory addresses for other values. Changing markers can save time against changing the data expressed by markers. This can be an effective way to write programs that process data. Pointers are often used as C, but are less commonly used in Java. Name the four types of storage classes. Registered, static, external and automatic.

Register: These are the storage classes stored in the register of the central processing unit. Static: A storage class where the variable does not change. It is announced at the beginning of the programming task. External: A local storage class for a piece of hardware, such as device drivers. Der If I need this variable again, read it from the external device for the most current value. Iye Automatic: A storage class where blocks or functions define the local variables.

What is encapsulation? This is part of the most important concepts in object-oriented design. It involves assembling a class with a set of instructions. In encapsulation, an object contains not only its data, but also functions that can modify it. Encapsulation can still be used to hide the nature of the data while facilitating interaction.

Name the seven data kinds. Floating point, float, real, integer, string, character, and Boolean. Name the five data structure kinds. Linked lists, queues, files, stacks, trees, and graphs. What is the similarity between "declaration" and "definition" of a particular function or variable? An identifier includes the type and function of an identifier and explains what the linker needs to reference it. The definition describes what the compiler needs to do to interact with the descriptor. Sometimes a declaration provides an initial value, sometimes a definition value.

What is the similarity between an interpreter and a compiler? When an interpreter executes commands directly, a compiler converts these commands (program) from source code to machine code. Compiler programs run faster than interpreter languages and can be used on a scale. Interpreter language examples: Basic, Korn Shell, Ruby, Python Examples of Compiler language: Java, C ++ What are the functions of object-oriented design? Encapsulation, polymorphism, inheritance and abstraction.

What is a real-time operating system? The real-time operating system (OS) processes data when it becomes available without buffering time. Real-time operating system example: Controlling factory machines in real time. Not necessarily on request, but on request. Examples of non-real-time operating systems: Windows and macOS. Not as important as response time. Are you ready to continue? From here, you can expect the company's recruitment manager to begin entering some of the more specific information that an encoder needs.

3. Algorithms Algorithms are often the bread and oil of the career of coding professionals. You know them as a description of how data is collected, sorted, and ultimately used in other processes. You can expect the following questions about the algorithm: Name four types of sorting algorithms?

Quicksort, insertion sort, selection sort, and the bubble sort.

How does a bucket sort algorithm work? Sorting a bucket creates an empty array and places each object in a bucket. Each bucket is ordered in sequence, and then each object returns to the original array.

How does a radix sort algorithm work? Radix sorting lets you edit data using integer keys instead of values. Each piece of data is grouped with figures of the same value and meaningful position. In other words, radix sorting is helpful for sorting numbers by their own numbers. Each number in the number is a key. It can use a counting sequence or a bucket sequence.

What is a count sorting algorithm and when is it useful? A count sorting algorithm counts instances of each data point and arranges them accordingly, most useful for arrays with at least relatively unique values. Most used values, least used values, unique values, and so on can be used to find.

4. Databases Encoding databases can include using languages such as Python, Java, and C # to create and implement web-based, cloud-based, and mobile applications that individuals and organizations can use to interact with information collections. Not every programming career requires expertise in database management systems, but it doesn't hurt to get to know some basic concepts. Define the word database. What interactions does the database management system provide? Any electronic system that includes a database and allows it to be used, accessed, updated and processed. Database management systems (DBMS) enable administrators, applications, software, and end users to interact with data stored in the database.

How do database management systems improve the functionality of file- based systems? A file-based system can interact directly with users, but it does not have to be at the same time. Database management provides mediation between the file system and the application programming interfaces (APIs) used, and is more likely to facilitate multiple users at the same time. However, these users cannot transform the same data at the same time.

What are the three types of database technology? Structure / data: helpful when writing a program that needs to save, manage, or read data but is used where a full database application is not desired. Navigational: Navigation databases are associated with hierarchical or network models. This describes a database in which objects are recursively processed individually. SQL / relational: This means “structured query language.. Used to retrieve or update information in the database.

What is normalization used for? Normalization rearranges data to eliminate redundancy and save disk space. It does this by creating multiple tables and describing the relationships in between them.

What is the difference in DDL and DML? DDL stands for “data description language… Specifies the structure of a database. DML stands for “data processing language ve and specifies how data is retrieved or modified from that database. Why database partitioning is required? Partitioning does improves efficiency and data availability and minimizes data loss by creating separate, more stable, and more convenient separate file systems. Partitioning allows parallel processing or separation of partitions that require more or less access.

What is an entity and what is an entity? An entity is a separate object or data point, such as a person or place. An entity groups entities of the same type.

5. Arrays In Java, the entity is an instance of a class. Example: Class = “car.” Example / entity = “Cadillac CTS.”

Arrays are the common data structure in programming and determine how to access information is a vital component. If you have variables as basic building blocks, arrays are the next step.

Problem solving with arrays Let's see how you know how to run arrays and how to solve some basic problems: How do you examine if an array contains only numbers? How do you go about finding all permissions for a particular directory? Find a missing number in an integer array (1-100). In an integer array, locate the replicated number. How can you find duplicate numbers in an array that contains multiple iterations? Given the unordered integer array, find the largest and smallest values. Find each of the integer pairs in an array that is equal to the given value.

6. Trees The following three categories (trees, charts, and lists) are the same in some basic ways, but still require some specific information that you should call during your call. How well do you know your path in linear and the nonlinear data structures? In Java, trees contain packages and that describe how lists are processed.

What is binary search tree and what is it meant for? A binary tree is a data structure that facilitates the operation of bifurcated data with two “children”. Name five tree types. Binary trees, binary search trees, the AVL trees, B trees, B + trees.

Problem solving in trees: How about a binary tree? How do you count several leaf nodes in a given binary tree? How do you check if a tree is balanced?

7. Graphics Graphs are nonlinear data structures consisting of multiple nodes and edges or “corners… Graphics can represent network architecture and can be deployed to solve practical problems. In general, it describes redirected or non-redirected relationships.

What is the first search or transition to depth for charts? Depth initial search (DFS) searches for graphs and trees, starting from the root node and tracing as much as possible before each branch comes back. DFS can be used to detect loops in a chart, find a way, and find strongly connected components.

What is the first pass width for charts? The width first examines the sister nodes before the first pass (BFS), and the second examines the children nodes. BFS can be used to discover the shortest path in the peer-to-peer network and even in web browsers used in search engines. Problem solving with graphics: When a given chart is exported, how do you determine if it returns?

8. Lists Lists are unordered or sortable data structures. Lists can contain objects, strings, and integers. Any data type can be listed. Each piece of data is the next pointer in the list.

Problem solving with lists: How do you check if a particular linked list contains a loop? How do you find the starting node of the loop? How do you reverse a linked list? How do you find the length of a single linked list? How do you remove duplicate nodes from an unlisted linked list? How do you find the sum of the two linked lists using Stack?

9. Networking Networking is not a new concept remotely. However, as most of our technologies and industries grow in connection with each other and with 5G networks, and the internet internet of things geniş extends to billions of devices, programming cannot be used to help job seekers redefine network fundamentals.

What is the difference between LAN, WAN and VLAN? LAN stands for “local area network ve and describes the high-speed device network in the same location. WAN stands for “wide area network ve and defines a network that is not geographically connected or diverts to another location. A WAN can be public or private. A VLAN means a “virtual LAN” (LAN within a LAN) that limits the number of participants but is subdivided to allow communication between known entities.

What are the differences between OSI and TCP / IP models? OSI uses a bottom-up approach to data transfer, and while TCP / IP uses a top-down data transfer. While OSI has seven layers, TCP / IP has only four. The OSI is not concrete, and TCP / IP is concrete.

What is DDoS? DDoS means “distributed denial of service… It is a kind of do attack in which external systems suppress the bandwidth of a targeted system. It is intended to make the target services unusable. It is often referred to as the tool that hackers try to steal information from, but this is not entirely true.

Define the QoS? QoS stands for “quality of service… This set of technologies reduce network latency and concussion and increase service reliability by limiting the amount of lost packets. Among other things, and it sets priorities for data transfer. QoS is more important than ever. If a program does not need immediate access to the network, it receives a lower QoS score. Programs that require low latency in other databases receive a higher QoS score.

10. Problem Solving Using Programming Some programming interviews will be based on some sort of practical, practical or problem-solving demonstration to see how well you can create and combine major and minor programming concepts. Remember that the most important premise for any type of programmer is the ability to think logically. Here is an example of a problem with random selection: If you have a set of 52 numbers, such as a deck of cards, it will be easy to select a random one. Your task is to create a number between 1 and 52 and use it as the index of your index. In other words, obtain a system in which all numbers or cards are equally likely where a number or card is randomly selected from a series or deck.

Here's a more difficult variation: Imagine that you cannot see or hide the entire set at the same time. In other words, it is not said how many total numbers there will be, and only one number is shown at a time. You cannot store all the numbers you have shown and are only told when you see the last number of the set. If you are allowed to store a maximum of two numbers, can one answer at a time or a random number selected simultaneously?

Here's how to solve this problem: You only need to register two numbers: the total number seen so far and your current candidate response selected from the number stream, if it would end at this point. When the first number is displayed, you have one. Your candidate response must be the first and only number seen there is a possibility of one. If a second number is displayed, you will receive a new candidate with the possibility of one of the two candidates. Likewise, a third number will be the new candidate response with the possibility of one third. Thus, each new number generates a random number between a point and the number of data points seen at that point N, as shown. If your random number is equal to N, keep only the number given to you as your new answer. Otherwise, store the answer previously saved with this same logic. You will perform this selection update every time you see a new data point. When it is said that the flow is made, the recorded value is the correct response which can be selected equally from all the numbers seen. The probability is 1 of N, regardless of how long it is before the flow stops. Here are some other issues you'll work on: Type an algorithm to find each instance of a specific word in a text selection. Type a program to generate a random number in a specific range. How do you check if a string contains only numbers? Given the two character strings, find the minimum number of changes required to convert string 1 to string 2. How do you count several vowels and consonants in a given string? Given a grid where each frame has a numeric value, how do you code a solution to find the lowest value path from one side of the grid to another? Ready for Your Programming Interview? I hope you're ready to do the programming interview and the programming job you've dreamed of. Otherwise, you'll be amazed at the quality of some of the online guides, tutorials, training materials, and full coding courses, including Codeforces, SphereOnline, HackerRank, and more. You have everything you need to enter this exciting space and in robotics innovation, machine learning, wireless networking and given the importance and demand for many things, you have all the motivations in the world. Hopefully an overview of the major programming concepts has been useful for this exciting career. Why are you still waiting? CHAPTER 4 What is data analysis and why is it important? There are several data analysis methods including data mining, text analysis, business intelligence and data visualization. Data analysis is the process of evaluating data using analytical and statistical tools to discover useful information and help with business decision making. There are a number of data analysis methods, including data mining, text analysis, business intelligence and data visualization.

How is data analysis performed? Data analytics is part of a larger process of deriving business intelligence. The process comprises one or more of the following steps:

Defining Goals: Every survey should begin with a set of clearly defined business goals. Much of the decisions made during the rest of the process depend on how clearly the objectives of the study have been set. Question Questions: An attempt is made to ask a question in the problem area. Eg. Do red sports cars get into accidents more often than others? Data collection: Data relevant to the question should be collected from the relevant sources. In the example above, data can be collected from a variety of sources, including: DMV or police accident reports, insurance claims, and admissions information. When data is collected using surverys, a questionnaire is required to be submitted to the subjects. The questions must be appropriately designed for the statistical method used. Data Wrangling: Raw data can be collected in several different formats. The data collected must be cleaned and converted so that data analysis tools can import it. For our example, we can receive DMV accident reports such as text files, insurance claims from a relational database and admissions information such as API. The data analyst must aggregate these different data forms and convert it into a form suitable for the analysis tools. Data Analysis: This is the step in which the purified and aggregated data is imported into analysis tools. These tools allow you to explore the data, find patterns in it and ask and answer what-if questions. This is the process of making sense of data collected in research through the proper use of statistical methods. Drawing Conclusions and Predictions: This is a step where, after sufficient analysis, conclusions can be drawn from the data and appropriate predictions can be made. These conclusions and predictions can then be summarized in a report provided to end users. Now, let's take a closer look at the methods of data analysis in particular.

Data mining is a method of data analysis to discover patterns in large datasets using the methods How to Become a Data Science How to Become a Data Science Data science has gone from a new beginning in 2007 to being one of the most sought after disciplines today. But what does a data scientist do? And how can you break into the field? Read more about statistics, artificial intelligence, machine learning and databases. The goal is to turn raw data into understandable business information. These may include identifying groups of data items (also known as cluster analysis) or identifying anomolies and dependencies between data groups.

Applications of data mining: Anomoly detection can process large amounts of data ("big data") and automatically identify outlier cases, possibly exclusion from decision-making or fraud detection (e.g., bank fraud). Learn customer buying habits. Machine learning techniques can be used to model customer buying habits and determine frequently purchased items. Clustering can identify previously unknown groups within the data. Classification is used to automatically classify data entries into predefined locations. A common example is to classify email messages as "spam" or "non-spam" and have the system learn by the user.

Text Analytics Text analysis is the process of deriving useful information from text. It is achieved by processing unstructured text information, extracting meaningful numeric storage time with text operations in Excel. Saving time with text operations in Excel can do magic with numbers and it can handle characters just as well. This tutorial shows you how to analyze, convert, replace, and edit text in spreadsheets. These basic reasons allow you to perform complex transformations. Read more indexes from the information and make the information available for statistical and machine learning algorithms for further processing.

The text extraction process includes one or more of the following steps: Collection of information from various sources, including web, file system, database, etc. Linguistic analysis, including natural language processing. Pattern Recognition (e.g., Recognition of Phone Numbers, Email Addresses, etc.) Excerpt of summary information from the text, such as Relative frequencies of the words, determination of similarities between documents, etc. Examples of text analysis programs: Analysis of open survey responses. These studies are exploratory and include open questions related to the topic of the question. Respondents can then express their views without being limited to a particular response format. Analysis of emails, documents, etc. to filter out "junk". This also includes automatic classification of messages into predefined trays for routing to various departments. Investigate competitors by reviewing their websites. This could be used to derive information about competitors' activities. Security applications that can process intrusion detection logs.

Business Intelligence Business Intelligence How to Use Cortana to Analyze Data with Power BI How to Use Cortana to Analyze Data with Power BI Are you eager to simplify your data analysis? Get quick answers from Cortana with this guide. Read More transforms data into usable intelligence for business purposes and can be used in an organization's strategic and tactical decision-making process. It allows people to research trends from data collected and derive insights from it.

Some examples of business intelligence in use today:

An organization's operating decisions such as product placement and pricing. Identification of new markets, assessment of product demand and suitability for different market segments. Budgeting and rolling forecasts. Using visual tools such as heat maps, pivot tables and geographic mapping.

Data Visualization Data visualization Instantly visualize data and information with Google Fusion Tables Instantly display data and information with Google Fusion Tables Whether you are compiling a report to work or you just want to graphically represent information on your blog, Google Fusion Tables can help. Google Fusion is actually a feature embedded in ... Read more simply refers to the visual representation of data. For the purposes of data analysis, it means using the statistics, probability, pivot tables, and other artifacts to visually present data. It makes complex data more understandable and useful. Increasing amounts of data are generated by a number of sensors in the environment (called "Things of Things" or "IOT"). This data (referred to as "big data") presents challenges in understanding that can be alleviated by using the data visualization tools. Data visualization is used in the following applications.

Extract of summary data from the raw IOT data. Use a bar chart to represent the sales performance over multiple quarters. A histogram shows the distribution of a variable as income by dividing the area into bins. Data analysis in review Data analysis is used to evaluate data with statistical tools to find useful information. A variety of methods are used for this purpose, including data mining, text analysis, business intelligence and data visualization. Software For The Application Of Automatic Learning Techniques For Medical Diagnosis

The objective is to provide an easy-to-use tool by Health specialists that allows the application of Artificial Intelligence (AI) techniques in support of medical diagnosis based on the clinical method. Diagnostic support is among the most cited applications of AI internationally, in particular the Supervised Classification, branch of Machine Learning. The ARISE software implements a classification algorithm based on the main results of a doctoral thesis. It presents the RISE algorithm, for the acronym in English "Rule Induction from a Set of Exemplars" that unifies the best ideas of learning paradigms of rule induction and learning based on cases, overcoming the limitations that both suffer separately. The experimental results place it in a privileged place among the algorithms of the state of the art. ARISE introduces modifications to the seminal algorithm that prevent inadequacies in its application. Experiments with training sets available internationally for testing show that ARISE preserves the classification accuracy, the most desired quality, and the slight decrease in processing speed is not prohibitive. Compared with foreign software for use in Cuba, which implements other classification algorithms, they surpass them in efficiency and in an interface that does not require user expertise, providing functionalities ranging from the creation of training sets to classification (diagnosis) of individual cases or groups of cases. The objective is offer a easy tool of use by specialists of health what enable the application of Artificial Intelligence (AI) techniques in support of diagMedical diagnosis based on the clinical method. Support to the diagnosis is between the applications of the IA plus cited internationally in particular of the Classification Supervised, branch of Learning Automatic. The software ARISE implements a algorithm of cl basedin the main results of a thesis doctoral. In she he presents the algorithm RISE, by the acronym in English from “Rule Induction from a Set of Exemplars” of sde Rulesa set of Cases) that unifies the best ideas of the paradigms of learning from induction of reglas and The learning based on cases, overcoming the limitations that both separately suffer The experimental results what placed in a privileged place among the algorithms of the state dthe art. ARISE introduces modifications to the algorithm seminal that prevent shortcomings in your application. The exper i-mentions with sets of training available internationally for tests, show what ARISE preserves the precision in the classification, the quality more desired, Y the slight decrease in the velocity of pr o-Cessation is not prohibitive. Compared with software foreign of use in Cuba, what implement others algorithms of classification, the s u-pear in efficiency and in an interface that no exige expert of user, providing funcRegionalities ranging from the creation of sets of training until the classification (diagnosis) of cases individual or groups of cases. Keywords: Machine Learning, Classification, Medical diagnosis, Clinical method

ABSTRAC T: Theaim is to provide an easy to use tool for health specialists to enable implementation of tech-niques of Artificial Intelligence (AI) in support of medical diagnosis based on the clinical method. The diagnostic support is among the most AI applications cited internationally, particularly in the supervised cla s-sification, branch of Machine Learning. ARISE implements software a classification algorithm based on the main findings of a doctoral thesis. It presents thealgorithm RISE, "Rule Induction from a Set of Exemplars "that unifies the best ideas of learning paradigms rule induction and case-based learning, overcoming the limitations that both suffer separately. Experimental results place it in a special place among the state of the art alGorithms AR ISE you introduce modifications to the algorithm seminal deficiencies that prevent its implementation In exper i-ments with training sets available for testing internationally, ARISE Show that preserves classification accuracy, the quality most desired, and the slight decrease in processing speed is not prohibitive Compared with foreign software in use in Cuba, which implement other classification algorithms, surpasses in effectiveness and an inte r-face that requires no user expertise, providing functions ranging from the creation of training sets up the classif i-cation (diagnosis) of cases individual or groups of cases. Machine Learning (Machine Learning) is faced with the challenge of building programs computational that automatically improved pray with the experience, starting ahemplos provided for tutor or instructor Y of the knowledge of base or previous knowledge (training setI feel or set of cases in medicine), the system of learning create general descriptions of co n-pips The mineria Data to generate information Similarly to which might generate a expert human, what also please the beginning of understandable i- dad The objective is discover knowledge inter e-santes; as patterns, associations, changes, anomalies and estrSignificant invoices from big quantities of data stored in b a-ses of data, data warehouses (almacenes of d a-cough), or any other means, medium of storage of information. The Mining of Data make use of the Learning To u-tomatic use sus methods for find patr o-nes, with Lto difference that stage observed It is a database. The set of examples constitutes to information necessary to the entr e-system swimming. Mining of Data in Medicine is focused to andn-counter trends relevant models and relationships between diseases and diseases, patients and patients, patients and diseases and all possible combinations between patients, sick e-dades, treatments, medications, etc. In general, those aspects that don't are potentially visible with clinical methods. Due to the existence of large volumes of data on a wide variety of symptoms and signs presents in the stories clinics or records med i-cos of cases treated in queries in the institution of Health, the use of the techniques of Learn i-Automatic gear Y of mining of Data for of s-cover patterns Y extract knowledge of they, he has become in a very productive effort what would bring the saving of resources for the sector of the savalanche Y would put to the doctor as protagonist principal of the diagnosis doctor or of the medodo clin i-co.

The Clinical Method The medodo clinical, known as well as a method of clinical problem solving or diagnosis, no is plus what "the application of the mét o-do scientist to study the health process- disease in the individual with views to know, value and transforming health / disease in the individual or subject, by way of what it implies to everyone the patients and It includes to all the is specialties " (J. Fernández Sacasas. Open dialogue on the method clinical. Newspaper Granma, page 3; 14 of January 2011). The clinical method comprises a step group or denies thatall doctor must apply in The bus-it remains of the diagnosis of The problems of health of their patients, Y what consist in: the interrogation or history to know the quejas o s symptoms what present; the Physical exam, in search of d e-finished signs by means, medium of the exploration co r-by the for after Group, relate, to combine and to integrate the symptom Y signs and found Y power establish the hypothesis diagnostic Presuntive that explain the patient's health problem; A) Yes as the testing of these hypothesis to through of complementary exams or by evolution of the patient. In case of what I don’t know check will have to completely reassess each one of the steps.

How mining of Dat you can support to the M and-all Clinic "The process of diagnosis It is not a simple sum a-toria of the symptom and signs found. To thetell de Berger, "the enumeration of signs and symptoms not it is equivalent to the diagnosis, in the same way as the stacked bricks do not form a house, "so it is convenient remember what everyone the symptoms Y yes g-they have their hierarchy, and they need to be placed in the plane that correspond to make the reasoning diagnosis, there are what group them combine them Y rand-laction them already what the diagnosis do not he base in the sum of all the data obtained by the interrogationto-thorium, the exam physical Y the comsupplementary, sino in a synthesis of the truly relevant integrating them to experiences and knowledgefrom the doctor. ” In agreement with the assertion previous is immediately a-ta the application of the technique of Mining of Data, known as “Selection of traits or feature i-cas ”, consistent in the selection of the rm infoation relevant or valuable Y the elimination of the informto-redundant tion or not relevant for a process e cient of mining before a given task, for find the symptom and signs Really relieve n-TES for diagnosis. Between the characteristics of the problems of Health in primary care cited in are listed: A part important of the information supply a-given by the patient is interference, that is, does not have No utility for solving the problem. This feature looks reflected in one of the process steps Mining of Data, consistente in the elimination of the "noise" eventually presentin the data. The patients concur every time mace to consulta with picture clinical incomefull, by the who Wait for make the diagnosis picture clinical flowery, fully developed, let outMany diagnoses, or will make them belatedly. Particularity previous directs the attention to the pr e-sence of missing data in processes of descor-bridle of metI feel aspect that it is valued in the sets of training for the applied to-tion of techniques that are valued in this work .In he raises also a set of “Estrat e-gias a follow for face the problems of savalanche in primary care ”among them: one. Is imbearing to know which son the symptom of presentation of The main diseases by which patients seek to the doctor for be catered. The previous it implies the need of tell with cor n-together of workouts oriented to the diagnosis positive or do not, of the infermedades more com you join that is present to mEthical of the prim attention a-estuary. Appearanceessential for him development of the algorithms of Classification S UPE rvisadawhat toast Machine Learning and that would provide a crit e-river of value of which the doctor can helpiar to make a primary diagnosis. The classification it is understood cas "The assignment ugrant of a tag class or diagnosis a new example (case or new patient), and how about classification he make on the base of the learned by a software to start from a set of between e-Swimming (classified cases, labeled or day g-previously nostalgic and with absolute certeza), and the in which, the new example why he want classified-car, it is very unlikely to be present.” Sometimes, in the initial stage of the process, the doctor can only sort the disease in a 2 categories possible: urgent or do not urgent psychologistand-na or organic infectious or do not infectious, infection bacterial or viral, surgical or do not surgical, and of s-cartar, above all, the diseases they haveserious consequences, if not sand diagnose you m- pranamente. Strengthens the comment made in the strategy one, the techniques of Learning Automatic, are will say c-LY applicable to problems of classification as long as be counted with sets from enter to-relevant sources. The Doctors of family, when attending generateslmente many patients for day, they learn to recognize patterns of signs Y symptom Y the compare with the models of storage disease enados in sor memory, thanks to your studies or experiences. That affirmation It constitutes almost verbatim the definition of the what he known in Intelligence Artificial as "Case Based Reasoning ", with the advantage by part of the software, what i don't know and tired do not lose concentration neither looks affected by agen-external tenses, not even by moods. Complementary exams are part of the method clinical, Y many of the hypothesis he co m-they try with they, but the higher part of the dayg-nostalgic daily in the query of medic family Tecnolo not take examsogy .That statement by yes yes Hello justifies, what the method Clinician has the support of the Techniques of Machine Learning . All the previous justify the objective of is tea work "offer a tool ease of use by specified a-lists of the Health that enable the application of teac-unique of Artificial intelligence (IA) in support for to the day g-medical diagnosis based on the clinical method ” .Warning a premise infallible, that he eminen you Dr. Fidel Ilizástigui Dupuy indicated in. “The augury of that the doctor can be removed of the take of the history clinic, the diagnosis, the prognosis and treatment is only possible if understand that the diagnosis static is to put label a the sick you, matter of the which already pror-the doctors of the past tested, and it won't happen if the diagnosis he understands as a process creative Total, integral and complete of a human being. The rejection of the doctors to the technology computer i-AC he should to what they think what they can strip them of your paperl as clinical. The computers Y the computing do not going to replace to the man regulated and controlled must be incorporated into the work doctor. The no computers they have creativity Y do not can diagnose the problem ace related with education and human values.”

The RISE algorithm Its essence in shape very summed up taken of co NSIST in: Each example or casor is it a vector of attribute pairs - value, j together with the specification of the class to the what belongs the example; the attributes they can be if symbolic and / or numerical. Each rule is formada for a conjunction of background Y a class involved Every antecand-dente is a condition on a single attribute and exists to the sumo a antecedent by every attribute, that is, there may be attributes that do not formpart ofnone of the background of a ruler. The conditions on sim attributes bugs aretest of samedad of the shape ji vto ☐, where ito is the attribute and jv is one of your values runc-cough. The conditions on numerical attributes they take the interval form permissible for d i-chos attributes, that is, two1, jji vvto , whereonejv Y twojv are two legal values of the attribute ito. A rule covers an example if the example satisfies all the terms of the ruler; a rule wins to an example yes is the rule mcloser to the example of according to distance metric that will be described later. RISE looks for "Good" rules partie NDOwhat spec í-to the general, starting with Own conju n-to of examples of training as the first Set of rules. Laugh taking every ruler Y find the example plus near of the mismto class, what the ruler do not he has c u-open (i.e. what is at a distance moh what zero of the ruler) and try generalize from Forma m í-encourage the rule to cover the example. The new rule is incorporates to the set of rules replacing the rule old, if the effect of the change (the inclusion of the ruler in the set) s o- about the global accuracy is positive in opposite case It is not incorporated into the set. The generalizations are one Algorithm exhibited in the dissertation"TO Unified A p-proach to Concept Learning”presin tada in option tograde of Doctor in Philosophy in information Y Sciences of the computing by Pedro Morais thin Sundays in the college of California Irving., EU Year 1997, Y abstracts presented by sor creator in 2004. Also accept if not produce no effect in the precision global of the set of rules. East proc e-dimiento repiyou until for each rule the generalization attempt fails. In the worst case, do not he makes any generalize a-tion, and the end result is a neighbor classifier plus near what uses everyone the examples of co n-training together as copies. The classification of each exampleplo test real i-za finding the ruler plus close to the example, and assigning to the example the class of said rule. The precision (in English, accuracy) IS) Acc (RS, of a set of rules RS over a set of examples IS it defines as the tailstion of axis m-plos that the set of rules classifies correct a-mind. classify correctlye n-you a example when the rule closest to example has The same class that the example. As set of examples IS is forever set of training that remains implicit (is tell, the precision will be denoted by Acc (RS)). The precision he measure using the Logic method "leave one out”: c when trying to classify to an axis m-plo, the ruler corresponding to example takes off of the set of rules, unless already cover to another example too. Each example memorizes your distance to rule more close Y the class assigned to bliss ruler, when he generalize to a ruler, single is necessary do correspond that one rule with everyone the axes m-plos, and check follow to some example that do not won before and what effect produces that. The examples wrong classified previously what now with the generalization of Rule are classified correct a- mind increase the accuracy of set of regales, and the examples classified previously entity of way run ecta that with the generalization, now they are poorly classified, decrease the accuracy of the conjunct of rules. If the first are more numerous what the last, the change in the precision of the co n- together of reglto is positive Y he you accept in the new rule set, the new general ruled. The measure of distance used is a combination of the distance Euclidean for the attributes number i-cos and, a simplified version of the metric of difference of values of Stanfill and Waltz. Be) C, and..., and, (and AND ETOtwo one an example with value iand for the i - th attribute and class. The ARISE software ARISE that sand describe in details in It constitutes an implementation of a classification algorithm bas adoin RISE, what change the form to of try the missing values, what It allows also of the classification, the induction of rules Y what do not is impelled in the versions of the software available in the University . The software It was tried and evaluated of way ex i-people con the objective of check yes he got a proper implementation of the algorithmor laugh and if the change in the way of try the values ausen tes They maintain the strengths of the original algorithm. further sand tried the ef convenience of the algorithm or impland-lied and of the product (ARISE), comparing the results obtained by the W softwareEKA using the algorithms ID3, C4.5 Y Nge for a group of sets of workouts available in the r e-position of The universityversatility of California of To gocame (ICU) (R epository of machine learning databases. Dept of Information and Computer Science, Unive r-sity of California, Irvine, CA,), with address Web for free download :http://www.ics.uci.edu/~mlearn/MLRepository.html. An alternative strategy to treatmente n-to of missing values for the algorithm Laugh At istudio quand drove to choice of RI SEas the algorithm to be implemented, it is detectedcted of righ now a limiting in I left hergives. In cases e x-we stretch the strategy RISE's original handle missing values of the attributes like values leg í-scams, can drive to a situation in the what he get rules imprecise and to avoid that situ a-ci ng an alternative strategy proposed in thecase of numerical attributes is widely described by various authors. In the bibliography is it so widely referenced various strategies for the treatment of numerical attributes with values absent, among others: Remove the set of training the examples that have missing values. Replace the value absent by the mode of the media of attribute values. • Replace the value of the absent attribute for fashion or the half of the values of the attribute for the class more frequently. Other techniques with support in the distributions of probability of attribute values. Note: In the case of the attributes symbolic absent n-tees, are applicable the techniques previous siempre May its symbolic nature allow it. Calculation of the “mean” of symbolic attributes The method plus simple what appears in the bibliogr a-trust for determine the half of the you values of a symbolic attribute is select the value plus c o-bad of attribute, without however rigor, of that m a-nera we would be referring to fashion instead of the media of the attribute And various coi authorsnciden in which does not produce the best accuracy in the classification location]. Given that the mean for attributes Numeric is the value that minimizes the variance, we can extrapolate That idea for try with attribute or-symbolic cough Is natural follow the mismto strategy what with the attributes numeric, Y the half It constitutes a mand-central tendency did a ideal for the approach of the distribution of attribute values Yes, symbolic. To determine the distance between two values of a symbolic attribute, Sundays and other authors use the metrics:

It coincides with other authors, which if the di stances among all The pairs of values of a symbolic attribute using, there will be a VA bout attribute that will result with the less distance to res-to in case of tie, he might select ale a- toriamente the valor of the attribution to minim hoist, and that value that minimizes distances, genuinely can be considered the "average" of the set of values. Formally, as described in: The media of a set of values symbolic he can define as the value m (a value of attribute) that minimiza the variance of the set, that is,

Where J is the set of values of attribute yesm-bolic and v is a value of the set Jm will act c o-mo the best approach for the values of cor n-together J, so similar tohalfway for valpray numeric, m will be replaced by every value simb o-set J and the m value that minimizes the equation be Tomara comwave "Average" of conju n-to of symbolic values J. That valor m of the attribute simbolic obtained as "half", he can replace by every one of the v a- absent lords of the attribute and it is normal wait what have a effect equivalent to what he gets in the case of numerical attributes with missing values. With it is obtained a set of trainI feel "normal" (without attributes missing), on which e x-wiping the same concepts as for attributes numeric missing, he you will get a mbest or alike precision in the classification what yes he will use the co n-Original training together. Is idea has in against what for the calculation of m he requires obtain the distance between all the p a- railings of values of the attributes symbolic, is dand-cir, the calculation of )(x ji x,SVDM, a quantity of times equal to the amount of variations with repeat i-tion of a set con a total of elements same to amount of values that it has the attribute, t o-I send two values at the same time. Being iTO the quantity of values different of the atr i-symbolic buto i, the calculation of A i 2 di sta n- cias (that is the amount of couples of values of attributes what he they can form where interest the order that occupy each element of the couple) and that amount of distances have to be calculated for each symbolic attribute of the set of enter a-while it has at least one missing value. Dice what )(x ji x,SVDM is a function of say s-tancia has what 0=x)d (x, Y with that he pue of avoid iTO times the calculation previous (the distance of a value to the same is zero) Y the quantity of say s-tanks for him symbolic attribute I know educate A i 2 -A i . It could be used also the fact that x)d(Y,=Y)d (x, and with it reduce further la cant i-dad of calculations, but, complicates the c logic calculation of the expression in parentheses in. An optiming for him calculation of the half of symbolic values After calculated the sum to of the distances of everyone the values of a attribute symbolic respect to the first his values, for the second value, it Iran Changing the others values of the attribute, my n-behind the accumulated an of the as distances That go getting be less what the sum of the gaves-tankers of everyone the values respect to the first v a-lor, yes he they travel everyone the values of the attribute, is on the roars he new sum of the distances, of everyone the attribute values with respect to the second one was less than the received was sum and in that case sand act a-lists the new one smaller sum of distances Y the value of the attribute for the which he got, in this case the be increase. In case contrary, to penalties the sum accumulated of the distances leave of be less what the sum between and change to the third value of attribute, without necessity of run them all, I don’t know update the smallest sum and proceed to calculate the cumulative sum of the distances of all the values regarding to the third value, and so on. Following is idea he reduce considerably the calculations, to a level that justifies the use of this proposal to weigh of the light increase in the weather of complej i-give of the algorithm, what in the worst case, in what the "half" be the latest of the values of the attribute, is quadratic with respect to the amount of values of the attribute with more cantity of distinct values and that has at least one missing value. Potentials of ARISE 1. Be introduce a strategy alternative to the treatment of attributes with missing values, don-of he concludes with the Benefits what the new one e s- Generation strategy. 2. It adopted a strategy of decrease in the number of rules, that in and find out it, and does not decrease the accuracy in the classification. He chose by remove the rules what are absorbed by other rules. A rule absorbs or contains another when meets that, both rules are of the same class for all attributes, it is also fulfilled that: Yes the attribute is symbolic: are equals their worth or nothing no this I presented the value of attribute in one or both rules Yes the attribute is numeric: the extreme lower of the condition (the INF) is less or same what the of the other and the extreme or higher (the SUP) is older or same than that of the other. Yes the previous he meets for all the terms present at the rules, then said rule a b-sip or contain the other. Example: A) Yes outlook = sunny Y 69 <= temperature <= 75 Y 70 <= humidity <= 70 Then and is B) Yes69 <= temperature <= 75 And 70 <= humidity <= 80 Then yes In this example the ruler TO) is absorbed by the r e-gla B) since all l are met as conditions above mentioned for each attribute what form apart of the conditions of the rules. 3. Be discovered the inaccuracies what provokes in the classification, the presence of multiple instances of a case in set of training, which it's generate mistakes conceptual disclosed in the Web and that will constitute object of analysis in futures work a-jos. 4. ARISE makes validations of the data of entry and provides facilities for the acquisition of the cases to classify and for creation Y edition of sets of training, suitable for non-specialist users 5. The export to several languages It helps the eventual creation of expert systems. 6. The results of ARISE with3 of the algorithms plus cited internationally on 17 c together internationals of tests show what this he behaves best also to count with the advantages They motivated their choice. The results of the experiments made allow affirm that at replace missing values by the stockings of the valpray of their cor-corresponding attributes he maintains the precision of the algorithi t-mo and the generation of inaccurate rules is avoided. Also applying the ideas of: 1. Don't eliminate rules that don't earn examples two the algorithm ID3 is applicable only to sets of workouts with discrete attributes, by so much is go favored in the tests, then, it was only possible try 3 of the 17 conjuncough. 2. Incorporate the strategy to eliminate las rules absorbed by others 3. Do not admit repeated instances of a mismo case ARISE preserve the like the algorithm base, the superiority over the algorithms used as counterpart representatives of state of art of the classification algorithms. The described software It is an easy tool of use by specialists of the Health what makes possible the application of Artificial Intelligence (AI) techniques in support of diagnosis doctor based on the Methics o-In addition, it allows to export to the language natural, and to the languages of Programming: C ++, Java, Y Prolog the rules of decision generated and this generates eventually useful knowledge. To materializer the goal is makes the execution of Projects what allow create conju n-cough of training extracted of data of pacie n-tes Cubans, or enrich the set of attributes that can be valued (give it a decimal value, nominal u ordinal) cas result of the interview and revision physical what the doctor perform to the patient, Y that allow a close diagnosis. The comparison of the diagnoses is immediate made on the basis to com analysis supplementary co n-Clients, with diagnoses "Virtual" made in base only to the interview and the reconomy what the doctor makes to the patient and about that base get sets of workouts insurance, oriented to what is aspired in this proposal. CHAPTER 5 What Is Machine Learning And What Is It Contributing To Cognitive Neuroscience?

Machine learning allows an unprecedented understanding of complex data sets, which is giving it a growing role in our society in general and in cognitive neuroscience in particular. These applications have been an exciting advance in the study of basic questions about our cognitive system, as well as in the diagnosis of some important diseases that affect this system. Despite the novelty of these works, the flexibility of machine learning allows us to predict that the most relevant contributions of machine learning are yet to come. Machine learning (in English) refers to the subfield within computer science specialized in the recognition of complex patterns in data sets. Unlike classical programming, in which a program executes the same (more or less complex) operation over and over again, the main feature of machine learning is that its programs manage to extract autonomously (that is, without being specifically programmed for this) relevant information in the data being processed. This information allows the program to “learn”, that is, to improve in its execution of the task for which it had been programmed (Turing, 1950). By developing sophisticated algorithms (which can be understood as "models"), these approaches allow us to identify invisible relationships for the human eye. This type of algorithms interacts with us in our day to day when, for example, the camera of our mobile recognizes a face or when we use an automatic translation application. In this sense, part of the success of these tools is due to their extensive field of action: from systems that detect mutations in our DNA (Libbrecht and Noble, 2015) to "big data", which identifies patterns in huge data sets about, for example, different segments of our society (Boyd and Crawford, 2012). As expected, cognitive sciences have not been immune to the development of these tools. A simple query of the terms "machine learning" and "brain" in an article repository (PubMed; date of consultation: But how does machine learning work in practice? To answer this question, it may be beneficial to understand what conceptual difference, within cognitive neuroscience, these analyzes make for more classical approaches. Imagine that the objective of our study is to compare the image processing of dogs and cats in two brain regions (region X and region Y). For each of these regions we obtain a stimulus activation pattern presented (in the figure, each square of the pattern represents activation [black = activated; white = deactivated] in a voxel [volumetric unit used in magnetic resonance imaging]). Following the logic of the classic analyzes, we would proceed to average the activity along these voxels. In all the patterns presented, there are 3 voxels activated and 3 deactivated. The average, therefore, would be the same for both animals in the two regions, suggesting that they are similarly involved in the processing of dogs and cats. However, we can observe that, Although in this example the difference jumps to the naked eye, this type of observation seems complicated when what we have before us is a larger set of data. It is here that researchers have used machine learning techniques to detect, among all the complexity of our data, patterns associated with different representations. Basically, the operation of these techniques consists in training an algorithm to differentiate two classes (in our case, dogs and cats) by presenting patterns associated with both categories (yellow and blue dots in the figure). In this way, the algorithm learns what rule to use to separate copies of each class (in this example, a diagonal line). The key step (“test”) consists in the presentation of new unlabeled copies, in order to check the accuracy of the algorithm when assigning each copy to the corresponding class. In this way, when the activity of a cerebral zone allows classifying the specimens of two different categories above the level of chance, we will assume that this zone is differentially representing these categories. How important are these assumptions in cognitive neuroscience? While classical techniques allowed us to detect which areas seemed to be involved in certain processes, it was not possible to understand what representations were being encoded in those regions. Thus, for example, when activating the prefrontal cortex during an attention task, machine learning allows us to study whether this activation underlies non-specific control processes (in which different categories would be similarly represented) or, on the contrary , to the differential coding of relevant contents for the task (Haynes, 2015). The application of machine learning in more clinical contexts is especially exciting. For example, several studies have shown how, from structural images of the brain, it is possible to detect if a person with mild cognitive disorder (TCL) will develop in the future an Alzheimer's type dementia (Moradi, Pepe, Gaser, Huttunen and Tohka, 2015 ). In this study, the researchers took advantage of one of the commented qualities of machine learning, the extraction of significant regularities in complex data sets. Specifically, they “fed” the classifier with three sources of information: images of the structural state of the brain, scores on different cognitive skills questionnaires and the age of each participant, all for patients with TCL who subsequently developed Alzheimer's disease, as well as for patients with TCL who did not develop the disease (in our previous example, this would correspond to the different blue and yellow dots). The results of this study showed that the algorithm could differentiate with 80% accuracy the two groups of patients only using brain images and that this percentage increased to 90% when all sources of information were combined. Crucially, this type of results allows the diagnosis to be advanced between 1 and 3 years compared to other available tools, which is a crucial advantage for treatment. this would correspond to the different blue and yellow dots). The results of this study showed that the algorithm could differentiate with 80% accuracy the two groups of patients only using brain images and that this percentage increased to 90% when all sources of information were combined. In any case, the most interesting aspect of machine learning is its flexibility. Although it is introductory, this article allows us to see that these tools are moldable and applicable to many different problems. In a world like the current one, in which the amount of data generated in each moment is huge, the potential of machine learning is undeniable. The fact that resources from machine learning used in cognitive neuroscience are still limited suggests that their potential is even greater in this discipline. Hopefully, therefore, the most relevant contributions are yet to come. Machine learning techniques at the service of Energy Efficiency in the digital home

The efficient use of resources is a universal challenge of today's society and this is reflected by the European Commission in its Roadmap. In the last decade, great advances have taken place in this field that has enhanced the concepts of Smart Network and Smart City. The generalization of the use of smart meters and sensors of different types in the domestic environment, allows to easily and affordably extracting consumption data, brightness, etc. Be part of the idea of learning from such data to extract value from them. The knowledge of the user's consumption habits and their environment will allow carrying out actions that will help improve energy efficiency. It is intended to show in this communication a compendium of what these machine learning techniques can be, as well as a comparative study of the alternatives that technology offers in order to carry them out. Taking into account that every solution proposed must be scalable, since the data used is generated increasingly over time, and the number of households that integrate this type of devices is increasing, the use of Big Data techniques will be crucial. For more than three decades, European Union (EU) has based its policy of improving energy efficiency on strategies aimed at reducing energy consumption and preventing energy waste. In this way, the EU tries to contribute decisively to competitiveness, security of supply and respect for the commitments made in the scope of the Kyoto Protocol on climate change (Grubb et al., 1999). Among the many potential sources of energy consumption reduction are housing, on which the Commission has focused its efforts, which are summarized in Directive 2010/31 / EU called Energy performance in buildings(Commission, 2010), where it is noted that within this area 40% of the final energy of the EU is consumed. On the other hand, on June 20, 2013, the study entitled “Energy efficiency certificates in buildings and their impact on the transition of prices and rentals in European Union countries” was published, whose conclusions show that mandatory energy certification initiatives such as RD 235/2013 (Ballesteros and Shaw, 2013) will have a positive effect on the housing market, where those with better efficiency rates will be valued. However, in general all these initiatives have not had the expected impact in the short term and the user does not choose a house or another for its energy certification, or the most relevant criterion when the occasion of changing the appliance is presented is not usually its level of efficiency, but others such as its price or its benefits. And although indeed, awareness about climate change, the need to consume less and efficiently are increasingly present in our daily lives, the end user expects their home and the elements that integrate it to provide a better quality of life and not that it supposes an additional concern that in many cases does not understand or does not interest you. This is where projects such as Smart Home Energy (SHE) (SHE Consortium, 2012) are presented as authentic energy assistants that allow the end user to manage their home in an integral and unattended way to guarantee energy efficiency criteria and achieve at all times the minimum demand and consumption point. To achieve this goal, these types of systems rest on three fundamental pillars: monitoring, prediction through machine learning and control. In this way, projects such as the one presented in this communication, are able to characterize the energy behavior of a home and issue the appropriate recommendations or corrections for its self- management. The document is well organized as follows: first a study of related research will be done both to improve energy efficiency, and to apply similar techniques in other fields. Subsequently, we review the technological alternatives that support the proposed system, which is described in section 4 below, describing its main modules. Finally, conclusions and the future work will be presented in the last section.

State of the Art The basis of any self-learning model is characterized by its ability to predict the behavior of a system to develop control strategies for a specific purpose. The use of machine learning techniques has been successfully applied to other fields. Specifically, recommended systems based on user preferences and similarities are increasingly present in everyday environments and applications. One of the best known cases is the Amazon shopping website, where users receive recommendations based on purchases made by users considered similar (Linden et al., 2003). Similarly, the popular video on demand service Netflix uses the scores that users give to the movies they watch, to propose movies to other users with similar tastes (Xavier Amatriain, Justin Basilico, 2012). It is also very common to use these systems in partner search services to propose possible candidates to the users of the service. As for prediction, known are the cases of application of this type of techniques in the economy, to try to know the future evolution of the markets (Satchell and Knight, 2011), in medicine, where such techniques can help by example to anticipate a possible stroke (Khosla et al., 2010), etc. It is also possible to perform classification and identification tasks, as evidenced by various studies in the field of computer vision (Chechik et al., 2010) and the detection of malicious elements on the web (Ma et al., 2009). Returning to the topic of this communication, it is necessary to take into account that a house is a complex model that integrates different elements of consumption (appliances and electric vehicles), sources of uncertainty (outside temperature and behavior of its occupants) and sources of energy production ( air conditioning equipment, boilers, solar or photovoltaic panels, other renewable energy, etc.). To design and then develop an energy management system for homes, it is essential that it be able to predict parameters such as temperature or energy demand. Previous studies show that predictive models based on neural networks (González Lanza and Zamarreño Cosme, 2002) (González and Zamarreno, 2005) and Support Vector Machines (SVM) (Zhao and Magoulès,

Technologies involved Data collection The acquisition of data comprises on the one hand the hardware and on the other the software. Regarding the hardware, it is necessary to install smart meters in the home, which are capable of obtaining consumption every certain time value. The SHE project aims to be hardware independent, and currently includes support for the Current Cost EcoManager (Current Cost, 2012) and Plugwise (Plugwise, 2012) meters, which can be seen in Figure 1. Both are based on a network of devices that are located between the appliance and the plug and send data to another central device via radio frequency and ZigBee respectively. In addition, other household information (brightness, temperature, pollution level, etc.) can be collected through sensors. The data is extracted from the meters and sensors using a software adapter, and offered as an energy service through the Web Service for Devices (WS4D) technology (“Web Services for Devices (WS4D) Website,” 2012). This service will be consumed by the elements responsible for performing machine learning, which are described in section 4 of this document.

Data storage and processing The data obtained from the home is stored centrally thanks to Cloud technology (Rhoton and Haukioja, 2011), which makes it not necessary to have large storage capacity in the home itself. It also offers facilities to manage and maintain the integrity, security and availability of data, storing replicas of the data to be used in case of loss of information. Another key technology in the system as it allows scalability is Big Data (Marz and Warren, 2013). Through an investment in linear technology in costs, it allows to cover the storage and processing needs, as well as an elastic management of the Information Technology (IT) infrastructure, adapting in size according to the needs. Big Data supports a large volume of data of various types, processing them in an allowable time. It also offers support for machine learning algorithms, visual data mining tools, near real-time monitoring, and a series of new possibilities for analysis and information processing that fit perfectly with the proposed system. The standardized frequency in energy measurement is one measurement every 15 minutes, due to the limitations existing with the technologies preceding Big Data. Increasing the frequency of measurements allows having a more detailed knowledge of what is happening at home and thus performing different activities with the information. Decreasing to a frequency of one measurement every second, for example, means a huge increase in the capacity needed to store and process information. Likewise, monitoring several devices in the same household, or breaking down the general consumption in several devices, is in turn a significant increase. In addition, the incorporation of almost all consumers to these measurement techniques would mean a significant increase in communications requirements, A home under these measurement requirements would cause transmission and storage of 1 kb per second. This implies the need for a system capable of receiving and storing 32 Petabytes per Million users per year (PMA). Additionally, it will be necessary to support the storage and processing of the operations to be carried out with all this information, which according to the functionalities that are intended to offer could increase this figure enormously. Among the requirements that are intended, are that a user can access the system remotely regularly (depending on the type of user from several times a day, once a month) and consult the consumption detail. In addition, the system, based on these consumption data, must be able to issue personalized recommendations for each user, energy recommendations, alerts about bad practices and predictions of consumption with the greatest certainty possible (all decision making ends up based on a prediction) , etc.

Machine learning tools When it comes to machine learning, the use of a suitable tool can greatly simplify the work. First, Weka (Frank et al., 2010) is a much consolidated tool that has the main advantage of incorporating a large number of learning algorithms, so It can be a very useful software in previous phases of study, to test the algorithms, and analyze the data and possible trends or relationships between them. Its great disadvantage is that it does not offer support for Big Data, which has studied other alternatives more specific to the amount and variety of data that will be handled in the proposed system. Among them, Mahout (Owen et al., 2011) and Jubatus (Jubatus WebSite, 2011) stand out. Mahout is an Apache Open Source project, still under development, which offers a Java implementation of various machine learning algorithms, classification, clustering and recommendation. In its roadmap there is the approach of increasing and diversifying the types of algorithms supported. Its great advantage is the scalability it offers when running on Hadoop and its MapReduce architecture (Lam, 2010). This also gives you some simplicity for integration into a Cloud system. As for Jubatus, it is a framework that has the particularity that the learning it performs is online, that is, the model is updated in each iteration, so it does not require storage capacity and also allows offering a first response in a smaller space of time, increasing its accuracy over time. Its client / server architecture is also scalable and offers clients for several languages (Java, Python, C ++, Ruby) with a package of very new algorithms already implemented. Although as it is a recent project, there are still many algorithms to incorporate, especially those dealing with time series of data, whose intrinsic characteristics make them a particular case.

Proposed system The proposed system, which can be seen graphically, will make use of the technologies described in the previous section, to perform automatic learning that helps improve the energy efficiency of the home through collaborative recommendations, predictions and data analysis.

Collaborative Recommendations The function of this module is to suggest users to carry out actions that other similar users have performed in their homes and that have helped them reduce energy consumption. For this, it will be necessary to define the way to calculate the similarity between two users, or what is the same, the “distance” between them. To calculate this distance there are several algorithms, such as Pearson's correlation coefficient, Euclidean distance, cosine, etc. These algorithms are based on the fact that users value the actions they take within a range of values, indicating their degree of acceptance. In the specific system that is proposed in this work, users do not give an assessment of how much they agree with the performance of an action, but they will limit themselves to saying whether they agree or not. That is, a user either does not accept the action (value 0) or accepts it completely (value 1). For this type of assessment, the most appropriate algorithms are those of Tanimoto (Cechinel et al., 2013), available in Mahout. The Tanimoto algorithm is based on the calculation of the Tanimoto similarity coefficient (1), which is also expressed as (2).

Where:

A = number of actions taken by user 1 B = number of actions taken by user 2 C = number of common actions taken by users 1 and 2. Using this formula, all distances between users are calculated and an array is constructed where all this data is stored. At the time of making a recommendation to a user, the algorithm returns those actions that the users most similar to the user have performed, and that the user to whom we want to offer the recommendation has not yet performed.

Predictions In order to achieve an efficient use of energy, a prediction system is provided, which will learn the consumption patterns of a certain address. This knowledge collects the behavior by technical aspects and by the habits of life, so that a user can predict their consumption and adapt their activities to more economically efficient energy consumption habits (moving the actions to moments with a better rate) and more responsible environmentally The system generates several times a day, from massive amounts of information, a weekly consumption prediction for each user, as can be seen in Figure 5. To make this type of prediction, consumption factors are taken into account the user's own, general consumption factors, and other external factors that are relevant, such as schedules, holiday calendars, prices, rates, meteorology... The strategy applied by the model allows defining and adjusting the way in which the system learns, what data is relevant, what data is not, and what data should be filtered or enhanced in learning. The use of a learning algorithm provides a memorization mechanism, capable of remembering and forgetting the situations experienced in adequate proportions, and of providing a consistent value for situations not previously occurred.

The model has been tested and trained with buildings located in several climatic zones and a satisfactory result has been obtained in all cases. The effectiveness of the prediction system has been evaluated by statistical analysis. The interpretation of these statistics allows the evaluation of the error made in the prediction and the way in which this error is distributed. The suitability of the technique is dependent on the interpretation of these statistics against situations to be detected or avoided. The tests carried out, whose result is shown in Table 1, provide a result of the application of the model on real data of a commercial establishment that presents the following values of these statistics:

Table 1: Results of the evaluation of the effectiveness of the prediction system through statistics Statistical Climate Zone Interpretation Mean of absolute Atlantic = Average prediction error ( MAD ) Where 237.46 Continental = errors made. It gives a is the forecast error of 207.59 perspective of the error period t that can be made in a prediction. The value is similar in both climatic zones and presents a better adjustment in continental climate. Mean of absolute Atlantic = 0.099 Indicates the probability percentage error ( Continental = 0.194 [0,1] of the average MAPE ) prediction errors committed. The technique presents a better value in the oceanic climate data set. Success percentage of Atlantic = 90.09% It presents the MAPE the percentage Continental = 80.59% value in the form of absolute error ( PA ) percentage success PA = 1 - MAPE * [0,100]%, which allows 100 comparing the algorithm in different data sets with different values. The technique shows a better value in the oceanic climate data set. Average total error indicator produced in the RMSD predictions Prediction values over time

Atlantic = 333.03 Continental = 281.17

Errors are very spread over time if the value is low or very concentrated at certain times if the value is high. Having concentrated errors can be interpreted as more convenient because it can be adjusted as anomalies or associated with new variables not contemplated. The technique shows a higher concentration of error in the case of oceanic climate than in continental. Standard deviation (DS) Atlantic = 233.49 Continental = 189.63 Indicates the dispersion of the error with respect to the mean error, that is, if the error is concentrated at some points or evenly distributed. In the first case, the algorithm would have a stable behavior in terms of the error made in each prediction, while in the second; there would be large variations between different predictions. The technique shows a better value in the case of continental climate than in the ocean. In order for the prediction information to be interpretable for the user and the user can make decisions, it is presented in the form of a daily consumption graph. This graph can be made by daily sections and weighted according to the rate applicable in each section, so that the user can accommodate activities on the same day under different rates. Analysis of data The techniques known as Insight and Visual Mining allow the user to have a generalized tool that can be customized at the time of use to obtain their own conclusions about their data and thus make their decisions. Through these techniques, the household information is presented to the user in a highly expressive visual way, allowing at a glance, a lot of varied information can be assimilated, thus avoiding having many graphs of disjoint elements. A relevant graph is the “Heat Map" of consumption. As can be seen in Figure 6, consumption is represented on a disk consisting of seven concentric circles (one for each day of the week) and 24 segments (one for each time interval), where the color corresponds to the intensity of consumption. This graph facilitates the user to detect consumption differences between days of the week, and therefore associate it with habits to improve them. In addition, it allows the user to navigate through different weeks and select which devices he wants to analyze. In addition, the application of inspection capacity is provided, allowing the user to select sections, and thus filter the rest of the information in order to facilitate conclusions. In this way, a user can, for example, select his consumption on Mondays, between a certain time intervals. The user, when selecting, can adjust the graphs to the selected interval, zooming in on that interval and filtering the data. In addition, you can navigate the graphs by scrolling along the coordinate axis that represents time.

Conclusions and future work Energy knowledge of the home is a key factor to achieve efficient use of resources. The extended use of meters and sensors at the residential level, facilitates the obtaining of the information of the same, but this information, due to its variety and size, is not acceptable to be processed manually, so it is necessary to resort to the use of machine learning techniques Within the context of energy efficiency in the home, a collaborative recommender will allow to suggest actions that will save energy to groups of users with a similar profile without having to have prior knowledge of those users and their characteristics. The possibility of making predictions of cost of consumption with energy rate for daily planning in discriminated rates, would allow a user to make a better distribution on the day of their consumption habits. In addition to the recommendations and predictions, the user is offered a tool to perform manual analysis himself, which would otherwise be unattainable, given the huge amount of data that may arise from a digital home. Another promising field is the anticipation of energy demand. In the proposed system, predictions could allow marketers and energy service companies to make a smarter purchase of energy and an analysis of efficiency actions for their customers. A possible future line would be to identify the devices from their consumption data when they are connected to the network, thus allowing anticipation of demand. For this, the possibility of training a classifier in a supervised way is raised, that is, by means of consumption data labeled with the identifier of the device that generated them. In a second phase in classifier it has to be tested, and adjusted until it reaches optimum precision. Another possible step in improving energy efficiency is to provide the system with the ability to simulate actions, to study the impact that they have both economic and consumption. CHAPTER 6 5 natural language processing methods that are rapidly changing the world around us

Going to learn NLP and develop natural language processing applications? Want to create your own application or program for the voice assistant Amazon Alexa or Yandex Alice? In this section, we will talk about the areas of development and techniques that are used to solve NLP tasks, so that it becomes easier for you to navigate. Natural language processing (hereinafter NLP - Natural language processing) - an area located at the intersection of computer science, artificial intelligence and linguistics. The goal is to process and “understand” the natural language to translate text and answer questions. With the development of voice interfaces and chat bots, NLP has become one of the most important artificial intelligence technologies. But a complete understanding and reproduction of the meaning of the language is an extremely difficult task, since human language has features: Human language is a specially designed system for conveying the meaning of words spoken or written. This is not just an exogenous signal, but a conscious transfer of information. In addition, the language is encoded so that even young children can quickly learn it. Human language is a discrete, symbolic or categorical signaling system with reliability. Categorical symbols of the language are encoded as signals for communication on several channels: sound, gestures, writing, and images and so on. Moreover, the language is able to be expressed in any way.

Where NLP is applied Today, the number of useful applications in this area is growing rapidly: Search (written or oral); Display suitable online advertising; Automatic (or assisted) translation; Mood analysis for marketing tasks; Speech recognition and chat bots, Voice assistants (automated customer assistance, ordering goods and services).

Deep Learning at NLP A significant part of the NLP technology works because of the deep trained eniyu (deep learning) - the field of machine learning , which began to gain momentum only in the beginning of this decade, for the following reasons:

Accumulated large amounts of training data; Developed computing power: multi-core CPU and GPU; Created new models and algorithms with advanced features and improved performance, c flexible training in the intermediate representation; Their teaching methods using c context, new methods of regularization and optimization. Most machine learning methods work well because of the human-developed representations of data and input features, as well as the optimization of weights to make the final prediction better. In deep learning, the algorithm attempts to automatically extract the best attributes or representations from raw input. Manually created traits are often too specialized, incomplete, and take time to create and validate. In contrast, the symptoms revealed by deep learning are easily adaptable. Deep learning offers a flexible, versatile, and learnable framework for representing the world in both visual and linguistic information. Initially, this led to breakthroughs in the areas of speech recognition and computer vision. These models are often trained using one common algorithm and do not require the traditional construction of traits for a specific task. Note: Access to lectures from the course and homework on programming is in this repository. Vector representation (text embeddings) In the traditional NLP, words are treated as discrete characters, which are then represented as one-hot vectors. The problem with words discrete characters - the lack of a similarity definition for one-hot vectors, therefore, the alternative is to learn to code the similarity into the vectors themselves. Vector representation is a method of representing strings as vectors with values. A dense vector is constructed for each word so that the words found in similar contexts have similar vectors. The vector representation is considered the starting point for most NLP tasks and makes deep learning effective on small datasets. Word2vec and GloVe vector presentation techniques created by Google (Mikolov) Stanford (Pennington, Socher, Manning), respectively, are popular and are often used for NLP tasks. Let's look at these techniques.

Word2vec Word2vec accepts a large corpus of text in which each word in a fixed dictionary is represented as a vector. Next, the algorithm runs through each position t in the text, which is the central word c and the context word o. Next, the similarity of the word vectors for c and o is used to calculate the probability o for a given c (or vice versa), and the word vector continues to be adjusted to maximize this probability.

Useless words (or words with a high frequency of occurrence, in English - a, the, of, then) are removed from the dataset to achieve a better Word2vec result . This will help improve model accuracy and reduce training time. In addition, negative sampling is used for each input, updating the weights for all correct labels, but only on a small number of incorrect labels.

Word2vec is presented in 2 variations of models:

Skip-Gram: A context window containing k consecutive words is examined. Next, one word is skipped and a neural network is trained that contains all the words except the missing one, which the algorithm is trying to predict. Therefore, if 2 words periodically share a similar context in the corpus, these words will have close vectors. Continuous Bag of Words: many sentences in the corpus are taken. Each time the algorithm sees a word, an adjacent word is taken. Next, context words are fed to the input of the neural network and we predict the word in the center of this context. In the case of thousands of such contextual words and the central word, we get one instance of the dataset for our neural network. The neural network trains and, finally, the output of the encoded hidden layer represents an embedding for a particular word. The same thing happens if a neural network trains on a large number of sentences and similar vectors are attributed to words in a similar context. The only complaints about Skip-Gram and CBOW are belonging to the class of window-based models, which are characterized by low efficiency of using statistics of matches in the case, which leads to suboptimal results. Glove GloVe strives to solve this problem by capturing the meaning of one word embedding with the structure of the entire visible body. To do this, the model looks for global matches in the number of words and uses enough statistics, minimizes the standard deviation, and produces the space of the word vector with a reasonable substructure. Such a scheme allows us to sufficiently equate the similarity of a word with a vector distance.

In addition to these two models, many recently developed technologies have found application: FastText, Poincare Embeddings, sense2vec, Skip-Thought, Adaptive Skip-Gram.

Machine translate Machine translation conversion of text in one natural language into text equivalent in content in another language, does this program or machine without human intervention. Machine translation uses statistics on the use of words in the neighborhood. Machine translation systems are widely used as translations from languages of the world are an industry with a volume of $ 40 billion per year. Some famous examples:

Google Translate translates 100 billion words per day. Facebook uses machine translation to automatically translate texts in posts and comments to break down language barriers and allow people from different parts of the world to communicate with each other. eBay uses machine translation technology to enable cross-border trade and connect buyers and sellers from different countries. Microsoft applies AI-based translation to end users and developers on Android, iOS, and Amazon Fire, regardless of Internet access. Systran was the first software provider to launch a neural machine translation engine in 30 languages in 2016. In traditional machine translation systems, you have to use a parallel case - a set of texts, each of which is translated into one or more other languages. For example, having the source language f (French) and the target e (English), it is necessary to build a statistical model that includes the probabilistic formulation for the Bayes rule, the translation model p (f | e), trained on the parallel case, and the language model p (e), trained only on the body with the English language. Needless to say, this approach misses hundreds of important details, requires a large number of manually designed features, and consists of various and independent machine learning tasks. Neural network Machine Translation (Neural Machine Translation) - approach to modeling the translation with the help of a recurrent neural network (Recurrent Neural Network, RNN). RNN - a neural network with dependence on previous states, in which it has connections between passes. Neurons receive information from previous layers, as well as from themselves in the previous step. This means that the order in which data is input and the network is trained is important: the result of submitting “Donald” - “Trump” does not coincide with the result of submitting “Trump” - “Donald”. The standard neural-machine translation model is an end-to-end neural network, where the source sentence is encoded by an RNN called an encoder, and the target word is predicted using another RNN called a decoder. The encoder “reads” the original sentence at a rate of one character per unit time, and then combines the original sentence in the last hidden layer. The decoder uses the back propagation of the error to examine this union and returns the translated version. Surprisingly, neuro-machine translation, which was located on the periphery of research activity in 2014, became the standard for machine translation in 2016. Below are the achievements of neural network based translation: End-to-end training: parameters in NMT (Neural Machine Translation) are simultaneously optimized to minimize the loss function at the output of the neural network. Distributed Representations: NMT makes better use of similarities in words and phrases. Better context research: NMT works more contexts - source and partially target text to translate more accurately. Faster text generation: text translation based on deep learning is far superior in quality to the parallel corpus method. The main problem of RNN is the problem of the disappearance of the gradient, when information is lost over time. It seems intuitively that this is not a serious problem, since it is only the weight, not the state of the neurons. But over time, weights become places where information from the past is stored. If the weight is 0 or 100000, the previous state will not be too informative. As a result, RNNs will have difficulty remembering words that are further in the sequence, and predictions will be made based on extreme words, which create problems. Short-term memory networks (Long / short term memory, hereinafter referred to as LSTM) try to deal with the extinction gradient problem by introducing gates and introducing a memory cell. Each neuron is a memory cell with three gates: input, output and forget. These gates act as bodyguards for information, allowing or disabling its flow.

The input gate determines how much information from the previous layer will be stored in this cell; The output gate does the work on the other end and determines how much of the next layer will know the state of the current cell. The forgetting gate controls the measure of keeping the value in memory: if a new chapter begins when studying a book, sometimes it becomes necessary for a neural network to forget some words from the previous chapter. It has been shown that LSTMs can learn from complex sequences and, for example, write in Shakespeare style or compose primitive music. Note that each of the gates is connected to a cell on the previous neuron with a certain weight, which requires more resources to work. LSTMs are common and used in machine translation. In addition, it is the standard model for most labeling tasks that consist of a lot of data.

Closed recurrent units (Gated recurrent units, hereinafter referred to as GRU) differ from LSTM, although they are also an extension for neural network machine learning. There is one less gate in GRU, and the work is structured differently: instead of input, output and forgetting, there is an update gate. It determine show much information needs to be saved from the last state and how much information to skip from previous layers. The reset gate functions are similar to the LSTM forget switch, but the location is different. GRUs always transmit their full state, do not have an output shutter. Often these shutter functions like LSTM, however, the big difference is the following: in GRU, the shutter is faster and easier to control (but also less interpreted). In practice, they strive to neutralize each other, since they need a large neural network to restore expressiveness, which negates the gains as a result. But in cases where extra expressiveness is not required, GRUs show better results than LSTMs.

In addition to these three main architectures, over the past few years, many improvements in neural network machine translation have appeared. Below are some noteworthy developments:

Sequence-to-Sequence Learning with Neural Networks has proven the effectiveness of LSTM for neural machine translation. The article presents a general approach for sequential learning, which is characterized by minimal assumptions about the structure of the sequence. This method uses a multi- layer LSTM to display the incoming sequence as a vector with a fixed dimension, and then another LSTM is used to decode the target sequence from the vector. Neural Machine Translation by Jointly Learning to Align and Translate introduced the attention mechanism in the NLP (which will be discussed in the next part). Recognizing the fact that using a fixed-length vector is a bottleneck in improving NMT performance, the authors propose allowing models to automatically search for parts of the original sentence that are relevant to the prediction of the target word, without the need to explicitly form these parts. Convolutional over Recurrent Encoder for Neural Machine Translation leverages the standard RNN encoder in NMT with an additional convolution layer to capture a wider context at the output of the encoder. Google has created its own NMT system called Google's Neural Machine Translation, which solves the problems of accuracy and ease of use. The model consists of a deep LSTM network with 8 coding and 8 decoding layers and uses both residual communications and attention communications from the decoder to the encoder network. Instead of using recurrent neural networks, Facebook AI Researchers use a convolutional neural network for sequence-to-sequence training tasks in NMT.

Voice assistants A lot of articles have been written about conversational artificial intelligence (AI), most of the developments focus on vertical chat bots, messenger platforms, and startup opportunities (Amazon Alexa, Apple Siri, Facebook M, Google Assistant, Microsoft Cortana, Yandex Alice). The ability of AI to understand natural language is still limited, so the creation of a full-fledged conversational assistant remains an open task. However, the work below is a starting point for people interested in a breakthrough in voice assistants. Researchers from Montreal, Georgia Technical Institute of Technology, Microsoft and Facebook have created a neural network that can create context-sensitive conversation responses. This system can train on a large number of unstructured conversations on Twitter. The architecture of the recurrent neural network is used to answer the sparse questions that appear when integrating contextual information into the classical statistical model, which allows the system to take into account what was said earlier. The model shows a confident improvement over the content-sensitive and content-insensitive baseline of machine translation and information retrieval. Hong Kong-developed Neural Response Machine (hereinafter NRM - Neural Responding Machine) is a response generator for short text conversations. NRM uses a common codec framework. First, the creation of an answer is formalized, as a decryption process based on a hidden representation of the input text, while encoding and decoding is implemented using recurrent neural networks. NRM is trained on big data with monosyllabic dialogs compiled from micro-blogs. Empirically, it has been established that NRM is able to generate the correct grammar and relevant answers in this context for 75% of the input texts, outperforming modern models with the same settings.

The latest model, Google's Neural Conversational Model, offers a simple approach to modeling dialogs using the sequence-to-sequence framework. The model supports the conversation by predicting the next sentence using the previous sentences from the dialogue. The strength of this model is the ability to learn through, which requires a lot less man-made rules.

The model is able to create simple dialogs based on an extensive dialogue training set, is able to extract knowledge from highly specialized datasets, as well as large and noisy common datasets of subtitles for films. In the highly specialized area of the help desk for IT solutions, the model finds solutions to a technical problem through dialogue. On noisy movie transcript datasets, the model is able to make simple reasoning based on common sense.

Question and Answer (QA) systems The idea of Question-answering (QA) systems is to extract information directly from a document, conversation, online search, or any other place that meets the user's needs. Instead of forcing the user to read the full text, QA systems prefer to give short and concise answers. Today, QA systems are easily combined with chat bots, go beyond the search for text documents, and extract information from a set of pictures. Most NLP tasks can be considered as question-answer tasks. The paradigm is simple: a request is sent, to which the machine provides an answer. Through reading a text or a set of instructions, a reasonable system should find the answer to a wide range of questions. Naturally, you need to create a model to answer common questions. Especially for QA tasks, a powerful deep learning architecture has been created and optimized - Dynamic Memory Network (hereinafter referred to as DNM). Trained on a training set of input data and questions, DNM generates episodic memories and uses them to generate suitable answers. This architecture consists of the following components: The semantic memory module, similar to the knowledge base, consists of previously prepared GloVe vectors that are used to create sequences of vector representations of words from incoming sentences. These vectors will be used as input to the model. The input module processes the input- related vectors into sets of vectors, called facts. This module is implemented using the Gated Recurrent Unit (hereinafter - GRU), which allows the network to find out the relevance of the proposal in question. The question module processes the question word by word and produces a vector using the same GRU as in the input module, with the same weights. The episodic memory module stores vectors of facts and questions extracted at the input, encoded as attachments. This is similar to the process in the hippocampus of the brain to extract temporary states in response to sound or sight. The response module generates a suitable response. At the last step, episodic memory contains the information necessary for the answer. This module uses another GRU trained with the classification of cross-entropy error of the correct sequence, which is converted back to natural language. DNM does a good job with QA tasks and outperforms other architectures for semantic analysis and part-speech tagging. Since the release of the initial version, DMN has undergone a number of improvements to further improve accuracy in QA tasks:

DMN for answers to text and visual questions - DNM applicable to images. Here, input modules and memory modules are upgraded to answer visual questions. Such a model improves the result of the existing architecture on most tests on visual question-answer datasets without a teacher. Dynamic coattention networks for answering questions come up with a solution to the problem of exiting a local maximum corresponding to an incorrect answer. The model merges the dependent views of the question and the text to focus on their appropriate parts. Next, the dynamic pointing decoder goes through a full set of potential responses.

Summary of Text (Text Summarization) It is difficult for a person to manually select a summary in a large volume of text. Therefore, in NLP there is the problem of creating an accurate and concise summary for the source document. Text Summarization is an important tool to help interpret textual information. Push notifications and article digests attract a lot of attention, and the number of tasks to create reasonable and accurate resumes for large fragments of text is growing day by day. Automatically extracting summary from text works as follows. First, the frequency of occurrence of a word in a full text document is considered, and then the 100 most common words are saved and sorted. After that, each sentence is evaluated by the number of frequently used words, moreover, the weight of the more common word. Finally, the first X sentences are sorted according to the position in the original text.

While maintaining simplicity and generalizing ability, the algorithm for automatically extracting summary content is able to work in complex situations. For example, many implementations fail on texts with foreign languages or unique vocabulary associations that are not contained in standard arrays of texts.

There are two fundamental approaches to reducing text: extractable and abstract. The first extracts words and phrases from the original text to create a resume. The latter studies internal linguistic representation in order to create a humanoid exposition, paraphrasing the original text. Extractive abbreviation methods work based on subset selection. This is achieved by extracting phrases or sentences from the article to form a resume. LexRank and TextRank are well-known representatives of this approach who use variations of the Google PageRank page sorting algorithm. LexRank is a graph-based, non-teacher-based learning algorithm that uses a modified cosine of the inverse frequency of a word to occur, as a measure of the similarity of two sentences. Similarity is used as the weight of the edge of the graph between two sentences. LexRank also introduces a smart post- processing step that makes sure that the main offers are not too similar to each other. TextRank is similar to the LexRank algorithm, but has some improvements. These include:

Using lemmatization instead of stemming Applying part-time markup and object name recognition Extracting key phrases and sentences based on these words Along with the summary of the article, TextRank extracts important key phrases.

Models for abstract summaries use deep learning, which allowed breakthroughs in such tasks. Below are the noteworthy results of large companies in the NLP area: Facebook Neural Attention is a neural network architecture that uses a local model with an attention mechanism that can generate every word of the resume depending on the input sentence. The Google Sequence-to-sequence adheres to codec architecture. The encoder is responsible for reading the source document and encoding into the internal representation. The decoder is responsible for generating each word in the output summary and uses the encoded representation of the original document. IBM Watson uses a similar Sequence-to-sequence model, but with the properties of an attentive and bidirectional recurrent neural network. CHAPTER 8 How to develop chatbot from scratch in Python: a detailed instruction

You may have heard of Duolingo: a popular application for learning foreign languages, in which learning takes the form of a game. Duolingo is popular for its innovative learning style. The concept is easy: five to ten minutes of interactive learning per day is enough to learn a language. In spite of the fact that Duolingo allows you to learn a new language at the service of users experiencing the problem, they felt that they did not develop conversational skills, as they were learning on their own. Users were reluctant to learn in pairs due to embarrassment. This problem did not go unnoticed by developers. The service team solved the problem by creating a chatbot in the application to help users gain conversational skills and put them into practice. Since bots were designed to be talkative and friendly, Duolingo users practice communicating at a convenient time for them, choosing an “interlocutor” from the set, until they overcome enough embarrassment to move on to communicating with other users. This solved the problem of users and accelerated learning through the application.

So what is a chat bot? Chatbot is a program that finds out the needs of users, and then helps to satisfy them (money transaction, hotel reservation, preparation of documents). Today, almost every company has a chatbot for interacting with users. Some ways to use chat bots:

Providing flight information; Providing users with access to information about their finances; Support. The possibilities are endless. The history of chatbots dates back to 1966 when Joseph Weizenbaum developed the ELIZA computer program. The program imitates the therapist’s speech style and consists of only 200 lines of code. You can still chat with Eliza on the site.

How does the chatbot work? There are two types of bots: working according to the rules and self-learning. A bot of the first type answers questions based on some rules with which it is trained. Rules can be either simple or very complex. Bots can handle simple requests, but they cannot handle complex ones. Self-learning bots are created using machine-learning-based methods and are definitely more efficient than bots of the first type. There are two types of self-learning bots: search and generative. The search bots use heuristics to select an answer from a predefined library of replicas. Such chatbots use the text of the message and the dialogue context to select an answer from a predefined list. The context includes the current position in the dialog tree, all previous messages and previously saved variables (for example, username). Heuristics for choosing an answer can be designed in different ways: from conditional logic “either-or” to machine classifiers. Generative bots can independently create answers and do not always respond with one of the predefined options. This makes them intelligent, since such bots learn every word in the request and generate a response. Creating A Bot In Python Let's assume that you know how to use the scikit and NLTK libraries. However, if you are new to natural language processing (NLP), you can still read and then study the relevant literature.

Natural Language Processing (NLP) Natural language processing is a field of research that studies the interaction between human language and computer. NLP is based on a synthesis of computer science, the artificial intelligence, and computational linguistics. NLP is a method for computers to analyze, understand, and make sense of human language in a smart and useful way. A brief introduction to NLKT NLTK (Natural Language Toolkit) is a platform for creating Python programs for working with natural speech. NLKT provides easy-to-use interfaces for more than 50 corporations and linguistic resources, such as WordNet, as well as a set of libraries for word processing for classification, tokenization, generation, tagging, parsing and understanding of semantics, and creating an NLP library wrapper for commercial use. The book Natural Language Processing with Python is a practical introduction to programming for language processing.

Download and install NLTK Install NLTK: run pip install nltk. Test installation: run python, and then type import nltk.

Install NLTK Packages Import NLTK and run nltk.download (). This will open the NLTK bootloader, where you can choose the version of code and model to download. You can download all packages at once.

Text preprocessing with NLTK The main problem with the data is that it is presented in text format. To solve problems, machine learning algorithms require a certain vector of properties. Therefore, before you start creating a project on NLP, you need to pre- process it. Text pre-processing includes: Convert letters to uppercase or lowercase so that the algorithm does not process the same words repeatedly. Tokenization. Tokenization is a term used to in the process of converting ordinary text strings to a list of tokens, i.e. words. An offer tokenizer is used to compile a list of offers. The word tokenizer makes a list of words. The NLTK package includes the pre-trained Punkt tokenizer for English.

Removing noise, that is, anything that is not a number or letter; Delete stop words. Sometimes some extremely common words that are considered to be of little importance for forming an answer to a user's question are completely excluded from the dictionary. These words are called stop words (interjections, articles, some introductory words); Stemming: Bringing the word to the root meaning. For example, if we need to stamp the words “stem”, “stemming”, “stemmed” and “stemization”, the result will be one word - “stem”. Lemmatization. Lemmatization is a slightly different method from stemming. The main difference between the two is that stemming often creates non- existent words, while a lemma is a real word. Thus, your original topic, that is, the word that is obtained after stemming, cannot always be found in the dictionary, and the lemma can. An example of a lemmatization: “run” is the basis for the words “running” or “ran”, and “better” and “good” are in the same lemma and therefore are considered the same.

A set of words After the first stage of preliminary processing, you need to convert the text into a vector (or array) of numbers. A “set of words” is a representation of a text describing the presence of words in a text. A “set of words” consists of:

Dictionary of famous words; Frequencies with which each word occurs in the text. Why is the word set used? This is due to the fact that information about the order or structure of words in the text is discarded, and the model takes into account only how often certain words appear in the text, but not where they are located. The idea of a “set of words” is that texts are similar in content if they include similar words. In addition, you can learn something about the content of the text only by a set of words. For example, if the dictionary contains the words {Learning, is, the, not, great} and we want to compose the sentence vector “Learning is great”, we get the vector (1, 1, 0, 0, 1).

TF-IDF Method The problem with the “set of words” is that frequently encountered words that do not contain valuable information can dominate the text. Also, a “set of words” assigns great importance to long texts compared to short ones. One approach to solving these problems is to calculate the frequency of occurrence of a word not in one text, but in all at once. Due to this, the contribution, for example, of the articles “a” and “the” will be leveled. This approach is called TF-IDF (Term Frequency-Inverse Document Frequency) and consists of two stages:

TF - calculation of the frequency of occurrence of a word in one text TF = (The number of times the word "t" appears in the text) / (The number of words in the text) IDF - calculating how rarely a word is found in all texts IDF = 1 + log (N / n), where N is the total number of texts, n is in how many texts there is "t" TF-IDF is the weight often used for information processing and text mining. It is a statistical measure used to assess the importance of a word for text in a set of texts.

Example Consider a text containing 100 words in which the word “phone” appears 5 times. The TF parameter for the word “telephone” is (5/100) = 0.05. Now suppose we have 10 million documents, and the word phone appears in a thousand of them. The coefficient is calculated as 1 + log (10,000,000/1000) = 4. Thus, TD-IDF is 0.05 * 4 = 0.20.

TF-IDF can be implemented in scikit like this: from sklearn.feature_extraction.text import TfidfVectorizer Otiai coefficient TF-IDF is a transformation is applied to texts to obtain two real vectors in vector space. Then we can get the Otiai coefficient of any pair of vectors by calculating their elementwise product and dividing it by the product of their norms. Thus, the cosine of the angle between the vectors is obtained. The Otiai coefficient is a measure of the similarity between two nonzero vectors. Using this formula, we can calculate the similarity between any two texts d1 and d2.

Cosine Similarity (d1, d2) = Dot product (d1, d2) / || d1 || * || d2 || Here d1, d2 are two nonzero vectors.

Chatbot Training In our example, we will use the Wikipedia page as text. Copy the contents of the page and place it in a text file called “chatbot.txt”. You can immediately use other text. Import required libraries import nltk import numpy as np import random import string # to process standard python strings Reading data Let's read the corpus.txt file and convert all the text into a list of sentences and a list of words for further preprocessing. f = open ('chatbot.txt', 'r', errors = 'ignore') raw = f.read () raw = raw.lower () # converts to lowercase nltk.download ('punkt') # first-time use only nltk.download ('wordnet') # first-time use only sent_tokens = nltk.sent_tokenize (raw) # converts to list of sentences word_tokens = nltk.word_tokenize (raw) # converts to list of words Let's look at an example sent_tokens and word_tokens files sent_tokens [: 2] ['a chatbot (also known as a talkbot, chatterbot, bot, im bot, interactive agent, or artificial conversational entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods.', 'such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the turing test.'] word_tokens [: 2] ['a', 'chatbot', '(', 'also', 'known'] Source code preprocessing Now we define the LemTokens function, which will accept tokens as input parameters and issue normalized tokens. lemmer = nltk.stem.WordNetLemmatizer () # WordNet is a semantically-oriented dictionary of English included in NLTK. def LemTokens (tokens): return [lemmer.lemmatize (token) for token in tokens] remove_punct_dict = dict ((ord (punct), None) for punct in string.punctuation) def LemNormalize (text): return LemTokens (nltk.word_tokenize (text.lower (). translate (remove_punct_dict)))

Keyword Selection Define a bot greeting replica. If the user greets the bot, the bot will say hello in response. ELIZA uses simple keyword matching for greetings. We will use the same idea.

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up", "hey",) GREETING_RESPONSES = ["hi", "hey", "* nods *", "hi there", "hello", "I am glad! You are talking to me"] def greeting (sentence):

for word in sentence.split (): if word.lower () in GREETING_INPUTS: return random.choice (GREETING_RESPONSES) Response generation To generate the answer of our bot for entering questions, the concept of similarity of texts will be used. Therefore, we start by importing the necessary modules.

Import the TFidf vectorizer from the library to convert the raw text set to the TF-IDF property matrix. from sklearn.feature_extraction.text import TfidfVectorizer In addition, import the Otiai coefficient module from the scikit library from sklearn.metrics.pairwise import cosine_similarity This module will be used to search for keywords in the user’s request. This is the easiest way to create a chat bot. Define a response function that returns one of several possible answers. If the query does not match any keywords, the bot gives the answer “Sorry! I do not understand". def response (user_response): robo_response = '' TfidfVec = TfidfVectorizer (tokenizer = LemNormalize, stop_words = 'english') tfidf = TfidfVec.fit_transform (sent_tokens) vals = cosine_similarity (tfidf [-1], tfidf) idx = vals.argsort () [0] [- 2] flat = vals.flatten () flat.sort () req_tfidf = flat [-2] if (req_tfidf == 0): robo_response = robo_response + "I am sorry! I don't understand you" return robo_response else: robo_response = robo_response + sent_tokens [idx] return robo_response Finally, we specify bot replicas at the beginning and end of correspondence, depending on the user's replicas. flag = True print ("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!") while (flag == True): user_response = input () user_response = user_response.lower () if (user_response! = 'bye'): if (user_response == 'thanks' or user_response == 'thank you'): flag = False print ("ROBO: You are welcome ..") else: if (greeting (user_response)! = None): print ("ROBO:" + greeting (user_response)) else: sent_tokens.append (user_response) word_tokens = word_tokens + nltk.word_tokenize (user_response) final_words = list (set (word_tokens)) print ("ROBO:", end = "") print (response (user_response)) sent_tokens.remove (user_response) else: flag = False print ("ROBO: Bye! take care ..") That's all. We wrote the code for our first bot in NLTK. Here you can find all the code along with the text. Now let's see how it interacts with people: It turned out not so bad. Even if the chatbot could not give a satisfactory answer to some questions, it coped well with others. Although our primitive bot hardly possesses cognitive skills, it was a good way to deal with NLP and learn about chat bots. ROBO at least responds to user requests. He, of course, will not deceive your friends, and for a commercial system you will want to consider one of the existing bot platforms or frameworks, but this example will help you think through the architecture of the bot. SUMMARY AND CONCLUSION

Data Analytics, Data Science and Machine Learning Data analysis In today's world, all businesses generate large amounts of data from a variety of sources. Whether it's from enterprise systems, social media or other online sources, smartphones, other customer / end computing devices, or sensors and tools that make up the Internet of Things, it's extremely valuable to capitalize on organizations that have the tools of this data. The general toolbox for these tools is known data analytics. Data analytics is a term that refers to the use of numerous techniques that find meaningful patterns in data. It is a process in which data is turned into insight and foresight. Data analysis tools allow us to explain what happened in the past, to draw insights into the present, and to make predictions about the future with some techniques. The field of data analysis is not new; it has been used in business for decades. Data analysis can be as simple as using statistics to determine the average age or summarize other demographic characteristics related to customers. A linear regression table in an Excel spreadsheet can shed light on sales trends. However, as old as it is, the field of data analytics never stops. Businesses are continually evolving as they apply more analytical techniques such as real- time data analysis as they flow into business intelligence applications and organizations. The desire to increase understanding of the past, present and future encourages continued progress in the field of data analysis. These developments are necessary for situations where we cannot agree on solving simple problems. In the business world, there are few such as yasal laws of nature olan that will tell precisely what will happen. To achieve this high level of understanding, businesses need to capture and analyze data using advanced techniques, which brings us to data science.

Data Science Data science is the pioneer of data analytics. It is the process of testing, evaluating, and testing new data analytics techniques and creating new ways to implement them. As its name implies, data science is in essence an application that follows soundly based approaches to scientific research. Therefore, data scientists try new algorithms to provide insight and understanding, and measure the usefulness of these approaches and demonstrate the accuracy of the results. If approaches are considered generally useful, they are more widely known and contribute to a growing set of data analysis tools. Therefore, data analytics should be used in every business activity, but as companies embrace digital transformation, the analytical need will increase. Businesses should continuously advance their analytical capabilities as they do this transformation. One way to get this done is to hire data scientists. Powerful enterprise data cultures should include data scientists who are constantly working to enhance their capabilities while working to enable larger business staff to use mature and proven analytical tools.

Data Engineering It is an indispensable element for data engineering, data analysis and data science, although it does not draw major titles. Simply put, data engineering makes data useful. Converts structured, unstructured, and the semi-structured data from various systems and silos into useful, consistent data collections that applications and algorithms can gain insights and value. Data engineering involves the task of cleaning up data sets - often hosting a large amount of work when working with data from many different data sources and / or missing values, errors, and even biases. For example, if you run analytics on recent home sales, you may want to correct or remove any home records with zero sales prices. This avoids your results when erroneous price data is incorporated into simple analytics such as the average house price, so it is better for the data engineer to try to remove it from the dataset or to correct it if it's possible. Such data errors can have hidden influence in more complex data analysis, which may not be easily visible in the results. However, they may have serious problems when using results. Although you have heard the term more often in recent years, there is nothing new about data engineering. Today, however, the need for data engineering is growing, as organizations are trying to consolidate, improve, reformat, and clean up a variety of data from a growing source. This is often required for advanced data analytics applications, including machine learning and deep learning. Data engineers should delete bad data, eliminate gaps, and ensure that data does not prejudice the results.

Artificial intelligence Artificial intelligence refers to computer systems that are capable of understanding what classifications and decisions that normally require human intelligence should make decisions. Common use cases for AI are image recognition and classification, and speech recognition and language translation. Even though you hear people just talking about AI, it's actually been with us since the 1950s. Since the advent of computers, people have been thinking that machines can be programmed to use people's way of thinking. Over the years, artificial intelligence there has been different approaches to making computers good or better. A few decades ago, some achievements were expert systems. These systems monitor pre-programmed sets of rules created by humans to perform tasks independently of people. For example, we have all experienced specialist systems in the form of automatic answering systems, as we are likely to interact when we call a customer service desk and you need to review button options and menu options. (Many of them are now being redesigned using natural language processing, based on deep learning to become more detailed, more flexible and effective, and less frustrating as natural language processing becomes better over time.9 More recently, an approach named machine learning has become a preferred method of realizing AI. A subset of machine learning, known as the deep learning, has been shown to be highly effective when there is enough data to train models in specific problem types and workloads. Therefore, a wider level of AI; machine learning and deep learning are two approaches that make today’s AI effective applications possible.

Machine Learning and Deep Learning Machine learning is an AI subdomain with systems that provide the ability to learn from time to time and improve over time without explicit programming. Machine learning algorithms use data to create and develop rules. The computer then decides how to respond based on what they have learned from the data. The key here is to allow the data to guide the development of rules. Machine learning techniques can use a variety of data types, including unstructured or semi-structured data, to help generate understanding that leads to system-generated actions and decisions. Consider a simple example; classic machine learning You can give a system a set of features that are common to cats in photographs of various animal species. You can then allow the system to sort through databases full of animal photographs and find out which combinations of human-derived features identify all cats in the mix. In this process, the machine learning system becomes better and better as we learn new things from experience with data. Deep learning is machine learning based on a deep hierarchy of interconnected “neural network” layers, capable of learning key “features den from the data provided to the system. The deep learning technique takes a lot of data and determines the common rules and characteristics associated with the data. As with classical machine learning, the data guides the education of the deep learning model. Let's expand our cat example. If you give a profound learning system a sufficient amount of cat image, the system - alone - is a cat; identify features that make a cat such as eyes, ears, whiskers and tail-related features. This learning ability goes beyond classical machine learning, because in this case you don't have to tell the system which features to look for. It solves this on its own.

Why is this important? We all benefit from the AI in almost every part of our lives. Did you use Google today to search the Internet? You took advantage of AI. Have you used a credit card recently? You took advantage of AI programs that authenticate user identities and potentially stop fraudulent transactions. Have you come across online stores that offer personalized recommendations for the products you're looking at? This is AI. AIchanges the basic rules for decision-making in the organization. Machine learning and deep learning techniques, for example; it enables the processing of data that managers obtain from many sources such as social media sites, customer information systems and e-commerce sites. Decision makers can then adapt their product development, sales and marketing strategies accordingly. It is important to emphasize that AI is no longer a niche practice. Businesses employ AI to build stronger customer relationships to make a wider range of customers, to make smarter business decisions, to improve the efficiency of transactions, and to bring better products and services to the market, some of which include AI . AIfrom health and financial services to manufacturing and national defense. If you have a huge amount of data, it can help you find and understand patterns within AI.