From Imitation to Prediction, Data Compression Vs Recurrent Neural Networks for Natural Language Processing

View metadata, citation and similar papers at core.ac.uk brought to you by CORE ASAI, Simposio Argentino de Inteligencia Artificial provided by El Servicio de Difusión de la Creación Intelectual From Imitation to Prediction, Data Compression vs Recurrent Neural Networks for Natural Language Processing Juan Andres´ Laura∗, Gabriel Omar Masiy and Luis Argerichz Departemento de Computacion,´ Facultad de Ingenier´ıa Universidad de Buenos Aires Email: ∗[email protected], [email protected], zlargerich@fi.uba.ar Abstract—In recent studies [1] [2] [3] Recurrent Neural Net- The hypothesis for this research is that, if compression works were used for generative processes and their surprising is based on learning from the input data set, then the best performance can be explained by their ability to create good compressor for a given data set should be able to compete with predictions. In addition, data compression is also based on prediction. What the problem comes down to is whether a other algorithms in natural language processing tasks. In the data compressor could be used to perform as well as recurrent present work, this hypothesis will be analyzed for two given neural networks in the natural language processing tasks of scenarios: sentiment analysis and automatic text generation. sentiment analysis and automatic text generation. If this is possible, then the problem comes down to determining if a II. DATA COMPRESSION AS AN ARTIFICIAL INTELLIGENCE compression algorithm is even more intelligent than a neural FIELD network in such tasks. In our journey we discovered what we think is the fundamental difference between a Data Compression For many authors there is a very strong relationship between Algorithm and a Recurrent Neural Network. Data Compression and Artificial Intelligence [5] [6]. Data Compression is about making good predictions [7] which I. INTRODUCTION is also the goal of Machine Learning, a field of Artificial One of the most interesting goals of Artificial Intelligence Intelligence. is the simulation of different human creative processes like Essentially, Data compression involves two important steps: speech recognition, sentiment analysis, image recognition, modeling and coding. Coding is a solved problem using automatic text generation, etc. In order to achieve such goals, arithmetic compression. The difficult task is modeling because a program should be able to create a model that reflects how it comes down to building a description of the data using the humans think about these problems. most compact representation; this is again directly related to Researchers think that Recurrent Neural Networks (RNN) Artificial Intelligence. Using the Minimal Description Length are capable of understanding the way some tasks are done such principle [8] the efficiency of a good Machine Learning as music composition, writing of texts, etc. Moreover, RNNs algorithm can be measured in terms of how good it is to can be trained for sequence generation by processing real data compress the training data plus the size of the model itself. sequences one step at a time and predicting what comes next A file containing the digits of π can be compressed with [1] [2]. a very short program able to generate those digits, gigabytes Compression algorithms are also capable of understanding of information can be compressed into a few thousand bytes. and representing different sequences and that is why the com- However, the problem arises when trying to find a program pression of a string could be achieved. However, a compression capable of understanding that our input file contains the digits algorithm might be used not only to compress a string but also of π. In consequence, achieving the best compression rate to do non-conventional tasks in the same way as neural nets involves finding a program able to always find the most (e.g. a compression algorithm could be used for clustering [4], compact model to represent the data and that is clearly an sequence generation or music composition). indication of intelligence, perhaps even of General Artificial Intelligence. Both neural networks and data compressors should be able to learn from the input data to do the tasks for which they III. RNNS FOR DATA COMPRESSION are designed. In this way, someone could argue that a data compressor can be used to generate sequences or a neural Recurrent Neural Networks and in particular LSTMs were network can be used to compress data. In consequence, using used not only for predictive tasks [9] but also for Data the best data compressor to generate sequences should produce Compression [10]. While the LSTMs were brilliant in their better results than the ones obtained by a neural network but text [2], music [3] and image generation [11] tasks they were if this is not true then the neural network should compress never able to defeat the state of the art algorithms in Data better than the state of the art in data compression. Compression [10]. 46JAIIO - ASAI - ISSN: 2451-7585 - Página 72 ASAI, Simposio Argentino de Inteligencia Artificial This might indicate that there is a fundamental difference As introduced, a data compressor performs well when it is between Data Compression and Generative Processes and capable of understanding the data set that will be compressed. between Data Compression Algorithms and Recurrent Neural This understanding often grows when the data set becomes Networs. After experiments, a fundamental difference will be bigger and in consequence compression rate improves. How- shown in this research in order to explain why a RNN can ever, it is not true when future data (i.e. data that has not be the state of the art in a generative process but not in Data been compressed yet) has no relation with already compressed Compression. data because the more similar the information it is the better compression rate is achieved. ENTIMENT NALYSIS IV. S A Let C(X1;X2:::Xn) be a compression algorithm that A. A Qualitative Approach compresses a set of n files denoted by X1;X2:::Xn. Let P ;P :::P and N ;N :::N be a set of n positive reviews The Sentiment of people can be determined according to 1 2 n 1 2 m and m negative reviews respectively. Then, a review R can be what they write in many social networks such as Facebook, predicted positive or negative using the following inequality: Twitter, etc.. It looks like an easy task for humans. However, it could be not so easy for a computer to automatically determine the sentiment behind a piece of writing. C(P1; :::Pn;R)−C(P1; :::; Pn) < C(N1; :::; Nm;R)−C(N1; :::; Nm) The task of guessing the sentiment of text using a computer is known as sentiment analysis and one of the most popular The formula is a direct derivation from the NCD. When the approaches for this task is to use neural networks. In fact, inequality is not true, a review is predicted negative. Stanford University created a powerful neural network for The order in which files are compressed must be considered. sentiment analysis [12] which is used to predict the sentiment As you could see from the proposed formula, the review R is of movie reviews taking into account not only the words compressed last. in isolation but also the order in which they appear. In the Some people may ask why this inequality works to predict first experiment, the Stanford neural network and a PAQ whether a review is positive or negative. So it is important compressor1 [13] will be used for doing sentiment analysis to understand this inequality. Suppose that the review R is of movie reviews in order to determine whether a user likes a positive one. R will be compressed in order to classify it: or not a given movie (i.e. each movie review will be classified if R is compressed after a set of positive reviews then the as positive or negative). After that, results obtained will be compression rate should be better than the one obtained if R is compared using the percentage of correctly classified movie compressed after a set of negative reviews because the review reviews. Both algorithms will use a public data set for movie R has more related information with the set of positive reviews reviews [14]. and in consequence should be compressed better. Interestingly, In order to understand how sentiment analysis could be both the positive and negative set could have different sizes done with a data compressor it is important to comprehend and that is why it is important to subtract the compressed file the concept of using Data Compression to compute the dis- size of both sets in the inequality. tance between two strings using the Normalized Compression B. Data Set Preparation Distance [15]. The Large Movie Review Dataset [14], which has been used C(xy) − minfC(x);C(y)g for Sentiment Analysis competitions, is used in this research2. NCD(x; y) = maxfC(x);C(y)g Positive Negative C(x) Total 12491 12499 Where is the size of applying the best possible Training 9999 9999 compressor to x and C(xy) is the size of applying the best Test 2492 2500 possible compressor to the concatenation of x and y. Table I The NCD is an approximation to the Kolmogorov distance MOVIE REVIEW DATASET between two strings using a Compression Algorithm to ap- proximate the complexity of a string because the Kolmogorov Complexity is uncomputable. C. PAQ for Sentiment Analysis The principle behind the NCD is quite simple: when string Using Data Compression for Sentiment Analysis is not y is concatenated after string x then y should be highly a new idea. It has been already proposed in IEEE 12th compressed whenever y is very similar to x because the International Conference [16].

From Imitation to Prediction, Data Compression Vs Recurrent Neural Networks for Natural Language Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support