Using News Articles to Predict Stock Price Returns

Ricardo Herrmann Rodrigo Togneri [email protected] [email protected]

Luciano Tozato Wei Lin [email protected] [email protected]

Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries. Table of Contents

Overview ...... 3

Introduction ...... 4

Related Work ...... 6

Deep CNN model for text classification ...... 6

Empirical evaluation ...... 9

Data sources ...... 10

Stock prices ...... 10

News articles ...... 10

Pre-trained word embeddings ...... 11

Results ...... 12

Conclusion ...... 14

References ...... 15

Disclaimer: The views, processes or methodologies published in this article are those of the authors. They do not necessarily reflect Dell EMC’s views, processes or methodologies.

2017 Dell EMC Proven Professional Knowledge Sharing 2 Overview The stock market can be seen as a dynamic system where stock prices are affected by the behavior of traders trying to buy or sell stocks, based on the most recent information they can get. There’s a bi-directional relationship between stock prices and news articles, with a variable lag between effects on both of them, as news influence traders’ behavior and vice-versa. The collective sentiment may push stock prices up (the so-called “Bull Effect”) – even creating economic bubbles – or down (the “Bear Effect”).

Our proposal measures how many changes in stock market prices we can predict with a simple model by combining both structured (stock price time series) and unstructured (news articles) data, based solely on automatically extracted text features, to infer the effect of news on the prices of stocks they refer to.

To deal with unstructured textual sources of data, we rely on recent advances in Deep Learning applied to Natural Language Processing (NLP), with a particular focus on algorithms that use the Vector Space Model (VSM) of semantics, so that we can work with a uniform representation of words, phrases and documents as numeric vectors. In our implementation, we take a declarative approach to describing numeric computation that can be efficiently executed in GPUs, using the Keras library and the TensorFlow or Theano engines.

2017 Dell EMC Proven Professional Knowledge Sharing 3

Introduction In a stock market, stocks, which are contracts that represent fractional ownership of companies’ shared value, are traded between buyers and sellers. The stocks’ prices fluctuate according to the dynamics of supply and demand, where prices approach an unknown, non-stationary, equilibrium.

Although the Efficient Market Hypothesis (EMH) [1] establishes that stocks are always traded at their fair values, price fluctuations are not completely random in practice. To get better returns out of their financial assets, traders can exploit the field of Technical Analysis (TA) which is comprised of techniques for identifying patterns. In a simplified view, technical analysis uses past information about the market to forecast the direction of the market in the short-term future. Quantitative traders rely on this mostly-numeric past information, deriving metrics which use, among other information, stocks’ prices and trading volume.

On the other hand, Qualitative trading is subjective, relying on human judgment, and takes into account a multitude of sources of information. Qualitative traders, or fundamentalists, heavily rely on information from financial news. Traditionally, the sources of information handled by computers are mostly structured, and thus computers play a much bigger role in quantitative trading. However, recent advances in Natural Language Processing (NLP) make it easier to process news sources and extract metrics from unstructured data, enabling computers to also be used to empower traders in their qualitative analyses.

Sentence classification is a well-studied problem in NLP, where the task is, given a set of pairs comprised of a source text and the class to which it belongs, to correctly classify previously unseen sentences or documents into a fixed set of classes. In Machine Learning [2] nomenclature, it is an instance of a supervised learning problem.

With that in mind, our hypothesis is that there is some textual information from financial news that can be automatically extracted using NLP techniques and use that information to predict if particular news articles will push stock prices up or down as a supervised learning problem.

In our experiment, we first captured a dataset of financial news, along with their publishing dates and related stock symbols. We then fetched a dataset with historical price series of the corresponding stocks. We combined the information sources at hand and created a dataset for training a learning algorithm to infer the movement of prices on the following day based on the text of news from the preceding day, treating the sentence classification into three

2017 Dell EMC Proven Professional Knowledge Sharing 4 classes related to the price movement: negative, neutral and positive. Our model relies on modern advances in NLP, namely word vectors, convolutional features, Rectified Linear Units (ReLUs) and a deep classification neural network.

Related Work It has already been shown [3] that, in the same time period, the number of mentions of a company in the and the transaction volume of a company’s stock are correlated, and so is the absolute return and the interest in a company in the news. Their study also shows there’s no statistically significant correlation when the direction of the price is taken into account. This evidence, however, just considers the number of mentions, but ignores the context in which companies are mentioned.

Regarding recent work in learning vector representations of words using neural networks, we can cite the work of Bengio et al. [4] (which establishes a Vector Space Model (VSM) for learning to predict the next word), Collobert & Weston [5, 6] (which uses multitask learning for language processing predictions using Convolutional Neural Networks ()), Mnih & Hinton [7] (a hierarchical distributed language model), Turian et al. [8] (uses semi-supervised learning for combining different word representations), Mikolov et al. [9] (the word2vec model) and Pennington et al. [10] (GloVe text vectors).

Extending the same line of work to representations of sentences, we can cite Yessenalina & Cardie [11] (which models each word as a matrix and combines words using iterated matrix multiplication), Grefen- stette et al. [12] (establishes tensor-based compositional distributional semantics of words and func- tions), Le & Mikolov [13] (the doc2vec model) and Kim [14] (uses CNNs for sentence classification).

Other related applications of NLP to the stock market include the work of Fehrer [15], which aims to predict the direction of stock movements following financial disclosures and shows how deep learning (more precisely, using a recursive autoencoder [16]) can outperform the accuracy of random forests by 5.66%, using a dataset comprised of 8.359 headlines. Since financial news articles are not produced in large quantities, Quid [17] uses a CNN architecture similar to ours to classify small datasets of text containing company descriptions into “good” or “bad” quality classes. Also in a text-based, but a different approach to the problem, Ding et al. [18] use a deep neural network and information extraction methods to obtain action-actor-object- timestamp tuples using an existing dependency parser [19].

2017 Dell EMC Proven Professional Knowledge Sharing 5

Deep CNN model for text classification In machine learning, the input information is described to the computer as a set of features. There are many approaches to text classification, but they can be divided into two main categories, regarding their use of features (as is the case in other areas of machine learning): feature engineering and representation learning. The former relies on feature extractors built by subject matter experts, which are usually highly specific to the task at hand. The latter incorporates, in the learning model, some sort of generic representation of feature extractors and treats the values of their parameters as part of the numeric optimization problem called learning, and thus the algorithm “learns” how to build good features. Both strategies aim at deriving semantic features, relevant to the task, from the text’s syntactic features.

In word-based modeling, sentences are first tokenized and then represented by a list or set of the words it contains. Each distinct word is assigned an index, so that it can be represented more compactly, as in this case the characters comprising the words are not relevant. Classic NLP algorithms use, as input, words represented as one-hot encoded vectors, where each word is represented by a vector comprised entirely of zeros, with the exception of a value of one at the word’s index.

The main problem with one-hot encoding is that all encoded words are orthogonal and equally distant, so it can’t represent all the rich information about the relationship between words. Another problem is that it makes the input data very sparse. The Vector Space Model (VSM) [20] is a solution to these problems. In this model, words are represented by dense numeric vectors, with arbitrary dimensionality, but with a lot fewer dimensions than the number of words in the vocabulary, as one-hot encoding requires. Word vectors represented in the VSM can capture part of the semantics of their corresponding words, as the Distributional Hypothesis (DH) [21] states that words that occur in similar contexts tend to have similar meanings, and their vectors may be computed using some metric that encourages context similarity.

Once we have a good representation for words, we need to represent sentences. One traditional model is the bag-of-words model [21], where sentences are represented as count sets of the words in a sentence. Its major weakness is that by losing the ordering of the words, many sentences lose their meaning. On the other extreme, we have syntactic parsers, which extract the tree structure and the role every word plays in the sentence. Syntactic tree representations of sentences are very good for information extraction tasks, but very hard to achieve as of the current state-of-the-art. 2017 Dell EMC Proven Professional Knowledge Sharing 6

An intermediate approach for representing sentence structure, which is not completely local or global – and now made more popular by recent developments in Deep Learning [22] – is that used by CNNs [23]. In CNNs, convolutional features, which combine the information of local regions of the input, are trained during the learning process. By stacking multiple layers of convolutions, the reach of the local information increases, up to the point where we can combine information from both ends of the sentences, making it a mixture which shares some characteristics of both local and global models.

In order to represent the semantics of sentences, we can use a compositional approach based on word representations which capture semantics like the ones obtained by VSMs. According to Kim [14], pre-trained word vectors are “universal” feature extractors that can be utilized for various classification tasks, although learning task-specific vectors through fine-tuning results in further improvements. However, naïvely computing a sentence’s vector as the average of its word vectors makes the semantic information captured by the vectors vanish quickly, as can be seen in Figure 1. The longer or more varied the sentence’s contents, the more the average vector gets closer to the vector of stop words or punctuation tokens, so we need an architecture that more effectively combines word representations to extract the meaning of sentences.

(a) Weak clustering of pure word vectors from GloVe. (b) Example sentence vectors obtained by taking the average word vector.

Figure 1: Comparison of the clustering characteristics of pre-trained GloVe word vectors and the average vector of a few sentences. We can easily see that, unlike word vectors, average sentence vectors constructed from the same word vectors lose their distinctive characteristics as they get closer to a common vector, with few distinguishing dimensions.

2017 Dell EMC Proven Professional Knowledge Sharing 7

The architecture we used for our model is a deep CNN with input word embeddings and a final classification dense network with softmax output, like the one shown in Figure 2. We used ReLU [24] as the non-linear activation function at each convolutional layer and the penultimate densely-connected hidden layer. In order to combine the information extracted by the convolutional kernels, we used max-pooling after convolutions, followed by another convolutional layer with more kernels in the subsequent layer.

The two techniques we employed to avoid overfitting were Dropout and Batch Normalization, both briefly explained below.

Dropout is a popular technique for regularization of CNNs. A dropout layer applies zero weight to neurons with a certain probability, so that, during training, the neural network cannot rely on particular elements being always present. This prevents co-adaptation, so that neurons are forced to individually learn good features.

The other technique we used in our model is Batch Normalization [25] (also simply referred to as batch-norm), which, at each batch, applies a transformation that tries to zero out the activations’ mean with standard deviation close to one. This acts as a regularization method and also lets us use higher learning rates.

Figure 2: Simplified architecture of the deep learning model we used to classify financial news.

2017 Dell EMC Proven Professional Knowledge Sharing 8

Some of our model’s hyper-parameter choices are architectural decisions based on the average size of text data we’re dealing with, like the size of the convolutional kernel’s filter region (3), the number of stacked convolutional layers (3), the number of convolutional features at each layer (16, 32 and 64) or the max-pooling size (2), but there are also dropout rates to tune in order to avoid overfitting. We used 70% dropout at the embedding, convolutional (after batchnorm) and hidden layers. Some guidelines on how to choose these kinds of architecture parameters are discussed by Zhang & Wallace [26].

We implemented our model in Python, using the Keras library with the TensorFlow backend so that we could benefit from automatic differentiation and Graphics Processing Unit (GPU) execution.

Empirical evaluation Our overall approach was to create a script that fetch news articles related to selected stock tickers, another script for fetching their corresponding historical prices, and another main script to calculate the return of the stock on the following day, match the articles to the future return and create a dataset with labels for each article’s initial text corresponding to whether the return is positive, negative or neutral, where the return is considered neutral if it’s within a ±0.15% return margin, in order to have a balanced set of labels in the dataset.

Figure 3: Returns of stock prices relative to a common initial day.

The following provides additional information about each kind of data source we used in our experiment and report on the results.

2017 Dell EMC Proven Professional Knowledge Sharing 9

Data sources Stock prices We wrote a short Python script that fetches data from Finance directly in CSV format, using the same URLs provided in the web UI for download of the bulk data. Overall, we chose 141 stock symbols for analysis, including the 30 DJIA tickers, the companies with over 100 Billion USDs in valuation as of January 2017, some technology companies and some start-ups.

To check for anomalies, we plotted the time series of stock prices (Fig. 3). As can be seen in the figure, the stock prices of most tickers stay concentrated in the unit line, meaning that the future price stays close to the past price at the origin date. A few outlier companies exist in our dataset, but the stock value of all of them stays within a tenfold increase or decrease in the time period of the analysis.

News articles To gather a text corpus that could be easily related to stocks, we started by writing a script that downloads the headline, date, Universal Resource Locator (URL) and initial text of the news from Google Finance’s crawler which are related to each stock symbol. We collected a dataset comprising 6.326 articles related to the previously mentioned stock tickers, ranging from Jan. 19, 2016 to Jan. 13, 2017.

In the dataset we collected, the top 20 sources (shown in Fig. 4) account for a little more than 52% of the articles’ sources. Among the top 3 sources, there are Investorplace.com, The Cerbat Gem and SeekingAlpha, which are some popular sources of financial news. We do not expose the source information to the training set in order to avoid data leakage problems [27] (unexpected additional information), because the different class labels may be unbalanced between sources, at the least.

2017 Dell EMC Proven Professional Knowledge Sharing 10

Figure 4: Quantity of articles from each news source.

We tokenized sentences using Python’s Natural Language Toolkit (NLTK) [28], which has a word to- kenizer that automatically makes use of the Punkt [29] sentence tokenizer. Punkt is a pre-trained unsupervised algorithm that builds a model for abbreviation words, collocations, and words that start sentences, so that it can distinguish between the dots in tokens like “U.S.” from a punctuation period. After tokenization, all words were normalized to lowercase in order to match GloVe’s uncased vectors. The final set of words forms our full vocabulary, totaling 15.651 distinct tokens in the dataset.

Since rare words may not be very informative and may even be misleading during sentence classifica- tion, we restrict our attention to the top 5.000 occurring tokens, forming our restricted vocabulary. Rare words are then assigned the same identifier in the word embeddings’ vectors.

Our final training set was comprised of 5.000 articles, as we separated 1.326 of them for validation. The sentence length distribution of the text in the training set can be seen in Figure 5. The shortest sentence has 11 words, the largest has 70 and the average sentence has a mean of 38.25 words with a standard deviation of 8.6.

Pre-trained word embeddings To avoid having to collect a large text corpus to improve the vocabulary size, we used a static layer of word embeddings with publicly available pre-trained word vectors from GloVe [10], which were trained from 2014 and Gigaword 5 [30] data, totaling 400.000 uncased words in its vocabulary. During our analysis, since our text dataset is small, we used the version with 50-dimensional embeddings.

2017 Dell EMC Proven Professional Knowledge Sharing 11

Figure 5: Histogram of the length of sentences in the training set.

Figure 6: Convergence of the model’s accuracy along training epochs, as measured in both training and validation sets.

Out of the 5.000 most frequent tokens present in our dataset, 4.218 (a total of 84.36%) are known words (i.e. they exist in GloVe). During training, unknown words are initialized with random vectors. Most of the unknown words, however, are numeric tokens. Had we wished to collapse these numeric tokens into a small set, one option would be to use word shapes to represent them, but we wanted to keep important numeric token features and let the learning algorithm decide which of these were relevant to the classification task.

Results In an initial run, to check whether the model had sufficient capacity, we didn’t apply any kind of regularization and overfitted on the training dataset on purpose. We achieved more than 96% accuracy on a few training epochs. This shows that input features of interest can be represented by the model.

2017 Dell EMC Proven Professional Knowledge Sharing 12

The categorical cross-entropy loss function optimization algorithm we used is a variation of Stochastic Gradient Descent (SGD) called Adaptive Moment Estimation (Adam) [31], which uses an adaptive learning rate based on an exponentially decaying average of previous gradients. Keras’ default learn- ing rate of 10−3 for Adam was too high for our task and we initialize it to 10−4 and then to 10−5 during fine-tuning.

After initial checks, we turned the regularization layers on (dropout and batchnorm). We ran our ex- periment on a Nvidia GeForce™ GTX 1070 GPU, using CUDA version 8.0 and cuDNN version 5.1. We achieved training convergence without overfitting by running 100 epochs for training with fixed word vectors and another 100 epochs of fine-tuning with training enabled for the embeddings layer. Us- ing mini-batches of size 50 (1% of the training set), training took an average of 2.17 seconds/epoch (comprised of 100 mini-batches) on the TensorFlow backend and around 1.02 seconds/epoch using the Theano backend. Comparatively, training on an Intel Core™ i5-4690K CPU running at 3.50GHz takes 252 seconds/epoch.

The plot of the model’s accuracy during training phase is shown in Figure 6 and validation accuracy converged to 39.21%. Fine-tuning the word embeddings couldn’t improve the model’s accuracy any further, but this may have happened because few unknown words (by GloVe) were randomly initialized and they were infrequent, so the convolutional features may have ignored these vectors.

We also used a simple Long Short-Term Memory (LSTM) [32] model to verify if long-term dependencies were being missed by our convolutional model, by comparing their achieved accuracies, but it also fluctuated around 39%.

2017 Dell EMC Proven Professional Knowledge Sharing 13

Conclusion Although the accuracy of predicted classes is not as high as the accuracy achieved in other similar text classification tasks, it’s some percentage points above the random baseline, which shows there are certain text structures which are useful for predicting the future variation of stock prices.

In comparison to other text classification tasks, our problem is made more difficult by some characteristics like the Efficient Market Hypothesis (EMH), by temporal lags between financial events and the releasing of news articles and the varying order of causality between news and stock price changes, to name a few.

The stock market is a system with very complex dynamics, and we didn’t include any other information we had about the stock prices, like their auto-correlation in time, the volume and a lot of other important information sources traditionally used in technical analyses. It’s as if we were analyzing just a marginal distribution of text characteristics from the full unknown joint distribution of market information in order to predict our dependent variable, the class to which the price variation belongs.

We believe better performance could be achieved primarily by using bigger datasets and including, in our source dataset, the time of day when the news were released, so that we know if they were re- leased before, during, or after market hours. Other ideas for augmenting the model include doing proper normalization of some words, like applying Word shapes to numeric tokens, using structured syntactic parsing and trying to extract mode relational information between tokens, like using dependency parsers already trained on English corpora, but relying on convolutional features over word vectors still gives us the advantage of being able to adapt our model to other languages and markets.

2017 Dell EMC Proven Professional Knowledge Sharing 14

References [1] Martin Sewell. History of the Efficient Market Hypothesis. UCL Department of Computer Science Research Note, 2011. URL: http://www.cs.ucl.ac.uk/fileadmin/UCL-CS/images/Research_ Student_Information/RN_11_04.pdf

[2] Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classification. 2nd edition, 2001. ISBN 0-471-05669-3. URL: http://dl.acm.org/citation.cfm?id=954544

[3] Merve Alanyali, Helen Susannah Moat and Tobias Preis. Quantifying the Relationship Between Financial News and the Stock Market. Scientific Reports 3, Article number: 3578 (2013). URL: http://www.nature.com/articles/srep03578

[4] Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin and Jean-Luc Gauvain.

Neural probabilistic language models. In Innovations in Machine Learning, pp. 137–186. Springer,

2006 URL: http://link.springer.com/chapter/10.1007\%2F3-540-33486-6_6

[5] Ronan Collobert and Jason Weston, A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. Proceedings of the 25th International Conference on

Machine Learning, 2008 URL: http://doi.acm.org/10.1145/1390156.1390177

[6] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa. Natural Language Processing (almost) from Scratch. Journal of Machine Learning Re- search, vol. 12, 2011. URL: https://arxiv.org/pdf/1103.0398v1.pdf

[7] Andriy Mnih and Geoffrey Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems, pp. 1081–1088, 2008. URL: http://dl.acm.org/ citation.cfm?id=2981780.2981915

[8] Joseph Turian, Lev Ratinov and Yoshua Bengio. Word representations: A simple and general method for semi-supervised learning. In: ACL (2010). URL: http://citeseerx.ist.psu.edu/

viewdoc/summary?doi=10.1.1.301.5840

[9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. Distributed Representa- tions of Words and Phrases and their Compositionality. Advances in Neural Information Processing

Systems, 2013. URL: http://dl.acm.org/citation.cfm?id=2999792.2999959

[10] Jeffrey Pennington, Richard Socher and Christopher D. Manning. GloVe: Global Vectors for Word Representation. Conference on Empirical Methods on Natural Language Processing (EMNLP), 2014. URL: http://nlp.stanford.edu/pubs/glove.pdf

2017 Dell EMC Proven Professional Knowledge Sharing 15

[11] Ainur Yessenalina and Claire Cardie. Compositional Matrix-space Models for Sentiment Analysis.

Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. URL:

http://dl.acm.org/citation.cfm?id=2145452

[12] Edward Grefenstette, Georgiana Dinu, Yao-Zhong Zhang, Mehrnoosh Sadrzadeh and Marco Ba- roni. Multi-step regression learning for compositional distributional semantics. Proceedings of the 10th International Conference on Computational Semantics, IWCS 2013. URL: http://aclweb. org/anthology/W/W13/W13-0112.pdf

[13] Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. Pro- ceedings of the 31th International Conference on Machine Learning, 2014. URL: https://cs. stanford.edu/~quocle/paragraph_vector.pdf

[14] Yoon Kim. Convolutional Neural Networks for Sentence Classification. Conference on Empirical Methods on Natural Language Processing (EMNLP), 2014. URL: https://arxiv.org/pdf/1408. 5882v2.pdf

[15] Ralph Fehrer and Stefan Feuerriegel. Improving Decision Analytics with Deep Learning: The Case of Financial Disclosures. The Computing Research Repository (CoRR), 2015. URL: http://dblp. uni-trier.de/rec/bib/journals/corr/FehrerF15

[16] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng and Christopher D. Manning.

Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions. Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011) 151–161. URL:

http://dl.acm.org/citation.cfm?id=2145432.2145450

[17] Ben Bowles. How Quid uses deep learning with small data. URL: https://quid.com/feed/

how-quid-uses-deep-learning-with-small-data

[18] Xiao Ding, Yue Zhang, Ting Liu and Junwen Duan. Using Structured Events to Predict Stock Price Movement: An Empirical Investigation. Conference on Empirical Methods on Natural Language

Processing (EMNLP), 2014. URL: http://emnlp2014.org/papers/pdf/EMNLP2014148.pdf

[19] Yue Zhang and Stephen Clark. Shift-Reduce CCG Parsing. Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguistics: Human Language Technologies, pp. 19–24, 2011. URL: http://dblp.uni-trier.de/rec/bibtex/conf/acl/ZhangC11

2017 Dell EMC Proven Professional Knowledge Sharing 16

[20] Peter D. Turney and Patrick Pantel. From Frequency to Meaning: Vector Space Models of Seman- tics. Journal of Artificial Intelligence Research 37 (2010), 141–188. URL: http://www.jair.org/ media/2934/live-2934-4846-jair.pdf

[21] Zellig S. Harris. Distributional structure. Word, 10(23) 1954: pp. 146-162. URL: https://www. researchgate.net/publication/232591129_Distributional_Structure

22] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press URL: http://www.

deeplearningbook.org

[23] Yann Lecun, Léon Bottou, Yoshua Bengio and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86.11 (1998): pp. 2278–2324. URL: http:// citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.1115

[24] Vinod Nair and Geoffrey E.Hinton. Rectified linear units improve restricted Boltzmann machines.

In ICML, pp. 807–814. Omnipress, 2010. URL: https://www.cs.toronto.edu/~hinton/absps/ reluICML.pdf

[25] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. URL: http://jmlr.org/proceedings/papers/v37/ioffe15. pdf

[26] Ye Zhang and Byron Wallace. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. URL: https://arxiv.org/abs/1510.03820

[27] Shachar Kaufman, Saharon Rosset and Claudia Perlich. Leakage in Data Mining: Formulation, De- tection, and Avoidance. Proceedings of the 17th ACM SIGKDD international conference on Knowl- edge discovery and data mining, KDD 2011, pp. 556–563. URL: http://dl.acm.org/citation. cfm?id=2020496

[28] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media Inc., 2009. URL: http://www.nltk.org

[29] Tibor Kiss and Jan Strunk. Unsupervised Multilingual Sentence Boundary Detection. Computa- tional Linguistics 32: pp. 485-525, 2006. URL: http://dx.doi.org/10.1162/coli.2006.32.4.

485

[30] Parker, Robert, et al. English Gigaword Fifth Edition LDC2011T07. DVD. Philadelphia: Linguistic Data Consortium, 2011. URL: https://catalog.ldc.upenn.edu/LDC2011T07

2017 Dell EMC Proven Professional Knowledge Sharing 17

[31] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. The International Conference on Learning Representations (ICLR), San Diego, 2015. URL: https://arxiv.org/ abs/1412.6980v8

[32] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, vol. 9, 1997. URL: http://dl.acm.org/citation.cfm?id=1246450

2017 Dell EMC Proven Professional Knowledge Sharing 18

Dell EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

2017 Dell EMC Proven Professional Knowledge Sharing 19