An Empirical Analysis of Feature Engineering for Predictive Modeling

Jeff Heaton McKelvey School of Engineering Washington University in St. Louis St. Louis, MO 63130 Email: [email protected]

Abstract—Machine learning models, such as neural networks, in the space has been in image and speech decision trees, random forests, and machines, recognition [1]. Such techniques are successful in the high- accept a feature vector, and provide a prediction. These models dimension space of image processing and often amount to learn in a supervised fashion where we provide feature vectors mapped to the expected output. It is common practice to engineer techniques [5] such as PCA [6] and new features from the provided feature set. Such engineered auto-encoders [7]. features will either augment or replace portions of the existing feature vector. These engineered features are essentially calcu- II.BACKGROUNDAND PRIOR WORK lated fields based on the values of the other features. Engineering such features is primarily a manual, time- Feature engineering grew out of the desire to transform non- consuming task. Additionally, each type of model will respond normally distributed inputs. Such a transfor- differently to different kinds of engineered features. This paper mation can be helpful for linear regression. The seminal work reports empirical research to demonstrate what kinds of engi- neered features are best suited to various model by George Box and David Cox in 1964 introduced a method types. We provide this recommendation by generating several for determining which of several power functions might be a datasets that we designed to benefit from a particular type of useful transformation for the outcome of linear regression [8]. engineered feature. The experiment demonstrates to what degree This technique became known as the Box-Cox transformation. the machine learning model can synthesize the needed feature The alternating conditional expectation (ACE) algorithm [9] on its own. If a model can synthesize a planned feature, it is not necessary to provide that feature. The research demonstrated that works similarly to the Box-Cox transformation. An individual the studied models do indeed perform differently with various can apply a mathematical function to each component of types of engineered features. the feature vector outcome. However, unlike the Box-Cox transformation, ACE can guarantee optimal transformations I.INTRODUCTION for linear regression. Feature engineering is an essential but labor-intensive com- Linear regression is not the only machine-learning model ponent of machine learning applications [1]. Most machine- that can benefit from feature engineering and other trans- learning performance is heavily dependent on the represen- formations. In 1999, researchers demonstrated that feature tation of the feature vector. As a result, scientists spend engineering could enhance rules learning performance for much of their effort designing preprocessing pipelines and data text classification [10]. Feature engineering was successfully transformations [1]. applied to the KDD Cup 2010 competition using a variety of To utilize feature engineering, the model must preprocess machine learning models. its input data by adding new features based on the other features [2]. These new features might be ratios, differences, arXiv:1701.07852v2 [cs.LG] 1 Nov 2020 III.EXPERIMENT DESIGNAND METHODOLOGY or other mathematical transformations of existing features. This process is similar to the equations that human analysts Different machine learning model types have varying de- design. They construct new features such as body mass index grees of ability to synthesize various types of mathematical (BMI), wind chill, or Triglyceride/HDL cholesterol ratio to expressions. If the model can learn to synthesize an engineered help understand existing features’ interactions. feature on its own, there was no reason to engineer the Kaggle and ACM’s KDD Cup have seen feature engineer- feature in the first place. Demonstrating empirically a model’s ing play an essential part in several winning submissions. ability to synthesize a particular type of expression shows if Individuals applied feature engineering to the winning KDD engineered features of this type might be useful to that model. Cup 2010 competition entry [3]. Additionally, researchers won To explore these relations, we created ten datasets contain the Kaggle Algorithmic Trading Challenge with an ensemble the inputs and outputs that correspond to a particular type of of models and feature engineering. These individuals created engineered feature. If the machine-learning model can learn these engineered features by hand. to reproduce that feature with a low error, it means that that Technologies such as deep learning [4] can benefit from particular model could have learned that engineered feature feature engineering. Most research into feature engineering without assistance. For this research, only considered regression machine learn- Algorithm 1 Generate count test dataset ing models for this experiment. We chose the following four 1: INPUT: The number of rows to generate r. machine learning models because of the relative popularity 2: OUTPUT: A dataset where y contains random integers and differences in approach. sampled from [0, 50], and x contains 50 columns randomly • Deep Neural Networks (DANN) chosen to sum to y. • Gradient Boosted Machines (GBM) 3: METHOD: • Random Forests 4: x ← [...empty set...] • Support Vector Machines for Regression (SVR) 5: y ← [...empty set...] 6: n ← 1 T O r To mitigate the stochastic nature of some of these machine for do 7: v ← (50) . learning models, each experiment was run 5 times, and the zeros Vector of length 50 8: o ← (0, 50) . best run’s outcome was used for the comparison. These exper- uniform random int Outcome(y) 9: r ← . iments were conducted in the Python programming language, o remaining 10: r ≥ 0 using the following third-party packages: Scikit-Learn [11] while do: 11: i ← (0, (x) − 1) and TensorFlow[12]. Using this combination of packages, uniform random int len 12: x[i] = 0 model types of support vector machine (SVM) [13][14], deep if then 13: v[i] ← 1 neural network [15], [16], and gradient boosting 14: r ← r − 1 machine (GBM) [17] were evaluated against the following sixteen selected engineered features: 15: x.append(x) 16: y.append(o) • Counts • Differences return [x, y] • Distance Between Quadratic Roots • Distance Formula TABLE I • Logarithms COUNTS TRANSFORMATION • Max of Inputs x1 x2 x3 ... x50 y1 • Polynomials 1 0 1 ... 0 2 • Power Ratio (such as BMI) 1 1 1 ... 1 12 0 1 0 ... 1 8 • Powers 1 0 0 ... 1 5 • Ratio of a Product • Rational Differences • Rational Polynomials Several example rows of the count input vector are shown • Ratios in Table I. The y1 value simply holds the count of the number • Root Distance of features x1 through x50 that contain a value greater than 0. • Root of a Ratio (such as Standard Deviation) B. Differences and Ratios • Square Roots Differences and ratios are common choices for feature The techniques used to create each of these datasets are engineering. To evaluate this feature type a dataset is generated described in the following sections. The Python source code with x observations uniformly sampled in the real number for these experiments can be downloaded from the author’s range [0, 1], a single y prediction is also generated that GitHub page [18] or Kaggle [19]. is various differences and ratios of the observations. When A. Counts sampling uniform real numbers for the denominator, the range [0.1, 1] is used to avoid division by zero. The equations chosen The count engineered feature counts the number of elements are simple difference (Equation 2), simple ratio (Equation 3), in the feature vector that satisfies a certain condition. For power ratio (Equation 4), product power ratio (Equation 5) example, the program might generate a count feature that and ratio of a polynomial (Equation 6). counts other features above a specified threshold, such as zero. Equation 1 defines how a count feature might be engineered. y = x1 − x2 (2) n X y = 1 if x > t else 0 (1) x i y = 1 (3) i=1 x2 The x-vector represents the input vector of length n. The x1 y x y = (4) resulting contains an integer equal to the number of x2 values above the threshold (t). The resulting y-count was 2 uniformly sampled from integers in the range [1, 50], and x1x2 y = 2 (5) the program creates the corresponding input vectors for the x3 program to generate a count dataset. Algorithm 1 demonstrates 1 y = (6) this process. 5x + 8x2 C. Distance Between Quadratic Roots G. Polynomials It is also useful to see how capable the four machine Engineered features might take the form of polynomials. learning models are at synthesizing ordinary mathematical This paper investigated the machine learning models’ ability equations. We generate the final synthesized feature from to synthesize features that follow the polynomial given by a distance between the roots of a quadratic equation. The Equation 13. distance between roots of a quadratic equation can easily be calculated by taking the difference of the two outputs of the y = 1 + 5x + 8x2 (13) quadratic formula, as given in Equation 7, in its unsimplified form. An equation such as this shows the models’ ability to synthesize features that contain several multiplications and an √ √ −b + b2 − 4ac −b − b2 − 4ac exponent. The data set was generated by uniformly sampling y = − (7) x from real numbers in the range [0, 2). The y1 value is simply 2a 2a calculated based on x1 as input to Equation 13. The dataset for the transformation represented by Equation 7 is generated by uniformly sampling x values from the real H. Rational Differences and Polynomials number range [−10, 10]. We discard any invalid results. Useful features might also come from combinations of D. Distance Formula rational equations of polynomials. Equations 14 & 15 show the types of rational combinations of differences and polynomials The distance formula contains a ratio inside a radical, tested by this paper. We also examine a ratio power equation, and is shown in Equation 8. The input are for x values similar to the body mass index (BMI) calculation, shown in uniformly sampled from the range [0, 10], and the outcome Equations 16. is the Euclidean distance between (x1, x2) and (x3, x4). x1 − x2 p 2 2 y = (14) y = (x1 − x2) + (x3 − x4) (8) x3 − x4 E. Logarithms and Power Functions 1 y = (15) Statisticians have long used logarithms and power functions 5x + 8x2 to transform the inputs to linear regression [8]. Researchers have shown the usefulness of these functions for transforma- x1 y = 2 (16) tion for other model types, such as neural networks [20]. The x2 log and power transforms used in this paper are of the type To generate a dataset containing rational differences (Equa- shown in Equations 9,10, and 11. tion 14), four observations are uniformly sampled from real numbers of the range [1, 10]. Generating a dataset of rational y = log(x) (9) polynomials, a single observation is uniformly sampled from real numbers of the range [1, 10]. y = x2 (10) IV. RESULTS ANALYSIS

1 To evaluate the effectiveness of the four model types over y = x 2 (11) the sixteen different datasets we must account for the differ- This paper investigates using the natural log function, the ences in ranges of the y values. As Table II demonstrates, second power, and the square root. For both the log and root the maximum, minimum, mean, and standard deviation of the transform, random x values were uniformly sampled in the real datasets varied considerably. Because an error metric, such number range [1, 100]. For the second power transformation, as root mean square error (RMSE) is in the same units as the x values were uniformly sampled in the real number range its corresponding y values, some means of normalization is [1, 10]. A single x1 observation is used to generate a single needed. To allow comparison across datasets, and provide this y1 observation. The x1 values are simply random numbers normalization, we made use of the normalized root-mean- that produce the expected y1 values by applying the logarithm square deviation (NRMSD) error metric shown in Equation 17. function. We capped all NRMSD values at 1.5; we considered values higher than 1.5 to have failed to synthesize the feature. F. Max of Inputs s Ten random inputs are generated for the observations (x1 − PT 2 1 t=1(ˆyt − yt) x10). These random inputs are sampled uniformly in the range NRMSD = (17) [1, 100]. The outcome is the maximum of the observations. σ T Equation 12 shows how this research calculates the max of The results obtained by the experiments performed in this inputs feature. paper clearly indicate that some model types perform much better with certain classes of engineered features than other y = max (x1...x10) (12) model types. The simple transformations that only involved TABLE II DATASET OBSERVATION (Y) STATISTICS AND RANGES 1.4

name min max mean std 1.2 bmi 6.25 87.77 37.22 18.22 counts 0.0 49.0 24.60 14.47 1.0 dev 7.56 41.70 27.01 4.57

diff -0.99 0.98 -0.007 0.40 0.8 dist 0.07 11.55 4.71 2.25 log 0.00 4.60 3.66 0.87 Score max 37.89 99.99 90.87 8.38 0.6 poly 1.00 42.99 16.79 12.37 pow 1.00 99.99 37.29 29.25 0.4 quad 0.0 11.66 1.59 2.03 r diff -24,976.28 4,205.68 -2.83 259.28 0.2 r poly 0.001 0.07 0.009 0.01 ratio 2.36e-05 93.92 2.33 5.32 0.0

rel 0.015 88.38 3.15 6.63 rel log diff dist dev bmi sqrt pow poly sum max ratio quad sqrt 1.00 9.99 6.75 2.28 r_diff r_poly sum 16.82 81.38 49.95 9.16 counts

Fig. 1. Deep Neural Network Engineered Features TABLE III MODEL SCORESFOR DATASETS deep neural network engineered feature experiments in Figure Feature Score SVM Score RF Score GBM Score Neural bmi 0.00 0.00 0.00 0.03 1. counts 0.00 0.08 0.10 0.00 The deep neural network performed well on all equation dev 0.03 0.45 0.32 0.05 types except the ratio of differences. The neural network also diff 0.07 0.00 0.01 0.00 dist 0.03 0.17 0.12 0.01 performed consistently better on the remaining equation types log 0.07 0.00 0.00 0.00 than the other three models. An examination of the calculations max 0.36 0.06 0.12 0.07 performed by a neural network will provide some insight into poly 0.00 0.00 0.00 0.00 pow 0.00 0.00 0.00 0.00 this performance. A single-layer neural network is essentially quad 0.56 0.04 0.04 0.03 a weighted sum of the input vector transformed by a transfer r diff 0.99 1.39 1.5 1.00 function, as shown in Equation 18. r poly 1.5 0.00 0.00 0.02 ratio 0.28 0.03 0.05 0.03 ! rel 0.07 0.05 0.09 0.02 X sqrt 0.03 0.00 0.00 0.00 f(x, w, b) = φ (wixi) + b (18) sum 0.00 0.33 0.28 0.00 n The vector x represents the input vector, the vector w represents the weights, and the scalar variable b represents a single feature were all easily learned by all four models. the bias. The symbol φ represents the transfer function. This This included the log, polynomial, power, and root. However, paper’s experiments used the rectifier transfer function [22] none of the models were able to successfully learn the ratio for hidden neurons and a simple identity linear function for difference feature. Table III provides the scores for each output neurons. The weights and biases are adjusted as the equation type and model. The model specific results from this neural network is trained. A deep neural network contains experiment are summarized in the following sections. many layers of these neurons, where each layer can form the input (represented by x) into the next layer. This fact allows the A. Neural Network Results neural network to be adjusted to perform many mathematical operations and explain some of the results shown in Figure 1. For each engineered feature experiment, create an ADAM The neural network can easily add, sum, and multiply. This [21] trained deep neural network. We made use of a learning fact made the counts, diff, power, and rational polynomial −7 rate of 0.001, β1 of 0.9, β2 of 0.999, and  of 1 × 10 , the engineered features all relatively easy to synthesize by using default training hyperparameters for Keras ADAM. layers of Equation 18. The deep neural network contained the number of input neurons equal to the number of inputs needed to test that B. Support Vector Machine Results engineered feature type. Likewise, a single output neuron The two primary hyper-parameters of an SVM are C and provided the value generated by the specified engineered γ. It is customary to perform a grid search to find an optimal feature. When viewed from the input to the output layer, combination of C and γ [23]. We tried 3 C values of 0.001, there are five hidden layers, containing 400, 200, 100, 50, 1, and 100, combined with the 3 γ values of 0.1, 1, and and 25 neurons, respectively. Each hidden layer makes use of 10. This selection resulted in 9 different SVMs to evaluate. a rectifier transfer function [22], making each hidden neuron The experiment results are from the best combination of C a rectified linear unit (ReLU). We provide the results of these and γ for each feature type. A third hyper-parameter specifies hta V a yteie h nlotu o nSRis SVR 19. an Equation in for shown output function, final decision The the synthesize. by given can SVR final transformations an its the that determine calculates re- is help SVR paper’s can calculated an calculation this how This and output. for is fitted However, concern paper. primary is the this SVR search, of an [24]. the scope how (SVR) the call of beyond regression we discussion vector full regression; support A include a original algorithm a to the resulting within difficult machine extended Vapnik were all vector and experiments ratio support Smola feature level. a NRMSD other and low All ratio, synthesize. polynomial to a in differences, features of engineered SVM the of models. 2. as results the other Figure the not the to as are experiment provide calculations engineering We results feature additional SVM a add of the pure does Therefore, investigated. SVM feature the we normalization [23], for required range This Gaussian step [0,1]. specific their to a input a SVM from is to all normalized benefit normalized which vectors machines uses, feature SVM vector input the support that Because kernel kernel. of type the ewe erlntokadSR h nlcluain share calculations differences final similarities. the many many SVR, and are network neural there a Though between summations. network. and neural cations a of bias Gaussian the the to analogous on somewhat based is kernel variable (RBF) The function used function. function paper basis to This radial The non-linearity. analogous a introduces SVR. somewhat that an function are The kernel of network coefficient. coefficients SVR’s neural the the the called of is alphas weights two the between h vector The h upr etrmciefudtemx udai,ratio quadratic, max, the found machine vector support The ietenua ewr,teSRcnprommultipli- perform can SVR the network, neural the Like

Score 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

bmi x y counts i.2 V nierdFeatures Engineered SVM 2. Fig. = ersnsteiptvco;tedifference the vector; input the represents dev X i =1 n diff ρ ( dist α ersnsteSRitret which intercept, SVR the represents i log −

α max i ∗

) poly K pow ( x

i quad x , r_diff + ) r_poly ρ K ratio

ersnsa represents rel

sqrt

sum (19) aercmedtosfrbt h ye ffaue ouse to features of types to the set us both allow same research for this the recommendations of make from results The gradient of benefit features. engineered typically and types of also forests same machines random the boosting similarly, from support features; benefit and engineered networks generally Neural machines features. differ- vector a synthesized random from of machines, benefit set machines vector ent boosting support gradient and networks, forests, neural as such ratio the deviation, standard sum. the and with differences, of difficulty most the had features engineered was GBM rate the 4. learning of Figure the results in and the the 100, provide We levels, used was 0.05. ten estimators all was of paper number depth this the maximum The in hyper-parameters. used same machines The forests. boosting random over gradient sometimes advantage performance optimization a additional com- GBM This optimal gives trees. produce the to objective of training binations the uses of algorithm gradient GBM the the However, forests. random to similarly Machine Boosted Gradient sum. D. and differences, of ratio a deviation, standard the engineered of the result 3. synthesize Figure the to in show attempt features hyper-parameter We model’s all a algorithm. forest paper is forest the random count this random tree outperform in the This usually used for trees. produce will classifier forests 100 to together random use The data that trees. trees training individual of the forest sampled a randomly We trees. Results Forest Random C. iue - lal lutaeta ahn erigmodels learning machine that illustrate clearly 1-4 Figures machine boosted gradient the model, forest random the Like very operates model (GBM) machine boosted gradient The the with difficulty most the had model forest random The decision of up made model ensemble an are forests Random

Score 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 .C V.

i.3 admFrs nierdFeatures Engineered Forest Random 3. Fig. bmi

ONCLUSION counts

dev

diff

dist

log F & max

URTHER poly

pow

quad

R r_diff

ESEARCH r_poly

ratio

rel

sqrt

sum fegnee etrs twudb sflt e o more models’ how learning see machine to affect useful features be of would classes It complex types machine features. additional of help examine engineered variety also of a could might for We use types. research that useful model ensembles learning be Such of creation might models. the features guide learning what machine understand research other Further to types. needed model is learning machine features popular four input multiple of Engi- up focus. models. logical made learning a are seem machine that of features set neered wider a with features shown hyper- have the tuning might spent datasets parameters. time additional and Re- for generic improvement models. models reasonably some the individual made for for chosen we sults hyper-parameters Instead, the for datasets. choices the of each gradient or vector forest machine. random support a boosting or with network groups well two ensemble neural might these A machine of well. each very from perform model might machines, a of boosting up gradient made and ensembles forests random different synthesize than neural can features Because machines vector models. support individual and networks than typ- better models of perform ensembles why ically reasons the of one demonstrates differences. from with benefit wide ratios might on a here based to explored features useful models engineered be all paper. might models, this of difference in array of explored not consider. models ratios were of the these to differences of deal Because of any features ratio great by engineered a well a synthesized on of has based types features used Engineered the model on learning influence machine ensemble. an of in type other each of with types the well and work type will model that learning models machine particular a for hsppreaie 6dfeetegnee etrsfor features engineered different 16 examined paper This engineered other exploring on focus will research Future for models the tuning time significant spend not did We empirically also paper this by performed research The the research, this in performed experiments the on Based

Score 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

i.4 iue4 B nierdFeatures Engineered GBM 4: Figure 4. Fig. bmi

counts

dev

diff

dist

log

max

poly

pow

quad

r_diff

r_poly

ratio

rel

sqrt

sum performance. A ml n .Vpi,“upr etrrgeso machines,” regression vector “Support Vapnik, V. and Smola A. neural Lin [24] C.-J. rectifier Chang, sparse C.-C. Hsu, “Deep C.-W. Bengio, [23] Y. and Bordes, A. optimization,” stochastic Glorot, for X. method A “Adam: [22] Ba, for J. and modeling Kingma network P. D. neural “Automatic [21] Ord, K. [Online]. J. 2020. and Balkin Dataset,” D. Engineering S. Feature [20] “Tabular J., Source Heaton, Conference/Paper - [19] Repository as GitHub Heaton’s algorithms “Jeff “Boosting Heaton, Frean, J M. and [18] Bartlett, P. Baxter, J. Mason, L. [17] forests,” “Random Breiman, L. immanent [16] ideas the of calculus logical “A Pitts, W. for and algorithm McCulloch S. training W. “A Vapnik, [15] N. V. and Guyon, M. recognition,” I. pattern Boser, of E. “Theory B. Chervonenkis, [14] J. A. and Vapnik N. V. [13] S. G. Citro, C. Chen, Z. Brevdo, E. Barham, P. Agarwal, Thirion, A. Abadi, B. M. Michel, [12] V. Gramfort, A. Varoquaux, G. in classification,” Pedregosa, text for F. engineering “Feature [11] Matwin, S. and Scott S. [10] B .Olshausen A. B. [7] Jolliffe, I. of [6] dimensionality the “Reducing Salakhutdinov, R. R. and Hinton E. G. for [5] algorithm learning fast “A J.-W. Teh, Y.-W. and McKenzie, Osindero, S. G. Hinton, T. E. G. Lou, [4] J.-K. Hsieh, H.-P. Lo, H.-Y. Yu, H.-F. in [3] networks single-layer of analysis “An Lee, H. and Ng, Y. A. A Coates, A. learning: “Representation [2] Vincent, P. and Courville, A. Bengio, Y. [1] L ria n .H remn Etmtn pia rnfrain for transformations optimal “Estimating Friedman, H. J. and Breiman L. transformations,” of [9] analysis “An Cox, R. D. and Box E. G. [8] rprisb erigasas oefrntrlimages,” natural for code sparse a learning by properties networks,”2006. neural with data 2006. nets,” belief deep 2010,” cup Wei kdd Y.-H. for Chang, ensemble C.-F. classifier Ho, and engineering C.-H. Chung, P.-H. Chou, statistics and intelligence in learning,” feature unsupervised Intelligence Machine perspectives,”and new and review acsi erlifrainpoesn systems processing information 1997. neural in vances 2003. classification,” vector Statistics in networks,” arXiv:1412.6980 preprint arXiv 2000. 509–515, pp. 4, no. series,” time univariate http://doi.org/10.34740/KAGGLE/DSV/1600809 Available: 2016-01-31. accessed: 1999. https://github.com/jeffheaton/papers, NIPS, Code,” space.” function in descent gradient 2001. 5–32, 1943. 115–133, pp. 4, no. activity,” nervous in theory learning Computational on in classifiers,” margin optimal Zheng, X. Available: [Online]. and 1974. tensorflow.org. Yu, from Y. available Wicke, https://www.tensorflow.org/ Vi software systems,” M. Talwar, Kaiser, F. heterogeneous K. on 2015, Wattenberg, L. Vasudevan, learning Sutskever, machine I. Jozefowicz, V. M. Large-scale “TensorFlow: Steiner, Vanhoucke, R. Warden, B. V. Jia, Shlens, P. J. Y. Tucker, Schuster, P. Isard, Goodfellow, M. Man I. Olah, M. D. Ghemawat, Levenberg, C. Irving, S. J. Kudlur, Devin, G. M. M. Harp, Dean, J. A. Davis, A. Corrado, Dubourg V. Weiss, python,” R. in Research Prettenhofer, Learning learning P. Machine Blondel, “Scikit-learn: M. Grisel, O. ICML Association correlation,” and regression multiple (Methodological) B Series Society. 1964. Statistical Royal the 1996. 607–609, pp. 6583, no. 381, o.9.Ctse,19,p.379–388. pp. 1999, Citeseer, 99. vol. , 01 p 315–323. pp. 2011, , rnia opnn analysis component Principal o.8,n.31 p 8–9,1985. 580–598, pp. 391, no. 80, vol. , nentoa ofrneo rica nelgneand Intelligence Artificial on Conference International erlcomputation Neural o.1,p.22–80 2011. 2825–2830, pp. 12, vol. , tal. et h ultno ahmtclbiophysics mathematical of bulletin The R nentoa ora fForecasting of Journal International Eegneo ipecl eetv field receptive simple-cell of “Emergence , 01 p 215–223. pp. 2011, , o.3,n.8 p 7812,2013. 1798–1828, pp. 8, no. 35, vol. , EFERENCES Science rceig ftefit nulworkshop annual fifth the of Proceedings EETascin nPtenAnalysis Pattern on Transactions IEEE 2014. , ahn learning Machine C,19,p.144–152. pp. 1992, ACM, . nentoa ofrneo artificial on conference International ,R og,S or,D Murray, D. Moore, S. Monga, R. e, ´ tal. et o.33 o 76 p 504–507, pp. 5786, no. 313, vol. , ora fteAeia statistical American the of Journal o.1,n.7 p 1527–1554, pp. 7, no. 18, vol. , ie nieLbay 2002. Library, Online Wiley . Apatclgiet support to guide practical “A , h ora fMachine of Journal The o.9 p 155–161, pp. 9, vol. , o.4,n.1 pp. 1, no. 45, vol. , gs .Vinyals, O. egas, ´ D Cup KDD tal. et p 211–252, pp. , Nature ora of Journal “Feature , o.16, vol. , o.5, vol. , 2010. , tal. et vol. , Ad- ,