Kaggle Machine Learning Projects Ashok Kumar Harnal

FORE School of Management, New Delhi About Kaggle and About Projects

Kaggle is a platform for predictive modelling and analytics competitions on which companies, public bodies and researchers post their data and pose problems relating to them from the domain of predictive analytics. Statisticians and data miners from all over the world compete to produce the best models. The data posted ranges from megabytes to terabytes. Data of the range of gigabytes is common. This competitive approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective.

Here is how works:

1. The competition host (company) prepares the data and a description of the problem. Kaggle offers a service which helps the host do this, as well as frame the competition, anonymize the data, and integrate the winning model into their operations. 2. Participants experiment with different techniques and compete against each other to produce the best models. For most competitions, submissions are scored immediately (based on their predictive accuracy relative to a hidden solution file) and summarized on a live leaderboard.

Projects participated by us

We have participated in a number of projects on Kaggle. Some Please visit my technical blog at this link of the projects are listed below. I also write a technical blog at (http://ashokharnal.wordpress.com ) this link: http://ashokharnal.wordpress.com where I describe in where many projects are described in detail how we have executed these projects, project-code as also detail. the results of competition. I have been using R and python at different times.

This compilation is a record of projects that we have executed. For paucity of time, not all of these are listed in my technical blog. This booklet describes projects and the associated problems but not the solutions. If you wish to have access to solutions also, please log into this e-learning site (http://203.122.28.250/bigdata) of FORE as userid: ‘myguest’, password: Qwerty#123 and peruse the project-codes and results.

Page 2 of 52

Contents About Kaggle and About Projects ...... 2 Here is how works: ...... 2 Projects participated by us ...... 2 1. Bosch Production Line Performance ...... 7 Problem: Reduce manufacturing failures ...... 7 Data ...... 7 Data Files ...... 7 File descriptions ...... 8 2. Africa Soil Properties Challenge ...... 10 Problem: Predict physical and chemical properties of soil using spectral measurements ...... 10 Data ...... 11 File descriptions ...... 11 Data fields ...... 11 Techniques used: ...... 12 3. Rossmann Drug Store...... 13 Problem: Forecast sales using store, promotion, and competitor data ...... 13 Data ...... 13 Files ...... 13 Data fields ...... 13 Feature Engineering ...... 14 Techniques used ...... 14 4. Walmart: Acquire Valued shoppers challenge ...... 16 Problem: Predict which shoppers will become repeat buyers ...... 16 Data ...... 17 Files ...... 17 Fields ...... 17 Feature Engineering ...... 18 5. Avazu CTR Prediction ...... 20 Problem: Predict whether a mobile ad will be clicked ...... 20 Data ...... 20 File descriptions ...... 20

Page 3 of 52

Data fields ...... 20 6. Facial keypoints detection ...... 22 Problem: Detect the location of keypoints on face images ...... 22 Data ...... 23 7. Forest Cover Prediction ...... 24 Problem: Use cartographic variables to classify forest categories ...... 24 Data ...... 25 Data Fields ...... 25 8. Boehringer Ingelheim: Which drugs are effective? ...... 28 Problem: Predict a biological response of molecules from their chemical properties ...... 28 Data ...... 28 9. West Nile virus prediction ...... 29 Problem: Predict West Nile virus in mosquitos across the city of Chicago ...... 29 Data ...... 29 Main dataset ...... 30 Spray Data ...... 30 Weather Data ...... 31 File descriptions ...... 32 10. Caterpillar tube pricing ...... 33 Problem: Model quoted prices for industrial tube assemblies ...... 33 Data ...... 33 File descriptions ...... 33 train_set.csv and test_set.csv ...... 33 tube.csv ...... 33 bill_of_materials.csv ...... 34 specs.csv...... 34 tube_end_form.csv ...... 35 components.csv ...... 35 comp_[type].csv ...... 35 type_[type].csv ...... 35 11. San Francisco (SanFrancisco) Crime Classification ...... 36 Problem: Predict the category of crimes that occurred in the city by the bay ...... 36

Page 4 of 52

Data ...... 37 Data fields ...... 37 12. Airbnb New User Bookings ...... 38 Problem: Where will a new guest book their first travel experience? ...... 38 Data ...... 38 File descriptions ...... 38 Fields ...... 39 13. TFI: Restaurant Revenue Prediction ...... 40 Problem: Predict annual restaurant sales based on objective measurements ...... 40 Data ...... 41 File descriptions ...... 41 Data fields ...... 41 14. Otto Group Product Classification Challenge ...... 42 Problem: Classify products into the correct category ...... 42 Data ...... 42 15. Walmart Recruiting - Store Sales Forecasting ...... 43 Problem: Use historical markdown data to predict store sales ...... 43 Data ...... 43 stores.csv ...... 43 train.csv ...... 44 test.csv ...... 44 features.csv ...... 44 16. Springleaf: Determine whether to send a direct mail piece to a customer ...... 45 Problem: Predict which customers can be directly targeted ...... 45 Data ...... 45 Results achieved ...... 46 17. Santander Customer Satisfaction ...... 47 Problem: Which customers are happy customers? ...... 47 Data Files ...... 47 File descriptions ...... 47 Results achieved ...... 48 18. Influencers in Social Networks ...... 49

Page 5 of 52

Problem: Predict which people are influential in a social network ...... 49 Data Files ...... 49 19. Predicting Red Hat Business Value ...... 51 Problem: Classifying customer potential ...... 51 Data Files ...... 51

Page 6 of 52

1. Bosch Production Line Performance

Problem: Reduce manufacturing failures (Area: Operation/Manufacturing)

A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.

Data

Data Files File Name Available Formats test_categorical.csv .zip (19.75 mb)

Page 7 of 52

File Name Available Formats train_categorical.csv .zip (19.78 mb) train_date.csv .zip (58.77 mb) test_date.csv .zip (58.78 mb) sample_submission.csv .zip (1.55 mb) test_numeric.csv .zip (270.33 mb) train_numeric.csv .zip (269.98 mb)

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.

File descriptions

 train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)  test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)  train_categorical.csv - the training set categorical features  test_categorical.csv - the test set categorical features  train_date.csv - the training set date features  test_date.csv - the test set date features  sample_submission.csv - a sample submission file in the correct format

Page 8 of 52

Page 9 of 52

2. Africa Soil Properties Challenge

Problem: Predict physical and chemical properties of soil using spectral measurements (Area: Environment/Geology)

Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing of soil samples, and greater availability of earth remote sensing data provide new opportunities for predicting soil functional properties at unsampled locations. Soil functional properties are those properties related to a soil’s capacity to support essential ecosystem services such as primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping of soil functional properties, especially in data sparse regions such as Africa, is important for planning sustainable agricultural intensification and natural resources management.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a highly repeatable, rapid and low cost measurement of many soil functional properties. The amount of light absorbed by a soil sample is measured, with minimal sample preparation, at hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum (Fig. 1). The measurement can be typically performed in about 30 seconds, in contrast to conventional reference tests, which are slow and expensive and use chemicals.

Page 10 of 52

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples selected to span the diversity in soils in a given target geographical area. The calibration models are then used to predict the soil test values for the whole sample set. The predicted soil test values from georeferenced soil samples can in turn be calibrated to remote sensing covariates, which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration model is then used to predict the soil test values for each pixel. The result is a digital map of the soil properties.

This competition asks one to predict 5 target soil functional properties from diffuse reflectance infrared spectroscopy measurements.

Data

File descriptions

 train.csv - the training set has 1158 rows.  test.csv - the test set has 728 rows.  sample_submission.csv - all zeros prediction, serving as a sample submission file in the correct format.

Data fields

SOC, pH, Ca, P, Sand are the five target variables for predictions. The data have been monotonously transformed from the original measurements and thus include negative values.

 PIDN: unique soil sample identifier  SOC: Soil organic carbon  pH: pH values  Ca: Mehlich-3 extractable Calcium  P: Mehlich-3 extractable Phosphorus  Sand: Sand content  m7497.96 - m599.76: There are 3,578 mid-infrared absorbance measurements. For example, the "m7497.96" column is the absorbance at wavenumber 7497.96 cm-1. We suggest to remove spectra CO2 bands which are in the region m2379.76 to m2352.76, but one does not have to.  Depth: Depth of the soil sample (2 categories: "Topsoil", "Subsoil")

We have also included some potential spatial predictors from remote sensing data sources. Short variable descriptions are provided below and additional descriptions can be found at AfSIS data. The data have been mean centered and scaled.

 BSA: average long-term Black Sky Albedo measurements from MODIS satellite images (BSAN = near-infrared, BSAS = shortwave, BSAV = visible)  CTI: compound topographic index calculated from Shuttle Radar Topography Mission elevation data  ELEV: Shuttle Radar Topography Mission elevation data  EVI: average long-term Enhanced Vegetation Index from MODIS satellite images.

Page 11 of 52

 LST: average long-term Land Surface Temperatures from MODIS satellite images (LSTD = day time temperature, LSTN = night time temperature)  Ref: average long-term Reflectance measurements from MODIS satellite images (Ref1 = blue, Ref2 = red, Ref3 = near-infrared, Ref7 = mid-infrared)  Reli: topographic Relief calculated from Shuttle Radar Topography mission elevation data  TMAP & TMFI: average long-term Tropical Rainfall Monitoring Mission data (TMAP = mean annual precipitation, TMFI = modified Fournier index)

Techniques used:

 Bayesian Additive Regression Trees, a variation of boosting in which each weak learner is fitted to errors using Bayesain approach.

Least error: 0.4692; Worst error: 28.77

Page 12 of 52

3.Rossmann Drug Store

Problem: Forecast sales using store, promotion, and competitor data (Area: Marketing/Human Resource)

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging one to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, one will help store managers stay focused on what’s most important to them: their customers and their teams!

One is provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

Data

Files

 train.csv - historical data including Sales  test.csv - historical data excluding Sales  sample_submission.csv - a sample submission file in the correct format  store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

 Id - an Id that represents a (Store, Date) duple within the test set  Store - a unique Id for each store  Sales - the turnover for any given day (this is what is to be predicted)  Customers - the number of customers on a given day  Open - an indicator for whether the store was open: 0 = closed, 1 = open

Page 13 of 52

 StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None  SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools  StoreType - differentiates between 4 different store models: a, b, c, d  Assortment - describes an assortment level: a = basic, b = extra, c = extended  CompetitionDistance - distance in meters to the nearest competitor store  CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened  Promo - indicates whether a store is running a promo on that day  Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating  Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2  PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

Feature Engineering

1. Mean(sales)/mean(customers) for each store 2. Mean(sales)/mean(customer) for each store and by day-of-week wise 3. Mean(sales) only by day-of-week wise 4. Mean(sales)/mean(customer) only by day-of-week wise 5. Mean(Sales) by each store and day of week wise 6. Mean(Sales) by each store and by promo wise 7. Mean(sales) by each store-type and promo wise 8. Mean(Sales) by assortment and store-type wise

Techniques used

1. Random Forest 2. Xgboost 3. Model mix: α * (i) + (1-α) * (ii)

Page 14 of 52

Page 15 of 52

4.Walmart: Acquire Valued shoppers challenge

Problem: Predict which shoppers will become repeat buyers (Area: Marketing/Human Behavior)

Consumer brands often offer discounts to attract new shoppers to buy their products. The most valuable customers are those who return after this initial incented purchase. With enough purchase history, it is possible to predict which shoppers, when presented an offer, will buy a new item. However, identifying the shopper who will become a loyal buyer -- prior to the initial purchase -- is a more challenging task.

The Acquire Valued Shoppers Challenge asks participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, we have provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior is also provided.

This challenge provides almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date.

Warning: this is a large data set. The decompressed files require about 22GB of space.

Page 16 of 52

This data captures the process of offering incentives (a.k.a. coupons) to a large number of customers and forecasting those who will become loyal to the product. Let's say 100 customers are offered a discount to purchase two bottles of water. Of the 100 customers, 60 choose to redeem the offer. These 60 customers are the focus of this competition. Predict which of the 60 will return (during or after the promotional period) to purchase the same item again.

To create this prediction, we give a minimum of a year of shopping history prior to each customer's incentive, as well as the purchase histories of many other shoppers (some of whom will have received the same offer). The transaction history contains all items purchased, not just items related to the offer. Only one offer per customer is included in the data. The training set is comprised of offers issued before 2013-05-01. The test set is offers issued on or after 2013- 05-01.

Data

Files

Four relational files are provided:

 transactions.csv - contains transaction history for all customers for a period of at least 1 year prior to their offered incentive  trainHistory.csv - contains the incentive offered to each customer and information about the behavioral response to the offer  testHistory.csv - contains the incentive offered to each customer but does not include their response (one is predicting the repeater column for each id in this file)  offers.csv - contains information about the offers

Fields

All of the fields are anonymized and categorized to protect customer and sales information. The specific meanings of the fields will not be provided (so don't bother asking). Part of the challenge of this competition is learning the taxonomy of items in a data-driven way. history id - A unique id representing a customer chain - An integer representing a store chain offer - An id representing a certain offer market - An id representing a geographical region repeattrips - The number of times the customer made a repeat purchase repeater - A boolean, equal to repeat trips > 0 offerdate - The date a customer received the offer transactions id - see above chain - see above dept - An aggregate grouping of the Category (e.g. water) category - The product category (e.g. sparkling water)

Page 17 of 52 company - An id of the company that sells the item brand - An id of the brand to which the item belongs date - The date of purchase productsize - The amount of the product purchase (e.g. 16 oz of water) productmeasure - The units of the product purchase (e.g. ounces) purchasequantity - The number of units purchased purchaseamount - The dollar amount of the purchase offers offer - see above category - see above quantity - The number of units one must purchase to get the discount company - see above offervalue - The dollar value of the offer brand - see above

Feature Engineering

1. What is the attitude of customer towards the offered product? Has he purchased this in the past? (Binary attribute: Yes/No) 2. productAffinity- How many times the customer has purchased the offered category (No of transactions) 3. prod_purchasedamount: Total amount spent on this category? 4. Customer attitude towards offered brand: What are his total expenses on brand? 5. brand_affinity: How interested a customer is in brand of company that offered this product; Count number of transactions that pertained to this brand 6. category_affinity: What is the attitude of customer towards offered category i.e. even from other companies? 7. chain_affinity: Attitude of customer towards the store chain? Does he visit it often? And how often has customer transacted and what are his total purchases 8. Basket of categories: What are customer's variety of category purchases? 9. Build brand popularity score: Which brand is more popular? 10. Build Category popularity score: Which category is more popular 11. Build Company popularity score: Which company is more popular

Page 18 of 52

Top-Entry Score: 0.62703; Worst Entry: 0.4430

Page 19 of 52

5.Avazu CTR Prediction

Problem: Predict whether a mobile ad will be clicked (Area: Advertising)

In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can one find a strategy that beats standard classification algorithms?

Data

File descriptions

 train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.  test - Test set. 1 day of ads to for testing one’s model predictions.  sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.

Data fields

 id: ad identifier  click: 0/1 for non-click/click  hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.  C1 -- anonymized categorical variable

Page 20 of 52

 banner_pos  site_id  site_domain  site_category  app_id  app_domain  app_category  device_id  device_ip  device_model  device_type  device_conn_type  C14-C21 -- anonymized categorical variables

Highest Score: 0.3791384 Worst score: 23.72

Page 21 of 52

6. Facial keypoints detection

Problem: Detect the location of keypoints on face images (Area: Computer Vision)

The objective of this task is to predict keypoint positions on face images. This can be used as a building block in several applications, such as:

 tracking faces in images and video  analysing facial expressions  detecting dysmorphic facial signs for medical diagnosis  biometrics / face recognition

Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions. Computer vision research has come a long way in addressing these difficulties, but there remain many opportunities for improvement.

Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15 keypoints, which represent the following elements of the face: left_eye_center, right_eye_center, left_eye_inner_corner, left_eye_outer_corner, right_eye_inner_corner, right_eye_outer_corner, left_eyebrow_inner_end, left_eyebrow_outer_end, right_eyebrow_inner_end, right_eyebrow_outer_end, nose_tip, mouth_left_corner, mouth_right_corner, mouth_center_top_lip, mouth_center_bottom_lip

Left and right here refers to the point of view of the subject.

Page 22 of 52

In some examples, some of the target keypoint positions are missing (encoded as missing entries in the csv, i.e., with nothing between two commas).

The input image is given in the last field of the data files, and consists of a list of pixels (ordered by row), as integers in (0,255). The images are 96x96 pixels.

Data

 training.csv: list of training 7049 images. Each row contains the (x,y) coordinates for 15 keypoints, and image data as row-ordered list of pixels.  test.csv: list of 1783 test images. Each row contains ImageId and image data as row- ordered list of pixels  submissionFileFormat.csv: list of 27124 keypoints to predict. Each row contains a RowId, ImageId, FeatureName, Location. FeatureName are "left_eye_center_x," "right_eyebrow_outer_end_y," etc. Location is what needs to be predicted.

  Best score: 1.9397; Worst score: 52.07

Page 23 of 52

7. Forest Cover Prediction

Problem: Use cartographic variables to classify forest categories

(Area: Environment)

In this competition one is asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. One is asked to predict an integer classification for the forest cover type. The seven types are:

1 - Spruce/Fir 2 - Lodgepole Pine 3 - Ponderosa Pine 4 - Cottonwood/Willow

Page 24 of 52

5 - Aspen 6 - Douglas-fir 7 - Krummholz

Data

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. One must predict the Cover_Type for every row in the test set (565892 observations).

Data Fields

Elevation - Elevation in meters Aspect - Aspect in degrees azimuth Slope - Slope in degrees Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

The wilderness areas are:

1 - Rawah Wilderness Area 2 - Neota Wilderness Area 3 - Comanche Peak Wilderness Area 4 - Cache la Poudre Wilderness Area

The soil types are:

1 Cathedral family - Rock outcrop complex, extremely stony. 2 Vanet - Ratake families complex, very stony. 3 Haploborolis - Rock outcrop complex, rubbly. 4 Ratake family - Rock outcrop complex, rubbly. 5 Vanet family - Rock outcrop complex complex, rubbly. 6 Vanet - Wetmore families - Rock outcrop complex, stony. 7 Gothic family. 8 Supervisor - Limber families complex. 9 Troutville family, very stony. 10 Bullwark - Catamount families - Rock outcrop complex, rubbly. 11 Bullwark - Catamount families - Rock land complex, rubbly.

Page 25 of 52

12 Legault family - Rock land complex, stony. 13 Catamount family - Rock land - Bullwark family complex, rubbly. 14 Pachic Argiborolis - Aquolis complex. 15 unspecified in the USFS Soil and ELU Survey. 16 Cryaquolis - Cryoborolis complex. 17 Gateview family - Cryaquolis complex. 18 Rogert family, very stony. 19 Typic Cryaquolis - Borohemists complex. 20 Typic Cryaquepts - Typic Cryaquolls complex. 21 Typic Cryaquolls - Leighcan family, till substratum complex. 22 Leighcan family, till substratum, extremely bouldery. 23 Leighcan family, till substratum - Typic Cryaquolls complex. 24 Leighcan family, extremely stony. 25 Leighcan family, warm, extremely stony. 26 Granile - Catamount families complex, very stony. 27 Leighcan family, warm - Rock outcrop complex, extremely stony. 28 Leighcan family - Rock outcrop complex, extremely stony. 29 Como - Legault families complex, extremely stony. 30 Como family - Rock land - Legault family complex, extremely stony. 31 Leighcan - Catamount families complex, extremely stony. 32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony. 33 Leighcan - Catamount families - Rock outcrop complex, extremely stony. 34 Cryorthents - Rock land complex, extremely stony. 35 Cryumbrepts - Rock outcrop - Cryaquepts complex. 36 Bross family - Rock land - Cryumbrepts complex, extremely stony. 37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony. 38 Leighcan - Moran families - Cryaquolls complex, extremely stony. 39 Moran family - Cryorthents - Leighcan family complex, extremely stony. 40 Moran family - Cryorthents - Rock land complex, extremely stony.

Page 26 of 52

Best Score: 1.0 ; Worst Score: 0.0000

Page 27 of 52

8. Boehringer Ingelheim: Which drugs are effective?

Problem: Predict a biological response of molecules from their chemical properties

(Area: Molecular Biology)

The objective of the competition is to help us build as good a model as possible so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

The data is in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing a real biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

The problem is to determine which molecular configurations are effective.

Data

The data is in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing a real biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are caclulated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Overall accuracy: 76.16%

Page 28 of 52

9.West Nile virus prediction

Problem: Predict West Nile virus in mosquitos across the city of Chicago

(Area: Public Health)

West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.

In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today.

Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

Given weather, location, testing, and spraying data, this competition asks one to predict when and where different species of mosquitos will test positive for West Nile virus. A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus.

Data

In this competition, one will be analyzing weather data and GIS data and predicting whether or not West Nile virus is present, for a given time, location, and species.

Every year from late-May to early-October, public health workers in Chicago setup mosquito traps scattered across the city. Every week from Monday through Wednesday, these traps collect mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the week. The test results include the number of mosquitos, the mosquitos species, and whether or not West Nile virus is present in the cohort.

Page 29 of 52

Main dataset

These test results are organized in such a way that when the number of mosquitos exceed 50, they are split into another record (another row in the dataset), such that the number of mosquitos are capped at 50.

The location of the traps are described by the block number and street name. For convenience, we have mapped these attributes into Longitude and Latitude in the dataset. Please note that these are derived locations. For example, Block=79, and Street= "W FOSTER AVE" gives us an approximate address of "7900 W FOSTER AVE, Chicago, IL", which translates to (41.974089,- 87.824812) on the map.

Some traps are "satellite traps". These are traps that are set up near (usually within 6 blocks) an established trap to enhance surveillance efforts. Satellite traps are postfixed with letters. For example, T220A is a satellite trap to T220.

Spray Data

The City of Chicago also does spraying to kill mosquitos. We give the GIS data for their spray efforts in 2011 and 2013. Spraying can reduce the number of mosquitos in the area, and therefore might eliminate the appearance of West Nile virus.

Page 30 of 52

Weather Data

It is believed that hot and dry conditions are more favorable for West Nile virus than cold and wet. We provide with the dataset from NOAA of the weather conditions of 2007 to 2014, during the months of the tests.

Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level

Page 31 of 52

File descriptions

o train.csv, test.csv - the training and test set of the main dataset. The training set consists of data from 2007, 2009, 2011, and 2013, while in the test set one is requested to predict the test results for 2008, 2010, 2012, and 2014. . Id: the id of the record . Date: date that the WNV test is performed . Address: approximate address of the location of trap. This is used to send to the GeoCoder. . Species: the species of mosquitos . Block: block number of address . Street: street name . Trap: Id of the trap . AddressNumberAndStreet: approximate address returned from GeoCoder . Latitude, Longitude: Latitude and Longitude returned from GeoCoder . AddressAccuracy: accuracy returned from GeoCoder . NumMosquitos: number of mosquitoes caught in this trap . WnvPresent: whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. o spray.csv - GIS data of spraying efforts in 2011 and 2013 . Date, Time: the date and time of the spray . Latitude, Longitude: the Latitude and Longitude of the spray o weather.csv - weather data from 2007 to 2014. Column descriptions in noaa_weather_qclcd_documentation.pdf. o sampleSubmission.csv - a sample submission file in the correct format

Best Entry: 0.85991; Worst Entry: 0.40415

Page 32 of 52

10. Caterpillar tube pricing

Problem: Model quoted prices for industrial tube assemblies

(Area: Logistics)

Walking past a construction site, Caterpillar's signature bright yellow machinery is one of the first things one notice. Caterpillar sells an enormous variety of larger-than-life construction and mining equipment to companies across the globe. Each machine relies on a complex set of tubes (yes, tubes!) to keep the forklift lifting, the loader loading, and the bulldozer from dozing off.

Like snowflakes, it's difficult to find two tubes in Caterpillar's diverse catalogue of machinery that are exactly alike. Tubes can vary across a number of dimensions, including base materials, number of bends, bend radius, bolt patterns, and end types.

Currently, Caterpillar relies on a variety of suppliers to manufacture these tube assemblies, each having their own unique pricing model. This competition provides detailed tube, component, and annual volume datasets, and challenges to predict the price a supplier will quote for a given tube assembly.

The dataset is comprised of a large number of relational tables that describe the physical properties of tube assemblies.

The competition challenges to combine the characteristics of each tube assembly with supplier pricing dynamics in order to forecast a quote price for each tube. The quote price is labeled as cost in the data.

Data

File descriptions train_set.csv and test_set.csv

This file contains information on price quotes from our suppliers. Prices can be quoted in 2 ways: bracket and non-bracket pricing. Bracket pricing has multiple levels of purchase based on quantity (in other words, the cost is given assuming a purchase of quantity tubes). Non-bracket pricing has a minimum order amount (min_order) for which the price would apply. Each quote is issued with an annual_usage, an estimate of how many tube assemblies will be purchased in a given year. tube.csv

This file contains information on tube assemblies, which are the primary focus of the competition. Tube Assemblies are made of multiple parts. The main piece is the tube which has a

Page 33 of 52 specific diameter, wall thickness, length, number of bends and bend radius. Either end of the tube (End A or End X) typically has some form of end connection allowing the tube assembly to attach to other features. Special tooling is typically required for short end straight lengths (end_a_1x, end_a_2x refer to if the end length is less than 1 times or 2 times the tube diameter, respectively). Other components can be permanently attached to a tube such as bosses, brackets or other custom features.

bill_of_materials.csv

This file contains the list of components, and their quantities, used on each tube assembly. specs.csv

This file contains the list of unique specifications for the tube assembly. These can refer to materials, processes, rust protection, etc.

Page 34 of 52 tube_end_form.csv

Some end types are physically formed utilizing only the wall of the tube. These are listed here. components.csv

This file contains the list of all of the components used. Component_type_id refers to the category that each component falls under. comp_[type].csv

These files contain the information for each component. type_[type].csv

These files contain the names for each feature.

Page 35 of 52

11. San Francisco (SanFrancisco) Crime Classification

Problem: Predict the category of crimes that occurred in the city by the bay

(Area: Crime)

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, one must predict the category of crime that occurred.

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.

Problem is to build a model to predict the category of crimes that occurred in the city by the bay

Page 36 of 52

Data

Data fields

 Dates - timestamp of the crime incident  Category - category of the crime incident (only in train.csv). This is the target variable to predict.  Descript - detailed description of the crime incident (only in train.csv)  DayOfWeek - the day of the week  PdDistrict - name of the Police Department District  Resolution - how the crime incident was resolved (only in train.csv)  Address - the approximate street address of the crime incident  X - Longitude  Y - Latitude

***************

Page 37 of 52

12. Airbnb New User Bookings

Problem: Where will a new guest book their first travel experience?

(Area: Tourism)

Instead of waking to overlooked "Do not disturb" signs, Airbnb travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

Data

In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user's first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

The training and test sets are split by dates. In the test set, you will predict all the new users with first activities after 7/1/2014 (note: this is updated on 12/5/15 when the competition restarted). In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010.

File descriptions train_users.csv - the training set of users test_users.csv - the test set of users sample_submission.csv - correct format for submitting your predictions

Page 38 of 52

Fields id: user id date_account_created: the date of account creation timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up date_first_booking: date of first booking gender age signup_method signup_flow: the page a user came to signup up from language: international language preference affiliate_channel: what kind of paid marketing affiliate_provider: where the marketing is e.g. google, craigslist, other first_affiliate_tracked: whats the first marketing the user interacted with before the signing up signup_app first_device_type first_browser country_destination: this is the target variable you are to predict sessions.csv - web sessions log for users user_id: to be joined with the column 'id' in users table action action_type action_detail device_type secs_elapsed countries.csv - summary statistics of destination countries in this dataset and their locations age_gender_bkts.csv - summary statistics of users' age group, gender, country of destination

Page 39 of 52

13. TFI: Restaurant Revenue Prediction

Problem: Predict annual restaurant sales based on objective measurements (Area: Strategic Planning)

With over 1,200 quick service restaurants across the globe, TFI (Tab Food Investments) is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI (Tab Food Investments) to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis.

Page 40 of 52

Data

File descriptions

 train.csv - the training set. Use this dataset for training your model.  test.csv - the test set. To deter manual "guess" predictions, Kaggle has supplemented the test set with additional "ignored" data. These are not counted in the scoring.  sampleSubmission.csv - a sample submission file in the correct format

Data fields

o Id : Restaurant id. o Open Date : opening date for a restaurant o City : City that the restaurant is in. Note that there are unicode in the names. o City Group: Type of the city. Big cities, or Other. o Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile o P1, P2 - P37: There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators. o Revenue: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values.

Page 41 of 52

14. Otto Group Product Classification Challenge

Problem: Classify products into the correct category (Area: e-commerce)

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range.

Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.

There are nine categories for all products. Each target category represents one of our most important product categories (like fashion, electronics, etc.). The products for the training and testing sets are selected randomly.

Data trainData.csv - the training set testData.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to a product feat_1, feat_2, ..., feat_93 - the various features of a product target - the class of a product

Page 42 of 52

15. Walmart Recruiting - Store Sales Forecasting

Problem: Use historical markdown data to predict store sales (Area: Retail Sales)

One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department- wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Data stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

Page 43 of 52 train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

 Store - the store number  Dept - the department number  Date - the week  Weekly_Sales - sales for the given department in the given store  IsHoliday - whether the week is a special holiday week test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file. features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

 Store - the store number  Date - the week  Temperature - average temperature in the region  Fuel_Price - cost of fuel in the region  MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.  CPI - the consumer price index  Unemployment - the unemployment rate  IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Page 44 of 52

16. Springleaf: Determine whether to send a direct mail piece to a customer

Problem: Predict which customers can be directly targeted (Area: Marketing/Sales)

Springleaf puts the humanity back into lending by offering their customers personal and auto loans that help them take control of their lives and their finances. Direct mail is one important way Springleaf's team can connect with customers whom may be in need of a loan.

Direct offers provide huge value to customers who need them, and are a fundamental part of Springleaf's marketing strategy. In order to improve their targeted efforts, Springleaf must be sure they are focusing on the customers who are likely to respond and be good candidates for their services.

Using a large set of anonymized features, Springleaf is asking you to predict which customers will respond to a direct mail offer. You are challenged to construct new meta-variables and employ feature-selection methods to approach this dauntingly wide dataset.

You are provided a high-dimensional dataset of anonymized customer information. Each row corresponds to one customer. The response variable is binary and labeled "target". You must predict the target variable for every row in the test set.

The features have been anonymized to protect privacy and are comprised of a mix of continuous and categorical features. You will encounter many "placeholder" values in the data, which represent cases such as missing values. We have intentionally preserved their encoding to match with internal systems at Springleaf. The meaning of the features, their values, and their types are provided "as-is" for this competition; handling a huge number of messy features is part of the challenge here.

Data

One is provided a high-dimensional dataset of anonymized customer information. Each row corresponds to one customer. The response variable is binary and labeled "target". One must predict the target variable for every row in the test set.

The features have been anonymized to protect privacy and are comprised of a mix of continuous and categorical features. One will encounter many "placeholder" values in the data,

Page 45 of 52 which represent cases such as missing values. We have intentionally preserved their encoding to match with internal systems at Springleaf. The meaning of the features, their values, and their types are provided "as-is" for this competition; handling a huge number of messy features is part of the challenge here. File train.csv is around 1GB size: 145231 X 1934 and test file is also around 1GB having 145232 X 1933.

Results achieved

Results achieved are as below:

Page 46 of 52

17. Santander Customer Satisfaction

Problem: Which customers are happy customers?

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, one'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

Data Files

File name Available Formats Sample_submission.csv .zip (175.67KB) test.csv .zip (3.31 mb) train.csv .zip (3.34 mb)

You are provided with an anonymized dataset containing a large number of numeric variables. The "TARGET" column is the variable to predict. It equals one for unsatisfied customers and 0 for satisfied customers.

The task is to predict the probability that each customer in the test set is an unsatisfied customer.

File descriptions

1. train.csv - the training set including the target

Page 47 of 52

2. test.csv - the test set without the target

3. sample_submission.csv- a sample submission file in the correct format

Results achieved

The problem was solved using glm (binomial family) model of sparkR (ver 1.6.1). Results achieved are as below:

*************

Page 48 of 52

18. Influencers in Social Networks

Problem: Predict which people are influential in a social network

Data Science London and the UK Windows Azure Users Group in partnership with Microsoft and Peerindex, announce the Influencers in Social Networks competition as part of The Big Data Hackathon. This competition asks you to predict human judgments about who is more influential on social media.

The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task. Each data point describes two individuals, A and B. For each person, 11 pre-computed, non- negative numeric features based on twitter activity (such as volume of interactions, number of followers, etc) are provided.

The binary label represents a human judgment about which one of the two individuals is more influential. A label '1' means A is more influential than B. 0 means B is more influential than A. The goal of the challenge is to train a machine learning model which, for pairs of individuals, predicts the human judgment on who is more influential with high accuracy. Labels for the dataset have been collected by PeerIndex

Kaggle Reference: https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network

Data Files

File name Available Formats Sample_submission.csv .csv (102.85 kb) Test .csv (1.29 mb) train.csv .csv (1.20 mb)

The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task. Each data point describes two individuals. Pre-computed, standardized features based on twitter activity (such as volume of interactions, number of followers, etc.) is provided for each individual.

The discrete label represents a human judgement about which one of the two individuals is more influential. The goal of the challenge is to train a machine learning model which, for a pair of

Page 49 of 52 individuals, predicts the human judgement on who is more influential with high accuracy. Labels for the dataset have been collected by PeerIndex.

Keywords: Social media analytics; twitter analytics; social networks

Page 50 of 52

19. Predicting Red Hat Business Value

Problem: Classifying customer potential

Like most companies, Red Hat is able to gather a great deal of information over time about the behavior of individuals who interact with them. They’re in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.

In this competition, Kagglers are challenged to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.

With an improved prediction model in place, Red Hat will be able to more efficiently prioritize resources to generate more business and better serve their customers.

Data Files

File Name Available Formats people.csv .zip (3.22 mb) sample_submission.csv .zip (1.18 mb) act_test.csv .zip (4.03 mb) act_train.csv .zip (17.07 mb)

This competition uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.

Page 51 of 52

The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.

The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.

The challenge of this competition is to predict the potential business value of a person who has performed a specific activity. The business value outcome is defined by a yes/no field attached to each unique activity in the activity file. The outcome field indicates whether or not each person has completed the outcome within a fixed window of time after each unique activity was performed.

The activity file contains several different categories of activities. Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).

To develop a predictive model with this data, one has to merge the files together into a single data set. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

(Refer: https://www.kaggle.com/c/predicting-red-hat-business-value/data )

******************

Page 52 of 52