Information Universe, fall 2011 Project Proposal 2010-23381 YeRim Choi

Finding the most relevant topic to the fluctuation of stock price data

1. Goal  Find the major interests of people depends on the situation of stock market  relationship between word count of queries and stock price

2. Model 1 given each query, extract keywords 2 count the number of each keyword appearance thru a day 3 find the relationship between word count and stock price 4 select keywords that have high correlation with stock price

3. Data:  AOL Search Data  NASDAQ Exchange Daily 1970-2010

4. Data Description:  AOL Search Data i. Overview  The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months. The data is sorted by anonymous user ID and sequentially arranged. ii. History  In August 2006, AOL released the search data. Within days, the company realized that this was a mistake, withdrew the data and made a public apology. Many copies of the data set were made before it was withdrawn, and it is still available for download on some sites. i. Format {AnonID, Query, QueryTime, ItemRank, ClickURL}  AnonID – an anonymous user ID number.  Query – the query issued by the user, case shifted with most punctuation removed.  QueryTime – the time at which the query was submitted for search.  ItemRank – if the user clicked on a search result, the rank of the item on which they clicked is listed.  ClickURL – if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.

 NASDAQ Exchange Daily 1970-2010 i. Overview  Historical NASDAQ stock data from 1970 – 2010, including daily open, close, low, high and trading volume figures. Data is organized alphabetically by ticker symbol. ii. Format  exchange  stock_symbol  date  stock_price_open  stock_price_high  stock_price_low  stock_price_close  stock_volume  stock_price_adj_close

5. Method  Key word extraction using Support Vector Machine  Regression between keywords and stock price

6. Plan

~11/2 ~11/16 ~11/30 ~11/14 Keywords extraction Number count Relationship discovery Keyword selection Documentation