Relationship Analysis between User’s Contexts and Real Input Words through Twitter

Yutaka Arakawa, Shigeaki Tagashira and Akira Fukuda Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan 744, Motooka, Nishi-ku, Fukuoka, Fukuoka, JAPAN, 819-0395 Email: arakawa, shigeaki, [email protected]

Abstract—In this paper, we propose a method to evaluate such as location, presence and time. We mainly focus on effectiveness of our proposed context-aware text entry by using how to make dictionary dynamically among above-mentioned Twitter. We focus on ”geo-tagged” public tweets because they three component of predictive transform in text entry. The include user’s important contexts, real location and time. We also focus on TV program listing because 50% traffic of iPhone reason is that the word not included in the dictionary cannot in Japan is generated from our home, in which I often tweets in be recommended, and we are not specialists of philolog- watching a TV. Cyclical collecting system based on Streaming API ical syntactic parsing. In our proposed system, based on and Search API of Twitter is proposed for gathering the target user’s current contexts, the dictionary in the mobile phone tweets efficiently. In order to find the relationship between user’s is updated periodically in cooperation with the dictionary contexts and really used words, we compare really-tweeted words with words obtained from Local Search API of Yahoo! Japan that creation server on the Internet, which generates user’s current is used for our context-aware text entry and words obtained from dictionary dynamically by using several public Web APIs. In TV program listing. We analyze 471274 tweets that have been our current prototype, “location” and “time” are adopted as collected from 15 December 2009 to 10 June 2010 for specifying a user’s context, and landmark names surrounding user are the relationship to landmark information and TV program. As a added based on user’s location, and TV programs’ title and result, we show that 5.1% of tweets include landmark words, and 9% of tweets include TV program words. Additionally, we bring performers’ name are added based on “time”. The reason why out that there are location dependent words and time dependent we use TV program is that 50% iPhone traffic in Japan is words. transmitted through home WiFi networks. As a result, if you input “H” at neighbor venue of Globecom2010, the system I.INTRODUCTION may suggest “Hyatt Regency Miami” as one of candidates. If In a recent research in Japan[1], it is turned out that over you input “J” or “K” in watching 24, the system may return fifty percent of Internet users access the Internet from mobile “Jack Bauer” or “Kiefer Sutherland” respectively. We have devices. And among them, over eighty percent users access already constructed the OpenWnn-based prototype system on the information by using not hierarchical menu in official site Android terminal[4]. We have already tried several system but search engines such as Google. Moreover, current mobile architectures, and have confirmed that one of them can achieve devices can use not only text messaging but also web-mail enough response time[5]. Also, we are evaluating our system such as Gmail. It indicates that one has an opportunity to through demonstration and questionnaire with some persons. input a long text. The increase of text input on mobile devices Although our system looks effective, there is an important drives the demand for improving a text . remaining issue. Since we started this research with the Recent mobile phones generally equip clever text entry assumption that such kind of system will be convenient for us, which have a function of predictive transform. This function there is no evidence or quantitative evaluation for representing consists of dictionaries, syntactic parsing, and learning. When the effectiveness. It is hard to gather large amount of results a user inputs “a”, it picks up the words which start with “a” through questionnaire-based evaluation. In addition, if we log from a dictionary, and recommends some candidate words really inputted sentences in a mobile phone, we must consider which seems appropriate according to syntactic parsing. In privacy protection in relation to personal data. addition, based on user’s input history, such as frequency or In this paper, we propose a method for evaluating context- time stamp of latest use, it sorts the order of the candidates. In aware text entry by using Twitter. Twitter is a micro blogging these days, iWnn[2], one of Japanese text entry adopted many service on the Internet, where a short message of up to 140 kinds of mobile phones, suggests more appropriate words characters, called tweet, can be posted. And these tweets are according to current seasons, time, body of received e-mail, re- generally open for the public. The reason why we focus on lationships between superior and inferior. As another approach Twitter is 1) we can obtain huge amount of public strings of that differs from text entry, “Google Suggest” provides often- various users, 2) Geotagging API released at November 2009 used keyword combination for optimizing search terms and enables users to add geo code to each tweet. It means that reducing keystrokes. we can extensively and publicly collect real input sentences Meanwhile we have proposed context-aware text entry[3] that include users’ real location. Therefore, we think that to which can suggest useful words based on user’s context analyze collected data clears up the relationship between user’s Train Transit Application E-mail Application Map Application Internet I took a Yamanote train Departure Roppongi Hyatt Regency Miami Local device Internal server External server from Shinbashi. Soon, I Destination Tokyo

will arrive Shibuya . Meet API access module access API

GPS sensor Context Estimation Schedule API estimation For

Route Search (XMLparser) at Statue of Hachiko. Engine Local context Global context Presense API Acceleration Context sensor updater Dictionary is dynamically generated by using public APIs on the Internet Other

sensors Location API Feedback Nearest station API Schedule/Calender API Landmark Info. API Asynchronous Yahoo API Shibuya, Shinjuku, Train, Go, Take, Ride, Bank of America, Hyatt,

Input module access API Google API

Roppongi, Tokyo, etc. Shinbashi, Hachiko James L. Knight Center Hiragana !"# Roman character dictionary making For

Select & Sort Engine Sort & Select

Japanese language language Japanese (XMLparser)

Schedule API

morphological morphological

Context-Aware (MeCab) Fig. 1. Typical Effective Examples of context-aware text entry analysis Amazon API IME Direct ATOK plugin Tabelog API contexts and real input words. As a result, it can show the GuruNavi API effectiveness of our proposed text entry quantitatively. Output "Personal Context mixture of Chinese characters and Dictionary" As a general dictionary First, we construct the tweet collecting system that obtains Japanese phonetic characters kana- Japanese tweets with geocode, where we effectively combine conversion API two APIs of Twitter, Streaming API and Search API to gather huge amount of tweets. Our system has already gathered half- Fig. 2. The architecture of prototype system million tweets since 15 December 2009. Next, we analyze collected data by comparing with the data that obtained from other APIs. In this paper, we use “Yahoo! Local search processes cyclically, the words related to a certain place API[6]” for obtaining landmark information, and use “TV become suggested, and normal words will be suggested in program listing on the Internet” for obtaining TV programs’ other place. title and performers’ name. These APIs are the same as APIs for making dictionary in our context-aware text entry. In our A. System architecture relationship analysis, both data are separated into some words The architecture of prototype system is shown in Figure 2. by using “Yahoo! Japanese language morphological analysis It is composed of three parts, local device, our server on the API” and “Yahoo! Key phrase extraction API”. Internet, and general web services on the Internet. The local As a result, we show that 5.1% of tweets include land- device has various sensors such as GPS and acceleration. In mark words, and 9% of tweets include TV program words. our prototype system, we use a PC as local device and adopt Additionally, we bring out that there are location dependent the Google Maps API as GPS sensor for setting user’s location words and time dependent words. The rest of the paper is visually. organized as follows. We present our context-aware input The internal server in the center of Figure 2 is a main part of method editor proposed previously in Section 2. In section our proposed system. It collects information and estimates of 3, we explain about Twitter. And following section explains user’s context, creates the dynamic dictionary, and suggests the relationship analysis. Finally, results are shown in Section 5. words by utilizing user’s context. These functions are possible to construct on local device. However, we set it into the II.CONTEXT-AWARE TEXT ENTRY FOR MOBILE PHONE server over the global network because it is important not only Fig.1 shows a typical service examples in which our pro- accuracy of estimation algorithm but also processing speed. posed context-aware text entry will work effective. It indicates Besides, we architect it works asynchronously to collect sensor the importance of words varies with a location (i.e., user’s information by the system and to input text on local device. context). For example, nearby station name is used at stations, The dictionary is updated whenever location is varied. As landmark name is used at a new places, product name is a result, local device only searches pre-constructed database used at bookstores and electronic retail stores. The most when text is input. This architecture enable the system to characteristic point is that dictionary is automatically and prevent the processing speed from slowing down when web dynamically updated by mashing up public Web APIs in the external servers increase. Internet. Nearest station API can be used for obtaining the The external servers in the right of Figure 2 are not our station name near here. Also, Landmark information API can servers but provided by several companies. In the case of this be adopted to search landmark names surrounding the user. prototype system, it cooperates with the Yahoo Local Search In addition, we introduced learning process into this system. API, Google Maps API, and Gurunavi API. Some words If a user selects the word in suggested candidates at the provided by these APIs are materials of personal context-aware station, the system judges it may be used at the same place dictionary. in the future. If the word is not used, it judges that the We develop the two prototype systems. One is the extension word is not useful in this place. By repeating these learning of OpenWnn of Android, another is ATOK Direct Plug-in for Filtering 2010-06-28T17:04:25, 34.54324, 131.234234, Honda and Matsui, Good job!! #worldcup Collect realtime tweets Filtering Collect past tweets from Streaming API from Search API Japanese 2010-06-28T17:05:32, 33.59723, 130.217793, I am staying Hyatt Regency Miami. Japanese (10 ~15% of all tweets) (Users who once geotagged) & Geotagged & Geotagged ・xxxxxxxx ・xxxxxxxx Time Location Real inputted texts ・xxxxxxxxxx Twitter ID ・xxxxxxxxxx TV program listing Yahoo! Local Search API ・xxxxxx ・xxxxxx ・xxxxxxxxxxx ・xxxxxxxxxxx Performer's name ・xxx Landmark name Jack Bauer Hyatt Regency Miami ・xxxxxxx Kiefer Sutherland Bank of America ・xxxxxx Keisuke Honda James L. Knight Center Daisuke Matsui Miami Convention Center ・xxxxxxxxx etc. etc.

・xxxxxxxxxxx Language morphological analysis to parse sentences ・xxxxxxxxx Jack Hyatt Honda ・xxxxx Bauer Regency Matsui Tweets Good (less than 1%) Kiefer Miami ・xxxxxxxxxxx Sutherland Bank Worldcup ・xxxxxxxxx Keisuke America Hyatt Database Honda James Regency ・xxxxx Daisuke Knight Miami Many tweets Matsui Center staying Convention Fig. 3. Cyclical collecting system based on Streaming API and Search API Fig. 4. Flow of relationship analysis

Windows and Mac. “ATOK[7]” is one of the major text entry These APIs are the same as APIs for making dictionary in our in Japan as well as Microsoft text entry. previously proposed context-aware text entry. B. Remaining Issue A. Cyclical collecting system for Twitter Since we started this research with the assumption that We construct the tweet collecting system that obtains such kind of system will be convenient for us, there is Japanese tweets tagged with geocode, where we effectively no evidence or quantitative evaluation for representing the combine two APIs of Twitter, Streaming API and Search API effectiveness. It is hard to gather large amount of results to gather huge amount of tweets. Fig.3 shows the cyclical through questionnaire-based evaluation. In addition, if we log tweet collecting system based on Streaming API and Search really inputted sentences in a mobile phone, we must consider API. The reason to use two APIs is as follows. Since tweets privacy protection in relation to personal data. obtained through Streaming API consist of various languages, III.TWITTER we need to filter and pick up target tweets which are written in Japanese and have geocode as shown in the left side of As you know, Twitter[8] is one of the major micro blogging Fig.3. As a result, we obtain only less than 1% of tweets. If and social networking service today, in which a short message a user want to add geocode to own tweets, he must have a of up to 140 characters, called tweet, can be posted. Tweets client that can tag user’s current location through Twitter API. are generally open for the public as a “public timeline”. Since In other words, a user once tagged is possible to post other Twitter releases many kinds of API for general users, we can tagged tweets. Therefore, we pick up user IDs who posted obtain other user’s tweets through these APIs. In this paper, a geo-tagged tweet, and we collect their past tweets through we use Streaming API and Search API for obtaining tweets. Search API cyclically. Streaming API that was officially released January 2010 allows near-realtime access to the user’s tweets timeline. B. Matching Process Tweets created by a public account are candidates for inclusion A tweet consists of thee data, time information, location in the Streaming API. However, Streaming API only provides information, and inputted text as shown in Fig.4. From location randomly sampled tweets which is about 10% of all the tweets. information and “Yahoo! Local Search API”, we pick up the Search API allows us to search Tweets with a query in which surrounding landmarks’ name within one kilometer of user’s we can set some parameters such as target text, language, user current location. Examples of typical landmarks are station, id, geocode, time spam, etc. In this paper, we use this API for city hall, school, hospital, post office, and so on. From time collecting the past tweets of user who have once posted with information and TV program listing, we pick up performers’ geocode. How to combine these two APIs is described in the name and TV programs’ title. As a target channel to be following section. collected, we adopt 12 key stations in Tokyo area and Fukuoka area. Fukuoka is the one of major cities located at west side of IV. RELATIONSHIP ANALYSIS Japan, where our university exists. Since it is hard to collect For analyzing the relationship between user’s contexts and past TV program listing and obtainable data is extremely large, really inputted sentences, we construct the collecting sys- we only analyze data of about one-month (between 7 January tem of Twitter and compare collected tweets with landmark 2010 to 2 February 2010). information gotten from Yahoo! Local Search API[6] and All the data are separated into some words by using “Yahoo! TV information obtained from online TV program listing[9]. Japanese language morphological analysis API” and “Yahoo! 130.42 141.35

ᮐᖠ 43.06

Only Streaming API

༤ከ 33.58

Start: 15 Dec. 2009 End: 10 June 2010

Fig. 5. Distribution of collected tweets per day Fig. 6. Geographical distribution of a location independent word: Noodle

139.7 Key phrase extraction API”. After that we compare these words with each other and evaluate the matching rate. Bunkyo-ku

V. RESULTS Nakano-ku Shinjuku-ku Fig.5 shows a distribution of collected 471274 tweets that Shinjuku have been collected from 15 December 2009 to 10 June 2010. station Geographical scope of tweets is limited to Japan area, which 35.69 is equal to the area from latitude 24 north and longitude 123 Chiyoda-ku east to latitude 46 north and longitude 146. This limitation is due to the limitation of Yahoo! Local search API, which is Shibuya-ku only provided by Yahoo! Japan. Since we used only Streaming Shibuya API at first, the number of tweets of the first one month is 10 station or 100 times less than those of subsequent terms. It points 35.66 out that our proposed cyclical tweet collecting system is very effective. Minato-ku Average word count of collected tweets is 48.8 characters, and tweets of about 30 characters are majority. From these results, we think that an abbreviated notation is often used in Meguro-ku tweets. Average and maximum number of landmarks obtained at a certain position from Yahoo! Local Search API is 22.9 and 71 respectively. 10.2% of position can’t obtain any landmark information from this API. Maximum number of landmarks Fig. 7. Geographical distribution of location dependent words: Shibuya, per position is 66. Meanwhile, average and maximum number Shinjuku of words gotten from TV program listing is 149.1 words/hour and 790 words/hour respectively, which is about 10 times larger than landmark information. word “noodle” can be determined as a location independent The percentage of tweets including the words obtained word. On the other hand, we notice that each plot (circle and according to the tweeted position is 5 plus) in Fig.7 is concentrated in certain areas respectively. In Finally, we refer the dependency of time and location. Fig.6 this figure, circle plots and plus plots show the geographical shows a geographical distribution of tweets which include distribution of tweets which incude “Shinjuku” and “Shibuya” “noodle”. Since plots are widely distributed all over Japan, the respectively. Centers of concentrated areas are Shinjuku station Sunday

“Ryoma-den” is a TV drama broadcasting “Ryoma-den” is a TV drama broadcasting in NHK at 20 o'clock on every Sunday now. in NHK at 20 o'clock on every Sunday now.

Sunday

Sunday

Sunday

Sunday

9 May 16 May 23 May 30 May 6 June

Fig. 8. Distribution of a time dependent word: Ryoma-den (per day) Fig. 9. Distribution of a time dependent word: Ryoma-den (per hour)

and Shibuya station of JR (Japan Railways). From this result, ACKNOWLEDGMENT the word “Shinjuku” and “Shibuya” can be defined as location The work is carried out by joint research program of dependent words. the NTT Service Integration Laboratories and the National Fig.8 and Fig.9 show a distribution of tweets which include Institute of Informatics. It is performed using the facilities “Ryoma-den” per day and per hour respectively. “Ryoma-den” provided by them. is a popular TV program broadcasting in Japan Broadcasting Corporation (NHK) at 20 o’clock on every Sunday now. As REFERENCES shown in Fig.8, the number of tweets on every Sunday is [1] rTYPE. (2009) Survey of mobile web site. (in Japanese). obviously larger than those on other day of the week. Also, [Online]. Available: http://release.center.jp/2008/11/0502.html (last access:2009/12/1) we can notice that the number of tweets at 20 o’clock is [2] OMRON SOFTWARE. (2009) iwnn. (in Japanese). [Online]. Available: remarkably larger than other time slots. As a result, the word http://www.omronsoft.co.jp/SP/mobile/iwnn/ (last access:2009/12/1) “Ryoma-den” highly depends on time. [3] S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “Network-based context-aware input method editor,” in The Sixth International Conference We are now picking up other typical words that highly on Networking and Services (ICNS 2010), 7 March 2010, pp. 1–6. depend on either time or location. We hope that by picking up [4] Y. Arakawa, S. Suematsu, S. Tagashira, Y. Yamaguchi, Y. Tanaka, and such context-aware words, our context-aware text entry system A. Fukuda, “Implementation of network-based context-aware editor,” in IEICE Technical Reports, ser. MoMuC2009-58, will be improved. vol. 109, no. 380, 21 January 2010, pp. 31–34, (in Japanese). [5] S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “On improvement VI.CONCLUSIONANDFUTUREWORK of response time for network-based context-aware japanese input method editor,” in IEICE General Conference, no. B-15-18, 19 March 2010, (in In this paper, we have proposed cyclical tweet collecting Japanese). system and have collected over half-million geo-tagged tweets [6] Yahoo Japan Corporation., “Yahoo! developer network- map,” http:/ written in Japanese for analyzing the relationship between /developer.yahoo.co.jp/webapi/map/(last access:2009/12/1), 2009, (in Japanese). users’ context (location and time) and real inputted words. [7] JustSystems Corporation, “Atok.com,” http://www.atok.com/(last ac- We have collected 471274 tweets from 15 December 2009. cess:2009/12/1), 2009, (in Japanese). Statistical analysis shows that 5.1% of tweets include land- [8] Twitter, “Twitter,” http://twitter.com/. [9] Toshiba, “Net de navi,” http://tvsurf.jp/tv/. mark words, and 9% of tweets include TV program words. Addtionally, Geographical mapping indicates the evidence of location/time dependence of real inputted words. As a first step, we have focused on Japanese tweets, but this relationship must exist regardless of language. We are now trying to pick up the words with high location dependency by calculating the geographical distribution ratio.