1st Symposium on Information Management and Big Data

8th, 9th and 10th September 2014 Cusco - Peru

PROCEEDINGS

© SIMBig, Andina University of Cusco and the authors of individual articles. Proceeding Editors: J. A. Lossio-Ventura and H. Alatrista-Salas

TABLE OF CONTENTS

SIMBig 2014 Conference Organization Committee SIMBig 2014 Conference Reviewers SIMBig 2014 Paper Contents SIMBig 2014 Keynote Speaker Presentations Full and Short Papers

The SIMBig 2014 Organization Committee confirms that full and concise papers accepted for this publication: • Meet the definition of research in relation to creativity, originality, and increasing humanity's stock of knowledge; • Are selected on the basis of a peer review process that is independent, qualified expert review; • Are published and presented at a conference having national and international significance as evidenced by registrations and participation; and • Are made available widely through the Conference web site.

Disclaimer: The SIMBig 2014 Organization Committee accepts no responsibility for omissions and errors.

3 SIMBig 2014 Organization Committee

GENERAL CHAIR

• Juan Antonio LOSSIO VENTURA, 2 University, LIRMM, Montpellier, • Hugo ALATRISTA SALAS, UMR TETIS, Irstea, France

LOCAL CHAIR

• Cristhian GANVINI VALCARCEL Andina University of Cusco, Peru • Armando FERMIN PEREZ, National University of San Marcos, Peru • Cesar A. BELTRAN CASTAÑON, GRPIAA, Pontifical Catholic University of Peru • Marco RIVAS PEÑA, National University of San Marcos, Peru

4

SIMBig 2014 Conference Reviewers

• Salah Ait-Mokhtar, Xerox Research Centre Europa, FRANCE • Jérôme Azé, LIRMM - University of Montpellier 2, FRANCE • Cesar A. Beltrán Castañón, GRPIAA - Pontifical Catholic University of Peru, PERU • Sandra Bringay, LIRMM - University of Montpellier 3, FRANCE • Oscar Corcho, Ontology Engineering Group - Polytechnic University of Madrid, SPAIN • Gabriela Csurka, Xerox Research Centre Europa, FRANCE • Mathieu d'Aquin, Knowledge Media institute (KMi) - Open University, UK • Ahmed A. A. Esmin, Federal University of Lavras, BRAZIL • Frédéric Flouvat, PPME Labs - University of New Caledonia, NEW CALEDONIA • Hakim Hacid, Alcatel-Lucent, Bell Labs, FRANCE • Dino Ienco, Irstea, FRANCE • Clement Jonquet, LIRMM - University of Montpellier 2, FRANCE • Eric Kergosien, Irstea, FRANCE • Pierre-Nicolas Mougel, Hubert Curien Labs - University of Saint-Etienne, FRANCE • Phan Nhat Hai, Oregon State University, USA • Thomas Opitz, University of Montpellier 2 - LIRMM, FRANCE • Jordi Nin, Barcelona Supercomputing Center (BSC) - Technical University of Catalonia (BarcelonaTECH), SPAIN • Gabriela Pasi, Information Retrieval Lab - University of Milan Bicocca, ITALY • Miguel Nuñez del Prado Cortez, Intersec Labs - , FRANCE • José Manuel Perea Ortega, European Commission - Joint Research Centre (JRC)- Ispra, ITALY • Yoann Pitarch, IRIT - , FRANCE • Pascal Poncelet, LIRMM - University of Montpellier 2, FRANCE • Julien Rabatel, LIRMM, FRANCE • Mathieu Roche, Cirad - TETIS and LIRMM, FRANCE • Nancy Rodriguez, LIRMM - University of Montpellier 2, FRANCE • Hassan Saneifar, Raja University, IRAN • Nazha Selmaoui-Folcher, PPME Labs - University of New Caledonia, NEW CALEDONIA • Maguelonne Teisseire, Irstea - LIRMM, FRANCE • Boris Villazon-Terrazas, iLAB Research Center - iSOCO, SPAIN • Pattaraporn Warintarawej, Prince of Songkla University, THAILAND • Osmar Zaïane, Department of Computing Science, University of Alberta, CANADA

5 SIMBig 2014 Paper Contents

Title and Authors Page

Keynote Speaker Presentations 7 Pascal Poncelet Mathieu Roche Hakim Hacid Mathieu d’Aquin

Long Papers Identification of Opinion Leaders Using Text Mining Technique in 8 Virtual Community, Chihli Hung and Pei-Wen Yeh Quality Metrics for Optimizing Parameters Tuning in Clustering 14 Algorithms for Extraction of Points of Interest in Human Mobility, Miguel Nunez Del Prado Cortez and Hugo Alatrista-Salas A Cloud-based Exploration of Open Data: Promoting Transparency 22 and Accountability of the Federal Government of Australia, Richard Sinnott and Edwin Salvador Discovery Of Sequential Patterns With Quantity Factors, Karim 33 Guevara Puente de La Vega and César Beltrán Castañón Online Courses Recommendation based on LDA, Rel Guzman 42 Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe and Jose Ochoa Luna

Short Papers ANIMITEX project: Image Analysis based on Textual Information, 49 Hugo Alatrista-Salas, Eric Kergosien, Mathieu Roche and Maguelonne Teisseire A case study on Morphological Data from Eimeria of Domestic 53 Fowl using a Multiobjective Genetic Algorithm and R&P for Learning and Tuning Fuzzy Rules for Classification, Edward Hinojosa Cárdenas and César Beltrán Castañón SIFR Project: The Semantic Indexing of French Biomedical Data 58 Resources, Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche and Maguelonne Teisseire

Spanish Track Mathematical modelling of the performance of a computer system, 62 Felix Armando Fermin Perez Clasificadores supervisados para el análisis predictivo de muerte y 67 sobrevida materna, Pilar Hidalgo

6

SIMBig 2014 Keynote Speakers

Mining Social Networks: Challenges and Directions Pascal Poncelet, Lirmm, Montpellier, France Professor Pascal Poncelet talk was focused on analysis of data associate to social network. In his presentation, Professor Poncelet summarized several techniques through knowledge discovery on databases process.

NLP approaches: How to Identify Relevant Information in Social Networks? Mathieu Roche, UMR TETIS, Irstea, France In this talk, doctor Roche outlined last tendencies in text mining task. Several techniques (sentiment analysis, opinion mining, entity recognition, ...) on different corpus (tweets, blogs, sms,...) were detailed in this presentation.

Social Network Analysis: Overview and Applications Hakim Hacid, Zayed University, Dubai, UAE Doctor Hacid presented a complete overview concerning techniques associated to explote social web data. Techniques as social network analysis, social information retrieval and mashups full completion were presented.

Putting intelligence in Web Data With Examples Education Mathieu d’Aquin, Knowledge Media Institute, The Open University, UK Doctor d'Aquin presented several web mining approaches addressed to education issues. In this talk doctor d'Aquin detailed themes like intelligent web information and knowledge processing, the semantic web, among others.

7 Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Chihli Hung Pei-Wen Yeh Department of Information Management Department of Information Management Chung Yuan Christian University Chung Yuan Christian University Taiwan 32023, R.O.C. Taiwan 32023, R.O.C. [email protected] [email protected]

significantly and consumers are further influenced Abstract by other consumers without any geographic limitation (Flynn et al., 1996). Word of mouth (WOM) affects the buying behavior of information receivers stronger than Nowadays, making buying decisions based on advertisements. Opinion leaders further affect WOM becomes one of collective decision-making others in a specific domain through their new strategies. It is nature that all kinds of human information, ideas and opinions. Identification groups have opinion leaders, explicitly or of opinion leaders has become one of the most implicitly (Zhou et al., 2009). Opinion leaders important tasks in the field of WOM mining. usually have a stronger influence on other Existing work to find opinion leaders is based members through their new information, ideas and mainly on quantitative approaches, such as representative opinions (Song et al., 2007). Thus, social network analysis and involvement. how to identify opinion leaders has increasingly Opinion leaders often post knowledgeable and attracted the attention of both practitioners and useful documents. Thus, the contents of WOM are useful to mine opinion leaders as well. This researchers. research proposes a text mining-based approach As opinion leadership is relationships between to evaluate features of expertise, novelty and members in a society, many existing opinion leader richness of information from contents of posts identification tasks define opinion leaders by for identification of opinion leaders. According analyzing the entire opinion network in a specific to experiments in a real-world bulletin board domain, based on the technique of social network data set, this proposed approach demonstrates analysis (SNA) (Kim, 2007; Kim and Han, 2009). high potential in identifying opinion leaders. This technique depends on relationship between initial publishers and followers. A member with the greatest value of network centrality is 1 Introduction considered as an opinion leader in this network This research identifies opinion leaders using the (Kim, 2007). technique of text mining, since the opinion leaders However, a junk post does not present useful affect other members via word of mouth (WOM) information. A WOM with new ideas is more on social networks. WOM defined by Arndt (1967) interesting. A spam link usually wastes readers' is an oral person-to-person communication means time. A long post is generally more useful than a between an information receiver and a sender, who short one (Agarwal et al., 2008). A focused exchange the experiences of a brand, a product or a document is more significant than a vague one. service based on a non-commercial purpose. That is, different documents may contain different Internet provides human beings with a new way of influences on readers due to their quality of WOM. communication. Thus, WOM influences the WOM documents per se can also be a major consumers more quickly, broadly, widely, indicator for recognizing opinion leaders. However, such quantitative approaches, i.e. number-based or

8 SNA-based methods, ignore quality of WOM and network hubs usually contain six aspects, which only include quantitative contributions of WOM. are ahead in adoption, connected, travelers, Expertise, novelty, and richness of information information-hungry, vocal, and exposed to media are three important features of opinion leaders, more than others (Rosen, 2002). Ahead in adoption which are obtained from WOM documents (Kim means that network hubs may not be the first to and Han, 2009). Thus, this research proposes a text adopt new products but they are usually ahead of mining-based approach in order to identify opinion the rest in the network. Connected means that leaders in a real-world bulletin board system. network hubs play an influential role in a network, Besides this section, this paper is organized as such as an information broker among various follows. Section 2 gives an overview of features of different groups. Traveler means that network hubs opinion leaders. Section 3 describes the proposed usually love to travel in order to obtain new ideas text mining approach to identify opinion leaders. from other groups. Information-hungry means that Section 4 describes the data set, experiment design network hubs are expected to provide answers to and results. Finally, a conclusion and further others in their group, so they pursue lots of facts. research work are given in Section 5. Vocal means that network hubs love to share their opinions with others and get responses from their 2 Features of Opinion Leaders audience. Exposed to media means that network hubs open themselves to more communication The term “opinion leader”, proposed by Katz and from mass media, and especially to print media. Lazarsfeld (1957), comes from the concept of Thus, a network hub or an opinion leader is not communication. Based on their research, the only an influential node but also a novelty early influence of an advertising campaign for political adopter, generator or spreader. An opinion leader election is lesser than that of opinion leaders. This has rich expertise in a specific topic and loves to be is similar to findings in product and service involved in group activities. markets. Although advertising may increase As members in a social network influence each recognition of products or services, word of mouth other, degree centrality of members and disseminated via personal relations in social involvement in activities are useful to identify networks has a greater influence on consumer opinion leaders (Kim and Han, 2009). Inspired by decisions (Arndt, 1967; Khammash and Griffiths, the PageRank technique, which is based on the link 2011). Thus, it is important to identify the structure (Page et al., 1998), OpinionRank is characteristics of opinion leaders. proposed by Zhou et al. (2009) to rank members in According to the work of Myers and Robertson a network. Jiang et al. (2013) proposed an (1972), opinion leaders may have the following extended version of PageRank based on the seven characteristics. Firstly, opinion leadership in sentiment analysis and MapReduce. Agarwal et al. a specific topic is positively related to the quantity (2008) identified influential bloggers through four of output of the leader who talks, knows and is aspects, which are recognition, activity generation, interested in the same topic. Secondly, people who novelty and eloquence. An influential blog is influence others are themselves influenced by recognized by others when this blog has a lot of in- others in the same topic. Thirdly, opinion leaders links. The feature of activity generation is usually have more innovative ideas in the topic. measured by how many comments a post receives Fourthly and fifthly, opinion leadership is and the number of posts it initiates. Novelty means positively related to overall leadership and an novel ideas, which may attract many in-links from individual’s social leadership. Sixthly, opinion the blogs of others. Finally, the feature of leaders usually know more about demographic eloquence is evaluated by the length of post. A variables in the topic. Finally, opinion leaders are lengthy post is treated as an influential post. domain dependent. Thus, an opinion leader Li and Du (2011) determined the expertise of influences others in a specific topic in a social authors and readers according to the similarity network. He or she knows more about this topic between their posts and the pre-built term ontology. and publishes more new information. However both features of information novelty and Opinion leaders usually play a central role in a influential position are dependent on linkage social network. The characteristics of typical relationships between blogs. We propose a novel

9 text mining-based approach and compare it with 3.3 Novelty several quantitative approaches. We utilize Google trends service 3 Quality Approach-Text Mining (http://www.google.com/trends) to obtain the first- search time tag for significant words in documents. Contents of word of mouth contain lots of useful Thus, each significant word has its specific time information, which has high relationships with tag taken from the Google search repository. For important features of opinion leaders. Opinion example, the first-search time tag for the search leaders usually provide knowledgeable and novel term, Nokia N81, is 2007 and for Nokia Windows information in their posts (Rosen, 2002; Song et al., Phone 8 is 2011. We define three degrees of 2007). An influential post is often eloquent (Keller novelty evaluated by the interval between the first- and Berry, 2003). Thus, expertise, novelty, and search year of significant words and the collected richness of information are important year of our targeted document set, i.e. 2010. This characteristics of opinion leaders. significant word belongs to normal novelty if the interval is equal to two years. A significant word 3.1 Preprocessing with an interval of less than two years belongs to This research uses a traditional Chinese text high novelty and one with an interval greater than mining process, including Chinese word two years belongs to low novelty. We then segmenting, part-of-speech filtering and removal summarize all novelty values based on significant of stop words for the data set of documents. As a words used by a member in a social network. The single Chinese character is very ambiguous, equation of novelty for a member is shown in (2). segmenting Chinese documents into proper Chinese words is necessary (He and Chen, 2008). e  0.66e  0.33e nov  h m l , This research uses the CKIP service i (2) eh  em  el (http://ckipsvr.iis.sinica.edu.tw/) to segment where e , e and e is the number of words that Chinese documents into proper Chinese words and h m l belong to the groups of high, normal and low their suitable part-of-speech tags. Based on these novelty, respectively. processes, 85 words are organized into controlled vocabularies as this approach is efficient to capture the main concepts of document (Gray et al., 2009). 3.4 Richness of Information 3.2 Expertise In general, a long document suggests some useful information to the users (Agarwal et al., 2008). This can be evaluated by comparing their posts Thus, richness of information of posts can be used with the controlled vocabulary base (Li and Du, for the identification of opinion leaders. We use 2011). For member i, words are collected from his both textual information and multimedia or her posted documents and member vector i is information to represent the richness of represented as f =(w , w , …w , …, w ), where w i 1 2 j N j information as (3). denotes the frequency of word j used in the posted documents of user i. N denotes the number of ric=d + g, (3) words in the controlled vocabulary. We then normalize the member vector by his or her where d is the total number of significant words maximum frequency of any significant word. The that the user uses in his or her posts and g is the degree of expertise can be calculated by the total number of multimedia objects that the user Euclidean norm as show in (1). posts.

fi expi  , (1) 3.5 Integrated Text Mining Model mi Finally, we integrate expertise, novelty and where  is Euclidean norm. richness of information from the content of posted documents. As each feature has its own

10 distribution and range, we normalize each feature number of documents that a member initiates plus to a value between 0 and 1. Thus, the weights of the number of derivative documents by other opinion leaders based on the quality of posts members is treated as involvement. become the average of these three features as (4). Thus, we have one qualitative model, i.e. ITM, and four quantitative models, i.e. DEG, CLO, BET Norm(nov)  Norm(exp) Norm(ric) and INV. We put top ten rankings from each model ITM  . (4) 3 in a pool of potential opinion leaders. Duplicate members are removed and 25 members are left. We request 20 human testers, which have used and 4 Experiments are familiar with Mobile01. In our questionnaire, quantitative information is 4.1 Data Set provided such as the number of documents that the potential opinion leaders initiate and the number of Due to lack of available benchmark data set, we derivative documents that are posted by other crawl WOM documents from the Mobile01 members. For the qualitative information, a bulletin board system (http://www.mobile01.com/), maximum of three documents from each member which is one of the most popular online discussion are provided randomly to the testers. The top 10 forums in Taiwan. This bulletin board system rankings are also considered as opinion leaders allows its members to contribute their opinions based on human judgment. free of charge and its contents are available to the public. A bulletin board system generally has an 4.3 Results organized structure of topics. This organized We suppose that ten of 9460 members are structure provides people who are interested in the considered as opinion leaders. We collect top 10 same or similar topics with an online discussion ranking members from each models and remove forum that forms a social network. Finding opinion duplicates. We request 20 human testers to identify leaders on bulletin boards is important since they 10 opinion leaders from 25 potential opinion contain a lot of availably focused WOM. In our leaders obtained from five models. According to initial experiments, we collected 1537 documents, experiment results in Table 1, the proposed model which were initiated by 1064 members and outperforms others. This presents the significance attracted 9192 followers, who posted 19611 of documents per se. Even INV is a very simple opinions on those initial posts. In this data set, the approach but it performs much better than social total number of participants is 9460. network analysis models, i.e. DEG, CLO and BET.

One possible reason is the sparse network structure. 4.2 Comparison Many sub topics are in the bulletin board system so these topics form several isolated sub networks. As we use real-world data, which has no ground truth about opinion leaders, a user centered F- Recall Precision Accuracy evaluation approach should be used to compare the measure difference between models (Kritikopoulos et al., DEG 0.45 0.50 0.48 0.56 2006). In our research, there are 9460 members in CLO 0.36 0.40 0.38 0.48 this virtual community. We suppose that ten of BET 0.64 0.70 0.67 0.72 them have a high possibility of being opinion INV 0.73 0.80 0.76 0.80 leaders. ITM 0.82 0.90 0.86 0.88 As identification of opinion leaders is treated to be one of important tasks of social network Table 1: Results of models evaluated by recall, analysis (SNA), we compare the proposed model precision, F-measure and accuracy (i.e. ITM) with three famous SNA approaches, which are degree centrality (DEG), closeness centrality (CLO), betweenness centrality (BET). Involvement (INV) is an important characteristic of opinion leaders (Kim and Han, 2009). The

11 5 Conclusions and Further Work Flynn, L. R., Goldsmith, R. E. and Eastman, J. K. 1996. Opinion Leaders and Opinion Seekers: Two New Word of mouth (WOM) has a powerful effect Measurement Scales. Academy of Marketing on consumer behavior. Opinion leaders have He, J. and Chen, L. 2008. Chinese Word Segmentation stronger influence on other members in an opinion Based on the Improved Particle Swarm Optimization society. How to find opinion leaders has been of Neural Networks. Proceedings of IEEE Cybernetics interest to both practitioners and researchers. and Intelligent Systems, 695-699. Existing models mainly focus on quantitative Jiang, L., Ge, B., Xiao, W. and Gao, M. 2013. BBS features of opinion leaders, such as the number of Opinion Leader Mining Based on an Improved posts and the central position in the social network. PageRank Algorithm Using MapReduce. This research considers this issue from the Proceedings of Chinese Automation Congress, 392- viewpoints of text mining. We propose an 396. integrated text mining model by extracting three Katz, E. and Lazarsfeld, P. F. 1957. Personal Influence, important features of opinion leaders regarding New York: The Free Press. novelty, expertise and richness of information, from documents. Finally, we compare this Keller, E. and Berry, J. 2003. One American in Ten proposed text mining model with four quantitative Tells the Other Nine How to Vote, Where to Eat and, approaches, i.e., involvement, degree centrality, What to Buy. They Are The Influentials. The Free closeness centrality and betweenness centrality, Press. evaluated by human judgment. In our experiments, Khammash, M. and Griffiths, G. H. 2011. Arrivederci we found that the involvement approach is the best CIAO.com Buongiorno Bing.com- Electronic Word- one among the quantitative approaches. The text of-Mouth (eWOM), Antecedences and Consequences. mining approach outperforms its quantitative International Journal of Information Management, counterparts as the richness of document 31:82-87. information provides a similar function to the Kim, D. K. 2007. Identifying Opinion Leaders by Using qualitative features of opinion leaders. The Social Network Analysis: A Synthesis of Opinion proposed text mining approach further measures Leadership Data Collection Methods and Instruments. opinion leaders based on features of novelty and PhD Thesis, the Scripps College of Communication, expertise. Ohio University. In terms of possible future work, some Kim, S. and Han, S. 2009. An Analytical Way to Find integrated strategies of both qualitative and Influencers on Social Networks and Validate their quantitative approaches should take advantages of Effects in Disseminating Social Games. Proceedings both approaches. For example, the 2-step of Advances in Social Network Analysis and Mining, integrated strategy, which uses the text mining- 41-46. based approach in the first step, and uses the Kritikopoulos, A., Sideri, M. and Varlamis, I. 2006. quantitative approach based on involvement in the BlogRank: Ranking Weblogs Based on Connectivity second step, may achieve the better performance. and Similarity Features. Proceedings of the 2nd Larger scale experiments including topics, the International Workshop on Advanced Architectures number of documents and testing, should be done and Algorithms for Internet Delivery and Applications, Article 8. further in order to produce more general results. Li, F. and Du, T. C. 2011. Who Is Talking? An Ontology-Based Opinion Leader Identification References Framework for Word-of-Mouth Marketing in Online Social Blogs. Decision Support Systems, 51, Agarwal, N., Liu, H., Tang, L. and Yu, P. S. 2008. 2011:190-197. Identifying the Influential Bloggers in a Community. Myers, J. H. and Robertson, T. S. 1972. Dimensions of Proceedings of WSDM, 207-217. Opinion Leadership. Journal of Marketing Research, Arndt, J. 1967. Role of Product-Related Conversations 4:41-46. in the Diffusion of a New Product. Journal of Page, L., Brin, S., Motwani, R. and Winograd, T. 1998. Marketing Research, 4(3):291-295. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford University.

12 Rosen, E. 2002. The Anatomy of Buzz: How to Create Word of Mouth Marketing, 1st ed., Doubleday. Song, X., Chi, Y., Hino, K. and Tseng, B. L. 2007. Identifying Opinion Leaders in the Blogosphere. Proceedings of CIKM’07, 971-974. Zhou, H., Zeng, D. and Zhang, C. 2009. Finding Leaders from Opinion Networks. Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics, 266-268.

13 Quality Metrics for Optimizing Parameters Tuning in Clustering Algorithms for Extraction of Points of Interest in Human Mobility

Miguel Nuez˜ del Prado Cortez Hugo Alatrista-Salas Peru I+D+I GRPIAA Labs., PUCP Technopark IDI Peru I+D+I [email protected] [email protected]

Abstract On one hand, heuristics rely on the dwell time, which is the lost of signal when user gets into a building. Clustering is an unsupervised learning tech- Another used heuristic is the residence time, which nique used to group a set of elements into non- represents the time that a user spends at a particu- overlapping clusters based on some predefined dissimilarity function. In our context, we rely lar place. On the other hand, clustering algorithms on clustering algorithms to extract points of group nearby mobility traces into clusters. interest in human mobility as an inference at- In particular, in the context of POI extraction, it is tack for quantifying the impact of the privacy important to find a suitable set of parameters, for a breach. Thus, we focus on the input param- specific cluster algorithm, in order to obtain a good eters selection for the clustering algorithm, accuracy as result of the clustering. The main contri- which is not a trivial task due to the direct im- bution of this paper is a methodology to find a “op- pact of these parameters in the result of the attack. Namely, if we use too relax parame- timal” configuration of input parameters for a clus- ters we will have too many point of interest tering algorithm based on quality indices. This op- but if we use a too restrictive set of parame- timal set of parameters allows us to have the appro- ters, we will find too few groups. Accordingly, priate number of POIs in order to perform another to solve this problem, we propose a method to inference attack. This paper is organized as follows. select the best parameters to extract the opti- First, we present some related works on parameters mal number of POIs based on quality metrics. estimation techniques in Section 2. Afterwards, we describe the clustering algorithms used to perform 1 Introduction the extraction of points of interests (POIs) as well as The first step in inference attacks over mobility the metrics to measure the quality of formed clusters traces is the extraction of the point of interest (POI) in sections 3 and 4, respectively. Then, we introduce from a trail of mobility traces. Indeed, this phase the method to optimize the choice of the parameters impacts directly the global accuracy of an inference in Section 5. Finally, Section 6 summarizes the re- attack that relies on POI extraction. For instance, sults and presents the future directions of this paper. if an adversary wants to discover Alice’s home and 2 Related works place of work the result of the extraction must be as accurate as possible, otherwise they can confuse or Most of the previous works estimate the parameters just not find important places. In addition, for a more of the clustering algorithms for the point of interest sophisticated attack such as next place prediction, a extraction by using empirical approaches or highly mistake when extracting POIs can decrease signif- computationally expensive methods. For instance, icantly the global precision of the inference. Most we use for illustration purpose two classical cluster- of the extraction techniques use heuristics and clus- ing approaches, K-means (MacQueen et al., 1967) tering algorithms to extract POIs from location data. and DBSCAN (Ester et al., 1996). In the former

14 clustering algorithm, the main issue is how to deter- a cluster C composed of all the consecutive points mine k, the number of clusters. Therefore, several within distance d from each other. Afterwards, the approaches have been proposed to address this issue algorithm checks if the accumulated time of mobil- (Hamerly and Elkan, 2003; Pham et al., 2005). The ity traces between the youngest and the oldest ones latter algorithm relies on OPTICS (Ankerst et al., is greater than the threshold t. If it is the case, the 1999) algorithm, which searches the space of param- cluster is created and added to the list of POIs. Fi- eters of DBSCAN in order to find the optimal num- nally as a post-processing step, DT-Cluster merges ber of clusters. The more parameters the clustering the clusters whose medioids are less than d/3 far algorithm has, the bigger the combinatorial space of from each other. parameters is. Nevertheless, the methods to calibrate cluster algorithm inputs do not guarantee a good ac- 3.3 Time Density (TD-Cluster) curacy for extracting meaningful POIs. In the next Introduced in (Gambs et al., 2011), TD-Cluster is section, we described the cluster algorithms used in a clustering algorithm inspired from DT Cluster, our study. which takes as input parameters a radius r, a time window t, a tolerance rate τ, a distance threshold d 3 Clustering algorithms for extraction of and a trail of mobility traces M. The algorithm starts points of interest by building iteratively clusters from a trail M of mo- To perform the POI extraction, we rely on the fol- bility traces that are located within the time window lowing clustering algorithms: t. Afterwards, for each cluster, if a fraction of the points (above the tolerance rate τ) are within radius 3.1 Density Joinable Cluster (DJ-Cluster) r from the medoid, the cluster is integrated to the list DJ-Cluster (Zhou et al., 2004) is a clustering algo- of clusters outputted, whereas otherwise it is simply rithm taking as input a minimal number of points discarded. Finally, as for DT Cluster, the algorithm minpts, a radius r and a trail of mobility traces merges the clusters whose medoids are less than d M. This algorithm works in two phases. First, the far from each other. pre-processing phase discards all the moving points 3.4 Begin-end heuristic (i.e. whose speed is above , for  a small value) and then, squashes series of repeated static points The objective of the begin and end location finder into a single occurrence for each series. Next, the inference attack (Gambs et al., 2010) is to take as second phase clusters the remaining points based meaningful points the first and last of a journey. on neighborhood density. More precisely, the num- More precisely, this heuristic considers that the be- ber of points in the neighborhood must be equal or ginning and ending locations of a user, for each greater than minpts and these points must be within working day, might convey some meaningful infor- radius r from the medoid of a set of points. Where mation. medioid is the real point m that minimizes the sum of Since we have introduced the different clustering distances from the point m to the other points in the algorithms to extract points of interest, we present in cluster. Then, the algorithm merges the new cluster the next section the indices to measure the quality of with the clusters already computed, which share at the clusters. least one common point. Finally, during the merg- 4 Cluster quality indices ing, the algorithm erases old computed clusters and only keeps the new cluster, which contains all the One aspect of the extraction of POIs inference at- other merged clusters. tacks is the quality of the obtained clusters, which impacts on the precision and recall of the attack. 3.2 Density Time Cluster (DT-Cluster) In the following subsection we describe some met- DT-Cluster (Hariharan and Toyama, 2004) is an iter- rics to quantify how accurate or “how good“ is the ative clustering algorithm taking as input a distance outcome of the clustering task. Intuitively, a good threshold d, a time threshold t and a trail of mobil- clustering is one that identifies a group of clusters ity traces M. First, the algorithm starts by building that are well separated one from each other, compact

15 and representative. Table 1 summarizes the notation 4.2 Additive margin used in this section. Inspired by the Ben-David and Ackerman (Ben- Symbol Definition David and Ackerman, 2008) k-additive Point Mar- C An ensemble of clusters. gin (K-AM) metric , which evaluates how well cen- th ci The i cluster of C. tered clusters are. We measure the difference be- nc The number of clusters in C. tween the medoid mi and its two closest points m0 th mi The medoid point of the i cluster. and m00 of a given group ci belonging to a cluster C (x, y) The Euclidean distance between x and y. d (Equation 5). c The number of points in a cluster c . | i| i m0 The closest point to the medoid mi. K AM(ci) = d(mi, mi00) d(mi, mi0 ) (5) m00 The second closest point to the medoid mi. − − C The total number of points in a set of C. | | Since the average of the k-additive point margins for all groups c in a cluster C is computed, we take the Table 1: Summary of notations i ratio between the average k-additive Point Margin and the minimal inter-cluster distance (Equation 1) 4.1 Intra-inter cluster ratio as shown in Equation 6. The intra-inter cluster ratio (Hillenmeyer, 2012) 1 nc K AM(c ) nc ci C i measures the relation between compact (Equation 1) AM(C) = minc C ∈ − (6) i∈ DIC(c ) and well separated groups (Equation 3). More pre- P i cisely, we first take the inter-cluster distance, which The additive margin method has a linear complexity is the average distance from each point in a cluster in the number of clusters. This metric gives a high ci to its medoid mi. value for a well centered clusters. c 1 | i| 4.3 Information loss DIC(ci) = d(xj, mi) (1) ci 1 The information loss ratio is a metric inspired by the | | − xj ci,xj =mi ∈ X6 work of Sole and coauthors (Sole´ et al., 2012). The Then, the average intra-cluster distance (DIC) is basic idea is to measure the percent of information computed using Equation 2. that is lost while representing original data only by C 1 | | a certain number of groups (e.g., when we represent AV G DIC(C) = DIC(ci) (2) the POIs by the cluster medoids instead of the whole nc c C Xi∈ set of points). To evaluate the percent of information Afterwards, the mean distance among all medoids loss, we compute the sum of distance of each point (DOC) in the cluster C is computed, using Equation represented by xi to its medoid mi for all clusters c C as we shown in Equation 7. 3. i ∈ C C 1 | | | | nc c DOC(C) = d(m , m ) | | i j SSE(C) = d(xj, mi) (7) nC ( nC 1) ci C cj C,i=j | | × | | − X∈ ∈X6 ci C xj ci (3) X∈ X∈ Finally, the ratio intra-inter cluster rii is given by the Then, we estimate the accumulated distance of all Equation 4 as the relationship between the average points of a trail of mobility traces in the cluster C to intra cluster distance divided by the inter-cluster dis- a global centroid (GC) using the following equation tance. Equation 8. AV G DIC(C) rii(C) = (4) C DOC(C) | | SST (C) = d(x , GC) (8) The intra-inter ratio has an approximate linear com- i x C plexity in the number of points to be computed and Xi∈ gives low values to well separated and compact clus- Finally, the ratio between aforementioned distances ter. is computed using Equation 9, which results in the

16 information loss ratio. Afterwards, a similarity measure between two clus- ters c and c , called R-similarity, is estimated, based SSE(C) i j IL(C) = (9) on Equation 14. SST (C) S(ci) + S(cj) The computation of this ratio has a linear complex- R(ci, cj) = (14) M(c , c ) ity. The lowest is the value of this ratio, the more i j representative the clusters are. After that, the most similar cluster cj to ci is the one maximizing the result of the function Rall(ci), 4.4 Dunn index which is given by Equation 15 for i = j. This quality index (Dunn, 1973; Halkidi et al., 2001) 6 attempts to recognize compact and well-separated Rall(ci) = maxc C,i=jR(ci, cj) (15) j ∈ 6 clusters. The computation of this index relies on a dissimilarity function (e.g. Euclidean distance d) Finally, the DBI is equal to the average of the simi- between medoids and the diameter of a cluster (c.f, larity between clusters in the clustering set C (Equa- Equation 10) as a measure of dispersion. tion 16). n 1 c diam(ci) = maxx,y c ,x=yd(x, y) (10) ∈ i 6 DBI(C) = Rall(ci) (16) nc ci C Then, if the clustering C is compact (i.e, the diam- X∈ eters tend to be small) and well separated (distance Ideally, the clusters c C should have the mini- i ∈ between cluster medoids are large), the result of the mum possible similarity to each other. Accordingly, index, given by the Equation 11, is expected to be the lower is the DB index, the better is the cluster- large. ing formed. These indices would be used to maxi- mize the number of significant places a cluster algo- DIL(C) = minc C [minc C,j=i+1 i∈ j ∈ rithm could find. More precisely, in the next section d(mi, mj) (11) we evaluate the cluster algorithm aforementioned as [ ]] maxck C diam(ck) well as the method to extract the meaningful places ∈ using the quality indices. The greater is this index, the better the performance of the clustering algorithm is assumed to be. The 5 Selecting the optimal parameters for main drawbacks of this index is the computational clustering complexity and the sensitivity to noise in data. In order to establish how to select the best set 4.5 Davis-Bouldin index of parameters for a given clustering algorithm, we The objective of the Davis-Bouldin index (DBI) have computed the precision, recall and F-measure (Davies and Bouldin, 1979; Halkidi et al., 2001) is to of all users of LifeMap dataset (Chon and Cha, evaluate how well the clustering was performed by 2011). One of the unique characteristic of this using properties inherent to the dataset considered. dataset is that the POIs have been annotated by the users. Consequently, given a set of clusters First, we use a scatter function within the cluster ci c C such that C = c , c , c , . . . , c and a of the clustering C (Equation 12). i ∈ { 1 2 3 n} set of points of interest (POIs) defined by the users n Ppoi = ppoi 1, ppoi 2, ppoi 3, . . . , ppoi n we were 1 2 { } S(ci) = d(xj, mi) (12) able to compute the precision, recall and f-measure vnc u xj ci as we detail in the next subsection. u X∈ t Then, we compute the distance between two differ- 5.1 Precision, recall and F-measure ent clusters ci and cj, given by Equation 13. To compute the recall (c.f. Equation 17), we take as input a clustering set C, the ground truth represented M(ci, cj) = d(mi, mj) (13) by the vector Ppoi (which was defined manually by q

17 each user) as well as a radius to count all the clus- Characteristics LifeMap ters c C that are within the radius of p P , Total nb of users 12 ∈ poi ∈ poi which represents the ”good clusters”. Then, the ratio Collection period (nb of days) 366 of the number of good clusters compared to the total Average nb of traces/user 4 224 number of found clusters is computed. This measure Total nb of traces 50 685 illustrates the ratio of extracted cluster that are POIs Min #Traces for a user 307 divided by the total number of extracted clusters. Max #Traces for a user 9 473

good clusters Table 2: Main characteristics of the LifeMap P recision = (17) total number extracted clusters dataset. To compute the recall (c.f. Equation 18), we take as input a clustering set C, a vector of POIs P as poi 5.3 Experimental results well as a radius to count the discovered POIs p poi ∈ P within a radius of the clusters c C, which This section is composed of two parts, in the first poi ∈ represents the ”good POIs”. Then, the ratio between part we compare the performance of the previously the number of good POIs and the total number of described clustering algorithms, with two base- POIs is evaluated. This metric represents the percent line clustering algorithms namely k-means and DB- of the extracted unique POIs. SCAN. In the second part, a method to select the most suitable parameters for a clustering algorithm is presented. good P OIs Recall = (18) total number of P OIs

Finally, the F-measure is defined as the weighted Input parameters Possible values DBSCAN DJ cluster DT cluster K-means TD cluster Tolerance rate (%) 0.75, 0.8, 0.85, 0.9 y Y y y y average of the precision and recall as we can see in Tolerance rate (%) {0.75, 0.8, 0.85, 0.9}      Minpts (points) 3, 4,{ 5, 6, 7, 8, 9, 10, 20,} 50      Equation 19. Eps (Km.) { 0.01, 0.02, 0.05, 0.1, 0.2 }      Merge distance (Km.) { 0.02, 0.04, 0.1, 0.2, 0.4 }      Time shift (hour) { 1, 2, 3, 4, 5, 6 }      K (num. clusters) { 5, 6, 7,8, 9 }      precision recall { } F measure = 2 × (19) − × precision + recall Table 3: Summary of input parameters for clustering We present the dataset used for our experiments in algorithms. the next subsection.

Number of Precision Recall F-measure Time(s) Complexity 5.2 Dataset description parameters DBSCAN 0,58 0,54 0,48 316 3 O(n2) In order to evaluate our approach, we use the DJ-Cluster 0,74 0,52 0,52 429 3 O(n2) LifeMap dataset (Chon et al., 2012), which is com- DT-Cluster 0,38 0,47 0,39 279 3 O(n2) k-means 0,58 0,51 0,49 299 2 O(n) posed of mobility traces of 12 user collected for a TD-Cluster 0,43 0,54 0,44 362 4 O(n2) year in Seoul, Korea. This dataset comprises lo- cation (latitude and longitude) collected with a fre- Table 4: The characteristics of the clustering algo- quency between 2 to 5 minutes with the user defined rithms. point of interest as true if the mobility trace is con- sidered as important or meaningfull for each user. In order to compare the aforementioned clustering Table 2 summarizes the main characteristics of this algorithms, we have take into account the precision, dataset, such as the collect period, the average num- recall, F-measure obtained, average execution time, ber of traces per user, the total number of mobility number of input parameters and time complexity. traces in the dataset, the minimal and maximal num- To evaluate these algorithms, we used the LifeMap ber of users’ mobility traces. dataset with POIs annotation and a set of different Since we have described our dataset, we present parameters configurations for each algorithm, which the results of our experiments in the next subsection. are summarized in Table 3. After running these con-

18 figurations, we obtained the results shown in Table different configurations of distinct algorithms us- 4 for the different input values. ing the previously described quality indices. After- wards, we were able to estimate the precision, recall

DBSCAN DJ Cluster DT Cluster and F-measure using the manual annotation of POIs by the users in the LifeMap dataset. 0.5 Regarding the relation between the quality indices and the F-measure, we studied the relationship be- 0.0 tween these factors, in order to identify the indices that are highly correlated with the F-measure, as -0.5 can be observed in Figure 1. We observe that the Index DBI two best performing indices, except for k-means, -1.0 DI are IL and DBI. The former shows a negative cor- k_means Resume TD Cluster IL Value kAM relation with respect to the F-measure. While the 0.5 RII latter, has a positive dependency to the F-measure. Our main objective is to be able to identify the rela- 0.0 tionship between quality and F-measure among the previous evaluated clustering algorithms. Accord- -0.5 ingly, we discard the inter-intra cluster ratio (RII) and the adaptive margin (AM), which only perform -1.0 well when using k-means and the DJ clustering al- A B C D A B C D A B C D Correlation gorithms. Finally, we observe that the Dunn index has a poor performance. Based on these observa- Figure 1: Correlation of quality indices with the tions, we were able to propose an algorithm to auto- computed F-measure. Where A) is the correlation matically choose the best configuration of input pa- measured between the user annotation and the cen- rameters. troid at 20 m of radius B) at 35 m radius C) at 50 m radius, D) at 100 m radius and DBI=Davis- 5.4 Parameter selection method Bouldin index, DI=Dunn index, IL=Information Let us define a vector of parameters p P and P a i ∈ loss, kAM=Additive margin and RII= Ratio intra- set of vectors, such that P = p , p , p , . . . , p , { 1 2 3 n} inter cluster. a trail of mobility traces M of a user. From previous sections we have the clustering function It is possible to observe that the precision of DJ- C(pi) and the quality metrics Information Loss Cluster out performs better than the other cluster- IL(C) and Davis-Bouldin index DBI(C). Thus, ing algorithms. In terms of recall DBSCAN and for each vector of parameters we have a tuple com- TD-Cluster perform the best but DJ-Cluster is just posed of the trail of mobility traces, the result behind them. Moreover, DJ-Cluster has the best of the clustering algorithm and the quality metrics (p ,M,C ,IL ,DBI ). When we compute F-measure. Regarding the execution time, DT- i pi Cpi Cpi Clustering the fastest one while DJ-Cluster is the the clustering algorithm and the quality metrics for slowest algorithm due to the preprocessing phase. each vector of parameter for a given user u. We de- Despite the high computational time of DJ-Cluster, fine also a χu0 matrix, which the matrix χu sorted by this algorithm performs well in terms of F-measure. IL ascending. Finally, the result matrix χu is of the In the following, we describe our method to form: choose “optimal” parameters for obtaining a good F-measure. We have used the aforementioned algo- p MC IL DBI 1 p1 Cp1 Cp1 rithms with a different set of input parameters con- p2 MCp ILC DBIC 2 p2 p2 figurations for users with POIs annotations in the χu = p3 MCp ILC DBIC 3 p3 p3 LifeMap dataset (Chon and Cha, 2011). Once clus- ... ters are built, we evaluate the clusters issued from p MC IL DBI n pn Cpn Cpn

19 DBSCAN DJ cluster DT cluster

0.6

0.4

0.2

Heuristic 0.0 DBI IL k means Resume TD cluster IL_DBI

F_measure MAX

0.6

0.4

0.2

0.0

u1 u2 u3 u4 u5 u6 u7 u1 u2 u3 u4 u5 u6 u7 u1 u2 u3 u4 u5 u6 u7 User

Figure 2: Comparison between F-measure and parameters selection based on schema in Figure ??. Where DBI=Davis-Bouldin index, IL=Information loss, IL DBI= combination of IL and DBI and MAX is the maximal computed F-measure (taken as reference to compare with IL DBI). Resume is the average of all the results using different clustering algorithms.

Therefore, the parameter selection function set of input parameters. In the second situation, both S(χu) could be defined as: IL and DBI refer each one to a different set of pa- rameters. In this case, the algorithm sorts the values by IL in the ascending order (i.e., from the small- pi, if maxpi (DBI) & minpi (IL) S(χu) = est to the largest information loss value). Then, it (p0 , if maxp (DBI) in 1st quartile i i0 chooses the set of parameters with the greatest DBI (20) in the first quartile. In detail, the function S takes as input a χ matrix containing the parameters vector pi, a trail of mobil- For the sake of evaluation, our methodology was ity traces M, the computed clustering C(pi,M) as tested using the LifeMap dataset to check if the cho- well as the quality metrics, such as Information loss sen parameters are optimal. We have tested the (IL(C)) and the Davis-Bouldin index (DBI(C)). method with the seven users of LifeMap that have Once all these values have been computed for each annotated manually their POIs. Consequently, for evaluated set of parameters, two cases are possible. every set of settings of each clustering algorithm, we In the first case, both IL and DBI agree on the same have computed the F-measure because we have the

20 ground truth as depicted in Figure 2. The ”MAX” Dunn, J. C. (1973). A fuzzy relative of the ISO- bar represents the best F-measure for the given user DATA process and its use in detecting compact well- and it is compare to the F-measures obtained when separated clusters. Journal of Cybernetics, 3(3):32– using the ”DBI”, ”IL” or ”IL DBI” as indicators to 57. choose the best input parameters configuration. Fi- Ester, M., peter Kriegel, H., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering nally, this method has a satisfactory performance ex- clusters in large spatial databases with noise. Knowl- tracting a good number of POIs for maximizing the edge Discovery and Data Mining, 2(4):226–231. F-measure achieving a difference of only 9% with Gambs, S., Killijian, M.-O., and Nu´nez˜ del Prado Cortez, respect to the F-measure computed from the data M. (2010). GEPETO: A GEoPrivacy-Enhancing with the ground truth. TOolkit. In Advanced Information Networking and Applications Workshops, pages 1071–1076, Perth, 6 Conclusion Australia. Gambs, S., Killijian, M.-O., and Nu´nez˜ del Prado Cortez, In the current paper, we have presented a method M. (2011). Show me how you move and I will tell to extract the a optimal number of POIs. Conse- you who you are. Transactions on Data Privacy, quently, based on the method described in tis paper, 2(4):103–126. we are able to find an appropriate number of POIs Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). relying only on the quality metrics of the extracted On clustering validation techniques. Journal of Intel- ligent Information Systems, 17(3):107–145. clusters and without the knowledge of the ground Hamerly, G. and Elkan, C. (2003). Learning the k in K- truth. Nonetheless, we are aware of the small size means. In In Neural Information Processing Systems, of dataset but the results encourage us to continue in pages 1–8, Vancouver, Canada. this direction. Hariharan, R. and Toyama, K. (2004). Project lachesis: Therefore, in the future we plan to test our method Parsing and modeling location histories. Lecture notes in a larger dataset and in presence of noise like in computer science - Geographic information science, downsamplig or random distortion. Another idea is 3(1):106–124. to evaluate the impact of this method in more com- Hillenmeyer, M. (2012). Intra and inter clus- http://www.stanford.edu/ plex attacks like prediction of future locations or de- ter distance. maureenh/quals/html/ml/node82.html. anonymization to verify if this step can affect the ˜ MacQueen, J. et al. (1967). Some methods for classifi- global result of a chain of inference attacks. cation and analysis of multivariate mbservations. In Proceedings of the fifth Berkeley Symposium on Math- ematical Statistics and Probability, pages 281–297, References Berkeley, CA, USA. Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, Pham, D. T., Dimov, S. S., and Nguyen, C. D. (2005). J. (1999). Optics: ordering points to identify the clus- Selection of K in K-means clustering. Journal of Me- tering structure. ACM SIGMOD Record, 28(2):49–60. chanical Engineering Science, 205(1):103–119. Ben-David, S. and Ackerman, M. (2008). Measures of Sole,´ M., Muntes-Mulero,´ V., and Nin, J. (2012). Effi- clustering quality: A working set of axioms for clus- cient microaggregation techniques for large numerical tering. In Proceedings of the Annual Conference on data volumes. International Journal of Information Neural Information Processing Systems, pages 121– Security - Special Issue: Supervisory control and data 128, Vancouver, Canada. acquisition, 11(4):253–267. Chon, Y. and Cha, H. (2011). LifeMap: A smartphone- Zhou, C., Frankowski, D., Ludford, P., Shekhar, S., and based context provider for location-based services. Terveen, L. (2004). Discovering personal gazetteers: Pervasive Computing, IEEE, 10(2):58–67. an interactive clustering approach. In Proceedings of the annual ACM international workshop on Ge- Chon, Y., Talipov, E., Shin, H., and Cha, ographic information systems, pages 266–273, New H. (2012). CRAWDAD data set yon- York, NY, USA. sei/lifemap (v. 2012-01-03). Downloaded from http://crawdad.cs.dartmouth.edu/yonsei/lifemap. Davies, D. L. and Bouldin, D. W. (1979). A cluster sepa- ration measure. IEEE Transactions on Pattern Analy- sis and Machine Intelligence, PAMI-1(2):224–227.

21 A Cloud-based Exploration of Open Data: Promoting Transparency and Accountability of the Federal Government of Australia

Edwin Salvador Richard Sinnott Department of Computing and Department of Computing and Information Systems Information Systems University of Melbourne University of Melbourne Melbourne, Australia Melbourne, Australia [email protected] [email protected]

national security issues are not challenged. Abstract According to the Open Government Data definition ("Welcome to Open Government The Open Data movement has become more Data," 2014), there are three main benefits that popular since governments such as USA, UK, governments can obtain by opening up their data: Australia and New Zealand decided to open transparency, participation and collaboration. up much of their public information. Data is Acquiring and processing the amount of data open if anyone is free to use, reuse and redistribute it. The main benefits that a generated by Governments may lead to government can obtain from Open Data workloads that are beyond the capacity of a include transparency, participation and single computer. Fortunately, the emergence of collaboration. The aim of this research is to new technologies, such as Cloud Computing, promote transparency and accountability of makes it easier to scale the data processing the Federal Government of Australia by using demands in a seamless and scalable manner Cloud-related technologies to transform a set (Buyya, Yeo, Venugopal, Broberg, & Brandic, of publicly available data into human-friendly 2009). Whilst for some disciplines and domains visualizations in order to facilitate its where finer grained security is an impediment to analysis. The datasets include details of adoption of Cloud computing, e.g. medicine, politicians, parties, political opinions and government contracts among others. This open data has by its very nature, no such paper describes the stages involved in impediments. Cloud computing also encourages transforming an extensive and diverse the creation of more innovative services collection of data to support effective including those based on processing and visualization that helps to highlight patterns analyzing datasets made available by in the datasets that would otherwise be governments. The sharing of technologies as difficult or impossible to identify. open source solutions also goes hand in hand with open data initiatives. 1 Introduction The aim of this paper is to describe an approach In recent years, the Open Data movement has taken to leverage the benefits provided by Open become increasingly popular since the Data from the Australian government using governments of various countries such as USA, Cloud-related technologies through the UK, Australia, New Zealand, Ghana amongst Australian national cloud facility: National many others decided to open up (some of) their eResearch Collaboration Tools and Resources data sets. In order to consider data as open, it (NeCTAR – www.nectar.org.au) and specifically should ideally be available preferably online in the NeCTAR Research Cloud. The paper begins formats that are easy to read by computers and with a brief introduction to Open Data, providing anyone must be allowed to use, reuse and its definition, its benefits and also its potential redistribute it without any restriction (Dietrich et disadvantages. We then describe the advantages al., 2012). Furthermore, in the Open Data of using Cloud Computing to deal with Open Handbook (Dietrich et al., 2012), the authors Data. The details of the approach taken to state that most of the data generated by harvest, clean and store publicly available data governments are public data by law, and from Australian government resources followed therefore they should be made available for by their analyses and visualizations of these others to use where privacy of citizens and datasets is given. Finally, the paper concludes by

22 pointing out the importance of Open Government and analyze these data. However, according to Data and the role of Cloud Computing to (Davies, 2010; Dietrich et al., 2012; Lathrop & leverage the benefits offered by Open Data. It is Ruma, 2010; Robinson, Yu, Zeller, & Felten, emphasized that there is no causality implied in 2008), most of the data collected by government this paper regarding the analysis of the data is public by law and therefore, it should be made offered. However we strongly believe that open open and available for everyone to use. In some discussions about causality are an essential cases, when governments have failed to make element in the transparency of Government more data easily accessible, citizens have had to find generally. alternative ways to harvest and process these data to give it a meaningful use. A well-known 2 Open Data case is the portal GovTrack.us which was launched in 2004 by a student who harvested a 2.1 Definition set of government data and published it in more The Open Knowledge Foundation defines Open accessible formats. This kind of initiatives have Data as ‘any data that can be freely used, reused influenced in governments’ decisions to make and redistributed by anyone – subject only, at government data publicly available (Brito, 2007; most, to the requirement of attribute and/or Hogge, 2010; Lathrop & Ruma, 2010). It should share-alike’ (Doctorow et al., 2014). We be noted also that government does not always emphasize two important conditions that are not share data effectively across its own departments clearly mentioned in this short definition. First, – here the data includes both open and non-open data can be considered as open if it is easily data. The government departments of accessible which means that data should be immigration, employment, education, health, available on the Internet and in formats that are transport, etc. all have subsets of the total machine readable. Second, the terms reuse and “government” data, but the use of this data in redistribute include the possibility of intermixing integrated frameworks by government is two or more datasets in order to discover currently lacking. relations that would not be visible when having Since 2009, various countries have started the datasets separated. The full definition Open Data initiatives by launching portals in provided by the Open Knowledge Foundation which they publish government datasets to be (Doctorow et al., 2014) gives further details of downloaded by anyone. Among these countries the conditions that should be satisfied by data to are the USA (data.gov), the UK (data.gov.uk), be considered as open. The final purpose of all Australia (data.gov.au), Ghana (data.gov.gh) and these conditions required by Open Data is to New Zealand (data.govt.nz). These sources of ensure the potential interoperability of datasets, data are useful but do not include the tools to i.e. it is possible to combine any number of these compare all of the data sets in any meaningful datasets and subsequently identify their inter- manner. Instead they are typically large relationships. Ideally this should be part of a collections of documents and individual larger system as opposed say to having many (distinct) data sets. Often they are available as individual data sets (e.g. spreadsheets). The true spreadsheets, CSV files with no means for direct power of Open Data is derived from the comparison or analysis across the data sets. analytical tools and capabilities used to identify patterns that would otherwise remain hidden 2.3 Benefits across multiple, diverse data sets. Many authors (Brito, 2007; Davies, 2010; Dietrich et al., 2012; Hogge, 2010; Lathrop & 2.2 Open Government Data Ruma, 2010; Robinson et al., 2008) agree about Governments are constantly gathering data from the benefits that can be obtained by governments many types of sources: the population, taxes, when they decide to open up their data, namely: quality of life indicators and indeed anything that transparency, participation and collaboration. could help the government to monitor and These benefits are directly derived from the improve the management and governance of corresponding Open Data requirements: freedom their country. Historically, only governmental of use, reuse and redistribution. In this context, entities (departments) have had access to process the fact that anyone is free to use government

23 data leads to an increment in government 2010). transparency. Hogge (2010), in her study mentions that transparency is not only about 2.4 Barriers citizens trying to find irregularities in According to (Davies, 2010; Lathrop & Ruma, government actions, it is also about citizens 2010), transparency should not be focused only constantly monitoring their governments’ on the accountability and transparency of activities and providing feedback to improve government. In fact, this could generate an processes and public services, and according to excessive attention to government’s mistakes and the Open Government Data definition consequently, create an image of government as ("Welcome to Open Government Data," 2014), corrupt. This is clearly a reason why this is what defines a well-functioning governments might not want to open up their democratic society. data. However, the authors state that instead of Open Data not only requires data to be reducing transparency, this problem could be accessible, but it requires the freedom to reuse addressed by creating a culture of transparency these data for different purposes. This allows that not only judges when public entities behave citizens to combine two or more datasets to badly, but a culture that is also capable to register create mash-ups and highlight potentially hidden approval when governments successfully solve relations between different datasets (Brito, 2007; public problems or deliver services in a cost Davies, 2010; Lathrop & Ruma, 2010). This effective manner. improves the participation of citizens from Furthermore, many governments and indeed different fields such as developers, scientists and individuals are concerned about the privacy of indeed journalists. This is particularly important citizens. Although, it is possible to anonymize to governments since citizens can experiment in datasets before they are made publicly available, the creation of new services based on it requires considerable time, effort and expense government data and the government is of public workers and sometimes it is not subsequently able to evaluate the most useful possible to guarantee that the data will be fully services and where appropriate shape future anonymized (Lathrop & Ruma, 2010). For this policy based on new knowledge. This has the reason, some governments prefer to keep the data added value of encouraging the participation of private. However it is the case that often terms more citizens in government activities and such as protecting national security or citizen increases the number of new services that could privacy are used as a blanket to deny access to benefit the government. many other data sets that are not contentious. The third key benefit of Open Data is Additional barriers that stop governments collaboration which is directly derived from the making data publicly available is the fact that freedom of users to redistribute government data, many data sets are stored on older forms of data e.g. combining two or more datasets for a storage media such as paper files and proprietary specific purpose and making the resulting dataset databases which do not allow for easy extraction available for others to use. In this way, citizens and publication. Furthermore open data also are collaborating with each other while they are requires appropriate (rich) metadata to describe contributing to the government by creating it: the context in which it was collected, by services and solving problems. In some cases, whom and when. In some cases, this additional this model of combining data sets to develop information is not directly available. new, targeted solutions has spurned a range of start-ups and industries, e.g. San Francisco and 2.5 Disadvantages the Civic Innovation activities Data can be open to misinterpretation, which can (http://innovatesf.com/category/open-data/) subsequently generate civic confusion and extra Although the process of making data publicly problems for governments. For instance, available can be seen as laborious and cost (Lathrop & Ruma, 2010) mentions a case where intensive to the government agencies involved, it people correlated locations of crimes in a city brings further economic benefits to governments with the number of immigrants in that location since it will improve the participation of people and make conclusions like “This is a high crime in the creation of innovative services (Hogge, neighborhood because many immigrants live

24 here”. Something which is not necessarily true as particularly important when working with many other aspects must be taken into account to government data as they can become quite determine the reasons of high levels of crimes in voluminous, they can change over time, they a location. require veracity of information to be checks, and Another disadvantage of publicly available data when comparisons and analyses are made is for the potential for it to be manipulated with between data sets these can result in the intention of satisfying personal interests. This computationally expensive requirements. Cloud is difficult for a government to control and could Computing is especially suited to this be problematic since people often do not always environment since it is possible to scale out verify data before making conclusions. Key to resources to satisfy needs and (in principle) pay tackling this is the spirit of open data: it should for those extra resources only for the time that be possible to verify or refute the conclusions are actually being used. This is convenient that are drawn by access to the original data sets. specially for people considered ‘civil hackers’ Thus for any data that is accessed (harvested) it who create services based on government data should always be possible to go back to the and most often without financial reward (Davies, original (definitive) sources of the data (since it 2010; Hogge, 2010). This contributes to the is open). emergence of new questions and reduces the time needed to answer these questions, which 3 Cloud Computing encourages people to collect more data and Open data benefits greatly by access to open data create more innovative services. processing platforms. Cloud computing offers The Cloud provider utilized here is the one approach that is directly suited to the NeCTAR Research Cloud, which is an processing of open data. The National Institute of Australian government funded project that offers Standards and Technology (NIST) (Mell & an IaaS platform with free access to Australian Grance, 2011), points out five essential academics, or more precisely members of characteristics that define the Cloud Computing organizations subscribed to the Australian model: on-demand self-service, broad network Access Federation (AAF – www.aaf.edu.au) access, resource pooling, rapid elasticity, and such as the University of Melbourne. This work measured service. In order to adapt to different utilised two virtual machines (VMs) each with 2 types of users, Cloud providers offer three levels cores, 8GB RAM and 100GB of storage. While of abstraction: Software as a Service (SaaS) with the VMs were located in different zones, both examples being Salesforce’s CRM and Google have the same architecture (Figure 1). This Docs; Platform as a Service (PaaS) with allowed them to act as master at any time examples being Microsoft Azure and Google providing high availability to the system. App Engine, and Infrastructure as a Service (IaaS) with examples being Amazon EC2, Rackspace and the Australian NeCTAR Research Cloud. There are also many different Cloud deployment models: Private Clouds, Public Clouds, Community Clouds, and Hybrid Clouds (Mell & Grance, 2011; Sriram & Khajeh- Hosseini, 2010; Velte, Velte, & Elsenpeter, 2009; Zhang, Cheng, & Boutaba, 2010). Ideally open data should be processed on open Clouds and the applications and interpretation of the data utilizing open sources data models for complete transparency of the data and the associated data processing pipelines. One of the main reasons for the success of Cloud computing is the capacity to rapidly scale up or scale down on demand, at an affordable cost and ideally in an automated fashion. This is Figure 1. Architecture.

25 4 Implementation of the Open that drove and shaped the work were based on: Government Data Access, Process and political donations, election results, historical Visualise Platform contracts data and political speeches. These data were selected following with researchers at the 4.1 Data Layer Centre for Advanced Data Journalism at the The key focus of the work is on access to and use University of Melbourne. The Data Layer itself of open government data. A Data Layer that was divided into three stages: data harvesting, harvested and processed these data was key to data cleansing and data storage which are this. This layer was responsible for dealing with described here. raw data coming from external sources. The data Data Harvesting sets that were harvested and used as the basis for the work described here included: It should be noted that most of the datasets that  Australian Electoral Commission were harvested satisfy the requirements of Open (www.aec.gov.au) Data, i.e. they are downloadable and are o Annual Returns (2003 - 2013) (includes: provided in machine-readable formats such as party returns, political donations, CSV and XML. It is also noted that there are Associated Entities, Political other data that do not satisfy all of these Expenditure) requirements. For instance, the lobbyist registers o Election Returns (2003 - for all the Australian States are available only in 2013) (includes: donations to candidates, HTML (via web pages). In this case, it was donors details, senate groups) necessary to implement web scrapers for o Election Results (2004 - 2010) (includes: webpages to extract the data and then store it in Details of Federal election results the database. This technique is inefficient and divided in general, house and senate) has several disadvantages for how data can be o Federal electoral boundary GIS data released as open data and subsequently used and (Census 2011) interpreted. Firstly, it is error prone because a  Portal data.gov.au scraper may assume that a webpage follows a o Historical Australian Government standard but there is the possibility of mistakes in Contract Data (1999 - 2013) the scraped HTML, which would cause the o Members of Parliament webpages and scraper to obtain erroneous data. Furthermore, it social networks is a tedious task since it is almost impossible to o Portfolio of Responsibilities build a scraper that works with many webpages  Parliament of Australia as different sites use completely different (www.aph.gov.au/Parliamentary_Business/H designs. Lastly, the design of a webpage can ansard) change without any notice, which would render a o House Hansard given scraper totally useless and require a new o Senate Hansard scraper to be produced. Nevertheless, it is an  Lobbyists Details effective technique when used carefully and after o Australian Government ensuring that all data obtained is verified before (www.lobbyists.pmc.gov.au) performing further analyses and interpretations. o Victoria The information should also include metadata on (www.lobbyistsregister.vic.gov.au) when the data was accessed and scraped. o Queensland Additionally, even when data is made (www.lobbyists.integrity.qld.gov.au) available in a more accessible (downloadable) o Western Australia format, further work is often required. For (www.lobbyists.wa.gov.au) example, despite the fact that the Hansard o Tasmania political speeches of Australia are provided as (www.lobbyists.dpac.tas.gov.au) downloadable XML files, there is no way to o New South Wales download the whole collection of speeches or the (www.dpc.nsw.gov.au) possibility of selecting a range of speech dates o South Australia (www.dpc.sa.gov.au) that could be downloaded. Consequently, it is The analyses and visualizations of these data often necessary to download one file at a time,

26 which makes the process inefficient taking into Data storage account that there are thousands of files. As a Due to the variety of sources and the lack of a result, whilst the data is open, the way in which it well-defined schema, CouchDB was selected as is made available is not really conducive to an appropriate database to store all the harvested further processing without computational data. CouchDB is a schema-free NoSQL and approaches to overcome these limitations, e.g. document-oriented database (Anderson, implementing processes to download all of the Lehnardt, & Slater, 2010). It stores its documents XML files. as JSON objects. In this model, each row of each It is recognized that the difficulties faced in dataset was stored as an independent JSON harvesting public data are understandable since document adding an extra field “type”, and in many governments (including the Australian some cases “subtype”, in order to facilitate the government) are still in the process of opening exploration of different datasets in the database. their datasets and learning about how best to do Although both VMs were set up to act as a this. These lessons are often spotlighted through master at any given time, in this stage one VM important initiatives such as organized events could be considered as master and the other one used to receive feedback from data enthusiasts as slave because only one of them could harvest about how to improve the available datasets or data from external sources at a time while it which new datasets could be made available. For replicated all the new data to the other. CouchDB instance, GovHack is an annual event which provides strong replication processes that allow brings together people from government, setting up a bi-directional replication between the industry, academia and general public to databases in each VM. This allowed having both experiment with government data and encourage databases up to date while only one of them was open government and open data in Australia. harvesting data. Additionally, there exist various open data portals around Australia including the national 4.2 Analysis Layer portal data.gov.au, portals for every State such as In addition to the flexibility provided by the http://data.nsw.gov.au and even some cities CouchDB to store schema-free documents, one like Melbourne have launched their own open of the main reasons to choose this database was data portals, e.g. its support for MapReduce based views. https://data.melbourne.vic.gov.au/. MapReduce is one of most effective approaches Data Cleansing to deal with large-scale data problems and allows to separate what computations are performed and Every dataset collected will typically contain how those computations are performed (Buyya et some extra and/or useless data that needs to be al., 2009; Dean & Ghemawat, 2008; Ekanayake, removed in order to improve the quality of data Pallickara, & Fox, 2008; Lin & Dyer, 2010; and increase the consistency between different Segaran & Hammerbacher, 2009; White, 2012). datasets allowing them to be combined and Therefore, to analyze the data the developer only interpreted more easily. To aid in data needs to focus on the first part which consists on consistency, datasets from different formats such writing two functions: a map function and a as CSV or XML were converted to JavaScript reduce function. The run-time system handles Object Notation (JSON) objects. Although, this how those computations are performed by process was simple, there were some difficulties managing failures, schedules and to overcome in specific datasets. For instance, intercommunication. The complexity of map and the XML files of the Hansard political speeches reduce functions can be diverse and depends on have different structures over different time the type of analysis to be performed on the data. periods, which made the process of parsing the Furthermore, CouchDB documents and views whole collection more complex. However, it was are indexed using a B-Trees data structures, possible to find certain levels of consistency in which are very efficient for storing large most of the datasets, which allowed use of amounts of data (Anderson et al., 2010; Bayer, Python scripts to convert hundreds of datasets 1997). The index for a view is created only the and then store them in the database. first time that the view is queried and allows to retrieve large amount of data very quickly. In

27 order reflect the current state of the database, the spite of the document storage capabilities of index of a view only needs to introduce the ElasticSearch, it is mainly used as an extension documents that have changed. Although this for NoSQL databases thanks to a range of process is very efficient, it can introduce high available plugins. For instance, it offers the latency to queries when a large amount of possibility of indexing any CouchDB database documents have changed (Anderson et al., 2010). through a plugin that listens to the changes API This is a common problem faced by applications of CouchDB making the database searchable and where documents in the database tend to be allowing to perform more complex queries and updated frequently. However, since the type of more complete analyses of the data (Gormley & data used in this project is largely historical and Tong, 2014). Using CouchDB in conjunction not changing dynamically, CouchDB views were with ElasticSearch allows taking advantage of used successfully. the most important features provided by each Most of the data analyses where performed technology, namely durability and advanced using CouchDB views, these analyses included search capabilities respectively. The number of political donations over time, data aggregation of features offered by ElasticSearch is vast and donations such as retrieving the largest donation more details can be found in (Gormley & Tong, in certain election period and correlation between 2014). different datasets, for instance, donations vs ElasticSearch was also useful to build votes received by a political party. However, visualizations that correlate the political there were some cases where it was not possible donations and the government contracts by to perform more complex analyses using only searching all the occurrences of the donors’ CouchDB views. For example, despite the fact names in the dataset of government contracts. that CouchDB owes many of its advantages to B- This was done through the Python API client Trees, it also inherits one of its main drawbacks provided by ElasticSearch and a Python script which is the inability to perform multi- which returned a list of donor names that dimensional queries (Bayer, 1997). In other appeared in the government contracts dataset words, CouchDB views are excellent to process indicating the field where it was found, this queries such as the sum of donations received in helped to show the correlation of both datasets in the year 2004 (point queries) or the sum of a timeline. donations received between 2004 and 2010 (range queries). However, for multi-dimensional 4.3 Presentation Layer queries such as the sum of donations received by Visualisation is essential when dealing with a candidate from a specific donor in 2004 (4- large-scale heterogeneous data sets. Indeed all of dimensional), there were challenges that required the data analyses would have limited value if it support for other data processing capabilities. were not possible to visualize them in a human- For this kind of query it was required to provide friendly way. This is especially important in a visualization that showed an overview of the open government data initiatives where the focus political donations. This visualization was is less on the detailed scientific model of required to group, color and filter donations in discovery and more on the big picture questions multiple ways and shows a summary for every that can be illustrated through the data itself. The group of donations. The summary includes the presentation layer was based mainly in total sum of donations, the number of donations JavaScript using the D3.js library, Google Charts in that group, details of the largest donation and API and jQuery. In this section we illustrate the top 3 donations received by candidates, some of the visualizations for the analyses parties and States. In order to solve this multi- mentioned previously. dimensional query limitation, CouchDB Figure 2 shows an example of the multiple functionalities were extended with ElasticSearch. ways of visualizing political donations through ElasticSearch is a distributed search engine one of the visualizations that were built. Each built on top of Apache Lucene, which among bubble in the figure represents a donation other features provides full text search received by a candidate, the size of the bubble capabilities whilst hiding the complexity of represents the amount of money donated, the Lucene behind a simple and coherent API. In color in this case, represents a political party and

28 each group of bubbles is an election period. This group (including the main title) are clickable and is an interactive visualization so, donations could they contain the summary for every group of be grouped, colored and filtered by all the donations and the main title contains the features contained in the dataset which include summary for all the four groups. election period, candidate, party, electorate, electorate state, donor, donor state, donor suburb, and nil return. Furthermore, the labels for each

Figure 2. Overview of Donations.

This visualization facilitates a detailed exploration of the donations dataset and could be considered as a starting point for further analyses. Another way of visualizing the donations is on a timeline as exposed in Figure 3. This shows the total number of donations received by date. Something interesting to point out here is how we can see that most of the peaks are in the years 2004, 2007 and 2010, years in which federal elections have taken place. This pattern of Figure 4. Individual Donations made by a given donations increasing in election years is also entity over time. visible when looking at donations made by individual entities. Figure 4 illustrates all the An additional scenario of interest is the donations made by an entity over time and correlation between political donations and highlights the tendency of making more government contracts, i.e. grants/award made to donations in election years. entities (most often companies). With the results obtained from the ElasticSearch analysis described in the previous section, donations and contracts were displayed in a timeline to explore whether the number of donations or the amount of money donated by an entity influenced (was correlated with) the number and the value of contracts that they subsequently obtained. Figure 5 shows this scenario for a specific entity. It can be seen that there are several cases Figure 3. Summation of Political Donations where contracts are obtained right before or after Visualised over Timeline.

29 some donations have been made. In addition to both terms tend to be used in the same dates. the graph showed in Figure 5, this visualization This visualization is useful to get an idea of what also provides the details of the donations and topics are being discussed by the members of the contracts related with the entity being analyzed. parliament in different periods. Thus one can see the persons involved as well as political parties and governmental agencies and try to find more patterns to perform further investigations. For instance, a next step might be to investigate who is on the board of the companies making donations and to see if there exists a direct or indirect relation with the governmental agency that is offering the contract. It is emphasized that this is only one of Figure 6. Political Speeches: correlation of words the many scenarios that can be visualized with used over time. this analysis and there did not exist a clear correlation between the two datasets in many of An additional visualization using word clouds the cases. However, this specific scenario helps (Figure 7) was implemented to explore the most us to demonstrate how mash-ups highlight popular terms used by politicians in their hidden relations between apparently unrelated speeches. This visualization allows to see a datasets. For transparency of government it is general word cloud for all the politicians in the important to ensure that where major grants are collection of speeches and provides the awarded, independent review of political possibility of filtering by year, month and day as donations prior to the award can be scrutinized to well as selecting up to three politicians to show a ensure independence of government. comparison of the terms used by each of them over time. These word clouds provide a simple overview of the topics discussed by each politician. For instance, in Figure 7 the word cloud on the left belongs to the Shadow Minister for Transport and Infrastructure and so we can see that the most popular words are highly related to this charge such as “infrastructure”, “transport”, “highway”, “safety”, and “airport”. The word cloud on the right shows the words used by the Prime Minister in his speeches in May 2014 which is the month when the federal budget was presented to the parliament. In this case, we can see that the words “budget”, “deficit”, “spending”, and “tax” are amongst the most popular ones. This demonstrates that word clouds give us an idea of the topics that are dealt in parliament in different periods of time by different politicians. The combination of terms

Figure 5. Colour-coded Correlation of Donations used in speeches and decisions made in award of (A, B, E, F, Q-W, Y, Z) vs Government contracts are also essential to correlate, e.g. Contracts/Awards (C, D, G, H, X). speeches about the important of the Australian car industry should not be correlated/associated A further visualization is illustrated in Figure with political donations from car manufacturers 6, which shows the correlation of terms used in for example if government is to be truly political speeches over time. The figure transparent and ultimately accountable for the demonstrate the correlation between the terms independence of the decisions it makes. “boats” and “immigration” and it indicates how

30

Figure 7. Word clouds comparing terms used by two politicians.

given government activity – this is the 5 Conclusions responsibility of others, e.g. investigative This paper presents an introduction to Open Data journalists. However for truly democratic and and points out how it could help governments to transparent governments it is essential that the improve transparency and accountability. data can be reviewed and analysed and stand up Furthermore, it describes some reasons why to public scrutiny. We strongly encourage this governments refuse to engage in Open Data endeavor. All of the software and systems used initiatives as well as the existing disadvantages in this work are also available. The existing encountered if they are not managed correctly. prototype system is available at The work described how and why Cloud http://130.56.249.15/proj/. Computing provide an appropriate environment Acknowledgments for working with Open Data and identified and presented one of the many approaches that can The authors are thankful to the feedback and be taken to set up this environment and the discussions that have shaped this work. associated technologies involved. It also Specifically we would like to thank Prof. identified some of the common challenges faced Margaret Simons (Director of the Centre for by projects that deal with publicly available data Advanced Journalism at the University of and the methods used to overcome these Melbourne). challenges. Moreover, it showed multiple ways of visualizing data and how different datasets References could be correlated to explore a portfolio of Anderson, J. C., Lehnardt, J., & Slater, N. (2010). CouchDB: the government data that is openly available on the definitive guide: O'Reilly Media, Inc. Bayer, R. (1997). The universal B-tree for multidimensional web. indexing: General concepts Worldwide Computing and Its This work has many refinements that are Applications (pp. 198-209): Springer. Brito, J. (2007). Hack, mash & peer: Crowdsourcing government currently ongoing. Incorporation of further data, transparency. e.g. membership of companies by Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. politicians/their families/associates, as well as (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. exploring social media use. The use of Twitter in Future Generation computer systems, 25(6), 599-616. particular offers a rich source of Open Data that Davies, T. (2010). Open data, democracy and public sector reform. A look at open government data use from data. gov. uk. Über: can be accessed and used to help promote the http://practicalparticipation.co.uk/odi/report/wp- overall information of government. Who is content/uploads/2010/08/How-is-open-governmentdata-being- used-in-practice. pdf. following whom on Twitter; who tweets on what Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data topics; what is their sentiment on particular processing on large clusters. Communications of the ACM, topics and how does this change over time are all 51(1), 107-113. Dietrich, D., Gray, J., McNamara, T., Poikola, A., Pollock, R., Tait, on-going activities that are being pursued. J., et al. (2012). The Open Data Handbook. 2014, from In all of this, it is emphasized that the purpose http://opendatahandbook.org/en/ Doctorow, C., Suber, P., Hubbard, T., Murray-Rust, P., Walsh, J., of this work is not to draw conclusions on any

31 Tsiavos, P., et al. (2014). The Open Definition. 2014, from Mell, P., & Grance, T. (2011). The NIST Definition of Cloud http://opendefinition.org/ Computing. Ekanayake, J., Pallickara, S., & Fox, G. (2008). Mapreduce for Robinson, D., Yu, H., Zeller, W. P., & Felten, E. W. (2008). data intensive scientific analyses. Paper presented at the Government data and the invisible hand. Yale JL & Tech., 11, eScience, 2008. eScience'08. IEEE Fourth International 159. Conference on. Segaran, T., & Hammerbacher, J. (2009). Beautiful data: the stories Gormley, C., & Tong, Z. (2014). Elasticsearch: The Definitive behind elegant data solutions: O'Reilly Media, Inc. Guide: O'Reilly Media, Inc. Sriram, I., & Khajeh-Hosseini, A. (2010). Research agenda in cloud Hogge, B. (2010). Open data study. a report commissioned by the technologies. arXiv preprint arXiv:1001.3259. Transparency and Accountability Initiative, available for Velte, T., Velte, A., & Elsenpeter, R. (2009). Cloud computing, a download at: practical approach: McGraw-Hill, Inc. http://www.soros.org/initiatives/information/focus/communicat Welcome to Open Government Data. (2014). 2014, from ion/articles_publications/publications/open-data-study- http://opengovernmentdata.org/ 20100519. White, T. (2012). Hadoop: The definitive guide: " O'Reilly Media, Lathrop, D., & Ruma, L. (2010). Open government: Collaboration, Inc.". transparency, and participation in practice: O'Reilly Media, Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud computing: Inc. state-of-the-art and research challenges. Journal of internet Lin, J., & Dyer, C. (2010). Data-Intensive Text Processing with services and applications, 1(1), 7-18. MapReduce: Morgan and Claypool Publishers.

32 Discovery of Sequential Patterns with Quantity Factors

Karim Guevara Puente de la Vega Cesar Beltrán Castañón Universidad Católica de Santa María /Arequipa Departamento de Ingeniería Universidad Nacional de San Agustín / Arequipa Pontificia Universidad Católica del Perú / Lima [email protected] [email protected]

In addition, the algorithms for mining sequen- Abstract tial patterns of dealing with the previous sequen- tial patterns in a uniform manner, despite the fact The sequential pattern mining stems from the that these patterns individually in a sequence can need to obtain patterns that are repeated in have important differences such as the associated multiple transactions in a database of se- amount to each item that make up each pattern. quences, which are related to time, or another For the foregoing reasons, in the present paper type of criterion. This work presents the pro- proposes a technique by which it is intended to posal of a new technique for the discovery of sequential patterns from a database of se- exploit these intrinsic relationships of the se- quences, where the patterns not only provide quential patterns, in this specific case the rela- information on how these relate to the time, tionship to the amount of each of the items. The but also, that in the mining process itself inclusion of this aspect in the sequential pattern should be included the quantity factors asso- mining, you can afford to get a set of sequential ciated with each of the items that are part of a patterns that are not only common but also let us sequence, and as a result of this process can know how these amounts associated with each be obtain information relating to how they re- item that is included in a sequential pattern fre- late these items with regard to the amounts quent relates. The inclusion of the restriction of associated. The proposed algorithm uses di- quantity within the extraction process of the fre- vide and conquer techniques, as well as in- dexing and partitioning of the database. quent sequential patterns we could provide in- formation much more meaningful. 1 Credits The article is organized as follows: Section 2 is on the previous work. Section 3 gives a de- This document was written as part of the devel- scription of the problem. Section 4 introduces the opment of the 1st Symposium on Information technical proposal. Section 5 shows the experi- Management and Big Data, SIMBig 2014. It has ments and results. The conclusions and future been adapted from the instructions for earlier work are shown in section 6 and finally the ref- ACL. erences. 2 Introduction 3 Previous works The sequential pattern mining is the process The techniques of discovery of association by which you get the relationships between oc- rules are essentially boolean, due to which are currences of sequential events, to find if there is discarded the quantities of the items purchased a specific order in which these events occur. In and only pay attention to if something was pur- relation to this area of study there are many in- chased or not. An important area of study is the vestigations, all of them makes use of the re- sequential pattern mining that involves the ex- striction of minimal support, some include other traction of patterns that are repeated in multiple restrictions, such as for example the time interval transactions in a transactional database, which in which it is required that the events happen, are related to time or another type of sequence. also the use of taxonomies as defined by the user, The problem of the sequential pattern mining and the fact of allowing the items in a sequence was introduced by Agrawal and Srikant (1995) not necessarily must have occurred in a single set the example of the typical income of clients transaction, but could be in two or more, always in a rental shop videos. Customers often rent and when their times of each of these transac- "Star Wars", then "Empire Strikes Back" and tions is within some small window of time de- then "Return of the Jedi". All these incomes not termined by the user. necessarily should have been made consecutive- ly, that is to say, there could be customers that

33 leased any other video in the middle of the pre- means of task execution of projection of the ini- vious sequence, so that these sequences of trans- tial data base and the obtaining of patterns, with- actions also fall into the same pattern. out involving a process of lead generation. Using The researches on mining sequential patterns this technique and approach under the divide and are based on events that took place in an orderly rule, Han et al. (1996) proposed the algorithm fashion at the time. FreeSpan (Frecuent Pattern-Project Sequential Most of the implemented algorithms for the Pattern mining), and Pei et al. (2001) proposes extraction of frequent sequences, using three dif- PrefixSpan (Prefix-projected Sequential Pattern ferent types of approaches according to the form mining). In these algorithms the database of se- of evaluating the support of the candidate se- quences is projected recursively in a set of small quential patterns. The first group of algorithms is databases from which the fragments of sub se- based on the ownership apriori, introduced by quences grow based on the current set of fre- Agrawal and Srikant (1994) in the mining of as- quent sequences, where the patterns are extract- sociation rules. This property suggests that any ed. sub pattern from a frequent pattern is also fre- Han et al. (1996)] show that FreeSpan extracts quent, allowing pruning sequences candidates the full set of patterns and is more efficient and during the process of lead generation. Based on considerably faster than the algorithm GSP. this heuristics, Agrawal and Srikant (1995) pro- However, a sub sequence can be generated by posed algorithms as the AprioriAll and Apri- the combinations of sub strings in a sequence, oriSome. The substantial difference between the- while the projection in FreeSpan must follow the se two algorithms is that the AprioriAll generates sequence in the initial database without reducing the candidates from all the large sequences the length. In addition, it is very expensive the found, but that might not be lowest panning val- fact that the growths of a sub sequence it will be ues, however, AprioriSome only counts those explored in any point of the division within a sequences that are large but lowest panning val- candidate sequence. As an alternative to this ues, thus reducing the search space of the pat- problem, Pei (2001) proposes PrefixSpan. The terns. general idea is to examine only the prefixes for In subsequent work, Srikant and Agrawal the sub project only sequences and their corre- (1996) propose the same algorithm GSP (Gener- sponding sub sequences postfijas within data- alization Sequential Patterns), also based on the bases planned. In each of these databases technical apriori, surpassing previous in 20 mag- planned, it will find the sequential patterns ex- nitudes of time. Until this time, the algorithms panded exploration only local patterns frequent- that had been proposed for mining sequential ly. PrefixSpan extracts the full set of patterns and patterns focused on obtaining patterns taking into their efficiency and implementation are consid- account only the minimal support given by the erably better both GSP and FreeSpan. user. But these patterns could fit into transactions The third group is formed by algorithms that that had been given at intervals of time very dis- kept in memory only information necessary for tant, which was not convenient for the purposes the evaluation of the bracket. These algorithms of mining. So, in this paper, we propose the idea are based on the calls of occurrence lists that that in addition to the minimal support, the user contain the description of the location where the could be in the ability to specify your interest in patterns occur in the database. Under this ap- obtaining patterns that fit into transactions that proach, Zaki (2001) proposes the SPADE algo- have been given in certain periods of time, and rithm (Sequential Pattern Discovery using this is made from the inclusion of restrictions on Equivalence classes) where he introduces the the maximum and minimum distance, the size of technical processing of the data base to vertical the window in which the sequences and the in- format, in addition there is a difference from the heritance relationships - taxonomies, which are algorithms based on apriori, it does not perform cross-relations through a hierarchy. multiple passes on the database, and you can ex- In these algorithms based on the principle of tract all the frequent sequences in only three apriori, the greater effort focused on developing passes. This is due to the incorporation of new specific structures that allow sequential patterns techniques and concepts such as the list of identi- represent the candidates and in this way make the fiers (id-list) with vertical format that is associat- counting operations support more quickly. ed with the sequences. In these lists by means of The second group is the algorithms that seek temporary unions can be generated frequent se- to reduce the size of the set of scanned data, by quences. Also used the grid based approach to

34 break down the search space in small classes that consequently, its performance is better than can be processed independently in the main Wang et al. memory. Also, uses the search in both breadth Fiot (2008) in her work suggests that an item and depth to find the frequent sequences within quantitative is partitioned into several fuzzy sets. each class. In the context of fuzzy logic, a diffuse item is the In addition to the techniques mentioned earli- association of a fuzzy set b to its corresponding er, Lin and Lee (2005) proposes the first algo- item x, i.e. [x,b]. In the DB each record is asso- rithm that implements the idea of indexing called ciated with a diffuse item [x,b] according to their Memisp memory (Memory Indexing for sequen- degree of membership. A set of diffuse items tial pattern mining). The central idea of Memisp will be implicated by the pair (X,B), where X is is to use the memory for both the data streams as the set of items, and B is a set of fuzzy sets. to the indexes in the mining process and imple- In addition, it argues that a sequence g-k- ment a strategy of indexing and search to find all sequence (s1, s2,…, sp) is formed by g item sets frequent sequences from a sequence of data in diffuse s=(X,B) grouped to diffuse k items [x,b], memory, sequences that were read from the da- therefore the sequential pattern mining diffuse tabase in a first tour. Only requires a tour on the consists in finding the maximum frequency dif- basis of data, at most, two for databases too fuse g-k-sequence. large. Also avoids the generation of candidates Fiot (2008), provides a general definition of and the projection of database, but presented as frequency of a sequence, and presents three algo- disadvantage a high CPU utilization and rithms to find the fuzzy sequential patterns: memory. SpeedyFuzzy, which has all the objects or items The fourth group of algorithms is composed of of a fuzzy set, regardless of the degree, if it is all those who use fuzzy techniques. One of the greater than 0 objects have the same weight, first work performed is the Wang et al. (1999), MiniFuzzy is responsible for counting the objects who propose a new data-mining algorithm, or items of a fuzzy set, but supports only those which takes the advantages of fuzzy sets theory, items of the sequence that candidate have a to enhance the capability of exploring interesting greater degree of belonging to a specified thresh- sequential patterns from the databases with quan- old; and TotallyFuzzy that account each object titative values. The proposed algorithm integrates and each sequence. In this algorithm takes into concepts of fuzzy sets and the AprioriAll algo- account the importance of the set or sequence of rithm to find interesting sequential patterns and data, and is considered the best grade of mem- fuzzy association rules from transaction data. bership. The rules can thus predict what products and quantities will be bought next for a customer and 4 Description of the Problem can be used to provide some suggestions to ap- A sequence s, denoted by , is an or- propriate supervisors. 1 2 n dered set of n elements, where each element e is Wang et al. (1999) propose fuzzy quantitative i a set of objects called itemset. An itemset, which sequential patterns (FQSP) algorithm, where an is denoted by (x [c ], x [c ] , …, X [c ] ), is a item’s quantity in the pattern is represented by a 1 1 2 2 q q non-empty set of elements q, where each element fuzzy term rather than a quantity interval. In their x is an item and is represented by a literal, and c work an Apriori-like algorithm was developed to j j is the amount associated with the item x that is mine all FQSP, it suffers from the same weak- j represented by a number in square brackets. nesses, including: (1) it may generate a huge set Without loss of generality, the objects of an ele- of candidate sequences and (2) it may require ment are supposed to be found in lexicographical multiple scans of the database. Therefore, an order by the literal. The size of the sequence s, Apriori-like algorithm often does not have a denoted by |s|, is the total number of objects of good performance when a sequence database is all elements of the s, so a sequence s is a k- large and/or when the number of sequential pat- sequence, if |s|=k. terns to be mined is large. For example, <(a[5])(c[2])(a[1])>, Chen et al. (2006) propose divide-and-conquer <(a[2],c[4])(a[3])> and <(b[2])(a[2],e[3])> are fuzzy sequential mining (DFSM) algorithm, to all 3-sequences. A sequence s = is a solve the same problem presented by Hong using 1 2 n sub-sequence of another sequence of s'= if there are 1≤i

35 contains the sequence s if s is a sub-sequence of Consider the sequence C6, which consists of s'. three elements, the first has the objects b and c, Similarly, <(b,c)(c)(a,c,e)> contains the second has the object c, and the third has the <(b)(a,e)> where the quantities may be different. objects a, c, and e. Therefore, the support of The support (sup) of a sequential pattern X is <(b)(a)> is 4/6 since all the sequences of data defined as the percentage on the fraction of rec- with the exception of C2 and C3 contain a ords that contains X the total number of records <(b)(a)>. The sequence <(a,d)(a)> is a sub se- in the database. The counter for each item is in- quence of both C1 and C4; and therefore, creased by one each time the item is found in <(a,d)(a)>.sup=2/6. different transactions in the database during the Given the pattern <(a)> and the frequent item scanning process. This means that the counter of b, gets the pattern type-1 <(a)(b)> adding (b) to support does not take into account the quantity of <(a)>, and the pattern type-2 <(a,b)> by the the item. For example, in a transaction a custom- extension of <(a)> with b. er buys three bottles of beer, but only increases Similarly, <(a)> is the prefix pattern (_pat) the number of the counter to support {beer} by which in turn is a frequent sequence, and b is the one; in other words, if a transaction contains an stem of both: <(a)(b)> and <(a,b)>. item, then, the support counter that item only is Note that the sequence null, denoted by <>, is incremented by one. the _pat of any 1-frequent sequence. Therefore, Each sequence in the database is known as a a k-sequence is like a frequent pattern type-1 or sequence of data. The support of the sequence s, type-2 of a (k-1)-frequent sequence. is denoted as s.sup, and represents the number of sequences of data that contain s divided by the 5 Algorithm for the discovery of se- total number of sequences that there is in the da- quential patterns with quantity fac- tabase. minSup threshold is the minimum speci- tors - MSP-QF fied by the user. A sequence s is frequent if s.sup≥minSup, therefore it will be a sequential The algorithm for mining sequential patterns pattern frequently. with quantity factors, arises from the need to dis- Then, given the value of the minSup and a da- cover from a database of sequences, the set of tabase of sequences, the problem of the sequen- sequential patterns that include the amounts as- tial pattern mining is to discover the set of all sociated with each of the items that are part of sequential patterns whose supports are greater the transactions in the database, since having this equal to the value of the minimum support additional information can be known with greater (s.sup≥ minSup). precision not only what is the relationship with Definition: given a  pattern and a frequent respect to the time that exists between the vari- item x in the database of sequences, ' is a: ous items involved in the transactions of a se-  Pattern Type-1: if ' can be formed by add- quence, but also as is the relationship to the amount of these items. ing to  the itemset that contains the item x, The algorithm MSP-QF, it is based on the idea as a new element of . of the use of prefixes, and the creation of indexes  Pattern Type-2: if ' can be formed by the from the database of sequences or other indices extension of the last element of  with x. that are generated during the mining process, The item x is called stem of the sequential pat- where recursively searching for frequent pat- tern ', and  prefix is the pattern of '. terns. As a result of the exploration of a particu- That is, the following database of sequences of lar index, fewer and shorter sequences of data figure 1, which includes amounts for the items need to be processed, while the patterns that are and that, has six sequences of data. found will be made longer. In addition, if the database is very large se- Sequences C1 = <(a[1],d[2]) (b[3],c[4]) (a[3],e[2])> quence uses the techniques of partitioning in a C2 = <(d[2],g[1]) (c[5],f[3]) (b[2],d[1])> manner that the algorithm is applied to each of C3 = <(a[5],c[3]) (d[2]) (f[2]) (b[3])> the partitions as if it were a database of lesser C4 = <(a[4],b[2],c[3],d[1]) (a[3]) (b[4])> C5 = <(b[3],c[2],d[1]) (a[3],c[2],e[2]) (a[4])> size. C6 = <(b[4],c[3]) (c[2]) (a[1],c[2],e[3])>

Figure 1. Database of sequences

36 5.1 Procedure of the algorithm MSP-QF Step 5: When there is no more stems, filtered index stems all those who are frequent. For all Listed below are the steps of the proposed al- stems (sequences) frequently, we proceed to dis- gorithm. cretize the quantities that were associated with

each item and stored in the set of frequent pat- Step 1: Partitioning and scanning of the data- terns. For this, before adding it to the set of fre- base of sequences. Depending on the size of the quent patterns, we proceed to verify that the database are applicable to so it can be partitioned common pattern found recently has not already and formatted through and then to scan each of been added before this set as a result of applying the partitions of independently. For each parti- the algorithm to a partition of the database that tion, the sequences are constructed and stored in was processed with previously. If frequent pat- the structure DBSeq. At the same time generates tern already exists in the set of frequent patterns, the index of items where is stored the support for the discretization process is again applied to the each one of them, which is found during the set of quantities associated with the sequence is scanning process. stored as a frequent pattern and set of quantities

of newly discovered frequent pattern; otherwise, Step 2: The index of items are filtered out the common pattern found in the current partition those that are frequent, i.e., whose support is is added directly to the set of frequent patterns. greater than or equal to minSup determined by Then we proceed to perform recursively steps the user. All these items come to form sequences 3, 4 and 5 with each one of the frequent patterns of size |s| =1, therefore, form the set of 1- that are found in the process. sequences. For all these sequences frequent item is to write the amounts associated with each item Discretization Function: This function is re- to the time it is saved in the whole of frequent sponsible for making the set of quantities associ- patterns. ated with an item, the range of values given by

the mean and standard deviation of this set. For Step 3: For each one of the frequent patterns example, given the sequences of the figure 1, the , found in step 2, or as a result of the step 4, the set of quantities associated with the item <(a)> index is constructed _idx, with inputs (ptr_ds, is: 1,5,4,3,1, which after being discretized would pos), where ptr_ds refers to a sequence of the DB be the interval formed by: [ 2.8±1.6 ] in which appears the  pattern, and pos is the pair (posItemSet, posItem), where posItemSet is To summarize the steps carried out in the pro- the position of the itemset in the sequence and posed algorithm, figure 3 shows a schematic of posItem the position of the item in the itemset the entire procedure. from the sequence where the pattern appears. The values of pos allow the following scans are performed only on the basis of these positions in 5.2 Algorithm specification MSP-QF a certain sequence. Here we show the specification of the pro- posed algorithm MSP-QF. Step 4: Find the stems of type-1 and/or type-2 for each  pattern and its corresponding index Algorithm MSP-QF _idx generated in the previous step, considering In: DB = database sequences only those items of the sequences referred to in minSup = minimum support partition = number of sequences included in each of the partitions _idx and the respective values of pos. At the Out: set of all sequential patterns with quantity factors. same time as are the stems are calculated their Procedure: 1. Partitioning the DB supports, and in addition is added to the list of 2. Each partition scan it in main memory and: quantities of the item that is part of the stem the (i) build sequences and store them in DBSeq structure. amount referred to in the item of the sequence of (ii) index the items and determine the support of each item. (iv) associate the quantities of each item in a sequence list of item the DB which is being examined. The infor- quantities in the index. mation of the stems and their quantities are 3. Find the set of frequent items stored in another index of stems. This step is re- 4. For each frequent item x, (i) form the sequential pattern  = <(x)> peated for the same pattern, until they were no (ii) call Discretize() to discretize the set of quantities associated longer more stems from this. with each item x. (iii) storing  in the set frequent patterns. (iv) call Indexing (x, <>, DBSeq) to build the -idx index. (v) call Mining (, -idx) to obtain patterns from index -idx.

37 Subrutine Indexing (x, , set_Seq) 3. For each stem x found in the previous step, Parameters: (i) form a sequential pattern '' from the prefix pattern -pat and x = one stem type-1 or type-2; the stem x.  = prefix pattern (-pat); (ii) call Discretize(') to discretize the amounts associated with set_Seq = set of data sequences the items of '. / * If set_Seq is an index, then each data sequence in the index is (iii) call Index(', , ’-idx) to build the index ’-idx. referenced by the element ptr_ds¸ which is formed at the input (iv) call Mining(’, '-idx) to discover sequential patterns from (ptr_ds, pos) index * / index ’-idx Out: índex '-idx, where ' represents the pattern formed by the stem x and prefix pattern -pat. Procedure: 1. For each data sequence ds of set_Seq Subrutina Discretize() (i) If set_Seq = DBSeq the pos_inicial = 0, else pos_inicial = pos. Parameters: (ii) Find the stem in each sequence ds from the position  = a pattern that is a sequence; (pos_inicial + 1), Output: the arithmetic mean and standard deviation of the amounts 1. If the stem x is in position pos in ds, then insert a pair (ptr_ds, associated with each item  pattern. pos) in '-idx index, where ptr_ds reference to ds. Procedure: 2. If the stems x is equal to the item x’ of the ds sequence, add- 1. For each itemset    do ed the quantity q associated with the item x’, to the list of a) For each item x  do quantities related to x. (i) Calculate the arithmetic mean and standard deviation of 2. Return the '-idx index. the set of quantities associated with the item x (ii) storing the arithmetic mean and standard deviation in the pattern  Subrutine Mining(,-idx) Parameters:  = a pattern; -idx = an índex. Procedure: 1. For each data sequence ds referenced by ptr_ds of input (ptr_ds, pos) in -idx, (i) Starting from the (pos +1) position until |ds|, determining poten- tial stems and increase in one support each of these stems. 2. Filter those stems that have a large enough support.

Figure 2. Specification of the algorithm MSP-QF

Figure 3. Schema of the procedures of the algorithm MSP-QF

38 In the test with minSup=2% were obtained 6 Experiments and Result 824 sequential patterns with quantity factors, some of which are: The experiments to test the technical proposal Olive[ 1.06±0.51 ]$ were implemented in two different scenarios, Olive[ 1.03±0.5 ]$ Paprika[ 0.37±0.23 ]$ which are described below. Olive[ 1.01±0.5 ]$ Porro[ 0.57±0.27 ]$ Celery[ 0.56±0.25 ]$ Lettuce[ 0.53±0.24 ], Lentils[ 0.54±0.23 ]$ Celery[ 0.56±0.26 ]$ Lettuce[ 0.55km Air ±0.24 ], 6.1 Scenario 1: Real data Paprika[ 0.34±0.17 ]$ Lettuce[ 0.61±0.25 ], Lentils[ 0.56±0.26 ]$ Porro[ 0.59±0.24 ], The technique was applied in the analysis of Paprika[ 0.33±0.15 ]$ the market basket of a supermarket. These tests Porro[ 0.54±0.27 ], Lentils[at 0.62±0.25 ]$ Lentils[ 0.58±0.25 ]$ consisted of obtain the set of frequent sequential Paprika[ 0.35±0.17 ]$ patterns from the basis of data obtained in the course of three non-consecutive periods. The Of these sequential patterns we can clarify the first period goes from mid-December of 1999 following with regard to purchases made by cus- until mid-January 2000. The second period goes tomers: from early 2000 until the beginning of June of • Customers buy only olives in a quantity of the same year. The third period goes from late 1.06±0.51. • Customers who have purchased a first time only August 2000 until the end of November 2000. olive, returning a next time for chili or by porro, This database consists of 88163 transactions, with quantities of 0.37±0.23 and 0.57±0.27 re- 3000 items unique to approximately 5133 cus- spectively . Those who buy after pepper, pur- tomers. chased before olives in a quantity equal to The purpose of testing is to discover patterns 1.03±0.5, while those who acquire porro did so of customer usage in the supermarket, plus get with an amount equal to 1.01±0.5. the amount of each of the items that will be pur- • Those who buy lettuce at the same time buy lentils chased by these customers as a result of applying in amounts equal to 0.61±0.25 and 0.56±0.26 re- the proposed technique, which will allow us to spectively. Later, these same customers buy porro have more accurate and significant in terms of and paprika with amounts equal to 0.59±0.24 and the quantity purchased of each of the items. 0.33±0.15. Seven tests were carried out with minimum • Those who buy porro, in the same transaction also buy lentils. Later return to buy only lentils, and a media 10 %, 2%, 1.5%, 1.0%, 0.75%, 0.50% and next time buy only paprika, in the amounts listed 0.25%, which were observed in figure 4. These in the pattern. results were compared with results of the tech- nical Memisp. 6.2 Scenario 2: Synthetic data This second scenario is generated multiple da- MEMISP MSP-QF minSup Exe.Time No. Exe.Time No. tabases (datasets) of synthetic form by means of (%) (seg.) Patterns (seg.) Patterns the Synthetic Data Generator Tool. 10.00 4 50 5 50 The process followed to synthetic generation 2.00 12 824 15 824 1.50 16 1371 19 1371 of the dataset, it is the describing Agrawal and 1.00 22 2773 27 2773 Srikan (1995), and under the parameters referred 0.75 28 4582 35 4582 to in the work of Lin and Lee (2005). 0.50 39 9286 50 9286 0.25 72 30831 89 30831 In this scenario, tests were carried out both of effectiveness, efficiency and scalability. The evidence of effectiveness and efficiency Analysis of the market basket 100 were made with dataset generated with the fol- 90

lowing parameters: NI = 25000, NS = 5000, N = 80 70 10000, |S| = 4, |I| = 1.25, corrS = 0.25, crupS = 60 0.75, corrI = 0.25 and crupI = 0.75. 50 MEMISP 40 The results of these tests were compared with 30 MSP-QF the results obtained for the algorithms Pre- 20 Execution Execution time (seg.) fixSpan-1, PrefixSpan-2 and Memisp. 10 0 10 2 1.5 1 0.75 0.5 0.25 Efficiency Tests: Ran a first subset of tests for Minimal Support (%) |C|=10 and a database of 200,000 sequences, Figure 4. Results of the tests for scenario 1 with different values for minSup. The results are shown in figure 5.

39 1600 Efficacy tests: Were carried out 92 efficacy

1400 trials with the same dataset of the tests of effi- 1200 ciency. In four of the tests carried out with values 1000 |C| =20 and |T| equal to 2.5, 5 and 7.5 respective- 800 600 ly, it did not achieve the same amount of sequen- 400 tial patterns found with the algorithms Pre-

Exxecution Exxecution (seg.) Tme 200 fixSpan-1 and PrefixSpan-2. These 4 tests repre- 0 0.25 0.5 0.75 1 1.5 2 sent 4% of the total. Minimal Support (%) PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF Scalability tests: The scalability tests were used datasets synthetically generated with the Figure 5. Test results for |C|=10 and |DB|=200K same values of the parameters of the first subset of tests of efficiency, and with minimal support The second subgroup of tests was conducted equal to 0.75%. The amount of sequences in the with a dataset with values for |C| and |T| of 20 dataset for these tests ranged from |DB|=1000K and 5 respectively. This value of |T| implies that to 10000K, i.e. of a million to 10 million se- the number of items of transactions increases, quences. In figure 8, you can watch the results of which represents that the database is also larger these tests. and more dense with respect to the number of frequent sequential patterns that may be ob- 40 tained. The results of these tests are those seen in 35 figure 6. 30

25

10000 20

1000KB) 15 8000 10 6000 5

4000 Execution Time (relative |DB| = 0 2000 1 2 3 4 5 6 7 8 9 10

Execution Execution Time (seg.) Millions of sequences 0 PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF 0.25 0.5 0.75 1 1.5 2

Minimal Support (%) PrefixSpan-1 PrefixSpan-2 MEMISP MSP-QF Figure 8, Results of scalability tests for min- sup=0.75% and different sizes of datasets Figure 6. Test results for |C|=20, |T|=5 and |DB|=200K 7 Conclusion We have proposed an algorithm for the dis- A last subset of efficiency tests were carried covery of sequential patterns that allows, as part out under the same parameters of the subset of the mining process, to infer from the amounts above with the exception of |T| increased to 7.5. associated with each of the items of the transac- The results are shown in figure 7. tions that make up a sequence, the quantity fac- tors linked to the frequent sequential patterns. 20000 The technical proposal has been designed in 18000

such a way that uses a compact set of indices in 16000 14000 which focuses the search for the sequential pat- 12000 terns from frequent patterns that have already 10000 been found earlier and that represent the prefixes 8000 6000 of the patterns to find. That is why the size of the

Execution Execution Time (seg.) 4000 indexes is decreasing in accordance with the 2000 0 mining process progresses. 0.5 0.75 1 1.5 2 In addition, there has been that the information Minimal Support (%) provided by the frequent patterns with factors of PrefixSpan-1 PerfixSpan-2 MEMISP MSP-QF quantity, is much more accurate, since not only

gives us information on how is the temporal rela- Figure 7. Test results for |C|=20, |T|=7.5 and |DB|=200K tionship of the items in the various transactions,

40 but also, what is the relationship of the quantities Karel Filip. 2006. Quantitative and Ordinal Associa- of some items to others, which enriches the se- tion Rules Mining (QAR Mining). 10th Internation- mantics provided by the set of sequential pat- al Conference on Knowledge-Based & Intelligent terns. Information & Engineering Systems (KES 2006). Finally, the results obtained in section 5, we South Coast, UK: Springer, Heidelberg. can conclude by saying that the technical pro- Lin Ming-Yen and Lee Suh-Yin. 2005. Fast Discov- posal meets the objectives of the mining process; ery of Sequential Patterns through Memory Index- it is effective, is efficient and is scalable because ing and Database Partitioning. Journal of Infor- it has a linear behavior in accordance with the mation Science and Engineering. sequence database grows, and that when applied Market-Basket Synthetic Data Generator, to large data bases his performance turned out to http://synthdatagen.codeplex.com/. be better than the techniques discussed in this Molina L. Carlos. 2001. Torturando los Datos hasta que work. Confiesen. Departamento de Lenguajes y Sistemas In- formáticos, Universidad Politécnica de Cataluña. Bar- Reference celona, España. Agrawal Rakesh and Srikant Ramakrishnan. 1994. Papadimitriou Stergios and Mavroudi Seferina. 2005. Fast algorithms for mining association rules. In The fuzzy frequent pattern Tree. In 9th WSEAS Inter- Proceeding 20th International Conference Very national Conference on Computers. Athens, Greece: Large Data Bases, VLDB. World Scientific and Engineering Academy and Soci- ety. Agrawal Rakesh and Srikant Ramakrishnan. 1995. Pei Jian, Han Jiawei, Mortazavi-Asl Behzad and Pinto Minning Pattern Sequential. 11th International Helen. 2001. PrefixSpan: Mining sequential patterns Conference on Data Engineering (ICDE’95), Tai- efficiently by prefix-projected pattern growth. In pei, Taiwan. ICDE '01 Proceedings of the 17th International Con- Agrawal Rakesh and Srikant Ramakrishnan. 1996. ference on Data Engineering. Mining Quantitative Association Rules in Large Srikant Ramakrishnan and Agrawal Rakesh. 1996. Relational Tables. In Proceeding of the 1996 ACM Mining Sequential Patterns: Generalizations and SIGMOD Conference, Montreal, Québec, Canada. Performance Improvements. In Proc.5th Int. Conf. Alatas Bilal, Akin Erhan and Karci Ali. 2008. Extending Database Technology (EDBT’96), pages MODENAR: Multi-objective differential evolution algo- 3–17, Avignon, France. rithm for mining numeric association rules. Applied Soft Takashi Washio, Yuki Mitsunaga and Hiroshi Moto- Computing. da. 2005. Mining Quantitative Frequent Itemsets Chen Yen-Liang and Haung T. Cheng-Kui. 2006. A Using Adaptive Density-Based Subspace Cluster- new approach for discovering fuzzy quantitative ing. In Fifth IEEE International Conference on Da- sequential patterns in sequence databases. Fuzzy ta Mining (ICDM'05). Houston, Texas, USA: Sets and Systems 157(12):1641–1661. IEEE Computer Society. Fiot Céline. 2008. Fuzzy Sequential Patterns for Wang Shyue, Kuo Chun-Yn and Hong Tzung-Pei. Quantitative Data Mining. In Galindo, J. (Ed.), 1999. Mining fuzzy sequential patterns from quan- Handbook of Research on Fuzzy Information Pro- titative data”, 1999 IEEE Internat. Conference Sys- cessing in Databases. tems, Man, and Cybernetics, vol. 3, 1999, pp. 962– 966. Han Jiawei, Pei Jian, Mortazavi-Asl Behzad, Chen Qiming, Dayal Umeshwar and Hsu Mei-Chun. Zaki Mohammed J. 2001. SPADE: An efficient algo- 1996. Freespan: Frequent pattern-projected se- rithm for mining frequent sequences. Machine quential pattern mining. Conference of the sixth Learning, 42(1-2):31–60. ACM SIGKDD international conference on

Knowledge discovery and data mining.

41 Online Courses Recommendation based on LDA

Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, Jose´ Ochoa Luna National University of St. Agustin Arequipa - Peru´ r.guzmanap,elizavvc,lvcruzq,eduardo.ol @gmail.com { }

Abstract are primarily focused. To do so, we rely on Topic Models (Blei, 2012), an unsupervised probabilistic In this paper we propose a course recommen- generative model, which given a set of documents dation system based on historical grades of students in college. Our model will be able to and a number of topics as input, automatically re- recommend available courses in sites such as: turns a relevant set of words probabilistically asso- Coursera, Udacity, Edx, etc. To do so, proba- ciated for each topic. Why this scheme is valuable?, bilistic topic models are used as follows. On consider for instance a huge number of digitalized one hand, Latent Dirichlet Allocation (LDA) books of a public library, this algorithm can auto- topic model infers topics from content given in matically discover main topic words and therefore a college course syllabus. On the other hand, allows one to gain insights about content in books. topics are also extracted from a massive online open course (MOOC) syllabus. These two sets Currently educational systems and data mining of topics and grading information are matched is an emerging research area (Romero and Ventura, using a content based recommendation system 2010), these systems use different recommendation so as to recommend relevant online courses to techniques in order to suggest online learning ac- students. Preliminary results show suitability of our approach. tivities, based on preferences, knowledge and data from other students with similar interests (Romero et al., 2007). In (Kuang et al., 2011) the author pro- 1 Introduction vides resource recommendation for users in the e- Nowadays, the amount of educational resources learning system based on contents and user log ac- spread at Internet is huge and diverse (Martin, 2012). tivities. There was proposed a method for resource Massive Online Open Courses (MOOCs) such us recommendation based on topic modeling in an e- Coursera, Udacity, EdX, to name a few, are gain- learning system, that system used Latent Dirichlet ing momentum (Fischer, 2014). It is possible to find Allocation (LDA) to get a low dimension vector, courses from almost every knowledge domain. This and to do inference it used Gibbs sampling, then in vast offer overwhelm any user willing to find courses resource recommendation it applied cosine similar- according his/her background. This task can be te- ity in document topic distribution to find neighbor dious because it involves access to each platform, resources. The authors from (Haruechaiyasak and search available courses, select some courses, read Damrongrat, 2008) also recommended documents, carefully each course syllabus, and choose appropri- in this case it recommended articles from wikipedia ate content. This process can be unmanageable if by calculating the similarity measures among topic we extend our search beyond online courses to edu- distributions of the articles. The model proposed in cational content. (Sadikov and Bratko, 2011) is an hybrid recommen- In this work we propose a system for online dation system where the core of the system is a lin- courses recommendation, although MOOCs courses ear regression model, based on stochastic gradient

42 descent. For predicting the rank of a lecture, they each document is a combination of topics and each used and compared the predictions made by content- topic is a probability distribution over words (Blei based and collaborative-based methods. In this pa- et al., 2003). Topic models are a type of graphical per they established manually the attributes that rep- model based on Bayesian networks. resent each video-lecture, unlike our paper, where The generative process described by a topic model the attributes for the courses are defined by the LDA does not make any assumptions about the order of algorithm. In (Sadikov and Bratko, 2011), to find a words as they appear in documents. The only infor- rank they measured the correlation between an old mation relevant to the model is the number of times lecture (a lecture the visitor has already seen), and words are produced, this is known as the “bag-of- the new lectures (lectures that visitor has not seen words” assumption (Steyvers and Griffiths, 2007). yet), and then they ordered theses measures in a list, There are two main topic models: LDA (Blei et where the lowest comes first, theses computations al., 2003) and Probabilistic Latent Semantic Anal- were used in the linear regression model. Also they ysis (pLSA) (Hofmann, 1999). In this work we use said that there was not to much difference between LDA due to its general model. It is also worth noting using content-based or collaborative-based methods, that LDA has been previously used in recommenda- but they said that their system could have been im- tion systems (Romero and Ventura, 2010; Romero et proved if they used textual attributes, which is our al., 2007; Kuang et al., 2011). case. In our proposal, Latent Dirichlet Allocation 2.2 Topics Modeling using Latent Dirichlet (LDA) (Blei et al., 2003) topic model is mainly used Allocation as feature descriptor of courses. Thus, we assume Latent Dirichlet allocation (LDA) (Blei et al., 2003) that each course has a set of inherent topics and is widely used for identifying topics in a set of docu- therefore relevant words that summarize them. In ments, building on previous work by Hofmann (Hof- our content-based recommendation setting those are mann, 1999). The corresponding graphical model input features that describe courses. We are con- representation is depicted in Figure 1, where each cerned in discovering the parameter vector of users, document is represented as a mixture of a fixed num- i.e., weights over topic words that denote user pref- ber of topics, with topic z receiving weight θ(d) in erences on courses. In order to infer this user vector, z document d, and each topic is a probability distribu- we rely on supervised machine learning algorithms tion over a finite vocabulary of words, with word i thus, we assume grading obtained in college courses (z) as ratings, learn user weights and ratings are pre- having probability φi in topic z. dicted for unseen MOOCs courses. Preliminary re- sults show suitability of this approach. (d) The paper is organized as follows. In Section 2 α θ background is given. In Section 3, our proposal is presented. Section 4 shows experimental results. Fi- z nally, Section 5 concludes the paper.

(z) 2 Background β Φ w N T d D 2.1 Probabilistic Topic Modeling

Topic models are probabilistic models that have Figure 1: Graphical model for the topic modeling using been mainly used to discover topics in a big col- plate notation lection of text documents. They are non supervised learning (Duda et al., 2012) techniques that do not Symmetric Dirichlet priors are placed on θ(d) require any prior annotations or labeling of the doc- and φ(j), with θ(d) Dirichlet(α) and φ(j) ∼ ∼ uments: the topics emerge from the analysis of the Dirichlet(β), where α and β are hyper-parameters original texts (Blei, 2012). To do so, they assume that affect the sparsity of these distributions. The

43 hyper-parameter α can be interpreted as a prior ob- zN j indicates (z1, . . . , zj 1, zj+1, . . . , zN ) \ − servation count for the number of times a topic is W is the size of the vocabulary sampled in a document, and β as the prior observa- (wj ) tion count on the number of times words are sampled nz ,N j is the number of times a word wj is j \ from a topic before any word from the corpus is ob- assigned to topic zj ( ) served. This smooths the word distribution in every · nz ,N j is the total number of words assigned to topic, with the amount of smoothing determined by j \ topic zj β. The goal of inference in this model is to identify (d ) n j is the number of times a word in document the values of φ and θ, given a corpus of D documents zj ,N j \ represented by a vocabulary of W words. dj is assigned to topic zj In our proposal, each course is a document d that (dj ) n ,N j is the total number of words in document dj has its related sequence of Nd word tokens, N words · \ in the overall corpus. From this probability distribution it is possible to make inference, in order to compute conditional 2.3 Gibbs Sampling Algorithm probability of topic structure given the observed There are many algorithms proposed to obtain document. The probability distribution of topics in the main variables of interest θ and φ in the lit- a document represents a feature vector for that doc- erature, (Hofmann, 1999) used the expectation- ument. maximization (EM) algorithm, this approach suffers from problems involving local maxima of the like- 2.4 Recommender Systems lihood function, which has motivated a search for According to (Ricci et al., 2011), recommender better estimation algorithms like the ones proposed systems are software tools and techniques provid- in (Blei et al., 2003; Buntine, 2002; Minka and Laf- ing items suggestions for a given user. Suggestions ferty, 2002). provided are aimed at supporting their users in var- Instead of directly estimating the variables for ious decision-making processes, such as what items each document, another approach is the algorithm to buy, what music to listen, or what news to read. called “Gibbs sampling” (Griffiths and Steyvers, As a rule, in a recommendation-system applica- 2004), which provides a relatively efficient method tion there are two classes of entities, which we shall of extracting a set of topics from a large corpus. refer to as users and items. Users have prefer- Gibbs sampling considers each word token in the ences for certain items and these preferences must text collection in turn, and estimates the probability be teased out of the data (Rajaraman and Ullman, of assigning the current word token to each topic, 2012). The data itself is represented as a utility ma- conditioned on the topic assignments to all other trix, giving for each user-item pair, a value that rep- word tokens. From this conditional distribution, resents what is known about the degree of preference given a document, a topic is sampled and stored as of that user for that item. Values come from an or- the new topic assignment for this word token. We dered set, e.g., integer 1 5 representing the number − write this conditional distribution as: of stars that the users gave as a rating for that item.

(wj ) (dj ) We assume that the matrix is sparse, meaning that nz ,N j + β nz ,N j + α P (z z , w ) = j \ j \ most entries are unknown. An unknown rating im- j N j N ( ) (d ) | \ n · + W β · n j + T α plies that we have no explicit information about the zj ,N j ,N j \ · \ user’s preference for the item. The goal of a rec- where: ommendation system is to predict the blanks in the utility matrix. wN = (w1, . . . , wN ) are the words in the entire corpus There are two basic architectures for a recommen- dation system (Rajaraman and Ullman, 2012): zN = (z1, . . . , zN ) are the topic assignments of the words Content-based systems focus on properties of • items. Similarity of items is determined by

44 measuring the similarity in their properties courses c. features profile user(1)—rating ... (1) (1) Collaborative-Filtering system focus on the re- Calculus x1, . . . xn θ1 , . . . , θn — 12 • lationship between users and items. Similarity ········· 0 0 (1) (1) of items is determined by the similarity of the ML(Mooc) x1, . . . xn θ1 , . . . , θn —? ratings of those items by the users who have Table 1: Utility matrix for courses rated both items.

In a content-based system, we must construct a all items. The rest of this section describes our main profile for each item, which is a record of collections design choices. of records representing important characteristics of Consider the utility matrix in Table 1 used to that item. In simple cases, the profile consist of some represent a content-based recommendation system. characteristics of the item that are easily discovered. First column contains courses names (college and For example, in a movie there are the set of actors, MOOC’s courses). Second column contains feature the director, the genre of general type of movie. In descriptors for courses. Each row denotes a different documents it is not immediately apparent what the course, therefore each course has a different feature values of features should be. There are many kinds vector. Third column shows the user vector profile of documents for which a recommendation system Θ(1) for user 1. This vector could comprise user 1 can be useful. For example, there are many news ar- preferences about art, math, and social sci- ticles published each day, and we cannot read all of ences in general. In this same column is also showed them. A recommendation system can suggest arti- user 1 ratings for each course (they are in fact grades cles on topics a user is interested in. Unfortunately, obtained in college for user 1, see for instance rating documents do not tend to have available information 12 for calculus). Further columns for user 2, user giving features. A substitute that has been useful in 3 and so on should be added accordingly. Our goal practice is the identification of words that character- is to predict missing ratings for MOOC’s courses (? ize the topic of a document. An approach is to com- symbol in last row) for user 1 (user 2, 3, etc.). In or- pute the TF (Term frequency) - IDF (Inverse doc- der to do so, we should perform the following steps: ument frequency) score for words in the document. Extract item vectors for courses: item vectors The ones with the highest scores are the words that • characterize the document. In this sense, documents are defined by courses content, i.e., text that are represented by sets of words. In this paper we describes courses, such as “about the course“ have used a different approach which relies on find- information. In order to construct item vectors ing document topic information by using topic mod- (features from documents), we rely on Latent eling algorithms such as LDA. Dirichlet Allocation algorithm which extracts topic information from text as probability dis- 3 Proposal tribution of words. Since we use a machine learning setting, item vectors are features of In order to recommend online courses, each course a regression/classification problem, which we is considered a document which has a given con- denote X = X1,X2,...,Xn . tent. To characterize each course, LDA is used to { } uncover the semantic structure hidden in the docu- Learn user’s vector: interests about topic • ment. Since LDA allow us to get a topic distribution courses can be modeled by user’s vector which for each course, this output is used as a feature vec- should be learned for each user. To do tor for courses (items according to a content-based so, we use a machine learning approach, all recommendation setting). A recommendation sys- available ratings (grading information in col- tem is built using item profiles and utility matrices lege) are used to train a multilinear regression and we treat the problem as one of machine learn- model (Bishop and others, 2006). The user’s ing. Regard the given data as a training set, and for vector is therefore the resulting set of param- (1) (1) (1) each user, build a classifier that predicts the rating of eters (or weights), Θ = θ , . . . , θn { 1 }

45 learned from training data (for instance, all courses and gradings of user 1). There are m (number of users) set of parameters. In a multi- linear regression algorithm we want to find the values for Θ, that minimize the cost function: J(Θ , Θ ,..., Θ ) = 1 m (h (x(i)) 0 1 n 2m i=1 Θ − yi)2 P

We define an hypothesis: T hΘ(x) = Θ x = Θ0x0 +Θ1x1 +Θ2x2 +...+ Θnxn Where Θ0, Θ1,..., Θn are the parameters we want to predict minimizing the cost function. One way to minimize the cost function is by using gradient descent method, where each it- eration of gradient descent makes the parame- ters θj come closer to the optimal values that Figure 2: Block diagram for the recommendation system will minimize the cost function J(θ). 4 Experimental Results For n > 1 This section shows preliminary experimental results Repeat { conducted on real world data sets. Courses and users Θ := Θ α 1 m (h (x(i)) yi)x (i) j j − m i=1 Θ − j grading information where extracted from a Peru- (simultaneously update Θj for j = 0, ...n) P } vian university. Some MOOC’s courses were ex- tracted from Coursera, the following categories were considered: “business and management”, “computer science - artificial intelligence”, “computer science - Given item and user vectors the goal is to pre- software engineering”, “computer science - systems • dict a rating RC for a MOOC course C with and security”, “computer science - theory”, “math- feature vector XC for user U, i.e., user vector ematics”, “statistics and data analysis”. The most profile Θ(U), the resulting predicted rating is significant information from each course is given by given by: “Introduction”, “About the Course” and “FAQ” sec- tions. All extracted information has been preprocessed T (U) according to the following process: remove non RC = XC Θ ASCII characters, strip HTML tags, remove special strings, remove multiple spaces and blank lines. After that we built a corpus further used by the LDA algorithm. The number of Coursera courses considered was 69, while the number of college An overview of the recommendation system is de- courses was 43, which gives rises to 112 courses. picted in Figure 2 where we estimate the ratings for The topic modeling algorithm used the gibbs sam- a student and to recommend a course we consider a pling inference procedure and according to (Blei, “top-10 best recommendations” approach thus, each 2012) we set parameters α = 50/T , β = 0.01. The student get always 10 recommended courses. Those number of iterations was chosen to be large enough are the most related MOOCs to courses in which a to guarantee convergence, N = 200. student get the 10 lowest grades. To measure performance, accuracy was consid-

46 ered by counting the number of correct matches be- tween college courses and Coursera courses. Figure 0.9 3 illustrates the impact of the number of topics T in the topic model. A higher accuracy is achieved when 0.8 we use a higher number of topics, then we set the number of topics T = number of Coursera courses 0.7 because of the precision.

Accuracy 0.6

0.5

0.4

0.7

10 20 30 40 Number of Top-Recommended course(top-N) 0.6

0.5 Figure 4: Recommendation system accuracy according to the number of recommended courses Accuracy 0.4 In Figure 5, a comparison between ratings of 0.3 “coursera courses” and “college courses” for one

0.2 student is showed. We intend to show proximity of 0 20 40 60 80 100 120 predicted data (ratings on “coursera courses”) and Number of Topics provided data (ratings on college courses). Figure 3: Accuracy of the recommendation system ac- cording to the number of topics, a better precision and 20 Predicted rating Coursera also efficiency is obtained when the number of topics is Real rating University equal to the number of coursera courses, T =69

15

10

The goal of our proposal is to recommend courses Rating for students who have received low grades in college therefore, we are using grades as ratings. To keep 5 a recommendation system setting, we have decided to invert grading information thus, 20 grade turns out 0 rating and viceversa (this step might not be 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 necessary in other recommendation systems). Mean Courses normalization is also used to get a more reliable rec- ommendation for students with few grades available, Figure 5: Comparison chart of ratings for one student for instance, first year students.

For testing, we define a variable “top-N” which 5 Conclusion denotes the number of courses to recommend. For instance, for student “a” we recommend the “top- We have introduced a novel approach for recom- N” courses from Coursera where he/she has gotten mending online courses that combines the proba- the greatest ratings. In Figure 4, the x-axis denotes bilistic topic model LDA and content-based recom- several values for “top-N”, and the y-axis denotes mendation systems. In short, we use a machine accuracy obtained. An cccuracy over 0.6 is achieved learning approach where LDA allow us to extract for “top-N” greater than or equal to 10. feature descriptors from courses, rating prediction

47 in this setting is performed by inferring user pro- Thomas Minka and John Lafferty. 2002. Expectation- file parameters using multilinear regression. Prelim- propagation for the generative aspect model. In inary experimental results show that our algorithm Proceedings of the Eighteenth conference on Uncer- performs well when compared to a similar approach tainty in artificial intelligence, pages 352–359. Mor- gan Kaufmann Publishers Inc. based on cosine similarity with LDA. Anand Rajaraman and Jeffrey David Ullman. 2012. Min- Although we have focused on MOOCs as source ing of massive datasets. Cambridge University Press, of recommendation content, nothing prevent us from Cambridge. using this approach beyond such domain. In fact, Francesco Ricci, Lior Rokach, and Bracha Shapira. further domains can be included by performing fea- 2011. Introduction to recommender systems hand- ture topic extraction. Future work will be addressed book. Springer. to investigate scalability issues. In this sense, topic C. Romero and S. Ventura. 2010. Educational data min- models such as LDA, have scalable versions avail- ing: A review of the state of the art. Systems, Man, and able. For instance, a MapReduce implementation is Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 40(6):601–618, Nov. given in the Apache Mahout library1. There are also Cristbal Romero, Sebastin Ventura, JoseAntonio Del- scalable versions for multilinear regression. gado, and Paul De Bra. 2007. Personalized links rec- ommendation based on data mining in adaptive educa- tional hypermedia systems. 4753:292–306. References Er Sadikov and Ivan Bratko. 2011. Recommending vide- Christopher M Bishop et al. 2006. Pattern recognition olectures with linear regression. and machine learning, volume 1. springer New York. Mark Steyvers and Tom Griffiths. 2007. Probabilistic David M. Blei, Andrew Y. Ng, and Michael I. Jordan. topic models. Handbook of latent semantic analysis, 2003. Latent dirichlet allocation. J. Mach. Learn. 427(7):424–440. Res., 3:993–1022, March. David M Blei. 2012. Probabilistic topic models. Com- munications of the ACM, 55(4):77–84. Wray Buntine. 2002. Variational extensions to em and multinomial pca. In Machine Learning: ECML 2002, pages 23–34. Springer. Richard O Duda, Peter E Hart, and David G Stork. 2012. Pattern classification. John Wiley & Sons. Gerhard Fischer. 2014. Beyond hype and underestima- tion: identifying research challenges for the future of moocs. Distance Education, 35(2):149–158. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235. Choochart Haruechaiyasak and Chaianun Damrongrat. 2008. Article recommendation based on a topic model for wikipedia selection for schools. 5362:339–342. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual interna- tional ACM SIGIR conference on Research and devel- opment in information retrieval, pages 50–57. ACM. Wei Kuang, Nianlong Luo, and Zilei Sun. 2011. Re- source recommendation based on topic model for edu- cational system. 2:370–374, Aug. Fred G. Martin. 2012. Will massive open online courses change how we teach? Commun. ACM, 55(8):26–28, August.

1https://mahout.apache.org/

48 ANIMITEX project: Image Analysis based on Textual Information

Hugo Alatrista-Salas and Eric Kergosien and Mathieu Roche and Maguelonne Teisseire TETIS (Irstea, Cirad, AgroParisTech), Montpellier, France LIRMM (CNRS, Univ. Montpellier 2), Montpellier, France [email protected]

Abstract The ANIMITEX project has many application ar- eas such as image annotation (Forestier et al. 2010). With the amount of textual data available on For instance, identifying the precise type of culture the web, new methodologies of knowledge ex- or the function of a building is not always possi- traction domain are provided. Some original ble with the only use of images. Nevertheless, tex- methods allow the users to combine different tual data could contain this kind of information and types of data in order to extract relevant give additional meaning to the images. The develop- information. In this context, this paper draws the main objectives of the ANIMITEX project ment of approaches based on image/text matching which combines spatial and textual data. The becomes crucial in order to complete image analy- data preprocessing step is detailed. sis tasks (Alatrista-Salas and Bechet´ 2014). It also enables a better classification of data.

Keywords: Knowledge extraction, Text mining, Moreover, image-text matching will enrich Infor- Satellite images, Spatial feature identification mation Retrieval (IR) methods and it will provide users a more global context of data (Sallaberry et al. 1 Aims of the ANIMITEX project 2008). This can be crucial for the decision maker in the context of land-use planning projects that have A lot of high resolution satellite data are now avail- to take into account opinions of experts related to a able. This raises the issue of fast and effective satel- territory (managers, scientists, associations, special- lite image analysis as it still requires a costly hu- ized companies, and so forth). man implication. In this context, remote sensing ap- proaches enable to tackle this challenge. The ex- In the context of the ANIMITEX project, we plan 1 ploratory and ambitious ANIMITEX project aims to investigate two specific scenarios: (i) The con- at processing massive and heterogeneous textual struction of a road on the north of Villeveyrac (city data (i.e. big data context) in order to provide cru- close to Montpellier, France), (ii) A port activity cial information to enrich the analysis of satellite im- area, called Hinterland, in Thau area (near to Sete,` ages. France). The aim of this case studies is to enrich The large amount of data are associated to a tem- images with information present in documents, e.g. poral repetitivity that increases. For instance today the opinions extracted in newspapers about land-use around ten images are available per year (e.g. SPOT, planning. Landsat), and in 3 years, one image every 5 days (based on Sentinel-2 satellites) will be available. The section 2 describes the proposed data prepro- 1http://www.lirmm.fr/ mroche/ANIMITEX/ (web site in cessing step. The section 3 details the partners in- ∼ French) volved in the project.

49 2 Data preprocessing process and dedicated rules (Gaio et al. 2012) allows us to identify the absolute (e.g., Montpellier) and relative The current work focuses on adapting of Natural (e.g., south of Montpellier) Spatial Features (ASF Language Processing (NLP) techniques for recogni- and RSF) (Lesbegueries et al. 2006; Kergosien et al. tion of Spatial Features (SF) and thematic/temporal 2014). A first framework based on sequential pattern information (Gaio et al. 2012; Maurel et al. 2011). mining (Cellier et al. 2010) has been proposed to In the proposed approach, SF appearing in a text, discover relationships between SF (Alatrista-Salas are composed of at least one Named Entity (NE) and and Bechet´ 2014). To this end, a two-step process one or more spatial indicators specifying its location has been defined (See Figure 3). (Lesbegueries et al. 2006). For this, a set of articles SF validation: for each identified ASF, we check (i.e. 12000 documents) concerning Thau region be- on external resources if there is a corresponding spa- tween the years 2010 and 2013 has been acquired. A tial representation. In particular, we have used layers second part of the data set is composed of raster files provided by the IGN3 (municipalities, roads, rail- (image mosaics Pleiades - spatial resolution 2x2 m ways, buildings, etc.). In addition, if an ASF does - 4 spectral bands) covering all regions of the Thau not present on IGN ressources, we use gazetteers lagoon (See Figure 1). Satellite images are available (Geonames and Open Street Maps) to complet the 2 via the GEOSUD Equipex . information. Concerning the representation of RSF, we use spatial indicators of topological order asso- ciates to ASF. Following the scopes proposed in (Sallaberry et al. 2008), the spatial indicators of topological order have been grouped in five categories:

Proximity: different indicators can be used • in relationship of proximity, such as: near, around, beside, close to, periphery, etc..

Distance: the indicators used in this relation- • ship are of the form: x km, x miles, etc.. Two representations are then proposed in our ap- proach: 1) calcul of distance from the centroid of the ASF and construction of a circular buffer Figure 1: Mosaic images Pleiades around the Thau basin, of size x from the centroid; 2) regarding the images on the right represent the superposition of a vector classification on the raster file. shape of the ASF and building a buffer of size x from the edge of the processed ASF . A detailed classification of the land occupation is Inclusion: this binary operation allow us to currently in progress. It will lead to a digital vec- • check if an ASF is inside another taking into tor layer where each SF (represented by a polygon) account indicators such as: center, in the heart, belongs to a class of specific land use. The nomen- in, inside, etc. clature of this classification is organized into four hi- erarchical levels (See Figure 2). Moreover we inves- Orientation: This unary relationship has been • tigate multi-scale information associated with differ- broadly studied in the literature. Different ap- ent levels of classification of satellite images. proaches have been proposed to identify a car- From this corpus, NLP methods have been ap- dinal points of an ASF. We have chosen to use plied in order to identify linguistic features con- the conical model proposed in (Frank 1991). cerning spatial, thematic, and temporal information For this, we use the centroid of ASF and we in the documents. The combined use of lexicons 3Institut National de l’information Gographique et forestire, 2http://www.equipex-geosud.fr/ i.e. National Institute of Geography

50 Figure 2: Nomenclature of Thau region used to image classification (in French)

Figure 3: From document to footprint definition: the two-step process

build a buffer around. The size of this buffer the treated document can be mapped. In this pro- will be calculated taking into account the sur- cess, two main problems have been identified. The face of the studied ASF. Then we decompose first one is the persistent ambiguity of some NE con- the buffer into four equal areas (forming a ”X”) tained in SF because of some NE (e.g. Montagnac) from the centroid. Each intersection between corresponding to several places. To address this is- the buffer and cones thus formed represent the sue, a configurable spatial filter based on predefined four cardinal points. scenarios has been developed. For example, to iden- tify events related to a specific land-use planning Geometry: geometry relations are built from at • project occurred in a part of the area of the Thau least two ASF. These relationships are, for ex- lagoon, only the SF contained in this area will be ample, the union, the adjacency, the difference explored. The second issue is related to the use of or a position of an ASF with respect to other external resources and the identification of the spa- ASF, for example, C between A and B (where tial representation appropriate to each ASF. Taking A,B and C are ASF), etc. into account the spatial indicator (e.g. town, road, Representation of the spatial footprint: after etc.) preceding by the toponymic name is a first an- the extraction step and spatial representation of the swer because it allows us to specify the type of the ASF and RSF, the spatial footprint associated with SF and thus take into account the appropriate spatial

51 representation. Acknowledgments Thematic information is identified by semantic re- The authors thank Midi Libre (French newspaper) sources (i.e. AGROVOC thesaurus, nomenclature for its expertise on the corpus and all the partners resulting of image classifications ...) (Buscaldi et al. of ANIMITEX project for their involvement. This 2013). work is partially funded by Mastodons CNRS grant These linguistic features allow us to identify spe- and GEOSUD Equipex. cific phenomena in documents (e.g., land-use plan- ning, environmental change, natural disasters, etc.). The main idea is to link the phenomena identified References in images with subjects found in documents during Alatrista Salas H., Bechet´ N. Fouille de textes : une ap- the same period. Overall, the ANIMITEX project proche sequentielle´ pour decouvrir´ des relations spa- allows the users to integrate different information tiales. In Cergeo Workshop - EGC, 2014 sources, i.e. both types of expressions (texts vs. im- Buscaldi D., Bessagnet M.N., Royer A., Sallaberry C. ages). The main objective is to enrich the informa- Using the Semantics of Texts for Information Re- tion conveyed by a text with images and vice versa. trieval: A Concept and Domain Relation-Based Ap- proach. Proceedings of ADBIS (2) - Advances in Databases and Information Systems, pp. 257-266, 3 Consortium of the project 2013. The multidisciplinary consortium of the project in- Cellier P., Charnois T., Plantevit M., Cremilleux´ B. Re- cursive Sequence Mining to Discover Named Entity volves three research domains: Computer Science, Relations Symposium on Intelligent Data Analysis, Geography and Remote Sensing. More precisely, LNCS, pp. 30-41, 2010. the expertise in remote sensing and complex min- Forestier G., Puissant A., Wemmert C., Ganc¸arski, ing and heterogeneous spatio-temporal data, is one Knowledge-based Region Labeling for Remote Sens- of the foundations of the project. ing Image Interpretation Computers, Environment and TETIS (Territories, Environment, Remote Sens- Urban Systems, Vol. 36(5), pp. 470?480, 2012 ing and Spatial Information, Montpellier) aims Frank A. U. Qualitative spatial reasoning with car- dinal directions. In Seventh Austrian Conference to produce and disseminate knowledge, concepts, on Artificial Intelligence, volume 287 of Informatik- methods, and tools to characterize and understand Fachberichte, pages 157–167. Springer, Berlin Heidel- the dynamics of rural areas and territories, and con- berg, 1991. trol spatial information on these systems. LIRMM Gaio M., Sallaberry C., and Nguyen V.T. Ty- (Informatics, Robotics and Microelectronics, Mont- page de noms toponymiques a` des fins d’indexation pellier) focuses on knowledge extraction. ICube geeographique.´ TAL, 53(2):143–176, 2012. (Strasbourg) is specialized in image analysis and Kergosien E., Laval B., Roche M., Teisseire M. Are opin- complex data mining. ICube collaborates with ge- ions expressed in land-use planning documents? Inter- ographers from LIVE laboratory (Image Labora- national Journal of Geographical Information Science, Vol. 28(4), pp.739-762, 2014. tory, City, and Environment) and specialists in NLP Lesbegueries J., Gaio M., and Loustau P. Geograph- (LiLPa lab – Linguistics, language, speech). These ical information access for non-structured data. In two locations (Montpellier and Strasbourg) consti- Proceedings of the 2006 ACM Symposium on Applied tute a cluster of local skills related to all major as- Computing, SAC ’06, pages 83–89, New York, NY, pects of the project. LIUPPA (Pau) includes re- USA, 2006. searchers specializing in Information Extraction (IE) Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella and Information Retrieval (IR). The main work of I., and Nouvel D. Casen: a transducer cascade to this partner is about extraction and managment of recognize french named entities. TAL, 52(1):69–96, 2011. geographical information. GREYC (Caen) brings Sallaberry C., Gaio M., and Lesbegueries J. Fuzzying gis researchers in data mining (e.g. mining sequences in topological functions for gir needs. In Jones C. B. and order to discover relationships between spatial enti- Purves R., editors, 5th ACM Workshop On Geographic ties) and NLP. For this aspect, a collaborations with Information Retrieval, pages 1–8, 2008. two other labs is developed (LIPN and IRISA).

52 A case study on Morphological Data from Eimeria of Domestic Fowl using a Multiobjective Genetic Algorithm and R&P for Learning and Tuning Fuzzy Rules for Classification

Edward Hinojosa C. Cesar A. Beltran C. Dept. Informatics Engineering Dept. Informatics Engineering Pontifical Catholic University of Peru Pontifical Catholic University of Peru Lima, Peru Lima, Peru [email protected] [email protected]

Abstract highly used of EAs. The FRBCSs are defined as Ge- netic Fuzzy Rule-Based Systems (GFRBSs) when In this paper, we use fuzzy rule-based classifi- the GAs are used to learn or tuning FRBCSs. The cation systems for classify cells of the Eime- GFRBSs continue to be researched and used in re- ria of Domestic Fowl based on Morphological cent years (Nojima and Ishibuchi, 2013), (Chen et Data. Thirteen features were extracted of the al., 2013), (Jalesiyan et al., 2014). images of the cells, these features are geneti- cally processed for learning fuzzy rules and a Generally a FRBCSs is composed of two com- method reward and punishment for tuning the ponents (Herrera, 2005), the Knowledge Base (KB) weights of the fuzzy rules. The experimental and the Inference Mechanism (IM). The KB is com- results show that our classifier based on inter- posed of two components too, the Data Base (DB) pretability fuzzy rules has a similar classifica- and the Rule Base (RB). This paper is concerned tion rate to that of a non-parametric and non- with the genetic learning of the RB. interpretability method. The most commonly used approaches for rule learning in FRBCSs using GAs are Pittsburgh, 1 Introduction Michigan, Iterative Rule Learning (IRL) and Ge- netic Cooperative-Competitive Learning (GCCL). The fuzzy systems were proposed by Zadeh at In the Pittsburgh approach, each chromosome en- 1965 (Zadeh, 1965) and they are systems based on codes a set of fuzzy rules, after the genetic pro- the theory of the fuzzy sets and logic fuzzy. A of the cess the RB is a better chromosome (De Jong et al., most important types of fuzzy systems are the Fuzzy 1993). In the Michigan approach, each chromosome Rule Based Classification Systems (FRBCSs) (Her- encodes a single rule, after the genetic process the rera, 2005) (Herrera, 2008). Classification prob- RB is the set of chromosomes or rules of the pop- lem is studied in the machine learning, data min- ulation (Holland and Reitman, 1978). In the IRL ing, database, and information retrieval communi- approach, each chromosome encodes a single rule ties with applications in a several domains. too, but after the genetic process, the better rule is The rules are a paradigm for representing knowl- selected and inserted to the RB, this process is re- edge and they have the capacity to build a linguistic peated iteratively until a condition is satisfied (Gon- model interpretable to the users. The learning (or zalez and Perez, 2012). The GCCL approach is a automatic generation) and tuning of the fuzzy rules hybrid of the Pittsburgh and Michigan approaches, in FRBCSs from data sample is a difficult task (Her- the rules or chromosomes cooperate among them- rera, 2008). This task can be considered as an op- selves based on Pittsburgh approach and the rules or timization or search process that can be managed chromosomes compete among themselves based on by using Evolutionary Algorithms (EAs). The Ge- Michigan approach (Giordana and Neri, 1995). netic Algorithms (GAs) is one of the most know and This paper is based in the IRL approach using a

53 Multiobjective Genetic Algorithms (MOGAs). We ample e E with a class C C. Each e is q ∈ j ∈ q use MOGAs because in the process of learning defined by a set of features or characteristics eq = fuzzy rules in FRBCSs are considered two objec- a , a , ..., a . { q1 q2 qn} tives: accuracy and interpretability. This objectives A FRCS resolves classification problems using are considered contradictory (Casillas and Carse, rules usually with the follow structures: 2009) and we search a trade-off of them. The ac- Ri: IF V1 IS T1l AND V2 IS T2l AND ... AND curacy is measured by the classification rate and 1 2 Tn IS Tnln THEN Class = Cj WITH CFi the interpretability is measured for many features of the FRBCSs, for example, quantity of the rules where: or quantity of the conditions of each rule. We use Ri : Index of the fuzzy rule i. specifically the well-known algorithm called Non- V1,V2, ..., Vn : Linguistic variables or featu- dominated Sorting Genetic Algorithm II (NSGA- res of each example eq. T ,T , ..., T : Linguistic terms or fuzzy sets II) (Deb et al., 2002). After the learning the 1l1 2l2 nln fuzzy rules, we use a Reward and Punishment(R&P) used for representing the class method for the tuning the factors or weights of the Cj. rules (Nozaki et al., 1996) to improve the accuracy Cj : The class of the fuzzy rule Ri. of the FRBCS. CFi : The certainty grade (i.e. rule weight) of the rule R . We use the proposed method for classify cells of i the Eimeria of Domestic Fowl. The Eimeria genus comprises a group of protozoan parasites that infect Usually a FRBCS has two main compo- a wide range of hosts. A total of seven different nents (Herrera, 2005): The Knowledge Base (KB) Eimeria species infect the domestic fowl, causing and the Inference Mechanism (IM), these are de- enteritis with severe economic losses. We use three tailed below: groups of morphological features: geometric mea- 1. The Knowledge Base: The KB is composed of sures, curvature characterization, and internal struc- two components: ture quantification (Beltran, 2007). This paper is organized as follows: we present (a) The Data Base: The DB contains the in Section 2 the basic concept of classification and membership functions, fuzzy sets or lin- FRBCSs employed in this paper. In Section 3 we guistic terms for each linguistic variable describe the genetic algorithm multiobjetivo called of the classification problem. NSGA-II used in this paper. The proposed method (b) The Rule Base: The RB contains the for learning the RB and tuning the factor of each rule collection of fuzzy rules representing the is detailed in Section 4. The Section 5 shows the re- knowledge. sults of the classification on morphological features of the Eimeria genus. The conclusions of this work 2. The Inference Mechanism: The IM is the fuzzy are presented in Section 6. logic reasoning process that determines the out- puts corresponding to fuzzified inputs (Lakhmi 2 Fuzzy Rule Based Classification Systems and Martin, 1998). The most common fuzzy inference method for fuzzy classification prob- Classification problem is studied in the machine lems are the classic and general reasing meth- learning, data mining, database, and information re- ods (Cordon et al., 2013). This paper uses the trieval communities with applications in a several classic method. domains, such as medical (Kumar et al., 2013), tar- get marketing (Yongzhi et al., 2013), biology (Silla 3 Non-dominated Sorting Genetic and Kaestner, 2013), among others. Algorithm Multiobjetive II Any classification problem has a set of exam- Is a new version of the NSGA (Srinivas and Deb, ples E = e , e , ..., e and a set of classes C = 1994), the NSGA-II was proposed by Deb in { 1 2 p} C ,C , ..., C , the objective is labeled each ex- 2002 (Deb et al., 2002) and it is computationally { 1 2 m}

54 more efficient, elitist and doesnt need to define addi- tional parameters. In the NSGA-II, the population Qt (size N) is generated using the parent population Pt (size N). After this, the two populations are combined for generating the population Rt (size 2N). The popu- lation Rt is sorted according the dominance of the solutions in different Pareto fronts (Pareto, 1896) and the crowding distance. A new population Pt+1 (size N) is generated with the bests Pareto fronts F1, F2, F3 and so forth, until the Pt+1 size equals to the value of N. The solutions in the Pareto fronts under this limit are removed. After Pt+1 is a new Pt and the process is repeated until a conditions is satisfied. The figure 1 shows the process of evolutions of the solutions in the NSGA-II. More details on NSGA-II can be found at (Deb et al., 2002). Figure 2: Proposed Method for Learning Fuzzy Rules

A set of examples is used as the set of training. The proposed IRL method starts when is defined the order of the class for learning. After that, a class is selected and the module for generate the best rule that used a MOGA is executed. The MOGA consid- ers two objectives for minimization: accuracy and interpretability. The accuracy is determined by the integrity and consistency of each rule (Gonzalez and Perez, 1999) and the interpretability is defined by the quantity of conditions of each rule. When the best rule in the Pareto front improves the rate of classi- Figure 1: Evolutions of the Solutions in the NSGA-II fication of the RB, this rule is inserted into the RB, some examples are marked and the process of learn- 4 Proposed Method ing a fuzzy rule starts again. When the best rule in the Pareto front doesnt improve the rate of classi- This section presents the proposed methods for fication of the RB, the process verifies that all the learning fuzzy rules using the IRL approach and a sequence of class was learned, if the sequence is not MOGA, and tuning the weights of the fuzzy rules learned a new class is selected and the process of using a R&P method. In the next subsections each learning a fuzzy rule starts again, else the process method is detailed. finishes and the set of the best rules is the RB. 4.1 Learning Fuzzy Rules In the process detailed above all rules has a weight equals to one. These weights can be tuning for im- The proposed method for learning fuzzy rules is prove the rate classification. This tuning is detailed based in the iterative multiobjective genetic method in the next subsection. described in (Hinojosa and Camargo, 2012) and uses a MOGA for learning a single fuzzy rule in each it- 4.2 Tunning Weights of the Fuzzy Rules eration of the MOGA. The main difference with the proposed method is the module for defining the or- We use the method proposed in (Nozaki et al., 1996) der of the class for learning. This method proposed for this task. This method rewards or increases the is illustrated in the Figure 2. weight the a fuzzy rule Ri when a example eq is cor-

55 rectly classified by this rule according to the next Class Number Class Name # of Examples equation: 1 E. acervulina 636 2 E. maxima 321 3 E. brunetti 418 4 E. mitis 757 CF new = CF old + n 1 CF old (1) i i 1 − i 5 E. praecox 747   6 E. tenella 608 And this method punishes or decreases the weight 7 E. necatrix 404 of the fuzzy rule Ri when a example eq is misclas- sifed by this rule according to the next equation: Table 1: Distribution of Classes Parameter Value CF new = CF old n CF old (2) i i − 2 i Size the population 50.0 Crossover rate 1.0 In the experimental study detaild in Section 5 we Mutation rate 0.2 used the values n1=0.001 and n2=0.1 and the tuning Number of generations 500.0 procedure for 500 iterations. Mark value 0.3

5 Experimental Study Table 2: Parameters of the Proposed Method The experimental study is aimed to show the appli- cation of the proposed method and the comparation rules has a high level of interpretability for the expert with the classification with non-parametric method users. for classifying cells of the Eimeria of Domestic Fowl based on Morphological Data. The Emeira genus comprises a group of protozoan parasites that infect a wide range of hosts, seven different Emeira species infect the domestic fowl, causing enteritis with sev- eral economic losses. This protozoan morphology was represented by 13 features: mean of curvature, standard deviation of curvature, entropy of curva- ture, major axis (lenght), minor axis (width), sym- Figure 3: Proposed Method for Learning Fuzzy Rules metry through major axis, symmetry through minor axis, area, entropy of internal structure, second an- We compared the proposed method with the gular moment, contrast, inverse difference moment, method non-parametric classifier proposed in (Bel- entropy of co-occurrence matrix; these features are tran, 2007) with the same set of examples. The Ta- used as the input pattern for the classification pro- ble 4 shows the classification rate by each class of cess. both classifiers. These results shows that the pro- The Table 1 shows the class and the number of posed method (PM) has a similar rate classifica- examples or instances of each class. More detail tion (overall 77.17) that the non-parametric method how the features were extracted or about the Eimeria (NPM) (overall 80.24), but with a high degree of in- genus can be found at (Beltran, 2007). terpretability. The non-parametric method does not The Table 2 shows the parameters for learning consider the interpretability. fuzzy rules and the NSGA-II (the MOGA used in 6 Conclusions this paper). After the process of learning fuzzy rules, the pro- In this article, we proposed a iterative multiobjective cess of tuning the weights starts. The Table 3 shows genetic method to learn fuzzy classification rules. the result of the classification or dispersion matriz The fuzzy rules are learned in each iteration depend after the process tuning the weights. of the sequencia of class. After that, the weights of After the proposed method, the result is a set of the each fuzzy rules are tuned using a R&P method. rules similar of the set shows in the Figure 3. This The results obtained have indicated that FRBCSs

56 ACE MAX BRU MIT PRA TEN NEC Holland, J. and Reitman, J. 1978. Cognitive Systems ACE 85.06 0.00 0.00 2.38 0.13 0.16 7.18 Based on Adaptive Algorithms, ACM SIGART Bul- MAX 0.00 98.44 0.72 0.00 0.00 0.00 0.00 BRU 0.00 1.56 87.08 0.00 6.29 1.48 0.74 letin MIT 1.73 0.00 0.00 86.79 4.95 1.64 4.70 Gonzalez, A. and Perez, R. 1999. SLAVE: A genetic PRA 2.67 0.00 5.50 6.87 69.34 10.53 24.01 learning system based on an iterative approach. IEEE TEN 7.86 0.00 6.70 1.19 16.06 82.73 37.62 Transactions on Fuzzy Systems, 7:176–191. NEC 2.67 0.00 0.00 2.77 3.21 3.45 25.74 Giordana, A. and Neri, F. 1995. Search-intensive con- cept induction. Evol Comput, 3:375–416. Table 3: Results of the Proposed Method Casillas, J. and Carse, B. 2009. Special issue on Genetic Fuzzy Systems: Recent Developments and Future Di- Class Name PM NPM rections. Soft Comput., 13(5):417–418. E. acervulina 85.06 87.70 Deb, K. and Pratap, A. and Agarwal, S. and Meyarivan, E. maxima 98.44 96.12 T. 2002. A Fast and Elitist Multiobjective Genetic E. brunetti 87.08 94.98 Algorithm: NSGA-II. Trans. Evol. Comp., 6(2):182– E. mitis 86.79 86.27 197. E. praecox 69.34 64.46 Nozaki, k., Ishibuchi, H., Tanaka, H. 1996. Adaptive E. tenella 82.73 76.53 fuzzy rule-based classification systems. IEEE Trans. E. necatrix 25.74 55.60 Fuzzy Systems, 4(3):238–250. Beltran, C. 2007. Anlise e reconhecimento digital de for- Table 4: The PM vs. NPM mas biolgicas para o diagnstico automtico de parasitas do gnero Eimeria. PhD. Teses - USP - Brazil. Kumar, S.U., Inbarani, H.H., Kumar, S.S. 2013. Bijec- have better interpretability and similar accuracy than tive soft set based classification of medical data. In- a non-parametric method for classify the Eimeria of ternational Conference on Pattern Recognition, Infor- domestic fowl. matics and Mobile Engineering, 1:517–521. Yongzhi, Ma., Hong Gao, Yi Ding, Wei Liu. 2013. Lo- gistics market segmentation based on extension clas- References sification. International Conference on Information Management, Innovation Management and Industrial Zadeh L. A. 1965. Fuzzy Sets. Information and Control, Engineering, 2:216–219. 8(3):338–353. Silla, C.N. and Kaestner, C.A.A. 2013. Hierarchi- Herrera F. 2005. Genetic fuzzy systems: Status, cal Classification of Bird Species Using Their Audio critical considerations and future directions. Inter- Recorded Songs. IEEE International Conference on national Journal of Computational Intelligence Re- Systems, Man, and Cybernetics, 1:1895–1900. search, 1(1):59–67. Lakhmi C. and Martin N. 1998. Fusion of Neural Net- Herrera F. 2008. Genetic fuzzy systems: taxonomy, cur- works, Fuzzy Systems and Genetic Algorithms: Indus- rent research trends and prospects. Evolutionary Intel- trial Applications (International Series on Computa- ligence, 1(1):27–46. tional Intelligence). CRC Press Nojima, Y. and Ishibuchi, H. 2013. Multiobjective ge- Cordon, O.,Herrera, F., Hoffmann, F., Magdalena, L. netic fuzzy rule selection with fuzzy relational rules. 2001. Genetic Fuzzy Systems: Evolutionary Tuning IEEE International Workshop on Genetic and Evolu- and Learning of Fuzzy Knowledge Bases World Sci- tionary Fuzzy Systems (GEFS), 1:60–67 entific Chen, S.-M, Chang, Y.-C., Pan, J.-S. 2013. Fuzzy Rules Srinivas, N. and Deb, K. 1994. Multiobjective Opti- Interpolation for Sparse Fuzzy Rule-Based Systems mization Using Nondominated Sorting in Genetic Al- Based on Interval Type-2 Gaussian Fuzzy Sets and Ge- gorithms. Evolutionary Computation, 2:221–248. netic Algorithms. IEEE Transactions on Fuzzy Sys- Pareto, V. 1896. Cours d’Economie Politique. Droz, tems, 21(3):412–425. Genve. Jalesiyan, H., Yaghubi, M., Akbarzadeh, T.M.R. 2014. Hinojosa, E. and Camargo, H.A. 2012. Multiobjective Rule selection by Guided Elitism genetic algorithm in genetic generation of fuzzy classifiers using the itera- Fuzzy Min-Max classifier. Conference on Intelligent tive rule learning. International Conference on Fuzzy Systems, 1:1–6. Systems, 1:1–8. Gonzalez, A. and Perez, R. 2009. Improving the genetic De Jong KA, Spears WM, Gordon DF. 1993. Using ge- algorithm of SLAVE. Mathware and Soft Computing, netic algorithms for concept learning. Mach Learn, 16:59–70. 13:161–188.

57 SIFR Project: The Semantic Indexing of French Biomedical Data Resources

Juan Antonio Lossio-Ventura, Mathieu Roche, Clement Jonquet Maguelonne Teisseire LIRMM, CNRS, Univ. Montpellier 2 TETIS, Cirad, Irstea, AgroParisTech Montpellier, France Montpellier, France [email protected] [email protected]

Abstract riquecimiento de ontolog´ıas, con el fin de poblar ontolog´ıas con los terminos´ extra´ıdos. The Semantic Indexing of French Biomed- El art´ıculo es organizado como sigue. Primero ical Data Resources project proposes to in- discutimos sobre la sobre la metodolog´ıa puesta vestigate the scientific and technical chal- en marcha para este proyecto en la Seccion´ 2. lenges in building ontology-based services La evaluacion´ de la precision´ es presentada en to leverage biomedical ontologies and ter- la Seccion´ 3 seguida de las conclusiones en la minologies in indexing, mining and re- Seccion´ 4. trieval of French biomedical data. 2 Metodolog´ıa 1 Introduccion´ Nuestro trabajo se divide en dos procesos princi- Hoy en d´ıa la gran cantidad de datos disponibles pales: (i) la extraccion´ de terminos´ biomedicos,´ y en l´ınea suele componerse de texto no es- (ii) el enriquecimiento de ontolog´ıas, explicados a tructurado, por ejemplo reportes cl´ınicos, in- continuacion.´ formes de reportes adversos, historiales cl´ınicos 2.1 Extraccion´ Automatica´ de Terminos´ electronicos´ (Lossio-Ventura et al., 2013). Reg- Biomedicos´ ularmente estos textos son escritos usando un lenguaje espec´ıfico (expresiones y terminos)´ us- La extraccion´ de terminos´ es una tarea esencial ados por una comunidad. Es por eso existe la en la adquisicion´ de conocimiento de un dominio. necesidad de formalizar e indexar terminos´ o con- En este trabajo presentamos las medidas creadas ceptos tecnicos.´ Lo cual implica un gran consumo para este objetivo. Medidas que se basan en var- de tiempo. ios criterios como lingu¨´ıstico, estad´ıstico, grafos Los terminos´ relevantes son utiles´ para obtener y web para mejorar el resultado de extraccion´ una mayor comprension´ de la estructura con- de terminos´ biomedicos.´ Las medidas presen- ceptual de un dominio. Estos pueden ser: tadas a continuacion´ son puestas a disposicion´ de (i) terminos´ de una sola palabra (sencillo a ex- la comunidad, bajo la aplicacion´ llamada BIO- traer), o (ii) terminos´ de varias palabras (dif´ıcil). TEX (Lossio-Ventura et al., 2014). En el ambito´ biomedico,´ hay una gran diferen- 2.1.1 Lingu¨´ıstica cia entre los recursos existentes (ontolog´ıas) en ingles´ y frances.´ En Ingles´ hay cerca de 7 000 000 Estas tecnicas´ intentan recuperar terminos´ gracias de terminos´ asociados a 6 000 000 de conceptos, a la formacion´ de patrones. La idea principal es tales como los de UMLS1 o BioPortal (Noy et al., la construccion´ de reglas para describir las es- 2009). Mientras que, en frances´ solo´ hay alrede- tructuras de los terminos´ de un dominio medi- dor de 330 000 terminos´ asociados a 160 000 con- ante el uso de caracter´ısticas ortograficas,´ lexicas´ ceptos (Neveol et al., 2014). Por lo tanto, hay o morfo-sintacticas.´ La idea principal es la con- una necesidad de enriquecer terminolog´ıas u on- struccion´ de reglas, normalmente de forma man- tolog´ıas en frances.´ Por lo tanto, nuestro tra- ual, que describen las estructuras comunes de bajo se compone de dos pasos principales: (i) la terminos´ para ciertos campos. En muchos ca- extraccion´ de terminos´ biomedicos,´ y (ii) el en- sos tambien,´ diccionarios conteniendo terminos´ tecnicos´ (e.g., prefijos, sufijos y acronimos´ es- 1http://www.nlm.nih.gov/research/umls pec´ıficos) son usados para ayudar a extraer

58 terminos´ (Krauthammer et al., 2004). con LIDF-value, donde los nodos representan los terminos´ relacionados con otros terminos´ gracias 2.1.2 Estad´ıstica a la co-ocurrencia en el corpus. Las tecnicas´ estad´ısticas se basan en la eviden- cia presentada en el corpus a traves´ de la infor- 2.1.4 Web macion´ contextual. Tales enfoques abordan prin- Diferentes estudios de Web Mining se enfocan en cipalmente el reconocimiento de terminos´ gen- la similitud semantica,´ relacion´ semantica.´ Esto erales (Van Eck et al., 2010). La mayor´ıa de me- significa para cuantificar el grado en el que al- didas se basan en la frecuencia. La mayor parte gunas palabras estan´ relacionadas, teniendo en de trabajos combinan la informacion´ lingu¨´ıstica cuenta no solo´ similitud sino tambien´ cualquier y estad´ıstica, tal es el caso de C-value (Frantzi posible relacion´ semantica´ entre ellos. La et al., 2000) combina la informacion´ estad´ıstica y primera medida web creada fue WebR (Lossio- lingu¨´ıstica tanto para la extraccion´ de terminos´ de Ventura et al., 2014), finalmente la mejora lla- varias palabras como de terminos´ largos y anida- mada WAHI (Lossio-Ventura et al., 2014) (Web dos. Es la medida mas´ conocida en la liter- Association based on Hits Information). Nuestra atura. En el trabajo de (Zhang et al., 2008), de- medida basada en la Web tiene por objetivo volver mostraron que C-value obtiene los mejores resul- a clasificar la lista obtenida previamente con TeR- tados comparado a otras medidas. Ademas´ del Graph. Demostramos con esta medida que la pre- ingles,´ C-value tambien´ ha sido aplicado a otros cision´ de los k primeros terminos´ extra´ıdos su- idiomas tales como japones,´ serbio, esloveno, po- peran los resultados de las medidas arriba men- laco, chino (Ji et al., 2007), espanol˜ (Barron-´ cionadas (ver Seccion´ 3). Cedeno et al., 2009), arabe.´ Es por eso, en nue- 2.2 Enriquecimiento de Ontolog´ıas stro primer trabajo (Lossio-Ventura et al., 2013), la modificamos y adaptamos para el frances.´ El objetivo de este proceso es enriquecer las ter- A partir de C-value, hemos creados otras me- minolog´ıas u ontolog´ıas con los terminos´ nuevos didas, como F-TFIDF-C, F-OCapi, C-OKapi, extra´ıdos en el proceso anterior. Los tres grandes C-TFIDF (Lossio-Ventura et al., 2014), estas pasos a seguir en este proceso son: medidas obtienen mejores resultados que C- (1) Determinar si un termino´ es polisemico:´ value. Finalmente una nueva medida basada con la ayuda del Meta-Learning, hemos po- en la informacion´ lingu¨´ıstica y estad´ıstica es dido predecir con una confianza de 97% si un LIDF-value (Lossio-Ventura et al., 2014) (pa- termino´ es polisemico.´ Esta contribucion´ sera´ trones Lingu¨´ısticos, IDF, and C-value informa- valorizada en la conferencia ECIR 2015. tion), que mejora con gran diferencia los resulta- (2) Identificar los posibles significados si el dos obtenidos por las medidas antes citadas. termino´ es polisemico:´ es nuestro trabajo actual, con la ayuda de clustering, cluster- 2.1.3 Grafos ing sobre los grafos tratamos de resolver este El modelo de grafos es una alternativa al modelo problema. de informacion,´ muestra claramente las relaciones (3) Posicionar el termino´ en una ontolog´ıa. entre los nodos gracias a las aristas. Gracias a los algoritmos de centralidad se puede aprovechar los 3 Experimentaciones grupos de informacion´ en grafos. Existen aplica- ciones de grafos para la Recuperacion´ de Infor- 3.1 Datos, protocolo y validacion´ macion´ (RI) en el contexto de las redes sociales, de En nuestros experimentos, hemos usado el corpus colaboracion´ y sistemas de recomendacion´ (Noh estandar´ GENIA2, el cual es compuesto de 2 000 et al., 2009). t´ıtulos y resumenes´ de art´ıculos de revistas que han Una medida basada en grafos creada para este sido tomadas de la base de datos Medline, con- proceso es TeRGraph (Lossio-Ventura et al., 2014) tiene mas´ de 400 000 palabras. GENIA corpus (Terminology Ranking based on Graph informa- contiene expresiones lingu¨´ısticas que se refieren a tion). Esta medida tiene como objetivo mejorar entidades con interes´ en biolog´ıa molecular tales la precision´ de los primeros k terminos´ extra´ıdos como prote´ınas, genes y celulas.´ despues´ de haber aplicado LIDF-value. El grafo 2http://www.nactem.ac.uk/genia/ es construido con la lista de terminos´ obtenidos genia-corpus/term-corpus

59 3.2 Resultados Los resultados son evaluados en terminos´ de pre- cision´ obtenidos sobre los primeros k terminos´ extra´ıdos (P @k) para las medidas propuestas y las medidas base (referencia) para la extraccion´ de terminos´ compuestos de varias palabras. En las subsecciones siguientes, limitamos los resul- tados para la medida basada en grafos con solo´ los primeros 8 000 terminos´ extra´ıdos y los resulta- dos para la medida basada en la web con solo´ los primeros 1 000 terminos.´ Figure 3: Comparacion´ de la precision´ de WAHI y 3.2.1 Resultados lingu¨´ısticos y estad´ısticos TeRGraph

grandes procesos. El primer proceso Extraccion´ Automatica´ de Terminos´ Biomedicos´ , terminado y siendo val- orizado en varias publicaciones citadas anterior- mente. En este proceso demostramos que las me- didas propuestas mejoran la precision´ de la ex- traccion´ automatica´ de terminos´ en comparacion´ a las medidas mas´ populares de extraccion´ de terminos.´ El segundo proceso Enriquecimiento de On- tolog´ıas, a la vez dividio en 3 etapas, es nues- tra tarea actual, solo la primera etapa ha sido fi- nalizada. En este proceso buscamos encontrar la Figure 1: Comparacion´ de la precision´ de LIDF- mejor posicion´ de un termino´ en una ontolog´ıa. value con las mejores medidas de base Como trabajo futuro, pensamos acabar el se- gundo proceso. Ademas,´ planeamos probar es- 3.2.2 Resultados basados en grafos tos enfoques generales sobre otros dominios, tales como ecolog´ıa y agronom´ıa. Finalmente, planeamos aplicar estos enfoques con corpus en espanol.˜

Agradecimientos Este proyecto es apoyado en parte por la Agencia Nacional de Investigacion´ de Francia bajo el pro- grama JCJC, ANR-12-JS02-01001, as´ı como por la Universidad de Montpellier 2, el CNRS y el pro- grama de becas FINCyT, Peru.´

References Figure 2: Comparacion´ de la precision´ de TeR- Graph y LIDF-value Barron-Cedeno,´ A., Sierra, G., Drouin, P., Ananiadou, S. 2009. An improved automatic term recognition method for Spanish. Computational Linguistics, In- 3.2.3 Resultados basados en la web telligent Text Processing, pp. 125-136. Springer. 4 Trabajo Futuro Frantzi K., Ananiadou S., Mima, H. 2000. Automatic recognition of multiword terms: the C-value/NC- Este art´ıculo presenta la metodolog´ıa propuesta value Method. International Journal on Digital Li- para el proyecto SIFR. Este proyecto consta de dos braries, (3):115-130.

60 Ji, L., Sum, M., Lu, Q., Li, W., Chen, Y. 2007. Chinese Van Eck, N.J., Waltman, L., Noyons, E.CM., Buter, Terminology Extraction Using Window-Based Con- R.K. 2010. Automatic term identification for bib- textual Information. Proceedings of the 8th Inter- liometric mapping. Scientometrics, vol. 82, pp. 581- national Conference on Computational Linguistics, 596. Intelligent Text Processing (CICLing07), pp. 62-74. Springer-Verlag, Mexico City, Mexico. Zhang, Z., Iria, J., Brewster, C., Ciravegna, F. 2008. A Comparative Evaluation of Term Recognition Al- Krauthammer, M., Nenadic, G. 2004. Term Iden- gorithms. Proceedings of the Sixth International tification in the Biomedical Literature. Journal of Conference on Language Resources, Evaluation Biomedical Informatics, vol. 37, pp. 512-526. Else- (LREC08). Marrakech, Morocco. vier Science, San Diego, USA. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis- seire M. 2014. BIOTEX: A system for Biomed- ical Terminology Extraction, Ranking, and Valida- tion. Proceedings of the 13th International Seman- tic Web Conference (ISWC’14). Trento, Italy. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire M. 2014. Integration of linguistic and Web infor- mation to improve biomedical terminology ranking. Proceedings of the 18th International Database En- gineering and Applications Symposium (IDEAS’14), ACM. Porto, Portugal. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis- seire M. 2014. Yet another ranking function to automatic multi-word term extraction. Proceed- ings of the 9th International Conference on Natural Language Processing (PolTAL’14), Springer LNAI. Warsaw, Poland. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis- seire M. 2014. Biomedical Terminology Ex- traction: A new combination of Statistical, Web Mining Approaches. Proceedings of Journees´ internationales d’Analyse statistique des Donnees´ Textuelles (JADT2014). Paris, France. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire M. 2013. Combining C-value, Keyword Extraction Methods for Biomedical Terms Extraction. Pro- ceedings of the Fifth International Symposium on Languages in Biology, Medicine (LBM13), pp. 45- 49, Tokyo, Japan. Neveol, A., Grosjean, J., Darmoni, S., Zweigenbaum, P. 2014. Language Resources for French in the Biomedical Domain. Proceedings of the 9th In- ternational Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland Noh, TG., Park, SB., Yoon, HG., Lee, SJ., Park, SY. 2009. An Automatic Translation of Tags for Multimedia Contents Using Folksonomy Networks. Proceedings of the 32nd International ACM SIGIR Conference on Research, Development in Informa- tion Retrieval SIGIR ’09, pp. 492-499. Boston, MA, USA, ACM. Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D. L., Storey, M., Chute, C.G., Musen, M. A. 2009. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic acids research, vol. 37(suppl 2), pp 170–173.

61 Mathematical modeling of the performance of a computer system.

Félix Armando Fermín Pérez Universidad Nacional Mayor de San Marcos Facultad de Ingeniería de Sistemas e Informática Lima, Perú [email protected]

intervención humana, que Fox y Patterson (2003) Abstract también han identificado como parte principal del problema, debido al uso de procedimientos ad hoc Generally the management of computer systems is donde la gestión de recursos aún depende manual; autonomic computing by self-management fuertemente del control y administración manual. tries to minimize human intervention using Sobre la computación autonómica, Kephart y autonomic controllers; for this, first the Chess (2003) mencionan que la idea es que un mathematical modeling of the system under study sistema informático funcione igual que el sistema is performed, then is designed a controller that nervioso autonómico humano cuando regula governs the behavior of the system. In this case, nuestra temperatura, respiración, ritmo cardíaco y the determination of the mathematical model of a otros sin que uno se halle siempre consciente de web server is based on a black box model by ello, esto es, promueve la menor intervención system identification, using data collected and humana en la administración de la performance de stored in its own log during operation of the los sistemas informáticos tendiendo hacia la auto- computer system under study. administración de los mismos. Tal auto-

administración se caracteriza por las propiedades Keywords: autonomic computing, self- de auto-configuración (self-configuration), auto- management, system identification. curación (self-healing), auto-optimización (self- 1 Introducción optimization) y auto-protección (self-protection).

La tecnología influye en cada aspecto de la vida En la visión de la computación autonómica los cotidiana; las aplicaciones informáticas, por administradores humanos simplemente especifican ejemplo, son cada vez más complejas, los objetivos de alto nivel del negocio, los que heterogéneas y dinámicas, pero también la sirven como guía para los procesos autonómicos infraestructura de información, como la Internet subyacentes. Así los administradores humanos se que incorpora grandes cantidades de recursos concentran más fácilmente en definir las políticas informáticos y de comunicación, almacenamiento del negocio, a alto nivel, y se liberan de tratar de datos y redes de sensores, con el riesgo de que permanentemente con los detalles técnicos de bajo se tornen frágiles, inmanejables e inseguras. Así, se nivel, necesarios para alcanzar los objetivos, ya hace necesario contar con administradores de que estas tareas son ahora realizadas por el sistema servicios y de servidores informáticos, autonómico mediante un controlador autonómico experimentados y dedicados, además de en lazo cerrado, que monitorea el sistema herramientas software de monitoreo y supervisión, permanentemente, utiliza los datos recolectados del para asegurar los niveles de calidad de servicio propio sistema en funcionamiento, compara estas pactados (Fermín, 2012). métricas con las propuestas por el administrador humano, y controla la performance del sistema, por Diao et al. (2005) promueven la utilización de la ejemplo, manteniendo el tiempo de respuesta del computación autonómica mediante sistemas de sistema dentro de niveles prefijados. control automático en lazo cerrado, reduciendo la

62 De acuerdo con la teoría de control, el diseño de un estudio, con herramientas de análisis utilizando controlador depende de un buen modelo técnicas de análisis estadísticas, principalmente. matemático. En el caso de un sistema informático, primero debe tenerse un modelo matemático para Entre las métricas de performance inicialmente se luego diseñar un controlador en lazo cerrado o encontraba la velocidad de procesamiento, pero al realimentado, pero sucede que los sistemas agregarse más componentes a la infraestructura informáticos son bastante complicados de modelar informática, surgieron nuevas métricas, siendo las ya que su comportamiento es altamente principales las que proporcionan una idea del estocástico. Según Hellerstein et al (2004) se ha rendimiento o trabajo realizado en un periodo de utilizado la teoría de colas para modelarlos, tiempo, la utilización de un componente, o el tratándolos como redes de colas y de servidores, tiempo en realizar una tarea en particular como por bastante bien pero principalmente en el modelado ejemplo el tiempo de respuesta. Lalanda et al. del comportamiento estacionario y no cuando se (2013) mencionan que las métricas de performance trata de modelar el comportamiento muchas veces más populares son las siguientes: altamente dinámico de la respuesta temporal de un - Número de operaciones en punto flotante por sistema informático en la zona transitoria, donde la segundo (FLOPS), representa una idea del tarea se complica. rendimiento del procesamiento, realiza comparaciones entre máquinas que procesan De manera que en el presente artículo se trata el complejos algoritmos matemáticos con punto modelado matemático de un sistema informático flotante en aplicaciones científicas. mediante la identificación de sistemas, enfoque - Tiempo de respuesta, representa la duración en empírico donde según Lung (1987) deben tiempo que le toma a un sistema llevar a cabo identificarse los parámetros de entrada y salida del una unidad de procesamiento funcional. Se le sistema en estudio, basándose en los datos considera como una medición de la duración recolectados del mismo sistema en en tiempo de la reacción a una entrada funcionamiento, para luego construir un modelo determinada y es utilizada principalmente en paramétrico, como el ARX por ejemplo, con las sistemas interactivos. La sensibilidad es técnicas estadísticas de autoregresión. La sección 2 también una métrica utilizada especialmente en describe alguna teoría básica sobre las métricas la medición de sistemas en tiempo real, para el monitoreo de la performance de un sistema consiste en el tiempo transcurrido entre el informático. En la sección 3 se trata el modelado inicio y fin de la ejecución de una tarea o hilo. matemático, y en la sección 4 se describe el - Latencia, medida del retardo experimentado en experimento realizado; finalmente en la sección 5 un sistema, generalmente se le utiliza en la se describen las conclusiones y trabajos futuros. descripción de los elementos de comunicación de datos, para tener una idea de la performance 2 Monitoreo de la performance. de la red. Toma en cuenta no solo el tiempo de procesamiento de la CPU sino también los En la computación autonómica los datos obtenidos retardos de las colas durante el transporte de del monitoreo de la performance del sistema en un paquete de datos, por ejemplo. estudio contribuye fundamentalmente en la - Utilización y carga, métricas entrelazadas y representación del estado o del comportamiento del utilizadas para comprender la función de sistema, esto es, en el modelo matemático del administración de recursos y proporcionan una mismo. Según Lalanda et al. (2013), conocer el medida de cuan bien se están utilizando los estado del sistema desde las perspectivas componentes de un sistema y se describe como funcionales y no funcionales es vital para llevar a un porcentaje de utilidad. La carga mide el cabo las operaciones necesarias que permitan trabajo realizado por el sistema y usualmente lograr los objetivos en el nivel deseado y el es representado como una carga promedio en monitoreo de la performance permite saber cuan un periodo de tiempo. bien lo está logrando. Generalmente los datos de la performance se consiguen vía el log del sistema en Existen muchas otras métricas de performance: - número de transacciones por unidad de costo.

63 - función de confiabilidad, tiempo en el que un sistema ha estado funcionando sin fallar. - función de disponibilidad, indica que el sistema está listo para ser utilizado cuando sea necesario. - tamaño o peso del sistema, indica la portabilidad. - performance por vatio, representa la tasa de cómputo por vatio consumido. - calor generado por los componentes ya que en sistemas grandes es costoso un sistema de refrigeración.

Todas ellas entre otras más permiten conocer mejor el estado no funcional de un sistema o proporcionar un medio para detectar un evento que ha ocurrido y ha modificado el comportamiento no funcional. Así, en el presente caso se ha elegido inicialmente como métrica al tiempo de respuesta, ya que el sistema informático en estudio es un Figura N° 1. Modelado matemático basado en la servidor web, por esencia de comportamiento teoría de control. (Adaptado de Hellerstein, 2004). interactivo, de manera que lo que se mide es la duración de la reacción a una entrada determinada. En contraste, según Ljung (1987) la identificación de sistemas es un enfoque empírico donde debe identificarse los parámetros de entrada y salida del 3 Modelado matemático. sistema en estudio, para luego construir un modelo paramétrico, como el ARX por ejemplo, mediante En la computación autonómica, basado en la teoría técnicas estadísticas de autoregresión. Este modelo de control realimentado, para diseñar un o ecuación paramétrica relaciona los parámetros de controlador autonómico primero debe hallarse el entrada y de salida del sistema de acuerdo a la modelo matemático del sistema en estudio, tal siguiente ecuación: como se muestra en la Figura N° 1. El modelo matemático de un servidor informático se puede y(k+1) = Ay(k) + Bu(k) (1) hallar mediante dos enfoques: uno basado en las leyes básicas, y otro en un enfoque empírico donde y(k) : variable de salida denominado identificación de sistemas. Parekh et u(k) : variable de entrada al. (2002) indican que en trabajos previos se ha A, B : parámetros de autoregresión tratado de utilizar principios, axiomas, postulados, k : muestra k-ésima. leyes o teorías básicas para determinar el modelo matemático de sistemas informáticos, pero sin Este enfoque empírico trata al sistema en estudio éxito ya que es difícil construir un modelo debido a como una caja negra, de manera que no afecta la su naturaleza compleja y estocástica; además es complejidad del sistema o la falta de conocimiento necesario tener un conocimiento detallado del experto, incluso cuando se actualicen las versiones sistema en estudio, más aún, cuando cada cierto del software bastaría con estimar nuevamente los tiempo se va actualizando las versiones del parámetros del modelo. Así, para un servidor web software, y finalmente que en este enfoque no se Apache, la ecuación paramétrica relaciona el considera la validación del modelo. parámetro entrada, Max Clients (MC), un parámetro de configuración del servidor web Apache que determina el número máximo de conexiones simultáneas de clientes que pueden ser servidos; y el parámetro de salida Tiempo de respuesta (TR), que indica lo rápido que se

64 responde a las solicitudes de servicio de los recoge los tiempos de inicio y fin de cada http que clientes del servidor, ver Figura N° 2. ingresa al servidor web, datos que se almacenan en el log del mismo servidor, y luego se realiza el cálculo del tiempo de respuesta de cada http completado en una unidad de tiempo y el tiempo de respuesta promedio del servidor. El servidor web por sí mismo no hace nada, por ello, la computadora personal PC2 contiene un generador de carga de trabajo, que simula la actividad de los Figura N° 2. Entrada y salida del sistema a usuarios que desean acceder al servidor web; en modelar. (Elaboración propia). este caso se utilizó el JMeter una aplicación generadora de carga de trabajo y que forma parte En (Hellerstein et al, 2004) se propone realizar la del proyecto Apache. identificación de sistemas informáticos, como los servidores web, de la siguiente manera: La operación del servidor web, no solo depende de 1.- Especificar el alcance de lo que se va a modelar la actividad de los usuarios, sino también de la en base a las entradas y salidas consideradas. señal de entrada MaxClients (MC), que toma 2.- Diseñar experimentos y recopilar datos que forma de una sinusoide discreta variable que excita sean suficientes para estimar los parámetros de la al servidor web junto con las solicitudes http del ecuación diferencia lineal del orden deseado. generador de carga de trabajo (PC2). De manera 3.- Estimar los parámetros del modelo utilizando que con la actividad del servidor web almacenada las técnicas de mínimos cuadrados. en su log, un sensor software calcula los valores de 4.- Evaluar la calidad de ajuste del modelo y si la la señal de salida Tiempo de Respuesta (TR), calidad del modelo debe mejorarse, entonces debe mostrados en la Figura N° 4. revisarse uno o más de los pasos anteriores.

4 Experimento. La arquitectura del experimento implementado, se muestra en la Figura Nº 3.

Figura N° 4. Señal de entrada MaxClients MC y señal de salida Tiempo de respuesta TR.

Con los datos de MC y TR obtenidos se estiman los parámetros de regresión A y B de la ecuación paramétrica ARX, haciendo uso de las técnicas de Figura N° 3. Arquitectura para la identificación del mínimos cuadrados, implementadas en el ToolBox modelo de un servidor web (Elaboración propia). Identificación de Sistemas del Matlab, logrando la siguiente ecuación paramétrica de primer orden: La computadora personal PC1 es el servidor web Apache, asimismo contiene el sensor software que TR(k+1) = 0.06545TR(k) + 4.984MC(k+1)

65 Se puede observar que el Tiempo de respuesta ARX, aunque más adelante se planteará el diseño actual depende del Tiempo de respuesta anterior y de controladores autonómicos no lineales para del parámetro de entrada MaxClients. El modelo sistemas informáticos en general, ya que el hallado es evaluado utilizando la métrica r2, comportamiento temporal no lineal, hace adecuada calidad de ajuste, del ToolBox utilizado, que indica la utilización de técnicas de inteligencia artificial el porcentaje de variación respecto a la señal como la lógica difusa y las redes neuronales. original. En el caso de estudio, el modelo hallado tiene una calidad de ajuste del 87%, lo que se puede considerar como un modelo aceptable. En la 6 Referencias bibliográficas. Figura N° 5 puede observarse la gran similitud Diao, Y.; Hellerstein, J.; Parekh, S.; Griffith, R.; entre la señal de salida medida y la señal de salida Kaiser, G.; Phung, D. (2005) A Control Theory estimada. Foundation for Self-Managing Computing

Systems, IEEE Journal on Selected Areas in Communications, Vol. 23, Issue 12, pp. 2213 – 2222. IEEE. DOI: 10.1109/JSAC.2005.857206.

Fermín F. (2012). La Teoría de Control y la Gestión Autónoma de Servidores Web. Memorias del IV Congreso Internacional de Computación y Telecomunicaciones. ISBN 978-612-4050-57-2.

Fox, A.; Patterson, D. (2003). Self-repairing

Figura N° 5. TR medida y TR estimada. computers, Scientific American, Jun2003, Vol. 288 Issue 6, pp. 54-61.

5 Conclusiones. Hellerstein, J.; Diao, Y.; Parekh, S.; Tilbury, D. (2004). Feedback Control of Computing Systems. La identificación del comportamiento de un Hoboken, New Jersey: John Wiley & Sons, Inc. sistema informático tratado como una caja negra es ISBN 0-471-26637-X posible, para ello debe simularse el funcionamiento del mismo con hardware y herramientas software Kephart, J.; Chess, D. (2003). The vision of como el Jmeter y Matlab, por ejemplo. autonomic computing. Computer, Jan2003, Vol. 36, Issue 1, pp. 41-50. IEEE. DOI: El modelo matemático determinado en base a los 10.1109/MC.2003.1160055 datos recopilados del log del sistema en estudio se aproxima bastante al modelo real, en este caso se Lalanda, P.; McCann, J.; Diaconescu, A. (2013). obtuvo un 87% de calidad de ajuste. Autonomic Computing. Principles, Design and Implementation. London: Springer-Verlag. ISBN El sensor software implementado ha permitido 978-1-4471-5006-0 calcular el tiempo de respuesta en base a los datos almacenados en el log del mismo servidor. Ljung L. (1987). System Identification: Theory for the User. Englewood Cliffs, New Jersey: Prentice Hall. De los datos tomados se observa que los sistemas informáticos poseen un comportamiento cercano al Parekh, S.; Gandhi, N.; Hellerstein, J.; Tilbury, D.; lineal solo en tramos por lo que se sugiere Jayram, T.; Bigus, J. (2002). Using control theory experimentar con modelos matemáticos no lineales to achieve service level objectives in performance para comparar la calidad de ajuste de ambos. management. Real-Time Systems. Jul2002, Vol. 23, Issue 1/2, pp. 127-141. Norwell, Como trabajos futuros se plantea diseñar un Massachusetts: Kluwer Academic Publishers. DOI: controlador autonómico basado en el modelo lineal 10.1023/A:1015350520175.

66 Clasificadores supervisados para el análisis predictivo de muerte y sobrevida materna Pilar Vanessa Hidalgo León

Universidad Andina del Cusco/San Jerónimo, Cusco-Perú [email protected]

cuando este problema requiere prontitud por su Resumen naturaleza, los clasificadores transforman los datos en conocimiento y aportan aplicaciones El presente trabajo se basa en el análisis de los clasificadores supervisados que puedan generar exitosas. resultados aceptables para la predicción de la muerte En el caso de factores de riesgo para la salud y sobrevida materna, según características de materna, existen estudios estadísticos y pacientes complicadas durante su gestación aplicaciones salubristas para determinarla, mas determinada por los expertos salubristas. Se describe no integrados simultáneamente como parte de la metodología del desarrollo, las particularidades de la muestra, además los instrumentos utilizados para el una probabilidad clasificatoria como modelo. procesamiento de los datos. Los resultados de la Por ello, este estudio determinará, el clasificador investigación luego de la evaluación de cada supervisado más eficiente en tiempo y resultado clasificador y entre ellos el que mejores resultados que establezca la diferencia de clases entre arroja. Los histogramas acerca de cada atributo de las pacientes gestantes complicadas durante su pacientes, y la inclusión en la muestra. Los parámetros determinantes para su correcta embarazo que pueden llegar a presentar síntomas clasificación. La comparación de cada resultado entre fatales y las que no, así apoyar al personal de cada tipo de clasificador dentro de la familia a la que salud a tomar la decisión más optima y a pertenece para después de identificado, implementar, prevenir futuras alzas en el índice de mortalidad el algoritmo de Naive-Bayes con estimador de de su comunidad. Núcleo activado (KERNEL=TRUE), en un software que contribuya a la toma de decisiones certera y Los problemas que generan el alza de este respaldada para los profesionales de la salud. En indicador ya son conocidos y puestos en valor en conclusión se encontró un clasificador supervisado esta investigación: que responde positivamente a dar cambio y mejora de “Hoy en día existe suficiente evidencia que la problemática que abarca a la sobrevida materna a pesar de sus complicaciones. demuestra que las principales causas de la muerte materna son la hemorragia posparto, la Palabras Clave: Mortalidad materna, preclampsia o la sepsis y los problemas clasificadores supervisados, Redes bayesianas, relacionados con la presentación del feto. Aprendizaje supervisado. Asimismo, sabemos cuáles son las medidas más eficaces y seguras para tratar estas emergencias 1 Introducción obstétricas. Para poder aplicarlas, es necesario que la gestante acceda a un establecimiento de El objetivo en el uso de clasificadores salud con capacidad resolutiva, pero supervisados (Araujo, 2006), es construir lamentablemente muchas mujeres indígenas no modelos que optimicen un criterio de acuden a este servicio por diversas razones, tanto rendimiento, utilizando datos o experiencia relacionadas con las características geográficas, previa. En ausencia de la experiencia humana, económicas, sociales y culturales de sus grupos para resolver una disyuntiva que requiere poblacionales, como por las deficiencias del explicación precisa, los sistemas implementados propio sistema de salud. por modelos clasificadores han sido parte importante en la toma de decisiones. A parte, En los últimos años se han hecho muchos

67

esfuerzos para revertir esta situación, tanto ya que el estado del registro de las historias mediante proyectos promovidos por el estado clínicas correspondientes a los casos de como ejecutados por organismos no fallecimiento no son legibles, ni están gubernamentales de desarrollo. Estos esfuerzos conservadas en las mejores condiciones en los han tenido, sin embargo, resultados desiguales archivos de la Dirección Regional de Salud debido principalmente a la poca adecuación de Cusco, esto hace que la muestra no pueda ser los proyectos al contexto geográfico y de nutrida con mayor diversidad de datos. infraestructura en el que vive gran parte de la (DIRESA, 2007) población indígena, a sus dificultades económicas para acceder al servicio, su cultura, Existe poca investigación acerca del tema sus propios conceptos de salud y enfermedad, y relacionado con el uso de clasificadores su sistema de salud.”(Cordero, 2010) supervisados y otras correspondientes a diagnósticos médicos que tienen similitud con la 2 Contenido muestra pero ninguna que se relacione directamente. Problema La relevancia de los datos se limito a los ¿Cuáles son los clasificadores supervisados que antecedentes sobre estudios en mortalidad predicen la muerte o la sobrevida materna con materna (edad, estado civil, analfabeta, mayor efectividad? ocupación, procedencia, anticoncepción, entorno Entonces para determinar adecuadamente la (estrato social), controles pre-natales, ubicación efectividad de cada algoritmos nos domiciliaria, tiempo de demora en atención, cuestionamos: atención profesional, antecedentes familiares, espacio intergenésico (en años), paridad (número • ¿Cuál es la especificidad, la de hijos), complicaciones no tratadas, clasificación correcta y el error absoluto fallecimiento), conservando el anonimato de y sensibilidad del clasificador cada paciente. supervisado en relación a los datos de mortalidad materna? Objetivos • - Redes neuronales Determinar el clasificador supervisado que brinde mejores resultados para el - Redes Bayesianas análisis predictivo de muerte y sobrevida materna. - Regresión Logística Luego para lograr este objetivo se debe: - Arboles de Decisión • Determinar la especificidad, la Algoritmos Basados en Distancias - clasificación correcta, el error absoluto, Mediante la herramienta Weka, se determinó la la sensibilidad del clasificador sensibilidad, la certeza más cercana de cada uno supervisado en relación a los datos de de estos algoritmos, y cuya conclusión sugerirá muerte y sobrevida materna. el más eficiente. - Redes neuronales Actualmente la problemática en mortalidad - Redes Bayesianas materna es un indicador determinante de desarrollo en los países Latinoamericanos. - Regresión Logística Siendo no solo un indicador de pobreza y desigualdad sino de vulnerabilidad de los - Arboles de Decisión derechos de la mujer. (OMS, 2008) - Algoritmos Basados en Limitaciones de la Investigación Distancias

Los datos recolectados para este estudio con respecto a pacientes fallecidas que tuvieron complicaciones durante el embarazo fueron 48,

68

Hipótesis General. control sanitario de mortalidad materna de la Región Cusco.

Hi: Existen clasificadores supervisados que El número es limitado, pues las historias desde predicen la muerte o la sobrevida materna con 1992 al 2011, no han sido redactadas ni efectividad conservadas en el mejor estado haciendo difícil la tarea de interpretar los datos suficientes para H. nula: No existen clasificadores supervisados ser analizados. que predicen la muerte o la sobrevida materna con efectividad Estas pacientes no fueron necesariamente atendidas desde el inicio en estos establecimientos, sino que debido a sus complicaciones durante el parto y e embarazo Definición de Variables: fueron derivadas a las capitales y luego a los establecimientos de mayor capacidad Variable principal: resolutiva para su atención. Clasificadores supervisados Los datos de las historias clínicas que incluyeran:

Variables Implicadas: • Edad

• Estado civil Variable Clasificadores supervisados • Analfabeta Dimensión Las Redes Neuronales, Algoritmos • Ocupación supervisados, Las Redes • Bayesianas, Arboles de decisión, Procedencia Regresión Logística, Algoritmos • Anticoncepción basados en instancias • Entorno (estrato social) Indicador/ Clasificación correcta, • Criterios de Clasificación incorrecta, Controles pre-natales Medición Sensibilidad,Especificidad, Tiempo • Ubicación domiciliaria de ejecución • Tiempo de demora en atención Mean absolute error • Atención profesional Kappa statistic Root mean squared erro, Relative • Antecedentes familiares absolute error • Espacio intergenésico (en años) Root relative squared error • Paridad (#de hijos) Clase Numérica discreta • Complicaciones no tratadas Instrumento Weka 3.5.7, Explored • Fallecimiento

Tabla I: Variables implicadas Entre 52 sobrevivientes y 48 fallecidas, ambos grupos con similares características, siendo factores determinantes: (Ramírez, 2009) Metodología de investigación Ubicación domiciliaria /tiempo demora en Cuasi Experimental, Aplicada, Inductiva atención Controles > 2: n (49.0/2.0) Descripción de la muestra y método de PRODECENCIA = rural: s (6.0/1.0) recolección Técnica e instrumentos de investigación

Los datos recolectados en todas 100 historias Se utilizó la herramienta Weka Explorer para clínicas (HC) de casos de sobrevida y de casos la interpretación de los datos. de muerte materna. Las opciones de clasificación supervisada y los algoritmos que propone esta herramienta. Las características de las gestantes son muy (Corso, 2009) similares entre si y corresponden a la población

de la ciudad del Cusco, del archivo en la Red Sur de la Dirección Regional de Salud sobre el

69

Se evaluaron los siguientes clasificadores o Identificación de las fuentes de supervisados: información externas e internas y selección del subconjunto de • Las Redes Neuronales datos necesario. - MultilayerPerceptron • Pre procesamiento: estudio de la calidad - RBFNetwork de los datos y determinación de las • Las Redes Bayesianas operaciones de minería que se pueden - BayesNet realizar. - Bayes simple estimator • Transformación de datos: conversión de - BMA bayes datos en un modelo analítico. - Naive-Bayes • Análisis de datos interpretación de los - BayesNet Kernel resultados obtenidos en la etapa anterior, - Naive-Bayes generalmente con la ayuda de una - Discretizacion Supervisada técnica de visualización. • Arboles de decisión • Asimilación de conocimiento - J48 descubierto(Calderón, 2006) - DecisionTable • Minería de datos: tratamiento • Regresión Logística automatizado de los datos seleccionados - MultiClassClassifier con una combinación apropiada de - Logistic algoritmos. (Ramirez, 2011). • Algoritmos basados en Distancias NEGATIVO POSITIVO VALORES IBK - Edad Menor a 19 y Entre 19-35 14-48 - LWL mayor 35 - KStar Estado civil Soltera Pareja Soltera-pareja Estos resultados fueron comparados con las Analfabeta Analfabeta Primaria Analfabeta- reglas de clasificación que nos proporcionará primaria- secundaria- los algoritmos de predicción inmediata: superior Ocupación No Remunerada Remunerada-no - OneR remunerada remunerada Procedencia Rural Urbana Rural-urbana - ZeroR Anticoncepc No Si Si-no ión Procedimiento de recolección de datos Entorno Bajo Medio-alto Baja-alta (estrato Las Historias Clínicas se insertaron en un social) Controles De 1 a 5 6 a mas 0-12 fichero CSV (delimitado por comas) cuya Ubicación Más de 2 Menos 3 del Menos de 1 hora, cabecera contiene las etiquetas de cada domiciliaria/ horas ee.ss 1-2,3-5,6 a mas atributo, y la última columna se refiere a la tiempo clase a la pertenecen. demora en atención Con respecto a los indicadores particulares en Personal de No Si No-si cada uno de los atributos de los sujetos de la atención clase tenemos los siguientes valores: Tabla II profesional Antecedente No Si No-si s familiares • La calidad de la estructuración Espacio Menos de 2 Entre 2-4 Primera gesta,1- • Comparar todas mediciones en cada intergenésic mayor a 4 3,4-6, menos 1 o (en años) clasificador por algoritmo. Paridad (#de Primípara o Entre 2-4 0-10 • Evaluar los resultados. hijos) mas de 4 • Complicacio Complicacion Sin No-si Interpretar los resultados. nes no es antes y complicacion tratadas durante es 2.9. Plan de análisis de la información Fallecida Si No No-si

• Determinación de objetivos Tabla II: Rango de valores de los atributos en las • Preparación de datos pacientes de la muestra • Selección de datos:

70

Histogramas:

Cada Atributo es evaluado visualmente por los • Fracción de verdaderos negativos (FVN). histogramas que arroja Weka (en total 15), por Demuestra la cantidad de pacientes que ejemplo con respecto a la EDAD de las pacientes realmente pertenecen a la clase de la muestra. Sobreviviente. Quiere decir que si el algoritmo estudiado tiene alto porcentaje de especificidad determina con gran éxito la probabilidad de sobrevida en pacientes complicadas durante su embarazo según los datos proporcionados en la ficha de antecedentes.

• Clasificación correcta : de la totalidad de datos, entre los que 52 que pertenecen a la clase Sobreviviente, y los 48 que pertenece a la clase Fallecida, determina dentro de cada clase cuantas instancias luego de la construcción del clasificador cuantas si pertenecen a la clase determinada. Gráfico 1: Histograma de la Edad de las pacientes de la muestra • En el caso de pertenecer a la clase Este histograma nos muestra el intervalo de edad sobreviviente o a la fallecida de las 100 de las pacientes de la muestra, la diferencia de instancias cuantas fueron clasificadas colores determinan la clase a la que pertenece correctamente. cada intervalo. Podemos observar lo siguiente: • Clasificación incorrecta: del mismo • Las pacientes del intervalo 14-20 años modo la cantidad de instancias que no pertenecen en su mayoría a la clase fueron clasificadas de manera correcta, “sobreviviente” son las que de manera supervisada se sabe que pertenecen a una u otra clase y fueron • Las pacientes del intervalo 21-31 años incluidas dentro de la cual no eran. Si el tienen un mayor porcentaje de sobrevida, indicador emite un número mayor al 50% coincide con el promedio de edad de la cantidad total de instancias, no se adecuado y de la muestra. debe considerar como eficiente.

• El intervalo de paciente entre 32-40 años • Sensibilidad: es la capacidad del tiene mayor porcentaje de muerte. algoritmo de clasificar a las pacientes complicadas dentro de la clase Fallecidas. • Las pacientes mayores a 40 años Es decir que si el clasificador tiene un alto pertenecen a la clase “fallecida” en gran porcentaje tiene mejor curva de corte y porcentaje. discernimiento entre los sujetos que Resultados por algoritmo testeado: pertenecen o no a la clase fallecida, es así que si la cifra de sensibilidad es del 90%, Los algoritmos usados para evaluar la base de existe entonces esa probabilidad de que la datos en mortalidad materna dieron como paciente fallezca. resultado cifras continuas indicando los siguientes sucesos: Tabla 2.

• Mean absolute error: Se define error • Especificidad: es la probabilidad de que absoluto de una medida la diferencia entre pacientes complicadas y de riesgo el valor medio obtenido y el hallado en esa pertenezcan a la clase Sobreviviente. Es medida todo en valor absoluto. decir los verdaderos negativos.

71

• Entonces el promedio de error absoluto, es • Relative absolute error: es el error la suma de los errores absolutos de relativo a cada característica de la clase, clasificación en cada uno de los sujetos por ejemplo el error relativo de tener de llevados a promedio. El clasificador que Espacio Intergenésico 0.5 años y arroje mayor cifra (mayor a 0.1) define un pertenecer o no a la clase fallecida, la error de clasificación alto, por lo cual no clasificación indicaría que si pertenece, se debe considerar por sobre los que por ser el valor indicado para aquellas arrojen una cifra menor. pacientes que están en peligro. En este caso el valor positivo para • Tiempo de ejecución: medido en pertenecer al clase sobreviviente es de segundos es la cantidad de tiempo que entre 2-4 años o primera gesta: 0.5 años demora en construir la arquitectura del incluido en la clase fallecido si el error clasificador y en arrojar resultados. entre los valores determinados por la clase • Puede que un clasificador se defina como y el valor ingresado es menor a 1. eficiente si el tiempo que emplea en emitir resultados es menor a 5 segundos, aun así • Root relative squared error: La raíz depende de los demás indicadores para relativa E de error al cuadrado i de un valerse de esta característica. programa individual i es evaluado por la ecuación: • Kappa statistic: el Kappa statistic es la concordancia de comparación que tienen los observadores de clasificación. Quiere decir en una matriz de clasificación, el índice esperado entre el diagonal principal esperada (Xii elemento clasificado en la misma clase por ambos observadores) y el donde P (ij) es el valor predicho por el índice real luego de la clasificación programa para el individuo i j muestra de efectuada por la arquitectura seleccionada casos (de los casos de la muestra n), T j es el (sea regresión lineal, backpropagation, valor objetivo para la muestra j caso, Naive-bayes, etc.), es la diferencia en y está dada por la fórmula: porcentaje de su lejanía a este valor.

• Si por ejemplo, la matriz esperada clasifica el valor en 25.00 y el resultado de la

arquitectura es 26.7, la diferencia seria, 1.7 equivale al 90.32%. Entonces cuanto mas Para un ajuste perfecto, el numerador es grande sea el porcentaje, estará más cerca igual a 0 y E i = 0. Así, el E i índice varía de de ser considerado eficiente. 0 a infinito, con el ideal que corresponde a 0. • Root mean squared error: error cuadrático medio, es una medida de uso Redes Redes frecuente de las diferencias entre los Reglas Bayesianas Neuronales valores pronosticados por un modelo o un NAÏVE estimador y los valores realmente BAYES observados. RMSD es una buena medida INDICADORES OneR KERNEL RBFNetwork 0.9 de precisión, pero sólo para comparar Especificidad 0.836 0.91 diferentes errores de predicción dentro de Clasificación 90 un conjunto de datos y no entre los correcta 84 91 diferentes, ya que es dependiente de la una Clasificación 10 escala muestra. Estas diferencias incorrecta 16 9 individuales también se 0.9 Sensibilidad 0.868 0.914 denominan residuos, y la RMSD sirve para 0.1468 agregarlos en una sola medida de la Mean absolute error 0.16 0.1142 capacidad de predicción.

72

Tiempo de 0.26 • Relative absolute error <25%, <1 ideal ejecución 0.2 0.01 • Root relative squared <50%, 0 ideal 0.7997 Kappa statistic 0.6759 0.819 El clasificador que cumpla con estas Root mean 0.3032 especificaciones, se considera como optimo squared error 0.4 0.2737 para la integración en un sistema que evaluará Relative absolute 29.39% error 32.03% 22.86% la base de datos que contengan los datos de las Root relative 60.64% pacientes a comparar con un registro nuevo de squared error 80.01% 54.74% entrada.

Tabla III: Algoritmos de clasificación Etapa de Evaluación: supervisada con mejores resultados por Cada uno de los clasificadores estudiados, que familia analizaron la muestra de pacientes, arrojaron los indicadores que muestra la tabla, en nivel de Arboles Algoritmos rendimiento y buenos resultados, aceptables para de de Regresión el estudio, la arquitectura que lleva los Decisión Distancia Logística porcentajes mas óptimos dentro de los INDICADORES J48 LWL Logistic parámetros, es el algoritmo de Naive bayes con 0.889 0.82 estimador de núcleo. Especificidad 0.9 Clasificación 89 82 correcta 90 Descartando así el resto de arquitecturas como Clasificación 11 18 apropiadas para este tipo de muestras. Además incorrecta 10 de visualizar de manera más clara y correcta los Sensibilidad 0.9 0.902 0.82 resultados comunes entre cada una de las Mean absolute 0.168 0.1753 arquitecturas dejando atrás aquel paradigma que error 0.1174 incluye a la regresión logística como la más Tiempo de 0 0.81 adecuada a la hora de realizar diagnósticos ejecución 0.02 preventivos en salud. 0.6388 0.6388 Kappa statistic 0.7994 Root mean 0.3019 0.4157 squared error 0.299 Kernel estimator, al ser un método de Relative absolute 33.64% 35.10% construcción no paramétrico, es mas flexible que error 23.75% los clasificadores que incluyen parámetros. Root relative 60.39% 83.15% Dividiendo la muestra en 40 grupos en la squared error 59.80% variable edad, paridad 15, controles 17, para la exploración descubre nuevas alternativas de Tabla IV: Algoritmos de clasificación clasificación en los atributos de la clase, supervisada con mejores resultados por familia mostrándolos como relevantes.

Descripción de la metodología propuesta El problema más resaltante es que requiere mayor tamaño de muestra para mantener estos Se propone entonces que evaluando cada uno resultados, ya que utilizado el agrupamiento para de los indicadores más importantes en la la clasificación (clustering) que podría ser una construcción del clasificador, se descarten desventaja si de clasificadores supervisados aquellos que no cumplen las siguientes hablamos. Esto se denomina the curse of características: dimensionality, pues dependen de la elección de un parámetro suavizado, en este caso los valores • Especificidad > 90% la edad, número de controles, y número de hijos, • Clasificación correcta > 90 instancias transformándola en no objetiva. • Clasificación incorrecta < 10 instancias • Sensibilidad >90% Cuando Naive-Bayes actúa con la estimación no • Mean absolute error < 0.1 ideal paramétrica, las estructuras que se construyen a • Kappa statistic >0.79, >0.9 ideal partir de esta arquitectura se obtienen a partir de • Root mean squared error < 0.3, <1 ideal un árbol formado con variables potencialmente predictoras multidimensionales. (Serrano, 2010)

73

3 Conclusiones Con respecto a la respuesta de las hipótesis podemos afirmar lo siguiente:

• Encontramos que las arquitecturas propuestas por esta memoria, totas conservan una efectividad bastante aceptable e cada uno de sus indicadores que en conjunto hacen un 80% de eficacia. Siendo el más optimo el de Naive-Bayer Kernel.

• Descarta estas hipótesis luego del trabajo realizado.

• Las Redes Bayesianas brindan al estudio Gráfico 2: Indicadores estadísticos de de los datos en mortalidad y sobrevida todos los algoritmos de clasificación materna una especificidad 91%, supervisada clasificación correcta 91%., error absoluto 0.1142, y sensibilidad 0.914 4 Recomendaciones recomendada. • Se recomienda usar datos discretizados, • Las Redes Neuronales brindan al estudio si se desea usar Naive-Bayes para su de los datos en mortalidad y sobrevida evaluación. materna una especificidad 90%, • clasificación correcta 90%, error Agrupar los datos de la manera absoluto 0.1468 y sensibilidad 90% propuesta en la Tabla 1, así podremos recomendada. respetar los parámetros acerca de mortalidad materna que establecen los • Los Arboles de Decisión brindan al expertos salubristas. estudio de los datos en mortalidad y • Es importante comparar los resultados sobrevida materna una especificidad con la regla de decisión OneR para 90%, clasificación correcta 90%, error próximos experimentos, pues nos da la absoluto 0.1174 y sensibilidad 90% mejor noción de veracidad y efectividad recomendada. a la hora de analizar la información. • Se pretende implementar un sistema en • Los Algoritmos basados en Distancias R Project, para ingresar nuevos registros brindan al estudio de los datos en y que lleve en la memoria la base de mortalidad y sobrevida materna una datos recolectada a través de este trabajo. especificidad 88.9%, clasificación • Para ayudar a la muestra numeraria que correcta 89%, error absoluto 0.168 y requiere Naive- Bayes Kernel, es sensibilidad 90.2% recomendada. necesario ingresar los registros de las muertes maternas y complicaciones • La Regresión Logística brinda al estudio diarias de las gestantes a nivel Nacional de los datos en mortalidad y sobrevida en la base de datos. materna una especificidad 82%, clasificación correcta 82%, error absoluto 0.1753 y sensibilidad 82% recomendada.

74

María García Jiménez, & Aránzazu Álvarez Sierra. (2010) Análisis de Datos en WEKA Agradecimientos – Pruebas de Selecitividad. Articulo. Agradezco al Dr. Basilio Araujo y al Dr. Yosu Yuramendi, por su apoyo y María N. Moreno García* L. A. (2005) perseverancia para la culminación de este Obtención Y Validación De Modelos De proyecto de fin de master. Estimación De software Mediante Técnicas De Minería De Datos. Revista colombiana Al personal de salud que proporciono la de computación, 3[1], 53-71. información clínica que se usó en la investigación. Msp Mynor Gudiel M., 1. E. (2001-2002) 5 Síntesis curricular de la autora: Modelo Predictor De Mortalidad Materna. Mortalidad Materna, 22-29. Pilar Hidalgo León, Ingeniera de Sistemas de la Universidad Andina del Cusco en Perú, Organización Mundial De La Salud. Docente en la Facultad de Ingeniería, Mortalidad Materna En 2005, (2008) : Sustentación de proyecto de máster en la Estimaciones Elaboradas Por La Oms, El Universidad del País Vasco. Unicef, El Unfpa Y El Banco. Ginebra 27: Contacto: [email protected] , Ediciones de la OMS. Universidad Andina de Cusco, San Jerónimo, Cusco, Perú. Porcada, V. R. (2003) Clasificación Supervisada Basada En Redes Bayesianas. 6 Referencias bibliográficas Aplicación En Biología Computacional. Clasificadores supervisados: el objetivo es Madrid: Tesis Doctoral. obtener un modelo clasificatorio valido para permitir tratar casos futuros.( 2006) Ramírez Ramírez, R., & Reyes Moyano, J. , Araujo, B. S. Aprendizaje Automatico: (2011). Indicadores De Resultado conceptos básicos y avanzados. Madrid, Identificados En Los Programas España: Pearson Prentice Hall, Estratégicos. Lima: Encuesta Demográfica Y De Salud Familiar – Endes Cordero Muñoz, L., Luna Flórez, A., & Ramirez, C. J. (2009)Caracterización De Vattuone Ramírez, M. (2010) Salud de la Algunas Técnicas Algoritmicas De La mujer indígena : intervenciones para reducir Inteligencia Artificial Para El la muerte materna. © Banco Interamericano Descubrimiento De Asociaciones Entre de Desarrollo. Variables Y Su Aplicación En Un Caso De Antonio Serrano, E. S., (2010) Redes Investigación Específico. Medellin: Tesis Neuronales Artificiales. Universidad de Magistral. Valencia. Reproductive Health Matters, (2009) Calderón Saldaña, J., & Alzamora de los Mortalidad Y Morbilidad Godos Urcia, L. (2009) Regresión Logística Materna:Gestación Más Segura Para Las Aplicada A La Epidemiología. Revista Mujeres. Lima: © Reproductive Health Salud, Sexualidad y Sociedad, 1[4]. Matters.

Calderón, S. G, (2006) Una Metodología Vilca, C. P. (2009) Clasificación De Tumores Unificada para la Evaluación de Algoritmos De Mama Usando Métodos. Lima: Tesis. de Clasificación tanto Supervisados como Dirección Regional De Salud Cusco, (2007) No-Supervisados. México D. F.: resumen de Análisis De Situación De La Mortalidad tesis doctoral. Materna Y Perinatal, Región Cusco. Corso, C. L. (2009) Aplicación de algoritmos de clasificación supervisada usando Weka. Ministerio De Salud, Dirección General De Córdoba: Universidad Tecnológica Epidemiología Situación De Muerte Nacional, Facultad Regional Córdoba. Materna, (2010-2011), Perú.

75