Use of and DataMining in to Improve Navigation Process and Marketing Plan.

Autoria: Andre Luiz Zambalde, Alexander Kippes, Ahmed Ali Abdalla Esmin, Eder Bruno Fonseca

The knowledge of consumer habits can help companies explore user needs, how they use products and how they react to marketing campaigns. Marketing experts mostly choose the maximum of available channels in order to generate the most revenue. Choosing every available channel often has the negative effect of companies loosing the real target group for their products. Sending messages only over the mass media is not the best solution for a marketing campaign in every case. The goal must be to identify the relevant focus groups and to send them individually targeted messages. To reach this goal companies should use Tools that use Mining Techniques in order to help discover hidden consumer information. Data Mining consists of statistical techniques that help to find hidden information in huge amount of data related to user and customer experiences. Data Mining is used to find common user behavior patterns and the extracted information can improve the process of strategic business and marketing decision making. This paper describes the use of web analytics and data mining to improve web site navigation process for marketing purpose with real case study. The research was conducted between july/2009 and october/2009. The unit of analysis was the web site of distance-learning post-graduate course in Software Engineering, offered by Federal University of Lavras. Data were collected considering techniques of knowledge discovery in databases and web usage mining. The described results provides some valuable information on how to improve the marketing strategy for the . A small minority of visitors used search engines to find the website and a big majority of users left the website after the first click. The website should firstly be optimzed for search engines and an AdWord marketing campaign could be initiated to generate more visitors from search engines. Furthermore, an e-Mail marketing campaign could increase the direct hits of the intended doorway page. An e-Mail campaign could be sent out to former students of the University or other universities. To estimate the success, different landing pages could be used in the e-Mail and AdWord campaigns. When a redesign including optimization should be initiated, the website has to examined with Data Mining Context algorithm. The goal would be to find out the keyword density and the search ranking of the page. The ranking is a product of the link strengths that are connected to the page. Considering the high number of visitors that left quickly, it could be recommended that the back links of the page are controlled to find out which back links work to improve the performance.

1

1 INTRODUCTION

1.1.Contextualization and motivation The Internet drastically changed the world in which we live in. The Internet connects millions of people and brings companies closer to their customers. With the Internet a massive amount of information is interconnected and data is available for everyone and everywhere. In the so called information age the amount of data for users and businesses is enormous and it needs to be selected to obtain new and relevant information. In business, knowledge, especially knowledge about the consumer behavior, is necessary to survive in today's highly competitive markets. Often consumers act differently than companies expect. It is necessary to find out more about the consumer needs and expectations. The Internet provides new opportunities to obtain the necessary data to discover the consumer knowledge in real time. Modern computer systems can help companies in finding relevant information that can be the foundation of innovation processes. Consumers talk freely on the Internet and exchange messages in many different ways. Data Mining can also be used in marketing strategies and in selling products online. The size of the company is not important, start-ups as well as established companies can profit from Data Mining technologies. Data Mining can be used to identify special user groups and to create individual targeted marketing campaigns which deliver information to these groups. The usability of a website can be improved by examining what the main navigation paths are and why the user might choose different navigation paths that were not expected.

1.2 Objectives and structure of the work The University Federal of Lavras MG-Brazil offers distance learning courses. On different websites they present courses and their objectives to potential learners. This research will examine the commercial website of a distance learning course in software engineering. The goal is to find out how the user enters, navigates and leaves the website. Also it is examined which technologies the visitor uses in order to create specific target groups and to help to test updates. The overall goal is to propose marketing strategies that can improve the performance of the website in the future. This case study is divided into four parts. In the first part the theoretical background for the research is explained. It deals with customer behavior and online marketing explaining how the consumer behavior changed over the past decades, how the Internet invoked recent changes, and how companies create marketing strategies. Furthermore, the concept of Knowledge Discovery in Databases and the techniques and forms of Data Mining are explained. The second part describes the methodology that is used to collect and evaluate the data. Here it is explained how the research is conducted, how the website is observed and how relevant information is extracted. The overall concept of Knowledge Discovery in Database processes is explained in which data is collected until the results will reveal useful and relevant new information. The last two sections – the research, the discussion and the conclusion – present the results of the research including which technologies the investigated visitors used and how they navigated through the website. Opportunities for new marketing concepts and strategies are presented that can help to improve the performance of the website.

2

2 THEORETICAL REFERENCE 2.1 Consumer behavior 2.1.1 Today’s consumer Markets consist of consumers and companies, with new competing companies constantly entering the market and trying to sell their products. According to Vieira (2002) the customer relationship became more to a center of strategic thinking. In instable markets it is necessary to investigate consumer behavior in order to satisfy expectations and needs – this is considered the most important factor for surviving in a hyper competition (CARLOS; VARGAS, 2008). Consumer behavior is considered as what a consumer feels, thinks, and does during the process of making a purchase decision. When the consumer makes a decision he is influenced by his cultural background, his knowledge and the friends in his environment. Social and cultural environments of a customer change (VIERA, 2002) – therefore, winning a customer is only the first step in creating a lasting relationship and satisfying the customer in a long term. A customer model must be simple but it has to be checked for appropriateness every time it is used. 2.1.2 What is consumer satisfaction? According to Lira and Marchetti (1999), there are five steps in a consumer’s purchase decision (Figure 1). In the first step the consumer compares an ideal form derived from knowledge and values with the real situation. Then the consumer searches for more information to make a decision. After the search the consumer makes an evaluation of the relevant information and from different perspectives. Based on the results the consumer buys a product and in the post buy decision phase the real satisfaction evolves.

Compare reality Se arch fo r informatio n Evaluation against ideal form

SearchBuy for decision information PostEvaluation buy decision

Figure 2.1 Consumer ideal behavior buy decision graph.

The satisfaction and with it the human behavior is of high importance for companies. According to Chauvel (1999), the maximization of the customer’s satisfaction is the most important instrument to a company’s successes. Thereby, the different ways of usage of a product have to be considered in marketing decisions. Some models are criticized because they see the usage only from the perspective of the manufacturer and not from the customer’s side. The consumer is not only one stereotype person. There are several types of consumers with different minds, ideas of use and rationalities (CHAUVEL, 1999). The rationalities of different persons range widely and with it what is considered to be rational. When a customer acquires a product, the usage can be quite different to what the constructor intended and the customer’s behavior might be considered irrational. The observed irrationality in consumer behavior was the reason why companies started to examine the consumers’ minds and to interpret cognitive processes (CHAUVEL, 1999).

3

2.1.3 Online consumer behavior Today many technical innovations like the Internet inform consumers quickly about new products and even offer the opportunity to create own retail outpost (GABRIEL, 2008). The online consumer spends a lot more time browsing for products and information than he would in real stores. In physical stores a seller can quickly find out if a product is a success but he cannot see quickly when it is a flop. In the online marketplace this changes and sellers can see very quickly when a product flops because a lot of customers are influencing the purchasing process of others by telling their opinions online or leaving hidden messages in comments, or guestbooks. (BURBY; ATCHISON, 2007, CAMPELL; 2005). The consumer is the most important force in markets. Without a consumer there is no market. Companies should have a special interest in the reactions of a customer. Every reaction of the customer must lead to reaction of the company with the goal to provide better services or products (BURBY; ATCHISON, 2007). Therefore, companies have to measure and collect data of the customer’s behavior. In the data collection the companies save page visits and navigation routes. This kind of data may describes the cognitive processes of the customer and can be used to learn about customers and the success of business decisions. This is not only a momentary situation – the history of a customer’s behavior can help to learn from the past and improve future marketing and business decisions. To reach these goals the website has to be optimized in a customer driven process (BURBY; ATCHISON, 2007). In order to understand the online customers the company needs knowledge about the online customer behavior (SONG; ZAHEDI, 2005). According to Song and Zahedi (2005) a genetic model of online customer behavior leading to a purchase decision.

2.2 Today’s Marketing Today the Internet is not only a platform to sell products but also a platform that provides companies with the opportunities to collect data about customers and to evaluate the performance of a website or campaign very quickly. Automatic performance tests and evaluation of customer behavior can support marketing and business decisions (BURBY; ATCHISON, 2007). Consumer often use search engines and keywords to find products online. For marketing strategists keyword marketing has become a main strategy and businesses often spent their marketing budgets for getting as many potential customers to their websites as possible. However, not every visitor will make a purchase decision and the keyword is only a doorway that is not always matching the visitor’s wishes and needs. Data Mining can help to select the visitors in groups of interst (BURBY; ATCHISON, 2007). To get a better overview over the website the content and have to be grouped. According to Huang (2007), there are four basic steps in an process: product information site, click forward links, basket link, and the purchase. The model can be modified for individual needs since not all shopping processes need all these steps.

2.3 Knowledge Discovery with Web Mining 2.3.1 Data Mining and knowledge discovery With the enormous amount of collected data the necessity for processes that deliver useful information for supporting decision making arises (ZAIANE, 1999). The amount of data alone does not create more information or knowledge by itself. It is necessary to find

4

techniques that help to produce useful information from the collected data (OBERLE, 2000). Data Mining and Knowledge Discovery are often used as similar terms but Data Mining is only a part of the Knowledge Discovery Process (ZAIANE,1999). Knowledge Discovery in Database (KDD) is a process that uses Data Mining to extract useful information from a database .

2.3.2 Data collection, saving and filtering Collecting relevant data is the key issue and common errors occur when tracking visitors on a website. Repetitive visitors can be falsely considered as new visitors because cookies in the have been deleted. Employees of the company are often using the website and might be counted as visitors or potential customers. Web crawlers are visiting websites regularly, e.g. search engines, and produce traffic that is not related to real consumer traffic. Tracking tools can get confused and deliver wrong result when the website structure is implemented inconsistently. Further, the web analytic tool could filter the wrong data and important data will be lost or less important data will be saved (BURBY; ATCHISON, 2007). All these common distortions in web traffic must be taken into account, when collecting and filtering traffic data. When the data is gathered it needs to be filtered and written in databases or in Data Warehouses. Data Warehouses are often used for Data Mining Process but are not obligatory to use (ZAIANE, 1999).

2.3.4 Web Mining Web Mining Techniques can be separated into three domains: The first domain is the Web Content Mining which explores unstructured data resources like texts, audio, video, and hyperlinks (ZAIANE, 1999; ONODA, 2006). According to Zidrina and Raudys (2005) web content mining is the identification of the density of for example special words or word groups. Every crawler of an full text search engine use this technique to catch web content and evaluate each tag, especially the links of a web page, in order to rank the relevancy of the content in its database. The second domain is Web Structure Mining which searches for domains that link to a specific domain (ZAIANE, 1999). These techniques might be useful for exploring rankings within a search engine or the interconnections of a website within the Internet. The third domain is Web Usage Mining which explores the visitors behavior (ZIDRINA;RAUDYS, 2005). Which each click on an website a visitor makes a request to a web page. All web service provider automatically record each request and save a log file record. This log file record saves the time, the IP, the operating system and other technical or regional information. It also delivers information about technical errors and the usability of a website. Therefore, an analyst can estimate the degree of customer satisfaction with the methods of Web Usage Mining.

2.3.5 Web Usage Mining Web Usage Mining is a concept that uses different analytical techniques with the aim to produce new knowledge. When using Web Usage Mining a server automatically collects various information on visitor and visits which are later examined with data mining techniques it. A KDD for example can collect the user IP, session ID, timestamps, browser type and version among others. With this data, especially the timestamps and the visited pages, the click stream analyses makes it possible to describe a series of clicks through the website and to record the time a user spent with each page (LEE et al., 2001). Web Usage

5

Mining stores the user traffic in a database or data warehouse that can be analyzed with rules and knowledge sets. The resulting information can be used to produce knowledge for an effective Internet marketing plan (KWAN at. All, 2005). The process of Web Usage Mining consists of the following steps: Data catching Data collection is the foundation of web analytics and it is important to choose the right method for data collection. Three different methods are used to collect data in web analytics tools. The first method that was used was log file analyzing (Figura 2.2). In recent years Java Script, Web Bacons and Sniffer were added as methods for analyzing web traffic. The common Java Script is the most widely used tool to collect usage data, with Sniffer and Web Bacons being special tools based on Java Script. All of the data collection methods have their advantages and disadvantages. Hybrid solutions are often used to outweigh the disadvantages. There are mainly three methods of collecting traffic data: log files from the (logs), page tags (tags) and hybrid solutions (logs + tags), the latter being the most complete. When the data is collected in form of web log data it has to be prepared for further analyses (AVINASH; KAUSHIK, 2007)(CARNEIRO, 2006).

198.102.031.003-[10/Oct/2000:13:55:36-0700]"GET/apache_pb.gif HTTP/1.0"2002326"http://www.ak-web4u.de/index.html""Mozilla/ 4.08 [en] (Win98; I ;Nav)"

Figure 2.2. Example of the cached web log data

Preparing and transforming data The web log data will be cached in the above described format but cannot be examined directly. According to Zidrina and Raudys (2005) the web log data is not always clean and include useless data resulting from common errors as described above, e.g. crawler information or employee visits. Therefore, the data needs to be cleaned and be transferred into a format that the data mining software can interpret. The analyst has to identify the relevant session ID’s and the connected data and write them in a database. Then, depending on the case and kind of web page a different data mining technique can be selected in order to identify hidden information.

3. METHODOLOGY 3.1 Type of research The following study is of applicative nature and the given problem is a qualitative problem with exploratory objects. The case study will examine secondary user data from website interactions. The model and theoretical background is based on bibliographic and documentary research. In this case study the behavior of 1009 website visits is monitored and evaluated that occurred between July 7th and October 1st 2009. The object of this research is a distance learning course that is offered by the UFLA (University Federal of Lavras MG- Brazil). The observed website is a content site that informs about the software engineering course (http://www.nte.ufla.br/esl/wp/) (see Figure 4.1).

6

“TrackMine” is used as the monitoring tool that collects and filters the data and observes each click on the connected web pages in the course site ”ESL”. “TrackMine” is an analytic software which was developed by the Computer Science Department at the UFLA (http://tm- licesa.dcc.ufla.br/trackmine) see Figure 3.1. The statistical software package SPSS will be used analyze the collected data and to find hidden information.

Figure 3.1. Screenshot of Track4Mine web analytics tool.

3.2. Process and methodology (KDD and Mining) According to Zaine (1999), Oberle (2000) and Song; Zahedi (2005) the process of KDD can be divided into four process: (1) the collecting, cleaning, and filtering of the data, (2) the transformation of the data, (3) the selection of a data mining technique, and (4) the reporting and learning through the result. Data collection, cleaning and filtering The web analytic tool “TrackMine” uses Java Script Tags that are implemented in the pages of the observed website. With each click the visitor will sent an http-request that makes an entry in the log file with the session ID. When the data is collected the data will be cleaned, filtered and written into a database with irrelevant data being eliminated (ZAIANE, 1999). Data transforming The collected and cleaned website data needs to be transformed to a separated database that can be read and explores in the SPSS statistical analyzing software. With a special algorithm log entries with the same IP and the same session IP are searched. All the entries will be transformed to one database entry. For each row (visit of each client) whole web path is saved. The screen resolution, the operating system, the keyword, and the web browser of the client are also saved. For each client the amount of clicks and visits of each page on the website are counted.

7

Data Mining With the computer software SPSS is made the data analyses. SPSS is a software from IBM that has a lot of analytic methods that informs and directs decisions. With this tool it can easily made a lot of analyses that can bring further information. Reporting and learning When the mining is made there has to be made a report in which the analyst has to interpret and learn from the analyses. This is might the most important part for further analyses. When the report is made the whole process start again and the analyst has to collect data till he can analyses and report and learn from the report.

4. RESULTS AND DISCUSSION 4.1 The distance learning course “ESL” The observed website is content page that informs the visitor about the software engineering distance learning course that is offered by the UFLA. The website is a dynamic web page that uses the content management system Wordpress. Wordpress uses php as the programming language which generates the website dynamically with each click. The website includes three areas for links (Figure 4.1). The first and second area a monitored. The first area includes the link Home, Methodology, Course, Objects, Discipline, and Professor. The second area includes Matriculation and Contact. The third area includes external links which are not monitored. Second level links are also not monitored because they are not often used. All the link areas appear on every pages so the visitor can click on the monitored links on any pages during the visit. The user can leave the website at any time or use the external links in the third area

Figure 4.1. Screenshot of Website for Software Engineering Course.

8

4.2 Web Mining of the “ESL” website 4.2.1 Association rules to find the site path Each web page of the “ESL” site is interconnected through links with each page. When a web visitor enters the website he will be monitored and each time he makes a click an entry will be saved in the database. For each visitor the total web path will be monitored. For example, User 1 started on the landing page 40 which is the matriculation page and then left the page (Table 4.1.). The fourth user entered on the same page then clicks in the following sequence: Course -> Methodology -> Objects -> Discipline -> Professor. The next click is a zero meaning that he left the website. The data for each visit of the website is collected and saved in the database in this form. Table 4.1. User navigation path (Spring: database of “TrackMine”).

User PN1 PN PN3 PN4 PN5 PN6

1. User40 0 0 0 0 0

2. User25 0 0 0 0 0

3. User25 0 0 0 0 0

4. User40 33 36 34 35 42

5. User25 33 25 33 40 33

The navigation through the website is analyzed using association rules. Each user takes a special click path which is transferred into an matrix (Table 4.1) For example, the User 5 used the click through path 25 (Home) -> 33 (Course) -> 25 (Home) -> 33 (Course) -> 40 (Matriculation) -> 33 (Course). Each user activity is saved in the matrix following this example (Table 4.2): User 5 comes from outside (Start) and enters the homepage (column), adding 1 to the column 'Home' in row 'Start'; then he goes from 'Home' to 'Course', adding 1 to the column 'Course' in row 'Home' and so on until the click path is finished. When all user paths are saved in the matrix, each cell is divided by its total row sum resulting in the percentage or chance of clicking on a link (column) when on a specific page (row). For example, 65% of users enter on the page 'Home' and from there 19% visit the page 'Matriculation'. Through this matrix the click path of the average user can be estimated. To explore the user navigation in a visual form the matrix can be transformed into a graphic (Figure 4.2). Decisions with less than 10% are omitted in this figure for reducing the complexity of the graphic. This overview can help to redesign the website in order to provide users with the paths that the owner intends. The Figure 4.2. shows that there are mainly two landing pages for the website, the 'Home' (65%) and the 'Matriculation' page (27%). The 'Home' page is an intended landing page while 'Matriculation' is not intended to be a landing page. Few people stay on the 'Matriculation' page and register but start to search for more information. From 'Matriculation' 19% go to the 'Object' page and 31% to the 'Course' page first. 15% leave the website from 'Matriculation' and it is not measured how many left without matriculating before.

9

Here, 60% of users click from the 'Object' page to the 'Contact' page and 46% of users click from 'Methodology' to 'Contact'. That could imply that the information provided on the 'Methodology' and 'Object' pages are not sufficient. The e-Mails sent via the 'Contact' page should be evaluated and explored whether the information on methodology and or object of the course were not sufficient. It is also notable that a great percentage of users left the website from the pages 'Object' and 'Discipline'. Either the visitors left because they where well informed or they were not interested anymore after reading the provided information. TABLE 4.2. Matrix with association rules to find the navigation path.

Side Name Start Home Meth. Course Contact Object Disp. Prof. Matr. End

Start 0 0,65 0,01 0,02 0 0,04 0 0 0,27 0 Home 0 0,12 0,05 0,32 0,03 0,2 0,03 0,02 0,19 0,03 Methodo. 0 0,08 0,01 0,16 0,46 0,19 0,04 0,01 0,04 0,02

Course 0 0,18 0,39 0,02 0,09 0,18 0,03 0,01 0,08 0,02 Contact 0 0,08 0,12 0,06 0 0,6 0,08 0 0,02 0,03 Object 0 0,1 0,04 0,08 0,06 0,06 0,18 0 0,07 0,42 Discipline 0 0,21 0,04 0,08 0,04 0,16 0,01 0,01 0,01 0,45 Professor 0 0,2 0,03 0,18 0,05 0,07 0,13 0,02 0,17 0,15 Matriculation 0 0,13 0,03 0,31 0,01 0,19 0,02 0,09 0,18 0,04 promotion pg. End 0 0 0 0 0 0 0 0 0 0

Figure 4.1. User decision.

10

5. CONCLUSION AND FURTHER RESEARCH The described research provides some valuable information on how to improve the marketing strategy for the website. A small minority of visitors used search engines to find the website and a big majority of users left the website after the first click. The website should firstly be optimzed for search engines and an AdWord marketing campaign could be initiated to generate more visitors from search engines. Furthermore, an e-Mail marketing campaign could increase the direct hits of the intended doorway page. An e-Mail campaign could be sent out to former students of the University or other universities. To estimate the success, different landing pages could be used in the e-Mail and AdWord campaigns. When a redesign including search engine optimization should be initiated, the website has to examined with Data Mining Context algorithm. The goal would be to find out the keyword density and the search ranking of the page. The ranking is a product of the link strengths that are connected to the page. Considering the high number of visitors that left quickly, it could be recommended that the back links of the page are controlled to find out which back links work to improve the performance.

Acknowledgements To FAPEMIG and CNPq for supporting this work.

REFERENCES KAUSHIK, A. Web Analítica Uma Hora por Dia. Wiley Sybex: Rio de Janeiro, 2007. BURBY J.; ATCHISON S. Web Analytics: Using Data To Make Smart Business Decisions, 1.0, John Wiley & SonsPub, 2007. CARLOS A. ;VARGS R.A Utilidade da pesquisa do consumidor. In. Encontro da Marketing da ANPAD, 2008. CHAUVEL M. A. A satisfacao do consumidor no pensamento de Marketing: Revisao de Literatura. In. ENANPAD, 23, 1999. GAJDZIK, J. Data Mining: Methoden und Algorithmen. ETH Zürich, 2002. HUANG X., Comparison of interest measures for web usage mining: An empirical study. World Science, vol(6): pg.15- pg.41, 2007. HEINRICHS, J. H.; LIM Jeen-Su. Integrating web-based data mining tools with business models for knowledge management. Decision Support System, vol(35): pg. 103- pg. 112, 2003. NGAI E.W.T. Application of data mining techniques in customer relationship management, ScienceDirect, 36: pg.2592- pg. 2602, 2009. KWAN S. Y.; FONG J.; WONG H. K. , An e-customer behavior model with online analytical mining for Internet marketing planning. Decision Support Systems, vol(41): pg. 189 - pg. 204, 2005. LEE J.; PODLASECK M.; SCHONBERG E.; HOCH R.Visualization and Analysis of Clickstream Data of Online Stores for Understanding Web Merchandising. Data Mining Knowledge Discovery, vol(5): pg. 59 - pg. 84, 2001.

11

LIRA M. A., MARCHETTI R., AnalIse e segmentação do Mercado consumidor de farmácias e drogarias. ENANPAD, 2006. SONG, J.; ZAHEDI F. M. A theoretical approach to web design in e-Commerce: A belief reinforcement model. Management Science, vol(51): pg. 1219 - pg. 1235, 2005. TIM Ash. Landing page optimization: Definitive Guide to Testing and Tunning for Conversions. January 29, 2008. TUG E. Automatic discovery of the sequential accesses from web log data files via genetic algorithm. Science Direct, 19: pg.180- pg. 186, 2003. VIEIRA A. V., Fazendo uma revisão nas áreas de influência no comportamento do consumidor. Universidade Paranese: Belém, 2002. ZIDRINA, P.; RAUDYS A. A process of knowledge discovery from web log data. SpringerScience, 28: pg.79- pg. 104, 2005.

12