EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2017

PDF document search within a very large database

LIZHONG WANG

KTH SKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK

Abstract

Digital search engine, taking a search request from user and then returning a result responded to the request to the user, is indispensable for modern humans who are used to surfing the Internet. On the other hand, the digital document PDF is accepted by more and more people and becomes widely used in this day and age due to the convenience and effectiveness. It follows that, the traditional library has already started to be replaced by the digital one. Combining these two factors, a document based search engine that is able to query a digital document database with an input file is urgently needed. This thesis is a software development that aims to design and implement a prototype of such search engine, and propose latent optimization methods for Loredge. This research can be mainly divided into two categories: Prototype Development and Optimization Analysis. It involves an analytical research on sample documents provided by Loredge and a multi-perspective performance analysis. The prototype contains reading, preprocessing and similarity measurement. The reading part reads in a PDF file by using an imported library Apache PDFBox. The preprocessing processes the in-reading document and generates document fingerprint. The similarity measurement is the final stage that measures the similarity between the input fingerprint with all the document fingerprints in the database. The optimization analysis is to balance resource consumptions involving response time, accuracy rate and memory consumption. According to the performance analysis, the shorter the document fingerprint is, the better performance the search program presents. Moreover, a permanent feature database and a similarity based filtration mechanism are proposed to further optimize the program. This project has laid a solid foundation for further study in the document based search engine by providing a feasible prototype and enough relevant experimental data. This study figures out that the following study should mainly focuses on improving the effectiveness of the database access, which involves data entry labeling and search algorithm optimization.

Keywords: Portable Document Format, Search, Document Identification, Cosine Similarity, Document Preprocessing, Document Search, Optimization Method, Performance Analysis, Classification, Regression, Loredge.

i

Abstrakt

Digital sökmotor, som tar en sökfråga från användaren och sedan returnerar ett resultat som svarar på den begäran tillbaka till användaren, är oumbärligt för moderna människor som brukar surfa på Internet. Å andra sidan, det digitala dokumentets format PDF accepteras av fler och fler människor, och det används i stor utsträckning i denna tidsålder på grund av bekvämlighet och effektivitet. Det följer att det traditionella biblioteket redan har börjat bytas ut av det digitala biblioteket. När dessa två faktorer kombineras, framgår det att det brådskande behövs en dokumentbaserad sökmotor, som har förmåga att fråga en digital databas om en viss fil. Den här uppsatsen är en mjukvaruutveckling som syftar till att designa och implementera en prototyp av en sådan sökmotor, och föreslå relevant optimeringsmetod för Loredge. Den här undersökningen kan huvudsakligen delas in i två kategorier, prototyputveckling och optimeringsanalys. Arbeten involverar en analytisk forskning om exempeldokument som kommer från Loredge och en prestandaanalys utifrån flera perspektiv. Prototypen innehåller läsning, förbehandling och likhetsmätning. Läsningsdelen läser in en PDF-fil med hjälp av en importerad Java bibliotek, Apache PDFBox. Förbehandlingsdelen bearbetar det inlästa dokumentet och genererar ett dokumentfingeravtryck. Likhetsmätningen är det sista steget, som mäter likheten mellan det inlästa fingeravtrycket och fingeravtryck av alla dokument i Loredge databas. Målet med optimeringsanalysen är att balansera resursförbrukningen, som involverar responstid, noggrannhet och minnesförbrukning. Ju kortare ett dokuments fingeravtryck är, desto bättre prestanda visar sökprogram enligt resultat av prestandaanalysen. Dessutom föreslås en permanent databas med fingeravtryck, och en likhetsbaserad filtreringsmekanism för att ytterligare optimera sökprogrammet. Det här projektet har lagt en solid grund för vidare studier om dokumentbaserad sökmotorn, genom att tillhandahålla en genomförbar prototyp och tillräckligt relevanta experimentella data. Den här studie visar att kommande forskning bör huvudsakligen inriktas på att förbättra effektivitet i databasåtkomsten, vilken innefattar data märkning och optimering av sökalgoritm.

Nyckelord: Portable Document Format, Sökning, Dokument Identifiering, Cosine Similarity, Dokument Förhandling, Dokument Sökning, Optimering metod, Prestandaanalys, Klassificering, Regression, Loredge

ii

Contents

Chapter 1 Introduction ...... 1

1.1 Background ...... 1

1.2 Problem Statement ...... 3 1.3 Purpose ...... 3 1.4 Approach ...... 3 1.5 Limitations ...... 4 1.6 Delimitations ...... 4

1.7 Thesis Outline ...... 5

Chapter 2 Background ...... 7

2.1 Portable Document Format...... 7 2.2 Article ...... 8

2.3 Big Data and Database ...... 9 2.4 Machine Learning...... 10 2.5 Apache PDFBOX JAVA Library ...... 14

Chapter 3 Methodology ...... 15

3.1 Research Strategy ...... 15 3.2 Understanding the Client Requirement...... 19 3.3 Data Collection ...... 19 3.4 Prototype Development ...... 21 3.5 Evaluation Method...... 22

Chapter 4 Requirement and Data Analysis ...... 25

4.1 Client Requirements ...... 25 4.2 Data Analysis ...... 27

iii

Chapter 5 Prototype Analysis ...... 39

5.1 Prototype Development Result ...... 39

5.2 Document Fingerprint Analysis ...... 42 5.3 Prototype Environment Analysis ...... 46 5.4 Accuracy Analysis ...... 48 5.5 Response Time Analysis ...... 50 5.6 Memory Consumption Analysis...... 73

5.7 Similarity Analysis ...... 75

Chapter 6 Discussion ...... 85

6.1 Methodology and Consequence ...... 85 6.2 Development Environment Discussion ...... 87

6.3 Problem Statement Reiteration ...... 88 6.4 Summative Evaluation ...... 91

References ...... 95

iv

Chapter 1 Introduction With the exponentially increased development and usage of Digital Library, the trend is that more and more documents are not only published but also stored on the Internet instead of traditional media. There is a critical and complex task of having a digital database search engine, which is able to use a document file as search statement to query a very large database and obtain a file that has the highest similarity to the input one. This thesis represents the task of developing such a document based search engine, analyzing execution results based on different arguments and putting forward proposals aimed at optimizing the search performance. The performance requirement of the search engine is to have short response time and low workload but high accuracy and scalability. Since Portable Document Format (PDF) is one of the most widely used digital document formats, it is released as an open standard that is standardized by International Organization for Standardization, all input files in this thesis are in PDF format. The rest subsections in this chapter briefly introduce the motivation of this research, describes and defines the objective and the focus area of the work, and then it declares both the delimitations and limitations of the research.

1.1 Background

1.1.1 Loredge Background Loredge is a startup company that was founded in October, 2015. It aims to develop a software that is a digital platform for people to manage knowledge more effectively. One of four founders, Anna Abelin, describes the original purpose behind Loredge, “the quantity of information is increasing at a mind-blowing speed, it becomes more and more important to be able to decide what information is useful.” [1] The whole Loredge application consists of three components, a platform application that is installed on client side, a server that handles all data transfer and provides database services to various clients, a database that statically stores all related users’ information and document files. The Loredge database is very large, and it is designed to only accept

1

and store document files that are in portable document format. The PDF files can be uploaded or shared by either users or administers. One of the key functions that are supported by Loredge is PDF Annotation Sharing. Loredge holds the original PDF file in its database, and it generates an editable copy when users want to work on the original file, for example add comments and highlights. Moreover, the copy version can be shared among a group of users, while it is only visible and editable for the users who have permission to access it. In order to make Loredge into a global open knowledge platform, where people can both manage own knowledge and visit other people’s knowledge, the company is going to seek cooperation with more individuals and organizations.

1.1.2 Problem Background Each time user opens a PDF file in Loredge platform, the server will check whether the file is already in its database or not by measuring the document similarities to all existing files in its database. If the measured highest similarity is under a certain value, the platform makes a decision that the file is uploaded for the first time, and then it automatically stores the file in the database; otherwise the platform returns the file that has the highest similarity found in the database to the user, then the user decides either to use the existing version in the database or to upload and share a new file. The pre-matching process is designed to efficiently reduce the data redundancy in the database by avoiding saving identical or over-similar files as far as possible. The identical and over-similar documents refer to the documents that have identical or over-similar contents in either a same or a different format. The key algorithm in pre-matching process is applied to another important process – database creation. It extracts certain data from an input PDF file, and then uses these data to create a unique identifier representing the PDF file. The generated identifier (also called fingerprint) works as a primary key or index in searching process. It aims to reduce the search time in the pre-matching process, because comparison between two keys is obviously easier and quicker than two original files.

2

1.2 Problem Statement This project aims to solve the problem by answering following underling questions How can a file that has the highest document similarity to an input PDF document be found in a very large document database?

In which way, the developed file search method can be improved? The first objective focuses on developing a program to measure the document similarity between two PDF documents. It mainly contains two phases, document reading and document comparison. The second objective is actually an optimization that improves the developed program from several perspectives. when regarding such a search engine, the accuracy and response time is considered as the two most important factors that decides the performance.

1.3 Purpose The purpose of the research is to propose and develop a framework of the document- based search engine that is used in the Loredge platform. Relevant feasible optimization method should also be studied in order to provide enough experiment data for Loredge. The study is widely involved in various science areas. At the first strategy, it requires capability on document analysis and program development. At the optimization stage, it needs technology in analytic geometry and machine learning. The experience offered from this project lays a foundation for software development.

1.4 Approach The whole project is done in a traditional waterfall model. Firstly, the problem statement is clearly understood. Afterwards, relevant data for research is collected and analyzed in order to gain a better understanding on the research subject. Then the prototype is development is performed in an incremental model, following design-implementation- test. And then, the developed prototype is analyzed and evaluated from several angels. Lastly, the whole project is evaluated by applying both formative and summative evaluation. Both quantitative/qualitive research and inductive/deductive approach are applied to achieve different purposes.

3

1.5 Limitations The project has several aspects that cannot be controlled. They are restriction on the research and can more or less affect the methodology, design and conclusion. Sampling data. The amount of sample that is provided by Loredge is two hundred (100 input files + 100 database files). All analysis and optimization in the research are totally based on the results of these sample documents. It may occur a overfitting problem that the final solution has no good generalization, which means that the program possibly has a poor performance on new input data. Environment The developed program is executed only on a personal laptop computer, so all the obtained execution results that are analyzed and discussed are based on one system environment. Other system environments both hardware and software may give different result and solution. Personal knowledge. The research with its solution is also limited by knowledge and experience in program development.

1.6 Delimitations The delimitations are used for setting boundaries for the dissertation. Delimitations aim to make the research goals and objectives more achievable, so the research will not be unnecessarily and impossibly large to complete. Document format. The implemented program focuses on using the Portable Document Format as the only input document format, in other words other formats than PDF do not work in the current version of the search engine. The reason is that the PDF is the most widely and commonly used document format. Furthermore, the project mainly focuses on the text content of the PDF file, since it is unbelievably difficult to develop an effective but simple method to compare multimedia and images. Database. This study creates a simple database where each data entry is a simple txt file. There are lots of wonderful digital databases that people can free download and use, but it is very difficult to learn and understand a new software in a short time. Programing language. Java is the main development language in this project. The program imports an open source JAVA library in order to work with PDF documents. There are several candidate programing languages, but most of them are interconvertible, for example Java can be converted into C, C++ and Python. The research just chooses the language that is really best suitable for the development.

4

1.7 Thesis Outline This thesis consists of the main sections.

Chapter 2. Background. This chapter introduces the background of the problem and the thesis. Reader is supposed to get relevant necessary background information and related knowledge for understanding the thesis.

Chapter 3. Methodology. This chapter summarizes all the research methodologies that have been applied to the thesis. All the research strategies in the project workflow are detailed introduced. Moreover, the evaluation methods are also described with setting up several relevant goals.

Chapter 4. Requirement and Data Analysis. The client requirements and the sample data are introduced and analyzed in this chapter. In terms of the client requirement, both functional and non-functional requirements are specified according to the understanding of the requirements. The sample data and collected data is respectively presented, and a relevant deep analysis about data is conducted.

Chapter 5. Prototype Analysis. The chapter presents the execution result and analyzes the developed prototype from several different perspectives. The analysis not only determine the performance of the current version but also provides the optimization method.

Chapter 6. Discussion and Evaluation. The last chapter discusses and evaluates the whole performed project by answering the problem stated in the Chapter 1 and the evaluation methods mentioned in the Chapter 3.

5

6

Chapter 2 Background This chapter introduces the background of the thesis. A basic theory of PDF document and database is given in order to understand this project. Several approaches in machine learning are also shortly described.

2.1 Portable Document Format Portable Document Format (PDF) is invented by Adobe System Incorporated. It is a digital file document format, which is used for presenting and exchanging different documents independent of application hardware, software and operating system. [2]

2.1.1 Content of a PDF file Each PDF file encapsulates a flat document that can include different things needed to display it, such as text and image. The first version of PDF included only plain text formatting and inline images, but more and more things are supported by PDF during recent years, not only text and images, but also fonts, diagram, table and hyperlink, even a multimedia can be found in a PDF file. [4]

2.1.2 Advantages in using PDF file The greatest merit of PDF is easy to create and read by everyone. Once a PDF file is created, it strictly keeps the fixed-content and layout. The same PDF file display always the exact same things no matter which software, device or operating system the PDF file opened on. PDF uses only little space on devices, because it can compress a large amount of information into a file size without breaking the quality of the images. It makes the document organization and arrangement a lot easier. There are different security options in PDF setting that can be customized by user, to set security level to protect the PDF files against different security issues. Because of these merits, PDF is more and more widely used, especially after the great rise of World Wide Web and HyperText Markup Language (HTML) document. In 2008, PDF was officially released as an open standard, and published by the Internet Organization for Standardization [3].

7

2.1.4 Disadvantages and Difficulties in PDF file handling There are two different types of PDF files [5]. The first one is Native PDF that is made from an electronic source. The second is Scanned PDF that is made by scanning a physical paper document using scanner. Native PDF relies heavily on electronic character designations so that users can directly select, search and copy text in a text-based PDF. But the scanned PDF treats the whole in-scanned document as an image. Once users want to do some operations, they must first use an Optical Character Recognition [6] to convert the images into machine-encoded text. Because layout of PDF is freely defined by user, there is no good rule to make machine to handle different formats (text, diagram, table and so on) separately, for example, it is not able to identify whether a graphics is an image or a diagram, and there is no significant difference between a body text and a legend for machine.

2.2 Article An article can be defined in the following way, “A written composition in prose, usually nonfiction, on a specific topic, forming an independent part of a book or other publication, as a newspaper or magazine.” [7]

2.2.1 Publication When the Internet was not widely used, most articles were published by traditional media, such as newspaper, TV and radio. But along with the development of technology, the Internet appears as a new technology and plays an important role in media market in recent years. Because of merits of the Internet, more and more articles are published on the new media instead of the legacy media directly, at least, there will be a copy of the article on the Internet.

2.2.2 Display and Storage on the Internet The digital article on the Internet is usually displayed and stored in two ways. The most articles are directly published on the webpage, for example on news websites and blogs. There is another way to read the articles, clicking a hyperlink which is linked to a file. This method is widely used when the uploaded article-file has already been constructed by author/publisher in a fixed layout, for example, an academic paper and a scientific paper. Furthermore, user can also use the hyperlink to download the file to local disk. The uploaded files have different document file formats, but the most used format is PDF - the standard for document exchange.

8

2.3 Big Data and Database

2.3.1 Big Data Introduction Big Data refers to a huge volume of data, as it can be interpreted literally. The history of Big Data as a special term in information technology is brief, there is no official document to determine when the term was first mentioned and used, but an industry-recognized answer is in 1990s. Entering the Information and Internet Age, the data is gradually created and presented in digital format. The data-storage capability keeps increasing in a geometric ratio, which makes the data set larger and more complex. Thus, it is extremely hard to process, manage and sort the data set manually or automatically within a reasonable time.

2.3.2 Database and Fingerprint A database is a collection of digital data; each data entry is an independent individual and the entries can be organized by certain rules. Each entry in database has a unique identification, which consists of keys that represent and identify the data entry. The fingerprint is generated either manually or automatically or both when the data entry is stored or altered in database. The greatest benefit of having the fingerprint is convenience and efficiency when people require a very large database. A high-quality fingerprint ensures the efficiency and availability of the search performance. There are some requirements that need to be satisfied when keys are generated from the data entry. In principle, the fingerprint must be totally unique so that the corresponding search result is exact and sole. A non-redundant key that contains only necessary information can often save more data storage and system memory than a redundant key. The second improvement is to apply the basic concept of the Support Vector Machine [8] [9]. By using short and variety keys instead of long and complex keys to generate the fingerprint, each data entry can be treated as an independent point. These points are virtually transformed into a high-dimensional space where each key works as an axis of the coordinate system. This projection makes the identification and separation process much easier than usual. Comparison of position of two data points can be done in a parallel way, thus the execution time of search program is obviously reduced. In addition, using the keys that are best representation and definition of the original data a good choice for database modification and maintenance.

9

2.3.3 Database Search Engine A database search engine is a kind of search engines that operates on digital database. The search engine applies certain computer algorithms to search the database and identify items that match the information given by user. With the ever-growing database, people have higher expectations on search quality. Programmers are facing several challenges in search engine design. Three factors that determine the performance of the search engine generally refer to response time, accuracy rate and operational cost. In reality, it is almost impossible that a search engine could have all the shortest response time, highest accuracy rate and the lowest cost simultaneously. It follows that all the things to do is to find a point that could fulfill the above-mentioned conditions furthest at the same time. The largest factor that could have the maximum impact on the performance of the search program is the velocity, and program execution speed can be optimized in different ways. The simplest and most direct improvement is to apply a faster search algorithm. There are many other optimization methods that can be implemented. Reducing and removing redundancy and unnecessary information from the original data will avoid the program from checking the useless data when it goes through the database. Pre-sorting and classifying the entries in the database can also simplify processing in stages where data transformations and aggregations happen. Preprocessing and filtering the input data from user is an another effective method, for example treating the words in singular and plural form as the same and ignore the common stop words from search statement.

2.4 Machine Learning Machine Learning is a scientific discipline focused on the development of algorithms that spot patterns or make predictions from empirical data. [12]. The machine learning research often overlaps with the computer and mathematical statistics, which means that a task in machine learning is often to solve a mathematical optimization problem that selects the best or the most suitable element from a set of all available alternatives.

2.4.1 Supervised Machine Learning The Supervised Machine Learning is to infer a function from a dataset where all data are labeled, and each training example is a pair of an input and a corresponding output. By

10

analyzing the relationship between input and output data, the supervised learning will infer a function that maps from the input variable to the output variable.

푂푢푡푝푢푡 = 푓푢푛푐푡𝑖표푛(퐼푛푝푢푡)

The learning performance is directly based on the core kernel mapping function, so the most important factor in the learning is to seek the most suitable function that is the most representative of the relationship between the input data and the output data.

2.4.1.1 Classification Analysis The Classification Analysis is a task of identifying and classifying the category to which the input data belongs, which means that the output variable of the learning model is categorical and discrete. An algorithm that implements the classification often is a mathematical function that maps input data to a category. Figure 2.1 is a simple example of classification that classifies the red and blue points.

Figure 2.1 A simple classification example

The linear classifier is the most basic classification algorithm that applies a linear combination of feature values to make a classification decision. [13].

표 = 푓(푤⃗⃗ , 푥 ) = 푓 (∑ 푤푖푥푖) 푖 o is the output,⃗ ⃗w⃗⃗ is a weight vector, x⃗ is the input

11

This learning seeks the best weights 푤푖 , and there are two suitable algorithms for achieving the goal. Perceptron Learning is the incremental learning that its weights only change when the output is wrong. It is a convergent function when the problem can be solved. Delta Rule is another incremental learning, but its weights always change no matter what the output is, and it will always have an optimal solution to the problem. In order to avoid that the separating hyperplane has a bad generalization, the distances (also be called margins) from data point to the hyperplane can be maximized, because the hyperplane with widest margin often has the best generalization. Simplicity and efficiency are the major advantages of such a classification; both the training and validation is very easy to implement. Because the linear classification model is quite simple, overfitting will not be the problem that should to be seriously considered. There are several limitations of using the linear classification. First of all, the classifier relies heavily on distribution/structure of data. The classification model will be too simple and general to be useful when the number of sample is small. Linear classification is poorly effective for handling linear inseparability data. The classification can only work in a low- dimensional space, it has no high-dimensional pattern recognition.

2.4.1.2 Regression Analysis The Regression is a statistical task of data mining. It studies and determines whether two or more variables are related or not in terms of the direction and intensity, and then establishes a mathematical model to predict target associated to any arbitrary new input. The dependent variable of the learning model must be real-valued and continuous, or at least close to continuous, and there is no specific requirement for independent variables. A regression model mainly contains three variables: an unknown parameter, an independent variable and a dependent variable as shown below. Y is the dependent variable, β is the unknown parameter, and X is the independent variable

푌 ≈ 푓(훽, 푋)

Different types of the function 푓 are applied in different regression analysis. It will always use the function that can best describe the relationship between the dependent and independent variables. In order to make the analysis easier and more effective, the function should be as flexible and convenient as possible.

12

Linear regression is the most basic and commonly used regression, which uses the least squares approach to model the relationship between one dependent variable and one or more independent variables. The dependent variable is a linear combination of parameters. Linear regression assumes that the sampling data points must be independent from each other. Figure 2.2 below is a simple linear regression to regress the blue points by drawing a red trend line.

Figure 2.2. A simple linear regression example

Given a random sampling data (푌푖, 푋푖1, 푋푖2, … , 푋푖푝), 𝑖 ∈ [1, 푛], there is an error term 휀푖 that is a random variable and captures any effect on 푌푖, so a multiclass linear regression model can be presented as:

푌푖 = 훽0 + 훽1푋푖1 + 훽2푋푖2 + 훽3푋푖3 + … + 훽푝푋푖푝 + 휀푖, 𝑖 ∈ [1, 푛]

The linear regression model does not need to be a linear function of the independent variable, the linearity here represents that conditional expectation of 푌푖 is to be linear in the parameters 훽0, 훽1, 훽2, … , 훽푝. The conditional expectation of 푌푖 is a random variable equal to the average of the former over each possible condition. And a more general linear model can be created like this:

Y = 훽 X + ε

13

Here 푌 is a column vector that contains observation values; 훽 is also a column vector that contains a set of unknown random parameters that needs to be decided; 푋 is an observation matrix of independent variables; and 휀 is a column vector that includes un- observed random variables. Simple linear regression refers to the linear regression with one independent variable and one dependent variable. The relationship function is a non-vertical linear function. In this case, the 푋, 푌, 훽 and 휀 in the function above are all single variables, not complex matrix. The linear regression is widely used in different areas. [14] One of the most representative applications is linear trend estimation. [15] A trend line can usually be drawn by observing a set of data points, but a more correct solution is using the linear regression to model the points with a linear function and calculate their slope. The limitation of the linear regression is that, this regression shows optimal results when the relationship between dependent and independent variables are linear or almost linear, so it is inappropriate to use when the data has nonlinear relationships.

2.5 Apache PDFBOX JAVA Library The Apache PDFBox [16] is an open source Java library for working with PDF documents. It implements a variety of features to deal with the PDF documents, such as extract text, split & merge , creation of new PDF and print PDFs. The newest version of Apache PDFBox is 2.0.3; and it was released on September 17th, 2016. The PDFBox project was started by Ben Litchfield in 2002 as a way of extracting PDF content so that it could be indexed by the Lucene search engine. [17] It entered Apache Incubation on February 7th, 2008, [18] and the first release of Apache PDFBox took place on September 23th, 2009. The Apache PDFBox project graduated and became an Apache top level project on October, 2009. Class PDFTextStripper [19] is used for extracting text content from a PDF file. It takes a PDF document and strips out all of the text content sequentially from all pages and ignores the formatting and such. There is another class PDFTextStripperByArea that extends the PDFTextStripper. This class is also used for extracting text content from a PDF document, but the difference is that it will get text from a specified region.

14

Chapter 3 Methodology The chapter introduces and explains main strategies and methods applied in the research. There is also a flowchart that represents and describes the workflow of whole project step by step, and each field in the flowchart is detailed explained in subsections.

3.1 Research Strategy In order to achieve the objectives stated in the section 1.2, a suitable and effective research strategy must be set up before starting the project. The research strategy is a general guide that helps to gain useful and reliable information and data relating to the research issue, and then to use these materials to analyze and solve the research problem in a systematic and methodical way.

3.1.1 Main Methodology This section briefly introduces some common methods used in academic and scientific research. These methods are applied more or less in different phases of this thesis.

3.1.1.1 Quantitative Research The quantitative research is a methodology paradigm relied primarily on quantifying data. For achieving its objective, the mathematical model development, the quantitative research focuses on generalizing data and inferring the quantitative relationships within. The sample data is often a large number of instances represented research object in the quantitative research. And the data collection stage often applies structured techniques, so the gathered data is reliably aggregated, and comparison between different sample groups is feasible. More importantly, the collected data must be measurable, which means that it should be in numerical and statistical form [20]. The most classical and widely applied method in the data analysis phase is statistical analysis, which describes and explains both the relationship within and structure of the sample data. In other words, the quantitative research often objectively infers a mathematical and statistical model based on the sample data, and the inferred model is applicable for the further research on the entire population of interest.

15

3.1.1.2 Qualitative Research The qualitative research is an opposite methodological approach that relies on collection of qualitative data. It contains different research methods so that it is widely applied in different areas. Its objective is to gain a better understanding of phenomena and underlying reasons or motivations of decision-making [21] [22], and the aim can be varying with different research backgrounds. The qualitative research focuses on smaller but more concentrated sample that produces knowledge about specific research cases. Both structured and unstructured methods can be applied for collecting and categorizing data into different groups based on the different research themes. Unlike statistic requirements in quantitative research, any type of data related to research is acceptable in a qualitative research. It follows that, the data analysis phase is usually non-statistical. This approach aims more likely to seek the meaningful content from the sample data and then gives a reliably explanation.

3.1.1.3 Inductive Approach The inductive approach, also inductive reasoning, is a research method for proposing theories and ideas. This approach begins always by collecting data relevant to the study topic, and then an observation stage is executed to look for patterns or regularities among the substantial amount of sample data. After analyzing the patterns and regularities, some temporary hypotheses can be roughly constructed. As a result, a more general and certain theory based on the hypotheses is proposed. The procedure of the inductive approach is illustrated in Figure 3.1

Data Data Pattern Hypothesis Theory Collection Observation Regularity Construction Proposition

Figure 3.1 The process of the inductive approach

Shortly, the inductive approach is generally a subjective approach [23] that starts always with working a more specific thing, then moves towards a more general one, which is also known as the “bottom-up” approach.

16

3.1.1.4 Deductive Approach The deductive approach works oppositely in contrast with the inductive one. It is an objective approach that focuses on testing proposed hypotheses based on existing theories as Figure 3.2 shows. As the deductive moves from a general level to a specific one, it also gets an informal name, the “top-down”.

Existing Hypothesis Data Observation Confirmation Theory Study Proposition Collection & Testing & Rejection

Figure 3.2 The process of the deductive approach

This approach starts with studying and understanding existing theories. After that, a research topic based on the theory study is brought up. After making the question into a provable hypothesis, the data collection phase is carried out in order to prepare specific and relevant data for observation and testing. The observation and testing stage applies certain methods to test the proposed hypothesis. Finally, by examining outcome of the testing stage, the hypothesis is either confirmed or rejected. If it is a rejection, then the original theory may need to be modified or updated.

3.1.2 Methodology applied in the thesis This research is more likely the inductive approach. It starts with understanding the client requirements and collecting relevant experimental data, and aims to find the best/optimal solution to the research subject. More importantly, there is no hypothesis proposed before the project begins. Both the quantitative and qualitative methods are applied in the research to achieve respectively objectives. The qualitative method is mainly used in the data collection and evaluation phases, and the quantitative method is conducted in the data analysis and performance analysis phases.

17

3.1.3 Process overview This project is actually a software development, and it applies the most common software development model named waterfall. The waterfall is a sequential life cycle model that includes a number of clearly defined and distinct work stages. Using the model intends to create a high-performance product that satisfies all client requirements as far as possible. The whole structure of the waterfall is very simple to understand, and each phase is easy to use and manage. Moreover, it works more effectively than other models when the project is relatively small and the client requirement is clearly understood. The process of the whole research is shown in Figure 3.3 below, detailed description of each step comes to be introduced in following sections.

Problem Statement

Requirement Elicitation Requirement Analysis

Proactive Evaluation

Data Collection

Sample Document Development Enviroment Similarity Measure

Clarificative Evaluation

Prototype Development

Design Implementation Testing Interactive Evaluation

Summative Evaluation

Figure 3.3 The process of the Research

18

3.2 Understanding the Client Requirement This section describes understanding of client requirements and making relevant analysis.

3.2.1 Requirements Elicitation The problem statement must be understood before starting project. The most clients usually have an abstract and wide idea of what they want as a result, but don't know details of the idea. A meeting with the Loredge is arranged in the initial stage, the client introduces the background and the objective of the Loredge software. And then the client requirements are presented. Afterwards, a detailed discussion of the requirements is conducted with the clients in order to understand the requirements completely.

3.2.2 Requirements Analysis The requirements analysis is intended to make the abstract idea into a specific and executable program development. Both functional and non-functional requirements from client are specified and analyzed. The functional requirements define what a system is supposed to do, and the non-functional ones define how the system is supposed to be. In other words, both objective and method of the development are defined. As a result, the development model is brought up and the development schedule is planned.

3.3 Data Collection The data collection is a preparation stage of the research. All useful and necessary knowledge and material relating to the research subject is gathered. The preparation is consisted by three independent parts, as Figure 3.4 shows. The sequential order is applied in order to make it easy and clear to plan and manage each part.

Sample Docunment Collection

Data Collection Development Enviroment

Statistical Algorithm Selection

Figure 3.4 The process of the Data Collection

19

3.3.1 Sample Document Collection Since the object is PDF document, the collection process is initiated with gathering PDF document for research. The sample requirement is that it must be able to measure similarities among the sample documents. Which means that the document comparison is feasible, afterwards the degree of similarities is presented in numerical form. After the sample documents are collected, a document structure and content analysis is conducted. The analysis gives a brief understanding of how a PDF document looks like and what content does a PDF document usually contains. The understanding is intended to be one of the major factors in the program development. And the result of the document analysis helps to prepare other relevant experimental data.

3.3.2 Development Environment After the Sample Document Collection, it is time to choose a programming language. Since all programming languages have their own advantages and disadvantages; comparison and evaluation are necessary and indispensable. Always the language that is most suitable to the research will be the first choice. Furthermore, an extensible resource that is able to work with PDF documents is needed. The resource is a computing library wrote in a certain programming language, it is mainly used for fetching content from a PDF file. Moreover, a digital document library is needed. The library is a database intended to store all sample documents and relevant data.

3.3.3 Statistical Algorithm Selection A set of numerical and statistical methods must to be set up to execute the similarity measurement among PDF documents. The method is applied in not only the document comparison, but also the database creation and update. The measurement is undoubtedly the core unit of the search program. For achieving the best performance, the selection process is based on the result of the document analysis. The chosen measure should be able to satisfy both function and non- functional requirements as far as possible. The decision is made after a comprehensive consideration of factors that affect the performance, the factor is for example convenience, complexity, efficiency and extensibility of the metric.

20

3.4 Prototype Development In software development, a prototype is a rudimentary working model of a product or information system, usually built for demonstration purpose or as part of the development. [24] The prototype in this research refers to a rough implementation of a Loredge search-engine program. It is used to concretely demonstrate operational principle and working process of the program.

3.4.1 Design of Prototype The identified and analyzed functional requirements and non-functional requirements are designed and determined. The functional requirements refer to the functionalities and objectives of the prototype, and it is defined as a set of inputs, logical behavior and outputs. In contrast, the non-functional requirements are detailed in the system architecture design, it is defined as a set of performance characteristics of the system. In this research, the whole development is divided into a set of small unit developments. A unit testing is arranged after each unit development is done. Afterwards all units are integrated as a group. Finally, an integration test is planned. The process of the development is illustrated in Figure 3.5

Unit Development Unit Testing Integration Integration Test

Figure 3.5 The process of the Prototype Development

Each unit here represents a module of the whole program and has a clear defined functionality. The unit works as an independent and complete small program, which takes certain input variable as parameter and generates a required output as result. The program division is based on the functional requirements. In other words, the whole program consists of several flexible functionalities sequentially. In the meantime, a set of non-functional requirements are designed and stated. The non- functional requirements are various and multi-objective, rather than the functional ones. The design is a combination of the requirement analysis and an existing resource, that is a given non-functional requirements checklist. [2]

21

3.4.2 Implementation of Prototype The implementation in the research refers to write program code, by using the selected programming language in an integrated development environment. The statistical algorithm has been decided in the data collection stage, and the structure and functionalities of the program has also been designed. So, the implementation phase is rather simple, but the design is appropriately adjusted according to the actual situation.

3.4.3 Testing of Prototype The program testing is an important phase in software development. In the research, two kinds of testing methods are applied, the unit testing and the integration testing. The unit testing is intended to isolate each unit of the program to analyze and debug. It is carried out after each individual unit is implemented. The unit testing starts with using different arguments to test whether the generated results are as expected. If the generated results are as expected, the tested unit has achieved the designed functionality. By applying the unit testing, program defects are detected early in the development cycle, thus design improvement and code refactoring is possible. The integration test is executed after all individual units are tested and integrated, it aims to verify functional and reliability requirements between the integrated units. The same as the unit testing, the integration takes different types of arguments to test whether the combination of units work correctly as designed, both the generated result and the quality of the execution are verified and analyzed.

3.5 Evaluation Method The implemented prototype is analyzed and evaluated from different perspectives in order to determine whether the implementation achieves both the non-functional and functional requirements, as designed and excepted. The evaluation result is documented to provide a good foundation for the further study and development. Both formative and summative evaluation are applied for the purpose of obtaining high quality results.

3.5.1 Formative Evaluation Formative evaluation aims to strengthen or improve the object being evaluated [25], and it is generally performed before and during the project’s implementation. [26] The category of evaluation involves qualitative methods, and greatly focuses on the details of content and performance of the object.

22

The whole research is divided into several short phases, and the prototype is developed iteratively using a systems development life cycle. After each phase and lifecycle, a short meeting with Loredge is arranged for feeding back and evaluating the phase achievement and making plan for the next step. The set of phased evaluations during the development ensures that the client requirements are correctly understood and defined, importantly the research is on the right path.

3.5.1.1 Proactive Evaluation Proactive evaluation is a method that provides information to decisions about how to best develop a program in advance of the planning stage. [27] The evaluation is performed after the Data Collection stage, but before the Prototype Development, and corresponding approaches are needs assessment and research review. The key questions that must be clearly answered are:

• What do people know about the research? • What are the needs of the research? • What kind of program does the research needs? The client requirements are evaluated to determine the scope of the project, and the evaluation of collected data aims to synthesize the related literature study about each identified client requirement. The result of the evaluation is intended to prepare enough information and understanding to prototype development.

3.5.1.2 Clarificative Evaluation Clarificative evaluation focuses on clarifying the intent and operation of a program. [28] Reason for applying the clarificative evaluation is that the main direction and objectives of the research are clear, but the detailed process is rather vague. The evaluation is performed during the Prototype Development. Two related approaches carried out in the research are evaluability assessment and logic mapping, and the two corresponding key questions that are set in the evaluations are:

• What are the intended results? • How the intended results can be achieved by the program? A graphical description of logical relationship between the inputs, behavior and outputs of the program is made in the design phase in the first development life cycle, and it is updated and modified after each iterative cycle. The short meetings after each phase and cycle works as the evaluability evaluation, that is intended to not only identify the program’s objectives and its intended activities, but also control the gap between them.

23

3.5.1.3 Interactive Evaluation The word Interactive is equal to improvement, and this kind of evaluation aims to improve the program’s design and implementation. The evaluation is used during the Prototype Development with an important topic:

• How could the program be changed to be more effective? The evaluation is carried out by two steps: determining whether the latest version of program is effective; determining whether it needs improvement. The first step starts with performing program performance testing by using different input arguments, and then comparing and analyzing the corresponding output results. Since both number and type of the sample data are limited, trend analysis is required for applying the program in a large population. Afterwards, parameter adjustment is performed in order to optimize the program performance.

3.5.2 Summative Evaluation In contrast, the summative evaluation examines the effects or outcomes of some object, [26] and it takes place at the end of an operating cycle of the project. The evaluation aims to determine whether the project has achieved its objectives or outcomes. Outcome evaluation, performing at the end of the project, is a method to summarize the performed project. It focuses more on summarizing and examining the outcomes of the project, not the content. Several questions are clearly answered in the evaluation:

• Have the goals/objectives of the project been met? • What is the overall impact of the program? • What resource will be needed in order to address the program’s weakness? • How sustainable is the project, dose it to be continued in its entirety? Both process and result of the evaluation are also presented to the client after the project. The evaluation is associated with objective and quantitative methods of data collection. In other words, the evaluation not only gives a summary of the project, but also provides a foundation for the further development.

24

Chapter 4 Requirement and Data Analysis This chapter analyzes the data collected in a preparation stage. The result of the analysis provides a basis understanding for the prototype development.

4.1 Client Requirements The client requirements are presented and analyzed in this section.

4.1.1 Requirement Presentation After meeting with Loredge, the client requirement is clearly presented and understood. The requirement is to develop a prototype of a document based search engine. The behavior of the prototype is described as follows. The program reads in a PDF document, and then it uses the input file as a search statement to query a PDF-file based database. Afterwards a document comparison process is performed in order to calculate the similarity between input file and each data entry in the database. Finally, the program returns a counterpart PDF file that has the highest similarity with the input file as a result.

4.1.2 Requirement Analysis Analyzing the client requirements is done after requirements elicitation. Both the functional and non-functional requirements are specified in the analysis, and the result of the analysis give directions to data collection and prototype development.

4.1.2.1 Functional Requirements The functional requirements refer to the functionalities of the search engine. The functional requirements mainly consist of several components.

• Reading in PDF document. This is the premise and basis that can make the program exist and develop. The system must be able to read content of a PDF document file word by word, even character by character into its memory.

25

• Pre-processing document. For being able to measure document similarity, the original documents must be converted to certain statistical data form. Furthermore, in order to make the comparison process more effective, the original data should be pre-processed, filtering out redundant and irrelevant data.

• Generating unique document identification. The identification is a data entry stored in the database, it represents the original document.

• Measuring similarities between PDF documents. A certain similarity measure is required in order to quantify the similarity between two documents. As a result, the similarity is a statistical and numeric value and presented as percentage. Generally speaking, if two documents are exactly the same, the similarity is 100%; if they are completely different, the similarity is 0%.

4.1.2.2 Non-functional Requirements The non-functional requirement focuses on describing how the search engine works. Several basic and specific criteria for the prototype are defined as:

• Platform-independent/Platform-compatibility. Because the project is a kind of software development that should be able to be used in various system environments, it requires that the developed product have strong compatibility; it can be executed correctly on various platforms or operating systems, e.g. Windows, IOS and Android.

• Response time. In search engine development, the response time refers to the total time it spends to respond to a request for making a search. For a search engine, the response time is one of the most important factors that determines whether the search process is effective or not. The most performance optimization focuses on reducing the time.

• Accuracy. The other non-functional requirement, even more important than the response time for users, for a search engine is the accuracy. The accuracy of the query result is required to be as high as possible at an acceptable range.

• Extensibility. The extensibility is the capacity to extent further functionalities of the development environment. As the project is a prototype development, it requires a good extensibility in order to accept and implement new extensions to the user environments.

26

4.2 Data Analysis All the relevant data and material are collected and analyzed in the section.

4.2.1 Selection of Sample Documents In this section, the relevant sample documents are collected, and a document analysis is performed for understanding these documents and making a document pre-processing.

4.2.1.1 Sample Document Collection Loredge provides a set of sample PDF documents for the research. The sample document is explained and described as follows. The sample documents are 200 articles averagely sorted into two folders, one folder is author-copy while the other folder is published-copy. The author copy refers to the original version of the document that is owned by the authors; and the published one is the alike article but with a few modifications compared with the original version. Therefore, each document in the author-copy folder could always find one and only high-similar counterpart in the published-copy folder.

4.2.1.2 Sample Document Analysis After studying and analyzing the collected sample documents, several characteristics of the documents are obtained and summarized. The understanding aims to improve the program efficiency by performing a data pre-processing.

4.2.1.2.1 Document Structure Analysis All pages of all the documents, in two folders, have both header and footer that are separated from main body, as shown in Figure 4.1. In most cases, the header and footer are the meta-data that identify the document or the page, such as page number, author name, publisher name, article title and issue number. Hence, the footer and header are considered as redundant information and filtered out.

Document Page

Header Main Body Footer

Figure 4.1 Document Structure

27

4.2.1.2.2 Document Content Analysis The sample PDF documents, within both author and published copy folders, contain only plain text and static images in the main body as shown in Figure 4.2. The images can be divided into two categories: text-based images and pure images. And the plain text within the documents consists of three basic elements, the Alphabet, the Punctuation Marks and the Arabic Numerals. Fortunately, there is no multimedia plugin in the sample files.

Docuemnt Content

Plain Text Static Image

Alphabet Punctuation Arabic Numerals Pure Image Text-based Image

4.2 Basic elements of Document Content

According to the result of the functional requirements analysis in section 4.1.2.1, in order to make the searching engine effective, the redundant and irrelevant data in the sample documents should be pre-processed and filtered out.

There are several reasons that the images within the sample documents are not be considered as useful data. Firstly, not all the documents have images, so the image is not so necessary for a document. And then, the image usually works only as additional explanation to the document, which means that it is mainly graphical representation of certain plain text. Furthermore, it is extremely difficult to design and implement a small but effective image-comparison program.

After filtering out all the images, the document analysis and pre-processing focus on the plain text of the documents, the punctuation, the Arabic numerals and the alphabets. The usage and importance of them are analyzed. The punctuation is used to structure and organize a writing work, it separates sentences and their elements and clarifies meaning. Thus, the punctuation has no specific meaning, it just works as language tool that helps reader to correctly read and understand printed text, it is not the key element in the documents analysis. Naturally, the punctuation is also going to be filtered out.

28

The Arabic numerals are the ten digits from 0 to 9, all number are the combinations of these numerical digits. The numbers usually represent the degree of certain object, they make text data numerical. The Arabic numerals are not necessary for a document, since they can be represented by words. E.g. 5 is equal to five in English. Generally, the numerals occupy for a low proportion in document as shown in Figure 4.3, and it has few substantial effect on content of the document. For this reason, the document analysis does not focus on the Arabic numerals, and the numbers are also filtered out.

Percentage of arabic numerals in sample documents 100,00% 90,00% 80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00% 0 200

Figure 4.3 Percentage of Arabic Numerals in 200 sample documents

It could not be more obvious that the most important and useful part of a document is the alphabet, or word, that is the smallest unit of language and has a particular meaning. The word is the most crucial factor in the understanding and analysis of a document. Following analysis focuses on all the words, the attribute and frequency of are figured and analyzed for being able to perform a further pre-process on the sample documents. All the sample PDF documents are scientific and academic articles written in English, so the words here mainly refer to all English words, or consist of English alphabet. But in certain documents, there are few un-English words that contain non-English alphabet. They are usually Greek alphabet in mathematical and physical functions, e.g. α, β and µ. These words mainly present certain specific variables and objects. Since their frequencies are even lower than the punctuation and numbers, the non-English words are definitely removed in the document pre-processing.

29

It seems that no matter what kind of document it is, there are some common words, e.g. the, as, an, there, and. These words usually belong to word classes pronoun, preposition and determiner. They are the most common words, so they have relatively higher frequency in the language. The words are Stop Words in computing. After making a rough Stop Words statistic in the sample documents, it is manifesting from the Figure 4.4 that the proportions of the stop words are so large and noticeable. The distribution of the points in the diagram shows that the percentages remain stable around 40%, almost the half of the document. Consequently, the stop words come to play an important role in any text-based document.

Percentage of StopWords in sample documents 100,00%

90,00%

80,00%

70,00%

60,00%

50,00%

40,00%

30,00%

20,00%

10,00%

0,00% 0 200

Figure 4.4 Percentage of stop words in sample documents

Contrarily, the stop words in documents have no important significance, and they don't provide any additional information for a search engine. Because of this reason, the words are usually considered irrelevant for any searching purpose, e.g. there is no significant difference between apple and the apple. On the other hand, the stop words always have greater amount and higher frequency than other words, as it is illustrated above. These two disadvantages result that querying and processing the stop words in a search program consume a large amount of both time and space. In other words, the search performance becomes weaker if the stop words exist. In conclusion, filtering out the stop words is a necessary and important process, it aims to improve the search performance.

30

After removing the punctuation, the Arabic numerals, the non-English alphabets and the stop words from the sample documents, the rest consists of words that can better identify and represent the original document. Two figures in Figure 4.5 show the quantity, and the percentage distributions of the Remaining Words in the 200 sample documents.

Quantity distribution Percentage distribution of of Remaining Words Remaining Words 10000 100,00%

9000 90,00%

8000 80,00%

7000 70,00%

6000 60,00%

5000 50,00%

4000 40,00%

3000 30,00%

2000 20,00%

1000 10,00%

0 0,00% 0 200 0 200

Figure 4.5 The quantity (left) and the percentage (right) distributions of the remaining words in the 200 sample documents

The quantity points are in a wide distribution in the left half of the Figure 4.5, but it is quite obvious that most of them are over one thousand, and reach the highest position at ten thousand. In contrast with the quantity, the percentage points in the right half of the Figure 4.5 remain stable at around 40%~50% over the 200 sample documents. Of course, there is no such guarantee that the quantity of the remaining words must be under ten thousand in any document, since PDF document has no quantitative restrictions on either words or pages. But it is an undoubted fact that a complete English scientific or academic article usually has more than 2000 words, and contains about 1000 remaining words. It is circa 4-5 pages in size A4, that is the most common page size.

31

4.2.1.2.3 Document Fingerprint Proposition Clearly, the document-comparison process within several thousands of words has definitely large time and memory consumption. For improving the search process, saving both the time and space, a further reduction on the quantity of the words to be compared is required. In this project, the final product is called feature, which is an identification of the original PDF document. As a result, a relatively fast feature-comparison is performed instead of a slow document-comparison. In order to make the comparison process permanently effective, an extra feature database on the server side is required. The feature database stores all the features that are generated from the documents in the original PDF document database. Thus, the whole search process is carried out like this. The program reads in an input PDF document and generates an input-feature that represents the document, and then the input-feature is used as the query statement to iterate through a feature-based database to find a feature that satisfies the search requirement. Finally the feature database can directly find the corresponding PDF document in the original PDF document database, as illustrated in Figure 4.6.

Input PDF document Client Side Search Program Input Feature

Feature Database Server Side

Loredge Database Original PDF document Database

Figure 4.6 Applying a feature database to speed up the search

32

Since the feature must be a unique identification of the original PDF document according to the functional requirement, it is also called document fingerprint. There are two factors that must be considered during a document fingerprint generation.

• The Size of the Document Fingerprint The size refers to the quantity of the elements to be used to construct the document fingerprint. It is clear that a big and detailed document fingerprint always gives a more accurate representation of the original document, but the resource consumption in both time and memory is definitely higher. Oppositely, applying a small and general fingerprint has a lower resource consumption, but it gives a weaker representation and lacks of effective method on avoiding duplication. Thus, the most suitable size of the document fingerprint achieves a balance between the accuracy and the resource consumption.

• The Content of the Document Fingerprint In general, the words that are not very common but have higher frequency in the original document should be used to create the document fingerprint, as they are more representative. In order to describe the importance of the word and identify the document, the occurrence of chosen word is also required. Thus, a document fingerprint is supposed to be a set of key-value pairs, where key is the high frequency word, and value is the occurrence of the word. The problem to be further analyzed is how to set an occurrence requirement, which should vary with the document.

4.2.1.2.4 Document Analysis Conclusion The document pre-processing aims to improve the search performance. The punctuation, number, non-English words and stop words are filtered out from the sample PDF documents. A document fingerprint that represents the original document, consisting of high frequency words and occurrences of the words, is extracted. Two factors that should be taken into consideration in the selection are the size and the content.

33

4.2.2 Selection of Programing Environment This section presents the development environment, the programing language, the relevant resources and the local database. The selection of development environment is important for meeting the specified requirements in section 4.1.2.2.

4.2.2.1 Programing Language Java is the programing language in the program development, since the language has several advantages over other programing languages. It is an object-oriented language, the greatest merit of using an object-oriented programing is code reuse and recycling. Importantly, Java is platform-independent, which means that Java-based program can be executed on practically every hardware and software platform. This characteristic satisfy one of the non-functional requirements, platform-independent or platform-compatibility. Moreover, Java based program is extensible, tons of Java libraries have been developed by professional companies, and they are published as open source on the Internet. The ready-made libraries make the programing convenient and effective. Thus, another specified non-functional requirement, extensibility, is also met. Additionally, Java is robust and reliable; latency problems and errors can be detected and warned earlier during program-compilation phase; the debugging and testing become easier and more effective.

4.2.2.2 Relevant Resource Apache PDFBox is imported as a ready-made Java library in order to be able to work with PDF documents. The library is used for reading and extracting content out of the PDF documents. It implements the first functional requirement specified in section 4.1.2.1. The reason for applying an existing library is to avoid wasting time on developing a new application, this is because that PDFBox is enough mature and stable.

4.2.2.3 Local Database The database refers to the feature database that is a local document folder that contains text files. Each text file is a feature that represented the PDF document. The project does not focus on choosing a great database, and a simple structured database usually has a good generalization. Using such a database mainly aims to save time from learning and testing other advance, but complex database software. Moreover, the response time analysis becomes easier in a simpler development environment.

34

4.2.3 Selection of Similarity Measurement This section presents the chosen similarity measurement in this project. Several advantages are also described in order to provide a good understanding.

4.2.3.1 Cosine Similarity The chosen measure is Cosine Similarity. The similarity is measured as the cosine value of the angle between two vectors in N-dimensional space, where N can be any positive integer. Since the measure is particularly used in positive space, the outcome ranges from 0 to 1. The outcome becomes 1 when the two vectors have exactly the same orientation, and it is 0 when the vectors are mutually perpendicular.

Figure 4.7 Cosine Similarity Example

An example of applying the cosine similarity is illustrated in Figure 4.7. The similarity between vector 푎 and vector 푏⃗ is calculated as the cosine value of the angle 휃 between the vectors, as shown in the figure. According to the Euclidean dot product, the formula is:

푎 · 푏⃗ = ‖푎 ‖ ‖푏⃗ ‖ cos 휃

where 푎 · 푏⃗ is the dot product, and ‖푎 ‖ refers to the length or the norm of vector 푎 , equally, ‖푏⃗ ‖ is the length of the vector 푏⃗ . Thus, the function can be re-written like this:

푎 · 푏⃗ ∑푛 푎 푏 cos 휃 = = 푖=1 푖 푖 ⃗ 푛 2 푛 2 ‖푎 ‖ ‖푏‖ √∑푖=1(푎푖) √∑푖=1(푏푖)

The 푎푖 and 푏푖 are the components/dimensions of vector 푎 and vector 푏⃗ , and all the components in formula are positive.

35

4.2.3.2 Application of cosine similarity within the document comparison Back to the conclusion gotten from the document analysis in the section 4.2.1.2. The parameters to be used for calculating the similarity is the document fingerprints. The document fingerprint is a set of key-value pairs, where the key is high-frequency word and the value is occurrence of the word. In the cosine similarity measure, all these fingerprints are treated as vectors in high-dimensional positive spaces, where each word in the fingerprint refers to a unique dimension, and the corresponding occurrence defines the value of the dimension in the coordinate system.

E.g. Document A is characterized by a vector 퐴 , as {푎푝푝푙푒/3, 푏푎푛푎푛푎/4, 표푟푎푛푔푒/2}, and document B is presented by vector 퐵⃗ : {푎푝푝푙푒/1, 푏푎푛푎푛푎/3, 표푟푎푛푔푒/5} . The three common dimensions in this example are 푎푝푝푙푒, 푏푎푛푎푛푎 and 표푟푎푛푔푒, it tells that both vectors are in exact the same three-dimensional space. According to the Cosine Similarity Measure, the similarity is calculated as:

퐴 · 퐵⃗ 3 ∙ 1 + 4 ∙ 3 + 2 ∙ 5 cos 휃 = = = 0.7847 ‖퐴 ‖ ‖퐵⃗ ‖ √32 + 42 + 22 ∙ √12 + 32 + 52

Thus, the similarity between the two vectors is 0.7847, in other words, the similarity between two documents is 78.47%, in percentage formula.

4.2.3.3 Advantages of using the Cosine Similarity The main characterizes of the cosine similarity are described in this section.

Dimensional in Requirement As explained in the introduction above, the cosine similarity measure is able to calculate the similarity between two vectors in any dimensional space. Moreover, it even does not require that the vectors contain exact the same dimensions. The reason is that the measure method is totally based on the direction of the vectors. The only request is that both vectors must be in positive dimensional spaces. This characteristic is suitable for the fingerprint-comparison. According to the definition in the previous section, the document fingerprint is generated from the remaining words. Commonly, different documents have different remaining words, in terms of quantity (Figure 4.5) and variety. Hence, fingerprints extracted from different document generally consists of different quantity and variety of elements.

36

For example, Document C is characterized by a fingerprint 퐶 , {푎푝푝푙푒/3, 푏푎푛푎푛푎/4}, and document D is presented by fingerprint 퐷⃗⃗ : {푎푝푝푙푒/1, 표푟푎푛푔푒/5, 푚푒푙표푛/2}. Since there are four different unique dimensions in this case, 푎푝푝푙푒, 푏푎푛푎푛푎, 표푟푎푛푔푒 and 푚푒푙표푛, both the vectors are required to be put into same space. Thus, they could have the same dimensions in the same order.

• vector 퐶 becomes {푎푝푝푙푒/3, 푏푎푛푎푛푎/4, 표푟푎푛푔푒/0, 푚푒푙표푛/0}, • vector 퐷⃗⃗ becomes {푎푝푝푙푒/1, 푏푎푛푎푛푎/0, 표푟푎푛푔푒/5, 푚푒푙표푛/2}.

Applying the Euclidean dot product, the similarity is calculated like this: 3 ∙ 1 + 4 ∙ 0 + 0 ∙ 5 + 0 ∙ 2 cos 휃 = = 0.1095 = 10.95% √32 + 42 + 02 + 02 ∙ √12 + 02 + 52 + 22

In summation, the cosine similarity is also an effective measure when two fingerprints consist of different quantity and variety of element. The solution is that the original vectors are converted into new vectors in the same dimensional space, that consists of the dimensions within both the original vectors.

Outcome Requirement The outcome requirement is specified in the functional requirements in the section 4.1.2.1, as the degree of similarity must be a numeric value and presented as percentage, ranging from 0% (totally different) to 100% (totally same). The outcome of the cosine function has two characteristics that make the measure a good choice for similarity calculation in this project.

• The range satisfies the outcome requirement, the outcome of the cosine similarity that takes two vectors in any positive dimensional space ranges from 0 to 1. In other words, the outcome is a continuous variable.

• The cosine function is a monotone function in any positive dimensional space, the two extremums of the function is 1 that gives the highest similarity and 0 that tells the lowest one. The similarity decreases progressively as the angle between two vectors increases. The monotone interval ensures that the similarity between two vectors is unique, thus it’s easier to get the highest similarity from the sample data.

37

38

Chapter 5 Prototype Analysis After collecting and analyzing the relevant data, a prototype of the search engine is developed with the understanding gotten from the data analysis. The whole development is presented, analyzed and evaluated in this chapter. According to the confidentiality agreements, several key-parts of the development will not be presented.

5.1 Prototype Development Result The result of the prototype development is presented in this section.

5.1.1 Prototype Design The prototype development consists of several continuous, but independent unit- developments as illustrated in Figure 5.1 below. Each unit-development contains a set of functionalities and aims to implement a specified functional requirement that is defined in the section 4.1.2.1

PDF Reading Document Similarity •Library Importing Processing Calculation •Docuement Reading •Document Preprocessing •Vector Creation •Fingerprint Extraction •Cosine Similarity

Figure 5.1 Unit developments

5.1.1.1 PDF Document Reading This unit-development makes the search program be able to work with PDF documents. Apache PDFBox is imported by the program as a Java library. The input PDF document is converted into a Java object. And then only the text-content of the document is read into program’s memory word by word.

39

5.1.1.2 Document Processing The unit-development converts the original documents into comparable data format. It consists of the document pre-processing and the document fingerprint extraction. The document pre-processing is intended to provide high-quality data for extracting document fingerprint, filtering out the redundancy and irrelevant data. As the conclusion gotten from the document analysis, the punctuation, the Arabic numerals, the non- English alphabets and the stop words are removed from the document. The extracted fingerprint consists of a set of highest frequency words with their occurrences in the document. It works not only as the vectors used in similarity calculation, but also as the identification of data entry stored in the database.

5.1.1.3 Similarity Calculation This is the querying process of the document search. A fingerprint extracted from the input document is used as a search statement, querying a fingerprint-based database. The linear search is applied in order to iterate through all data entries in the database. Firstly, the document fingerprints are converted into vectors in the same positive N- dimensional space according to their sizes, and then the cosine similarity is carried out to measure the similarity between the vectors. Lastly, the data entry with the highest and most suitable similarity to the input document is returned.

5.1.2 Prototype Implementation Writing Java code

40

5.1.3 Prototype Testing/Debugging The testing/debugging phase in the program development aims to determine whether the designed and implemented prototype works correctly, and the returned value is expected. Several different types of input parameters are used in the tests, it is intended to make the program has a better generalization.

5.1.3.1 Unit Testing The unit testing is performed at the end of each unit development. According to the design, there are three corresponding unit tests in the prototype development.

PDF Document Reading This unit testing is to check whether the text-content of the input PDF document is correctly extracted. The content of the PDF document is read into a plain text file, and then a text-content comparison between the input PDF document and the output text files is carried out with the naked-eye.

Document Processing The testing aims to determine whether all redundant and irrelevant data is correctly removed from the input document. In like manner, the pre-processed document is output to a text file word by word. Afterwards, a testing program is executed, it reads in the text file and checks if the file contains the defined redundancy and irrelevant data.

Similarity Calculation Since it is extremely difficult to have checks against arithmetical errors in computing, this unit testing focuses on determining whether the unit can measure the similarity between two vectors in different dimensions. Because the outcome is a real value ranged from 0 to 1, the second thing focused in the unit testing is to determine whether measured result is located in this determined interval.

5.1.3.2 Integration testing Sometimes the developed units can work correctly separately, but it does not guarantee that they can work correctly together. The integration testing focuses on checking the parameters passing among different modules/units. As a result, the data structures of the passed parameters are unified as needed. E.g. all integers are presented in Long, and all floating points are given in Float. The representations of fingerprints and vectors are integrated into same data structure. 41

5.2 Document Fingerprint Analysis In this section, the designed and implemented document fingerprint extraction method is presented and analyzed, since the fingerprint is the key element of the program. Firstly, the terms that are proposed in the section 4.2.1.2 are reviewed. And then the representation of the fingerprint in Java is introduced. Then, the document fingerprint extraction methods are described. Lastly, several different versions of the methods are given to compare the performance of the prototype in different perspectives.

5.2.1 Term Definition The term Remaining Words refer to the words of an English document after removing redundancy and irrelevant data. Studying the remaining words aims to provide high quality data for producing the Document Fingerprint. The Document Fingerprint refers to a unique document identification, consisting of high frequency words and occurrences. The fingerprint comes from the Remaining Words and aims to improve the document comparison process. It is used not only for processing the input document, but also creating the feature database.

5.2.2 Representation As previously stated, for an article, the higher occurrence a word has, the more representative the word is in general. Thus, the words used for creating fingerprint are supposed to have higher occurrence than other words. For being able to obtain the high frequency words effectively, the set of the remaining words is converted into the Java object Map [29]. The key is the word and the Value is the occurrence of the word. In this case, the Map is manually sorted into a Value descending order. Put simply, the most frequent word is placed at the top of the sorted Map, and the least frequent one is at the bottom. Afterwards, certain entries are extracted top-down for creating the document fingerprint. It is clear that the fingerprint is also represented by the object Map. The Key is the highest frequency word and the Value is the occurrence of the word.

42

5.2.3 Extraction Methods As mentioned above, the document fingerprint is resulted from the remaining words of the original document. In other words, there must be a certain relationship between the document fingerprint and the remaining words. In this project, the relationship is regarded as a mathematical function as follows.

퐷표푐푢푚푒푛푡 퐹𝑖푛푔푒푟푝푟𝑖푛푡 = 푓푢푛푐푡𝑖표푛(푅푒푚푎𝑖푛𝑖푛푔 푊표푟푑푠)

There are two important factors in the generation of the document fingerprint, according to the conclusion of document analysis in the section 4.2.1.2.

• Size. The number of high frequency words used for creating document fingerprint. • Content. Property/Occurrences Requirement for the chosen words. The fingerprint is briefly functionally presented by a tuple consisted of Size and Content.

퐷표푐푢푚푒푛푡 퐹𝑖푛푔푒푟푝푟𝑖푛푡 = {푆𝑖푧푒, 퐶표푛푡푒푛푡}

By setting variables in these two factors, a draft of document fingerprint selection standard is formed. The standard describes both the Size and Content requirements.

A Document Fingerprint consists of Not more than X highest frequency words of which the occurrences of these words are all greater than Y in the document.

• The variable X tells the planed size of the document fingerprint. • The variable Y decides the minimum occurrence requirement for a word to be in the document fingerprint.

The selection is performed in several steps as follows. 1. Reading X entries from the sorted Map RemainingWords. 2. Obtaining the Value: s of all the entries 3. Making comparisons between the Value: s and Y a. If and only if the Value is greater Y, putting the corresponding entry to the Map DocumentFingerprint. b. Otherwise, Ignoring the corresponding Entry 4. Returning the Map DocumentFingerprint.

43

The selection process is presented in pseudo code in Figure 5.2.

Figure 5.2 The pseudo code of Document Fingerprint selection process

As a result, the document fingerprint selection standard is completely defined as follows.

In this study, the planed size of document fingerprint X and the minimal occurrence requirement Y are enabled to be as independent linear functions (푓 and 푔) of the quantity of the remaining words R. 푋 = 푓(푅) 푌 = 푔(푅) The R is determined as a non-negative integer, and both X and Y are expected to be non-negative integers as return values of the functions.

By substituting the applied linear functions into the mentioned relationships above, and making a summation, the document fingerprint is described as a tuple of functions of the quantity of the remaining words in the document.

퐷표푐푢푚푒푛푡 퐹𝑖푛푔푒푟푝푟𝑖푛푡 = {푋, 푌} = {푓(푅), 푔(푅)}

44

5.2.4 Experimental Extraction Method The key part of the prototype performance analysis is to compare, analyze and evaluate the execution results of applying document fingerprints. The resource consumption, response time and accuracy are the key factors which are taken into the analysis. Since the project aims to find the optimal solution to document search, different available versions of the fingerprints extraction method are necessary.

Version Complete: Document Fingerprint satisfies {푿, 풀} = {푹, ퟎ} Actually, this version does not apply the selection standard, because all the remaining words are used for creating the document fingerprint. Thus, the planed size is the quantity of the remaining words, i.e. 푋 = 푅 , and no minimal occurrence requirement is set up, which means that 푌 = 0. The defined functions 푓 and 푔 are not used in this version.

Version Feature Full: Document Fingerprint satisfies {푿, 풀} = {풇(푹), 품(푹)} This is the standard feature version that applies the selection standard, which is described above. The linear function 푓(푅) is the standard planed size of the document fingerprint, and the 푔(푅) decides the standard minimal occurrence requirement. Thus, this method produces fingerprint that is much smaller than the Complete Version.

Version Feature Half: Document Fingerprint satisfies {푿, 풀} = {풇(푹)/ퟐ, 품(푹) ∗ ퟐ} The version is the adjective formed from the Feature Full. It also applies the selection standard. It halves the planed size but doubles the minimal occurrence requirement of the document fingerprint. The method is intended to generate a document fingerprint which has about half length of the document fingerprint applied Feature Full.

Version Feature Quarter: Document Fingerprint satisfies {푿, 풀} = {풇(푹)/ퟒ, 품(푹) ∗ ퟒ} This is also the adjective formed form the standard feature version and applies the selection standard. It quarters the planed size and quadruple the minimal occurrence requirement of the documents fingerprint. It is clear that the document fingerprint applied this method has an approximate quarter length of the Feature Full.

The last three versions are named Feature because they work as the feature that are extracted from the remaining words, the original document.

45

5.3 Prototype Environment Analysis In this section, the execution environment of the prototype is introduced and clarified.

5.3.1 Research Objective The project has two objectives to achieve as follows.

• To develop a prototype of the document-based search program, which returns the data entry that has the highest similarity to the input document. • To find the optimal solution to the document search program

5.3.2 Experiment Material Introduction The experimental subject is a set of PDF documents, which is provided by Loredge. The PDF documents are sorted into two folders. The files that have the same name in the folders are the same articles but in different formats, according to Loredge’s explanation.

• folder Author contains 100 articles named from 001 to 100 • folder Publish contains 100 articles named from 001 to 100

In this project, there are two different databases, a document database and a feature as shown in the Figure 4.6. The folder Author itself works as the document database, and the documents within the folder are used to create the feature database. The PDF document in the folder Publish are the input document used as the search statement to be iterated through the feature database Author. The linear search is applied in this project in order to obtain all possible data for further research.

After the prototype of the search program is implemented, several document-reading experiments are performed. The results of these experiments show that there are several special documents in the two folders. Folder Author

• Document 008 and 024 are almost the same. The 008 is a draft version while the 024 is a formal version of the same article. • Document 058 is partly un-readable by Apache PDFBox.

46

Folder Publish

• Document 008 and 024 are the identical document. • Document 077 is completely un-readable by applying Apache PDFBox. These special documents have following consequences.

• input document is either 008 or 024, both data entry 008 and 024 in the database Author are the correct search result. • input document is 077 has no correct answer in the database Author. • input document is 058 has no correct answer in the database Author.

In order to implement the first objective of the project, measuring the similarity between the documents, the Cosine Similarity is applied and introduced in the section 4.2.3. The returned similarity is at the range from 0% to 100%. The 0% means that the documents are totally different and the 100% refers to when the documents are identical.

For achieving the second objective, the document fingerprint is proposed in the section 4.2.1.2.3, and related extraction method is introduced in the section 5.2.3. Furthermore, four versions of the method are described in the section 5.2.4. These methods are applied to not only process the input search statement but also create the feature database (Figure 4.6). The feature database Author is a document folder that contains text files. Applying the document fingerprint aims to improve the search process, especially the document-reading and document-comparison.

The performance analysis focuses on comparing and analyzing the program execution results, which are generated when different document fingerprint extraction methods (section 5.2.4) are applied (for both input document and feature database). Thus, there are total four scenarios to be analyzed are shown in Figure 5.3 below.

Scenario Name Applied Document Fingerprint Extraction Method F-Complete version Complete F-Full version Feature Full F-Half version Feature Half F-Quarter version Feature Quarter

Figure 5.3 Scenarios to be analyzed in this section

47

5.4 Accuracy Analysis The accuracy is definitely the most important factor that determines the performance of any types of search program, as user always want to have the correct search result, which correspond to the input search query.

In this project, the search program queries the feature database Author with an input PDF document from the folder Publish. All the correct search results are clearly pre-defined.

• Correct Search Result. The returned data entry from the feature database has the same name as the input PDF document. • Wrong Search Result. The returned data entry from the feature database does not have the same name as the input PDF document Thus, the Accuracy Rate is calculated as follows.

푇표푡푎푙 퐶표푟푟푒푐푡 퐴푐푐푢푟푎푐푦 푅푎푡푒 = ∗ 100% 푇표푡푎푙 퐶표푟푟푒푐푡 + 푇표푡푎푙 푊푟표푛푔

Figure 5.5 is a multi-line chat, which illustrates the Correct and Wrong answers under the four defined scenarios. The horizontal axis is the file name of the input document, and the colored lines represent respective the defined scenario. Each fluctuation of the line refers to a wrong answer, and the rest is the correct answer. The detail about the wrong answers (input document and wrong answer) is shown in Figure 5.6.

Accuracy Fluctuation

1 4 7

37 40 82 85 10 13 16 19 22 25 28 31 34 43 46 49 52 55 58 61 64 67 70 73 76 79 88 91 94 97 100 F-Complete F-Full F-Half F-Quarter

Figure 5.5 Fluctuation chart shows the Correct/Wrong answer under four scenarios

48

input document output data entry F-Complete 8 24 F-Complete 58 100 F-Complete 77 60 F-Full 8 24 F-Full 58 100 F-Full 77 88 F-Half 8 24 F-Half 16 82 F-Half 58 24 F-Half 77 88 F-Quarter 16 67 F-Quarter 24 8 F-Quarter 58 100 F-Quarter 77 88 Figure 5.6 Wrong search results under the four scenarios

Since there are several special documents in the folders as introduced in section 5.3.2, the final accuracy rate must be calculated after considering the consequence of these documents. The accuracy rate of the prototype under scenarios is shown in Figure 5.7.

Scenario Name Accuracy Rate F-Complete 100.00 % F-Full 100.00 % F-Half 99.00 % F-Quarter 99.00 %

Figure 5.7 The accuracy rate under the scenarios in the final statistics

As can be seen from the Figure 5.7, the accuracy is the highest (100%) under both the F- Complete and F-Full, while it is relatively lower (99%) under the F-Half and F-Quarter. Whatever, all the applied document fingerprint extraction methods behave well in respect to the accuracy. Taking the size of the fingerprints into the analysis, although the F-Complete and F-Full have same accuracy rate, the size of the F-Complete is way larger than all the other three. Furthermore, the size of the F-Quarter is about the half of the F- Half, which is further almost the half of the F-Full, but the gap among the rates of them are quite small, only 1%. In this project, the document fingerprint that has the relatively best quality of the accuracy is the F-Quarter. But the final answer depends more on the fault-tolerance which is coming to be decided by Loredge in the future.

49

5.5 Response Time Analysis The response time is the time the search program takes to response the search request. It is one of the important factors, which decide the performance of the designed document fingerprint extraction method and the developed program.

In this project, the whole folder Publish is used as a single search statement. In other words, the response time is measured from the program starts to read the folder Publish that contains 100 PDF documents to the program returns 100 data entries that have corresponding highest document similarities. Each response time measurement in this project is performed 50 times, and the average value is adopted as the final measured result to be shown and analyzed. Applying the repeated measurements aims to eliminate error in measuring, and it leads to greater confidence in calculating an accurate average measurement [31]. The average value is the best representation of the set of measured results, since it is more accurate and reliable and usually is closest to the true value.

There are two main objectives of performing the response time analysis:

• Determining the performance of the applied document fingerprint. • Proposing relevant response time improvement method.

The whole response time analysis is performed from two perspectives, global response time analysis and local response time analysis. Global Response Time Analysis (hereinafter to be referred as global analysis). In this analysis, the response time is measured and studied as a whole. It mainly studies the relationship between the response time and the size of the database. There are two steps in performing the global analysis. Firstly, a response time statistical mode is queried to be developed. It is intended to obtain the response time based on a considerably large database. And then, by comparing the predicted response times under the four defined scenarios that are described in the Figure 5.3, it is able to determine the performance of the designed document fingerprint methods. Local Response Time Analysis (hereinafter to be referred as local analysis). In contrast with the global analysis, this analysis focuses on studying the inner structure of the measured response time. Applying the local analysis mainly aims to explore the possibilities of improving the response time. For achieving this objective, the whole measured response time is segmented into several separate partitions, so that more details are possible to be observed. Thus, it is possible to determine whether it is possible to be improved.

50

5.5.1 Global Response Time Analysis The Loredge database is designed to be large enough to store thousands of, even millions of data entries. But the quantity of the given sample document is quite limited, so it is important to simulate the response time the search program takes to works with a considerably large database. The simulation process is a prediction, which is based on all the existing sample data. And the result of the prediction is used for determining whether the developed program has a great performance. This section contains three sequential phases as shown in Figure 5.8, measuring response time based on the existing data, predicating response time according to the measured result and performing comparisons among the obtained predicted response times.

Data Measurement Prediction Analysis Conclusion

Figure 5.8 The process of Global Response Time Analysis

5.5.1.1 Response Time Measurement There are two things that change in the response time measurement. The first variable is the size of the database Author, and the second variable in the measurement is the defined scenarios. According to the experimental subject description in the section 5.3.2, the database contains 100 data entries, which are the document fingerprints that are generated from the original PDF documents in the folder Author. In other words, the maximal available size of the database is 100 in this research. It is not enough to measure the response time only once. For being able to make prediction about the response time based on a large database, more available data are required. Thus, the response time measurement is supposed to be performed several times based on the database Author with different sizes. As a result, the database size is adjusted to 10, 20, 30, …, 100 that are the quantity of the data entries. Under each database size, the sample data entries are randomly selected in order to reduce or even eliminate the experimental sampling bias and noise [30]. In the performed experiment, each response time is supposed to be measured under the circumstance that is a combination of one scenario and one database size. The final response time for one circumstance is the mean value of a set of response times that are measured under that circumstance.

51

Based on the 10 different database sizes and the 4 various scenarios, 40 various average response times are obtained after the measurements. The average response times are shown in Figure 5.9. The horizontal axis refers to the database size, and the vertical axis is the response time in seconds. The different colors of the data points refer to the different defined scenarios.

Response Time Measured Response Time under the scenarios (seconds) 5,50E+01

5,00E+01

4,50E+01

4,00E+01

3,50E+01

3,00E+01

2,50E+01 0 10 20 30 40 50 60 70 80 90 100 Database size F-Complete F-Full F-Half F-Quarter

Figure 5.9 The variation of the response time with the size of the database. The response time is measured in seconds, and the size of the database refers to the quantity of the data entry.

It is manifest from the Figure 5.9 that the database size and the response time are in direct proportion under all the four scenarios. Which simply means, the greater database size is, the longer the response time is. The gap between the response time under the F-complete and other three scenarios is strongly advanced with the growth of the database size. It is very difficult to distinguish the other three scenarios, since the gaps among the data points are barely noticeable in the most cases.

52

5.5.1.2 Response Time Prediction In order to be able to predict the response time, a statistical model that mathematically describes the generation of the data points in the Figure 5.9 is required. The statistical model implies independent and dependent variables. The dependent variable is the one to be predicted, and the independent variable is the one to be used to predicate the dependent variable. It is abundantly clear that the response time is the dependent variable and the database size is the independent variable in this project. Thus, the required statistical model is supposed to be a mathematical function that describes the relationship between the dependent variable 푅푒푠푝표푛푠푒 푇𝑖푚푒 and the independent variable 퐷푎푡푎푏푎푠푒 푆𝑖푧푒.

푅푒푠푝표푛푠푒 푇𝑖푚푒 = 푓푢푛푐푡𝑖표푛 (퐷푎푡푎푏푎푠푒 푆𝑖푧푒)

It is clear that the response time changes over the database size, so the statistical model to be inferred is a Trend Estimation. The trend can be simply drawn by eyes through the sets of data points in the Figure 5.9. But in order to make the prediction more accurate and reliable, the Linear Regression that is a kind of statistical techniques is applied. Since there is only one independent variable (퐷푎푡푎푏푎푠푒 푆𝑖푧푒) in the model, the applied trend estimation is a Simple Linear Regression, which is defined in the section 2.4.1.2. With the simple linear regression, the data points are modeled as follows:

Y = 훽 X + ε

This is a simple linear function, which tells the relationship between variables X and Y. The Y is the response time, which is the outcome to be predicted. And the X represents the database size, which is the known input variable. 훽 and ε are two coefficients to be estimated. The 훽 is the slope of the line and the ε refers to the vertical intercept. In order to fit all the obtained data points to the model, Linear Least Squares Approach is applied. The approach aims to find the optimal solution, which has the minimal sum of squared differences between the data points and the fitted value in the model. The approach is simple but effective in solving the optimization problem, especially when then required statistical model is a linear function.

53

ResponseTime Trend Esimation of the Response Time (seconds) 5,50E+01

5,00E+01

4,50E+01

4,00E+01

3,50E+01

3,00E+01

2,50E+01 0 10 20 30 40 50 60 70 80 90 100 Database size F-Complete F-Full F-Half F-Quarter F-Complete F-Full F-Half F-Quarter

Figure 5.10 The estimated trend lines under four defined scenarios. The estimation is based on the data points in the Figure 5.9. The applied approach is Linear Least Squares.

According to Figure 5.10, all the estimated trends of the response time have linear growth as expected. It is obvious that the F-Complete is the scenario that always has the longest response time over all other scenarios, no matter how large the database size is. Also, while the response time of the F-Complete strongly increases, the other three have lower rates of increase. As the database size grows, the gaps among the three scenarios becomes wider and more noticeable. In other words, under the same database size, the F-Full is above the F-Half, which is further above the F-Quarter. The estimated line function that tells the relationship between the response time and the database size are given in Figure 5.11 below.

Estimated Trend Line Linear Function (y=Response Time, x=Database Size) F-Complete y = 2.53E-01x + 2.59E+01 F-Full y = 1.21E-01x + 2.53E+01 F-Half y = 1.16E-01x + 2.52E+01 F-Quarter y = 1.09E-01x + 2.54E+01 Figure 5.11 Statistical Models of Trend Line Estimation of the Response time in seconds 54

In order to show the response time based on a larger database, the obtained trend lines in the Figure 5.10 must be extended. Figure 5.12 below is a data table that predicts the response time based on the database size increased from 1 thousand to 1 trillion.

Pridicted Response Time under scenarios (second) Database Size F-Complete F-Full F-Half F-Quarter 1.00E+01 2.84E+01 2.65E+01 2.63E+01 2.64E+01 1.00E+02 5.12E+01 3.74E+01 3.68E+01 3.63E+01 1.00E+03 2.79E+02 1.47E+02 1.41E+02 1.34E+02 1.00E+04 2.56E+03 1.24E+03 1.19E+03 1.12E+03 1.00E+05 2.53E+04 1.22E+04 1.16E+04 1.09E+04 1.00E+06 2.53E+05 1.21E+05 1.16E+05 1.09E+05 1.00E+07 2.53E+06 1.21E+06 1.16E+06 1.09E+06 1.00E+08 2.53E+07 1.21E+07 1.16E+07 1.09E+07 1.00E+09 2.53E+08 1.21E+08 1.16E+08 1.09E+08 1.00E+10 2.53E+09 1.21E+09 1.16E+09 1.09E+09 1.00E+11 2.53E+10 1.21E+10 1.16E+10 1.09E+10 1.00E+12 2.53E+11 1.21E+11 1.16E+11 1.09E+11 Figure 5.12 The response times based on the selected set of database sizes.

5.5.1.3 Global Analysis Conclusion By combining the selected data in the Figure 5.12 and the predicted linear functions in the Figure 5.10 above, the result shows that as the database size becomes bigger and bigger, then the corresponding response time has weaker correlation with the coefficient ε. In other words, the obtained response time depends mainly on the coefficient 훽 as the database size goes to infinity. The coefficient 훽 of the linear functions are respectively 2.53E-01, 1.21E-01, 1.16E-01 and 1.09E-01. Thus, the ratio among the response time under the four scenarios is 100:48:46:43 based on a very large database. A simple conclusion is drawn. The scenario F-Complete has the longest response time that is around twice the other scenarios under any circumstance. In comparison, there is no such large gaps among the other three scenarios. All the document fingerprint methods in the section 5.2.4 that apply the selection standard obviously and strongly reduces the response time. But a further reduction of the size of the document fingerprint does not provide a further reduction of the response time. And under the same giant database size, from shortest to longest response time, all the scenarios are sorted like this;

F-Quarter (43) < F-Half (46) < F-Full (48) < F-Complete (100)

55

5.5.2 Local Response Time Analysis The local analysis is carried out after the global analysis. In contrast with the global analysis, the local one focuses on more specific and detailed things. The main purposes of making the local response time analysis is to explore the possibility of improving the response time. For achieving this objective, a deeper observation and study on the obtained predicted response time is required. The process of local analysis is performed in three stages as shown in Figure 5.13 below.

Response Time Improvment Partition Analysis Partition Proposition

Figure 5.13 The process of Local Response Time analysis

Firstly, the response time partition is performed. The whole search process is divided into several independent partitions, and then response time of each partition is measured. Afterwards, the partition response time is analyzed and evaluated. Both longitudinal comparison and crosswise comparison are carried out in order to gain a better understanding of each partition. There are several problems to be clarified by performing the partition analysis from several perspectives. Finally, the relevant improvement on each partition is figured out.

5.5.2.1 Response Time Partition This is the key part of the local analysis. The whole search program is broken up into four partition processes according to the search process, which consists of reading a PDF document, pre-processing the document, querying the database and returning the result. Each partition has a clear task and all the partitions are independent. The response time of each partition process is measured. The response time partition is shown in Figure 5.14.

Response Time

Reading-Time Processing-Time Match-Time Result-Time

Figure 5.14 The response time partition

56

5.5.2.2 Partition Analysis In this section, all the partition times are analyzed and evaluated. The partition analysis focuses on clarifying following problem.

• Which part of the response time varies with the database size? • Which part of the response time varies with the applied document fingerprint? • The reason that causes the gap between the measured response time. • The importance and proportion of the partition in the whole search process. The partitions are reading-time, processing-time, match-time and resulting-time.

Reading-Time The reading-time refers to the time the search program takes to read in PDF document by importing Apache PDFBox. It is measured from the program starts to read the address of the PDF document to the program extracts the content of the document. Since it is clear that the reading-process has totally nothing to do with either the database or the applied document fingerprint, a simple hypothesis is made. The reading-time is assumed to be as a fixed value under any circumstance in this project. This partition analysis starts with measuring the reading-time and the whole response time.

Measured Reading-Time F-Complete F-Full F-Half F-Quarter database Reading %response Reading %response Reading %response Reading %response 10 2.30E+01 81.12% 2.32E+01 85.92% 2.29E+01 85.93% 2.27E+01 86.03% 20 2.30E+01 74.88% 2.29E+01 82.68% 2.28E+01 83.01% 2.26E+01 83.03% 30 2.27E+01 68.71% 2.26E+01 79.38% 2.26E+01 79.90% 2.26E+01 79.84% 40 2.30E+01 63.87% 2.27E+01 76.62% 2.27E+01 77.07% 2.28E+01 77.41% 50 2.31E+01 59.43% 2.32E+01 73.73% 2.31E+01 74.27% 2.31E+01 74.55% 60 2.28E+01 55.84% 2.29E+01 71.44% 2.31E+01 71.89% 2.34E+01 72.24% 70 2.37E+01 52.98% 2.38E+01 69.23% 2.37E+01 69.71% 2.38E+01 70.11% 80 2.25E+01 49.95% 2.28E+01 67.21% 2.27E+01 67.60% 2.27E+01 67.93% 90 2.31E+01 47.96% 2.35E+01 65.95% 2.32E+01 66.40% 2.25E+01 66.27% 100 2.32E+01 45.02% 2.36E+01 62.63% 2.38E+01 63.30% 2.34E+01 63.75% Figure 5.15 The data table shows the measured reading-time in seconds and the percentage that the reading-time occupies the response-time based on database sizes under the four scenarios.

According to Figure 5.15 that shows the measured reading-time above, all the reading- time fluctuate marginally and irregularly around 2.30E+01 seconds with the growth of the database size. Figure 5.16 below is a bar chart that graphically illustrates the reading-time in the Figure 5.15 and the response time shown in the Figure 5.9.

57

Comparison (Response Time, Reading-Time) Time (seconds) Response-Time Reading-Time 6,00E+01

5,00E+01

4,00E+01

3,00E+01

2,00E+01

1,00E+01

0,00E+00 10 30 50 70 90 20 40 60 80 100 10 30 50 70 90 20 40 60 80 100 F-Complete F-Full F-Half F-Quarter

Database Size

Figure 5.16 The bar chart that compares the whole response time with the reading-time

According to the Figure 5.16, it is clearer that the proposed hypothesis is verified, which means that the reading-time in this project is an absolute fixed value, which does not vary with the database size and the document fingerprint.

Precentage Percentage Change (Reading-Time) 90,00% F-Complete F-Full F-Half 80,00% F-Quarter

70,00%

60,00%

50,00%

40,00% 10 20 30 40 50 60 70 80 90 100 database size Figure 5.17 The percentage change of the reading-time constituting in the whole response time

58

Figure 5.17 above shows the percentage change that the reading-time constitutes the whole response time with the growth of the database size under the defined scenarios. The proportion of the reading-time decreases with the growth of the database size under all the scenarios. Thereafter, as the database size increases to infinity, the reading-time is coming to account for a small proportion. A short conclusion is drawn. The reading-time is absolutely a fixed value, because it is totally based on the performance of Apache PDFBox, the file-reading algorithm of Java and another program execution environment. The only improvement method on this partition time is to import other libraries to works with the PDF documents, and execute the program in a better system environment.

Processing-Time The processing-time refers to the time the program takes to pre-process the document and extract the document fingerprint that applies the defined method. This is the measured partition time after the reading process. It is obvious that the document pre- processing is totally not based on the database, since the process is performed before program connects the database. Thus, the processing-time is assumed to vary with the applied document fingerprint, not the database size.

Measured Processing-Time F-Complete F-Full F-Half F-Quarter database Reading %response Reading %response Reading %response Reading %response 10 2.58E+00 9.11% 2.58E+00 9.57% 2.59E+00 9.75% 2.62E+00 9.93% 20 2.57E+00 8.34% 2.54E+00 9.18% 2.53E+00 9.21% 2.49E+00 9.13% 30 2.48E+00 7.52% 2.53E+00 8.89% 2.49E+00 8.80% 2.54E+00 8.94% 40 2.49E+00 6.91% 2.50E+00 8.44% 2.50E+00 8.48% 2.48E+00 8.39% 50 2.55E+00 6.54% 2.61E+00 8.29% 2.57E+00 8.26% 2.58E+00 8.33% 60 2.50E+00 6.12% 2.54E+00 7.92% 2.56E+00 7.99% 2.60E+00 8.02% 70 2.64E+00 5.92% 2.69E+00 7.82% 2.70E+00 7.95% 2.70E+00 7.94% 80 2.45E+00 5.43% 2.50E+00 7.38% 2.50E+00 7.42% 2.50E+00 7.47% 90 2.51E+00 5.21% 2.59E+00 7.26% 2.57E+00 7.36% 2.47E+00 7.29% 100 2.55E+00 4.95% 2.65E+00 7.04% 2.63E+00 7.01% 2.61E+00 7.09% Figure 5.18 The data table shows the measured processing-time in seconds and the percentage that the processing-time occupies the response-time based on database under the scenarios.

According to Figure 5.18, the processing-time varies within a narrow range around 2.50E+00 seconds, but the fluctuation shows no clear regularity. Thus, this partition time is not based on the database. The F-Complete has the shortest processing-time, and then the F-Quarter, F-Half and F-Full are ranked at respectively 2,3 and 4.

59

Time (seconds) Comparison (Response Time, Processing-Time) Response -time Processing-Time 6,00E+01

5,00E+01

4,00E+01

3,00E+01

2,00E+01

1,00E+01

0,00E+00 10 30 50 70 90 20 40 60 80 100 10 30 50 70 90 20 40 60 80 100 F-Complete F-Full F-Half F-Quarter

Database Size

Figure 5.19 The bar chart that compares the whole response time with the processing-time

Figure 5.19 above illustrates the measured processing-time in the Figure 5.18 and the response time shown in the Figure 5.9 in the same bar chart. Unlike the reading-time shown in the Figure 5.16, the processing-time actually accounts for a very low proportion of the whole response time. The fluctuation of the processing-time, and even the processing-time itself does not have a significant effect on the whole response time. But analogous to the reading-time, as the database size grows to infinity, the proportion of the processing-time approaches zero. In summation, the assumption is partly verified. The processing-time in this project is a fixed value. The partition time limitedly varies with the applied document fingerprint extraction method, because the variance is tiny and unstable enough to be ignored. Thus, this partition is not the factor that causes the gaps among the measured response time under the scenarios. Improvement method on this partition time is quite limited in this project. Performing the search program in a better system (both hardware and software) environment will definitely cause a shorter processing-time. The other method is to apply a high-quality fingerprint extraction method, But the relationship between the extraction method and the processing-time is to be further studied and analyzed.

60

Match-Time The match-time is the time the program takes to query the database. This partition time is measured from the program starts to read the first data entry from the local database to the program generates a list that contains measured document similarities between the input documents and all data entries in the database.

Measured Match-Time F-Complete F-Full F-Half F-Quarter database Match %response Match %response Match %response Match %response 10 2.77E+00 9.77% 1.21E+00 4.50% 1.15E+00 4.32% 1.07E+00 4.03% 20 5.16E+00 16.77% 2.25E+00 8.13% 2.14E+00 7.78% 2.13E+00 7.83% 30 7.84E+00 23.76% 3.34E+00 11.72% 3.20E+00 11.30% 3.18E+00 11.21% 40 1.05E+01 29.21% 4.43E+00 14.93% 4.26E+00 14.44% 4.19E+00 14.19% 50 1.32E+01 34.02% 5.66E+00 17.98% 5.43E+00 17.46% 5.31E+00 17.11% 60 1.56E+01 38.04% 6.61E+00 20.62% 6.45E+00 20.11% 6.40E+00 19.73% 70 1.84E+01 41.10% 7.90E+00 22.94% 7.58E+00 22.33% 7.45E+00 21.94% 80 2.01E+01 44.61% 8.60E+00 25.40% 8.40E+00 24.97% 8.23E+00 24.59% 90 2.25E+01 46.81% 9.56E+00 26.78% 9.17E+00 26.24% 8.97E+00 26.43% 100 2.58E+01 50.03% 1.14E+01 30.32% 1.12E+01 29.68% 1.07E+01 29.15% Figure 5.20 The data table shows the measured match-time in seconds and the percentage that the match-time occupies the response-time based on database sizes under the four scenarios.

Comparison (Response Time, Match-Time) Time (seconds) Response -time Match-Time 6,00E+01

5,00E+01

4,00E+01

3,00E+01

2,00E+01

1,00E+01

0,00E+00 10 30 50 70 90 20 40 60 80 100 10 30 50 70 90 20 40 60 80 100 F-Complete F-Full F-Half F-Quarter Database Size

Figure 5.21 The bar chart that compares the whole response time with the match-time

61

According to the Figure 5.20, it is obvious that the match-time significantly increase with the growth of the database size. Under the same database size, the F-Complete has the longest match-time that is about two times longer than the other scenarios, and the gaps among the other three are relatively small. The match-time ranking obtained from the Figure 5.20 is F-Complete > F-Full > F-Half > F-Quarter. It is manifest from the Figure 5.21 that the increment rates of the response time and the match-time are quite similar. A deeper and detailed study that focuses on the relationship between the response time, match-time and database size is performed.

The match-time is assumed to have a linear increment under all scenarios. Hence, the linear function that tells the relationship between database size and match-time is: 푀푎푡푐ℎ𝑖푛푔푇𝑖푚푒 = 훽 ∗ 퐷푎푡푎푏푎푠푒푆𝑖푧푒 + ε

After substituting the data in the Figure 5.20 into the model and performing the linear regression, the estimated linear functions are shown in Figure 5.22.

Estimated Scenario Linear Function (M=Match-Time, x=Database Size) F-Complete M = 2.52E-01x + 3.26E-01 F-Full M = 1.10E-01x + 7.56E-02 F-Half M = 1.07E-01x + 1.83E-02 F-Quarter M = 1.03E-01x + 7.11E-02 Figure 5.22 Estimated function between the match-time in seconds and the database size.

Next step is to compare the slopes of the estimated match-time functions to the slopes of the estimated response time functions shown in the Figure 5.9.

Slope (Response Time) Slope (Match-Time) Slope Ratio F-Complete 2.53E-01 2.52E-01 100: 99.65 F-Full 1.21E-01 1.10E-01 100: 90.21 F-Half 1.16E-01 1.07E-01 100: 92.05 F-Quarter 1.09E-01 1.03E-01 100: 94.95 Figure 5.23 Comparing the slopes of the estimated response time and match-time functions

62

The result of comparisons is shown in Figure 5.23 above. As the database size approaches infinity, then the ratio between the response time and the match-time becomes the ratio between the slopes of the corresponding functions. The Slope Ratio is calculated as slope (Response Time): slope (Match-Time), and the second term also shows the percentage that the match-time constitutes the whole response time based on a very large database size under each scenario.

A general conclusion is drawn. The match-time varies with both the database size and applied document fingerprint, and the variance is mainly based on the database size. As the database size increases, the percentage of the match-time constituting in whole response time approaches over 90% under any scenario. Thus, the match-time is definitely the most important partition that significantly affects the response time.

For being able to propose improvement method about this partition, a further partition on the match-time is required. The further partition is called “second partition” that is based on a set of phases in the match process as shown in Figure 5.24. In order to distinguish the first partition and the second partition, a small “m” is put before the second partition names.

Match-Time

mReading-Time mProcessing-Time mCalculation-Time mResult-Time

Figure 5.24 The further partition on the match-time.

mReading-Time The mReading-time refers to the time the program takes to read the data entries in the local database during the match process. This partition time is measured from the program starts to read the address of the first data entry to the program finishes reading all the data entries into its memory. Assumption about this partition is that a larger database causes a longer mReading-time, since there are more data entries to be read. In like manner, the more information a data entry contains, the longer the mReading-time is.

63

Measured mReading-time F-Complete F-Full F-Half F-Quarter Database mReading %match mReading %match mReading %match mReading %match 10 2.12E+00 76.47% 1.19E+00 98.21% 1.14E+00 99.17% 1.06E+00 99.57% 20 3.91E+00 75.66% 2.21E+00 98.18% 2.12E+00 99.16% 2.12E+00 99.60% 30 5.98E+00 76.28% 3.28E+00 98.17% 3.17E+00 99.18% 3.17E+00 99.63% 40 8.03E+00 76.25% 4.35E+00 98.16% 4.22E+00 99.20% 4.17E+00 99.63% 50 1.01E+01 76.17% 5.55E+00 98.15% 5.39E+00 99.18% 5.29E+00 99.62% 60 1.19E+01 76.56% 6.49E+00 98.16% 6.40E+00 99.19% 6.37E+00 99.62% 70 1.41E+01 76.50% 7.75E+00 98.19% 7.52E+00 99.22% 7.43E+00 99.64% 80 1.55E+01 77.11% 8.45E+00 98.23% 8.34E+00 99.24% 8.20E+00 99.66% 90 1.72E+01 76.58% 9.37E+00 98.05% 9.09E+00 99.22% 8.94E+00 99.67% 100 1.99E+01 77.13% 1.12E+01 98.22% 1.11E+01 99.23% 1.07E+01 99.67% Figure 5.25 The data table shows the measured mReading-time in seconds and the percentage that the mReading-time occupies the match-time based on database sizes under the scenarios.

Comparison (Match-Time, mReading-Time) Time (seconds) Match-Time mReading-Time 3,00E+01

2,50E+01

2,00E+01

1,50E+01

1,00E+01

5,00E+00

0,00E+00 10 30 50 70 90 20 40 60 80 100 10 30 50 70 90 20 40 60 80 100 F-Complete F-Full F-Half F-Quarter

Database Size

Figure 5.26 The bar chart that graphically presents and compares the measured match-time and the mReading that are shown in the Figure 5.25

64

According to Figure 5.25, the mReading-time substantially increases with the growth of the database. Obviously, the mReading-time in F-Complete with all database sizes is way longer than their counterparts in other three scenarios, but the gaps among the other three scenarios are relatively smaller. As can be seen from Figure 5.26, the measured mReading-Time always accounts for a large proportion of match-time. After making the linear tread estimation on the mReading-time based on the database size under the scenarios. The estimated functions are shown in Figure 5.27

Estimated Scenario Linear Function (Mr=mReading-Time, x=Database Size) F-Complete Mr = 1.94E-01x + 1.76E-01 F-Full Mr = 1.08E-01x + 7.42E-02 F-Half Mr = 1.06E-01x + 1.62E-02 F-Quarter Mr = 1.03E-01x + 6.86E-02 Figure 5.27 Trend function between the mReading-time in seconds and the database size.

Slope (Match) Slope (mReading) Slope Ratio F-Complete 2.52E-01 1.94E-01 100:77.16 F-Full 1.10E-01 1.08E-01 100:98.17 F-Half 1.07E-01 1.06E-01 100:99.25 F-Quarter 1.03E-01 1.03E-01 100:99.68

Figure 5.28 Slopes Comparison of the mReading-time and the match-time functions.

According to the slope ratio between the match-time and mReading-time functions shown in Figure 5.28 above, it follows that as the database size increases to infinity, then the mReading-time under the scenarios occupy respectively 77.16%, 98.17%, 99.25% and 99.68% of the match-time. Summing up, the mReading-time is definitely the most important partition, which has the greatest effect on the match-time. The F-Quarter is the document fingerprint that has the best performance in this partition. The factors that decide the performance of the mReading process are the file reading technique, the database size and the system environment. It is lack of subjective improvement method on this partition. The document fingerprint extraction method that generates smaller fingerprint causes the shorter mReading-time. A more effective local-file reading method in Java will also help to reduce the time taken by a Java program to read a file, but this is the subject to be learned in the future. Alternatively, applying an effective database system is always an effective method that reduces the mReading-time.

65

mProcessing-Time The mProcessing process is carried out after the mReading process. This partition time is taken by the program to process the input document fingerprint and the database fingerprint (the read data entry) to be able to measure the similarity.

Measured mProcessing-Time F-Complete F-Full F-Half F-Quarter Database mProcessing %match mProcessing %match mProcessing %match mProcessing %match 10 6.39E-01 23.10% 1.93E-02 1.59% 7.58E-03 0.66% 2.69E-03 0.25% 20 1.23E+00 23.89% 3.68E-02 1.63% 1.43E-02 0.67% 5.15E-03 0.24% 30 1.83E+00 23.29% 5.50E-02 1.65% 2.11E-02 0.66% 7.33E-03 0.23% 40 2.45E+00 23.31% 7.35E-02 1.66% 2.76E-02 0.65% 9.51E-03 0.23% 50 3.10E+00 23.37% 9.45E-02 1.67% 3.57E-02 0.66% 1.24E-02 0.23% 60 3.58E+00 22.99% 1.10E-01 1.66% 4.25E-02 0.66% 1.51E-02 0.24% 70 4.23E+00 23.05% 1.30E-01 1.65% 4.84E-02 0.64% 1.66E-02 0.22% 80 4.52E+00 22.45% 1.39E-01 1.62% 5.25E-02 0.63% 1.83E-02 0.22% 90 5.17E+00 22.98% 1.72E-01 1.80% 5.98E-02 0.65% 1.92E-02 0.21% 100 5.79E+00 22.43% 1.86E-01 1.63% 7.12E-02 0.64% 2.32E-02 0.22% Figure 5.29 The data table shows the measured mProcessing-time in seconds and the percentage that the mProcessing occupies the match-time based on database sizes under the scenarios.

According to Figure 5.29, it is obvious that the mProcessing-time has also a strong increase with the growth of the database size and the document fingerprint. In contrast to this, the proportion of the mProcessing have a very indistinct fluctuation. In like manner, a trend estimation (linear function) about the mProcessing-time is performed to determine the importance of this partition based on a very large database.

Estimated Scenario Linear Function (Mp=mProcessing-Time, x=Database Size) F-Complete Mp = 5.65E-02x + 1.47E-01 F-Full Mp = 1.85E-03x + 1.40E-05 F-Half Mp = 6.77E-04x + 8.19E-04 F-Quarter Mp = 2.19E-04x + 8.86E-04 Figure 5.30 Trend function between the mProcessing-time in seconds and the database size.

Fetching slopes of the estimated linear functions that describe the relationship between the database size and the mProcessing-time in Figure 5.30 above, and comparing them to the slopes of the estimated match-time functions, the result is shown in Figure 5.31 below.

66

Slope (Match) Slope (mProcessing) Slope Ratio F-Complete 2.52E-01 5.65E-02 100:22.40 F-Full 1.10E-01 1.85E-03 100:1.69 F-Half 1.07E-01 6.77E-04 100:0.63 F-Quarter 1.03E-01 2.19E-04 100:0.21 Figure 5.31 Comparing the slopes of the match-time and mProcessing-time functions

According to the Figure 5.31, as the database size approaches to infinity, then the mProcessing-time occupy respectively, 22.40%, 1.69%, 0.63% and 0.21% of the match- time under the scenarios. Except the F-Complete, the proportions of the mProcessing- time under other three scenarios are actually very tiny, in comparison with the mReading- time. The F-Complete has obvious the longest mProcessing-time, this is because the document fingerprint applied the F-Complete has the most high-frequency words to be processed. It is clear that the shortest document fingerprint causes the shortest mProcessing-time.

mCalculation-Time This partition time is the time the program takes to calculate the document similarity between the two vectors by applying the Cosine Similarity. The result of the mCalculation- time is shown in Figure 5.32 below.

Measured mCalculation-Time F-Complete F-Full F-Half F-Quarter Database mCalculation %match mCalculation %match mCalculation %match mCalculation %match 10 9.94E-03 0.36% 9.33E-04 0.08% 6.47E-04 0.06% 5.26E-04 0.05% 20 1.98E-02 0.38% 1.73E-03 0.08% 1.22E-03 0.06% 9.90E-04 0.05% 30 2.94E-02 0.37% 2.53E-03 0.08% 1.69E-03 0.05% 1.29E-03 0.04% 40 3.97E-02 0.38% 3.31E-03 0.07% 2.18E-03 0.05% 1.71E-03 0.04% 50 5.00E-02 0.38% 4.37E-03 0.08% 2.87E-03 0.05% 2.25E-03 0.04% 60 5.89E-02 0.38% 4.72E-03 0.07% 3.19E-03 0.05% 2.49E-03 0.04% 70 7.00E-02 0.38% 5.59E-03 0.07% 3.48E-03 0.05% 2.61E-03 0.04% 80 7.60E-02 0.38% 5.89E-03 0.07% 3.71E-03 0.04% 2.72E-03 0.03% 90 8.73E-02 0.39% 6.66E-03 0.07% 3.77E-03 0.04% 2.56E-03 0.03% 100 9.72E-02 0.38% 7.30E-03 0.06% 4.60E-03 0.04% 3.12E-03 0.03% Figure 5.32 The data table shows the measured mCalculation-time in seconds and the percentage that the mCalculation occupies the match-time based on database sizes under the scenarios.

67

According to Figure 5.32 above, the mCalculation-time is proportional to the database size, but it occupies only a really small part of the match-time in any case. In contrast, the fluctuation of the percentage is inversely proportion to the database size. The F-Complete has the longest mCalculation-time, and there is no large gap among the other three. In order to gain a better understanding on the relationship between the mCalculation- time and the match-time, trend estimation is carried out. The estimated functions that tell the described the mCalculation-time and the database size are show in Figure 5.33. Furthermore, the slope ratio between the mCalculation-time and the match-time is shown in Figure 5.34.

Estimated Scenario Linear Function (Mc=mCalculation-Time, x=Database Size) F-Complete Mc = 9.65E-04x + 7.70E-04 F-Full Mc = 7.02E-05x + 4.45E-04 F-Half Mc = 4.11E-05x + 4.78E-04 F-Quarter Mc = 2.69E-05x + 5.47E-04 Figure 5.33 Trend function between the mCalculation-time in seconds and the database size.

Slope (Match) Slope (mCalculation) Slope Ratio F-Complete 2.52E-01 9.65E-04 100:0.38 F-Full 1.10E-01 7.02E-05 100:0.06 F-Half 1.07E-01 4.11E-05 100:0.04 F-Quarter 1.03E-01 2.69E-05 100:0.03 Figure 5.34 Comparing the slopes of the match-time and mCalculation-time functions

The shorter document fingerprint gives a better performance in this partition, but there are two reasons that it is unnecessary and even unable to have further improvement. Firstly, analogous to the mProcessing, the mCalculation takes a smaller proportion along with the growth of the database. And its proportion is coming to be enough tiny as shown in the Figure 5.34 to be ignored as the database grows. The more important factor is that the similarity calculation applies the Cosine Similarity, of which expression has already been defined completely. Thus, it is lack of effective measure to control and improve the calculation process. The performance of this partition is totally based on the programing language and the system environment.

68

mResult-Time The mResult process is the last partition in the match process. It refers to the time the program takes to put all the measured document similarities into a Java Map.

Measured mResult-Time F-Complete F-Full F-Half F-Quarter Database mResult %match mResult %match mResult %match mResult %match 10 8.44E-04 0.03% 5.97E-04 0.05% 5.68E-04 0.05% 5.51E-04 0.05% 20 1.61E-03 0.03% 1.07E-03 0.05% 1.07E-03 0.05% 9.96E-04 0.05% 30 2.21E-03 0.03% 1.49E-03 0.04% 1.43E-03 0.04% 1.41E-03 0.04% 40 3.71E-03 0.04% 2.25E-03 0.05% 2.18E-03 0.05% 2.11E-03 0.05% 50 4.84E-03 0.04% 2.77E-03 0.05% 2.69E-03 0.05% 2.59E-03 0.05% 60 5.83E-03 0.04% 3.30E-03 0.05% 3.26E-03 0.05% 3.31E-03 0.05% 70 6.72E-03 0.04% 3.90E-03 0.05% 3.78E-03 0.05% 3.90E-03 0.05% 80 6.59E-03 0.03% 3.79E-03 0.04% 3.93E-03 0.05% 3.80E-03 0.05% 90 6.43E-03 0.03% 4.85E-03 0.05% 4.46E-03 0.05% 4.21E-03 0.05% 100 9.53E-03 0.04% 5.61E-03 0.05% 5.68E-03 0.05% 4.99E-03 0.05% Figure 5.35 The data table shows the measured mResult-time in seconds and the percentage that the mResult-time occupies the match-time based on database sizes under the scenarios.

According to Figure 5.35. the main trend of the four scenarios is that the mResult-time varies strongly with the database size. But the fluctuation of the percentages that the mResult-time constitutes the match-time varies within a very small interval under all the circumstances. Thus, the mResult-time is assumed to not account for a larger proportion of the match-time as the database size increases.

Estimated Scenario Linear Function (Me=mResult-Time, x=Database Size) F-Complete Me = 8.72E-05x + 3.51E-05 F-Full Me = 5.37E-05x + 1.14E-05 F-Half Me = 5.30E-05x - 1.18E-05 F-Quarter Me = 4.88E-05x + 1.04E-04 Figure 5.36 Trend function on relationship between the mResult-time and the database size.

The estimated trend function in Figure 5.36 verifies the assumption, the slopes (increment rate) of the functions are smaller in comparison with other partitions, which means that the improvement method on this partition can be ignored.

69

Result-Time In this process, the program sorts the obtain Map that consists of the measured document similarities, and returns the entry that has the highest document similarity to response the search query. The result of the measurement is recorded in Figure 5.37 below.

Measured Result-Time F-Complete F-Full F-Half F-Quarter database Result %response Result %response Result %response Result %response 10 1.15E-03 0.00% 7.24E-04 0.00% 7.08E-04 0.00% 6.78E-04 0.00% 20 1.36E-03 0.00% 9.52E-04 0.00% 9.26E-04 0.00% 8.60E-04 0.00% 30 1.82E-03 0.01% 1.34E-03 0.00% 1.29E-03 0.00% 1.21E-03 0.00% 40 2.46E-03 0.01% 1.92E-03 0.01% 1.84E-03 0.01% 1.79E-03 0.01% 50 2.74E-03 0.01% 2.34E-03 0.01% 2.33E-03 0.01% 2.19E-03 0.01% 60 2.94E-03 0.01% 2.49E-03 0.01% 2.47E-03 0.01% 2.44E-03 0.01% 70 3.33E-03 0.01% 2.88E-03 0.01% 2.87E-03 0.01% 2.82E-03 0.01% 80 3.27E-03 0.01% 2.83E-03 0.01% 2.77E-03 0.01% 2.69E-03 0.01% 90 3.41E-03 0.01% 3.02E-03 0.01% 2.93E-03 0.01% 2.64E-03 0.01% 100 3.70E-03 0.01% 3.55E-03 0.01% 3.47E-03 0.01% 3.43E-03 0.01% Figure 5.37 The data table shows the measured result-time in seconds and the percentage that the result-time occupies the response-time based on database sizes under the four scenarios.

As can be seen from Figure 5.37 above, both the value and proportion of the result-time increase as the database size grows. A trend line estimation about the result-time is performed. In like manner, the linear statistical model is applied.

Estimated Scenario Linear Function (R=Result-Time, x=Database Size) F-Complete R = 2.87E-05x + 1.04E-03 F-Full R = 3.05E-05x + 5.26E-04 F-Half R = 3.00E-05x + 5.09E-04 F-Quarter R = 2.90E-05x + 4.78E-04 Figure 5.38 Trend function about relationship between the result-time and the database size.

The slopes of the functions shown in Figure 5.38 show that the gaps among the four scenarios become smaller as the database grows to infinity. In other words, none of scenarios has obviously greater performance over the other three. On the other hand, the result-time accounts for an even tinier proportion in comparison with all the other partition time that are analyzed in this section. Hence performing the relevant improvement on this partition is also less effective and unnecessary.

70

5.5.2.3 Local Analysis Summarization After performing the response time partition and relevant partition analysis, a depth-in understanding about the response time is obtained. According to the result of the set of the partition analysis, all partitions are sorted into two categories, the fixed partition time and the flexible partition time, as shown in Figure 5.39.

Fixed Partition Time Flexible Partition Time

• reading-time • match-time • processing-time • mReading-time • mProcessing-time • mCalculation-time • mResult-time • result-time

Figure 5.39 Category of the partition time based on database and un-database.

All the flexible partition times increase with the growth of the database size, while there is no correlation between the fixed partition time and the database.

The reading-time is an absolute fixed partition time, which means that it does not vary with either the database size or the applied document fingerprint extraction method. The average reading-time is 2.30E+01 seconds. There are only a few factors that affect the reading-time. Apache PDFBox is the imported library in this project to work with the PDF documents, and Java is the only programming language. The only possible improvement method on this partition is to apply another library to work with the PDF document, extracting the content from the document.

The processing-time is a fixed partition time. The average processing-time under the scenarios are respectively 2.53E+00, 2.57E+00, 2.57E+00 and 2.56E+00 seconds. The performance of this partition is based on the applied document fingerprint extraction method, and the gaps among the average values is quite small. The detailed relationship between the processing-time and the document fingerprint is required to be studied in order the make improvement method on this partition time.

71

The match-time is definitely the most important partition of the whole response time. As shown in the Figure 5.23, as the database size approaches to infinity, then the match-time accounts for more than 90% of the whole response time under any scenario. After making the secondary partition and relevant analysis, several further detailed understandings on the match-time are gained. 1. The greatest partition of the match-time is definitely the mReading-time, which is the time the program takes to read data entries in the database. As the database size increases to infinity, then the proportion that the mReading-time occupy the response-time under each scenario respectively are 76.20%, 88.56%, 91.31% and 94.60% according to the Figure 5.23 and the Figure 5.28. 2. Even all the other secondary partition times (mProcessing, mCalculation and mResult) vary directly with the database size, their proportions do not increase with the database size. Furthermore, they are enough tiny to be ignored in comparison with the mReading-time. Aiming at these understandings, the improvement method focuses on following areas. Applied Database. In this project, the local document folders are used as the database, since it is simple to modify and maintain. But there is no guarantee that the kind of databases also has great efficiency. Thus, applying a mature high-quality database system is greatly possible to have positive effect on reducing the time the program takes to communicate with the database. The applied database can be either a database server or a database management system. Searching Algorithm. This project applies a linear search algorithm that has the linear time complexity O (n). There are reasons that the linear search is applied. Firstly, the local database is unsorted. Furthermore, it is intended to gain all document similarities between the input file and the data entries for a further research. The advantage of the linear search is that it has low memory consumption and error rate. But the biggest disadvantage is its time complexity, which is higher than many other search algorithms. Applying another search algorithm that has better time complexity can obviously reduce the database working time. But the feature database must be sorted by certain way. The result-time is also a flexible time partition that varies directly with the database size. By the same token, both the value and proportion of the result-time are enough tiny (as shown in the Figure 5.37) under any case to be ignore, even both of them have a positive growth as the database growth. Furthermore, the performance of this partition is mainly based on the database size and the system environment. Hence the relevant improvement method is relatively impossible and unnecessary.

72

5.6 Memory Consumption Analysis As described in the Figure 4.6, the server side database (Loredge database) consists of a document database (hereinafter to be referred as OPDF-database) that stores the original PDF documents and a feature database that contains the document fingerprint extracted from the documents in the OPDF-database. In this section, the memory consumption of the database is analyzed. The analysis aims to gain a better understanding on the designed extraction method, determining the performance and proposing possible improvement. The size of the OPDF-database is a fixed constant that is the size of the document folder Author, which is provided by Loredge. But the cost of the feature database is determined by the document fingerprint. For achieving the optimal solution, there are four kinds of document fingerprints applied in the project. The F-Complete uses all the remaining words extracted from the original document to create the document fingerprint while the F-Full, F-Half and F-Quarter apply the document fingerprint selection standard (section 5.2.3) to the remaining words to generate shorter but more representative document fingerprints. The F-Quarter is designed to be half the length of the F-Half, which is further the half of the F-Full, like their names describe. The F-Complete is way larger than all the three.

The memory measurement is carried out by measuring the size of the feature databases Author based on the defined document fingerprint extraction methods (F-complete, F- Full, F-Half and F-Quarter) and the OPDF-database. Measurement unit of the memory consumption is byte that is a unit of digital information consisted of eight bits.

Memory Cost Measurement Databae Name Document Size in bytes Original PDF Document 257,927,992 F-Complete 1,609,340 F-Full 89,040 F-Half 39,602 F-Quarter 14,707 Figure 5.70 Result of the database size memory measurement in bytes

Result of the memory measurement is introduced in Figure 5.70 above. The measured sizes of the database are respectively 257927992 bytes, 1609340 bytes, 89040 bytes, 39602 bytes and 14707 bytes. It is obvious the size of the OPDF-database is way larger than all the sizes of the feature databases.

73

There are two factors that causes the large gap. The first and the most important factor is the difference in the file formats. The files in the OPDF-database are in PDF-format, which is much more complex than the text file that is the document format used in the feature databases. Even though the PDF file and the text file contain exact the identical content, the files have quite different sizes. E.g. the sizes are 145556 bytes for PDF file and 1 bytes for text file when both files only contain a single character. The second factor is the difference in the file content. The document fingerprint is plain text consisted of words extracted from the original document after performing the document pre- processing, while the PDF file contains a complete document that can also include both plain text and other elements, e.g. figure and multi-media. The size of the Loredge database is determined by one of the feature databases and the OPDF-database. According to the Figure 5.70, the cost of feature databases is quite tiny compared to the cost of OPDF-database. To be more specific, by applying one of four feature databases sequentially, i.e. F-Complete, F-Full, F-Half and F-Quarter, the size will be expanded by 0.6239%, 0.0345%, 0.0154% and 0.0057% respectively. Obviously, the feature databases, especially the ones that applied the document fingerprint selection standard do not impose a large extra burden. The result also conforms to the design requirements (section 5.2.3) of the document fingerprint extraction method. Ratio of the memory consumption between the F- Complete, F-Full, F-Half and F-Quarter is 1 ∶ 0.0553 ∶ 0.0.0246 ∶ 0.0.0091. It shows that the F-Complete is way larger than the other three ones. Then the F-Full is more than the double of the F-Half, which is further more than the double of the F-Quarter.

To sum up, the memory consumption of any of the feature databases is very tiny compared to the OPDF-database. Thus, applying the document fingerprint is an acceptable improvement method. Among the feature databases, the shorter the document fingerprint is, the smaller memory/space the feature database needs. In this project, the F-Quarter has the best performance in the memory consumption since the extraction method aims to generate the shortest document fingerprint that represents the original document.

74

5.7 Similarity Analysis In this section, all the obtained document similarities are analyzed, in order to gain a better understanding on the developed prototype. The analysis aims to provide a solid foundation on further research on performance improvement.

5.7.1 Problem Statement As defined in the project, the returned data entries are required to satisfy only one requirement, having the highest document similarity to the input PDF document. But the factor is that the highest one is not equal to the most suitable one, even the most correct one. For example, a poor 1% similarity is possible the highest one and returned as the final answer, since there are no other better alternatives. According to the conclusion drawn from the response time analysis in the section 5.5, the mReading-time taken to access the feature database occupies the absolutely largest proportion of the whole response time as database grows to infinity, since the linear search algorithm is applied in this project to iterate the whole database. In the extreme case, even the first worked data entry has a 100% similarity to the input document, the program still iterates the database to work with all the data entries. The first problem is quite easy to solve. Like the most search engines, they will return something they think is the correct answer, and let user to make a decision on whether to take the result or not. Even 1% is a really low similarity, it is the best answer that found in the database. Just let user to decide. In contrast with the first problem, the second one is more serious. The unnecessary time to be taken to work with the database definitely causes a larger response time, which is not expected and leads to a bad search performance. Thus, to find an effective method to avoid the redundant database access is very important and urgent in reality.

5.7.2 Analysis Statement A further study on the measured document similarities under the four defined scenarios is performed to solve these problems. Instead of measuring and sorting all similarities, an appropriate statistical model is inferred to process and filter the measured similarity. The model aims to improve the search performance, both reducing the response time and making search result high-quality and reliable. The model sets up three intervals for the document similarity, which is measured between the input document and data entry in the database. According to the pre-defined outcome range (from 0% to 100%) of the Cosine Similarity, which is described in the

75

section 4.2.3.3, the three intervals are continuous to make up the whole outcome range together, as shown in Figure 5.41. Each interval implements a functionality, which processes the data entry based on its measured similarity.

Cosine Similarity Outcome Range

Ingorable Interval Pending Interval Acceptable Interval

Figure 5.41 The outcome range of the Cosine Similarity consists of three intervals.

Ignorable Interval. This range of this interval is from 0% to C1, which is a lower boundary of the document similarity. The data entry is directly ignored and is not further processed if the measured corresponding similarity is in this interval. Setting up this interval is intended to avoid the unreliable returned result by filtering out the data entries that have too low similarity to the input document. Acceptable Interval. This interval is set up to avoid the redundant database access. To be more exact, the search process is immediately terminated and the corresponding data entry is returned as the result, when the measured similarity at range from C2 to 100%. In contrast with the C1, the C2 refers to the upper boundary of the document similarity. Pending Interval. The data entry has a pending status if the measured similarity is at the range from C1 to C2. This status means that the corresponding data entry with the similarity is temporarily put into a list, which is to be sorted after whole match process. This interval is applied for the similarities that are neither Ignorable nor Acceptable. Determining the most suitable ranges for the three intervals is the subject to be studied in this section. In other words, the two boundaries, C1 and C2 are required to be inferred and verified. The subjective inference is based on the objective evidence that is the measured document similarities under the four defined scenarios.

5.7.3 Overall Distribution Analysis For inferring and verifying the upper boundary C1 and lower bondary C2, an overall simialrity analysis is performed. The document similarities between all the input files (PDF documents in folder Publish) and the data entries (document fingerprints in the feature database Author) are measured. Each input document is supposed to have 100 corresponding document simialrities under each scenario, thus the number of the total document similarities is supposed to be 100*100 = 10,000.

76

Similarity Distribution of all the Document Similarity 100,00%

90,00%

80,00%

70,00%

60,00%

50,00%

40,00%

30,00%

20,00%

10,00%

0,00% 0 Input Document 100 F-Complete F-Full F-Half F-Quarter

Figure 5.42 The distrubtion of all the measured document similarities.

According to Figure 5.42 that plots all the measured document similarities under the four scenarios, it shows that the most data points remain below 60% but over 10%, while the points are relatively sparser in other areas under any scenario. It follows that the designed fingereprint distinguishes the similarities in an effective way. The similarities over 90% are supposed to be useful and high-quality, and the similarities below 10% are relatively useless. Hence, the following analysis is to set up two further boundaries based on this overall distrubtion to infer the three intervals in a more effective and clear way. Three terms are defiend here and used in the following analysis to clarify the difference. HSIM is the highest one of all document similarities measured between input document and all data entries in the database. The data entry that has the HSIM to the input file is considered as the correct answer and returned to users, as defined in this project. CSIM is the document similarity measured between the input document and data entry, both of which have the same name. The CSIM is expected to be equal to the HSIM under an ideal circumstance. SSIM refers to the second highest similarity measured between the input document and all data entries in the database.

77

5.7.4 Upper Boundary Inference The upper boundary is intended to improve the search performance by avoiding the unnecessary database access as much as possible. In contrast with the C1, the C2 is much easier to be inferred and verified. Under an absolutely ideal condition, the C2 should be set up to 100% to determine whether two documents are identical. But in this project, the sample PDF documents with same name are only highly similar to each other. Which means that the upper boundary is needed to be appropriately adjusted downwards.

Similarity CSIM under the four scenarios

100,00%

90,00%

80,00%

70,00%

60,00%

50,00%

40,00%

30,00%

20,00%

10,00%

0,00% 0 Documents with same name 100 F-Complete F-Full F-Half F-Quarter

Figure 5.43 The CSIM under the four defined scenarios.

According to Figure 5.43 above, it shows that there is no obvious difference and gap among the CSIM data points under the four scenarios, and the most points stabilize at the range from 80% to 100%. It follows that the four scenarios can be discussed and analyzed as a whole. Under the best condition, the CSIM is supposed to be equal to the HSIM that is the returned result, the upper boundary analysis focuses on setting up a suitable C2 to separate the CSIM data points with the highest similarities from all the available data points in the Figure 5.42 as well as possible. Furthermore, the boundary is assumed to be generated from the range where the points are most centralized.

78

Although all similarity data points over the upper boundary C2 (to be shorted as CAND points) are the candidate points to be possibly returned as the correct answers, there is no necessary connection between the CSIM and CAND. There are two cases to be taken into the consideration and evaluation of the upper boundary. 1. The CAND may include other data points than CSIM. 2. Maybe not all CSIM are placed over the upper boundary. In other words, all the CSIM become the CAND only under the absolute best condition. For externalizing the quality of the applied upper boundary and to pick the most effective one, the effectiveness of the upper boundary is performed from two perspectives. These two perspectives aim to externalize and quantify the effectiveness of applied C2. Accuracy Tolerance Rate (ATR). It is the proportion that the number of the CSIM points (NCSIM) occupies the number of the CAND points (NCAND). A higher ACR means that the NCSIM and NCAND are much closer. It follows that the boundary C2 classifies all the data points with a high-quality, and the model approaches the ideal condition. Oppositely, a lower ACR means that the CAND points include more incorrect and redundant points, and the C2 does not so well in classification. Result Selection Rate (RSR). It is the probability that the CSIM point is the returned correct answer to the input document. The term is measured by dividing the NCSIM with the number of the total possible returned data points (in the Figure 5.42) is a constant, which is 40,000 under all the four scenarios. Thus, a high RSR refers to that the CSIM is highly possible to be returned as the answer.

C2(%) NCSIM NCAND ATR RSR 70 390 604 64.57% 0.98% 72.5 386 549 70.31% 0.97% 75 386 512 75.39% 0.97% 77.5 383 471 81.32% 0.96% 80 377 431 87.47% 0.94% 82.5 371 409 90.71% 0.93% 85 366 394 92.89% 0.92% 87.5 341 353 96.60% 0.85% 90 307 314 97.77% 0.77% 92.5 263 265 99.25% 0.66% 95 212 212 100.00% 0.53% 97.5 121 121 100.00% 0.30% 100 0 0 0.00% 0.00% Figure 5.44 The NCSIM, NCAND, ARC and SER under C2 at the range from 70 to 100

79

Figure 5.44 above is a table recorded the NCSIM and NCAND, and their corresponding ATR and RSR under the upper boundary value C2 at the range from 70 to 100. It is manifest from the table that both the NCSIM and NCAND decrease as the increment of the C2, while the NCAND has a stronger decrement rate than the NCSIM. But after the C2 approaches a certain value, the NCSIM and NCAND becomes identical. It follows that the ATR is proportional to the C2 and reaches the peak at 100% when the NCSIM is equal to the NCAND. Oppositely, the RSR is inversely proportional to the C2, NCSIM, NCAND and ATR, because the RSR is totally proportional to the NCSIM as defined. Conclusion drawn from the Figure 5.44 is that the inference of the upper boundary is to find a balance between the return probability (RSR) and return quality (ATR). More specifically, it becomes a confilkt between the Accuracy discussed in the section 5.4 and the Response Time analyzed in the section 5.5. The higher C2 helps to provide a more accurate result, while the response time is possibly not obviously reduced because there are fewer entries satisified the stricter condition. While the lower C2 helps to reduce the response time obviously, but the returne result probably has a lower quality. This project proposes a statistical model to find the best balance between the accuracy and response time in setting up the upper boundary C2. The final effectiveness of the applied boundary (EUB) is quantified as the product of the ATR and RSR as shown below.

푁퐶푆퐼푀 푁퐶푆퐼푀 푁퐶푆퐼푀2 퐸푈퐵 = 퐴푇푅 ∗ 푅푆푅 = ∗ = 푁퐶퐴푁퐷 40,000 40,000 푁퐶퐴푁퐷

EUB Effectiveness of the Upper Boundary 0,90% 0,80% 0,70% 0,60% 0,50% 0,40% 0,30% 0,20% 0,10% 0,00% 70 72,5 75 77,5 80 82,5 85 87,5 90 92,5 95 97,5 100 EUB 0,63% 0,68% 0,73% 0,78% 0,82% 0,84% 0,85% 0,82% 0,75% 0,65% 0,53% 0,30% 0,00% Upper Boundary C2

Figure 5.45 Effectiveness of the applied upper boundary under the C4 at range from 70 to 100

80

Figure 5.45 above is a bar-chart showing the final statistical result of the EUB under the differenet applied upper boundaries C2. According to the figure, the trend of the EUB is divided into two different intervals. When the C2 is not more than 85%, the EUS increases slowly with the growth of the C2, and reaches the highest point at 85%. After that, the EUS plummets towards 0 as the C2 approaches 100%. Thus, the conclusion about the upper boundary is the boundary works most effectively when the C2 is adjusted to 85% in the designed evaluation system. Under this value, the performance of the accuracy and response time is possibly the most balanced.

5.7.5 Lower Boundary Inference In comparison with the analyzed upper boundary that filters the correct answers out of all possible answers, the lower boundary to be inferred is to filter out the answers that are definitely incorrect and useless. According to the previous section, the upper boundary C2 is supposed to be 85% that is the most effective threshold to classify all the measured similarities. Technically, all the data entries that have the similarities under 85% are to be ignored under the most ideal condition. But in reality, there is no guarantee that all the CSIM data points are over the 85%, as shown in the Figure 5.43. Which means that, a buffer zone between the upper boundary and lower boundary is required to store the CSIM points that are not able to be returned in the Acceptable Interval. This area is so called the Pending Interval as defined. Hence, the first factor used to determine the lower boundary C1 is the CSIM data points, which are under the upper boundary C2. In other words, the lower boundary should at least be placed under the lowest CSIM point in the Figure 5.44.

F-Complete F-Full F-Half F-Quarter Document CSIM Document CSIM Document CSIM Document CSIM 16 81.03% 5 83.92% 5 82.26% 5 77.01% 55 71.66% 55 70.56% 16 76.68% 7 72.28% 57 84.93% 57 84.53% 55 67.66% 16 78.84% 59 78.62% 59 81.09% 59 77.58% 22 79.82% 85 79.47% 95 81.19% 95 80.65% 42 80.91% 95 77.43% 51 84.29% 55 61.22% 59 70.98% 62 84.12% 95 77.74% Figure 5.46 The CSIM that are smaller 85% under the full scenarios.

81

The CSIM that is smaller than 85% under each scenario are presented in Figure 5.46. Two special and negative exceptions (document 58 and 77) that are described in the section 5.3.2 are filtered out in order to make the inference more reliable and general. According to the figure, the lowest CSIM point is 61.22% when the input document is 55 under the F-Quarter. It follows that the first mandatory condition is C1 < 61.22%. Furthermore, according to the client requirements from Loredge, the program should return a result responded to each search query even the corresponding similarity is very low, and lets user to decide whether to trust the result or not. Thus, the second factor used to infer the C1 must ensure that the pending interval is enough wide to include at least one data point, so that the program could return a non-empty result under any case. In other words, the analysis on the second factor is performed under the situation that all the available CSIM/HSIM data points are removed in the Figure 5.42.

Similarity SSIM under the four scenarios

100,00%

90,00%

80,00%

70,00%

60,00%

50,00%

40,00%

30,00%

20,00%

10,00%

0,00% 0 Input Document 100

F-Complete F-Full F-Half F-Quarter

Figure 5.47 The SSIM under the four defined scenarios

82

After removing the CSIM/HSIM point, the SSIM becomes the highest similarity to the input document under each scenario, as shown in Figure 5.47. For achieving the client requirement, the SSIM should be included in the pending interval to the least extent. According to the figure, the lowest data point is 21.25% owned by the document 100 under the F-Quarter. Thus, the second condition is C1 < 21.25%. Combining the two inferred boundaries. The lower boundary C1 should be set up to not more than 21.25% in order to not miss any CSIM data points when there are available CSIM that are not returned in the Acceptance Interval, or return a reliable result responded to the search query when there is no available CSIM data point. The upper boundary works most effective to filter out the correct answers, more exactly balancing the response time and accuracy, when the C2 is adjusted to 85%.

83

84

Chapter 6 Discussion This chapter presents a discussion about the performed project. Firstly, the applied methodology is discussed and evaluated. Then, the proposed problem statement is reviewed to evaluate the obtained result of the project. After that, the summative evaluation is performed to determine whether the objectives of the project are achieved

6.1 Methodology and Consequence The overall work of this project is to develop a reliable and effective PDF-document based search program that is to be used in the platform Loredge. The problem is stated in the section 1.2 with the background introduced in the section 1.1.2. This project seeks to present a feasible and complete prototype of the search program, and makes relevant performance analysis and evaluation, which is intended to provide a meaningful foundation for farther field of the digital document processing and searching.

6.1.2 Methodology of the Development The whole project follows the traditional waterfall model, which consists of the Problem Statement, Data collection, Prototype Development and Prototype Analysis. While the prototype development applies the incremental development mode, which integrates a set of unit developments that follows the Design, Implementation and Testing/Debugging. Both the qualitative research method that is associated with the inductive approach and the quantitative method associated with the deductive approach are applied in the project for better meeting the client requirements. The main reason of applying the waterfall model is that it is easy to manage and control the progress of the project, since all phases have respectively a specific deliverable and a review process [32], and do not overlap each other. Furthermore, the project plan is developed only after the client requirement is clearly understood. While the largest disadvantage obviously presented is that the waterfall has poor fault tolerance, since it is not able to handle the uncertain risks. Once the project is in the later phases such as the performance analysis and evaluation, it is impossible to go back to the earlier phases, especially the client requirement and prototype design, to make any modification. The

85

analysis and evaluation is totally based on the result gotten from the previous phase, whether the result is correct or not. The study is conducted in four phases. The first phase is the Problem Statement, which is to elicit the client requirements and perform relevant analysis. The main approach applied is the qualitative interviews, which are arranged several times with Loredge to deeply clarify and understand the client requirements that what they want the program to do and how they want the program to be. The second phase is the Data Collection, involved collecting the related experimental material and analytical study of the PDF documents provided by Loredge. The sample PDF document, development environment and statistical measurement are determined in this early phase of the project. The selection of the programing environment (both the language and database) is based upon analyzing the merits and shortcomings of several available alternatives. The main dish in this phase is to analytically study the sample PDF documents, which are the scientific and academical articles in English. This document study is a combination of the qualitative and quantitative approaches. The qualitative one provides a linguistic basis for filtering out the irrelevant and redundant data to make the document effectively comparable. While the quantitative one explain the reason why these data are removed in a statistical way, by measuring the percentage that these data occupy the whole document. In view of the outcome, the combined approach helps to focus on the data that is the most representative of the original document. The third phase, the Prototype Development applied an incremental development model that consisted of the Design, Implementation and Testing/Debugging. By dividing the whole client requirements into various units/modules that each of them has a clearly defined functionality, the risk management and program testing/debugging become easier and more flexible because the problem and risk are early identified and handled during each unit development. While the most potential disadvantage of this approach is that the whole client requirements needs to be well defined to be broken down into pieces. The second shortcoming is the integration, which is a relatively complex process that takes a considerable amount of time to merge the modules with performing the related integration test in this project. The last phase Prototype Analysis quantitatively presents and analyzes the performance of the designed document fingerprint and developed prototype from several perspectives. The outcomes (the accuracy rate, the linear regression model predicted the response time, the memory comparison and the similarity boundary assumption) of this approach not only makes a practical and comprehensive performance analysis, but also lays a solid foundation for farther study. The main disadvantage greatly presented is time consumption. For making the result/conclusion more accurate, a considerable large amount of sample data must be learned. In this project, the response time analysis is absolutely the part that consumes the most time. The response time measurement is 86

performed at least 20,000 times each of which takes from a few minutes to a few tens of minutes. The second disadvantage is the short of the available experimental data. This shortage causes the problem that it is very difficult to confirm and verify the developed statistical model, because the obtained statistical model needs to be retested and refined several times with extra new data in order to finally draw an explicit conclusion.

6.1.2 Methodology of the Evaluation Both the formative and summative evaluations are carried out in this project. The most important and necessary one of the formative evaluations is definitely the proactive evaluation, which is performance only once after the Data Collection with Loredge to gain a better understanding on the client requirements and the collected material to design the prototype. Since the prototype is not a large program and the whole requirement is broken up into pieces, all the input, output and logical relationship become clear defined. It follows that, the clarificative evaluation that clarifies the intent of the program becomes a bit redundant. This kind of evaluation works more like a pseudocode that describes the logical structure of the program. The interactive evaluation is performed twice in this project. The code quality is improved by following the consistent style and removing the redundant and duplicating code during the unit integration when the evaluation is first conducted. The second conduction of the interactive evaluation is based on the result of the prototype analysis aiming to improve the performance of the developed prototype.

6.2 Development Environment Discussion Java is the selected programming language in this project in the integrated development environment NetBeans. The NetBeans is a well-structured IDE that provides a very comprehensive overview [33]. The most important merit is that both Java and NetBeans are cross-platform, which is one of the non-functional requirements. Furthermore, the NetBeans contains lots of powerful built in plugins comparing with other IDEs. During the implementation and debugging of the prototype, the NetBeans provides hits on code quality optimization and exception handling. Apache PDFBox is the imported Java library working with the PDF documents. The first presented advantage is that it is easy to install and use. Thus, it avoids that too much time is spent on learning new things. The PDFBox is also reliable and stable, since it identifies each thrown exception and provides relevant solution, and does not crash even once under the whole program development. The local document folder works as the database, and text files within are the data entries. This is the most simple and effective way for modification and maintenance. In addition, it is easy to back up the database by copying the folders. So it can be seen that the performance analysis of the database becomes easier.

87

6.3 Problem Statement Reiteration In this section, the problem statement in the section 1.2 is reiterated combining with the review of the result obtained in this project.

6.3.1 Document Similarity Measurement Conclusion and Discussion

• How can a file that has the highest document similarity to an input PDF document be found in a very large document database?

The solution is the developed prototype. And the key part of the prototype is to measure the document similarity between the input file and the data entry in the database by the Cosine Similarity, which measure the similarity between two non-zero vectors in the same dimension. After that, all the data entries are sorted by the corresponding measured similarity to return the one that has the largest similarity. The Cosine similarity is the only similarity measurement applied in the project. The main advantage is that it is easy understood and implemented, since the formal has already been clearly defined. The largest shortcoming is that it is a direction-based similarity measurement, which only cares about the angle between two vectors, and the sizes of the vectors are completely ignored. In an extreme case, the similarity between two documents infinitely approaches the largest value even though there is a considerably great gap between the sizes of the documents. According to the statistic of the document size, there is no necessary and consistent relation between the number of words of the PDF documents that have the same name in the folder Publish and folder Author. It follows that, it is very difficult to take the document size into the consideration of the document similarity measurement. The second disadvantage of the cosine similarity is that it does not focus on the content and the meaning of the documents. In other words, the measurement is not conducted from the perspective of the linguistics. To be more precisely, the most important factor taken into measurement is the word with corresponding occurrence, while the order and the property of the words appeared in the documents are irrelevant.

88

6.3.2 Optimization Method Conclusion and Discussion

• In which way, the developed file search method can be improved?

6.3.2.1 Document Pre-processing The irrelevant and redundant data in the original documents are filtering out in the document reading process in order to not only reduce the amount of the data to be compared but also improve the similarity quality, as shown in the section 4.2.1.2. The selection of the data to be removed is a combination of the subjective and objective researches. The objective selection is the punctuation and the non-English alphabets both of which undoubtedly have less practical significance in an English article. While the selection of the Arabic numeral and the stop words is a subjective decision. A statistic of the document preprocessing shows that it removes averagely 57.52% of the amount of the total words in each PDF document in the folder Author, while the value is 53.87% in the folder Publish. It follows that, this process has a high quality in reduction of the data to be used for similarity measurement. However, the reduction is possible further improved. The most flexible variable is the stop words, since it is difficult to determine whether a word is of vital importance for the content of the article. The list of the stop words that is used in the project is a homemade product, which is definitely uncomplete and defective. In the future, a deeper and more comprehensive study that refines the list needs to be conducted. Overall, increment of the size of the list of the stop words productively improves the document preprocessing.

6.3.2.2 Document Fingerprint The document fingerprint is proposed in the section 4.2.1.2.3 and detailed described in the section 5.2. It is a unique representation and identification of the original document, consisting of a set of high-frequency words with the occurrences of the words. The main usage is to speed up the comparison by doing a faster feature-comparison. The largest advantage of the designed fingerprint is simple to create and use. The selection of the document fingerprint is a research that focuses on balancing the response time, accuracy rate and memory consumption. According to the result of the performance analysis in the Chapter 5, as the decrement of the size of the document fingerprint, the response time of the search process and the memory consumption of the database storage are substantially reduced. Although the accuracy rate of the search result has a gradual decline, it remains at a relatively high level that is above 99%. The conclusion drawn is that, the F-Quarter that generates the smallest size of the document fingerprint provides the best performance of the search program.

89

Due to the limitation of the sample documents in term of the quantity and quality, the obtained result and conclusion is uncompleted and limited. The reason is that the gaps among accuracy rates under the four scenarios have not been widened enough to make any relevant feasible quantitative research, such as regression and classification analysis. Two main disadvantages of the design document fingerprint are the simple composition and the subjective extraction method, which affect respectively the response time and the applicability of the search program. The simple composition is a double-edged sword, as the merit makes the fingerprint easy to create and use while the shortcoming is the ineffectiveness in search process. The evidence is that the document fingerprint only contains the data generated from the document, particularly no extra label or search index is imported. Consequently, the linear search must be conducted during the database access to not miss any data entry, and each fingerprint must be completely read during the similarity measurement. As a result, the response time suffers. In contrast with the objective content inference based on the qualitative research, the key linear functions in the extraction method introduced in the section 5.2.2 are generally subjective assumptions, which rely heavily on the number of pages and words in PDF documents. It follows that the extraction method only works well when the size of the document achieves the assumed value in the section 4.2.1.2.2. In other words, the document must be at enough size so that it is able to generate any fingerprint. This subjective assumption makes the program unreliable and uncertain under some condition. Although the shortcoming does not come out in this project because all the sample PDF documents satisfy the size requirement, there is no guarantee that all documents could meet the requirement in practical use of the search program.

6.3.2.3 Feature Database This optimization is derived from the document fingerprint. By creating an extra feature database in the Loredge database, the database query process is permanently accelerated. The feature database is intended to store the document fingerprints that are extracted from the documents in the original PDF document database. The most important factor that affects the effectiveness of the feature database is the size. According to the memory consumption analysis in the section 5.6, the size of the feature database is absolutely far small comparing to the original PDF database under any scenario. The preliminary conclusion is that the feature database is an effective optimization method, the improve the search performance with less cost. While the cost of the database creation, operation and maintenance is unknown, it is difficult to say that the feature database is definitely effective. The farther study of the feature database should be an integrated research that covers all possible areas.

90

6.4 Summative Evaluation As mentioned in the Chapter 3, the summative evaluation is conducted at the end of the project to summarizes the project by answer the following questions.

6.4.1 Outcome Evaluation

• Have the goals/objectives of the project been met?

The main purpose of the project stated in the section 1.3 is generally fulfilled, since the development is conducted as planned and designed. Although the study is tentative and incomplete, Loredge is supposed to more or less benefit it The designed prototype with the relevant proposed optimization method provide a general but valuable insights and suggestion for developing a complete and effective PDF document based search engine for Loredge. Furthermore, the massive experimental data that concretely presents the performance of the prototype lays a solid foundation for farther analysis and development. When performing a rationally and critically evaluation, it is difficult to determine that the client requirements is well achieved. To elaborate on this, all the functional requirements are completely met because both the output and procedure of the program achieve the anticipated effect, while it lacks of effective method to examine the non-functional requirement. It follows that the performance of the result is restricted to the limitation and delimitation of the project, especially the response time analyzed in the section 5.5 is still too longer comparing to the existing search engines in the reality.

6.4.2 Impact Evaluation

• What is the overall impact of the project?

The major and overall impact of this project to the IT industry is the framework and prototype of document based search engine. The second impact is the designed feature database, which is placed on the server-side and works as the table content of the original document database. Each search request comes to query the feature database after it arrives at the server-side, and then the query result decides whether it is needed to further query the original database. The search performance optimization approach that applies a representative third-party database is

91

not only feasible and effective in the document based search, but all kinds of search that involve the big data. Beside the two impacts mentioned above, the project also proposes a flexible and effective approach to preprocess a text-based document obtaining more important and representative data. It is a both objective and subjective method, which provides a new and general clue for document refinement and data mining.

6.4.3 Improvement Evaluation

• What resource will be needed in order to improve the program?

The largest weakness of the project is that it lacks of enough resources (hardware and software) to retest and refine the prototype, due to both the limitation and delimitation. Combining these limitations and the performance analysis in the Chapter 5, there are several feasible methods and suggestions that possible optimize the program and lay a solid foundation for further development.

PDF Document Library The Apache PDFBox is the only imported Java library to work with PDF documents in this project. Because of the time limitation of the project, there is no other possible Java library is applied and compared to the PDFBox. The PDFBox most affects the search performance from two perspectives, the document reading time and the extracted content quality. In this project, the average time to spend on reading in a PDF document is 0.24 seconds. Although the proportion of the reading time decreases with the increase of the database size, 0.24 seconds is still too long time comparing to a search engine in the reality. When talking about the content extraction, the PDFBox does not successfully read and handle all the sample PDF documents, with the failure rate is 1%.

Database Application The implemented database is definitely one of the most important factors that affect the search performance. The local text-file reading technology is simple but ineffective comparing to a mature database management system. This impact mainly reflects in the response time and memory consumption.

92

According to the response time analysis in the section 5.5, the database access time is obviously the largest part of the whole response time with the growth of the database. On the other hand, the proposed feature database requires extra memory on server side database, relevant analysis is presented in the section 5.6. Applying a mature database management system, such as MySQL and Oracle, is possibly an effective approach that reduces both the data access time and memory of the feature database.

Database Access Approach The complete linear search is the only search approach that implemented in this project. This subjective purpose is to not miss any available data entry in the database to obtain all document similarities for the farther study, and the objective factor is that both the feature and original document databases are constructed sequentially. It is clear that the most disadvantage of the linear search is the ineffective time complex, which means that all the elements must be visited. For reducing and optimizing the database access, categorizing the data entries by labeling them with representative tag is a feasible and effective solution. The tag is supposed to be the type of the document, the size of the document or the other meta-data that represents the document. Moreover, a multiple-tag labeling identifies and categorizes the entry with a higher quality. Second approach is to apply a more effective search algorithm, such as binary search. But this kind of search algorithm requires that all the data entries in the database must be pre-sorted in a certain order. It follows that the cost of the pre-sorting must be taken into the consideration of the total cost. Another solution is to set up the upper/lower boundary for the measured document similarity, which is analyzed in the section 5.7, this approach improves the performance by returning the entry that has an enough high similarity, and ignoring the entry that has an enough low similarity.

6.4.4 Sustainability Evaluation

• How sustainable is the project, dose it to be continued in its entirety?

The research on the document based search engine is absolutely sustainable. In contrast with the image-based search engine provided by Google several years ago, the document- based one is still at its infancy stage, but has a bright market prospects.

93

This project is supposed to be carried out with the minimal resources. By following a result oriented development approach, the program aims to design a high-quality search engine that absolutely focuses on returning a correct result to user. The successful developed prototype implicates the minimal resource consumption for implementing such a search engine. The only uncertain variable that affects the sustainability of the project development is the Apache PDFBox, which is an open source library. The gap of the effectiveness between the PDFBox and other feasible Java libraries is to be studied and clarified for the resource optimization. The search algorithm applied in the database access is also a key point that is able to be improved the resource consumption in both time and energy demands. Moreover, the program with a high code quality implemented by Loredge’s professional developers definitely affects the energy application. When figuring out the shortcoming that is presented in the project, the part consumed the largest resource is obviously the performance analysis. During this phase, the developed program must be tested and debugged thousands of times with various parameters/input arguments in order to obtain available data as more as possible for making more accurate statistical analysis and looking for relevant optimal solution. This shortcoming is an unavoidable problem all software developers are faced with. The reason is that each new developed program requires to be retested and refined in order to achieve the best performance. A general solution that alleviates the problem is to make a development plan as detailed and well-structured as possible.

94

References

[1]. Emelie Södergren, "I'm here to realize Loredge”, retrieved 2017-02-03 Availability at: https://www.kth.se/en/innovation/nyheter/jag-ar-har-for-att-forverkliga-loredge-1.663504

[2]. Adobe Systems Incorporated, PDF Reference, sixth edition, version 1.23, retrieved 2017-02- 03, Availability at: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.

[3]. Adobe Systems Incorporated (2008), Public Patent License, ISO 32000-1: 2008 – PDF 1.7 (PDF), retrieved 2017-02-03, Availability at: https://www.adobe.com/pdf/pdfs/ISO32000- 1PublicPatentLicense.pdf

[4]. Multimedia and PDFs (Acrobat Pro DC), retrieved 2017-02-03 Availability at: https://helpx.adobe.com/acrobat/using/adding-multimedia-pdfs.html

[5]. HOW IS A PDF CREATED?, retrieved 2017-02-03, Availability at: http://www.investintech.com/resources/articles/createapdf/

[6]. Optical character recognition, retrieved 2017-02-27, Availability at: https://en.wikipedia.org/wiki/Optical_character_recognition

[7]. Oxford Dictionaries, second definition of Article, retrieved 2017-02-27, Availability at: https://en.oxforddictionaries.com/definition/article

[8]. DD2431 Machine Learning, "Support Vector Machines", retrieved 2017-01-12, Availability at: https://www.kth.se/social/files/57dbef14f2765420e74aaf42/08-linsep-handout.pdf

[9]. Jason Weston, "Support Vector Machine Tutorial", retrieved 2017-01-12, Availability at: http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

[10]. Search algorithm, retrieved 2017-01-12, Availability at: https://en.wikipedia.org/wiki/Search_algorithm

[11]. Logarithmic time, Time complexity, retrieved 2017-01-12, Availability at: https://en.wikipedia.org/wiki/Time_complexity#logarithmic_time

[12]. Master's programme in Machine Learning, retrieved 2017-01-15, Availability at: https://www.kth.se/en/studies/master/machinelearning/description-1.48533

[13]. DD2431 Machine Learning, Linear Separation, retrieved 2017-01-15, Availability at: https://www.kth.se/social/files/57dbef14f2765420e74aaf42/08-linsep-handout.pdf

[14]. Linear regression, Applications of linear regression, retrieved 2017-02-22, Availability at: https://en.wikipedia.org/wiki/Linear_regression#Applications_of_linear_regression

[15]. Linear trend estimation, retrieved 2017-02-22, Availability at: https://en.wikipedia.org/wiki/Linear_trend_estimation#Goodness_of_fit_.28R- squared.29_and_trend

95

[16]. PDFBox, retrieved 2017-01-17, Availability at: https://pdfbox.apache.org/

[17]. Apache PDFBox and FontBox 1.0.0 released, retrieved 2017-01-17, Availability at: http://www.h-online.com/open/news/item/Apache-PDFBox-and-FontBox-1-0-0-released- 932436.html

[18]. PDFBox Project Incubation Status, retrieved 2017-01-17, Availability at: https://incubator.apache.org/projects/pdfbox.html

[19]. PDFBox API Docs, retrieved 2017-01-17, Availability at: https://pdfbox.apache.org/docs/2.0.3/javadocs/

[20]. Given, Lisa M. (2008). The Sage encyclopedia of qualitative research methods. Los Angeles, Calif.: Sage Publications. ISBN 1-4129-4163-6, retrieved 2017-03-22,

[21]. Susan E. Wyse, "What is the Difference between Qualitative Research and Quantitative Research?", retrieved 2017-03-22, Availability at: http://www.snapsurveys.com/blog/what-is- the-difference-between-qualitative-research-and-quantitative-research/

[22]. Qualitative research, retrieved 2017-02-01, Availability at: https://en.wikipedia.org/wiki/Qualitative_research

[23]. John Dudovskiy "Inductive Approach (Inductive Reasoning)”, retrieved 2017-02- 01,Availability at: http://research-methodology.net/research-methodology/research- approach/inductive-approach-2/

[24]. Margaret Rouse, "Definition prototype", retrieved 2017-04-01, Availability at: http://searchmanufacturingerp.techtarget.com/definition/prototype

[25]. William M.K. Trochim, Introduction to Evaluation, retrieved 2017-03-05, Availability at: http://www.socialresearchmethods.net/kb/intreval.php

[26]. Evaluation Toolbox, Formative evaluation, retrieved 2017-03-06, Availability at: http://evaluationtoolbox.net.au/index.php?option=com_content&view=article&id=24&Itemid= 125

[27]. Owen, 2007, Proactive Evaluation, retrieved 2017-03-06, Availability at: http://4p95salima.weebly.com/

[28]. Evaluation Terminology 2014, Section 3 Workbooks and Support, retrieved 2017-03-06, Availability at: http://www.fin.gov.nt.ca/sites/default/files/documents/section_3_workbooks_and_supports.p df

[29] Java SE8 documentation, Interface Map, retrieved 2017-03-01, Availability at: https://docs.oracle.com/javase/8/docs/api/java/util/Map.html

[30] Saul McLeod, "Sampling Methods", retrieved 2017-04-18, Availability at: http://www.simplypsychology.org/sampling.html

96

[31] Sandra Slutz, Kenneth L. Hess, "Increasing the Ability of an Experiment to Measure an Effect", retrieved 2017-05-01, Availability at: http://www.sciencebuddies.org/science-fair- projects/top_research-project_signal-to-noise-ratio.shtml

[32] What is Waterfall model- advantages, disadvantages and when to use it?, retrieved 2017- 02-06, Availability at: http://istqbexamcertification.com/what-is-waterfall-model-advantages- disadvantages-and-when-to-use-it/

[33] Geertjan-Oracle, Top 5 Benefits of "NetBeans Platform for Beginners" on Mar 14, 2014, retrieved 2017-02-06, Availability at: https://blogs.oracle.com/geertjan/entry/top_5_benefits_of_netbeans

97

98

TRITA TRITA-ICT-EX-2017:49

www.kth.se