Proceedings Template - WORD s3

DBMS Comparison in Relation to an Implemented Fuzzy Database Design Model

ABSTRACT terminology will be necessary before any of these paths can be A database is an application that facilitates the storage, explored. searching, indexing, and retrieval of an organized collection of information stored in files. A database management system 2. BASIC TERMINOLOGY (DBMS) is an application that manages and provides a user A database is an organized collection of information interface to a database. Existing applications using databases arranged for ease and speed of searching and retrieval [1]. This typically are written with a DBMS package already selected. As common definition does not fully encompass what a database is. the needs of the application increase, it becomes necessary to look A file could easily use the same definition. A database is an at other DBMS packages and determine either to continue using application that facilitates the storage, searching, indexing, and the current solution or change the underlying DBMS system. retrieval of an organized collection of information stored in files. Given the need for performance, efficiency, and the ability to A database management system (DBMS) is an application that integrate an English language front end, open source DBMS manages and provides a user interface to a database. packages need to be evaluated. This paper evaluates two different The DBMS interface allows the use of queries (or open source packages, MySQL and Apache Derby and compares questions) to be posed against the data held in the database. them to the current Microsoft SQL Server solution. It was Queries can be stored by the DBMS to speed up retrieval and save determined that for smaller data sets that performance and the user from typing commonly used queries. These stored efficiency impacts were negligible. As for the integration issue it queries (or procedures) can also be optimized for faster execution. becomes a matter of locating a solution that has the abilities being Stored procedures will be discussed in greater detail in section 4. sought. 3. CURRENT APPLICATION Categories and Subject Descriptors The University's School of Engineering houses the H.2.4 [Database Management]: Systems – Relational Databases Department of Computer Science's Database Research Group. This research group is currently researching a prototype fuzzy General Terms database with a natural language query interface [2]. The system Management, Measurement, Performance, Design, Reliability, allows a user to pose queries and retrieve images of faces based Experimentation. on certain selection criteria: shape of the subject's face and the color of their eyes. The query “Show all the people with slightly blue eyes,” can be interpreted differently by different Keywords communities. For instance, in a community where blue eyes are Database, Fuzzy, Fuzzy Database, Relational Database, DBMS, rare in general, the community may decide that anyone with any Database Management System, MySQL, Apache Derby, Derby, blue eyes is considered very blue, and that someone with hazel SQL Server, Microsoft SQL Server eyes is considered slightly blue. A community where blue eyes are common may believe that slightly blue is interpreted as 1. INTRODUCTION someone with pale blue or blue with a lower intensity. Database technology permeates every facet of modern Another example is when a user is posing the query life. It is used in cell phones for storing phone numbers, “Show all the people with a very wide face.” If the community addresses, meeting times, alarms, etc. It is used in satellite and consists of members who predominantly have thinner faces, those cable boxes to list shows and times. Databases are used in whose faces are generally longer in height than width, then that airports, on websites, for radio stations, and even in newer home community may believe that those with rounder faces are wider. appliances. Thanks to the Internet, the need to search a mountain In contrast, those with a round face may believe that anyone with of information quickly and efficiently and in a timely matter is of a thinner face but very wide nose constitutes a wide face. The utmost concern to users. With all of the technology available, prototype allows the system to “learn” community beliefs about how does one decide whether to stick with their current solution such terminology such as “Very Blue Eyes” or “Slightly Wide or to see if some other solution is even feasible? This is a study to Face.” determine whether or not changing the underlying database There is much information about a person’s face that platform will improve the performance of the Database Research can be stored. This information includes eye color, facial Group's application. length/width, nose length/width, and an image of the person. It is necessary to first look at the current solution, and Table 3.1 and Table 3.2 show two tables used in the current fuzzy then to compare it with at least one other possible solution. This implementation dealing specifically with eye color and image will either reinforce the belief in the current solution, or it will respectively. The fuzziness of the data lies in the weights used. suggest that another solution would be more beneficial. For instance, the image associated with ID 1, can be interpreted as Understanding all the advantages and disadvantages, along with slightly blue, medium brown, and very green. Each modifier all the difficulties that come with change, can provide the (slightly, medium, very) has a range of thresholds. These ranges foundation for a solid working solution that can evolve and will be covered in detail later on in this section. change as the needs of the model changes. A solid grasp of the Table 3.1. Eye Color Table after 1000 queries, the same person’s blue eye color value may change to 0.2, which drops it from the medium blue category and ID Color Weight moves it to the slightly blue category. 1 Blue 0.25 Table 3.3 Range Table 1 Brown 0.51 1 Green 0.76 2 Blue 0.77 2 Brown 0.11 High Low Modifier Threshold 2 Green 0.34 1 0.75 Very 0.87 0.74 0.30 Medium 0.52 The Table 3.2 Person Table 0.29 0 Slightly 0.20 current fuzzy ID Gender Image database system is running on a Dell GX260 with Microsoft Windows XP running Microsoft's SQL Server. The application 1 F Binary Image interface is written in Microsoft's .Net Studio using Visual Basic. 2 M Binary Image The natural language processing is performed using the English Language Front end for Access (ELF). [3] ELF facilitates the use of the fuzzy modifiers and synonyms used in the prototype. There are four base tables used by the prototype: Color, Many different approaches to the application interface Face, Person, and Range. Person stores the ID, gender, and a and initialization have been taken over time. [4][5][6][7][8] This binary image for each person in the table. Color stores the ID, eye has created multiple prototypes with slight variations to include color, and weight of that color for each person. Face stores the concepts such as the use of compound queries, the use of different ID, weight of the width, and width with attribute values of: initialization methods, and the introduction of spatial data in “narrow”, “average”, and “broad”. In the case of the Face and addition to the dynamic data. While collating all of the different Color tables, weight is used to determine the extent that the methods into a single cohesive prototype application is being particular feature satisfies its predicate. Table 3.3 shows that for performed, the underlying database management system is also person 1, the weights are consistent with the belief that the being scrutinized for cost and portability. community believes this person has a broad face, but for person 2, the weights suggest that the community believes the person may have either a narrow or average face with a fairly low belief that it 4. CURRENT DBMS is broad. The current DBMS is Microsoft's SQL Server running on a Dell GX260 using Microsoft's Windows XP as an Operating Table 3.3 Face Table System. The obvious advantage to using this system is that it has already been written and is currently in use. This removes the ID Face Weight necessity of porting over, or translating source code or data from 1 Average 0.18 one platform into another, of the existing application source code, database tables and database stored procedures. 1 Broad 0.78 Secondly, the system is well established. All parts of the 1 Narrow 0.12 system have been around long enough that resources, assistance, documentation on the individual packages of software are readily 2 Average 0.49 available. It can be extremely frustrating to find documentation on 2 Broad 0.12 newer software. 2 Narrow 0.51 The third advantage of the current SQL Server Looking implementation is that it is already installed and configured on the back to Table 3.1, for person 1 the weights suggest that the image computer system being used for research. Installing and is seen as either having green or brown eyes and a low belief that configuring a new database management system can be a complex the person has blue eyes. The low brown value in conjunction process: finding all of the dependent software, getting the source with the high blue value lends to the belief that the community code, finding documentation that is written on the user’s level, believes the person has blue eyes with possibly some green in and going through the installation process with little or no errors. them. One disadvantage to the current database management These values are adjusted by the application each time system is the use of “off the shelf”, or store bought prepackaged the queries are run and feedback is given. Modifiers such as software, Microsoft’s SQL Server (MSSQL). This is proprietary “very”, “medium”, and “slightly” are used and assigned ranges software, meaning that only the manufacturer has the source code and threshold values as shown in Table 3.4. Each attribute, such and can make any modifications to the core portions of the as “blue”, “green”, and “brown” for eye color, has a value and is software. This signifies that if there is a new development in categorized dependent on the range in which it falls. For instance, database technology, such as the use of spatial data, a developer a person’s image may be initialized with a value of 0.6 for its blue must wait until either an update is available for the current eye color value, which puts it in the medium blue category, but product, or a new version of the software with the new standards based, easy to use, written in Java, and that has a small modifications built-in is available for purchase. footprint. [10] Second, a DBMS package, such as SQL Server, may MySQL, unlike Derby, is not developed in a public use its own “flavor” of an implementation [9] that the user may or community. MySQL was first released internally in 1995 and is may not completely understand. SQL Server uses TSQL which sponsored and owned by a single for-profit company, MySQL AB provides a set of programming extensions to add features such as [11]. MySQL has commercial licensing, is open source, and is transaction control to the structured query language. TSQL is supported by people from all over the world working together via only used by Sybase and Microsoft, and queries written in TSQL the Internet. As of 2005, the production release is version 4.1.14. may not work at all on other platforms. Slight differences in table MySQL is currently used by many web service companies and creation will be illustrated while covering some of the difficulties works on multiple platforms. with different DBMS packages. A stored procedure, that works on one database management system may or may not have the same 5.1 Installation effect when written for another database management system. Installation of the selected DBMS packages is the first For instance, the create procedure signature in SQL Server's major hurdle in the feasibility process. In the case of MySQL, stored procedure encapsulates the name of the procedure installation can be fairly quick. First the user must download the “Fetch_Data” with braces [], while it's MySQL counterpart does package dependent on platform. The Windows version comes not. with an installer to guide the user through the process. Depending on download speed and other considerations, the install process A third disadvantage to the current database can take as little as a few minutes. management system is a simple matter of cost. Due to the plethora of open source database management packages out there, Apache Derby's installation is a little more complex. such as Apache Derby and MySQL, whether it is still cost The home page of Apache Derby can be a little intimidating [11]. effective to maintain the current system using SQL Server comes There is no quick tutorial or installation application available from into question. It may not matter as much with educational the onset. Prior to August of 2005, Apache Derby was still in discounts, educational versions, etc, but it makes a significant incubator status, meaning that it was still in unofficial release difference in a production environment or in cases where money status. With the official release, it is possible to download all files may not be as readily available. needed to get Derby up and running from one location. Users have two options for acquiring Derby: Using the snapshot jars, A fourth disadvantage is that with a proprietary DMBS and recompiling the source code. Since customization isn't package it may not be possible to integrate new functionality into necessary during this testing the snapshot jars that are previously the program. This would force a user to wait until a new version is compiled were used. Derby also requires having a Java run-time released with the modifications. Conversely, an open source environment installed. This increases the time necessary for solution could make it possible to blur the interface with another installation due to configuring the Java run-time environment application such as ELF. class path after installing the environment. 5. DBMS Difficulties 5.2 Configuration Discovering and acknowledging the difficulties of a After getting the candidate DBMS packages installed, DBMS package is of paramount concern when determining the each must configured for use. Configuration can sometimes be viability of the package. There are four basic considerations when done during the installation process via the installer application, or choosing a DBMS: package choice, installation, configuration, it is a stand alone process that requires editing configuration text and porting over the tables and stored procedures. files or running configuration applications. This process sets up There are a multitude of choices when picking a DBMS the file structure, initial administrative accounts, and the initial packages, such as Oracle, DB2, Apache Derby/Cloudscape, security and permission guidelines. MySQL, etc. Testing every package would be time consuming, MySQL can be configured manually or via a and focusing on specific needs can significantly decrease the configuration initialization file. In the case of Windows, it runs a number of options. In the case of the fuzzy database application, configuration wizard which walks the user through the the focus is to find an open source package that is easy to use, configuration process. Using the default configuration options, modifiable, and currently used in some production environments. configuration can be accomplished in approximately one to two Therefore, Apache Derby/Cloudscape and MySQL were chosen minutes. as possible alternatives to the current SQL Server system. Apache Derby configuration is more complicated. The Apache Derby is an open source, Java based, relational Derby installation creates Java JAR files that contain the classes database management system [10]. Apache Derby started out as for Derby. These files must be put on the computer and then JBMS created by Cloudscape, Inc back in 1997. In 1999, linked via the operating system's class path. All other Informix acquired Cloudscape, and was then acquired by IBM in configuration settings are done through the tools and user 2001. In 2004, IBM contributed the Cloudscape source code to interfaces supplied by Derby. If the user is used to the process, the Apache Software Foundation for release as an open-source installation can take as little as 5 minutes. package. Apache Derby can be used in a client/server mode or as 5.3 Porting Over Data an embedded system, meaning that Derby is used within an The next step, after installation and configuration is application without the need to install a separate application. complete, is to see if the tables can be ported over. There are Derby's Charter is to create a database technology that is secure, several approaches to porting over tables and their data from one platform to another. The user can port the information by hand, manually creating the tables on the new platform and entering the into account when modifying a SQL Server script to create tables data into the tables by typing in the data. This is inefficient, leaves on Derby and MySQL. room for error, and is impractical in the case of large table structures CREATE TABLE products The user can export the data via a script function that ( creates a file with the necessary SQL statements to recreate the Product_id int IDENTITY (10, 2), Product_name varchar (50) data on the same platform. This script is generally created by the ) DBMS package. This script can then be used to import the data Figure 5.31 Creating a table in SQL Server into the new platform. Some platforms, like MySQL, allow the exportation of either the entire database or just one table. The script is configured for the exporting platform, i.e. MySQL, and CREATE TABLE products ( does not require any editing if the data is being ported over to Product_id INT, another MySQL database. Derby and MySQL differ slightly in Product_name CHAR (50), the creating of tables, so editing would be required. Primary key (Product_id) ); The user could also use some connectivity client like Figure 5.32 Creating a table in Apache Derby Open Database Connectivity (ODBC) for Microsoft Windows that would do the translation for the user as far as translation. This assumes that such a client exists and that it will minimize or CREATE TABLE products eliminate the need for any modification of any code. ( Product_id INT, Product_name VARCHAR (50), 5.3.1 Configuration Primary key (Product_id) Porting SQL Server tables to MySQL is relatively ); simple using an ODBC client. MySQL has an ODBC client Figure 5.33 Creating a table in MySQL (MyODBC) that allows any application that can use ODBC to map to a MySQL database and perform operations on it without After the tables were created and the data was ported knowing the specific syntax for MySQL. It has some drawbacks over, without the Person table, the data for the Person table was though. Binary types do not port over. This is a typical problem created by hand. Due to a limitation in the size of string literals in with many ODBC clients. Derby, the images had to be saved locally and then an application MySQL has two noticeable problems with porting was written to insert them as prepared insertion statements in tables over from SQL Server using ODBC. First, it requires Java. The prototype uses 40 images, with some images in the installing a specific third party ODBC driver on the SQL Server 100KB+ range. Using the combination of porting methods, all the Machine. Second, SQL Server stores images of a person's face as tables and data were ported over. an IMAGE type, but when told to translate the image to MySQL, However, this still left the stored procedures. These which stores binary data as BLOB's, ODBC gives an error. All procedures are stored by the DBMS. Stored procedures add other tables except the one(s) with images transferred. The tables functionality for applications and allow some specifics about the with images had to be inserted in to the database by hand. database to be hidden from the user. For instance, procedures to 5.3.2 Porting SQL Server to Derby fetch images from the database can be written and referenced by Apache Derby has the same problem with BLOB types. the commands “FetchALL()”, ”FetchBlue()”, etc. Therefore Derby also required a special third party ODBC Driver, which moving the procedures is only required because there is an was actually written for DB2. SQL Server had could not create application that is calling them. Rewriting the application to not the tables via the ODBC driver, and all of the tables had to be use these stored procedures is another solution. created by hand. 5.4 Stored Procedures After the tables were created, ODBC was used to port Stored procedures created the largest hurdle to cross. over a majority of the data. To be specific, all the data in tables Stored procedures can increase efficiency for heavy CPU other than the Person table was able to be migrated into the intensive queries and queries that are run frequently. However, manually created tables. In the case of the tables containing they work differently on different database management systems, images, the Person table, converting SQL Server image tables to if they are implemented at all. Depending on the complexity of another platform required extensive rewriting of the SQL code the application using the stored procedures and the frequency with due to the problem with the BLOB types to conform to Derby’s which they are used, it may behoove an application to remove the SQL Syntax. stored procedures and modify its source code accordingly. The Person table had to be exported to a SQL Server 5.4.1 Stored Procedures in MySQL script using the Enterprise Manager. This script was modified to MySQL does not support the use of stored procedures create the tables on MySQL and Derby. To understand why this prior to version 5.0. The stored procedures used in the application process can be time consuming, it is necessary to observe the were examined, and it was discovered that the each of the two differences in between MySQL, Derby, and SQL Server (see stored procedures were called only once. This gives two options: Figures 5.31 5.32 and 5.33). Note the difference in declaring the modify the source of the application to remove the stored primary key using “IDENTITY” in SQL Server (Figure 5.31). procedures, or wait until 5.0 is released and bring MySQL back Another minor difference is in Derby and the use of “CHAR” vs on the table for further consideration. “VARCHAR” (Figure 5.32). These differences must be taken 5.4.2 Stored Procedures in Apache Derby Connection Tim e Apache Derby, on the other hand, does support stored procedures. But stored procedures in Derby work differently than they do for SQL Server, which uses TSQL for stored procedures. 3 5 0

Functionality that exists in TSQL may not exist in Derby. This 3 0 0 problem, combined with the MySQL stored procedure problem, led to the decision to modify the application in order to facilitate 2 5 0 portability. 2 0 0 S Q L Theoretically, MySQL and Apache Derby are prime D e r b y candidates for porting over the current project. However, the 1 5 0 M y S Q L application interface needed to be rewritten for each case in order 1 0 0 to test for speed and efficiency. 5 0 5.5 Testing 0 Testing for the application was limited to three major 4 0 3 , 0 0 0 3 0 , 0 0 0 3 0 0 , 0 0 0 timing factors: connection, query, and insertion. Connection time Q u e r y S i z e is defined as the duration in milliseconds to establish a connection to the database management server. Query time is defined as the time to open a connection and pose a query and print the result. Figure 5.4 Connection Time Trial Insertion time is defined as the duration it takes to insert X number of tuples into the database. Query Tim e Each test consisted of testing the three factors on varying query sizes. With the limited amount of data on hand, 16,000 three additional tables were created. Each table consisted of 3 14,000 attributes: integer identification, an alphanumeric string, and a text string. Three operations were written to time the execution 12,000

of each of the three factors. All three operations were done on ) 10,000

s SQL each DBMS a total of 20 times over a period of a week. The m ( 8,000 Derby average of all 20 trials was used in determining the final results. e m i MySQL The first operation tested connection time. Figure 5.4 T 6,000 illustrates that the initial connection was the highest for all three 4,000 DBMS. It went down significantly re-connections. SQL server 2,000 was the fastest initially, but Derby was quicker after the initial connection. The difference between the fastest and slowest 0 connection was 111 milliseconds, which is negligible. After the 40 3,000 30,000 300,000 initial connection, the reconnects were so fast that there really Query Size isn’t a lot of difference between the three. The second operation, querying the database for varying Figure 5.5 Query Time Trial query sizes, showed some interesting results. Figure 5.5 shows the times for queries. For 40 images, SQL Server was the fastest, Insertion Tim e but only by a very small amount. Likewise all three DBMS packages ran fairly close until the query size reached 300,000 tuples. At 30,000 tuples, SQL Server shined, while MySQL and 8000 Derby started to bog down. The time difference at 30,000 was 7000 still fairly close and at 300,000, the difference was a matter of seconds. Though there are speed differences, the open source 6000

DBMS packages proved that they can handle simple database ) 5000 s SQL c

functionality almost as quickly and efficiently as an Enterprise e s

( 4000 Derby level DBMS like SQL Server. For large datasets on the other e

m MySQL hand, SQL Server shined. i 3000 T 2000

0 3,000 30,000 300,000 Query Size

Figure 5.6 Insertion Time Trial The last operation tested, insertions, had the widest [3] Elf Software Home Page – Retrieved November 14, 2005, variance. Figure 5.6 illustrates that MySQL had the worst time from http://www.elfsoft.com/ inserting new data as the table size grew. [4] Sanghi, S., Determining Membership Function Values to MySQL went up exponentially from the beginning while Derby Optimize Retrieval in a Fuzzy Relational Database, May did well in the beginning but after 30,000 insertions the time grew 2005. Retrieved August 8,2005 from exponentially. SQL Server grew linearly as opposed to the open http://www.people.vcu.edu/~lparker/DBGroup/MemFunc.do source variants. Further research could reveal that something c such as how the package does indexes or how the package [5] Bhootra, R. and Mehrotra, A., Overview of Natural allocates memory could affect insertion performance. Language Interfaces, December 2004 Retrieved August 8,2005 from 6. FUTURE WORK http://www.people.vcu.edu/~lparker/DBGroup/NLI.doc Memory usage could be monitored on cross platform [6] Crouse, B. and Pandrinath, A., Spatial Databases, December testing as far as using alternative operating systems. Tests on 2004 Retrieved August 8,2005 from different indexing methods such as insertion methods could also http://www.people.vcu.edu/~lparker/DBGroup/SpatialDB.do be performed. Furthermore, tests could be performed to see if c different methods of creating insertions can increase performance. [7] Dattatri, S. and Joy, K., Implementing a Fuzzy Relational 7. CONCLUSION Database and Querying System With Community Defined The current implementation of the prototype on SQL Membership Values, November 2004. Retrieved August 8,2005 from Server will allow much larger datasets in the future. Given that the group wants to add functionality to the database by integrating http://www.people.vcu.edu/~lparker/DBGroup/FuzzyDB.rtf ELF, an open source package will be more beneficial. Derby and [8] Crabtree, P., Fuzzy Relational Database and Querying MySQL provide excellent performance with small sets. System With Compound Query Capabilities, May 2005. Connection time and query time seem to be inconsequential with Retrieved August 8,2005 from the size of the data set currently being used, but caution would be http://www.people.vcu.edu/~lparker/DBGroup/Compound.d needed if the size of the data set grew. Operations in which data oc is being added or modified on the database also need to be [9] Transact SQL Overview – Retrieved August 8, 2005, from monitored. http://msdn.microsoft.com/library/default.asp? url=/library/en-us/tsqlref/ts_tsqlcon_6lyk.asp 8. REFERENCES [10] Derby Home Page, Retrieved August 8, 2005, from [1] Database Retrieved August 7, 2005, from http://db.apache.org/derby/ http://www.answers.com/database [11] MySQL AB Home Page, Retrieved August 9, 2005, from [2] Joy, K. and Dattari, S. Implementing a Fuzzy Relational http://www.mysql.com/ Database Using Community Defined Membership Values, Proceedings of the 2005 ACM SE Conference, Vol. 1, pages 264-265 Retrieved August 8,2005 from http://www.people.vcu.edu/~lparker/DBGroup/Poster.doc