Based Search In LAN METADATA BASED SEARCH IN LAN

1SALABHA P S, 2ABDUL NIZAR M

College of Engineering,Trivandrum

Abstract— In recent years, the number of ways to keep and manage personal information has increased considerably, in line with the overall increase in the number of devices, technologies and applications on which knowledge workers rely. The fragmentation of personal information increases the probability of keeping something locked away in a device, application or format and forgetting that something was ever seen, heard, or read in the first place. Information does not only in personal computers, but is continuously produced and revised in local area networks. Due to the increased availability of data and evolved standards in the last years, applications of semantic technologies in organizational information systems have increased. The application of ontologies in organizational information systems allows the integration of heterogeneous information items within the organizational memory. Semantic architectures bring together information sources, which, previously, would have been more difficult to manage. In this paper we try to develop a tool that aim at effectively using semantic technologies to support searching data in LAN based on attributes of a file. The data stored in LAN is to be indexed periodically and information collected will be stored in the form of RDF triples. These metadata can be later used for searching documents.

Keywords— indexing, lan, metadata, search,

I. INTRODUCTION people. This concept is very much related to the Semantic Web but is distinct insofar as its main The traditional personal computers manage the concern is the personal use of information. Semantic resources in two ways. One way is based on directory desktop improves the efficiency we search and locate and file name. Another way is managing resources resources. It also can create relations between through the application. Each type of resource file different types of resources, and users would treat with one or more applications associated can only be their personal affairs more efficiently. accessed by the appropriate application. For example, Word documents can be edited by. II. SEMANTIC TECHNOLOGY Word and the video files may be processed by the Microsoft Media Player, Real Player or other Semantic technology encodes meanings separately application. However, with the rapid development from data and content files, and separately from and popularization of personal computers and application code. This enables machines as well as Internet, the traditional management of personal people to understand, share and reason with them at computer resources can no longer meet the actual execution time. With semantic technologies, adding, needs of users, mainly in the following aspects: changing and implementing new relationships or  Rapid increase in the size of hard disks interconnecting programs in a different way can be  Fragmentation of personal information just as simple as changing the external model that  Filename may not reflect content these programs share.  Computers cannot get a great deal of With traditional information technology, on the information about the content of files other hand, meanings and relationships must be  Inability to manage semantic link between predefined and “hard wired” into data formats and the resources the application program code at design time. This  Time to locate information means that when something changes, previously unexchanged information needs to be exchanged, or This paper employs the emerging technology of two programs need to interoperate in a new way, the semantic desktop to provide a new way to solve this humans must get involved. Off-line, the parties must problem. The Semantic Desktop is a collective term define and communicate between them the for ideas related to changing a computer's user knowledge needed to make the change, and then interface and data handling capabilities so that data recode the data structures and program logic to is more easily shared between different applications accommodate it, and then apply these changes to the or tasks and so that data that once could not be and the application. Then, and only then, automatically processed by a computer could be. It can they implement the changes. also encompasses some ideas about being able to Semantic technologies are “meaning-centered.” automatically share information between different They include tools for: auto recognition of topics and

Proceedings of Fifth IRAJ International Conference, 15th September 2013, Pune, India, ISBN: 978-93-82702-29-0

148 Metadata Based Search In LAN concepts, information and meaning extraction, and information may be made available to applications categorization. Given a question, semantic other than those for which it was originally created. technologies can directly search topics, concepts, associations that span a vast number of sources. C. RDF Data Storage Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging Currently, there are three kinds of RDF data and interconnection of data, content, and processes. storage format: RDF / XML file format, a special Second, from the portal perspective, semantic XML/RDF database (triple store) and a traditional technologies can be thought of as a new level of . For a small amount of data to depth that provides far more intelligent, capable, RDF / XML files storage is available. But for large relevant, and responsive interaction than with amounts of data, taking into account scalability, data information technologies alone. integrity and security, the query efficiency and many other factors, using relational database or RDF / III. DESIGN CONCEPT XML database to store RDF data is a relatively good choice. A. Resource Identification A is a purpose-built database for the In semantic desktop URIs are used to identify data. storage and retrieval of triples, a triple being a data Each resource has a unique identifier that can be entity composed of subject-predicate-object, like used to locate data from any computer in a network. "Bob is 35" or "Bob knows Fred". Much like a The basic syntax of URI is :// relational database, one stores information in a ? . For example it can be of form triplestore and retrieves it via a . mail://///. optimized for the storage and retrieval of triples. In addition to queries, triples can usually be B. 3.2. Information Representation imported/exported using Resource Description Framework (RDF) and other formats. Some can store billions of triples. Some of the Resource Description Framework (RDF) is a common implementations of triple store are Soprano, language used for expression of World Wide Web Virtuoso Universal Server, Apache Jena, Sesame etc. information resources. It is specifically used for

representing metadata about web resources, such as web page’s title, author, modification time etc. By D. Ontology generalizing the concept of web resources RDF can In the context of knowledge sharing, the term be used to express anything that can be identified on ontology t means a specification of a web even if they cannot be accessed directly from conceptualization. That is, an ontology is a web. The relationship between computer resource, description (like a formal specification of a program) URI and RDF is described in fig 1 of the concepts and relationships that can exist for an agent or a community of agents. An ontology provides a shared vocabulary, which can be used to model a domain, that is, the type of objects and/or concepts that exist, and their properties and relations. We can establish hundreds of relationships between resources. Different people may refer same relationship (or properties) using different names. Also same name may mean different thing to different people. This may create ambiguity while Figure 1: Resource described in RDF creating an ontology. Instead of creating our own

ontology we used ontologies developed as an open- RDF is intended for situations in which this source project called Shared Desktop Ontologies information needs to be processed by applications, (SDO). The development process is centered around rather than being only displayed to people. RDF the SDO Trac repository which is open to provides a common framework for expressing this contributions from everyone. information so it can be exchanged between applications without loss of meaning. Since it is a common framework, application designers can IV. PROTOTYPE SYSTEM leverage the availability of common RDF parsers and processing tools. The ability to exchange information In order to support metadata based search in LAN between different applications means that the a prototype system called locator was developed. It

Proceedings of Fifth IRAJ International Conference, 15th September 2013, Pune, India, ISBN: 978-93-82702-29-0

149 Metadata Based Search In LAN utilizes semantic technologies to collect metadata directories based on these relationships. When a user and identify relationship between resources. issues a query it is analyzed and matched with the ontology library to find results. The result to queries A. System Architecture are ranked according to relevance. The frequency of keyword in Figure 2 shows the system architecture of Locator. the file and frequency of access is used as a metric to It includes the following parts: An adapter, an calculate relevance. This information is collected Ontology database and a query module. The indexer during the indexing phase of operation. includes adapter and an indexing mechanism which is present in every node of LAN. The indexer was V. EXPERIMENTS AND RESULTS written using apache lucene 4.4.0. The adapter extracts textual data from resources and the converts The experiments were conducted on a LAN the data into RDF format and stores them in the core with 4 nodes running on 12.04 LTS ontology database. The indexer runs periodically to , working with Intel Core i5-2450M extract metadata from resources and updates the CPU @250 GHz processor and 4GB main memory. index and information in the central ontology The following observations were made during the library. Instead of using a central ontology we may experiments. 1 performance of Locator for divide LAN into clusters and allots an ontology to different data sets. each cluster. The size of the index and indexing time depends mainly on number of files indexed and type of file. The indexer part of locator was written mainly to extract textual data from files. Indexing of non-textual data will not be as efficient and will take more time. As number of text type files and their size increases the index size will also increase. Also as lucene uses inverted index larger the vocabulary of user larger will be the index. The indexer part of locator is executed periodically in every 30 minutes. During the first run of locator, the indexer processed information at the rate of almost 50 GB per hour. As incremental indexing was performed, subsequent indexing were much faster. Only those files modified after previous indexing need to be re-indexed. The

Figure 2: Architecture of locator index size varies between 2 - 10 percent the size of actual data. Larger the number of textual files larger The Sesame Ontology database stores an ontology the index size. that describes the data stored in LAN. The ontology database is the core of semantic systems. It describes While querying however large the data set may be the first response was obtained in mater of Total No: Index Indexing Search millisecond. The final result depends on the total size Of Size Time Time index size and frequency of keyword. Locator ranked Files the results according to relevance. The frequency of 107 MB 113 33 MB 33s 8s keyword in the file and frequency of access is used as 507 MB 10215 68.6 MB 213s 10s a metric to calculate relevance In and 1.12 GB 281 415 MB 291s 25s ubuntu native desktop search the results were 42.5 GB 199943 4.57 GB 4396s 97s produced in the order of storage location. 0 the information The Index size and time needed for indexing files Table 1:Performance of Locator is comparable. They grow at almost same rate. But

relationship between Index size and search time is stored in LAN and link between resources. The different. By examining the last two rows we can see queries are based on the information stored in that even when there ontology library. was a 10 fold increase in index size the increase in The Query module is responsible for processing search time was slightly less than 2 fold. Also Search semantics when a user enters a query and for time is independent of Index time. Even when the indexing of metadata. The directory mapping module time taken to index the file is very large, time to identifies the relationship between resources using search data will be small. the ontology database and groups them into semantic

Proceedings of Fifth IRAJ International Conference, 15th September 2013, Pune, India, ISBN: 978-93-82702-29-0

150 Metadata Based Search In LAN Large increase in index size will cause only small It builds upon standard Web technologies such as increment in search time. In short the system is HTTP, RDF and URIs, but rather than using them to scalable. serve web pages for human readers, it extends them to share information in a way that can be read A fuzzy search is a process that locates files that automatically by computers. This enables data from are likely to be relevant to a search argument even different sources to be connected and queried. By when the argument does not exactly correspond to using linked data technology along with semantic the desired information. A fuzzy search is done by desktop, in future we can create application to extract means of a fuzzy matching program, which returns a only relevant information for files without reading list of results based on likely relevance even though the entire file. search argument words and spellings may not exactly match. Locator supports fuzzy queries whereas REFERENCES native desktop search of window 7 and ubuntu does not support fuzzy queries. [1] Junhua Qu, Chao Wei, Wenjuan Wang, Fei Liu, "Research On a Retrieval System Based On Semantic Web", International Besides contents and filename locator provides Conference on Internet Computing and Information additional search filters: author, file type, date of last Services(2011). modification and size. It can also show relationship [2] Souripriya Das, Seema Sundara, Matthew Perry, Jagannathan Srinivasan, Jayanta Banerjee, Aravind Yalamanchi, "Making between files. Windows 7 only provides two Unstructured Data SPARQL Using Semantic Indexing in additional filters: size and date of last modification. Oracle Database", 28th IEEE International Conference on Data Engineering(2012). [3] D. Minnie, S. Srinivasan, "Intelligent Search Engine Algorithms CONCLUSION on Indexing and Searching of Text Documents using Text Representation", International Conference on Recent Trends in Information Systems (2011) As the size of the LAN increases it fragmentation [4] Rui Hu, Xiang Zhang, Peng Wang, "Classification and of data increases and it becomes more difficult to Evaluation of Online Indexing Strategies", Conference on search and locate files. By indexing data in LAN and Technologies and Applications of Artificial Intelligence (2011). [5] T. Berners-Lee, J. Hendler, O. Lassila, "The Semantic Web", storing it in a central index searching becomes faster. Scientific American (May 2001). Using semantic technologies to collect metadata [6] V. B. Om Prakash Verma,Rahul Katarya,N. Maheshwari, "Use allows us to create more efficient queries. This of Semantic Web in Enabling Desktop Based Knowledge allows us to find relationships between files. Use of Management", 3rd International Coference on Electronics Computer Technologies(2011). semantic technologies improves the efficiency of [7] L. Sauermann, A. Bernardi, A. Dengel, "Overview and Outlook resource discovery and location, and it can create on the Semantic Desktop", Semantic Desktop Workshop at the semantic links between the resources. Through the ISWC(2005). [8] S. Decker, M.R. Frank, "The Networked Semantic Desktop", semantic links between the resources, users could WWW Workshop on Application Design, Development and handle personal business more effectively, while they Implementation Issues in the Semantic Web, NY(2004). may discover new information related to the user [9] D. Beckett, B. McBride, "RDF/XML Syntax Specification (Revised)", World Wide Web Consortium. http queries. ://www.w3.orgltr/rdf-syntaxgrammar/, (2004). Linked Data describes a method of publishing [10] D. Brickley, R. V. Guha, "RDF Vocabulary Description structured data so that it can be interlinked and Language: RDF Schema", World Wide Web Consortium. http ://www.w3.org/tr/rdfschema/, (2004) become more useful. [11] Frank Manola, Eric Miller, Brian McBride," RDF Primer"’, World Wide Web Consortium. http ://www.w3.org/TR/rdf- primer/, (2004)



Proceedings of Fifth IRAJ International Conference, 15th September 2013, Pune, India, ISBN: 978-93-82702-29-0

151