A Fuzzy Grassroots Ontology for improving Weblog Extraction

Edy Portmann, Andreas Meier Information Systems Research Group Department of Informatics University of Fribourg Journal of Digital Boulevard de Pérolles 90 Information Management CH-1700 Fribourg Switzerland {edy.portmann, andreas.meier}@unifr.ch

Abstract: This paper presents fuzzy clustering algorithms to top-down-approach of ontologies, because fuzzy sets are establish a grassroots ontology – a machine-generated weak more suitable for characterizing vague information. Murthy ontology – based on . Furthermore, it describes and Biswas state in [3] that fuzzy logic is an addition to a search engine for vaguely associated terms and aggregates conservative logic and handles the concept of partial truth them into several meaningful cluster categories, based on the along with true and false, which is used for qualitative rather introduced weak grassroots ontology. A potential application than quantitative judgement. Hence fuzzy logic follows the of this ontology, weblog extraction, is illustrated using a simple way humans think and helps to handle real world complexi- example. Added value and possible future studies are discussed ties more efficiently. Therefore, it converts imprecise human in the conclusion. information to precise mathematical models. Consider, for example, the following paradox of a marriage: I Categories and Subject Descriptors cannot live with her, and I cannot live without her. Both state- I.2.3 [Deduction and Theorem Proving]; Uncertainty, ``fuzzy,'' and ments are – to a certain degree – true. The dynamic between probabilistic reasoning: H.3.3 [Information Search and Retrieval]; those statements is – along with other things – what keeps Clustering marriage interesting. That is what fuzzy logic deals with. Down with both statements (with, without) in fuzzy set theory certain General Terms: Fuzzy logic, Clustering, GUI membership degrees comes along. Keywords: Fuzzy data clustering, Information retrieval, Ontology, In this proposal folksonomies – human-made taxonomies – will Search engine, Semantic web, Weblog be converted to machine-understandable ontologies adapted from fuzzy set theory. To search for information from weblogs, Received: 19 November 2009; Revised: 31 December 2009; a user types one or more search terms in a graphical user inter- Accepted: 18 January 2010 face (GUI) and defines an associated relevance for each of them (as a membership function). This user need is then processed 1. Motivation and Related Work by a query engine, which generates term- and relevance-based clusters. The relationship grade depends on the relevance In Information Retrieval (IR), it is pivotal to know how to find selected by the user. To discover the related terms, the query relevant information despite to information overload. However, engine primarily draws upon a previously built ontology. defining relevance is a challenge. Do the retrieved documents and/or queries resulting from a set of keywords exactly match The ontology, which is a set of associated tags with related the semantic content of the given search terms? weights, is compiled with a . The weights are repre- sented using the semantic closeness of the terms. To achieve The problem with the resulting documents and queries is that this, it is essential to establish terms and their relationships to they often yield results which are only partially relevant to their each other, as Hasan-Montero and Herrero-Solana discuss actual semantic contents. As a consequence, the matching of in [4]. They propose an algorithm for semantic clustering and a document to the query terms is often vague. The users would provide an example of its use. Furthermore, Kaser and Lemire be frequently better off if they not only received exact results, suggest in [5] that associated tags ought to be placed near one but also related outcomes (presented as suggestions). These another. The easiest possible way similarity between two tags suggestions ideally should also be searchable, allowing the us- can be measured is to count the number of co-occurrences, ers to interact during the search process and hence find more that is, the number of times two tags are allocated to the same suitable results, narrow the query or expand it as desired. source. Nevertheless, there exist other, different measurements This paper discusses an approach to find more appropriate in- to establish similarity, such as locality sensitive hashing (LSH) formation by searching weblogs using fuzzy logic, here referred – where the tags are hashed, so that similar tags are mapped to as fuzzy weblog extraction. In [1], Meier, Schindler and Werro to the same set with a high probability – and collaborative filter- describe fuzzy logic as an appropriate instrument for rough ing (CF) – where several users define tags and their relations modelling the kind of uncertainty related with vagueness. The jointly – as well. core power of fuzzy logic is the fuzzy set theory, first proposed After the similarities among all necessary tags are calculated, by Zadeh in [2] as an extension of the traditional set theory. it is possible to derive clusters with the help of fuzzy clustering This paper shows that fuzzy sets can overcome the gap algorithms – for instance, the fuzzy c-means (FCM) algorithm between the bottom-up-approach of folksonomies and the described in [6] by Bezdek. The resulting clusters are the initial

276 Journal of Digital Information Management  Volume 8 Number 4  August 2010 point to obtain an ontology and the basis for the suggestions information and to reveal the underlying structure in a profitable related to the search. By this means the clusters can be an- way for the end user. Using established methods to represent ticipated in relation with the user need as folders and contain knowledge gained from unstructured data will be beneficial for the individual term-based search results; in addition they are the Web 2.0 too, in that it provides users with enriched Semantic searchable – as a result the user can interact. The established Web features to organize and understand their data. Hence it is search results of all the documents represented will be compiled possible for the user to generate new knowledge as knowledge by a meta search engine. A meta search engine permits the is the understanding of a subject with the ability to use it for a user to enter the query once and thereby it accesses several specific purpose if appropriate. search engines at once, as Schwartz in [7] explains. The benefit to the user is that instead of delivering millions of 2.2 Folksonomy and Weblogs search results in one long list, this search engine groups simi- Social software covers a collection of tools that empowers users lar results together into clusters and presents them in folders to interact and share data, as communication and interactive like Ferragina and Gulli in [8] propose. These folders help to applications. Communication tools such as weblogs typically recognize search results by topic so it is possible to zero in on handle the capturing, storing and presentation of statements. precisely what was searched for or expose unforeseen rela- Interactive tools manage interceding interactions between user tionships among items. Rather than scrolling through multiple groups. They focus on establishing and obtaining a relation pages, the folders help to uncover missed results or results that amongst users, facilitating the mechanics of conversation. A were hidden in a ranked list. typical example of interaction tools are folksonomies. The paper is structured as follows: Chapter 2 clarifies the main The portmanteau “folksonomy” from ‘folk’ and ‘taxonomy’ means elements of the proposed concept. Chapter 3 describes the the practice and technique of collaboratively creating and research topic in more depth; furthermore a simple example is manipulating tags to annotate and categorize content. By this given. Chapter 4 summarizes the results and shows potential means a document, a uniform resource locator (URL), a picture, issues for further research. a movie, etc. can be marked as content. As Voss explains in [11], in folksonomies, freely chosen keywords are used instead 2. Concept and Associated Components of a controlled vocabulary. Such meta data are straightforward to create, but generally lack any kind of formal grounding, as For better understanding, this chapter first clarifies the main used in the Semantic Web. In this sense folksonomies will be fundamentals of this paper. In chapter 2.1, necessary Semantic used as a starting point to harvest social knowledge from these Web issues are outlined. The next chapter 2.2 describes user-generated taxonomies and to build-up an ontology, as Wu, social software (and its applications), followed by Information Zubair and Maly suggest in [12]. Retrieval (including web search engines) in chapter 2.3. Then common metrics (chapter 2.4) will be explained, since the The second element important to mention are weblogs or short fuzzy data clustering (chapter 2.5) is based on a metric-based blogs. As Picot and Fischer illustrate in [13], a “weblog” is a tag space – a two dimensional space model. At the end, the made-up word of ‘’ (WWW) and ‘log’ and de- resulting ontology is described in chapter 2.6. Each of these scribes a type of website, usually administrated by an individual fundamentals will be briefly demonstrated and related academic with periodical reverse-chronological entries of comments, work will be provided as well. description of events, or other objects such as movies, pictures or diagrams. Large quantities of blogs are composed of com- 2.1 From Web 2.0 to Web 3.0 mentary, news or information on special topics. A typical weblog The expression Web 2.0 refers to the alleged next generation combines text, images, and essential links to other blogs, web of web improvement, which aims to ease communication, pages, and additionally to other media. Therefore, an important promote safe information sharing, and enable interoperability advantage is that blogs based on hyperlinks typically inform and collaboration on the Internet. Thus Web 2.0 concepts faster than broadcast or print media. Thus, the latest information have led to the expansion and development of web-based on specific topics is generally found in weblogs. communities, hosted services, and applications such as social software, to cite an example. A major reason for their overnight 2.3 Information Retrieval and Search Engines success is the breathtaking simplicity of use. These sites do Information Retrieval is interdisciplinary, based on computer not only afford data but also generate a plethora of weakly science, and establishes the retrieval of information from a arranged meta data. set of documents as Baeza-Yates and Ribeiro-Neto outline in [14]. Accordingly, IR represents the science of searching for Although the term Web 2.0 seemed to announce a new ver- documents, for information within documents and for meta sion of the Internet, according to O’Reilly, it is just a shift, not data about documents, as well as that of searching on technical, but on social interaction, as for example software and the WWW. There is a partial coverage in the handling of developer and end-user use the Internet. In [9], O’Reilly declares the terms Data Retrieval, Document Retrieval, Information that Web 2.0 technically does not differ from the earlier Internet, Retrieval, and Text Retrieval, but each has its own body of retrospectively marked as Web 1.0. In contrast to this static literature, practices, technologies and theory. expert generated content in Web 1.0, interactive elements are crucial in Web 2.0. Normally, IR systems are used to diminish what is called in- formation overload. Web search engines are the most obvious At present, the Web 3.0 can amend the bottom-up wisdom- IR application. A web search engine is an instrument intended of-the-crowds attempt of the Web 2.0 in a top-down manner, to search for information on the Internet. The search results as Cardoso states in [10]. Its fundamental aim is a stronger are typically presented in a single list and are generally called knowledge representation as is possible with folksonomies, for “hits”. The information can consist of images, text, web pages, example. While users provide their data, they generally have and auxiliary types of documents. a structure in mind. Indeed, this structure is buried in the data and it needs to be extracted for advanced use. Different kinds A number of search engines mine data by dint of a web agent of self-acting mechanisms are indispensable to extract hidden available in newsbooks, databases, or open directories. Particular

Journal of Digital Information Management  Volume 8 Number 4  August 2010 277 arearecomposedcomposed of ofcommentarycommentary, , news newsoror information information DistanceDistance metrics metrics attend attend to to the the identification identification of of the the areonon composed special special topics topics of commentary. . A A typical typical ,weblog newsweblogorcombines combinesinformation text, text, Distancedistancedistance metricsbetween between attend two two indiv indivto idualtheidual identification elements elements. The. The of basis thebasis onimages, images, special and and topics essential essential. A typicallinkslinks toweblog to other other combinesblogs, blogs, w web eb text, pa pag- g- distanceforfor distance distance between measurement measurement two individual is is the theelements Minkowski Minkowski. Themetricmetric basisas as images,es,es, and and and additionally additionally essential links to to other otherto other media. media. blogs, Therefore Therefore web pa,g-,ana n fordiscussed discusseddistance bymeasurementby Su Su and and Chou Chou is inthe in [16 [Minkowski16]:]: metric as es,important important and additionally advantage advantage tois otheris that that blogsmedia. blogs based basedTherefore on on hype , hypean r-r- discussed by Su and Chou in [16]: importantlinkslinks typically typically advantage inform informis faster thatfaster blogs than than based broadcast broadcast on hype or or print r- print (1) (1) linksmedia.media. typicallyThusThus , informthe, thelatestlatestfaster information information than broadcast on on specific specific or printtopics topics (1) media.isis generally generallyThus, foundthe foundlatest in in weblogs. weblogs.information on specific topics is generally found in weblogs. TheThe critical critical factor factor in in this this equation equation is isto to obtain obtainthethe constant , which defines the Minowski metric. 2.32.3InformationInformation Retrieval Retrievalandand S Searchearch E Enginenginess Theconstant critical factor, which in this defines equation the isMinowski to obtain metric.the 2.3 Information Retrieval and Search Engines constantInIn so so doing, doing,one,onewhich has has todefines to consider consider the thatMinowski that Minowski Minowski metric. m me- e- Information Retrieval is interdisciplinary, based on Intrics sotrics doing,count countoneasas a has a matter matter to consider of of principle principle that Minowski on on dissimilarity dissimilarity me- Information Retrieval is interdisciplinary, based on measurements; that is, the bigger the measure the Informationcomputer Retrievalscience, isand interdisciplinary establishes the, based retrieval on of tricsmeasurements count as a; matterthat is , ofthe principle bigger the on dissimilarity measure the , and establishes the retrieval of more dissimilar the single elements are. In many computerinformation science from, aand set establish of documentses the as retrieval Baeza- Yates of measurementsmore dissimilar; that the is , singlethe biggerelements the measureare. In many the information from a set of documents as Baeza-Yates cases, not the distance but the similarity is desired. informationand Ribeiro from-Neto a set outlineof documentsin [14] as. AccordinglyBaeza-Yates, IR morecases dissimilar, not the distancethe single but elements the similarityare. isIn desiredmany . and Ribeiro-Neto outline in [14]. Accordingly, IR Commonly the similarity is obtained by subtracting andrepresents Ribeiro-Netothe scienceoutline ofin searching [14]. Accordingly for documents,, IR casesCommonly, not the the distance similarity but is the obtained similarity by is subtracting desired. represents the science of searching for documents, the particular distance from 1. Therefore it is possible representsfor informationthe science within of documents searching and for documents, for meta data Commonlythe particular the distance similarity from is obtained1. Therefore by subtractingit is possible for information within documents and for meta data to calculate one from the other. for about information documents, within as documents well as that and of for searching meta data data- theto particularcalculate onedistance from fromthe other. 1. Therefore it is possible about documents, as well as that of searching data- For metric data mainly four distance measurements aboutbases documents, and the WWWas well. Thereas that is ofa partialsearchingcoverage data- in toFor calculate metric one data from mainly the other. four distance measurements bases and the WWW. There is a partial coverage in are used: the Euclidean, the squared Euclidean, the basesthe and handling the WWW of the. There terms is D ataa partial Retrieval,coverage Documentin Forare metric used : datathe Euclidean, mainly fourthe distancesquared measurements Euclidean, the the handling of the terms Data Retrieval, Document Block- and Chebyshev's distance. theR handlingetrieval, ofInformation the terms Retrieval Data Retrieval,, and Text Document Retrieval , areBlock used- and: the Chebyshev's Euclidean, thedistance.squared Euclidean, the Retrieval, Information Retrieval, and Text Retrieval, However, for non-metric data, coefficients such as for Retrieval,but each Information has its own Retrieval body, of and literature, Text Retrieval practice, s, BlockHowever- and, Chebyshev'sfor non-metric distance. data, coefficients such as for but each has its own body of literature, practices, example the Dice, the Jaccard, the Kulczynski, the search engines as Technorati,buttechnologies Blogdigger, each has etc. itsand are own theory special body. search of literature, For pract metricice sdata, mainlyHoweverexample four ,distance for the non Dice,- metricmeasurements the data Jaccard,, coefficients are the used: Kulczynski,such as for the technologies and theory. Russel and Rao, the Simple Matching and the Tani- engines for searching blogs.technologies ToNormally index the, andIR underlying systemstheory. sources are used the to diminishthe Euclidean, what is the examplesquaredRussel andEuclidean, the Dice,Rao, the thethe Simple Jaccard,Block- andMatching the Cheby Kulczynski, and- the Tan the i- Normally, IR systems are used to diminish what is moto coefficient, are used Normallycalledcalled information, informationIR systems overload overload are used. Web. Web to search diminishsearch engines engines what areis are Russelmoto coefficientand Rao, ,theareSimple used Matching and the Tani- search engine draws on web agents. shev’s distance. widely, as Backhaus, calledthe informationmost obvious overloadIR application.. Web search A web engines search engineare motowidely coefficient, as , are Backhaus, used the most obvious IR application. A web search engine Erichson, Plinke and Wei- A web agent is a programthe thatis most an accumulatesinstrument obvious IR intended pagesapplication. fromto search Athe w eb forsearchHowever, information engine for onnon-metric widelyErichson, data,, as coefficients Plinke Backhaus, and such We i- as for example is an instrument intended to search for information on ber in [17] explain; multiple Internet in an automatedis and anthe instrumentmethodical Internet. intended Theway. searchAccordingto search results to for areinformationthe typically Dice, on thepr e- Jaccard,Erichson,ber thein [ 17 Kulczynski, Plinke] explain and; multiple the We Russeli- and Rao, the Internet. The search results are typically pre- non-metric but set-based Kobayashi and Takeda inthe [15]sented Internet there in are. a The singleother search termslist and resultssuch are as generally are typicallythecalled Simplepr “e-hits Matching”. bernon andin- metric[17 the] explain Tanimoto but ; setmultiple -coefficient,based are used sented in a single list and are generally called “hits”. ordinal distance used in ants, automatic indexers, bots,sentedThe crawlers, in information a single worms, list can orand webconsist are spiders generally of images,widely,called text “ashits, Backhaus,”web. nonordinal Erichson,-metric distance but Plinke set andused-based Weiber in in [17] ex- The information can consist of images, text, web Information Retrieval are and web robots. Thepages information, and auxiliary can consisttypes of of documents images,plain; .text multiple, web non-metricordinalInformation but distance set-based Retrieval used ordinal inare distance used pages, and auxiliary types of documents. the Jaccard, the Simple pagesAA number number, and auxiliary of ofsearchsearchtypes engines engines of documents mine mine data data. in byInformationby dint dint of of a aRetrieval Informationthe are Jaccard the Jaccard, Retrieval, the Simpletheare Simple Matching Primarily, these agents are usedweb to create agent aavailable copy of all inthe newsbooks, visited databases, or theMatching Jaccard , andthe lastlySimple the FigureFigure 1 .1Venn. Venn diagram. diagram. A web number agent of searchavailable engines in newsbooks, mine data databases,andby dintlastly of the a or Dice coefficient.Matching and lastly the pages for later processing by the search engine that will list the Dice coefficient. webopenopen agent directories directoriesavailable. .Particular Particular in newsbooks, search search engines engines databases, as as Tec Tec orh- h- MatchingDice coefficient. and lastly the Figure 1. Venn diagram. pages to provide fast andopen sophisticatednorati, directories Blogdigger, searches.. Particular etc. Therefore, are search special engines searchThe as enginesJaccard Tech- forcoefficient Dice coefficient. for example measures similarity be- norati, Blogdigger, etc. are special search engines for The Jaccard coefficient for example measures simi- these agents gather metanorati, datasearching from Blogdigger, web blogs. pages, etc.To suchareindex special as the tags underlying search tween engines sources sample for the sets. TheLet AJaccard and B becoefficient the sets forof resources example measures so simi- searching blogs. To index the underlying sources the larity between sample sets. Let and be the sets of from folksonomies. searchingsearchsearch engine engineblogs. draws Todraws index on on webthe web underlyingagents. agents. thesources Jaccard the coefficient Thelarity Jaccardis betweendefinedcoefficient sampleas the size setsfor of . exampleLetthe intersectionandmeasures be the sets simi- of resources so the Jaccard coefficient is defined as the searchAAwebweb engine agent agent drawsisis a a progron progr webamam agents.thatthat accumulates accumulatesdivided pagesbypages the size larity resourcesof the between union so ofsample the the Jaccard sample sets. coefficientLet sets and(cf. Venn isbe defined the sets as ofthe Hence, the web agent initially starts with a list to visit, called the size of the intersection divided by the size of the Afromwebfrom the agent the Internet Internetis a progrinin an amanautomatedautomatedthat accumulates and anddiagram methodical methodicalpages in fig. 1): resourcessize of theso the intersection Jaccard coefficient divided by is thedefined size as of the the “seeds”. While the agent visits these seeds, it identifies all the union of the sample sets (cf. Venn diagram in fig. 1): fromway the. AccordingInternet in to an Kobayashiautomated and and Takeda methodical in [15] sizeunion of of the the intersection sample sets divided(cf. Venn by dia thegram size in offig . the1): tagged sources in the site andway .subjoins According them to in Kobayashi the so-called and Takeda in [15] waytherethere. According are are other other toterms terms Kobayashisuchsuch as as andants,ants, Takedaautomatic automatic in inde [15inde]x-x- union of the sample sets (cf. Venn diagram in fig. 1): “crawl frontier list”. Sources from the crawl frontier are visited thereers,ers, are bots, bots, other cr crawlers awlersterms, ,wormssuchworms as, ,orants,or web web automatic spiders spiders and inde andwebx-web (2) (2) (2) recursively in accordance to a set of conventions. ers,robotrobot bots,s.s. crawlers, worms, or web spiders and web (2) robotPrimarilys. , these agents are used to create a copy of Primarily, these agents are used to create a copy of AnotherAnothercommonlycommonlyusedused coeffi coefficientcientisis t hetheSimpleSimple 2.4 Distance Metrics Primarilyallall the the , visitedthese visited agentspagespages are for for used later later to processing create processingAnother a copy by commonly by of the the usedMatching coefficientcoefficient. is theSimple This Matchingcoefficient coef for- a set and A metric is a function whichsearch defines engine a distance that will betweenlist the pages ficient. to provide This fastcoefficient AnotherMatching for commonlya coefficient.set A and usedB Thisis calculated coefficoefficientcient as: foris athe set Simpleand allsearch the visitedengine pagesthat will for list laterthe pages processing to provide by the fast Matchingis calculatedcoefficient. as: This coefficient for a set and elements. Therefore a distanceand metric sophisticated attends tosearches. the analysis Therefore , these agents is calculated as: searchand sophisticated engine that willsearches. list the Therefore pages to, providethese agents fast is calculated as: of differences. Adequate distanceandgather sophisticated indices meta datashould searches.from consider web Therefore pages, the such, these as agentstags from gather meta data from web pages, such as tags from (3) character of different scalesgatherfolksonomies folksonomiesand meta at the data same. . from time web undertake pages, such as tags from (3) (3) the effort to figure out the folksonomiescommonHenceHence, the, mostthe.web websignificant agent agentinitial initialinformationlylystartsstarts with with a alist list to to visit, visit, (3) content of different elements.Hencecalledcalled, the the theweb “seeds“ seedsagent”. ”initial. WhileWhilely startsthethe agentwithagent a visitslistvisits to visit, these these TheTheDiceDicecoefficientcoefficientisis a a modification modification of of the the Simple Simple calledseeds the, it “identifiesseeds”. Whileall the thetagged agent sourcesThevisits Dicein these the coefficient site Matchingis a modification coefficient. of theThe Simple Dice coefficient Matching for a set are composed of commentary, news or informationDistance metricsDistance attend metrics toseeds the identification, attend it identifies to the all of identification the the tagged distance sources of the in the site TheMatchingDice coefficient coefficient.isThe a modification Dice coefficient of the for Simple a set seedsand, subjoins it identifiesthem all in thethe taggedso-called sources“crawlcoefficient.in frontier the sitelist The”. Diceand coefficientis obtained for a withset :A and B is obtained on special topics. A typical weblog combinesbetween text, two distance individual between elements.and subjoins two Theindivthem basisidualin elementsthe for so distance-called. The “ crawlbasis frontier list”. Matchingand is coefficient.obtained withThe: Dice coefficient for a set andSources subjoinsfromthem thein crawlthe so frontier-called are“crawl visited with:frontier recursivelylist”. images, and essential links to other blogs, wmeasurementeb pag- isfor the distance Minkowski Sourcesmeasurement metricfrom as discussed theis the crawl Minkowski frontierby Su and aremetric visitedas recursively and is obtained with: Sourcesin accordancefrom the tocrawl a set frontier of conventions are visited. recursively es, and additionally to other media. ThereforeChou, inan [16]: discussed byin Su accordance and Chou toin a[16 set]: of conventions. (4 (4) ) in accordance to a set of conventions. important advantage is that blogs based on hyper- (4) (4) 2.42.4DistanceDistance M Metricsetrics links typically inform faster than broadcast or print 2.5 Fuzzy Logic, Fuzzy Sets and Fuzzy Data Clus- 2.4 Distance Metrics (1) (1) 2.5 Fuzzy Logic, Fuzzy Sets and Fuzzy Data Clus- media. Thus, the latest information on specific topics A metric is a function which defines a distance be- 2.5 Fuzzyteringtering Logic, Fuzzy Sets and Fuzzy Data Clus- is generally found in weblogs. A metric is a function which defines a2.5 distance Fuzzy Logic, be- Fuzzy Sets and Fuzzy Data Clustering Atweenmetrictween elements. iselements. a function Therefore Therefore which a definesadistancedistance a metric distance metric attend attend be- s s tering The critical factor in this equation is to obtain theFuzzy logic allows theF modelinguzzy logic of allows uncertainty the modeling associatedof withuncertainty asso- The critical factor in this tweenequationtoto the theelements. analysisisanalysis to obtain Therefore of of differencethe difference constant, a distances.s . Adequate Adequate metric attend distance distances Fuzzy logic allows the modeling of uncertainty asso- 2.3 Information Retrieval and Search Engines constant , which defines the Minowski metric. ciated with vagueness and imprecision and putting r ( ≥ 1)which defines theto indices Minowskiindices the analysis should shouldmetric. of consider Inconsider difference so doing, the thes . characterone Adequatecharacter vagueness of ofdistance different different and imprecisionFuzzyciated logic with and allows vagueness putting the thismodeling andinto impreciappropriateof uncertaintysion and assputtingo- In so doing, one has to consider that Minowski me- this into appropriate mathematical equations. Human indicesscalesscales shouldandand at at theconsider the same same the time time character undertake undertake mathematical of the the different effort effort to toequations. ciatedthis into Humanwithappropriate vagueness reasoning mathematical is and not imprecidichotomous, sionequations and .put Humanting Information Retrieval is interdisciplinary, basedhas to onconsider trics that count Minowskias a metrics matter count of principle as a matter on dissimilarity of reasoning is not dichotomous, unlike software pro- scalesfigurefigure and out out at the the the common common same time most most undertake significant significant unlike theinformatio informatio effort software to n n programs,thisreasoning into appropriate where is not everything dichotomous, mathematical is either unlikeequations true software . Human pro- computer science, and establishes the retrievalprinciple of on dissimilaritymeasurements measurements;; that is, the that bigger is, the the bigger measure the grams, where everything is either true or false. It figurecontentcontent out of of the different different common elements elements most. significant. orinformatio false. It ndeals withreasoninggrams, imprecision where is not everythingand dichotomous, the conceptions is eitherunlike truearesoftware or false. pro- It information from a set of documents as Baezathe-Yates measure themore more dissimilar dissimilar the the single single elements elements are.are. InIn many deals with imprecision and the conceptions are ambi- content of different elements. ambiguous in the sensegrams,deals that with where they im precision everythingcannot be and sharply is the either conceptions defined. true or are false. amb It i- and Ribeiro-Neto outline in [14]. Accordinglymany, IRcases, notcases the, notdistance the distance but the butsimilarity the simi islarity desired. is desired . For instance, the questiondeals with whether imprecision the temperature and the conceptions is ‘hot’ or are ambi- represents the science of searching for documents,Commonly theCommonly similarity is the obtained similarity by issubtracting obtained the by par subtracting- not cannot be unanimously answered. Despite the fact that the for information within documents and for metaticular data distancethe from particular 1. Therefore distance it isfrom possible 1. Therefore to calculate it is possible about documents, as well as that of searchingone dat froma- the other.to calculate one from the other. definition of the word ‘hot’ is clear, it is not possible to clearly bases and the WWW. There is a partial coverage in For metric data mainly four distance measurementsstate if a temperature is hot because the answer may depend the handling of the terms Data Retrieval, Document are used: the Euclidean, the squared Euclidean, theon the individual perception. For a person it may even not be Retrieval, Information Retrieval, and Text Retrieval, Block- and Chebyshev's distance. possible to give a precise and clear answer as belonging to but each has its own body of literature, practices, However, for non-metric data, coefficients such as fora concept (e.g. hot temperature) is often not sharp but fuzzy, technologies and theory. example the Dice, the Jaccard, the Kulczynski, theinvolving a partial matching expressed in the natural language Normally, IR systems are used to diminish what is Russel and Rao, the Simple Matching and the Tanbyi- the expressions ‘fairly, ‘slightly’, ‘more or less’, etc. called information overload. Web search engines are moto coefficient, are used In figure 2, the meaning of the expressions ‘cold’, ‘warm’, and the most obvious IR application. A web search engine widely, as Backhaus, is an instrument intended to search for information on Erichson, Plinke and Wei- ‘hot’ is represented by functions mapping a temperature scale. the Internet. The search results are typically pre- ber in [17] explain; multiple A point on that scale has three truth values — one for each of sented in a single list and are generally called “hits”. non-metric but set-based the three functions. The vertical line in the image represents The information can consist of images, text, web ordinal distance used in a particular temperature that the three arrows (truth values) pages, and auxiliary types of documents. Information Retrieval are determine. Since the topmost arrow (one) points at 0.8, this A number of search engines mine data by dint of a the Jaccard, the Simple temperature may be interpreted as ‘fairly cold’. The second ar- web agent available in newsbooks, databases, or Matching and lastly the Figure 1. Venn diagram.row (two) pointing at 0.3 may be described as ‘slightly warm’ and open directories. Particular search engines as Tech- Dice coefficient. the arrow at the bottom (three) points to zero, as ‘not hot’. norati, Blogdigger, etc. are special search engines for searching blogs. To index the underlying sources the The Jaccard coefficient for example measures simBasedi- on fuzzy logic, the fuzzy set theory was developed in search engine draws on web agents. larity between sample sets. Let and be the sets of1965 by Zadeh at the University of Berkeley, California. The A web agent is a program that accumulates pages resources so the Jaccard coefficient is defined as thefuzzy set theory is – built on intuitive deduction by considering from the Internet in an automated and methodical size of the intersection divided by the size of thehuman imprecision and subjectivity – not an inexact theory way. According to Kobayashi and Takeda in [15] union of the sample sets (cf. Venn diagram in fig. 1):but a rigorous mathematical one which deals with uncertainty there are other terms such as ants, automatic index- Figure 1. Venn diagram and subjectivity. ers, bots, crawlers, worms, or web spiders and web (2) robots. 278 Journal of Digital Information Management  Volume 8 Number 4  August 2010 Primarily, these agents are used to create a copy of Another commonly used coefficient is the Simple all the visited pages for later processing by the Matching coefficient. This coefficient for a set and search engine that will list the pages to provide fast is calculated as: and sophisticated searches. Therefore, these agents gather meta data from web pages, such as tags from folksonomies. (3) Hence, the web agent initially starts with a list to visit, called the “seeds”. While the agent visits these The Dice coefficient is a modification of the Simple seeds, it identifies all the tagged sources in the site Matching coefficient. The Dice coefficient for a set and subjoins them in the so-called “crawl frontier list”. and is obtained with: Sources from the crawl frontier are visited recursively in accordance to a set of conventions. (4) 2.4 Distance Metrics 2.5 Fuzzy Logic, Fuzzy Sets and Fuzzy Data Clus- A metric is a function which defines a distance be- tering tween elements. Therefore a distance metric attends to the analysis of differences. Adequate distance Fuzzy logic allows the modeling of uncertainty asso- indices should consider the character of different ciated with vagueness and imprecision and putting scales and at the same time undertake the effort to this into appropriate mathematical equations. Human figure out the common most significant information reasoning is not dichotomous, unlike software pro- content of different elements. grams, where everything is either true or false. It deals with imprecision and the conceptions are ambi- Figure 2. Fuzzy logic temperature sets (including truth values)

In [2] and [18], Zadeh, Dubois, Prade and Yager explain that in traditional set theory, the membership of elements in a set is assessed in binary terms corresponding to a two-valued condition. As a result, an element either belongs or does not Figure 3. An example of a weak ontology belong to the set. Contrary to this, the fuzzy set theory allows for continuous assessment of the membership of elements in a intervention by human beings. In addition, a weak ontology con- set. According to Dubois and Prade in [19], fuzzy sets general- verges with Boolean logic and other subfields in which automatic ize the classical sets, since the indicator functions of classical reasoning is known to be possible. sets are particular cases of the membership functions of fuzzy sets, where the second only take values 0 or 1. 3. The Fuzzy Grassroots Ontology The term “data clustering” describes the method of grouping In the following, important aspects of this research are presented data elements into clusters or classes, so that elements in the in more detail. To establish a common vocabulary, the scope of same cluster are as identical as possible, and elements in dif- this research is specified in chapter 3.1. Next, each element of ferent clusters are as diverse as possible. Depending on the the proposed search engine is elaborated. This chapter closes intention for which clustering is being used and the nature of with a simple example to emphasize the benefit of the fuzzy the data, special metrics of relationship may be used to place weblog extraction (chapter 3.2). items into clusters, where the relationship measure controls how the clusters are shaped. Hence one has to distinguish between 3.1 Building Blocks hard and soft clustering. Information overload leads to an important question: How can In hard clustering, data is separated into distinctive clusters, relevant information be extracted from Web 2.0 and Web 3.0 where all data elements belong precisely to one single clus- applications? One possible approach is the proposed extraction ter. According to fuzzy set theory, Bezdek shows in [6] that in with the use of a fuzzy data clustering algorithm. fuzzy clustering, data elements can belong to more than one cluster. Associated with each element is a set of membership In figure 4, an overview over the fuzzy weblog extraction is levels, which indicate the potency of the relationship between given. The main parts are the graphical user interface (with that data element and a particular cluster. Fuzzy clustering is integrated search query engine), the meta search engine, and a method of assigning these diverse levels of membership and the aggregated documents. Each element is discussed in more allocating data elements to one or more clusters according to depth in the rest of this section. the membership values. 3.1.1 Interface 2.6 Ontology To search for relevant information from weblogs, a user enters Ontologies – the term has its origin in philosophy – are in the search terms ka, kb and kc in an interface, called user need theory artifacts of objects and their ties. Hence ontologies (cf. chapter 1). Included in this user interface is a convenient provide criteria for distinguishing various types of objects (e.g. tool (e.g. a slider) to determine the weight k of the key terms concrete and abstract, existent and non-existent, real and ka, kb and kc and which generates the search query. A suitable ideal, independent and dependent) and their ties (relations, dependences and predication). Within computer science the term stands for a design model for specifying the world that consists of a set of types, relationships and properties. What is provided precisely can deviate, but these properties are fundamentals of every ontology. According to Gruber in [20], an ontology is a “formal, explicit specification of a shared conceptualization”. There is an expectation that the model bear analogy to the real world as well; however, it definitely offers a common terminology which can be used to model a domain. A domain is the type of objects and concepts that exist, and their properties and relations In theory there is a distinction between strong and weak ontologies, whereas we use at this point only weak ontologies (cf. fig. 3). A weak ontology is one that is not sufficiently as rigorous as a strong one and therefore allows software to insert new details without an Figure 4. Overview over the weblog extraction process

Journal of Digital Information Management  Volume 8 Number 4  August 2010 279 example which clarifies this can be found in [14] by Baeza-Yates moreover, we need to distinguish between the single-word and and Ribeiro-Neto. They define a search query[q = ka ∧ (kb ∨-kc)]. the phrase-based approaches as well. In their example, each cci, i∈{1,2,3} is a conjunctive module. Dictionary-based approaches compare entered query terms Let Da be the fuzzy set of documents related to the term ka . with a dictionary and if the query term is not covered by the This set may be composed, for example, of the documents dj dictionary they search for similar terms. which have a degree of relationship maj bigger than the already _ Statistical methods refer by misspellings with no or only few hits predefined weightK . Further, let Da be the complement of the _ to the most commonly used similar syntax. To determine the set Da . The fuzzy set Da is related to ka, the negation of the phonetic similarity the words will be reduced to a code, which term ka . Correspondingly, we can characterize fuzzy sets Db conforms to similar terms. A well-known example especially for and Dc related to the key terms kb and kc, equally. Because the English language is the Soundex algorithm – patented by every single one of these sets is fuzzy, a document dj might Russell – for indexing names by sound. The goal is for homo- belong to the set Da for example, even if the document text dj phones to be encoded to the same representation so that they does not include the term ka. can be matched despite minor differences in spelling. The pat- After the entire user need is specified, it can be processed by ents to this can be found in [23] and [24]. The algorithm mainly the query engine. encodes consonants; a vowel will not be encoded unless it is the first letter. The Soundex algorithm as an outline is: 3.1.2 The Query Engine 1. Capitalize all letters in the word and drop all punctuation The query engine generates a fuzzy set search query from the marks. provided data (user need). The query fuzzy set Dq is a fusion of 2. Retain the first letter of the word. the fuzzy set related with the three conjunctive components of cc1, cc2, cc3. The relationship mdj of a document dj in the fuzzy answer 3. Change all occurrence of the letters set Dq is calculated as md, j = mcc + cc + cc , where mi, j,∈ {a, b,c} is the 1 2 2, 1 • 'A', 'E', 'I', 'O', 'U', 'H', 'W', 'Y' → 0 relationship of di in the fuzzy set associated with ki . In this case, the level of relationship in the disjunctive fuzzy set is calculated 4. Replace consonants with digits as follows: using an algebraic sum. Additionally, the level of relationship in • 'B', 'F', 'P', 'V' → 1 a conjunctive fuzzy set is calculated using an algebraic product. • 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' → 2 The adoption of algebraic sums and products yields relationship levels which can vary seamlessly. • 'D', 'T' → 3 • To generate an adequate fuzzy set search query, the query 'L' → 4 engine uses the data provided by web agents that crawled • 'M', 'N' → 5 through the Internet and collected meta data (e.g. tags from • 'R' → 6 folksonomies). In this sense the ability to find high-quality sources, such as documents or people, is important to over- 5. Collapse adjacent identical digits into a single digit of that come the information overload. Collaborative filtering systems, value. or recommender systems, identify high-quality sources utilizing 6. Remove all non-digits after the first letter. individual knowledge. One known algorithm which is successful 7. Return the starting letter and the first three remaining digits. in identifying high-quality sources in a hyperlinked environment If needed, append zeroes to make it a letter and three digits. automatically is the Hyperlink-Induced Topic Search (HITS) algorithm proposed by Kleinberg in [21]. Improvements to the Soundex algorithm, as for example a fuzzy Soundex algorithm as Holmes and McCabe present in HITS starts from a small root set of documents to a larger [25], are the basis for many modern phonetic algorithms and set T by adding up documents that link to and from the can be used to correct misspellings in taxonomies. A major documents in the root set. The ambition of the algorithm is advantage of the utilization of a Soundex algorithm is that the to identify hubs, the documents that link to numerous high correctly spelled ontology terms can as well be used as a kind quality documents, and authorities, the documents that are of auto-completion and -suggestion while the user is typing linked from numerous high quality documents. The hyper- search terms in the interface. link structure amongst the documents in T, is exposed by the adjacency matrix A, where Aij denotes whether there Nevertheless, the recovered tags will be arranged as Hasan- is a link from document di to document dj. By means of this Montero and Herrero-Solana in [4] suggest. Tag similarity is matrix A, a weighting algorithm constantly updates the hub measured as a kind of semantic correlation between tags, weight and authority weight for every document, until the considered by means of relative co-occurrence among tags. weights converge. Essentially, the hubs and authorities are This is the already presented Jaccard similarity coefficient, documents with biggest values in the main eigenvectors of which is, according to these authors, superior to other coef- ATA and AAT, correspondingly. ficients. LetA and B be the sets of resources characterized by two tags, relative co-occurrence is ascertained with (2). That A well-known problem in folksonomies, where people choose is, relative co-occurrence is identical to the partition among the their own tags to annotate sources, is that typing errors can amount of resources in which tags co-occur, and the amount occur since there is no editorial supervision. This leads to of resources in which any one of the two tags appear. This overlapping but barely relating terms in the underlying ontology. collection method causes these tags to become united and Certainly it can be assumed that a search system finds relevant offers a semantically consistent picture where nearly all tags information despite misspelling in tags, because the queries are related to each other. This semantically consistent picture contain the same mistakes. But the necessity of fault-tolerant is referred to as tag space. treatment of queries becomes clear. According to Lewan- dowski in [22] one has to distinguish between different types Hence these tags will be sorted by a fuzzy clustering algo- of typing error improvements as for example dictionary-based rithm, for instance the aforementioned FCM algorithm by and statistical approaches. Within the statistical approaches, Bezdek in [6].

280 Journal of Digital Information Management  Volume 8 Number 4  August 2010 covered by the dictionary they search for similar Hence these tags will be sorted by a fuzzy clustering terms.covered by the dictionary they search for similar algorithm,Hence these for tags instancewill bethesortedafore bymentioned a fuzzy clustering FCM Statisticalterms. methods refer by misspellings with no or algorithmalgorithm, by Bezdekfor instance in [6]. the aforementioned FCM onlycoveredStatisticalcovered few by hits bymethods the to the dictionary the dictionary refer most by commonlythey they misspellings search search used forThe with for similar similar FCM similar no or algorithm TheHencealgorithmHence FCMattemptsthese thesealgorithm by tags Bezdekto tags splitwill attemptswill beain be limited[sorted6].sorted to splitbycollection by aa fuzzya limitedfuzzy ofclustering clusteringcollec- syntax.terms.onlyterms. few To hitsdetermine to the the most phonetic commonly similarity usedelements similar the X = {Xtionalgorithm,1,...,Thealgorithm, of X nelementsFCM} into for algorithm afor assortment instanceX instance= { attempts, . .the of.,the c aforefuzzy}to intoafore splitmentioned clustersamentioned assortmenta limited FCMcolle FCMof c- wordsStatisticalsyntax.Statistical will methods beTomethods reduceddetermine refer refer to aby the code, by misspellings phonetic misspellings which conforms similarity withwith with no regard no toor the or to somealgorithmfuzzytionalgorithm specifiedof clusters elements by by Bezdek Bezdek with condition. X =inregard {in[6 ][,.6 .] .to.In ., some fuzzy} into specified clustering,a assortment condi- of similaronlywordsonly few terms. few will hits hits be A to reduced towell the the- known most mostto a commonly example code, commonly which especially used usedconformseach similar point similar for to has tionThea Thelevel.fuzzy FCMIn FCM of fuzzy clusters algorithmbelonging algorithm clustering, with attempts to attemptsregard clusters, each toto pointto split someas split ina has specifiedlimitedfuzzya limited a levelcollecollecondofc- c-i- thesyntax.similarsyntax. English To terms. To languagedetermine determine A well is-known the the the phonetic Soundexphoneticexample similarity algoriespeciallysimilaritylogic,thm rather the – thefor than belongingtion belongingtiontion of. ofelementsIn elements fuzzyto to clusters,just X clustering, one =X {= particular as{, . , in. .., each. fuzzy., cluster.} into} point logic,into a Thus,assortmenta has ratherassortment a level thanof ofof patentedwordsthewords English will willby be Russell be reduced language reduced – for to to is aindexing code, a the code,Soundex whichnames which conforms bypointsalgoriconforms soundthm on to . the to– edgebelongingfuzzybelonging offuzzy a clusterscluster clusters to to just may clusters,with one with be regard particular inregard asthe to incluster tosome fuzzycluster. some to specified logic, a specified Thus,less rather pointscondcond thani- i- Thesimilarpatentedsimilar goal terms. terms.is by for RussellA hA wellomophones well- known–-knownfor indexing example toexample be encodednamesespeciallyespeciallysignificant by to sound the for for level. ontion thanbelonging tion the. In. points edgeIn fuzzy fuzzy in to of the clustering, just a clustering, clustercenter one particular ofmay each cluster each be point in. pointcluster. For the has each has clus Thus, a ter alevel level topoints of a of sametheThethe English representation Englishgoal is language for language homophones so is that is the the theySoundex toSoundex becan encoded bealgori pointalgori matchedthm X tothm there– the – islessbelonging aon belongingcoefficient significant the edge to to clusters, givinglevel of clusters, a than clusterthe as grade aspoints in may in fuzzy fuzzyof in be being the logic,in logic, center thein ratherthe clus rather of ter clu than s-to than a despitesamepatented minor representation by differencesRussell – sofor in that indexingspelling. they namesThe can patentsbeth by matched sound to . terlessbelonging. For significant each topoint justlevel onetherethan particular ispointsa coefficient incluster. the center Thus,giving of points theclu s- patented by Russell – for indexing names kby clustersound .u k(x). Characteristically,belonging to just theone sum particularth of those cluster. coefficients Thus, points thisThedespiteThe can goal goal beminor is is forfound for differencesh omophones hinomophones[23] and in tospelling. [24] to be be. encoded The encodedThe algorithm patents to to the the to gradeteron. theForof being edgeeach in ofpoint the a clusterthereclustermay is a becoefficient in. Characterist the clus givingter toi-the a is defined as 1: on the edge of a clusterthmay be in the cluster to a mainlysamethissame representationcan encodes representation be foundconsonants; soin so[23] that that a and they vowel they [24] can willcan. The be not be matched algorithm be matched en- callylessgradeless ,significant the significantof sum being oflevel those inlevel thethan coefficientsthan points clusterpoints in isinthe definedthe center . centerCharacterist as of 1: ofclu clus- s-i- codedmainlydespite unless encodesminor it is differences theconsonants; first letter. in spelling.a The vowel Soundex willThe not patents a belgo- e n- to callyter. For, the each sum ofpoint those there coefficients is a coefficient is defined giving as 1: the despite minor differences in spelling. The patents to ter. For each point thereth is a coefficient giving the Figure 5. Dendrogramm with example clusters rithmthiscodedthis canas can an unless be o be utline found foundit isis in:thein[23] first[23]and letter.and [24] [24]The. The. SoundexThe algorithm algorithmalg o- gradegradeof ofbeing being in inthe the th clustercluster . Characterist. (5)Characterist (5i-) i- mainlyrithmmainly encodesas encodes an outline consonants; consonants;is: a vowela vowel will will not not be be en- en- callycally, the, the sum sum of ofthose those coefficients coefficients is definedis defined as as 1: 1: (5) codedcoded1. unless Capitalizeunless it isit istheall the lettersfirst first letter. inletter. the The wordThe Soundex Soundexand dropalg aallo-lg o- The focal point of a cluster is by fuzzy -means, theWith the FCM algorithm – on the basis of the collected tags rithm aspunctuation an outline marks.is: rithm as1. anCapitalize outline is: all letters in the word Theand dropfocal allpoint averageof Thea cluster focalof is allpoint by points, fuzzyof a clusterc weighted-means, is bythe by fuzzy average their amount-means, and( 5of the)( 5 in) alliance with the chosen weight K (user need) of the key 2. Retain the first letter of the word. punctuation marks. of all points, weightedbelongingaverage by to oftheir the all cluster: amount points, of weighted belonging by to their the amount termsof ka, kb and kc (cf. fig. 4) – it is possible to identify all related 3. 2.1.ChangeRetainCapitalize all the occurrence firstall letters letter ofin the word.lettersword and drop all 1. Capitalize all letters in the word andcluster: drop all ThebelongingThe focal focal point topoint the of ofcluster:a clustera cluster is isby byfuzzyfuzzy - means,-means, theterms the to a certain degree. These terms are sub-summarized 3. Changepunctuation all occurrence marks. of the letters punctuation marks. averageaverageof of all all points, points, weighted weighted by by their their amount amount (6withof) of fuzzy set search query, which after that will be sent to a 4.2. 2.ReplaceRetainRetain the consonants the first first letter letter ofwith ofthe digitsthe word. word. as follows : belonging to the cluster: meta search engine. The belonging degrees of the terms later 3. Change all occurrence of the letters belonging to the cluster: (6) (6) 3.4. ChangeReplace all consonantsoccurrence withof the digits letters as follows: are used as ranking factor.   The amount of belonging is associated to the inverse  (6) 4. Replace  consonants with digits as follows: of Thethe distanceamount ofto belongingthe heart of is the associated cluster: to the inverse (6) 4. Replace consonants with digits as follows: 3.1.3 The Meta Search Engine   The amount of belongingof the distance is associated to the heart to the of inversethe cluster: of the The amount of belonging is associated to the inverseA meta search engine sends the provided terms of the fuzzy   distance to the heartThe amountof the cluster:of belonging is associated to the inverse (7) of the distance to the heart of the cluster: set search query to numerous weblog search engines, such as   of the distance to the heart of the cluster: (7) Technorati, Blogdigger, etc. (cf. fig. 4). 5. Collapse  adjacent identical digits into a sin- In that case the coefficients are normalized and fuz- (7) (7) 5. gleCollapse digit  of that adjacent value. identical digits into a sin- zyfiedIn that with case a truetheparameter coefficients are normalizedso that their and sum The(7 fu) z- fuzzy set search query contains the term searched for, 6. Removegle digit a llof non that-digits value. after the first letter. is zyfied1. Hence with, a true parameter so that theirand sum the user-chosen related terms are drawn on the build 7. 5.ReturnCollapse the startingadjacent letter identical and digits the first into three a si n- 5.6. CollapseRemove adjacent all non- digitsidentical after digits the first into letter. a sin- InisInthat 1.that Hencecase casethe, the coefficients coefficients are are normalized normalized and and fuontology. z-fuz- 7. remaininggleReturngle digit digit of the digits.ofthat that starting value. value. If needed, letter and append theIn firstthat zeroes case three the coefficientszyfied with area true normalizedparameter and fuzzyfiedso that with their (8 sum) zyfied with a true parameter so that their sumAfter the completed search, the meta search engine aggregates 6. 6.toRemove remainingmakeRemove it a all a nonletterll digits. non-digits and-digits If three needed,after after thedigits. the first append firsta letter. true letter. zeroes parameter m(>is 1. 1) Hence so that, their sum is 1. Hence, is 1. Hence, the (8 results) and displays them clustered in folders according to 7. 7. ReturntoReturn make the theit startinga startingletter and letter letter three and and digits. the the first first three three Improvementsremainingremaining to the digits. Soundex digits. If needed,If needed,algorithm, append appendas for zeroes zeroesex- their (8 )source. ample a fuzzyto make Soundex it a letteralgorithm and three as digits. Holmes and For equal to 2, this is the same as normalizing(8) the(8) Improvementsto make toit athe letter Soundex and threealgorithm, digits. as for ex- To rank the returned documents inside the clusters, machine McCabeample presenta fuzzyin Soundex [25], are algorithmthe basis for as many Holmes mo d- and coefficientFor equals linearly to 2, tothis make is the their same sumas 1.normalizing When isthe close to 1, then the cluster center closest to the pointlearning (ML) – as Taylor et al. in [26] explains – could be used ernImprovementsMcCabeImprovements phonetic present algo to torithmsthein the [25]Soundex Soundex and, are can thealgorithm, bealgorithm,basis used for as tomany as correctfor for moe x-e d-x- coefficients linearly to make their sum 1. When is is For given aequal considerably to 2, this islarge ther sameamountas normalizingextra weightto thediscover appropriate web ranking functions. An input for ML misspellingsampleernample phonetica a fuzzy fuzzyin taxonomies. algoSoundex Soundexrithms algorithm andAalgorithm major can advantage be as as used Holmes Holmes to of correct andthe and Forclose equal to 1, tothen 2, thisthe clusteris the same centeras closestnormalizing to the thepoint For m equal to 2,than thiscoefficient isthe the others. sames linearly as normalizing to make the their coefficients sum 1. When can isbe the degrees of affiliation for the ontology-based terms utilizMcCabemisspellingsMcCabeation present of present a Soundexin intaxonomies. in[25] [25],algorithmare, are the A the majorbasisis basis that advantagefor thefor many manycorrect mo of mod-ly thed- coefficientis givens alinearly considerably to make large theirr amountsum 1. Whenextra weightis linearly to makeThe theircloseFCM sum to algorithm 1,1. thenWhen the focusesm clusteris close on center tominimizing 1, closestthen the anto theobje found.pointc- spelledernutilizern phonetic a phonetic ontologytion of algo a algoSoundextermsrithmsrithms can andalgorithm andas canwell can bebe is be usedusedthat used the toas to correctcorrecta correctkind ly closethan to the 1, others.then the cluster center closest to the point cluster center closesttiveis function. givento the pointa Toconsiderably isgenerate given a considerably anlarge ontologyr amount thelargerextraproposed, weight ofmisspellings spelledmisspellingsauto-completion ontology in intaxonomies. taxonomies. termsand -suggestion can A asmajorA majorwellwhileadvantage beadvantage used the asuser of a ofthe kindis the is The givenFCM a considerablyalgorithm focuses large ronamount minimizingextraanweight obje c- amount extra weightextendedthan than the standardthe others. others. function is: typingutilizofutiliz aauto tionsearchation- completionof ofa term Soundexa Soundexs inand thealgorithm interface-algorithmsuggestionis. isthat whilethat the the thecorrectcorrect userly isly thantive the function. others. To generate an ontology the proposed,3.1.4 Aggregated Documents The FCM algorithm focuses on minimizing an objec- Neverthelessspelledtypingspelled ontology search ontology, the term terms recovered termss in can the can asinterfacetags as well well will be .be used usedarrangedThe as as aFCM kinda as kind algorithm Theextended focusesFCM algorithmstandard on minimizing focuses function an onis objective: minimizing funcan- objeThec- aim is to organize the search results into several meaningful tive1. function.Select anTo amountgenerateof anclusters. ontology the proposed, Hasanof Neverthelessofauto auto-Montero-completion-completion, andtheand recovered Herreroand -suggestion -suggestion-Solanatags willwhile inwhile be [4 the]arrangedtion. suggest.the user Touser generate is as is tive an ontologyfunction. the To proposed,generate extendedan ontology standard the proposed, categories (clusters). These clusters are a group of similar typing search terms in the interface. extended2. 1. AssignSelectstandard coefficientsan amountfunction erraticallyof is clusters.: to each point TtypingagHasan similarity search-Montero term is measureds andin the Herrero interfaceas- Solanaa . kind in of [function4 semantic] suggest. is: extended standard function is: topics related to a term. The benefits to the user include an for being in the clusters. correlationNeverthelessTNeverthelessag similaritybetween, the, t ishe recovered measuredrecovered tags, consideredtagstagsas willa will kindbe bybe arranged ofmeansarranged semantic as of as 2. Assign coefficients erratically to each impression point of the available themes or topics, as well as an 3. 1.ReiterateSelect anuntilamount the algorithof clusters.m has converged relativeHasancorrelationHasan- Montero co-Montero-occurrencebetween and and Herrero tags,among Herrero -consideredSolana-tagsSolana. This in in [4by ]1.is[ 4 suggest. ] means the suggest.Selectal- of an amount1. of Selectclusters.for beingan amount in the clusters.of clusters. overview of related results in folders rather than scattered 2.(thatAssign is, the coefficients coefficients' erratically adjust amongto eachtwo point readyTagrelativeTag similarity presented similarity co-occurrence is Jaccard ismeasured measured among similarityasas atags a coefficientkind kind. This of of semantic is, semantic whichthe a l- 2.3. AssignReiterate coefficientsuntil the erratically algorithm to has each converged point 2. Assign coefficients iterationserraticallyfor being tois innoeach the more clusters.point than for being, a given in sensitithroughoutv- a list. iscorrelation,readycorrelationaccording presentedbetween tobetween these Jaccard tags,authors tags, considered similarity ,consideredsuperior coefficienttobyby other means means ,coe which off- of for(that being is, in the the coefficients'clusters. adjust among two the clusters. 3. Reiterate until the algorithm has converged ficients.relativeisrelative, according Let co co- occurrence-occurrenceand to thesebe theamongauthors among sets, tagsofsuperiortags resources. This. Thisto isother charaisthethe coeac-l-af-l- 3. ityReiterate iterationsboundaryuntil is value no the more): algorith thanm , has a given converged sensitiThe v- basis for the definition of the clusters is, as previously, the (that is, the coefficients' adjust among two terizedreadyficients.ready presentedby presented Let two tags,and Jaccard Jaccard relativebe the similarity similarity cosets-occurrence of coefficient resources coefficient3. is , asce Reiterate whichchara, whichr- c- until the algorithm(thatitya. boundary is, has Calculate the converged coefficients' valuethe): (that centroidadjust is, the among for eachtwopreliminarily built grassroots ontology. iterations is no more than , a given sensitiv- tainedis,terizedisaccording, according withby (2). twoto tothese tags,Thatthese authors relativeis, authors relative, superior co, superior- occurrence co-occurrenceto tootherother iscoefficients’ coeasce coe isf- r-f- adjust amongiterations twoa. cluster, is iterationsCalculate no more using is than theno themore, centroid a given than formula sensiti for (6) eachv- ity boundary value): identicalficients.tainedficients. LettowithLet the and (2). andpartition That bebe the is,amongthe sets relativesets ofthe ofresources resources coamount-occurrence charaof ∈chara , rae-c- given isc- sensitivity ityboundary boundaryabove. value):cluster, value ): using the formula3.2(6) Example of Information Screening sourcesterized inby which two tags, tags corelative-occur, co and-occurrence the amount is asceof r- b. a.Forabove.Calculate each point,the compute centroid its for coeff eachi- terizedidenticalby twoto the tags, partition relativeamong co-occurrencethe amount is asceof r- re- a. Calculate the centroid for eachFigure 7. Weak ontology with reference to the example a. Calculate the centroid for eachcientscluster, cluster, of being usingusing in the the the clusters, formula us-(6) resourcestainedsourcestained with with in in (2). wh which (2).ich That anyThat tags is, one cois, relative- occur, of relativethe two andco co-occurrence tags- theoccurrence amountappear is. of is b. cluster,For each using point, the compute formula its coeff(6) i- formula (6) above. ingabove. the formula (8) above. To explain the benefit of this proposed fuzzy weblog extraction Thisidenticalresourcesidentical collectionto to thein themethod wh partition ichpartition any causes among oneamong these of thethethet agtwoamount samount to tags becomeofappear of re- r e-. above.cients of being in the clusters, us- 4. Reiterateb. stepFor 1 each to 3 point,for every compute cluster its until coeffapproach, i- let’s introduce a small example. In the first part of this unitedsourcesThissources collectionand in in whichoffers whichmethod tags a tags semantically co causes co-occur,-occur, these andconsistent and thet ag the samount toamountpicture becomeb. Forof of each point, computeb. itsFor coefficientsing each the formula point, of computebeing(8) above. in the its coeffi- there is onlycients one term of being left in inthe the cluster. clusters,section us- the example will be explained. Afterwards the results whereresourcesunitedresourcesnearly and in in whalloffers whichtagsich any a areany semantically one related one of ofthetothe twoeachconsistenttwo tags other. tagsappearappearpictureThisclusters,. . using the4. formulaReiterate cients(8) stepabove. of 1 being to 3 for in the every clusters, cluster uuntils- 5. Concatenateing all the same formula terms(8) together.above. of the boolean search will be compared with the result of the semanticallyThiswhereThis collection collectionnearly consistentmethodallmethod tags causesarepicture causes related these isthese referredtot ageachtags tos toother. tobecome as become tagThis there ising only the one formula term left(8) above.in the cluster. 4. Reiterate step 1 to 4.3 forReiterate every cluster step until 1 to there 3 for is every only clusterfuzzy until search. spaceunitedsemanticallyunited. and andoffers offersconsistent a asemantically semanticallypicture isconsistent referredconsistent topicture aspicture tag 4.5. ReiterateConcatenate step 1all tosame 3 for terms every together. cluster until one term leftThe in the concatenation cluster.there is only is necessary one term left because in the cluster. the terms wherespacewherenearly. nearlyall alltags tags are are related relatedto toeach each other. other.ThisThis there is only one term left in the cluster. can – by5. drawingConcatenate on the all proposedsame terms fuzzy together. clustering semanticallysemantically consistent consistentpicturepicture is is referred referred 5. to to asConcatenate astag tag allThe 5.same concatenationConca termstenate together. isall necessarysame terms because together. the terms3.2.1 Problem Specifications space. can – by drawing on the proposed fuzzy clusteringThe problem with new and previously unobserved information space. The concatenation is necessary because the terms can – by TheThe concatenation concatenation is is necessary necessary because because the the terms termsis that the relationship to a term or other terms is not precisely drawing on the proposed fuzzy clustering method – belong cancan – –byby drawing drawing on on the the proposed proposed fuzzy fuzzy clustering clusteringknown. For example, the screen-producing company, Samsung, to more than one cluster. Nevertheless, with the proposed is screening the business competitors in the Internet for new method a model can be derived (cf. dendrogramm in fig. 5) killer applications for organic light emitting diodes (OLEDs). with several clusters the term belongs to a certain degree in, dependent on the membership level. This model is referred to An OLED (very often also named Organic Electro Lumi- as weak ontology. nescence – OEL) is any light emitting diode (LED) whose

Journal of Digital Information Management  Volume 8 Number 4  August 2010 281 Figure 6. Related terms in four different weblogs

emissive electroluminescent layer is made up of a film of organic compounds. The layer typically contains a polymer substance that allows appropriate organic compounds to be deposited. Figure 8. (a) Search with relation to other terms and (b) with assigned The completely different manufacturing process of OLEDs weight K (≥ 0.8) lends itself to various advantages over flat-panel displays pre- pared with Liquid Cristal Display (LCD) technology, as OLEDs enable a larger variety of colors, gamut, brightness, contrast and viewing angle than LCDs, due to the fact that OLED pixels directly emit light. In the Blogosphere new technologies are discussed earlier and therefore information is spread faster than in most other media. Thus in this example there are four weblogs (A to D) Figure 9. Boolean search result with reference to the example with different entries related to OLEDs. As shown in figure 6 the terms mentioned in these weblogs are OLED, LED, LCD, OEL. In contrast to the proposed fuzzy search approach, it is possible to find not only the searched term, but also to a certain degree 3.2.2 Pre-search related terms (as defined in theuser need). As a result it is pos- sible to find weblog A as well as weblog D (fig. 10). Thus, the To create an ontology, an agent first crawls the Internet for tags. benefit for the user is that he can discover new relationships. A web agent is – as already mentioned – a computer program As illustrated above, the search for ‘OLED’ originates not only that searches the Internet in a methodical, automated manner. an ‘OLED’ result but also a ‘LED’ result. A transformation of the Nevertheless this agent applies the previously discussed HITS user predefined weight K leads to in- or exclusion of further algorithm with the specified Jaccard similarity coefficient to the related terms. This transformation can be performed easily with set of tags it found to generate a tag space. the use of the introduced slider. With the use of FCM, the specified tag space will be partitioned in several meaningful clusters, as for example an ‘OLED’ cluster (fig. 7). This cluster represents a part of the – with the FCM as well – built ontology with the focus ‘OLED’. In this example, an arbitrary chosen range of relationship to the ‘OLED’ cluster is defined for OEL, LED and for LCD as well.

Figure 10. Search result based on weak ontology with reference to the example

Instead of delivering millions of search results in one long list, the search engine groups similar results together into clusters. Clusters help to see search results by topic so it is possible to zero in on precisely what was searched for or discover un- Figure 7. Weak ontology with reference to the example foreseen associations among elements. Rather than scrolling through page after page, the clusters help to find results missed without fuzzy weblog extraction or that were hidden deep in 3.2.3 The search the ranked list. To search for groundbreaking new OLED technology in these weblogs, a user enters search terms and related relevance Hence the two located weblogs will be presented to the user in a simple-to-use interface such as for example, ‘OLED’ with in two different folders. One folder is labeled ‘OLED’ and the a search range of [0.8..1] (see weight K in chapter 3.1). This other is labeled ‘OEL’. range is defined with the use of a slide control. In this example shown in figure 8a and 8b, the search terms and 4. Conclusion and Outlook their relevance include the nucleus OLED. OEL is included as People questioned O’Reilly and Battelle ever since the term well, however only with a disc range of [0.9..1]. As depicted in Web 2.0 was introduced, “What’s next?” as they state in [27]. “Is figure 8b LED with the disc range of [0.6..1] is excluded. it the Semantic Web? The Sentient Web? Is it the Social Web? The Mobile Web? Is it some form of Virtual Reality?” and they 3.2.4 Results answer with “It is all of those, and more.” They are probably right With a boolean search it is only possible to find weblog A (fig. 9), because “the Web is no longer a collection of static pages [...]. because the term “OLED” is or is not mentioned in the weblog Increasingly, the Web is the world – everything and everyone in (two-valued logic). The other related terms cannot be found. the world casts an “information shadow”, an aura of data which,

282 Journal of Digital Information Management  Volume 8 Number 4  August 2010 when captured and processed intelligently, offers extraordinary References opportunity and mind bending implications.” as they answer the question themselves in [27]. [1] Meier, A., Schindler, G., Werro, N. (2008). Fuzzy classification on relational databases. In: Galindo M. (Ed.): A new approach to gain deeper insights in a part of this “infor- Handbook of Research on Fuzzy Information Processing mation shadow” is the introduced fuzzy weblog search engine. in Databases, Volume II, Information Science Reference, Due to the fact that the boundaries in the fuzzy set theory are p.586-614. not well-defined, it is possible to find more numerous and higher quality results with this new Web 3.0 kind of a weblog search. [2] Zadeh, L. A. (1965). Fuzzy Sets. Information and Control, The results should be presented in an understandable way by 8, p.338-353. using folders as suggested in this paper. The folders will be [3] Murthy, S.G.K., Biswas, R.N. (2004). A Fuzzy Logic Based defined by the fuzzy clusters through FCM. Search Technique for Digital Libraries. DESIDOC Bulletin of A prototype for fuzzy weblog extraction is at the early stage of Information Technology, Vol. 24, No. 6, p.3-9. development (see architecture in fig. 4). It evaluates if there is a [4] Hasan-Montero, Y. and Herrero-Solana, V. (2006). possibly superior coefficient to the Jaccard similarity coefficient Improving Tag-Clouds as a Visual Information Retrieval proposed by Hasan-Montero and Herrero-Solana in [4]. Further Interfaces. Proceedings of International Conference on tests include comparisons with the Dice, the Kulczynski, the Multidisciplinary Information Sciences and Technologies, Russel and Rao, the Simple Matching and the Tanimoto coef- Mérida. ficient. Additionally, the HITS algorithm will be tested against [5] Kaser, O., Lemire, D. (2007). Tag-Cloud Drawing: Algorithms other comparable algorithms as for example Google PageRank for Cloud Visualization. Electronic Edition, Banff. or Yahoo! TrustRank and with the variation of different associ- ated Soundex algorithms it is also possible to vary. Another [6] Bezdek, J.C. (1981). Pattern Recognition with Fuzzy essential evaluation will be to weigh the FCM algorithm against Objective Function Algorithms. Plenum Press, New York. other comparable algorithms, such as the “Fuzzy clustering [7] Schwartz, C. (1998). Web Search Engines. Journal by Local Approximation of Memberships” (FLAME) algorithm of the American Society for Information Science. 49(11), suggested by Fu and Medico in [28]. p.973–982. A focal point of this research is to construct an adaptive man- [8] Ferragina, P. Gulli, A. (2005). A Personalized Search Engine machine interaction interface as previous search engines Based on Web-Snippet Hierarchical Clustering. International capitalize only insufficiently on the need of the users to interact World Wide Web Conference Committee, p.801-810, Chiba, with the search engine in a straightforward manner. In order to Japan. not confuse the user, the interaction should be kept as simple [9] Oreilly, T. (2007). What is Web 2.0: Design Patterns as possible. Therefore the key for it is an easily manageable and Business Models for the Next Generation of Software. graphical user interface which has to be designed. Communications & Strategies, No. 1, p.17, First Quarter. Other 2D or 3D visualization possibilities as for instance Kuhn, [10] Cardoso, J. (2007). The Semantic Web Vision, Where Are Erni, Loretan and Nierstrasz with their approach for software We? IEEE Computer Society, p.22-26. visualization in [29] or Portmann and Kuhn for weblog search in [30] presents can be explored in further research. The over- [11] Voss, J. (2007). Tagging, Folksonomy & Co - Renaissance of arching philosophy of the GUI should be to constantly simplify Manual Indexing? Proceedings of the International Symposium the search process for the user – for example by going into an of Information Science, p.234–254, Cologne. interaction with them. [12] Wu, H., Zubair M., Maly, K. (2006). Harvesting Social Knowledge from Folksonomies. ACM, p.1-12, Odense, Another unexplored area for future research is the dynamic Denmark. storage of the terms found by the agent. For this purpose, storage space is intended to use a multidimensional data- [13] Picot, A., Fischer, T. (2006). Weblogs professional, base, as a repository of the stored meta data. A conceivable fundamentals, concept and practice in a corporate environment approach to this is akin to the fuzzy data warehouse as pro- (in German). Dpunkt, Heidelberg. posed by Fasel and Zumstein in [31] for web analytics. For [14] Baeza-Yates, R., Ribeiro-Neto, B. (1999). Modern search speed reasons this data warehouse should possibly Information Retrieval. ACM press, Essex. be split across several different machines in the long run. Consequently future balance-load considerations have to be [15] Kobayashi, M., Takeda, K. (2000). Information Retrieval taken into consideration. on the Web. IBM Research, ACM Computing Surveys, Vol. 32, No. 2, p.144-173. 5. Acknowledgement [16] Su, M.C., Chou, C. H. (2001). A Modified Version of the K-Means Algorithm with a Distance Based on Cluster Upon the recommendation of the referees this paper is a revised Symmetry. IEEE Transactions on Pattern Analysis and Machine version of Portmann’s paper – which can be found in [32] – for Intelligence, Vol. 23, No. 6, p.674-680. the 2nd International Conference on the Applications of Digital Information and Web Technologies (ICADIWT2009) in London. [17] Backhaus, K., Erichson, B., Plinke, W., Weiber, R. (2006). That is why we are grateful to the referees for this honor. Multivariate Analysis, an application-oriented adoption (in German). Springer, Berlin. We also appreciate our colleagues of the Fuzzy Marketing Methods research center (www.FMsquare.org) who always [18] Dubois, D., Prade, H., Yager, R.R. (1993). Readings help with word and deed. This center implements the ideas of in Fuzzy Sets for Intelligent Systems. Morgan Kaufmann fuzzy classification methods to numerous fields of applications Publishers, San Mateo. and therefore, the research center welcomes cooperation with [19] Dubois, D., Prade, H. (1988). Fuzzy Sets and Systems. practitioners as well as researchers. Academic Press, New York.

Journal of Digital Information Management  Volume 8 Number 4  August 2010 283 [20] Gruber, T. R. (1993). A Translation Approach to Portable [28] Fu, L., Medico, E. (2007). FLAME, a novel fuzzy clustering Ontology Specifications. Knowledge Systems Laboratory, method for the analysis of DNA microarray data. BMC Technical Report KSL 92-71, p.199-220. Bioinformatics, p.1-15. [21] Kleinberg, J. (1998). Authoritative sources in a hyperlinked [29] Kuhn, A., Erni, D., Loretan, P., Nierstrasz, O. (2009). environment. ACM-SIAM Symposium on Discrete Algorithms, Software Cartography: Thematic Software Visualization with p.1-33, Odense, Denmark. Consistent Layout. Proceedings of 15th Working Conference [22] Lewandowski, D. (2005). Web Information Retrieval, on Reverse Engineering, IEEE Computer Society Press, technologies for information search in the Internet (in p.209-218. German). Deutsche Gesellschaft f. Informationswissenschaft [30] Portmann, E., Kuhn, A. (2010). Extraction and Cartography u. Informationspraxis, Düsseldorf. of Weblog Information” (in German). in HMD271 Web 3.0 & [23] Russel, R. C. (1918). US Patent 1261167. Semantic Web. [24] Russel, R. C. (1922). US Patent 1435663. [31] Fasel, D., Zumstein, D. (2009). A Fuzzy Data Warehouse Approach for Web Analytics. In: Miltiadis d. Lytras, E. Damiani, [25] Holmes, D., McCabe, M. C. (2002). Improving Precision J.M. Carroll, R.D. Tennyson, D. Avison, A. Naeve, A. Dale, P. and Recall for Soundex Retrieval. Proceedings of the Lefrere, F. Tan, J. Sipior, and G. Vossen, editors, Visioning International Conference on Information Technology: Coding and Engineering the Knowledge Society – A Web Science and Computing, p.22. Perspective, volume 5736 of Lecture Notes in Computer [26] Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Science, p.276–285, Springer. Burges, C. (2006). Optimisation methods for ranking functions [32] Portmann, E. (2009). Weblog Extraction with Fuzzy with multiple parameter. CIKIM, p.585-593. Classification Methods. 2nd International Conference on the [27] O’Reilly, T., Battelle, J. (2009). Web Squared: Web 2.0 Five Applications of Digital Information and Web Technologies, Years On. Web 2.0 Summit, O’Reilly Media, Inc, p.1-13. p.422-427, London.

284 Journal of Digital Information Management  Volume 8 Number 4  August 2010