ORGANOGRAPHS Multi-faceted Hierarchical Categorization of Web Documents

Rodrigo Dias Arruda Senra, Claudia Bauzer Medeiros Institute of Computing, University of Campinas, UNICAMP, Campinas, SP, Brazil [email protected], [email protected]

Keywords: Organographs, Hierarchical Content Categorization, Bookmarking, Classification

Abstract: The data deluge of information in the Web challenges internauts to organize their references to interesting content in the Web as well as in their private storage space off-line. Having an automatically managed personal index to content acquired from the Web is useful for everybody, but critical to researchers and scholars. In this paper, we discuss concepts and problems related to organizing information through multi-faceted hierarchical categorization. We introduce the organograph as a mechanism to specify multiple views of how content is organized. Organographs can help scientists to automatically organize their documents along multiple axes, improving sharing and navigation through themes and concepts according to a particular research objective.

1 INTRODUCTION the pivot artifact to discuss our approach to hierar- chical structuring, considering that digital content is The organization, archival and sharing of digi- often encapsulated, stored and shared as files. tal content generated by scientists is important in The issues for filesystems can be transported to eScience research - e.g. reports, algorithms and data. other manifestations of the hierarchical structuring Scientists must be able to efficiently organize and dis- pattern. First issue, the membership relation between seminate their knowledge, not only within a project, files and directories, is often static and manually de- but also with the community at large, where the pre- fined by the user on a per-file-instance basis. Sec- ferred environment is the Web. This ond, the hierarchy implicitly defines a content catego- complicates document management, since each scien- rization, thus some given file frequently can only be tist (or group) may use different document standards, placed in a single point inside the hierarchy. We refer storage mechanisms and vocabularies. Such issues to this issue as the single category problem, further are nowadays a prime research topic in – exploring it in section 2.2. Above all, the organiza- a novel research domain which is concerned with the tion of a directory hierarchy is not shared dissociated Web as the primary object of interest. from the content itself, e.g., people do not exchange Continuing our research (Senra and Medeiros, hierarchiesof empty folders even thoughthere is valu- 2009) about data sharing in eScience, this paper con- able knowledge encoded in their structuring. We only cerns with information organization and collabora- share directory trees associated with content, because tion on the Web. We are interested in the organiza- we lack the tools to categorize our content accord- tion of scholarly digital content (documents used by ing to a foreign directory hierarchy and vice-versa. scientists), through automatic hierarchical structur- We refer to this issue as the content-recategorization ing, multifaceted filtering and sharing. Hierarchical problem, and discuss it in section 2.4. Structuring is a pervasive approach towards informa- The Web aggravated these issues. It is often eas- tion organization. It is the cognitve pattern we use to ier and quicker to find a piece of information in the organize everything (e.g., files, messages). Although , than to locate the one already available in ubiquitous, we often create our hierarchies manually our local computer or Intranet. Not only is this an in- and in an ad hoc fashion. We choose filesystems as efficient way of managing content, but it can lead to problems such as needless duplication, loss of quality Another important concept is the notion of ontol- and provenance mishaps. The information available ogy ontology (Uschold and Gruninger, 1996): an ex- “off-line” (i.e. in our local computer or Intranet) is plicit and rigorous specification of a conceptualiza- not the same information available in the Internet. Al- tion, that organizes some knowledge domain. For though not necessarily as fresh, in some aspects local many applications, ontologies are mostly hierarchical, information can be richer. It has been filtered, possi- containing all the entities and their relations, usually bly transformed, can be sensitive or private, and may restricted to is-a and part-of. Ontologies can describe no longer be available on the Web. IUs, categories, and relationships between them. This paper proposesan approachtowards the auto- matic organization of documents generated and used 2.2 The Single Category Problem by scientists. Our approach enables this organization along multiple axes, thus facilitating its sharing and The way we organize digital documents in hierarchies dissemination. Our hypothesis is that hierarchies can still follows the metaphors of the physical world: be shared in isolation from their generative collection, archives, drawers and folders. The majority of tools and used to organize other (non-generative) collec- used to organize documents relies on such single tax- tions. onomic organization patterns - e.g., file managers or browsers, e-mail clients, or software navigational menus. When the digital space inherited this pattern 2 APPROACHES TO CONTENT from the physical world, it also inherited some unnec- ORGANIZATION essary limitations. For instance, many software arti- facts restrict a digital document to be placed in a sin- gle directory in the filesystem hierarchy, or an email We are interested in information organization,and message to be stored in a single folder. we define it as: ”the structuring of information units Digital libraries are presented as a means to solve by a group of agents according to a set of consensual this issue – by allowing multifaceted content orga- and shared principles to achieve a defined goal”. Our nization using links and structures. Other definition has five elements. The agents represent the means of circumventing this limitation is the use of users and software artifacts that interact. The infor- copies, hard-links (a.k.a clones) and soft-links (a.k.a mation units (IUs) represent the granularity at which symbolic links). While these mechanisms support content is encapsulated and manipulated. The struc- content multi-categorization, they also introduce new ture is defined by intermediate aggregators that parti- problems. Copies increase storage space, complicate tion the set of IUs; these aggregators are often related consistency maintenance and may cause redundant to content categorization or clusterization. The prin- processing. Links change the hierarchy traversal from ciples are the algorithms and knowledge bases that tree-based into a graph-based procedure that often in- drive the structuring process. The goal is usually to troduces problematic cycles. Furthermore, links in- improve how the agents obtain access to the desired troduce a duality: sometimes they are treated trans- IUs. In this paper, we refer to file and document in- parently, other times not – causing an identity prob- terchangeably as concrete instances of IUs; similarly, lem between the link and its referred object. Thus, directories are instances of hierarchical aggregatorsto archiving systems and digital libraries still lack flex- enforce structuring upon the IUs. ible content organization support, and suffer from a 2.1 Basic Concepts rigid content structuring.

Organization as part of Information Retrieval (IR), 2.3 Alternatives to Rigid Hierarchical involves several tasks (Jackson and Moulinier, 2002) Structuring – e.g., sorting, summarizing, indexing. We are par- ticularly interested in Classification or Categorization In opposition to the hierarchical organization strategy, – the assignment of IUs to a pre-established set of there are other two approaches to organize content: classes or categories defined by the agents. Meth- and full-text management engines. The ods for categorization differ in the form of the clas- full-text search approach abolishes classes and cate- sifier, the technique for training, and the representa- gories altogether. Objects are indexed by their tex- tion of the IU, e.g, see (Weigend et al., 1999) on text. tual content and retrieved by a subset of keywords. While the categorization task assumes an existing set The object collection cannot be browsed by topic (or of classes(categories), clustering aims to create or dis- any other property) because it is unstructured, only cover a set of classes(clusters). ranked result sets can be iterated. Furthermore, full- text mechanisms do not support multimedia content, egorization problem. The content-recategorization which is increasingly common. problem is a variation of hierarchical classification Folksonomies (or social tagging) is a recent re- because the content subject to classification is already sponse to the demands of content organization. In classified according to a source hierarchy, which this paradigm, content is annotated and categorized could be used to improve the classification process by multiple labels or tags that form a cloud of un- towards the new given target hierarchy. For a full structured categories, represented by words or short discussion and survey about hierarchical classifica- phrases. Any object can be associated with any num- tion see (Gordon, ). Many classification methods dis- ber of tags, thereforebelonging to multiple categories, cussed in the literature are not fully automatic, requir- solving the single category problem. Tags can evolve ing supervised training and user feedback – e.g., naive dynamically, being used to browse and retrieve IUs. bayes, support vector machine, or neural networks. Therefore, folksonomies preserve classes or emulate If the categorization process is distributed across categories, but not their inter-relations and structure. different people, or even done by a single person at In order to minimize the lack of structure, some tag- different times, then categorizations will differ – e.g., ging systems (e.g.Delicious and Connotea) use co- (Bonifacio et al., 2000) have shown that community occurrence of tags for navigational purposes. Folk- members keep their own perspective on a commu- sonomies have other limitations (Giannakidou et al., nity repository. The reason why many community 2008) that restrict their usability, such as: tag valida- repository initiatives fail is due to the fact that a sin- tion, uncontrolled vocabularies, spamming and term gle (though shared) categorization scheme is not ac- ambiguity and redundancy. cepted or understood by the entire community. This Summing up, taxonomies suffer from the single reinforces the idea that single category approaches to category problem, full-text search provides no struc- classification are doomed to fail. turing nor categories, and folksonomies have unstruc- tured and unrelated categories. 2.5 Categories for Categorization 2.4 The Content-Recategorization Problem We are interested in improving automation for dif- ferent categorization tasks, which we call: first- As mentioned in section 1, hierarchies still lack the time, refactor, shared, and synchronized categoriza- desirable property of dynamic and flexible member- tion. First-time categorization is when we categorize ship relation between content and category. One of an assorted digital document collection for the first our goals is to allow hierarchical organizations to be time. shared in isolation from the content they categorize. Refactor categorization is when we already have a First we need to distinguish the generative collection categorized collection, but we need to either judge its from a subordinate collection. A generative collec- coherence or refactor it. The quality and suitability of tion is the set of IUs from which the agents derived the categorization may vary for different user groups a hierarchical categorization scheme. A subordinate or for the same group doing tasks at different times. collection is the reciprocal entity, i.e., any set of IUs subject to a specific (generated) classification scheme. Shared categorization is when we want to re-use For example, suppose some researcher has a hier- the categorization scheme from a different group and archical collection of articles. The folders represent apply it to our own content, or do the inverse task – categories, and the union of all articles are the gener- using our own categorization scheme to browse the ative collection. There are two recategorization sce- shared content from others. narios. The first is to fit her articles to an external Synchronized categorization is a stricter version categorization, for instance according to the Library of shared categorization, when two groups are simul- of Congress Subject Headings (LCSH). In this case, taneously manipulating content and categorization. each heading from LCSH would become a folder (cat- Each group may adopt a different categorization, and egory), and the researcher’s articles would become the yet they must edit and evolve the same documents. subordinate collection to be organized according to Supposing that one group devised a proper (refactor) this new hierarchy. The second scenario is to fit other categorization for its purposes, then this group should documents to her own hierarchy scheme, for example be capable of applying that same categorization to to browse a colleague’s collection as if it were orga- any exchanged content. In other words, in the last nized with her own personal classification scheme. two tasks what is sought is a solution to the content- Each scenario means solving the hierarchical cat- recategorization problem. 3 THE ORGANOGRAPH processed and indexed to populate what we call the FRAMEWORK attribute-space. The attribute-space is a dictionary- like whose keys are document IDs and In order to accomplish those four categorization whose values are records with heterogeneousschemas tasks, we present a semantically-grounded organiza- – because different documents might have distinct tional structure that we called organographs. An sets of attributes. It can be implemented on top of a organograph is a user parametrization to a hierarchi- NoSQL database or on top a relational database with cal categorization task, that is editable, persistable and a star-join (multi-dimensional) schema.It is built by shareable. applying all suitable information extractors available Organographs improve access to digital content to all documents in the source collection. The car- because they are context-sensitive and built to meet dinality of the attribute-space for a given document specific user needs, preserving user familiarity with depends on the availability of information extractors categories and their structure. Multiple organographs applicable to the respective document type. Hence, it can be generated from the same collection given a grows incrementally with the advent of new informa- different parametrization, and a single organograph tion extractors. could be applied to different and unrelated collec- For instance, the attribute-space for some x.pdf tions. Each generated organograph should be inter- document has the schema: title, author, publica- preted as a multi-faceted view (Dakka et al., 2007) tion date, document type, word frequency vector, of the generative collection. A concrete example of and other user defined keywords. The first four at- organograph is given in figure 1, which is detailed in tributes are tagged (by the extractor) with their re- section 3.3. spective elements counterparts. Further- more, x.pdf’s attribute-space is augmented with at- 3.1 UseCase tributes retrieved from datasource (the host filesys- tem), such as: document size, last access/modification date, owner, group, and access permissions. If origi- Consider the following example: a researcher uses a nated from the Web, the attribute-space could be aug- hierarchy created to gather all material related to his mented with information extracted from the host site research project. Some documents were generated lo- or the surrounding Web pages. Once the attribute- cally, while others were retrieved from the Web. The space is populated, the researcher possesses a vocab- document tree contains his publications, unpublished ulary of attributes with which he can write organo- papers, some papers (categorized by subject) he al- graph specifications that will guide the materializa- ready read, and papers to read. He wants to make tion of diverse multi-faceted views. this document tree available for his research group, We propose four steps to construct organographs: including his own publications, but not his unpub- lished materials. Moreover, he wants to organize the Step 1: apply information extraction techniques resulting collection by publication date (year/month) to the IUs (e.g. documents), factoring out attributes and then by ACM’s 1998 Computing Classification and properties (defining an attribute space). Each IU System (http://www.acm.org/about/class/1998). No- is assigned a unique ID that serves for indexation pur- tice that the classifying criteria are either based on poses. Step 2: automatically generate categories from attributes that are intrinsic to the IU’s content (e.g.: a user-given organograph that either explicitly enu- publication date, text subject, annotations), or based merates categories or provides generative rules that on attributes dependent on the user context – being define them. User parametrization should anchor cat- content-independent (eg.: read vs unread, published egories and their inter-relationships to concepts in on- vs unpublished). In this example, the attributes pub- tologies. Step 3: run a categorization algorithm spec- lished/not published and read/not read are implicitly ified in the user-given organograph, resulting in the given by categories in the original document orga- assignment of IU IDs (step 1) to the generated cate- nization, therefore annotations can be derived auto- gories (step 2). Step 4: use virtualization and links matically using proper attribute extractors (Dakka and to materialize the emergent categorization using the Ipeirotis, 2008). In other cases, annotations must be same structuring metaphor (e.g. directories) of the performed manually. generative collection.

3.2 Methodology 3.3 Concrete Organograph Example

Prior to constructing the target organizing hierar- We provide the organograph described in figure 1 to chy, all documents in the source collection are pre- illustrate what we mean by user parametrization. 1: 2: 3: 8: 9: 14: 15: 18: 19: 20: application/pdf 21: unpublished 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:

Figure 1: Sample Organograph Specification

Line 1 defines an organographinstance with iden- ipating scientist can construct a personal organization tification for persistence purposes. Line 2 defines the scheme to allow other researchers to “see” organiza- topmost level (root node). Lines 3-7 define the labels tions differently. for the first level nodes, consisting of the 4-digit year The CAIMAN system (Lacher and Groh, ) facili- present in the publication date attribute. Lines 8-30 tates document exchange between geographically dis- define the second nested level, where lines 9-13 define persed people. Each community member organizes similarly a 3-letter month label for this level’s nodes his collection according to his own categorization based on the same attribute. Lines 14-29 define the scheme (ontology), then CAIMAN maps concepts in third level, where lines 15-24 apply an algorithm to personal ontologies to concepts in a communityontol- do topical clusterization according to ACM’s CCS on- ogy. (Bloehdorn et al., ) performed text mining exper- tology. The choice of categorization algorithm is left iments in the medical domain in which the ontologi- for the user, in this example we chose “naive bayes” cal structures used were acquired automatically in an omitting from the example some necessary parame- unsupervised learning process. They have shown that ters such as the trainning set used. Line 19 defines the automatically learned ontologies and manually engi- generative datasource collection subject to inclusive neered ones, were both competitive and improved re- (line 20) and exclusive filters (line 21). Finally, lines sults on text clustering and classification tasks. (Gi- 26-28 define the inner-most level nodes (tree leaves), annakidou et al., 2008) propose an approach for so- consisting of hyperlinks to the IU’s (documents) fil- cial data clustering which combines semantic, social tered. and content-based information. They devised an un- The exact syntax and semantics of the domain spe- supervised model for efficient and scalable mining on cific language used to code the organograph lies out- multimedia social-related data, which leads to the ex- side the scope of this paper. traction of rich and trustworthy semantics and the im- provement of retrieval in a social tagging system. (Du and Chen, 2007) devised a desktop-based personal in- 4 RELATED WORK formation management that further exploits a social networking environment for collaborative knowledge There are many research initiatives on tagging and creation, integration and sharing, based on the inte- text mining, on the Web, whose goal is document gration of technologies and collabora- sharing. However, to our knowledge, ours is the first tive tagging. (Chen and Roberts, 2007) presented an work that is geared towards organizingfor sharing in a architecture capable of recovering the context of tags collaborative Web environment, in which each partic- and drive emergent semantics, using them to organ- ically build ontologies. The generated semantic hi- ACKNOWLEDGEMENTS erarchy is used to enforce structure and semantics in collaborative tagging. That approach was adopted in This work was supported by Fapesp, CNPq, CAPES practice in the Online Open Publishing System. and INCT in Web Science (CNPq 557.128/2009-9). Our work also goes toward filling gaps in Web Sci- ence research, in the area of designing and develop- ing infrastructures for collaboration on the Web. The REFERENCES term Web Science was first introduced by Berners- Lee. It has since given origin to large international Bloehdorn, S., Cimiano, P., and Hotho, A. Learning on- research efforts, including The Web Science Trust tologies to improve text clustering and classification. (http://webscience.org/) or the Brazilian Institute for In From Data and Information Analysis to Knowledge Web Science Research (http://webscience.org.br). Engineering: Proceedings of the 29th Annual Confer- Formally, research in Web Science is concerned ence of the German Classification Society. with the Web as the primary object of interest. In Bonifacio, M., Bouquet, P., and Manzardo, A. (2000). A our work, this means among others concentrating on distributed intelligence paradigm for knowledge man- agement. In AAAI Spring Symposium Series 2000 on organographs as a means of sharing and exchanging Bringing Knowledge to Business Processes. knowledge. Furthermore, once document organiza- tions are shared, the researcherscan reuse each other’s Chen, L. and Roberts, C. (2007). Semantic tagging for large-scale content management. In WI ’07: Proceed- data – which is the essence of scientific collaboration ings of the IEEE/WIC/ACM International Conference – without having to concern themselves with estab- on Web Intelligence. lishing standards for document organization. Dakka, W. and Ipeirotis, P. G. (2008). Automatic extrac- tion of useful facet hierarchies from text . In ICDE, pages 466–475. 5 CONCLUSIONS Dakka, W., Ipeirotis, P. G., and Wood, K. R. (2007). Faceted browsing over large databases of text-annotated ob- jects. In ICDE, pages 1489–1490. This paper presented a conceptual framework to Du, Y. and Chen, L. (2007). Using personalized knowledge automate information organization and support col- portal for information and knowledge integration and laborative work on the Web. Our core proposal is the sharing. In SKG ’07: Proceedings of the Third In- organograph – a persistent and shareable organiza- ternational Conference on Semantics, Knowledge and tion that emerges from automatic feature-extraction, Grid. classification and clustering. Though our discussion Giannakidou, E., Kompatsiaris, I., and Vakali, A. (2008). was centered in document sharing and reuse, our end- Semsoc: Semantic, social and content-based cluster- users are scientists that work cooperatively in some ing in multimedia collaborative tagging systems. In eScience domain. In such a context, documents refer ICSC ’08: Proceedings of the 2008 IEEE Interna- tional Conference on . not only to scientific papers and reports, but also data files containing experimental data, or images. Under Gordon, A. Hierarchical classification. Clustering and clas- sification. this perspective, our proposalcan be extended to other domains in which cooperation on the Web is required. Jackson, P. and Moulinier, I. (2002). Natural language pro- cessing for online applications: text retrieval, extrac- At the same time, we need to concern ourselves tion, and categorization. John Benjamins Publishing with the Web Science issue of visibility. It is not Company. enough to share organographs, if we also want the Lacher, M. and Groh, G. Facilitating the exchange of ex- documents to be visible beyond a research group. In- plicit knowledge through ontology mappings. In Pro- deed, the validation of scientific experiments requires ceedings of the Fourteenth International Florida Arti- reproducibility – and this means that documents asso- ficial Intelligence Research Society Conference. ciated with an eScience project must all, at the end, Senra, R. D. A. and Medeiros, C. B. (2009). SciFrame: a become available. This means that we must also con- conceptual framework to describe data sharing in e- sider some sort of Publication Directory, in which a Science. SBBD. III e-Science Workshop. group’s (or a project’s) organographs can be accessed Uschold, M. and Gruninger, M. (1996). Ontologies: Princi- by all interested in accessing the main results of that ples, methods and applications. The Knowledge Engi- group. This kind of solution is part of our ongoing neering Review, 11(02). research. We will validate this concept using real ap- Weigend, A. S., Wiener, E. D., and Pedersen, J. O. (1999). plications that run in the Web and that have been im- Exploiting hierarchy in text categorization. Inf. Retr., plemented by our research group, in distinct scientific 1(3). domains.