Knowledge Management Enviroments for High Throughput Biology
Total Page:16
File Type:pdf, Size:1020Kb
Knowledge Management Enviroments for High Throughput Biology Abhey Shah A Thesis submitted for the degree of MPhil Biology Department University of York September 2007 Abstract With the growing complexity and scale of data sets in computational biology and chemoin- formatics, there is a need for novel knowledge processing tools and platforms. This thesis describes a newly developed knowledge processing platform that is different in its emphasis on architecture, flexibility, builtin facilities for datamining and easy cross platform usage. There exist thousands of bioinformatics and chemoinformatics databases, that are stored in many different forms with different access methods, this is a reflection of the range of data structures that make up complex biological and chemical data. Starting from a theoretical ba- sis, FCA (Formal Concept Analysis) an applied branch of lattice theory, is used in this thesis to develop a file system that automatically structures itself by it’s contents. The procedure of extracting concepts from data sets is examined. The system also finds appropriate labels for the discovered concepts by extracting data from ontological databases. A novel method for scaling non-binary data for use with the system is developed. Finally the future of integrative systems biology is discussed in the context of efficiently closed causal systems. Contents 1 Motivations and goals of the thesis 11 1.1 Conceptual frameworks . 11 1.2 Biological foundations . 12 1.2.1 Gene expression data . 13 1.2.2 Ontology . 14 1.3 Knowledge based computational environments . 15 1.3.1 Interfaces . 16 1.3.2 Databases and the character of biological data . 17 1.4 Goals of the thesis . 19 1.5 Structure of the thesis . 19 2 Objectivism: Formal Concept Analysis 21 2.1 Epistemology of objectivism . 21 2.1.1 Applying objectivism . 22 2.1.2 Gene expression data . 22 2.1.2.1 Clustering techniques . 23 2.1.2.2 Bi-clustering . 23 2.1.2.3 Formal Concept Analysis . 23 2.1.3 The amino acids: an example data set . 24 2.2 Basics of Formal Concept Analysis . 24 2.2.1 Definitions . 24 2.2.1.1 Formal context: . 24 2.2.1.2 Partial orders: . 25 2.2.1.3 Lattices: . 25 2.2.1.4 Power sets: . 25 2.2.1.5 Formal concept: . 25 2.2.1.6 Hasse diagram . 26 2.2.1.7 Ideals and filters . 27 2.2.1.8 Concept lattices: . 27 1 2.2.1.9 Attribute partial order: . 27 2.2.2 Measures . 27 2.2.2.1 Chains and height . 27 2.2.2.2 Anti-chains and width . 28 2.2.3 Simplifying attributes and feature selection . 28 2.3 Next closure algorithm implementation . 29 2.3.1 Calculating Closed Sets from Next Closure . 29 2.3.1.1 Algorithmic Complexity . 33 2.3.2 Constructing the lattice . 34 2.3.3 Unfolding the lattice to form a hierarchy . 35 2.4 Labelling concepts . 37 2.4.1 Reduced labelling for concept lattices . 37 2.4.2 Naming . 39 2.5 Non-binary data . 39 2.5.1 Nominal scaling . 40 2.5.2 Ordinal scaling . 41 2.5.3 Logical scaling . 41 2.6 Discussion . 42 3 Ontology: algorithms, metrics and representations 43 3.1 Introduction to ontology and applications . 44 3.1.1 Ontology . 44 3.1.2 Structure and implementation of open biomedical Ontologies . 45 3.1.3 Ontologies for labelling . 45 3.1.4 The Gene Ontology . 45 3.1.4.1 Structure . 46 3.1.4.2 Usage of GO . 47 3.1.5 Annotation and curation . 48 3.1.5.1 Functional roles of genes . 48 3.1.5.2 Process of construction . 48 3.1.5.3 Context . 49 3.2 Ontology analysis . 49 3.2.1 Ordered set structural measures . 50 3.2.1.1 Depth . 50 3.2.1.2 Specificity and centrality . 50 3.2.1.3 Coverage . 51 3.2.2 Informational value . 51 3.2.2.1 Metric . 51 3.2.2.2 Atomic symbols within GO terms . 52 3.2.2.3 Gene cluster annotation statistical significance . 53 2 3.2.3 Usage . 53 3.3 Implementing Gene Ontology data services . 54 3.3.1 Introduction . 54 3.3.2 Gene Ontology services . 54 3.3.3 Taxonomies in SQL . 54 3.3.4 Alternate database models and interfaces . 55 3.3.5 GOS . 56 3.3.6 Measure implementation . 56 3.3.6.1 Resnick similarity and coverage . 56 3.3.6.2 Specificity and depth . 57 3.3.7 Query by genes . 57 3.4 GNS . 57 3.5 Usage of the GOS . 58 3.5.1 Initialization . 58 3.5.2 Basic navigational commands . 58 3.5.3 Advanced Usage . 58 3.6 Concept naming: labelling directories of a file system . 60 3.6.1 Criteria . 60 3.6.2 Algorithm implementation . 61 3.6.3 Unfolding the lattice . 62 3.7 Improvements to ontology . 63 3.7.1 Improvements and evolutions to the Gene Ontology . 63 3.7.1.1 Relational biology . 64 3.7.1.2 Annotation and folksonomies . 65 3.8 Summary . 66 4 Formal concept file system 68 4.1 Introduction to Inferno . 68 4.1.1 Inferno and Plan 9 . 68 4.1.2 Architecture . 69 4.1.3 Limbo . 70 4.1.4 Styx: a common network access protocol . 70 4.1.5 Per process file system based interface - computable namespaces . 72 4.1.5.1 An example file server: the network device . 74 4.1.5.2 An example file server: the domain name server . 77 4.1.5.3 HURD: Alternate approach to user level program file systems 78 4.1.6 Heuristics of resources as file servers . 78 4.2 Review of semantic file systems based on FCA . 79 4.2.1 Libferris . 79 4.2.1.1 Libferris example . 80 3 4.2.2 LISFS . 80 4.2.3 LISFS example: exploring two data structures . 81 4.2.4 Review . 83 4.3 Design strategy . 84 4.3.1 Key processes . 84 4.3.2 Data structures . 85 4.4 A prototype system . 86 4.4.1 Feature extraction - Transducers . 86 4.4.2 Concept formation - Conceptderive . 86 4.4.3 Directory generation - FCA2fs . 88 4.4.4 File system traversal and naming . 89 4.5 GO utilities . 89 4.5.0.1 GOS reloadable implementation . 89 4.5.0.2 Distributed middle ware . 91 4.6 Implementation of poly-hierarchy handling . 92 4.6.1 Using the file-system in the.