Large-Scale Data Fusion by Collective Matrix Factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015

Total Page:16

File Type:pdf, Size:1020Kb

Large-Scale Data Fusion by Collective Matrix Factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Jane looks foR help! jane’s personal hairball! Hi jane. NAR just published 176 new bio databases*! .... Messy. Think about What’s wrong? Ohhh! all different edge types! I have no idea how to How about stiching Make anything useful. them in a single TRMT61A data table? RecA_monomer-monomer_interface TOP3B Tried it. A nightmare! NFX1 PMS1 RPL11 GBP2 Homologous Recombination Repair RAD52 TP53 Double-Strand Break Repair Think of GO annotationS POLD4 RAD54B ACACB POLD1 RAD9B DNA Repair I could work this out, RPA1 EXO1 in the data table RPA4 EIF5A ACACA CRYAB MLH1 RPA2 but not for every MND1 TERF2IP PMS2 TOP3A EME1 DNAJA3 of yeast phenotypes! POLR2K CDK2 TERF2 Meiotic Recombination MUS81 different data source RFC5 RFC2 PRKDC RFC4 ATR UBE2I BIOCARTA_ATM_PATHWAY MYO18A RPA3 POLD3 RAD9A out there. MSH3 BARD1 RFC3 FANCD2 BIOCARTA_ATRBRCA_PATHWAY RFC1 PCNA AIRE WRN ZNF280B MLH3 XRCC5 XRCC2 DMC1 MDC1 Told you! MRE11A CSNK1D DNA_recomb/repair_RecA RAD51 COPB2 APEX2 Homologous recombination BRCA1 MSH5 RAD50 DNA_recomb/repair_Rad51_C MSH6 BRIP1 POLD2 FANCL NBN MSH4 MSH2 XRCC6 HSPA9 SEC14L5 H2AFX BRCA2 RAD51D FANCF RAD54L BLM FANCC TOPBP1 CSNK1E MED6 ATM XRCC3 XRCC4 PPP1CC DNA_recomb_RecA/RadB_ATP-bd SHFM1 CHEK2 JUN FANCA Mismatch repair FANCE C10orf2 RAD51AP1 LIG1 MSH5-C6orf26 RAD51C FANCG CHEK1 FEN1 TP53BP1 FIGN SSBP1 RBBP8 UIMC1 PALB2 RAD51B Meiosis Homologous recombination repair of ... GYS1 BARD1 signaling events Fanconi anemia pathway CSTF1 FAM175A ANAPC2 * Fernandez-suarez & galperin, nucleic acids research, 2013. Large-scale data fusion by collective matrix factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015 These notes include introduction Welcome to the hands-on Data Fusion Tutorial! This tutorial is designed to integrative data analysis with for data mining researchers and biologists with interest in data analysis examples from collaborative and large-scale data integration. We will explore latent factor models, a filtering and systems biology, popular class of approaches that have in recent years seen many and Orange workflows that we successful applications in integrative data analysis. We will describe the will construct during the tutorial. intuition behind matrix factorization and explain why factorization Tutorial instructors: approaches are suitable when collectively analyzing many heterogeneous Marinka Zitnik and Blaz Zupan, data sets. To practice data fusion, we will construct visual data fusion with the help from members of workflows using Orange and its Data Fusion Add-on. Bioinformatics Lab, Ljubljana. If you haven’t already installed Orange, please follow the installation guide at http://biolab.github.io/datafusion-installation-guide. * See http://helikoid.si/recomb14/zitnik-zupan-recomb14.png for our full award-winning poster on data fusion. !1 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 1: Everything is a Matrix In many data mining applications there are plenty of potentially beneficial data available. However, these data naturally come in various formats and at different levels of granularity, can be represented in totally different input data spaces and typically describe distinct data types. For joint predictive modeling of heterogeneous data we need a generic way to encode data that might be fundamentally different from each other, both in type and in structure. An effective way to organize a data compendium is to view each data set a matrix. Matrices describe dyadic relationships, which are relationships between two groups of objects. A matrix relates objects in the rows to objects in the columns. Examples of data matrices commonly used in the analysis of biological data include degrees of protein-protein interactions from the STRING database that are represented in a gene-to-gene matrix: Gene interaction network gacT gemA rdiA racN racJ racI xacA racM gemA can easily be converted gacT gacT to a matrix. Each wighted rdiA gemA racN edge in a network rdiA corresponds to a matrix racJ racN entry. racM racJ racI racI xacA xacA racM Binary matrices can be used to associate Gene ontology terms with cellular pathways: alg13 Binary relations between alg7 alg1 alg14 two object types can be Fructose and mannose represented with a binary metabolism Part of N-Glycan biosynthesis pathway Ontology terms matrix. dpm1 dpm2 dpm3 Protein N-linked glycosylation (GO:0006487) Orthology Ontology Pathways Dolichol kinase (K00902) GO:0004168 Alpha-mannosidase II (K01231) GO:0004572 Oligosaccharyltransferase complex (K12668) GO:0008250 !2 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 research articles with Medical Subject Headings (MeSH): Papers cited in PubMed Medical Subject Headings are tagged with MeSH terms. We can use one MeSH terms large binary matrix to Cell separation encode relations Cytoplasmic vesicles/metabolism Ethidium/metabolism between research articles Immunity/innate and MeSH terms. Mutation Literature Phagocytes/cytology Phagocytes/immunology* Phagocytosis* or membership of genes in pathway, one column for each pathway: Just like the relations Part of N-Glycan biosynthesis pathway between MeSH terms and scientific papers, we alg13 alg7 alg1 alg2 can encode pathway alg14 memberships of genes in Pathways Fructose and mannose one large matrix that has metabolism genes in rows, pathways alg11 in commons. dpm1 dpm2 dpm3 alg3 Genes alg12 alg9 GPI-anchor biosynthesis The structure of Gene Ontology can be represented with a real-valued matrix whose elements represent distance or semantic similarity between the corresponding ontological terms: Any ontology can be Part of Gene Ontology graph Gene Ontology terms Response to Response to represented with a Response external biotic to stress square matrix. We use stimulus stimulus ontology to measure Response to Response distances between its to stress Response to external biotic other organisms entities, and encode stimulus these distances in a Response Defense Response to to bacterium distance matrix. response other organisms Defense Response to response to Defense bacterium other organism response Defense Gene Ontology terms response to bacterium !3 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 2: The Challenge Suppose we would like to identify genes whose mutants exhibit a certain phenotype, e.g., genes that are sensitive to Gram negative bacteria. In addition to current knowledge about phenotypical annotations, i.e. data encoded in a gene-to- phenotype matrix, which might be incomplete and contain some erroneous information, there exists a variety of circumstantial evidence, such as gene expression data, literature data, annotations of research articles etc. An obvious question is how to link these seemingly disparate data sets. In many applications there exists some correspondence between different input dimensions. For example, genes can be linked to MeSH terms via gene-to-publication and publication-to-MeSH-term data matrices. This is an important observation, which we exploit to define a relational structure of the entire data system. The major challenge for such problems is how to jointly model multiple types of data heterogeneity in a mutually beneficial way. For example, in the scheme below, can information about the relatedness of MeSH terms and similarity between phenotypes from the Phenotype Ontology help us to improve the accuracy of recognizing Gram negative defective genes? The data excerpt on the right comes from a gene prioritization problem where our goal was to find candidates for bacterial response genes Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective Phenotype in a social amoeba Ontology Dictyostelium. Other than Mutant for a few seed genes, Phenotypes there was not any data Timepoints Publications from which we could directly infer the bacterial spc3 swp1 phenotype of mutants. kif9 Pubmed data Hence, we considered alyL Genes nagB1 circumstantial data sets gpi and hoped that their shkA nip7 fusion would uncover MeSH terms Expression interesting new bacterial Phenotype data data response genes. MeSH terms MeSH Ontology MeSH annotations !4 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 3: Recommender Systems Sparse matrices and matrix completion have been thoroughly addressed in the area of machine learning called recommender systems. Several methods from this field form foundation for matrix-based data fusion. Hence, we diverge here from fusion to recommender systems, and for a while, from biology to movies. How would you decide which movie to recommend to a friend? Obviously, a useful source of information might be ratings of the movies your friend had seen in the past, i.e. one star up to five stars. Movie recommender systems primarily use user ratings information from which they estimate correlations between different movies and similarities between users and infer a prediction model which can be used to make recommendations about which movie a user should see next. For example, in the figure below we see a movie ratings data matrix containing information for four users and four movies. Notice that in a real setting such matrices can contain information for millions of users and hundreds of thousands of movies. However, each individual user typically sees only a small proportion of all the movies and rates even fewer of them. Hence, data matrices in recommender systems are typically extremely sparse, e.g., it is common that up to ~99% of matrix elements are unknown. This characteristic together
Recommended publications
  • The Application of Data Fusion Method in the Analysis of Ocean and Meteorology Observation Data
    International Journal of Hydrology Research Article Open Access The application of data fusion method in the analysis of ocean and meteorology observation data Abstract Volume 3 Issue 3 - 2019 Data fusion is a new technology of data processing for multi - source ocean meteorology 1 2 3 data. In order to clearly understand how to apply this technology in the data processing Wang Wei, Liu Zhi-jie, Zhao Yue, Gong 1 of ocean meteorology observation, the author first divides the fusion methods into three Xiao-qing categories, i.e., pixel-level fusion, feature-level fusion and decision-making level fusion. 1Tianjin Institute of Meteorological Science, China Then, the corresponding methods of each data fusion are discussed in detail. Afterward, the 2Meteorology Administrative of Xiqing District, China SST fusion data from China Meteorological administrative is analyzed in order to explain 3Meteorology Administrative of Jinnan District, China that the fusion data is applied in numerical forecast model on the China’s Bohai sea region. Finally, the authors believe that the data fusion technology of multi-source observation data Correspondence: Wang Wei, Tianjin Institute of has a good application value when the observation data is processed into grid data of ocean Meteorological Science, Tianjin, China, Tel 086-022-23358783, meteorology, which can be used as the grid truth-value of numerical model inspection. Email In addition, the fusion technology at decision-making level can also develop with the intelligent technology, and deeply mine the application of data fusion technology to the Received: May 13, 2019 | Published: June 03, 2019 ocean meteorology observation data. Keyword: data fusion, ocean meteorology, observation data, data processing Introduction evaluation tasks.
    [Show full text]
  • Causal Inference and Data Fusion in Econometrics∗
    TECHNICAL REPORT R-51 First version: Dec, 2019 Last revision: March, 2021 Causal Inference and Data Fusion in Econometrics∗ Paul Hunermund¨ Elias Bareinboim Copenhagen Business School Columbia University [email protected] [email protected] Learning about cause and effect is arguably the main goal in applied economet- rics. In practice, the validity of these causal inferences is contingent on a number of critical assumptions regarding the type of data that has been collected and the substantive knowledge that is available about the phenomenon under investiga- tion. For instance, unobserved confounding factors threaten the internal validity of estimates, data availability is often limited to non-random, selection-biased sam- ples, causal effects need to be learned from surrogate experiments with imperfect compliance, and causal knowledge has to be extrapolated across structurally hetero- geneous populations. A powerful causal inference framework is required in order to tackle all of these challenges, which plague essentially any data analysis to varying degrees. Building on the structural approach to causality introduced by Haavelmo (1943) and the graph-theoretic framework proposed by Pearl (1995), the artificial intelligence (AI) literature has developed a wide array of techniques for causal learn- ing that allow to leverage information from various imperfect, heterogeneous, and biased data sources (Bareinboim and Pearl, 2016). In this paper, we discuss recent advances made in this literature that have the potential to contribute to econo- metric methodology along three broad dimensions. First, they provide a unified and comprehensive framework for causal inference, in which the above-mentioned problems can be addressed in full generality.
    [Show full text]
  • Data Fusion – Resolving Data Conflicts for Integration
    Data Fusion – Resolving Data Conflicts for Integration Tutorial proposal, intended length 1.5 hours Xin (Luna) Dong Felix Naumann AT&T Labs Inc. Hasso Plattner Institute (HPI) Florham Park, NJ, USA Potsdam, Germany [email protected] [email protected] 1. MOTIVATION entity into a single record and resolving possible conflicts The amount of information produced in the world in- from different data sources. Data fusion plays an important creases by 30% every year and this rate will only go up. role in data integration systems: it detects and removes dirty With advanced network technology, more and more sour- data and increases correctness of the integrated data. ces are available either over the Internet or in enterprise Objectives and Coverage. The main objective of the pro- intranets. Modern data management applications, such as posed tutorial is to gather models, techniques, and systems setting up Web portals, managing enterprise data, managing of the wide but yet unconsolidated field of data fusion and community data, and sharing scientific data, often require present them in a concise and consolidated manner. In the integrating available data sources and providing a uniform 1.5-hour tutorial we will provide an overview of the causes interface for users to access data from different sources; such and challenges of data fusion. We will cover a wide set of requirements have been driving fruitful research on data in- both simple and advanced techniques to resolve data con- tegration over the last two decades [13, 15]. flicts in different types of settings and systems. Finally, we Data integration systems face two folds of challenges.
    [Show full text]
  • Data Mining and Fusion Techniques for Wsns As a Source of the Big Data Mohamed Mostafa Fouada,B,E,F, Nour E
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 65 ( 2015 ) 778 – 786 International Conference on Communication, Management and Information Technology (ICCMIT 2015) Data Mining and Fusion Techniques for WSNs as a Source of the Big Data Mohamed Mostafa Fouada,b,e,f, Nour E. Oweisb,e, Tarek Gaberb,c,e,f, Maamoun Ahmedd, Vaclav Snaselb aArab Academy for Science, Technology, and Maritime Transport, Cairo, EGYPT bFaculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava - Czech Republic cFaculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt dMiddle East University, Amman - Jordan eIT4Innovations, VSB-Technical University of Ostrava, Ostrava - Czech Republic fScientific Research Group in Egypt (SRGE), http://www.egyptscience.net Abstract The wide adoption of the Wireless Senor Networks (WSNs) applications around the world has increased the amount of the sensor data which contribute to the complexity of Big Data. This has emerged the need to the use of in-network data processing techniques which are very crucial for the success of the big data framework. This article gives overview and discussion about the state-of-the- art of the data mining and data fusion techniques designed for the WSNs. It discusses how these techniques can prepare the sensor data inside the network (in-network) before any further processing as big data. This is very important for both of the WSNs and the big data framework. For the WSNs, the in-network pre-processing techniques could lead to saving in their limited resources. For the big data side, receiving a clean, non-redundant and relevant data would reduce the excessive data volume, thus an overload reduction will be obtained at the big data processing platforms and the discovery of values from these data will be accelerated.
    [Show full text]
  • Data Fusion and Modeling for Construction Management Knowledge Discovery
    Data Fusion and Modeling for Construction Management Knowledge Discovery Lucio Soibelman *, Liang Y. Liu ** and Jianfeng Wu *** * Assistant Professor, Department of Civil and Environmental Engineering, University of Illinois at Urbana- Champaign, 205 N. Mathews Avenue, Urbana, IL 61801, U.S.A. [email protected] * Associate Professor, Department of Civil and Environmental Engineering, University of Illinois at Urbana- Champaign, 205 N. Mathews Avenue, Urbana, IL 61801, U.S.A. [email protected] * Research Assistant, Department of Civil and Environmental Engineering, University of Illinois at Urbana- Champaign, 205 N. Mathews Avenue, Urbana, IL 61801, U.S.A. [email protected] Abstract Advances in construction data analysis techniques have provided useful tools to discover explicit knowledge on historical databases supporting project managers’ decision making. However, in many situations, historical data are extracted and preprocessed for knowledge discovery based on time-consuming and problem-specific data preparation solutions, which often results in inefficiencies and inconsistencies. To overcome the problem, we are working on the development of a new data fusion methodology, which is designed to provide timely and consistent access to historical data for efficient and effective management knowledge discovery. The methodology is intended to be a new bridge between historical databases and data analysis techniques, which shields project managers from complex data preparation solutions, and enables them to use discovered knowledge for decision making more conveniently. This paper briefly describes the motivation, the background and the initial results of the ongoing research. 1. Introduction In today’s competitive market, the success of construction projects depends largely on project managers’ capabilities to make corrective decisions during construction planning and control stages.
    [Show full text]
  • Data Fusion by Matrix Factorization
    Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973). Data Fusion by Matrix Factorization Marinka Zitnikˇ [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia BlaˇzZupan [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX-77030, USA Abstract our in the study of disease. Let us say that we ex- For most problems in science and engineer- amine susceptibility to a particular disease and have ing we can obtain data that describe the access to the patients’ clinical data together with data system from various perspectives and record on their demographics, habits, living environments, the behaviour of its individual components. friends, relatives, and movie-watching habits and genre Heterogeneous data sources can be collec- ontology. Mining such a diverse data collection may tively mined by data fusion. Fusion can fo- reveal interesting patterns that would remain hidden if cus on a specific target relation and exploit considered only directly related, clinical data. What if directly associated data together with data the disease was less common in living areas with more on the context or additional constraints. In open spaces or in environments where people need to the paper we describe a data fusion approach walk instead of drive to the nearest grocery? Is the with penalized matrix tri-factorization that disease less common among those that watch come- simultaneously factorizes data matrices to dies and ignore politics and news? reveal hidden associations.
    [Show full text]
  • An Overview of Iot Sensor Data Processing, Fusion, and Analysis Techniques
    sensors Review An Overview of IoT Sensor Data Processing, Fusion, and Analysis Techniques Rajalakshmi Krishnamurthi 1, Adarsh Kumar 2 , Dhanalekshmi Gopinathan 1, Anand Nayyar 3,4,* and Basit Qureshi 5 1 Department of Computer Science and Engineering, Jaypee Institute of Information Technology, Noida 201309, India; [email protected] (R.K.); [email protected] (D.G.) 2 School of Computer Science, University of Petroleum and Energy Studies, Dehradun 248007, India; [email protected] 3 Graduate School, Duy Tan University, Da Nang 550000, Vietnam 4 Faculty of Information Technology, Duy Tan University, Da Nang 550000, Vietnam 5 Department of Computer Science, Prince Sultan University, Riyadh 11586, Saudi Arabia; [email protected] * Correspondence: [email protected] Received: 21 August 2020; Accepted: 22 October 2020; Published: 26 October 2020 Abstract: In the recent era of the Internet of Things, the dominant role of sensors and the Internet provides a solution to a wide variety of real-life problems. Such applications include smart city, smart healthcare systems, smart building, smart transport and smart environment. However, the real-time IoT sensor data include several challenges, such as a deluge of unclean sensor data and a high resource-consumption cost. As such, this paper addresses how to process IoT sensor data, fusion with other data sources, and analyses to produce knowledgeable insight into hidden data patterns for rapid decision-making. This paper addresses the data processing techniques such as data denoising, data outlier detection, missing data imputation and data aggregation. Further, it elaborates on the necessity of data fusion and various data fusion methods such as direct fusion, associated feature extraction, and identity declaration data fusion.
    [Show full text]
  • Information Acquisition in Data Fusion Systems
    Information Acquisition in Data Fusion Systems Ronnie Johansson TRITA-NA-0328 Licentiate Thesis Royal Institute of Technology Department of Numerical Analysis and Computer Science Akademisk avhandling som med tillstånd av Kungl Tekniska Högskolan framlägges till offentlig granskning för avläggande av teknologie licentiatexamen torsdagen den 18 december 2003kl. 14.00 i Kollegiesalen, Administrationsbyggnaden, Kungliga Tekniska Högskolan, Valhallavägen 79, Stockholm.. TRITA-NA-0328 ISSN 0348-2952 ISRN KTH/NA/R--03/28--SE ISBN 91-7283-655-5 CVAP283 © Ronnie Johansson, november 2003 Universitetsservice US AB, Stockholm 2003 Abstract By purposefully utilising sensors, for instance by a data fusion system, the state of some system-relevant environment might be adequately assessed to support decision-making. The ever increasing access to sensors offers great opportunities, but also incurs grave challenges. As a result of managing multiple sensors one can, e.g., expect to achieve a more comprehensive, resolved, certain and more frequently updated assessment of the environment than would be possible otherwise. Chal- lenges include data association, treatment of conflicting information and strategies for sensor coordination. We use the term information acquisition to denote the skill of a data fusion sys- tem to actively acquire information. The aim of this thesis is to instructively situate that skill in a general context, explore and classify related research, and highlight key issues and possible future work. It is our hope that this thesis will facilitate communication, understanding and future efforts for information acquisition. The previously mentioned trend towards utilisation of large sets of sensors makes us especially interested in large-scale information acquisition, i.e., acquisition using many and possibly spatially distributed and heterogeneous sensors.
    [Show full text]
  • CS 520 Data Integration, Warehousing, and Provenance
    CS 520 Data Integration, Warehousing, and Provenance Course Description: This course introduces the basic concepts of data integration, data warehousing, and provenance. We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous datasources at once (data integration) and translating data between databases with different data representa- tions (data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representa- tions (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance. The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research. Course Material: The following text book will be helpful for following the course and studying the presented material.
    [Show full text]
  • Cheminformatics and the Semantic Web: Adding Value with Linked Data and Enhanced Provenance
    Advanced Review Cheminformatics and the Semantic Web: adding value with linked data and enhanced provenance Jeremy G. Frey∗ and Colin L. Bird Cheminformatics is evolving from being a field of study associated primarily with drug discovery into a discipline that embraces the distribution, management, ac- cess, and sharing of chemical data. The relationship with the related subject of bioinformatics is becoming stronger and better defined, owing to the influence of Semantic Web technologies, which enable researchers to integrate hetero- geneous sources of chemical, biochemical, biological, and medical information. These developments depend on a range of factors: the principles of chemical identifiers and their role in relationships between chemical and biological enti- ties; the importance of preserving provenance and properly curated metadata; and an understanding of the contribution that the Semantic Web can make at all stages of the research lifecycle. The movements toward open access, open source, and open collaboration all contribute to progress toward the goals of integration. C " 2013 John Wiley & Sons, Ltd. How to cite this article: WIREs Comput Mol Sci 2013. doi: 10.1002/wcms.1127 INTRODUCTION cipline of bioinformatics evolved more recently, in heminformatics is usually defined in terms of response to the vast amount of data generated by the application of computer science and infor- molecular biology, applying mathematical, and com- mationC technology to problems in the chemical sci- putational techniques not only to the management ences. Brown1 introduced the term chemoinformatics of that data but also to understanding the biological in 1998, in the context of drug discovery, although processes, pathways, and interactions involved.
    [Show full text]
  • Information Quality Assessment for Data Fusion Systems
    data Article Information Quality Assessment for Data Fusion Systems Miguel A. Becerra 1,2,*,† , Catalina Tobón 2,† and Andrés Eduardo Castro-Ospina 1,† and Diego H. Peluffo-Ordóñez 3,4 1 Instituto Tecnológico Metropolitano, Cra. 74d #732, Medellín 050034, Colombia; [email protected] 2 Facultad de Ciencias Básicas, Universidad de Medellín, MATBIOM, Cra. 87 #30-65, Medellín 050010, Colombia; [email protected] 3 Modeling, Simulation and Data Analysis (MSDA) Research Program, Mohammed VI Polytechnic University, Ben Guerir 47963, Morocco; [email protected] or [email protected] 4 Faculty of Engineering, Corporación Universitaria Autónoma de Nariño, Carrera 28 No. 19-24, Pasto 520001, Colombia * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: This paper provides a comprehensive description of the current literature on data fusion, with an emphasis on Information Quality (IQ) and performance evaluation. This literature review highlights recent studies that reveal existing gaps, the need to find a synergy between data fusion and IQ, several research issues, and the challenges and pitfalls in this field. First, the main models, frameworks, architectures, algorithms, solutions, problems, and requirements are analyzed. Second, a general data fusion engineering process is presented to show how complex it is to design a framework for a specific application. Third, an IQ approach , as well as the different methodologies and frameworks used to assess IQ in information systems are addressed; in addition, data fusion systems are presented along with their related criteria. Furthermore, information on the context in data fusion systems and its IQ assessment are discussed. Subsequently, the issue of data fusion Citation: Becerra, M.A.; Tobón, C.; systems’ performance is reviewed.
    [Show full text]
  • The Use of Data Fusion on Multiple Substructural Analysis Based GA Runs
    J. Appl. Environ. Biol. Sci. , 7(2S)30-36, 2017 ISSN: 2090 -4274 Journal of Applied Environmental © 2017, TextRoad Publication and Biological Sciences www.textroad.com The Use of Data Fusion on Multiple Substructural Analysis Based GA Runs Nor Samsiah Sani Faculty of Information Science and Technology, National University of Malaysia, Bangi, Selangor, Malaysia Received: February 21, 2017 Accepted: April30,2017 ABSTRACT Substructural analysis (SSA) was one of the very first machine learning techniques to be applied to chemoinformatics in the area of virtual screening. Recently, the use of SSA method based on genetic traits particularly the genetic algorithm (GA) was shown to be superior to the SSA based on a naive Bayesian classifier, both in terms of active compound retrieval rate and predictive performance. Extensive studies on data fusion have been carried out on similarity-based rankings, but there are limited findings on the fusion of data obtained from evolutionary algorithm techniques in chemoinformatics. This paper explores the feasibility of data fusion on the GA-based SSA. Data fusion is a method to produce a final ranking list from multiple sets of ranking lists via several fusion rules. Based on the encouraging results obtained using the GA, the application of data fusion to the GA-based SSA weighting schemes are examined in this paper in order to enhance retrieval performance of 2D-based fingerprint predictive method. Our experiments used the MDDR and WOMBAT datasets. The results show that data fusion can indeed enhance retrieval performance of evolutionary techniques further in the case of evolutionary algorithm techniques, and specifically the GA- based SSA.
    [Show full text]