Objectives Overview of Data Management Tools

Total Page:16

File Type:pdf, Size:1020Kb

Objectives Overview of Data Management Tools Data Management: Collecting, Protecting and Managing Data Presented by Andre Hackman Johns Hopkins Bloomberg School of Public Health Department of Biostatistics July 17, 2014 Objectives • Review of available database and data management tools and their capabilities • Demonstrate examples of good data management • Discuss strategies and good practice for data security and backup • Discuss data conversion and transfer options Overview of Data Management Tools Types of data management tools to consider Type of tool Examples Spreadsheet Microsoft Excel Google Docs Spreadsheet Open Office Calc Desktop database Microsoft Access Filemaker Pro Specialized databases Epi Info, CSPro, EpiData Web database REDCap OpenClinica Mobile data collection Open Data Kit, CommCare, Magpi, iFormBuilder, 1 Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected • Is it longitudinal data with many time points? • What is the expected number of records? Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes: mobile, kiosk, and computer entry. • How will data flow between them? Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes: mobile, kiosk, and computer entry. – Complexity of the data entry forms 2 Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes: mobile, kiosk, and computer entry. – Complexity of the data entry forms – Multiple users access • Can you control and restrict access? – Restrict data views by site – Control whether users have full or read-only access Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes: mobile, kiosk, and computer entry. – Complexity of the data entry forms – Multiple users access – Merging/importing/exporting data capabilities Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes: mobile, kiosk, and computer entry. – Complexity of the data entry forms – Multiple users access – Merging/importing/exporting data capabilities – Audit capabilities 3 Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes such as mobile, kiosk, and computer entry. – Complexity of the data entry forms – Multiple users access – Merging/importing/exporting data capabilities – Audit capabilities – Reporting capabilities during and after the data collection. Built-in or separate tool? • During: – What data is collected/missing? – Running validation checks on data – Interim looks at the trends • After: – Summaries – Analytic results of study Choosing a data management tool… Pick the tool that is appropriate for the data needs of the project • What will be the challenges you face with each tool? – Complexity and scale of the data being collected – Mixture of collection modes such as mobile, kiosk, and computer entry. – Complexity of the data entry forms – Multiple users access – Merging/importing/exporting data capabilities – Audit capabilities – Reporting capabilities during and after the data collection. Built-in or separate tool? Spreadsheets • “Lowest common denominator” • Format that all users can open and edit files • Often used by labs to output their results • Disadvantages – Data can be corrupted by single column sorts or accidental keystrokes – Data entry interface is limited. – Sometimes program tries to “assist” the user with data – Not designed for multiple user entry (unless you are using a web-based spreadsheet) • Helpful spreadsheet data management commands – Transpose data (Paste Special, transpose) – Remove duplicates (Data, Remove Duplicates wizard) – Match merge data (VLOOKUP) – Concatenation/substrings (CONCATENATE, SUBSTRING) – Data Validation (Data, Data Validation) – Use “Forms” for data entry (Excel: Add button to Quick Access Toolbar) – Use “Table” to maintain integrity (Insert, Table) – Use Filter to quickly subset your data 4 Spreadsheets Data management tips Transpose data (Copy, Paste Special, Transpose) From columns to rows Concatenation Create new fields by concatenating fields together Data Validation For coded fields, create validation to ensure consistent data. Spreadsheets Data management tips Remove duplicates tool To ensure unique records, use the Remove Duplicates option. Be careful if not checking entire records. “Forms” tool for data entry Some spreadsheets allow you to automatically create data entry form. Form is simple but often better than trying to enter data directly in spreadsheet Desktop databases • Microsoft Access and FileMaker Pro commonly used • Includes database, forms, reports, queries, custom programming • Easy to initially set up using wizards • Designed for simultaneous access from multiple users • Disadvantages – By default, limited to local network access – Built-in wizards can make it easy to start but tasks beyond wizards require training – Scalability can be an issue if amount of data is large 5 Desktop databases Microsoft Access All of the database elements are in one file Desktop databases Microsoft Access Data entry forms Desktop databases Microsoft Access Queries 6 Desktop databases Microsoft Access Reports Desktop Database: Microsoft Access (“Access 2007: The Missing Manual, Matthew MacDonald, O'Reilly Media, Inc., 2006) Specialized databases • Epi Info, CSPro, Epidata are common • Includes database, forms, reports, queries, custom programming, analysis tools • Quickly set up database by building dictionary/forms • Some include mapping capability to map your data. • Some can be networked for multi-user access but aren’t truly multi-user. • Disadvantages – Can require significant training time (CSPro for example) – Tend to be Windows only applications 7 Specialized databases EpiData Specialized databases Epi Info – Form Design Specialized databases Epi Info – Mapping 8 Overview of Data Management Tools • The web database application – REDCap and OpenClinica are examples – Include database, forms, reports, queries, user/site management, audit trails, backup and security – Automatically build forms from codebooks – Accessible from a web browser – no additional software needed – Scale for very large studies and multiple sites – Most have large user communities as a resource – Disadvantages • Features that YOU want may not exist yet • Some changes require approval by system admin http://www.project-redcap.org/ Project Home View in REDCap 9 Entering Data in REDCap Exporting Data in REDCap Exporting Data in REDCap 10 Setting User Rights in REDCap Audit Trail Logging in REDCap Mobile data collection • Mobile data collection – Apple (iPhone, iPad) and Android (phones and tablets) are main options. – Data is entered directly on a smart phone or tablet – Data can be stored locally and transmitted manually OR it can be transferred in real time to a central database. – Smart phones work well for simple question types – Tablets are better suited for complex question types with many answers OR the ability to see previous answers – Considerations: • If data is stored on the device, is it protected in case of theft, damage, etc. • Do you need a mobile data connection to transfer data on-demand or will it be transferred via WiFi or USB? • Battery life of the device. • Where will the data ultimately be stored? Central server, website? 11 Mobile data collection • Options for mobile data collection – iFormbuilder • https://www.iformbuilder.com/ – doForms • http://www.doforms.com/ – Magpi • https://www.magpi.com – Formhub (for Android only) • https://www.formhub.org/ – Enketo • https://enketo.org/ – Open Data Kit (ODK) (for Android only) • This is what our Center currently uses for mobile data collection. Open Data Kit (ODK) Sample Screens Open Data Kit (ODK) Sample Screens 12 Form & Database Design • Recommendations for fields and codings on forms – Make codes for missing and refused data fields. • A common convention is “9” for missing but only if it is not a valid value • Problematic if the 9’s get included in the analysis • Sometimes need to distinguish between – “missing – not obtained”, – “don’t know”, – “refused to answer” Form & Database Design • Recommendations for fields and database design – Make codes for missing and refused data fields. – Use multiple identifiers on each form as a cross check • You don’t need to save all of them each time. It is simply a cross- check. 13 Form & Database Design • Recommendations for fields and database design – Make codes for missing and refused data fields. – Consider requiring that all fields are entered unless they are
Recommended publications
  • ICT in the Kosovo National Statistical System - Baseline Review and Recommendations for Development
    UNDP Kosovo UNKT Technical assistance to the IT department of the Kosovo Agency of Statistics (KAS) Technical analysis of information technology in the national statistical system ICT in the Kosovo National Statistical System - Baseline review and Recommendations for development April 2013 by Arij Dekker, Consultant in statistical data processing and management [email protected] ICT in the Kosovo National Statistical System - Baseline review and Recommendations for development 1 Contents List of Acronyms ...................................................................................................................................... 3 Executive Summary ................................................................................................................................. 5 Chapter 1. Introduction ..................................................................................................................... 7 Chapter 2. Methods of information gathering .................................................................................... 9 2.1 Description of interview partners and question clusters ........................................................... 9 2.2 Other information sources ..................................................................................................... 12 Chapter 3. The present state of ICT in the Kosovo national statistical system ................................... 14 3.1 Summary of information gathered from the interviews .........................................................
    [Show full text]
  • An Evaluation of Statistical Software for Research and Instruction
    Behavior Research Methods, Instruments, & Computers 1985, 17(2),352-358 An evaluation of statistical software for research and instruction DARRELL L. BUTLER Ball State University, Muncie, Indiana and DOUGLAS B. EAMON University of Houston-Clear Lake, Houston, Texas A variety of microcomputer statistics packages were evaluated. The packages were compared on a number of dimensions, including error handling, documentation, statistical capability, and accuracy. Results indicated that there are some very good packages available both for instruc­ tion and for analyzing research data. In general, the microcomputer packages were easier to learn and to use than were mainframe packages. Furthermore, output of mainframe packages was found to be less accurate than output of some of the microcomputer packages. Many psychologists use statistical programs to analyze ware packages available on the VAX computers at Ball research data or to teach statistics, and many are interested State University: BMDP, SPSSx, SCSS, and MINITAB. in using computer software for these purposes (e.g., Versions of these programs are widely available and are Butler, 1984; Butler & Kring, 1984). The present paper used here as points ofreference to judge the strengths and provides some program descriptions, benchmarks, and weaknesses of the microcomputer packages. Although evaluations of a wide variety of marketed statistical pro­ there are many programs distributed by individuals (e.g., grams that may be useful in research and/or instruction. see Academic Computer Center of Gettysburg College, This review differs in several respects from other re­ 1984, and Eamon, 1983), none are included in the present cent reviews of statistical programs (e. g., Carpenter, review.
    [Show full text]
  • On New Data Sources for the Production of Official Statistics
    On new data sources for the production of official statistics David Salgado1,2 and Bogdan Oancea3 1Dept. Methodology and Development of Statistical Production, Statistics Spain (INE), Spain 2Dept. Statistics and Operations Research, Complutense University of Madrid, Spain 3Dept. Business Administration, University of Bucharest, Romania February 7, 2020 Abstract In the past years we have witnessed the rise of new data sources for the potential production of official statistics, which, by and large, can be classified as survey, administrative, and digital data. Apart from the differences in their generation and collection, we claim that their lack of statistical metadata, their economic value, and their lack of ownership by data holders pose several entangled challenges lurking the incorporation of new data into the routinely production of official statistics. We argue that every challenge must be duly overcome in the international community to bring new statistical products based on these sources. These challenges can be naturally classified into different entangled issues regarding access to data, statistical methodology, quality, information technologies, and management. We identify the most relevant to be necessarily tackled before new data sources can be definitively considered fully incorporated into the production of official statistics. Contents 1 Introduction 2 2 Data: survey, administrative, digital 3 2.1 Somedefinitions ..................................... .... 3 2.2 Statisticalmetadata ................................ ....... 4 2.3 Economicvalue.....................................
    [Show full text]
  • Software Developer
    PINKY MONONYANE SOFTWARE DEVELOPER https://www.pinkymononyane.info PROFILE English, Sotho, Tsonga, Tswana, Venda, N. Sotho 33 Years old, Black Female with no disability I have 9 years’ programming experience with professional skills that could fill a South African Citizen, ID Number: 871020 0461 08 1 bucket 10 times over, with few honorable mentions such as HTML, B – Light Vehicles 750kg to 3500kg CSS3, Python (I have more than 5 years Single of Experience in Django framework. I have been developing small to medium projects even before my latest position ADDRESS & CONTACT INFO at BCM), Php and Java Script with keen attention to details, Phone Number: +27 78 968 2362 plus a well dynamic personality that appeals to most that I have had a Home Address Recent / Previous pleasure working/collaborating with. Work Address 684 Lievaart Point to note... Challenges is what I live, Street 145 Second Street breath and seek, for it's in those that's Proclamation Hill Parkmore where personal growth is found. Pretoria, Gauteng Sandton 0183 2052 Furthermore I constantly look for more knowledge related to my field; as such I Email Address: [email protected] have enrolled in Computer Science BSc Degree at UNISA, as they say knowledge CAREER STATUS is like an endless river flow. Can't have 2013-2019 Data Developer: 7 Years’ Experience enough: P. Languages:- So when I'm nowhere to be found rest Python, CSpro, SSRS, SQL Server, R-Statistics, SAS, Server assured you can find me programming a interaction, security and management under various software on my friendly computer or out environments (Linux, Windows and Windows server) on a hunt for further knowledge and if not I’m probably out travelling and 2020-2020 Software Developer: 9 Months Formal Experience spending time with my loved ones.
    [Show full text]
  • Comparison Between Census Software Packages in the Context of Data Dissemination Prepared by Devinfo Support Group (DSG) and UN Statistics Division (DESA/UNSD)
    Comparison between Census software packages in the context of data dissemination Prepared by DevInfo Support Group (DSG) and UN Statistics Division (DESA/UNSD) For more information or comments please contact: [email protected] CsPro (Census and Survey Redatam (REtrieval of DATa for CensusInfo DevInfo* SPSS SAS Application Characteristics Processing System) small Areas by Microcomputer) License Owned by UN, distributed royalty-free to Owned by UN, distributed royalty-free to CSPro is in the public domain. It is Version free of charge for Download Properitery license Properitery license all end-users all end-users available at no cost and may be freely distributed. It is available for download at www.census.gov/ipc/www/cspro DATABASE MANAGEMENT Type of data Aggregated data. Aggregated data. Individual data. Individual data. Individual and aggregated data. Individual and aggregated data. Data processing Easily handles data disaggregated by Easily handles data disaggregated by CSPro lets you create, modify, and run Using Process module allows processing It allows entering primary data, define It lets you interact with your data using geographical area and subgroups. geographical area and subgroups. data entry, batch editing, and tabulation data with programs written in Redatam variables and perform statistical data integrated tools for data entry, applications command language or limited Assisstants processing computation, editing, and retrieval. Data consistency checks Yes, Database administration application Yes, Database administration
    [Show full text]
  • Statistics with Free and Open-Source Software
    Free and Open-Source Software • the four essential freedoms according to the FSF: • to run the program as you wish, for any purpose • to study how the program works, and change it so it does Statistics with Free and your computing as you wish Open-Source Software • to redistribute copies so you can help your neighbor • to distribute copies of your modified versions to others • access to the source code is a precondition for this Wolfgang Viechtbauer • think of ‘free’ as in ‘free speech’, not as in ‘free beer’ Maastricht University http://www.wvbauer.com • maybe the better term is: ‘libre’ 1 2 General Purpose Statistical Software Popularity of Statistical Software • proprietary (the big ones): SPSS, SAS/JMP, • difficult to define/measure (job ads, articles, Stata, Statistica, Minitab, MATLAB, Excel, … books, blogs/posts, surveys, forum activity, …) • FOSS (a selection): R, Python (NumPy/SciPy, • maybe the most comprehensive comparison: statsmodels, pandas, …), PSPP, SOFA, Octave, http://r4stats.com/articles/popularity/ LibreOffice Calc, Julia, … • for programming languages in general: TIOBE Index, PYPL, GitHut, Language Popularity Index, RedMonk Rankings, IEEE Spectrum, … • note that users of certain software may be are heavily biased in their opinion 3 4 5 6 1 7 8 What is R? History of S and R • R is a system for data manipulation, statistical • … it began May 5, 1976 at: and numerical analysis, and graphical display • simply put: a statistical programming language • freely available under the GNU General Public License (GPL) → open-source
    [Show full text]
  • Statistical Software
    Statistical Software A. Grant Schissler1;2;∗ Hung Nguyen1;3 Tin Nguyen1;3 Juli Petereit1;4 Vincent Gardeux5 Keywords: statistical software, open source, Big Data, visualization, parallel computing, machine learning, Bayesian modeling Abstract Abstract: This article discusses selected statistical software, aiming to help readers find the right tool for their needs. We categorize software into three classes: Statisti- cal Packages, Analysis Packages with Statistical Libraries, and General Programming Languages with Statistical Libraries. Statistical and analysis packages often provide interactive, easy-to-use workflows while general programming languages are built for speed and optimization. We emphasize each software's defining characteristics and discuss trends in popularity. The concluding sections muse on the future of statistical software including the impact of Big Data and the advantages of open-source languages. This article discusses selected statistical software, aiming to help readers find the right tool for their needs (not provide an exhaustive list). Also, we acknowledge our experiences bias the discussion towards software employed in scholarly work. Throughout, we emphasize the software's capacity to analyze large, complex data sets (\Big Data"). The concluding sections muse on the future of statistical software. To aid in the discussion, we classify software into three groups: (1) Statistical Packages, (2) Analysis Packages with Statistical Libraries, and (3) General Programming Languages with Statistical Libraries. This structure
    [Show full text]
  • STAT 3304/5304 Introduction to Statistical Computing
    STAT 3304/5304 Introduction to Statistical Computing Statistical Packages Some Statistical Packages • BMDP • GLIM • HIL • JMP • LISREL • MATLAB • MINITAB 1 Some Statistical Packages • R • S-PLUS • SAS • SPSS • STATA • STATISTICA • STATXACT • . and many more 2 BMDP • BMDP is a comprehensive library of statistical routines from simple data description to advanced multivariate analysis, and is backed by extensive documentation. • Each individual BMDP sub-program is based on the most competitive algorithms available and has been rigorously field-tested. • BMDP has been known for the quality of it’s programs such as Survival Analysis, Logistic Regression, Time Series, ANOVA and many more. • The BMDP vendor was purchased by SPSS Inc. of Chicago in 1995. SPSS Inc. has stopped all develop- ment work on BMDP, choosing to incorporate some of its capabilities into other products, primarily SY- STAT, instead of providing further updates to the BMDP product. • BMDP is now developed by Statistical Solutions and the latest version (BMDP 2009) features a new mod- ern user-interface with all the statistical functionality of the classic program, running in the latest MS Win- dows environments. 3 LISREL • LISREL is software for confirmatory factor analysis and structural equation modeling. • LISREL is particularly designed to accommodate models that include latent variables, measurement errors in both dependent and independent variables, reciprocal causation, simultaneity, and interdependence. • Vendor information: Scientific Software International http://www.ssicentral.com/ 4 MATLAB • Matlab is an interactive, matrix-based language for technical computing, which allows easy implementation of statistical algorithms and numerical simulations. • Highlights of Matlab include the number of toolboxes (collections of programs to address specific sets of problems) available.
    [Show full text]
  • Software Packages – Overview There Are Many Software Packages Available for Performing Statistical Analyses. the Ones Highlig
    APPENDIX C SOFTWARE OPTIONS Software Packages – Overview There are many software packages available for performing statistical analyses. The ones highlighted here are selected because they provide comprehensive statistical design and analysis tools and a user friendly interface. SAS SAS is arguably one of the most comprehensive packages in terms of statistical analyses. However, the individual license cost is expensive. Primarily, SAS is a macro based scripting language, making it challenging for new users. However, it provides a comprehensive analysis ranging from the simple t-test to Generalized Linear mixed models. SAS also provides limited Design of Experiments capability. SAS supports factorial designs and optimal designs. However, it does not handle design constraints well. SAS also has a GUI available: SAS Enterprise, however, it has limited capability. SAS is a good program to purchase if your organization conducts advanced statistical analyses or works with very large datasets. R R, similar to SAS, provides a comprehensive analysis toolbox. R is an open source software package that has been approved for use on some DoD computer (contact AFFTC for details). Because R is open source its price is very attractive (free). However, the help guides are not user friendly. R is primarily a scripting software with notation similar to MatLab. R’s GUI: R Commander provides limited DOE capabilities. R is a good language to download if you can get it approved for use and you have analysts with a strong programming background. MatLab w/ Statistical Toolbox MatLab has a statistical toolbox that provides a wide variety of statistical analyses. The analyses range from simple hypotheses test to complex models and the ability to use likelihood maximization.
    [Show full text]
  • On Reproducible Econometric Research
    On Reproducible Econometric Research Roger Koenkera Achim Zeileisb∗ a Department of Economics, University of Illinois at Urbana-Champaign, United States of America b Department of Statistics and Mathematics, WU Wirtschaftsuniversit¨at Wien, Austria SUMMARY Recent software developments are reviewed from the vantage point of reproducible econo- metric research. We argue that the emergence of new tools, particularly in the open-source community, have greatly eased the burden of documenting and archiving both empirical and simulation work in econometrics. Some of these tools are highlighted in the discussion of two small replication exercises. Keywords: reproducibility, replication, software, literate programming. 1. INTRODUCTION The renowned dispute between Gauss and Legendre over priority for the invention of the method of least squares might have been resolved by Stigler(1981). A calculation of the earth's ellipticity reported by Gauss in 1799 alluded to the use of meine Methode; had Stigler been able to show that Gauss's estimate was consistent with the least squares solution using the four observations available to Gauss, his claim that he had been using the method since 1795 would have been strongly vindicated. Unfortunately, no such computation could be reproduced leaving the dispute in that limbo all too familiar in the computational sciences. The question that we would like to address here is this: 200 years after Gauss, can we do better? What can be done to improve our ability to reproduce computational results in econometrics and related fields? Our main contention is that recent software developments, notably in the open- source community, make it much easier to achieve and distribute reproducible research.
    [Show full text]
  • Introducing SPM® Infrastructure
    Introducing SPM® Infrastructure © 2019 Minitab, LLC. All Rights Reserved. Minitab®, SPM®, SPM Salford Predictive Modeler®, Salford Predictive Modeler®, Random Forests®, CART®, TreeNet®, MARS®, RuleLearner®, and the Minitab logo are registered trademarks of Minitab, LLC. in the United States and other countries. Additional trademarks of Minitab, LLC. can be found at www.minitab.com. All other marks referenced remain the property of their respective owners. Salford Predictive Modeler® Introducing SPM® Infrastructure Introducing SPM® Infrastructure The SPM® application is structured around major predictive analysis scenarios. In general, the workflow of the application can be described as follows. Bring data for analysis to the application. Research the data, if needed. Configure and build a predictive analytics model. Review the results of the run. Discover the model that captures valuable insight about the data. Score the model. For example, you could simulate future events. Export the model to a format other systems can consume. This could be PMML or executable code in a mainstream or specialized programming language. Document the analysis. The nature of predictive analysis methods you use and the nature of the data itself could dictate particular unique steps in the scenario. Some actions and mechanisms, though, are common. For any analysis you need to bring the data in and get some understanding of it. When reviewing the results of the modeling and preparing documentation, you can make use of special features embedded into charts and grids. While we make sure to offer a display to visualize particular results, there’s always a Summary window that brings you a standard set of results represented the same familiar way throughout the application.
    [Show full text]
  • Minitab Connecttm
    TM MinitabMinitab Connect ConnectTM Data integration, automation, and governance for comprehensive insights 1 Minitab ConnectTM Minitab ConnectTM Access, blend, and enrich data from all your critical sources, for meaningful business intelligence and confident, informed decision making. Minitab ConnectTM empowers data users from across the enterprise with self-serve tools to transform diverse data into a governed network of data pipelines, to feed analytics initiatives and foster organization-wide collaboration. Users can effortlessly blend and explore data from databases, cloud and on-premise apps, unstructured data, spreadsheets, and more. Flexible, automated workflows accelerate every step of the data integration process, while powerful data preparation and visualization tools help yield transformative insights. 1 Data integration, automation, and governance for comprehensive insights Accelerate your digital transformation Minitab ConnectTM is empowering insights-driven organizations with tools to gather and share trusted data and actionable insights that drive their company forward. Enable next-generation analysis and reporting “The things we measure today are not even close to the things we measured three years ago. We’re able to do new things and iterate into the next generation of where we need to be instead of being mired in spreadsheets and PowerPoint decks.” - Senior Manager, Digital Search Analytics Telecommunications Industry Focus on business drivers, not manual tasks “Bringing this into our environment has really freed our analytics resources from the tactical tasks, and really allowed them to think strategically.” - Marketing Technology Manager, Finance Industry Empower users with sophisticated, no-code tools “The platform is really powerful. It has an ETL function but also has a front-end face to it.
    [Show full text]