Data Managementmanagementdata for Scientific Research
Total Page:16
File Type:pdf, Size:1020Kb
DDaattaa MMaannaaggeemmeenntt For Scientific Research Photo: © Stanza. Used with permission. CCoouurrssee IInnttrroodduuccttiioonn Welcome to a course in data management for scientific research projects. CCoouurrssee SSttrruuccttuurree Casual “guided” study-group approach Presentations, demos, hands-on exercises, discussions and “homework” Materials: A textbook, eBooks, websites, and online videos WWhhyy TTaakkee TThhiiss CCoouurrssee?? PracticalComputing.org Researchers work with increasing amounts of data. Many students do not have training in data management. Science degree programs generally do not address this gap. It is difficult for “non-majors” to get into IT courses. This leaves students and research teams struggling to cope. And therefore places a heavy burden on IT support. Our data management course provides the needed skills to address these issues. Exciting new discoveries await those who can effectively sift through mounds of data! Participant IntroductionsIntroductionsParticipant Please introduce yourself and share your: Degree program and emphasis Research area (general topic) Your current research project (specific topic) The types of data or data systems you use in this project What you hope to get out of this course Session 1: Data System Essentials How will you manage your data? You need a data system. There are many choices. To pick the best one, you need to state your requirements. Photo: NASA Today's Learning Objectives In this session, you will … Become familiar with common types of data systems Learn to differentiate between flat files and relational databases Learn to differentiate between spreadsheets and databases Learn how to model system functions and interactions Learn how to create system diagrams Learn how to state system requirements Ultimately, this knowledge will help you select or design the best data system for your needs. TTyyppeess ooff DDaattaa SSyysstteemmss UUnnlliinnkkeedd LLiinnkkeedd Flat file Network Distributed Hierarchical Relational Object relational NoSQL FFllaatt FFiilleess RReellaattiioonnaall DDaattaabbaasseess MS Office Documents MS Access Plain Text Files (CSV, TXT) FileMaker Pro Instrument Output SQLite Stats. Program Output MS SQL Server Oracle MySQL MariaDB PostgreSQL Spreadsheets and Databases An excellent short video presentation explaining the differences between databases and spreadsheets can be found on YouTube: Video: What are Source: WHO and Mozilla/dietrich databases? - lynda.com Watching this video is a “homework” assignment. So for now, we will just summarize the differences. SpreadsheetsSpreadsheets Convenient Interactive Visual Flexible Portable Source: WHO DatabasesDatabases Manageable Structured Standardized Scalable Accessible Graphic: Mozilla/dietrich Designing a Data SystemSystemDesigning To design a data system, we need to identify requirements and map out interactions and components. In this course you will learn how to create: Use Case Diagrams Data Flow Diagrams Entity Relationship Graphic: EPISTLE and its successors / Matthew West, Julian Diagrams Fowler, Razorbliss / Wikimedia So let's get started! Get the Picture: Use Case Diagrams Let's visualize a model of a uc Use Cases System Boundary “system” … receive order Order <<extend!! Order Food Wine Waiter confirm order place order Serve Cook Food Food Use Case Diagrams focus on Chef <<extend!! "if #ine #as ordered$ the “what” and not the “how”. Serve Wine Eat <<extend!! Drink Food "if #ine Wine Client #as served$ They model what people want facilitate payment pay <<extend!! to do with a system. A use case accept Pay for "if #ine Pay for payment Food #as Wine describes a “goal”, expressed Cashier consumed$ as an “action”. People and Graphic: Kishorekumar 62 (redrawn by Marcel other external entities are Douwe Dekker) / Wikimedia modeled as “actors” that “interact” with the system. Example System Interactions Imagine a system called “research project.” Some interactions that might appear in a model of this system are: 1. Researcher proposes experimental design. 2. Principal investigator approves experimental design. 3. Researcher creates survey. 4. Subject takes survey. 5. Subject provides survey results. 6. Researcher analyses results. 7. Researcher produces manuscript. 8. Principal investigator reviews manuscript. Let's visualize these interactions in a use case diagram… Research Project Use Case Diagram If we were only modeling the data system, we would probably remove the goals (and some actors) which were “out of scope” with repect to the data system… Survey Data System Use Case Diagram 1. Researcher uploads survey. Here, the scope only 2. Subject takes survey. encompasses the goals of 3. Subject uploads results. conducting the survey and returning results. 4. Researcher downloads results. This is the basic operation of the Open Data Kit system which we will learn about next week. But what good are they, really? Modeling diagrams help you: Clarify your own understanding Explore possibilities Communicate with others Prepare for more detailed design steps As a researcher, you can use these to clarify your project scope and requirements. They will help you present your project needs to others, such as your collaborators and support staff. Use case diagrams identify what a system must do and how people will interact with it. A complete use case model includes use case diagrams and textual descriptions of each use case. HHaannddss--oonn GGrroouupp EExxeerrcciissee Photo: SarahStierch / Wikimedia Create a Use Case Diagram As a group, list the goals (actions, use cases) for your research data system in “verb noun” form. Then figure out who (actors) will interact to perform those actions. Draw a simple use case diagram with stick figures (actors) and elipses (goals, use cases). Use pen and paper or software. All of the elipses should be enclosed in a “system boundary” box (if the software supports that), with the stick figures outside of the box. Lines (interactions) should connect the actors to their goals. Label the lines with what the actor does (action) to achieve the goal. DDiissccuussssiioonn We will display your diagrams on the screen and discuss them. Graphic: Jagbirlehl / Wikimedia In the Coming Sessions...Sessions...In We will continue the Needs Analysis of your data system with: Detailed Use Cases - including text descriptions More Systems Analysis - including Data Flow Diagrams A Requirements Document - compiling the above A Feasibility Study Which will present us with some options, typically: Do nothing (business as usual) Get something “off the shelf” (free or commercial) Build something ourselves … or some combination of the above. AAccttiioonn IItteemmss VViiddeeooss RReeaaddiinnggss TTaasskkss Watch VideosVideosWatch Watch these videos in the order listed. What are databases? Discover to Deliver Structured Conversation Use Case Diagram Tutorial (watch first two or more) ODK (watch one or two) ReadingsReadings Read: In the PCfB textbook: “Before You Begin”, pp. 1-6; Chapters 1-3, pp. 9-43; and Appendix 1, pp. 451-453 Read: Use Case Tips Skim Wikipedia articles: Data Management, Data System, Data Modeling, Needs Analysis, Agile Modeling, SDLC Skim eBook chapter: Beginning Database Design Chapter 3: Initial Requirements and Use Cases Explore websites: Agile Modeling, ODK TasksTasks We have several tasks to perform as “homework” before our next session. They should be fairly quick to complete. You might do one task per day, spending maybe 15-30 minutes on each task. Task 1: Your favorite website's database Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on Facebook below, pick another site if that was your favorite.) If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation. Task 2: Limits to Excel as a "database" Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. (These may vary depending on the software version.) How about for OpenOffice (LibreOffice) “Calc”? (bonus points) For the Excel experts (bonus points): How do you link spreadsheets by matching columns headings, control the allowed values which can be entered in a column, protect cells (say, those containing formulas or constants) from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network? If you know how to do these things, please demonstrate in class for us. Task 3: Data Sources and Needs Analysis Use your wiki in Redmine (or GitHub, etc.) to document the list of the data sources you will be working with in your project. Note the file types/applications, organizations/persons /processes they came from, and what you will do to/with them. The wiki language supports tables, which might be a good way to format the information in the wiki. Later you will use this wiki to further elucidate your “data dictionary”. Perform a