Data ManagementManagementData For Scientific Research

Photo: © Stanza . Used with permission. Course IntroductionIntroductionCourse

Welcome to a course in data management for scientific research projects.

Course StructureStructureCourse Casual “guided” study-group approach Presentations, demos, hands-on exercises, discussions and “homework” Materials: A textbook, eBooks, websites, and online videos Why Take This Course?Course?Why

PracticalComputing.org Researchers work with increasing amounts of data. Many students do not have training in data management. Science degree programs generally do not address this gap. It is difficult for “non-majors” to get into IT courses. This leaves students and research teams struggling to cope. And therefore places a heavy burden on IT support. Our data management course provides the needed skills to address these issues. Exciting new discoveries await those who can effectively sift through mounds of data! Participant IntroductionsIntroductionsParticipant Please introduce yourself and share your: Degree program and emphasis Research area (general topic) Your current research project (specific topic) The types of data or data systems you use in this project What you hope to get out of this course Session 1: Data System Essentials

How will you manage your data?

You need a data system.

There are many choices.

To pick the best one, you need to state your requirements.

Photo: NASA Today's Learning Objectives

In this session, you will …

Become familiar with common types of data systems Learn to differentiate between flat files and relational Learn to differentiate between spreadsheets and databases Learn how to model system functions and interactions Learn how to create system diagrams Learn how to state system requirements

Ultimately, this knowledge will help you select or design the best data system for your needs. Types of Data SystemsSystemsTypes UnlinkedUnlinked LinkedLinked Flat file Network Distributed Hierarchical Relational Object relational NoSQL Flat FilesFilesFlat Relational DatabasesDatabasesRelational MS Office Documents MS Access Plain Text Files (CSV, TXT) FileMaker Pro Instrument Output SQLite Stats. Program Output MS SQL Oracle MySQL MariaDB PostgreSQL Spreadsheets and Databases

An excellent short video presentation explaining the differences between databases and spreadsheets can be found on YouTube:

Video: What are Source: WHO and Mozilla/dietrich databases? - lynda.com

Watching this video is a “homework” assignment.

So for now, we will just summarize the differences. SpreadsheetsSpreadsheets Convenient Interactive Visual Flexible Portable

Source: WHO DatabasesDatabases Manageable Structured Standardized Scalable Accessible

Graphic: Mozilla/dietrich Designing a Data SystemSystemDesigning

To design a data system, we need to identify requirements and map out interactions and components. In this course you will learn how to create:

Use Case Diagrams Data Flow Diagrams

Entity Relationship Graphic: EPISTLE and its successors / Matthew West, Julian Diagrams Fowler, Razorbliss / Wikimedia

So let's get started! Get the Picture: Use Case Diagrams

Let's visualize a model of a uc Use Cases System Boundary

“system” … receive order Order <> Order Food Wine Waiter confirm order place order

Serve Cook Food Food Use Case Diagrams focus on Chef <> {if wine was ordered}

the “what” and not the “how”. Serve Wine

Eat <> Drink Food {if wine Wine Client was served} They model what people want facilitate payment pay <> to do with a system. A use case accept Pay for {if wine Pay for payment Food was Wine describes a “goal”, expressed Cashier consumed}

as an “action”. People and Graphic: Kishorekumar 62 (redrawn by Marcel other external entities are Douwe Dekker) / Wikimedia modeled as “actors” that “interact” with the system. Example System Interactions

Imagine a system called “research project.” Some interactions that might appear in a model of this system are:

1. Researcher proposes experimental design. 2. Principal investigator approves experimental design. 3. Researcher creates survey. 4. Subject takes survey. 5. Subject provides survey results. 6. Researcher analyses results. 7. Researcher produces manuscript. 8. Principal investigator reviews manuscript. Let's visualize these interactions in a use case diagram… Research Project Use Case Diagram

If we were only modeling the data system, we would probably remove the goals (and some actors) which were “out of scope” with repect to the data system…

Survey Data System Use Case Diagram 1. Researcher uploads survey. Here, the scope only 2. Subject takes survey. encompasses the goals of 3. Subject uploads results. conducting the survey and returning results. 4. Researcher downloads results.

This is the basic operation of the Open Data Kit system which we will learn about next week. But what good are they, really?

Modeling diagrams help you:

Clarify your own understanding Explore possibilities Communicate with others Prepare for more detailed design steps

As a researcher, you can use these to clarify your project scope and requirements. They will help you present your project needs to others, such as your collaborators and support staff.

Use case diagrams identify what a system must do and how people will interact with it. A complete use case model includes use case diagrams and textual descriptions of each use case. Hands-on Group ExerciseExerciseHands-on

Photo: SarahStierch / Wikimedia Create a Use Case Diagram

As a group, list the goals (actions, use cases) for your research data system in “verb noun” form. Then figure out who (actors) will interact to perform those actions.

Draw a simple use case diagram with stick figures (actors) and elipses (goals, use cases). Use pen and paper or .

All of the elipses should be enclosed in a “system boundary” box (if the software supports that), with the stick figures outside of the box.

Lines (interactions) should connect the actors to their goals. Label the lines with what the actor does (action) to achieve the goal. DiscussionDiscussion

We will display your diagrams on the screen and discuss them.

Graphic: Jagbirlehl / Wikimedia In the Coming Sessions...Sessions...In

We will continue the Needs Analysis of your data system with:

Detailed Use Cases - including text descriptions More Systems Analysis - including Data Flow Diagrams A Requirements Document - compiling the above A Feasibility Study

Which will present us with some options, typically:

Do nothing (business as usual) Get something “off the shelf” (free or commercial) Build something ourselves … or some combination of the above. Action ItemsItemsAction

Videos Videos

Readings Readings

Tasks Tasks Watch VideosVideosWatch

Watch these videos in the order listed.

What are databases? Discover to Deliver Structured Conversation Use Case Diagram Tutorial (watch first two or more) ODK (watch one or two) ReadingsReadings Read: In the PCfB textbook: “Before You Begin”, pp. 1-6; Chapters 1-3, pp. 9-43; and Appendix 1, pp. 451-453 Read: Use Case Tips Skim Wikipedia articles: Data Management , Data System, Data Modeling, Needs Analysis , Agile Modeling, SDLC Skim eBook chapter: Beginning Design Chapter 3: Initial Requirements and Use Cases Explore websites: Agile Modeling , ODK TasksTasks

We have several tasks to perform as “homework” before our next session. They should be fairly quick to complete. You might do one task per day, spending maybe 15-30 minutes on each task. Task 1: Your favorite website's database

Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on below, pick another site if that was your favorite.)

If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation. Task 2: Limits to Excel as a "database"

Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. (These may vary depending on the software version.)

How about for OpenOffice (LibreOffice) “Calc”? (bonus points)

For the Excel experts (bonus points): How do you link spreadsheets by matching columns headings, control the allowed values which can be entered in a column, protect cells (say, those containing formulas or constants) from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network? If you know how to do these things, please demonstrate in class for us. Task 3: Data Sources and Needs Analysis

Use your wiki in Redmine (or GitHub, etc.) to document the list of the data sources you will be working with in your project. Note the file types/applications, organizations/persons /processes they came from, and what you will do to/with them.

The wiki language supports tables, which might be a good way to format the information in the wiki. Later you will use this wiki to further elucidate your “data dictionary”.

Perform a needs analysis . For example, how will you access your data (from campus, remotely, from a mobile device, using what software?) and what sorts of security protections you will need (encryption, access controls)? What other goals and requirements do you have? Store the detailed list in your wiki. Task 4: Use Case Diagram for your project

Based on your needs analysis, produce a Use Case Diagram for your research study data system. Make it more detailed than the one we made in class today. Break out complicated actions into separate, more detailed, diagrams if you need to.

Include all of the data-related goals and tasks associated with your research project from beginning to end. Go into a level of detail which would communicate your data system needs clearly to a IT professional (analyst, designer, developer, or administrator).

You can use pen and paper to make the diagram or you can use software tools such as Creately, Gliffy, Dia, or Violet . You will present this diagram (for two minutes) in the next class session. Task 5: Get files for textbook exercises

Download Examples from the textbook and extract the example files from the “pcfb_examples.zip” file to the folder “pcfb”. Put that folder in whichever environment you will be working. For now, this will probably be your “Documents” folder on your own computer or in your “home directory” on a Unix or server. See alsoalsoSee Google and Facebook Team Up to Modernize Old-School Databases WebScaleSQL: MySQL for Facebook-sized databases What database actually FACEBOOK uses? What database does Facebook use? NYT: Healthcare.gov Project Chaos Due Partly To Unorthodox Database Choice (Slashdot) Topics in Data Management Is “Data” Singular or Plural? Questions and CommentsCommentsQuestions

Image: © Nevit Dilmen / Wikimedia Some Parting WordsWordsSome

It is estimated that 40% of the defects that make it into the testing phase of enterprise software have their root cause in errors in the original requirements documents.

Image: Andy Glover . Used with permission. From: Obamacare's Website Is Crashing Because Backend Was Doomed In The Requirements Stage (Forbes)