Data Management: Project Lesson Overview

Intro:  Sources of data  Choosing a question

Learning Goal of this lesson: -to help you choose a topic for which data exists and is accessible to you -to help you find the data to support your project -to teach you how to access quality information from Statscan. It is the public face of our government statistics bureau)

1. Statistics in the News

http://google.com Search for statscan news

http://www.thestar.com/business/tech_news/2013/11/01/statscan_data_points_to_canadas_gro wing_digital_divide_geist.html

Canadians & literacy, numeracy http://www.sunnewsnetwork.ca/sunnews/canada/archives/2013/10/20131008-102228.html

http://ca.news.yahoo.com/adding-up-the-ways-we%E2%80%99re-falling-behind-in-education- 194540699.html

http://www.nationalpost.com/m/wp/sports/mlb/blog.html? b=sports.nationalpost.com/2013/09/24/more-than-two-thirds-of-quebecers-want-major-league- baseball-back-in-montreal-according-to-poll

Statscan twitter 2. What are statistics used for?

Medical studies; Epidemiology (tracking disease outbreaks)

Merchandising (tracking consumer purchases)

Government policy

Personal decisions – stocks, investments, productivity

Sports (draft position) http://www.tsn.ca/columnists/scott_cullen/?id=267960

A move towards “evidence-based decision making”

3. Good graphs; Bad graphs

Animating data to produce “information” http://www.youtube.com/watch?v=jbkSRLYSojo

How to Make Data Look Sexy (CNN)

Are women bad a math? http://www.slate.com/blogs/xx_factor/2013/08/29/are_women_bad_at_math_graphs_r efute.html Bad graphs http://misterguch.brinkster.net/graph.html http://gator.gatewayk12.org/~smcgrail/myweb/powerpoint/misleading_graphs/here_are _some_examples_of_mislea.htm http://en.wikipedia.org/wiki/Misleading_graph

Bad graphs

http://lilt.ilstu.edu/gmklass/pos138/datadisplay/images/phillips2.jpg

Three elements of bad graph design: Data Ambiguity, Data Distortion, and Data Distraction. http://lilt.ilstu.edu/gmklass/pos138/datadisplay/badchart.htm

Types of graphs http://cas.illinoisstate.edu/jpda/charting_data/fillesbysection.shtml Scatterplots (correlation vs causation) 4. How do I find a topic?

Where to get ideas for your project:

1. Do a literature search using the databases (Virtual Library) (secondary; library) to see  What relationships are out there  What data has been collected already

2. Browse StatsCan to see what categories may be of interest to you.

3. Other online searches (include the term statistics or MDM 4U) http://mathforum.org/workshops/usi/dataproject/usi.hslessons.html

MDM4U resources on the Brock University website http://www.brocku.ca/cmt/mdm4u/intro.htm http://www.brocku.ca/cmt/mdm4u/resr/index.html 4. Brainstorm topics of interest

5. Exemplars http://schools.hwdsb.on.ca/highland/files/2011/02/MDM4U-Final-Project-PGA.pdf http://teacherweb.com/ON/statistics/Math/photo3.stm

Proposal Phase: 1. Do some presearching; find background information on the topic; determine what is already known in the field 2. Thesis – your main thesis question/statement and the sub-problems you are going to answer. 3. Determine the Population you will seek out or the Sample you will use 4. Analyze – Explain each of the following: 1. What are the main variables in your question? 2. Can these variables be measured statistically? 3. Is there enough data to make an interesting analysis? 5. Hypothesis – Predict what do you expect to find / observe? 6. Why is it important for you to investigate this topic? Who is it most relevant to? 7. Data – Include either 1) all of the raw data that you are going to use from the Internet, books, etc., sourced. NOTE: For large datasets a 1-page sample including a WWW link (with Name/Title) to the rest is sufficient. I need to know how the researchers got their data. OR 2) the survey that you are going to use. It should not be distributed yet. HINT: Start your Bibliography as soon as you find your first useful web site. Trying to go back and find information later is a nightmare. 5. Research and Project Design

1. Title Page

2. Table of Contents - Include section headings and page numbers - NOT numbered - 3. Summary (like an abstract) - Do not write this until you are finished your project! - Page numbering starts here (1) – insert a section break - In one page, briefly summarize your entire report. - A summary section is something that would be read by a manager who didn’t have enough time to read the entire report, so make sure that you have enough details that it can stand by itself. - At the very least, include the following information: - Problem: A clear statement of what you are trying to learn - Plan: The procedure you will use to carry out the study (How do you choose people? How do you measure? Who does the measuring? What methods are you going to use?) - Data: The data are collected according to the plan (What data did you collect? Where did it come from?) - Analysis: The data are summarized and analyzed to answer the thesis question (numerical, graphical, informative sentences) - Conclusions are drawn about what has been learned (note any biases, suggest further studies)

4. Problem  Main thesis question. The thesis question is the theme of your report (e.g. What is the relationship between an NBA player’s salary and their success?). Try to use the word “relationship” in your thesis question. Remember, you do not have the tools to try and find any cause and effect.  Sub-questions: The sub-questions are the smaller questions that you will answer that will lead you to conclude on your main thesis question. These should be specific enough that they contain your variables that you will compare. The problems may evolve slightly throughout the life of your project. (e.g. What is the relationship between salary and a player’s points per game? What is the relationship between salary and a player’s rebounds per game? What is the relationship between salary and the number of games that a player has won?)  Hypothesis – What do you expect to find?  Define the population and describe the characteristics of the population (e.g. all players in the NBA that played at least 70 games during the 2011-12 regular season).  Define the independent variables (e.g. points per game in 2011-12 NBA regular season, rebounds per game in 2011-12 NBA regular season)  Define the dependent variables (e.g. player salary).

5. Plan  Select the sampling method and justify your choice  Design and explain the Experiment/Survey/Questionnaire/Data Collection process.  Identify any possible biases NOTE: if the data is not your own, you need to find out as much of the above information as possible and point out the parts that you don’t know.

6. Data  Put all of your raw data collected in an appendix, not in this section  Include summaries of your key variables here (frequency tables – but not histograms or graphs)  Identify all problems you ran into with your data (Did you need to ‘massage’ it to use it in Excel/Fathom? Did you alter the scale?)

7. Analysis For each sub-question identified, use the concepts we learned in class to describe the data or find trends/relationships. Only include those that are relevant.

(a) Numerical Statistics (your report must include at least 3)  Find means, modes, and medians

 Find the standard deviation, Q1 , Q3 , IQR, percentiles  Use linear regression and find the correlation coefficient, equation of a line of best fit  Use non-linear regression and find the coefficient of determination, equation of a curve of best fit  Relate your data to the Normal Distribution, Binomial Distribution or another distribution.  Use z-scores and z-tables to find some useful information.  Permutations, Combinations and Probability: - Predict the probability of certain events using your model - Do something else relating to probability - Use a simulation to help you discover a probability - Use the binomial theorem - Create a probability distribution

(b) Graphical Representations (you must include at least 3)  Scatter plots (this should be included in every project as you will be finding many relationships)  Bar graph / histogram / frequency polygon (histogram + curve) / cumulative frequency polygon (each freq. is a cumulative total) / relative frequency polygon (freq. as a %) / line graph / moving average  Box and whisker

(c) Information – descriptive sentences. This part is very important and often overlooked by students. Don’t just provide numbers and statistics. Be sure to interpret them for the reader. What do the numbers tell you? Include this with each concept / graph.

8. Conclusion  Draw conclusions that directly relate to your thesis.  Note any biases that you believe occurred in your study.  Make suggestions for further/follow-up studies or any modifications that would make to the current study.

9. Bibliography

Web sites cited using APA format. Research Cycle: Steps to Success Maximizing Evidence and Data Sheets (from HWDSB eBEST)

1) Develop Your Question  Identify the area of interest (issue, concern, untested hypotheses, unanswered question, etc.).  Define in clear, specific terms, the actual, specific problems that will be the focus of your investigation.  Identify the variables you are interested in. What (I.V.) causes what (D.V.)?  Identify the independent and dependent variables.

2) Gather Existing Evidence/Data on Your Topic  You may conduct a literature search related to the problem area.  You may examine any available data/information.

3) Make a Prediction  Based on the literature that you have read, what is your hypothesis?  The hypothesis should engage the two variables…change in the I.V. affects the D.V. in some way. As poverty rises, then rates of attending university will fall.  Be specific and attempt an explanation of why you think the hypothesis is true.

4) Make a Study/Evaluation Plan Will you seek out existing data and repurpose it or create your own through a survey? If you are using existing data…you need to  Evaluate the quality of the source of your data  Determine how it was collected  Was it linked to by others? See if other sources support this source. This is done by adding “link:” to the beginning of the URL (e.g., type the following into a Google search window… link:http://www.hwdsb.on.ca)  How was the data collected? Is the data for the two variables related?  If you are studying the hypothesis above you cannot get poverty statistics from Alberta and university attendance rates from Ontario.  If you were studying teen sleep and Grade 12 averages, you must use data that was collected from the same students for whom you will gather grade data. If you do a survey…you need to determine  Who are the participants?  What data/information will you collect?  Are there any ethical issues to consider?  Where will the data collection take place?  When do you plan to collect the data? Once? More than once?  How will you collect the data? What tools/instruments will you use?  Do you plan to include an intervention?  What type of analysis will you use to make sense of the findings? 5) Collect Data/Gather the Information  Put your study/evaluation plan in place!  Watch for confounds 6) Examine/Explore/Analyze Data  Identify and describe the key findings (look for trends, graph the data, examine other relationships).  Graphs should be made of summary data only.  Graphs should always show the I.V. on the horizontal or x-axis and the D.V. on the vertical or y-axis.  Graphs should be appropriate to the type of data (bar graphs for comparison of non-continuous information and line graphs for continuous data) o Continuous data is data for which values exist at all levels (e.g. time) o Non-continuous data is data such as eye colour (either blue or brown or hazel or other categories you create)  Some important considerations: o What were the most important findings? And why are they so important? o To whom are the findings most relevant? o What are the limitations of the research?

Potential Problems: Almost EVERY problem that I have seen on final projects was because of an incomplete or poorly done proposal phase. The following factors have created flawed projects in past Data Management courses:

 Projects were far too large in scope. A research team of 100 working for 25 years would be unable to prove causation in the way that these students wished to do. This happens most often with projects like drunk driving, teenage pregnancy or economic problems. Choose less glamorous and smaller topics that you can find data about. Make sure data is available.  Projects which attempted to prove causation instead of correlation.  Projects whose entire body of evidence was based on the unreliable sources from the Internet. They made no attempt to figure out where their sources' data came from.  Projects where random sampling involved giving a survey to everyone in their class. Sample size too small or too homogeneous.  Projects where the students developed their surveys first and their research questions second. They ended up not asking the correct survey questions and were unable to prove their point.

6. Variables

Decide on variables Developing a Good Research Question

Identify the variables you are going to study. Your preliminary research should allow you to develop an hypothesis to relate two variables. One is called the independent variable and the other is a dependent variable. Your hypothesis should be phrased such that you are studying the effect of the independent variable on the other or dependent variable. You should attempt to find data sets that control any other variables.

E.g. Do the number of hours children watch television per day affect levels of obesity?

Independent variable:

Dependent variable:

Controls: Independent variable – The variable you think affects something else. You select different levels for this variable and then see how this affects another variable. You have to have an idea that the variables are related…that one causes the other to change, that they are correlated.

Dependent variable - This is the variable you would measure.

Controls: the standard to which you compare your results to ( some need to be controlled (e.g. age/sex)

Relate them using an hypothesis: Does A (I.V.) affect B (D.V.)? Be specific. Define A and B fully.

7. Where can you find data?

First you will need to decide what type of data you will collect for your project:

1. Primary Data is information that you collect on your own. For example, this could be obtained by having students at AHS complete a questionnaire on paper or online (using Fathom, surveymonkey.com, etc.). It could also be an actual experiment/simulation that you conduct using a computer.

2. Secondary Data is information that you are taking from another source. It is important to use reliable sources. When choosing your topic, be sure that you will be able to find good data. Some places that students often obtain data from are as follows:

 Let’s go on a bit of a tour of Statscan http://www.statcan.gc.ca/start-debut-eng.html Home login  go to the virtual library…Virtual tools….estat  University of Michigan Library of Statistics http://www.lib.umich.edu/govdocs/stats.html  Nation Master www.nationmaster.com http://cas.illinoisstate.edu/jpda/finding_data/internationaldata.shtml http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx

Use search terms like….data sets; statistics;

Wolfram Alpha http://www.wolframalpha.com/ google trends (under categories; shows you the #of Google searches) http://www.google.com/trends/

RDC http://www.statcan.gc.ca/rdc-cdr/data-donnee-eng.htm

Open data project http://www.data.gc.ca/default.asp?lang=En&n=F9B7A1E3-1

UN data http://data.un.org/ http://data.un.org/Search.aspx?q=population

Datamob http://datamob.org/datasets

Areas of data CDC http://www.cdc.gov/nchs/fastats/Default.htm http://www.cdc.gov/nchs/products/hestats.htm

Health http://www.nlm.nih.gov/hsrinfo/datasites.html http://phpartners.org/health_stats.html http://web4.uwindsor.ca/units/leddy/leddy.nsf/HealthStatistics!OpenForm (list of Canadian sites for health info)

World Bank - Economic Data http://data.worldbank.org/ http://data.worldbank.org/data-catalog

Open Data Hamilton, ON http://openhamilton.ca/article/starting-hamiltons-open-data-sets

Freebase -gives info and data http://www.freebase.com/

Data.gov http://www.data.gov/

List of open datasets http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

Library and Archives Canada http://www.collectionscanada.gc.ca/opendata/900-1000-e.html

Open Government http://open.gc.ca/index-eng.asp

Canadian data http://data.gc.ca/eng

Citing Examples of Data Citations Always check your syllabus or author guidelines to see if they contain directions for citing data. Some data distributors will suggest citations that you may use. Most common style guides do not give specific instructions for citing data; however, here are three examples from those that do:

Publication Manual of the American Psychological Association (APA), 6th Edition

Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures: A survey of Latinos on the news media [Data file and code book]. Retrieved from http://pewhispanic.org/datasets/

How do I cite data?

When you're writing a research paper, it is necessary to cite your use of sources, typically as footnotes at the bottom of the page or in a bibliography at the end of the paper. It is crucial to provide references for your reader to better understand the context of your research and to give credit for people's work that you've used. As research becomes more data-intensive, it is important to cite your use of datasets in addition to traditional publications such as journal articles, books, and conference proceedings.

Digital datasets come in a wide variety of formats. Some examples include:

 spreadsheets  interview transcripts  sensor and instrument readings  high resolution images  gene sequences  software source code  video recordings

* The emerging best practice is to cite data just as you would cite a research article. *

Most traditional forms of documents are not capable of representing these kinds of data, and so datasets can be published separately in data repositories and other web sites. Whether you produced the data yourself or you're using someone else's data in your research, it is important to maintain a linkage between your paper and its supporting datasets by citing them. Not only does this give credit to the person who created the data, but it enables others to reproduce your research and verify your results. In some cases, sharing a dataset may have more scholarly impact than publishing a book or journal article.

There are many challenges in citing data. In most disciplines, there are no clear instructions on how to cite data. In fact, most of the major style guides (APA, MLA, the Chicago Manual of Style) do not directly address the issue of data citation. Data is not recognized as a format in many citation management tools and tutorials. Some kinds of data are dynamic, such as a weather dataset, and may change every hour or every day, so it's difficult to know what to cite.

Here are some tips for citing data properly:

 Always look for instructions in your syllabus or the author guidelines on how to cite data. You may be able to find examples from previously published papers to imitate.  The distributor you downloaded the dataset from may suggest a citation. Some examples include ICPSR, OECD, and Dryad.  If there are no explicit instructions for citing data, there may be instructions for a similar format such as citation styles for electronic resources, web pages, or tables that can be used.

Try to capture these important elements in your data cititation:

 Who produced the dataset (creator or author)  The title of the dataset  The unique identifier of the dataset, perferably a Digital Object Identifier (DOI) or minimally a link to the dataset if it is online  The date the datasets was published and its version number, if it has one  The date and time the dataset was accessed  The distributor of the dataset

Keep in mind that some datasets are dynamic and change over the course of time. Always try to cite the specific version of the dataset that you used. Some distributors provide a checksum to ensure that the dataset hasn't been changed or corrupted since it was published, which may be included in a source note. Other important information for understanding and using the dataset may be included in supplementary files (e.g., codebook, readme.txt) that may be available at the same link in the citation or in the source notes of your paper.

Responsible Use of Data

Be sure to examine the license associated with the data you're citing, to make sure your use is acceptable. If the dataset is derivative of one or more other datasets, you may need to review their licenses and credit their sources also. If you're including a substantial portion of someone else's data in your paper, you may need to seek their permission. Some data distributors request that you submit your citation to them to help them track the use of their data.

Is the data that you're citing accurate? Is the dataset described and presented in a way that users will recognize and use it appropriately? Does the data contain sensitive information, such as phone numbers or other personal identity information? If your research includes the use of human subjects, you will need to confirm that your data meets the requirements set by your institutional review board (IRB) or other ethical norms.

Data vs. Datum

Remember: data is plural. The singular form of data is datum, which means a "data point". It sounds odd, but it is grammatically correct to say "The data show us.." and not "The data shows us..."

Health:

-- Teen Drug Use and Abuse (Jeff Madeiros and Corey Hoekstra, St. Peter H.S., 2008) -- Obesity and Diabetes: A Growing Epidemic (Jennifer Brierley, St. Pius X H.S., 2006) -- Teen Pregnancy and Abortion (Nicole Sanger and Brittany Burg, St. Peter H.S., 2006) -- Canada's Health by Region (presentation, Colin McClenaghan and Jack Wei, K.C.V.I., 2006) -- Dietary Habits of Canadians (Bronwyn, Hillcrest H.S., 2006) -- Factors affecting Thyroid Condition (presentation, Shafaq, Nepean H.S., 2004) -- Does the Marital Status of Parents affect their Kids? (presentation, Cassandra, Brandon, Victoria, Derek, Mother Teresa HS, Ottawa, Dec. 2008)

Education: -- Video game usage and Absenteeism in Canadian high schools (Bryan Smith, Orangeville D.S.S., 2008) -- Education, Salary, and Career Paths (Patrick Jackson, North Dundas S.S., 2006) -- Factors affecting Student Achievement (Lisa Hoople, Sacred Heart C.H.S., 2006) -- Substance Use and Academic Performance (Charlie Berrigan, Smiths Falls C.I., 2006)

Politics and Economics: -- Variables affecting Voter Turnout in Canadian Federal Elections (Shaun Banke, Holy Trinity H.S., 2006) -- Political Opinions of Students (presentation, J. Maier and S. Gordon, St. Peter H.S., 2006) -- Political Opinions of Students (dataset, J. Maier and S. Gordon, St. Peter H.S., 2006) -- Analysis of the Canadian Federal Debt using E-STAT (Linda, Pickering H.S., 2004) -- Factors affecting Income (Jodi Morden and Mike Curridor, Sacred Heart, 2003)

Transportation: -- Factors that Influence Collisions (Dimitar Hristov, Vesko Avramov, and Ayman Barri, Brookfield H.S., 2006) -- Why do Young People pay more for Auto Insurance? (T.J., Opeongo H. S., 2005) -- Teenager Driving Infractions (Erin Knox and Katherine Renner, Sacred Heart C.H.S., 2006)

Environment and Energy: -- Global Warming (Renpeng Sun, Nepean H.S, 2008) -- Greenhouse Gas Emissions (Sarah Deslippe, K.C.V.I., Kingston, 2006) -- Carbon Dioxide Emissions (Mathew Hall, Dr. Williams S.S., 2006) -- The Future of Electricity (Jonathan Thomas, Opeongo H.S., 2006) -- Trends in Energy Consumption in Canada (Chenyu Bing, Sir Robert Borden H.S., 2006)

Other: -- Unemployment & Divorce in Canada (Rachel Wang, Glebe C.I., 2008) -- The Effect of Tourism on GDP in Canada (Feifei C., L � Amoreaux C.I., 2007) -- Travellers in Canada (Gosia, Sacred Heart C.H.S., 2005) -- Investigation on the effects of Data Manipulation (Matt, Will, and James, Carleton Place H.S., 2005) -- Factors affecting Internet Use (Bryan W., Earl of March S.S., 2004)