Assignment #1 - Graphical and Numerical Summaries of Data

Assignment #2 – Descriptive Statistics 1. Maine Mercury Study Data File: Maine Mercury Study

Background Information: What is mercury? Mercury is a “heavy metal” that occurs naturally in the environment in several forms (elemental mercury, organic and inorganic mercury). Mercury occurs naturally in the earth’s crust and oceans and is released into the earth’s atmosphere. In addition, human activity results in releases of mercury through the burning of fossil fuels and incineration of household and industrial waste.

Mercury enters fish through two known mechanisms. When water passes over fish gills, fish absorb mercury directly through the water. In addition, fish intake mercury by eating other organisms. Mercury tends to bioaccumulate at the top levels of the food chain. Bioaccumulation occurs when microorganisms convert inorganic mercury into toxic organic compounds which become concentrated in fatty tissues as they move up the food chain (EPA, 1994).

Mercury is a toxin that acts upon the human nervous system. Consumption of mercury laden fish can lead to a variety of neurological and physiological disorders in humans. Because mercy acts upon nervous system, developing children and fetuses are especially sensitive to mercury’s effects (Bahnick et al. 1994).

In 1993, the U.S. EPA and the state of Maine implemented the “Maine Fish Tissue Contamination Project”. The goals of the project were to determine the distribution of selected contaminants in fish from Maine lakes, to determine risk to human and wildlife consumers of fish from Maine lakes, and to identify factors that affect the distribution of contaminants in fish tissue. To select the sample of lakes, the research team identified 1073 lakes in Maine that had previously been surveyed, found to have significant fisheries, and were reasonably accessible. The identified lakes are a subset of the total number of lakes in Maine, 2314. From the 1073 lakes, a simple random sample of 150 lakes was selected for study. Out of the original 150 lakes selected, samples were collected from only 125 of these lakes during the summers of 1993 and 1994. Non- sampled lakes were either not reasonably accessible or did no have desired fish species available.

A group of “target species” were determined based on the species’ desirability as game fish, and other factors. The data included here involves only the predator species from the original target species list and thus only 120 lakes out of the original 150 lakes are included. To collect the fish specimens, field crews obtained up to 5 fish from the hierarchical order of preferred predator species group. Field protocols targeted fish that were of comparable age, legal length limit, “desirability” as game species, and likelihood of capture. Fish were collected by angling, gill nets, trap nets, dip nets or beach seines. Care was taken to keep fish clean and free of contamination. Upon capture, fish were immediately killed if alive. Fish were rinsed in lake water and wrapped in aluminum foil, labeled with an identification number, and kept on ice in a cooler. Upon returning from the field, fish were immediately frozen for later analyses. In the laboratory, the fish fillet (muscle) of each fish was extracted. The fillets from each lake were ground up, combined and homogenized, and then the tissue was subsampled to analyze mercury levels.

Another goal of the study was to examine external stressors and other factors potentially responsible for elevated levels of mercury in fish. The information would then be used to gain insights on conditions and sources that could be used in managing problems detected. The factors were divided into fish factors, lake factors, and geographic stressors (watersheds and airsheds). Only a subset of the factors are used here. Lake characteristics include lake size, depth, elevation, lake type, lake stratification, watershed drainage area, runoff factors, lake flushing rate, and impoundment class (see 2 – Data/Variable Types handout for definitions). a) Briefly comment on any limitations and deficiencies you see with this study. (2 pts.) b) The U.S. Food and Drug Administration has determined that samples with more than 1.0 ppm mercury are above the safety limit. Most states consider 0.5 ppm mercury levels (Maine uses .43 ppm) to be high enough to consider taking action (e.g., issuing a health advisory, considering methods of clean-up, etc.).

As indicated by the data collected here, are mercury levels high enough to be of concern in Maine? To answer this question look at both the percentage of lakes where the measured mercury level in sampled fish was above 1 ppm and the percentage of lakes where the Hg level was above .43 ppm. Summarize your findings.

To do this in JMP select Tables > Sort and select Merc (ppm) as the variable to sort by which will produce a new spreadsheet with the waterways sorted by mercury level. Then highlight the cases where mercury is above 1 ppm. The number of cases out of the 120 total that have been selected will be displayed in the lower left corner of the spread sheet. (4 pts.) c) The industries that benefit from dams and dam construction are concerned that environmentalists will claim that high mercury levels in fish are related to the presence of a dam (or man-made flowage) in the lake’s drainage. Do the data support this claim? Justify your answer by providing an appropriate plot(s) and summary statistics. (4 pts.) d) Previous studies (Nilsson and Hakanson, 1992) suggest that mercury levels vary by lake type with oligotrophic lakes experiencing the highest mercury levels and eutrophic lakes experiencing the lowest mercury levels. Doe the Maine study support this? Justify your answer by providing an appropriate plot and summary statistics. (4 pts.) 2. Comparison of Cell Characteristics of Benign and Malignant Breast Tumors Data File: BreastDiag.JMP Key Words: Comparative Boxplots, ANOVA graphics, Summary Statistics

These data come from a study of breast tumors conducted at the University of Wisconsin- Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells. The sample of tumor cells were examined under an electron microscope and a variety of cell shape characteristics were measured.

Your goal is to use summary statistics and graphical displays to determine which characteristics are most useful for discriminating between benign and malignant tumors.

The variables in the data file are:  ID - patient identification number (not used)  Diagnosis determined by biopsy - B = benign or M = malignant  Radius = radius (mean of distances from center to points on the perimeter  Texture texture (standard deviation of gray-scale values)  Smoothness = smoothness (local variation in radius lengths)  Compactness = compactness (perimeter^2 / area - 1.0)  Concavity = concavity (severity of concave portions of the contour)  Concavepts = concave points (number of concave portions of the contour)  Symmetry = symmetry (measure of symmetry of the cell nucleus)  FracDim = fractal dimension ("coastline approximation" - 1)

Medical literature citations: W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792--796, 1995. See also: http://www.cs.wisc.edu/~olvi/uwmp/mpml.html http://www.cs.wisc.edu/~olvi/uwmp/cancer.html a) Use JMP to obtain histograms, outlier boxplots, and numerical summary statistics (mean, median, quantiles, standard deviation, etc.) for the cell radii for both malignant and benign breast tumor cells. Make sure the histograms have uniform scaling.

To do this in JMP select Analyze > Distribution and place Radius in the Y, Columns box and put Diagnosis in the By box. Be sure to select both Stack and Uniform Scaling from the Distributions pull-down menu as shown below.

Use the results to make a comparison between malignant and benign tumors cells in terms of cell radius. Your comparison should address each of the following aspects:  measures of location (mean,median,quantiles) (2 pts.)  measures of variability (range,interquartile range,standard deviation, CV) (2 pts.)  distributional shape (histogram and boxplot) (1 pt.) b) Repeat part (a) for Symmetry of the tumor cells. (5 pts.) c) Use the Fit Y by X option in the Analyze menu to construct comparative boxplots for the cell radii in the two groups. To do this place the variable Diagnosis in the X box and Radius in the Y box. Then select the appropriate display options. Briefly summarize what you see from this plot. What are the advantages/disadvantages, if any, in using comparative boxplots versus the stacked histograms you examined in part (a). (3 pts.)

3. Walleyes in Selected Major Waterways in Minnesota

Data File: Walleyes Major Waterways.JMP

These data come from a larger data base of walleye data collected by the Minnesota Department of Natural Resources.

The variables in this data file are:

 WATERWAY – name of waterway  LOCATION – location of waterway or sampling location  LGTHIN – length of the walleye in inches  WTLB – weight of walleye in pounds  HGPPM – mercury concentration found in fillet in parts per million (ppm)  Year – Year fish was sampled (1993 – 1997)  log(Hg) – log base 10 of the mercury concentration a) Construct histograms of the length (in.), the weight (lbs.), and the mercury concentration (HGPPM) found in the walleyes in this sample. How would characterize the distributional shape of each variable? Give an appropriate typical value for each measured characteristic of the walleyes from these waterways. Comment on the amount of variation found in each of these characteristics relative to one another using appropriate summary statistics. (6 pts.) b) Using comparative boxplots and group summary statistics (mean, median, standard deviation etc..), comment on differences you see in these three characteristics (length, weight, Hg level) across waterway. (3 pts. for each characteristic, 9 pts. total) c) Construct a histogram of the log base 10 of the mercury concentration found in these walleyes, log10 (Hg) . How would you characterize the distributional shape of the log mercury concentrations? (1 pt.) d) Convert the median of the mercury concentration in the log base 10 scale back to the original scale. How does it compare to the median of the mercury concentration found in part (a)? (2 pts.) e) Convert the mean of the mercury concentration in the log base 10 scale back to the original scale. How does it compare to the mean of the mercury concentration found in part (a)? (2 pts.) f) One might conjecture that the older, and hence larger, a fish is the higher the concentration of mercury we would find in its tissues. To investigate this, construct a scatterplot of HGPPM (Y) vs. LGHTIN (X). What do you conclude? (2 pts.) g) Now construct a scatter plot of log(Hg) (Y) vs. LGHTIN (X). What advantage if any do you see in using mercury concentration in the log scale vs. the original scale? Explain. (2 pts.) h) Examine separate scatter plots of log(Hg) (Y) vs. LGHTIN (X) for each waterway. To do this in JMP, select Analyze > Fit Y by X > log(Hg) in Y, Response box and LGHTIN in the X, Factor box AND put WATERWAY in the By box (see below). This will construct a separate scatterplot of log10(Hg) vs. LGHTIN for each WATERWAY. Does the strength of the relationship between fish length and Hg concentration depend upon the sampling location? Explain. (3 pts.)

Your Fit Y by X dialog box for part (h) should look like this. i) Assuming the log10(Hg) levels found in walleyes from the Boulder Reservoir are approximately normally distributed give a range of values in the original scale that is very likely, say a 95% chance, to contain the Hg level found in a randomly selected walleye from this waterway. You will need the summary statistics for log10(Hg) levels obtained from walleyes sampled at Boulder Reservoir found in your analysis from part (b). (2 pts.)

4. A Closer Look at Walleyes in the Mississippi River and Boulder Reservoir

Data File: Walleyes Miss and Boulder.JMP

Construct separate cumulative distribution function (CDF) plots for length (in.) and mercury concentration (ppm) for fish sampled from Mississippi River and Boulder Reservoir. To do this in JMP select Fit Y by X placing both LGTHIN and HGPPM in the Y box and Waterway in the X box. Then select the CDF Plots option from the Oneway Analysis... pull-down menu. Use these plots to answer the following questions, justifying your answers: (1 pt. each) a) In which waterway, Mississippi or Boulder Reservoir, are you more likely to catch a walleye over 15 inches in length? b) In which waterway are you more likely to catch a walleye under 20 inches in length? c) In which waterway are you more likely to catch a walleye having a mercury concentration exceeding .75 ppm in its fillet? d) In which waterway are you more likely to catch a walleye having a mercury contamination level below .50 ppm?