University of Colorado s6

University of Colorado Department of Civil, Environmental and Architectural Engineering Statistical Methods for Water and Environmental Engineers CVEN-5454 Spring 2007 Finals (Take Home) Date: 05/05/2007 Due: 05/06/2007 – 8 AM 80 points ______Please write the steps clearly so that points can be awarded even when the numerical answers are incorrect. If you use R (or other software) include the commands/codes used in addition to the steps. ABSOLUTELY no consulting with each other. Knowing that you are all mature graduate students I will trust your conscience and honor system.

Functions of Random Variables 1. (a) A sample of 12, 11.2, 13.5, 12.3, 13.8, 11.9 comes from a population with a density function f (x; theta) = theta / (xtheta+1) (i) Find the maximum likelihood estimate of theta (b) A dike is proposed to be built to protect a coastal area from ocean waves. Assume that the wave height H is related to the wind velocity by he equation H = 0.2*V Where H is in meters and v is velocity in kilometers per hour (kph). The annual maximum wind velocity is assumed to have a log-normal distribution with a mean of 80 kph and a coefficient of variation (CV) of 15% (CV = standard deviation/mean). (ii) Determine the probability distribution of the annual maximum wave height and its parameters. (iii) If should be the design height of the dike for a 5% risk. [Hint: You have to use the PDF of H derived from (ii) above] [5, 6, 4]

Interval Estimation 2. The daily dissolved oxygen concentration (DO) at a location downstream from an industrial plant has been recorded for 10 consecutive days. Day 1 2 3 4 5 6 7 8 9 10 DO (mg/l) 1.8 2.0 2.1 1.7 1.2 2.3 2.5 2.9 1.6 2.2 (a) Determine the 95% confidence interval for the true mean? (b) If the engineer is not satisfied with the width of the confidence interval, and would like to reduce this interval by 10%, keeping the 95% confidence level, how many additional daily measurements have to be gathered? (c) Determine the 95% confidence interval for the true variance? (d) If the 95% confidence level for the mean DO over another set of 10 consecutive days is (2.5, 3.2)mg/l can you say that the mean DO is different between these two sets? [4, 3, 2, 1] Hypothesis Testing 3. Contaminant concentration at two sites for a 20-day period are given in the file http://civil.colorado.edu/~balajir/r-session-files/finals/prob3-data.txt (First column is the day, second is the contaminant concentration at location 1 and the third column for location 2). (i) Is the average contaminant concentration the same at the two locations? Use an appropriate parametric and nonparametric test and compare the results. (ii) Which one of the tests would you recommend and justify. (iii) For the data in problem 2, the mean DO is 2.03 mg/l. If the mean DO exceeds 2.03 by as much as 0.42, is the sample size n = 10 adequate to ensure that Ho: Mu = 2.03 mg/l will be rejected with probability at least 0.8 [7,3,5] Short Answers testing concepts 4. (i) What are some drawbacks of linear regression? (ii) Why is PRESS preferred in model selection? (iii) Confidence intervals on the regression estimiate from linear regression theory, always are symmetric. This might not be realistic. Nonparametric methods, such as bootstrap methods are attractive alternatives. Outline clearly the steps to obtain a 95% CI of the regression estimate at a point, say xp. Say you have observations (x1, y1); (x2, y2); …,(xn, yn). (If you wish and find it easy, outline the steps as a pseudo-R code). (iv) Suppose you have two sets of observations: X : x1, x2, x3, …, xn and Y : y1, y2, y3, .., yn. You wish to test if xbar = ybar (xbar and ybar ar the average X and Y, respectively) using a bootstrap approach. Outline the steps. (If you wish, you can use the R-syntax) (v) What is ‘strength of evidence’ in a hypothesis testing context? [3, 2, 5, 4, 1]

Linear Regression and R-coding problem 5. Australian mean annual rainfall (in millimeters) is provided at http://www.bom.gov.au/climate/change/rain03.txt You can see that log(annual rainfall) is close to Normal distribution (check). A line plot of the data plot(X,Y, xlab=”Year”, ylab=”log(Annual Rainfall)”, type=”l”) tells you that there is a trend in the rainfall over time. Investigation this forms the basis for this problem. (a)Fit a linear regression trend. X = Year; Y = log(annual rainfall) Fit a linear regression Y = beta0 + beta1 * X Write a small R-script to perform this regression, CLEARLY COMMENTED. (Of course, the commands are all there in the filed I pointed to, so you can use them) Check the beta0 and beta1 you obtain from your script with the default R-command ‘lsfit’ – i.e., >zz=lsfit(X,Y) (b) Write another small R-script to perform the ANOVA. In particular, to do (i) the F- test for the overall regression; (ii) R2 and adjusted R2 and (iii) T-test for the two coefficients. Compare the output from your script to that from ‘ls.print’, i.e., ls.print(zz) (c) You now look at the scatterplot and the linear regression line that you fitted above. >plot(X,Y, xlab=”Year”, ylab=”log(Annual Rainfall)”) # This does the scatterplot >ablilne(lsfit(X,Y)) #This command adds the regression line through the scatterplot Looking at this, you notice that there are some outliers that clearly influence the fit. To remedy this you decide to perform a ‘weighted least squares’. Described in page 280- 285 of Helsel and Hirsch book(http://pubs.usgs.gov/twri/twri4a3/pdf/chapter10new.pdf). Write a R-script to perform the iterative weighted least squares’. The steps are as follows: (i) Obtain residuals from the current linear regression line fitted in (a) above. (ii) Obtain weights for each data point using the bisquare weight function described on page 283 of the Helsel and Hirsch book. You will obtain a weight for each data point, say w1, w2, w3, … wN and say, this is stored in a vector ‘w’. Create a diagonal matrix from this. You can achieve this by the command >Wt = diag(w, nrow=N, ncol=N) (iii) Now obtain the linear regression fit using these weights. The equation for beta now includes Wt. Beta = (XXT Wt XX) -1 (XXT Wt Y) Where XX is the augmented X matrix which has 1s in the first column and X in the following column. (The command lsfit can take in a vector of weights. >lsfit(X, Y, weights=w). You can use this as a check) (iv) Plot the new regression line on the scatterplot (v) Use the residuals from this new regression line and repeat steps (ii) through (iv) about 2-3 times and the regression line will start to stabilize, giving you a ‘robust’ regression. Don’t go more than 3 times. (vi) Compare the GCV of this robust regression and that from (a) above. [5,10,10]