<<

This article was downloaded by: [92.118.247.142] On: 30 January 2014, At: 07:48 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

The American Statistician Publication details, including instructions for authors and subscription information: http://amstat.tandfonline.com/loi/utas20 The and Taylor’s Expansion in Statistics Changyong Feng a , Hongyue Wang a , Yu Han a , Yinglin Xia a & Xin M. Tu a a Department of Biostatistics and Computational Biology , University of Rochester , Rochester , NY , 14642 , USA Accepted author version posted online: 02 Oct 2013.Published online: 18 Nov 2013.

To cite this article: Changyong Feng , Hongyue Wang , Yu Han , Yinglin Xia & Xin M. Tu (2013) The Mean Value Theorem and Taylor’s Expansion in Statistics, The American Statistician, 67:4, 245-248, DOI: 10.1080/00031305.2013.844203 To link to this article: http://dx.doi.org/10.1080/00031305.2013.844203

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// amstat.tandfonline.com/page/terms-and-conditions General

The Mean Value Theorem and Taylor’s Expansion in Statistics

Changyong FENG, Hongyue WANG,YuHAN, Yinglin XIA, and Xin M. TU

(MVT), which are useful tools in mathematics, have been widely The mean value theorem and Taylor’s expansion are powerful used in statistics to study the asymptotic properties of estima- tools in statistics that are used to derive estimators from nonlin- tors obtained from nonlinear estimating equations. In ear estimating equations and to study the asymptotic properties we know that the mean value theorem exists for multivariate of the resulting estimators. However, the mean value theorem real-valued differentiable functions. However, a similar mean for a vector-valued differentiable does not exist. Our value theorem does not exist for vector-valued differentiable survey shows that this nonexistent theorem has been used for a functions. Our survey shows that this fact is not well appreci- long time in statistical literature to derive the asymptotic prop- ated and many frequently cited papers and books in statistics erties of estimators and is still being used. We review several have used this nonexistent theorem in their respective contexts. frequently cited papers and monographs that have misused this In this article, we show how the nonexistent mean value the- “theorem” and discuss the flaws in these applications. We also orem for vector-valued has been used offer methods to fix such errors. in some highly cited journal papers and monographs. We also discuss the reason for such flaws and methods to fix them. KEY WORDS: Asymptotic normality; Consistent estimator; The article is organized as follows. In Section 2 we briefly Estimating equation. summarize the Taylor’s expansion and MVT, especially for the case of vector-valued multivariate differentiable functions. In Section 3 we discuss some published works that have misused the MVT and Taylor’s expansion. In Section 4 we show how to 1. INTRODUCTION fix the flaws, followed by the conclusion in Section 5. Many estimators in statistics are obtained through estimat- ing equations. For example, the maximum likelihood estimator 2. MEAN VALUE THEOREM AND TAYLOR’S (MLE) is the solution of the equation (Cramer,´ 1946); the EXPANSION OF VECTOR-VALUED FUNCTIONS least square estimator in linear regression is the solution of nor- mal equation (Seber and Lee 2003); the generalized estimating We now summarize some well-known results on the mean equations (GEE) estimator (Liang and Zeger 1986), is the solu- value theorem and Taylor’s expansion in mathematical analysis. tion of a set of equations. Under mild regularity conditions, the All these results can be obtained from any standard book on estimators from such estimating equations are consistent and mathematical analysis such as Rudin (1976).

Downloaded by [92.118.247.142] at 07:48 30 January 2014 f O → R asymptotically normally distributed. Their asymptotic variance Suppose that O is an open interval and : is differ- a,b ⊂ O θ ∈ a,b can be easily obtained from some robust procedures such as the entiable. Then for any [ ] , there exists ( ) such “sandwich variance estimator” (Huber 1967; White 1982; Liang that and Zeger 1986 ). f (b) − f (a) = f (θ)(b − a). (1) Estimating equations are generally nonlinear. Highly efficient numerical methods are available to solve those equations. How- This is the well-known mean value theorem in calculus. This ever, the proof of consistency and asymptotic normality is not so result can be easily generalized to multivariate real-valued dif- straightforward. Taylor’s expansion and the mean value theorem ferentiable functions. Suppose G is an open convex subset of Rp (p>1) and f : G → R is a differentiable function. Then C. Feng is Associate Professor (E-mail: [email protected]), H. Wang is Re- for any a, b ∈ G, there exists θ between a and b such that search Assistant Professor (E-mail: [email protected]), Y. Han is Ph.D. candidate (E-mail: yu [email protected]), Y. Xia is Professor (E-mail: f (b) − f (a) =∇f (θ)(b − a), (2) yinglin [email protected]) and X. M. Tu is Research Assistant Professor (E-mail: xin [email protected]), Department of Biostatistics and Compu- where ∇f (θ) (a row vector) is the gradient of f at θ. tational Biology, University of Rochester, Rochester, NY 14642, USA. This However, the MVT for vector-valued differentiable functions research was supported by a research grant from the Clinical and Translational does not exist. To see this, consider a vector-valued function Sciences Institute at the University of Rochester Medical Center. The authors p thank the editor, associate editor, and two referees for their very insightful f . For an open convex subset G ⊂ R (p ≥ 1), the function q comments which improved this article. f : G → R (q>1) is said to be differentiable at x ∈ G if there

© 2013 American Statistical Association DOI: 10.1080/00031305.2013.844203 The American Statistician, November 2013, Vol. 67, No. 4 245 p q exists a linear Df (x):R → R such that from (7) we can only have for vectors θ1,...,θq , ⎡ ⎤ f (x + h) − f (x) − Df (x)h ∇f1(θ1) lim = 0. (3) ⎢ ⎥ x+h∈G, h→0 h f − f = ⎢ . ⎥ − , (b) (a) ⎣ . ⎦ (b a) The function f is differentiable in G if it is differentiable at each ∇fq (θq ) point of G.From(3) which is totally different. f (x + h) = f (x) + Df (x)h + o (h) . (4) Sometimes the MVT and Taylor’s expansion are used inter- changeably in the statistical literature. See Example 1 in the next If f is differentiable in G, for any a ∈ G, the first-order Taylor’s section, in which the Taylor’s expansion is actually the NEMVT. expansion of f around a is

f (x) = f (a) + Df (a)(x − a) + o (x − a) . (5) 3. SOME MISUSES OF THE MEAN VALUE THEOREM IN STATISTICS Although (5) is quite similar to (2) and the remainder term in (5) has higher order than the linear part, it does not mean that for Our survey of the literature shows that the NEMVT (6) has any a, b ∈ G there exists θ on the line segment between a and been used quite extensively for decades in the statistical litera- b such that ture. In this section we focus on five highly cited publications (two journal article and three published monographs) to illus- f − f = Df θ − . (b) (a) ( )(b a) (6) trate the extent of the problem. To make our argument precise, we try to use the same notation as in the original works and We call (6)thenonexistent mean value theorem (NEMVT) for point out the exact formula where mistakes occur. Please refer vector-valued functions. to the original works for more details. Here is an example to show why (6) is not true in general. Consider the function f : R2 → R2 defined by Example 1. Wei, Lin, and Weissfeld (1989). x + sin y This is a highly cited article in survival analysis, especially f (x,y) = . x + cos y in the literature of multivariate survival data and recurrent event failure time data. In this article, the marginal hazard functions 2 This function is continuously differentiable in R with of failure times are assumed to be of the form  1 cos y λk(t) = λk (t)exp(β Z(t)), Df (x,y) = . 0 k 1 − sin y where βk is a p × 1 vector. The marginal working independence Note that f (0, 0) = f (0, 2π) = (0, 1) . Assume that there ex- partial likelihood was used to estimate the parameter βk.The ists (x∗,y∗) on the line segment between (0, 0) and (0, 2π) such score function Uk(βk, ∞)isalsoap × 1 vector. To study the that (6) holds, then we must have asymptotic properties of βˆk, Wei, Lin, and Weissfeld (1989) stated “By the Taylor expansion of Uk(βk, ∞) around βk, 0 1 cos y∗ 0 0 f (0, 2π) − f (0, 0) = = − . we have” 0 1 − sin y∗ 2π 0 −1/2 ∗ 1/2 n Uk(βk, ∞) = Aˆk(βk )n (βˆk − βk), (8) y∗ = y∗ = This means that cos sin 0, which is impossible. ∗ where “βk is on the line segment between βˆk and βk”. See We give an argument to show why (6) is generally wrong. ∗ formula (A.1) and the expression of Aˆk(β ) in Wei, Lin, and Downloaded by [92.118.247.142] at 07:48 30 January 2014 Our argument is not novel to mathematicians, but may be new Aˆ β∗ −DU β∗, ∞ /n to some statisticians. Weissfeld (1989). In fact, k( ) is exactly k( k ) . p q It is obvious that Equation (8) [formula (A.1) in Wei, Lin, Let f = (f1,...,fq ) : R → R ,p≥ 1,q >1bediffer- p and Weissfeld (1989)] is not obtained from the entiable. For k = 1,...,q, fk : R → R are multivariate real- valued differentiable functions. According to (2), for any a, b ∈ expansion but from the application of the NEMVT on the vector- p valued function Uk(βk, ∞). R , there exists θk between a and b such that Example 2. Gross and Huber (1987). fk(b) − fk(a) =∇fk(θk)(b − a),k= 1,...,q. (7) Gross and Huber (1987) studied the asymptotic properties θ Note that k may be different for different k. This means in of semiparametric estimators of regression parameters in the θ general we cannot find a universal between a and b such that hazard functions of matched-pair survival data. In their setup, Df x x ∈ Rp (7) holds for all k, even when ( ) is invertible for all . the regression parameter β is a p × 1 vector. The partial like- θk All examples discussed in Section 3 simply assume that is lihood score function DLn(β) (formula 4 of Gross and Huber the same for all k. Although 1987) is a vector-valued function with p components. To study ⎡ ⎤ the asymptotic normality of the maximum partial likelihood ∇f1(θ) ⎢ ⎥ estimator β˜n, they used the following expansion of the score Df θ = ⎢ . ⎥,θ∈ Rp, DL β ( ) ⎣ . ⎦ function n around the true value 0: ∇f θ 2 ∗ q ( ) DLn(β˜n) = DLn(β0) + β˜n − β0 D Ln(βn ), (9)

246 General D2 ∗ where “ denotes the matrix of second partial with “where θn is an intermediate value between θˆn and θ0” (Tsiatis ∗ respect to β, and βn lies between β0 and β˜n” [see lines 3–6 on 2006, p. 30). p31 of Gross and Huber (1987)]. To study the asymptotic normality of MLE βˆn, they expanded F ∗ 4. CAN WE FIX THE FLAWS? the score function DLn around β in the following way −1/2 F −1/2 F ∗ In Section 3, we showed some misuses of the MVT in pub- n DLn (βˆn) = 0 = n DLn (β ) 1/2 ∗ −1 2 F ∗∗ lished papers and monographs, although the conclusions usu- + n (βˆn − β )n D L (β ), (10) n n ally remain unchanged in large samples. Here we use the idea of ∗∗ ∗ “with βn ∈ (β , βˆn)” (see Gross and Huber 1987, sec. 3.2). Andersen et al. (1993) to illustrate how to fix the mistake. Con- It is again clear that these two expansions of the vector-valued sider the problem mentioned earlier in Kalbfleisch and Prentice  function used by Gross and Huber (1987) are the application of (2002). Assume that U(β) = (U1(β),...,Up(β)) , U(βˆ) = 0, formula (6). and βˆ is a consistent estimator of β.FromMVT(2)wehave that for k = 1,...,p, Example 3. Kalbfleisch and Prentice (2002). √ −1/2 −1/2 −1 (k∗) 0 = n Uk(βˆ) = n Uk(β) + n ∇Uk β · n(βˆ − β), This book is a revised and expanded version of Kalbfleisch and Prentice (1980). Since the publication of the first edition, where β(k∗) is between βˆ and β. Suppose we can prove that the book has become an “authority on censoring and likelihood −n−1∇U β(k∗) →   =  ,...,  k( ) k, and ( 1 p) is nonsingu- and the hazard function approach to models for several types of lar, then we can prove failure” (Andersen et al. 1993,p.7). √ −1 −1/2 In the proof of the asymptotic normality of the Cox partial n(βˆ − β) =  · n Uk(β) + op(1). βˆ β likelihood estimator of parameter vector (formula 5.59 in Another way to correct the mistake is to use the Kalbfleisch and Prentice 2002, p. 176), the following formula form of the mean value theorem for vector-valued functions, −1/2 −1/2 f Rp → Rq Df 0 = n U(β,τˆ ) = n U(β0,τ) which states that if : is differentiable and is /  − ∗ {  − 

Downloaded by [92.118.247.142] at 07:48 30 January 2014 f + − f = Df · + o   , Harrington 1991, p. 299). Here I(β,τ) =−∂U(β,τ)/∂β . (x0 t) (x0) (x0) t ( t ) (15) Example 5. Tsiatis (2006). which is exactly Equation (4), equivalent to the definition of differentiability. Tsiatis (2006) is a very nice introduction to semiparametric Ferguson (1996, pp. 45, 122, and 138) used (14) to prove the methods in statistics and is more accessible than some classical multivariate δ-method, the asymptotic normality of the MLE, books in semiparametric statistical methods such as Bickel et al. and the equivalence of the one-step estimator to the MLE. Shao (1993), which may be too theoretical for graduate students in (2003, p. 191) also used (14) to prove the asymptotic normality statistics and applied researchers. of the MLE. Unfortunately, the NEMVT was used numerous times in Tsi- atis (2006) to derive certain estimators. For example, in the dis- cussion of M-estimator, Tsiatis (2006, p. 30) has the following 5. CONCLUSION expansion of estimating equation Our limited survey establishes misuse of the MVT for vector- n valued differentiable functions in the statistical literature for = m Z , θˆ = m Z ,θ 0 ( i n) ( i 0) more than two decades. All the examples discussed in Section 3 i=1 i=1 have been cited numerous times in many statistics/biostatistics p×p n ∂m Z ,θ∗ publications. Some books discussed in Section 3 have been + ( i n ) θˆ − θ , ∂θ ( n 0) (13) widely used as textbooks in graduate education or as reference i=1 books for researchers. Our article reminds statisticians that such

The American Statistician, November 2013, Vol. 67, No. 4 247 an “MVT” does not exist and appropriate results should be used Huber, P. J. (1967), “The Behavior of Maximum Likelihood Estimation Un- to ensure the validity of estimators and asymptotic properties der Nonstandard Conditions,” in Proceedings of the Fifth Berkeley Sym- derived from estimating equations. As one reviewer pointed out posium on Mathematical Statistics and Probability, vol. 1, pp. 221– 233. [245] “Our statistical literature should not be perpetuating this error.” Lang, S. (1993), Real and Functional Analysis, New York: Springer. [247] [Received March 2013. Revised September 2013.] Liang, K. Y., and Zeger, S. L. (1986), “Longitudinal Data Analysis Using Generalized Linear Model,” Biometrika, 73, 13–22. [245] Kalbfleisch, J. D., and Prentice, R. L. (1980), The Statistical Analysis of Failure REFERENCES Time Data, New York: Wiley. [247]

Andersen, P. K., Borgan, Ø., Gill, R. D., and Keiding, N. (1993), Statistical Kalbfleisch, J. D., and Prentice, R. L. (2002), The Statistical Analysis of Failure Models Based on Counting Processes, New York: Springer. [247] Time Data (2nd ed.), Hoboken, NJ: Wiley. [247] Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1993), Efficient Rudin, W. (1976), Principles of Mathematical Analysis (3rd ed.), New York: and Adaptive Inference in Semiparametric Models, Baltimore, MD: Johns McGraw-Hill. [245] Hopkins University Press. [247] Seber, G. A. F., and Lee, A. J. (2003), Linear Regression Analysis (2nd ed.), Cramer,´ H. (1946), Mathematical Methods of Statistics, Princeton, NJ: Princeton Hoboken, NJ: Wiley. [245] University Press. [245] Shao, J. (2003), Mathematical Statistics (2nd ed.), New York: Springer. [247] Ferguson, T. S. (1996), A Course in Large Sample Theory, New York: Chapman Tsiatis, A. A. (2006), Semiparametric Theory and Missing Data,NewYork: & Hall. [247] Springer. [247] Fleming, T. R., and Harrington, D. P. (1991), Counting Processes and Survival Wei, L. J., Lin, D. Y., and Weissfeld, L. (1989), “Regression Analysis of Multi- Analysis, Hoboken, NJ: Wiley. [247] variate Incomplete Failure Time Data by Modeling Marginal Distributions,” Gross, S. T., and Huber, C. (1987), “Matched Pair Experiments: Cox and Maxi- Journal of the American Statistical Association, 84, 1065–1073. [246] mum Likelihood Estimation,” Scandinavian Journal of Statistics, 14, 27–41. White, H. (1982), “Maximum Likelihood Estimation of Misspecified Models,” [246,247] Econometrica, 50, 1–25. [245] Downloaded by [92.118.247.142] at 07:48 30 January 2014

248 General