L

with the function I from the Bernoulli case above. ough Large Deviations and it may look paradoxical, large deviations are at the core Applications of this event of full probability. is result is the basis of bioinformatics applications like sequence matching, and F§Z•h†« C™“u±« 7 of statistical tests for sequence randomness. Professor e theory does not only apply to independent vari- Université Paris Diderot, Paris, France ables, but allows for many variations, including weakly dependent variables in a general state space, Markov or Large deviations is concerned with the study of rare events 7Gaussian processes, large deviations from 7ergodic the- and of small probabilities. Let Xi,  ≤ i ≤ n, be independent orems, non-asymptotic bounds, asymptotic expansions identically distributed (i.i.d.) real random variables with (Edgeworth expansions), etc. expectation , and ¯ ... their empirical m Xn = (X + + Xn)~n Here is the formal denition. Given a Polish space mean. e shows that, for any Borel (i.e., a separable complete metric space) X , let {Pn} be set A not containing m in its closure, P X¯ A  ⊂ R ( n ∈ ) → a sequence of Borel probability measures on , let an be as , but does not tell us how fast the probability X n → ∞ a positive sequence tending to innity, and nally let I : vanishes. Large deviations theory gives us the rate of decay, X → [, +∞] be a lower semicontinuous functional on X. which is exponential in n. Cramér’s theorem states that, We say that the sequence {Pn} satises a large deviation ¯ principle with speed a and rate I, if for each measurable P(Xn ∈ A) = exp ‰−n‰ inf{I(x); x ∈ A} + o()ŽŽ n set E ⊂ X as n , for all interval A. e I can be → ∞ − inf I x lim a ln n E computed as the Legendre conjugate of the logarithmic − ○ ( ) ≤ n P ( ) x∈E n moment generating function of X, lim − ln inf ≤ an Pn(E) ≤ − I(x) sup λ ln exp λ ; λ , n x∈E¯ I(x) = { x − E ( X) ∈ R} where ¯ and ○ denote respectively the closure and interior and is called the Cramér transform of the common law E E of E. e rate function can be obtained as of the Xi’s. e natural assumption is the niteness of the moment generating function in a neighborhood of the lim lim − ln , , 7 I(x) = − an Pn(B(x δ)) origin, i.e., the property of exponential tails. e func- δ↘ n→∞ with , the ball of center and radius . tion I : R → [, +∞] is convex with I(m) = . B(x δ) x δ Sanov’s theorem and sampling with replacement: let µ In the Gaussian case X m, σ  , we nd I x ● i ∼ N ( ) ( ) = be a probability measure on a set Σ that we assume nite x m  σ  . ( − ) ~( ) for simplicity, with µ y  for all y Σ. Let Y , i , an In the Bernoulli case P X  p  P X  , ( ) > ∈ i ≥ ● ( i = ) = = − ( i = ) i.i.d. sequence with law µ, and N the score vector of the we nd the function I x x ln x p  x n ( )= ( ~ ) + ( − ) n-sample, ln(−x)~(−p) for x ∈ [, ], and I(x) = +∞ otherwise. n N y  Y . To emphasize the importance of rare events, let us n( ) = Q y( i) i= mention a consequence, the Erdös–Rényi law: consider an By the law of large numbers, Nn~n → µ almost surely. From innite sequence X , i , of Bernoulli i.i.d. variables with i ≥ the 7multinomial distribution, one can check that, for all parameter , and let denote the length of the longest p Rn ν such that nν is a possible score vector for the n-sample, consecutive run, contained within the rst n tosses, in  −SΣS −nH(νSµ) − −nH(νSµ), which the fraction of s is at least a (a > p). Erdös and (n + ) e ≤ P(n Nn = ν) ≤ e Rényi proved that, almost surely as , n → ∞ where ln ν(y) is the relative entropy H(νSµ) = ∑y∈Σ ν(y) µ(y) − Rn~ ln n Ð→ I(a) , of ν with respect to µ. e large deviations theorem holds

Miodrag Lovric (ed.), International Encyclopedia of Statistical Science, DOI ./----, © Springer-Verlag Berlin Heidelberg   L Laws of Large Numbers

for the empirical distribution of a general n-sample, with served as Associate Editor for Stochastic Processes and their speed n and rate I(ν) = H(νSµ) given by the natu- Applications (–). ral generalization of the above formula. is result, due to Sanov, has many consequences in information the- Cross References ory and (Dembo and Zeitouni ; 7Asymptotic Relative E›ciency in Testing den Hollander ), and for exponential families in 7Chernoš Bound statistics. Applications in statistics also include point esti- 7Entropy and Cross Entropy as Diversity and Distance mation (by giving the exponential rate of convergence Measures of -estimators) and for hypothesis testing (Bahadur M 7Laws of Large Numbers e›ciency) (Kester ), and concentration inequalities 7Limit eorems of Probability eory (Dembo and Zeitouni ). 7Moderate Deviations e deals with dišusion pro- Freidlin–Wentzell theory 7Robust Statistics cesses with small noise, є є √ є , є . References and Further Reading dXt = b ‰Xt Ž dt + є σ ‰Xt Ž dBt X = y Dembo A, Zeitouni O () Large deviations techniques and e coe›cients , are uniformly lipshitz functions, and b σ applications. Springer, New York B is a standard (see 7Brownian Motion den Hollander F () Large deviations. Am Math Soc, Provi- є and Dišusions). e sequence X can be viewed as є ↘  dence, RI as a small random perturbation of the ordinary dišerential Freidlin MI, Wentzell AD () Random perturbations of dynami- equation cal systems. Springer, New York Kester A () Some large deviation results in statistics. CWI Tract, , . dxt = b(xt)dt x = y . Centrum voor Wiskunde en Informatica, Amsterdam є Olivieri E, Vares ME () Large deviations and metastability. Indeed, X → x in the supremum norm on bounded time- intervals. Freidlin and Wentzell have shown that, on a nite Cambridge University Press, Cambridge є time interval [, T], the sequence X with values in the path space obeys the LDP with speed є− and rate function

 T  I ϕ σ ϕ t − ϕ ˙t b ϕ t dt ( ) =  S ( ( )) Š ( ) − ( ( ))  Laws of Large Numbers if ϕ is absolutely continuous with square-integrable deriva- tive and ϕ() = y; I(ϕ) = ∞ otherwise. (To t in the above A•o§uë R™«Z«Ží formal denition, take a sequence є = єn ↘ , and for Pn Professor the law of Xєn .) University of Florida, Gainesville, FL, USA e Freidlin–Wentzell theory has applications in physics (metastability phenomena) and engineering (track- ing loops, statistical analysis of signals, stabilization of sys- e laws of large numbers (LLNs) provide bounds on the tems, and algorithms) (Freidlin and Wentzell ; Dembo žuctuation behavior of sums of random variables and, as and Zeitouni ; Olivieri and Vares ). we will discuss herein, lie at the very foundation of sta- tistical science. ey have a history going back over  years. e literature on the LLNs is of epic proportions, as About the Author this concept is indispensable in probability and statistical Francis Comets is Professor of Applied Mathematics at theory and their application. University of Paris - Diderot (Paris ), France. He is the , like some other areas of mathemat- Head of the team “Stochastic Models” in the CNRS lab- ics such as geometry for example, is a subject arising from oratory “Probabilité et modèles aléatoires” since . He an attempt to provide a rigorous mathematical model for is the Deputy Director of the Foundation Sciences Mathé- real world phenomena. In the case of probability theory, matiques de Paris since its creation in . He has the real world phenomena are chance behavior of biologi- coauthored  research papers, with a focus on random cal processes or physical systems such as gambling games medium, and one book (“Calcul stochastique et mod- and their associated monetary gains or losses. èles de dišusions” in French, Dunod , with ierry e probability of an event is the abstract counterpart Meyre). He has supervised more than  Ph.D thesis, and to the notion of the long-run relative frequency of the Laws of Large Numbers L  occurence of the event through innitely many replications fundamental importance in statistical science. It follows of the experiment. For example, if a quality control engi- from () that neer asserts that the probability is . that a widget pro- n X P duced by her production team meets specications, then if , then ∑i= i ; () EX = c ∈ R → c she is asserting that in the long-run, % of those widgets n meet specications. e phrase “in the long-run” requires this result is the Khintchine WLLN (see, e.g., Petrov [], the notion of limit as the sample size approaches innity. p. ). e long-run relative frequency approach for describing Next, suppose A , n  is a sequence of independent the probability of an event is natural and intuitive but, nev- { n ≥ } events all with the same probability p. A special case of the ertheless, it raises serious mathematical questions. Does Kolmogorov SLLN is the limit result the limiting relative frequency always exist as the sample size approaches innity and is the limit the same irrespec- ˆ pn → p a.s. () tive of the sequence of experimental outcomes? It is easy to see that the answers are negative. Indeed, in the above where pˆ n I n is the proportion of A ,..., A to n = ∑i= Ai ~ {  n} example, depending on the sequence of experimental out- occur, . (Here is the indicator function of , .) n ≥ IAi Ai i ≥ comes, the proportion of widgets meeting specications is result is the rst SLLN ever proved and was discov- could žuctuate repeatedly from near  to near  as the ered by Emile Borel in . Hence, with probability , the number of widgets sampled approaches innity. So in what sample proportion pˆn approaches the population propor- sense can it be asserted that the limit exists and is .? tion p as the sample size n → ∞. It is this SLLN which To provide an answer to this question, one needs to apply thus provides the theoretical justication for the long-run a LLN. relative frequency approach to interpreting probabilities. e LLNs are of two types, viz., weak LLNs (WLLNs) Note, however, that the convergence in () is not point- and strong LLNs (SLLNs). Each type involves a dišerent wise on Ω but, rather, is pointwise on some subset of Ω L mode of convergence. In general, a WLLN (resp., a SLLN) having probability . Consequently, any interpretation of involves convergence in probability (resp., convergence via () necessitates that one has an p = P(A) a priori almost surely (a.s.)). e denitions of these two modes intuitive understanding of the notion of an event having of convergence will now be reviewed. probability . Let {Un, n ≥ } be a sequence of random variables e SLLN () is a key component in the proof of the dened on a probability space (Ω, F, P) and let c ∈ R. We Glivenko–Cantelli theorem (see 7Glivenko-Cantelli eo- P rems) which, roughly speaking, asserts that with probabil- say that Un converges in probability to c (denoted Un → c) if ity , a population distribution function can be uniformly lim P Un c ε  for all ε . approximated by a sample (or empirical) distribution func- n (S − S > ) = > →∞ tion as the sample size approaches innity. is result is We say that Un converges a.s. to c (denoted Un → c a.s.) if referred to by Rényi (, p. ) as the fundamental the- orem of mathematical statistics and by Loève (, p. ) P ω Ω : lim Un ω c . ({ ∈ n→∞ ( ) = }) = as the central statistical theorem. In , Jacob Bernoulli (–) proved the rst P WLLN If Un → c a.s., then Un → c; the converse is not true in general. P pˆ p. () e celebrated Kolmogorov SLLN (see, e.g., Chow n → and Teicher [], p. ) is the following result. Let Bernoulli’s renowned book Ars Conjectandi (e Art of {Xn, n ≥ } be a sequence of independent and identically Conjecturing) was published posthumously in , and it is distributed (i.i.d.) random variables and let c ∈ R. en here where the proof of his WLLN was rst published. It is interesting to note that there is over a  year gap between n X ∑i= i a.s. if and only if . () the WLLN () of Bernoulli and the corresponding SLLN → c EX = c n () of Borel. Using statistical terminology, the su›ciency half of () An interesting example is the following modication asserts that the sample mean converges a.s. to the popula- of one of Stout (, p. ). Suppose that the quality con- tion mean as the sample size n approaches innity provided trol engineer referred to above would like to estimate the the population mean exists and is nite. is result is of proportion p of widgets produced by her production team  L Laws of Large Numbers

that meet specications. She estimates p by using the pro- if and only if portion pˆ of the rst n widgets produced that meet speci- n  as . () cations and she is interested in knowing if there will ever nP(SXS > n) → n → ∞ be a point in the sequence of examined widgets such that In such a case, a nE X I  as n . n − (  [SXS≤n]) → → ∞ with probability (at least) a specied large value, pˆn will be e condition () is weaker than . If , ESXS < ∞ {Xn within ε of p and stay within ε of p as the sampling contin-  is a sequence of i.i.d. random variables where has n ≥ } X ues (where ε >  is a prescribed tolerance). e answer is probability density function a›rmative since () is equivalent to the assertion that for a given  and , there exists a positive integer ⎧ c , x e ε > δ > Nε,δ ⎪ x SxS S S ≥ f (x) = ⎨ log such that ⎪, x e ⎩⎪ S S < ∞ ˆ P n N pn p ε  δ. where is a constant, then and the SLLN ‰∩ = ε,δ [S − S ≤ ]Ž ≥ − c ESXS = ∞ n X n c a.s. fails for every c but () and hence at is, the probability is arbitrarily close to  that ˆ will be ∑i= i~ → ∈ R pn the WLLN () hold with , . arbitrarily close to simultaneously for all beyond some an = n ≥ p n Klass and Teicher () extended the Feller WLLN point. If one applied instead the WLLN (), then it could to the case of a more general norming sequence only be asserted that for a given  and , there ε > δ > ,  thereby obtaining a WLLN analog of Feller’s exists a positive integer such that {bn n ≥ } Nε,δ () extension of the Marcinkiewicz–Zygmund SLLN. ˆ  for all . Good references for studying the LLNs are the books P(Spn − pS ≤ ε) ≥ − δ n ≥ Nε,δ by Révész (), Stout (), Loève (), Chow and ere are numerous other versions of the LLNs and we Teicher (), and Petrov (). While the LLNs have will discuss only a few of them. Note that the expressions been studied extensively in the case of independent sum- in () and () can be rewritten, respectively, as mands, some of the LLNs presented in these books involve

n n summands obeying a dependence structure other than that ∑i= Xi − nc ∑i= Xi − nc P →  a.s. and →  of independence. n n A large literature of investigation on the LLNs for thereby suggesting the following denitions. A sequence sequences of Banach space valued random elements has of random variables {Xn, n ≥ } is said to obey a general emerged beginning with the pioneering work of Mourier SLLN (resp., WLLN) with centering sequence {an, n ≥ } (). See the monograph by Taylor () for background and norming sequence {bn, n ≥ } (where  < bn → ∞) if material and results up to . Excellent references are the books by Vakhania, Tarieladze, and Chobanyan () and n n X a X a P ∑i= i − n ∑i= i − n Ledoux and Talagrand (). More recent results are pro- →  a.s. Œresp., → ‘ . bn bn vided by Adler et al. (), Cantrell and Rosalsky (), A famous result of Marcinkiewicz and Zygmund (see, and the references in these two articles. e.g., Chow and Teicher (), p. ) extended the Kol- mogorov SLLN as follows. Let {Xn, n ≥ } be a sequence of i.i.d. random variables and let  < p < . en About the Author n Professor Rosalsky is Associate Editor, Journal of Applied ∑i= Xi − nc  a.s. for some c R if and only (–present), and n~p → ∈ Mathematics and Stochastic Analysis if p . Associate Editor, International Journal of Mathematics and ESXS < ∞ Mathematical Sciences (–present). He has collabo- In such a case, necessarily if  whereas is rated with research workers from  dišerent countries. c = EX p ≥ c arbitrary if p < . At the University of Florida, he twice received a Teaching Feller () extended the Marcinkiewicz–Zygmund Improvement Program Award and on ve occasions was SLLN to the case of a more general norming sequence named an Anderson Scholar Faculty Honoree, an honor reserved for faculty members designated by undergraduate {bn, n ≥ } satisfying suitable growth conditions. e following WLLN is ascribed to Feller by Chow and honor students as being the most ešective and inspiring. Teicher (, p. ). If {Xn, n ≥ } is a sequence of i.i.d. random variables, then there exist real numbers an, n ≥  such that Cross References n Almost Sure Convergence of Random Variables X a P 7 ∑i= i − n  () Borel–Cantelli Lemma and Its Generalizations n → 7 Learning Statistics in a Foreign Language L 

7Chebyshev’s Inequality sciences (including Statistics) are taught in English. e 7Convergence of Random Variables reason is that most of the scientic literature is in English 7Ergodic eorem and teaching in the native language may leave graduates 7Estimation: An Overview at a disadvantage. Since only few instructors speak Ara- 7Expected Value bic, the university adopts a policy of no communication 7Foundations of Probability in Arabic in classes and o›ce hours. Students are required 7Glivenko-Cantelli eorems to achieve a minimum level in English (about . IELTS 7Probability eory: An Outline score) before they start their study program. Very few stu- 7Random Field dents achieve that level on entry and the majority spends 7Statistics, History of about two semesters doing English only. 7Strong Approximations in Probability and Statistics

References and Further Reading Language and Cultural Problems It is to be expected that students from a non-English- Adler A, Rosalsky A, Taylor RL () A weak law for normed speaking background will face serious di›culties when weighted sums of random elements in Rademacher type p Banach spaces. J Multivariate Anal :– learning in English especially in the rst year or two. Most Cantrell A, Rosalsky A () A strong law for compactly uni- of the literature discusses problems faced by foreign stu- formly integrable sequences of independent random elements dents pursuing study programs in an English-speaking in Banach spaces. Bull Inst Math Acad Sinica :– country, or a minority in a multi-cultural society (see Chow YS, Teicher H () Probability theory: indepen- for example Coutis P. and Wood L., Hubbard R, Koh E). dence, interchangeability, martingales, rd edn. Springer, New York Such students live (at least while studying) in an English- Feller W () A limit theorem for random variables with infinite speaking community with which they have to interact on a moments. Am J Math :– daily basis. ese di›culties are more serious for our stu- Klass M, Teicher H () Iterated logarithm laws for asymmet- dents who are studying in their own country where English L ric random variables barely with or without finite mean. Ann is not the o›cial language. ey hardly use English outside Probab :– Ledoux M, Talagrand M () Probability in Banach spaces: classrooms and avoid talking in class as much as they can. isoperimetry and processes. Springer, Berlin Loève M () Probability theory, vol I, th edn. Springer, New York Mourier E () Eléments aléatoires dans un espace de Banach. My SQU Experience Annales de l’ Institut Henri Poincaré :– Statistical concepts and methods are most ešectively Petrov VV () Limit theorems of probability theory: sequences taught through real-life examples that the students appre- of independent random variables. Clarendon, Oxford Rényi A () Probability theory. North-Holland, Amsterdam ciate and understand. We use the most popular textbooks Révész P () The laws of large numbers. Academic, New York in the USA for our courses. ese textbooks use this Stout WF () Almost sure convergence. Academic, New York approach with US students in mind. Our main prob- Taylor RL () Stochastic convergence of weighted sums of ran- lems are: dom elements in linear spaces. Lecture notes in mathematics, vol . Springer, Berlin ● Most of the examples and exercises used are com- Vakhania NN, Tarieladze VI, Chobanyan SA () Probability pletely alien to our students. e discussions meant to distributions on Banach spaces. D. Reidel, Dordrecht, Holland maintain the students’ interest only serve to put ours oš. With limited English they have serious di›culties understanding what is explained and hence tend not to listen to what the instructor is saying. ey do not read the textbooks because they contain pages and pages of Learning Statistics in a Foreign lengthy explanations and discussions they cannot fol- Language low. A direct ešect is that students may nd the subject boring and quickly lose interest. eir attention then K„†o†§ M. AfoufZ«†± turns to the art of passing tests instead of acquiring the Sultan Qaboos University, Muscat, Sultanate of Oman intended knowledge and skills. To pass their tests they use both their class and study times looking through Background examples, concentrating on what formula to use and e Sultanate of Oman is an Arabic-speaking country, where to plug the numbers they have to get the answer. where the medium of instruction in pre-university edu- is way they manage to do the mechanics fairly well, cation is Arabic. In Sultan Qaboos University (SQU) all but the concepts are almost entirely missed.  L Least Absolute Residuals Procedure

● e problem is worse with introductory probability I expect such textbooks to go a long way to enhance courses where games of chance are extensively used students’ understanding of Statistics. An international as illustrative examples in textbooks. Most of our stu- project can be initiated to produce an introductory statis- dents have never seen a deck of playing cards and some tics textbook, with dišerent versions intended for dišerent may even be ošended by discussing card games in a geographical areas. e English material will be the same; classroom. the examples vary, to some extent, from area to area and glossaries in local languages. Universities in the develop- e burden of nding strategies to overcome these di›cul- ing world, naturally, look at western universities as models, ties falls on the instructor. Statistical terms and concepts and international (western) involvement in such a project such as parameter/statistic, sampling distribution, unbi- is needed for it to succeed. e project will be a major con- asedness, consistency, su›ciency, and ideas underlying tribution to the promotion of understanding Statistics and hypothesis testing are not easy to get across even in the stu- excellence in statistical education in developing countries. dents’ own language. To do that in a foreign language is a e international statistical institute takes pride in sup- real challenge. For the Statistics program to be successful, porting statistical progress in the developing world. is all (or at least most of the) instructors involved should be project can lay the foundation for this progress and hence up to this challenge. is is a time-consuming task with lit- is worth serious consideration by the institute. tle reward, other than self satisfaction. In SQU the problem is compounded further by the fact that most of the instruc- tors are expatriates on short-term contracts who are more Cross References likely to use their time for personal career advancement, 7Online Statistics Education rather than time-consuming community service jobs. 7Promoting, Fostering and Development of Statistics in Developing Countries 7Selection of Appropriate Statistical Methods in Develop- ing Countries What Can Be Done? 7Statistical Literacy, Reasoning, and inking For our rst Statistics course we produced a manual that 7Statistics Education contains very brief notes and many samples of previous quizzes, tests, and examinations. It contains a good col- References and Further Reading lection of problems from local culture to motivate the students. e manual was well received by the students, to Coutis P, Wood L () Teaching statistics and academic language in culturally diverse classrooms. http://www.math.uoc.gr/ the extent that students prefer to practice with examples ~ictm/Proceedings/pap.pdf from the manual rather than the textbook. Hubbard R () Teaching statistics to students who are learning Textbooks written in English that are brief and to the in a foreign language. ICOTS  point are needed. ese should include examples and exer- Koh E () Teaching statistics to students with limited language cises from the students’ own culture. A student trying skills using MINITAB http://archives.math.utk.edu/ICTCM/ VOL/C/paper.pdf to understand a specic point gets distracted by lengthy explanations and discouraged by thick textbooks to begin with. In a classroom where students’ faces clearly indicate that you have got nothing across, it is natural to try explain- ing more using more examples. In the end of semester evaluation of a course I taught, a student once wrote “e Least Absolute Residuals instructor explains things more than needed. He makes Procedure simple points di›cult.” is indicates that, when teaching R†h„Z§o W††Z“ FZ§uf§™±„u§ in a foreign language, lengthy oral or written explanations Honorary Reader in Econometrics are not helpful. A better strategy will be to explain concepts Victoria University of Manchester, Manchester, UK and techniques briežy and provide plenty of examples and exercises that will help the students absorb the material by osmosis. e basic statistical concepts can only be ešec- For , , . . . , , let , ,..., , represent the th i = n {xi xi xiq yi} i tively communicated to students in their own language. observation on a set of q +  variables and suppose that we For this reason textbooks should contain a good glossary wish to t a linear model of the form where technical terms and concepts are explained using the local language. yi = xi β + xi β + ⋯ + xiq βq + єi Least Absolute Residuals Procedure L  to these n observations. en, for p > , the Lp-norm In his variant of the standard LAR procedure, there are two tting procedure chooses values for , ,..., to min- explanatory variables of which the rst is constant  b b bq xi = imise the -norm of the residuals n p ~p where, for and the values of b and b are constrained to satisfy the Lp [∑i= SeiS ]   , , . . . , , the th residual is dened by adding-up condition n y b x b  usually i = n i ∑i=( i −  − i ) = associated with the least squares procedure developed by ... . ei = yi − xib − xib − − xiqbq Gauss in  and published by Legendre in . Com- puter algorithms implementing this variant of the LAR e most familiar Lp-norm tting procedure, known procedure with q  variables are still to be found in the as the 7least squares procedure, sets p =  and chooses ≥ values for , ,..., to minimize the sum of the squared literature. b b bq residuals n . For an account of recent developments in this area, ∑i= ei A second choice, to be discussed in the present article, see the series of volumes edited by Dodge (, , sets  and chooses , ,..., to minimize the sum , ). For a detailed history of the LAR procedure, p = b b bq of the absolute residuals n . analyzing the contributions of Boškovic,´ Laplace, Gauss, ∑i= SeiS A third choice sets and chooses , ,..., to Edgeworth, Turner, Bowley and Rhodes, see Farebrother p = ∞ b b bq minimize the largest absolute residual n . (). And, for a discussion of the geometrical and maxi= SeiS  mechanical representation of the least squares and LAR Setting ui ei and vi  if ei  and ui  = = ≥ = tting procedures, see Farebrother (). and vi = −ei if ei < , we nd that ei = ui − vi so that the least absolute residuals (LAR) tting problem chooses , ,..., to minimize the sum of the absolute residuals b b bq About the Author n Richard William Farebrother was a member of the teach- Q(ui + vi) ing staš of the Department of Econometrics and Social i= Statistics in the (Victoria) University of Manchester from subject to  until  when he took early retirement (he has L been blind since ). From  until  he was an for , , . . . , xib + xib + ⋯ + xiqbq + Ui − vi = yi i = n Honorary Reader in Econometrics in the Department of Economic Studies of the same University. He has published and Ui ≥ , vi ≥  for i = , , . . . , n. three books: Linear Least Squares Computations in , e tting problem thus takes the form of a linear LAR Fitting Linear Relationships: A History of the Calculus of programming problem and is oŸen solved by means of a Observations – in , and Visualizing statistical variant of the dual simplex procedure. Models and Concepts in . He has also published more Gauss has noted (when q = ) that solutions of this than  research papers in a wide range of subject areas problem are characterized by the presence of a set of q including econometric theory, computer algorithms, sta- zero residuals. Such solutions are robust to the presence of tistical distributions, statistical inference, and the history of outlying observations. Indeed, they remain constant under statistics. Dr. Farebrother was also visiting Associate Pro- variations in the other n − q observations provided that fessor () at Monash University, Australia, and Visiting these variations do not cause any of the residuals to change Professor () at the Institute for Advanced Studies in their signs. Vienna, Austria. e LAR tting procedure corresponds to the maxi- mum likelihood estimator when the є-disturbances follow a double exponential (Laplacian) distribution. is estima- Cross References tor is more robust to the presence of outlying observations 7Least Squares than is the standard least squares estimator which maxi- 7Residuals mizes the likelihood function when the є-disturbances are normal (Gaussian). Nevertheless, the LAR estimator has an References and Further Reading asymptotic as it is a member of Huber’s Dodge Y (ed) () Statistical data analysis based on the L-norm class of M-estimators. and related methods. North-Holland, Amsterdam ere are many variants of the basic LAR proce- Dodge Y (ed) () L-statistical analysis and related methods. dure but the one of greatest historical interest is that North-Holland, Amsterdam L proposed in  by the Croatian Jesuit scientist Rugjer Dodge Y (ed) () -statistical procedures and related topics. (or Rudjer) Josip Boškovic´ (–) (Latin: Rogerius Institute of Mathematical Statistics, Hayward Dodge Y (ed) () Statistical data analysis based on the L-norm Josephus Boscovich; Italian: Ruggiero Giuseppe Boscovich). and related methods. Birkhäuser, Basel  L Least Squares

Farebrother RW () Fitting linear relationships: a history of the In linear regression each fi is a linear function of type p calculus of observations –. Springer, New York fi cij x ,..., xn βj and the model () may be = Qj= (  ) Farebrother RW () Visualizing statistical models and concepts. presented in vector-matrix notation as Marcel Dekker, New York y = Xβ + e, where ,..., T, ,..., T and y = (y yn) e = (e en) β = ,..., T, while is a matrix with entries . (β βp) X n × p xij If ,..., are not correlated with mean zero and a com- e en Least Squares mon (perhaps unknown) variance then the problem of Best Linear Unbiased Estimation (BLUE) of β reduces to Cñu«’Zë S±u˛£•†ZŽ nding a vector β̂ that minimizes the norm UUy − Xβ̂UU = Professor  T Maria Curie-Skłodowska University, Lublin, Poland (y − Xβ̂) (y − Xβ̂) University of Rzeszów, Rzeszów, Poland Such a vector is said to be the ordinary LS solution of the overparametrized system Xβ ≈ y. On the other hand the last one reduces to solving the consistent system

T T Least Squares (LS) problem involves some algebraic and X Xβ = X y numerical techniques used in “solving” overdetermined n of linear equations called normal equations. In particular, systems F(x) ≈ b of equations, where b ∈ R while F(x) is a column of the form if rank(X) = p then the system has a unique solution of the form ⎡ ⎤ ̂ T − T . ⎢ ⎥ β = (X X) X y ⎢ f(x) ⎥ ⎢ ⎥ ⎢ ⎥ For linear regression yi α βxi ei with one regressor ⎢ f x ⎥ = + + ⎢ ( ) ⎥ x the BLU estimators of the parameters α and β may be ⎢ ⎥ ⎢ ⎥ presented in the convenient form as F(x) = ⎢ ⎥ ⎢ ⋯ ⎥ ⎢ ⎥ ⎢ ⎥ nsxy ⎢fm− x ⎥ β̂ and α y β̂x, ⎢ ( )⎥ =  ̂ = − ⎢ ⎥ ns x ⎢ ⎥ ⎢ fm(x) ⎥ where ⎣ ⎦ x y with entries , , . . . , where ,..., T. ns x y (∑i i)(∑i i), fi = fi(x) i = n x = (x xp) xy = Q i i − i n e LS problem is linear when each fi is a linear function, x  x y and nonlinear – if not.   (∑i i) , ∑i i and ∑i i nsx = Q xi − x = y = Linear LS problem refers to a system Ax = b of linear i n n n equations. Such a system is overdetermined if . If n > p b ∉ For its computation we only need to use a simple pocket the system has no proper solution and will be range(A) calculator. denoted by Ax ≈ b. In this situation we are seeking for a solution of some optimization problem. e name “Least Example e following table presents the number of resi- Squares” is justied by the -norm commonly used as a l dents in thousands (x) and the unemployment rate in % (y) measure of imprecision. for some cities of Poland. Estimate the parameters β and α. e LS problem has a clear statistical interpretation in regression terms. Consider the usual regression model

,..., ; ,..., for , . . . , () xi        yi = fi(xi xip β βp) + ei i = n yi . . . . . . . where xij, i = , . . . n, j = , . . . p, are some constants given by experimental design, fi, i = , . . . n, are given functions depending on unknown parameters , , . . . , , while In this case , , .,  , βj j = p ∑i xi = ∑i yi = ∑i xi = , , . . . , , are values of these functions, observed with and , . erefore  , . and yi i = n ∑i xiyi = nsx = nsxy = some random errors ei. We want to estimate the unknown −. us β̂ = −. and ̂α = . and hence f (x) = parameters βi on the basis of the data set {xij, yi}. −.x + .. Lévy Processes L 

If the variance–covariance matrix of the error vector e 7Analysis of Areal and Spatial Interaction Data coincides (except a multiplicative scalar σ ) with a positive 7Autocorrelation in Regression denite matrix V then the Best Linear Unbiased estima- 7Best Linear Unbiased Estimation in Linear Models T tion reduces to the minimization of (y − Xβ̂) V(y − Xβ̂), 7Estimation called the weighed LS problem. Moreover, if rank(X) = p 7Gauss-Markov eorem then its solution is given in the form 7General Linear Models 7Least Absolute Residuals Procedure β̂ XTV−X −XTV−y. = ( ) 7Linear Regression Models It is worth to add that a nonlinear LS problem is more 7Nonlinear Models complicated and its explicit solution is usually not known. 7Optimum Experimental Design Instead of this some algorithms are suggested. 7Partial Least Squares Regression Versus Other Methods Total least squares problem. e problem has been 7Simple Linear Regression posed in recent years in numerical analysis as an alterna- 7Statistics, History of tive for the LS problem in the case when all data are ašected 7Two-Stage Least Squares by errors. Consider an overdetermined system of n linear equa- References and Further Reading tions Ax ≈ b with k unknown x. e TLS problem consists Björk A () Numerical methods for least squares problems. in minimizing the Frobenius norm SIAM, Philadelphia Kariya T () Generalized least squares. Wiley, New York A, b A,̂b Rao CR, Toutenberg H () Linear models, least squares and TT[ ] − ̂ TTF alternatives. Springer, New York n×k for all  ∈ R and ̂b ∈ range(Â), where the Frobenius Van Huffel S, Vandewalle J () The total least squares problem: computational aspects and analysis. SIAM, Philadelphia norm is dened by aij aij. Once a minimizing SS( )SSF = ∑i,j Wolberg J () Data analysis using the method of least squares: Â,̂b is found, then any x satisfying Ax̂ = ̂b is called a TLS extracting the most information from experiments. Springer, L solution of the initial system Ax ≈ b. New York e trouble is that the minimization problem may not be solvable, or its solution may not be unique. As an example one can set

⎡ ⎤ ⎡ ⎤ Lévy Processes ⎢ ⎥ ⎢⎥ ⎢ ⎥ ⎢ ⎥ A = ⎢ ⎥ and b = ⎢ ⎥ . ⎢ ⎥ ⎢ ⎥ M™„Z“uo Afou-HZ“uuo ⎢ ⎥ ⎢⎥ ⎣ ⎦ ⎣ ⎦ Professor It is known that the TLS solution (if exists) is always United Arab Emirates University, Al Ain, United Arab better than the ordinary LS in the sense that the cor- Emirates rection has smaller -norm. e main tool in b − Âx l solving the TLS problems is the following Singular Value Introduction Decomposition: Lévy processes have become increasingly popular in engi- For any matrix A of n × k with real entries there neering (reliability, dams, telecommunication) and mathe- exist orthonormal matrices ,..., and matical nance. eir applications in reliability stems from P = p pn the fact that they provide a realistic model for the degrada- ,..., such that Q = q qk tion of devices, while their applications in the mathemati- cal theory of dams as they provide a basis for describing T ,..., , where and P AQ = diag(σ σm) σ ≥ ⋯ ≥ σm the water input of dams. eir popularity in nance is m = min{n, k}. because they describe the nancial markets in a more accu- rate way than the celebrated Black–Scholes model. e About the Author latter model assumes that the rate of returns on assets For biography see the entry 7Random Variable. are normally distributed, thus the process describing the asset price over time is continuous process. In reality, the Cross References asset prices have jumps or spikes, and the asset returns 7Absolute Penalty Estimation exhibit fat tails and 7skewness, which negates the nor- 7Adaptive Linear Regression mality assumption inherited in the Black–Scholes model.  L Lévy Processes

Because of the deciencies in the Black–Scholes model () Poisson process, and X is a square integrable martingale. researchers in mathematical nance have been trying to () () () nd more suitable models for asset prices. Certain types of e characteristic components of X , X , and X (denoted Lévy processes have been found to provide a good model () () () by φ z , φ z and φ z , respectively) are as follows: for creep of concrete, fatigue crack growth, corroded steel ( ) ( ) ( )

()  gates, and chloride ingress into concrete. Furthermore, cer- φ z iza z b , tain types of Lévy processes have been used to model the ( ) = −  () water input in dams. φ z exp izx  ν dx , ( ) = ∫{SxS≥}( ( ) − ) ( ) In this entry, we will review Lévy processes and give () φ z exp izx  izx ν dx . important examples of such processes and state some ( ) = ∫{SxS<}( ( ) − − ) ( ) references to their applications. Examples of the Lévy Processes Lévy Processes The Brownian Motion A Lévy process is said to be a Brownian motion (see A X = {Xt, t ≥ } that has right con- tinuous sample paths with leŸ limits is said to be a Lévy 7Brownian Motion and Dišusions) with driŸ µ, and process if the following hold: volatility rate σ , if µ = a, b = σ , and ν (R) = . Brow- nian motion is the only nondeterministic Lévy processes . X has stationary increments, i.e., for every s, t ≥ , the with continuous sample paths. distribution of is independent of . Xt+s − Xt t . has independent increments, i.e., for every , , X t s ≥ The Inverse Brownian Process Xt s Xt is independent of Xu, u t . + − ( ≤ ) Let X be a Brownian motion with µ >  and volatility rate . X is stochastically continuous, i.e., for every t  and  ≥ σ . For any x > , let Tx = inf{t : Xt > x}. en Tx is an є > : increasing Lévy process (called inverse Brownian motion), . its Lévy measure is given by lims→tP(SXt − XsS > є) = at is to say a Lévy process is a stochastically continu-  −xµ ous process with stationary and independent increments υ(dx) = √ exp Œ ‘ .    σ  whose sample paths are right continuous with leŸ hand πσ x limits. The If Φ z is the characteristic function of a Lévy process, ( ) e compound Poisson process (see 7Poisson Processes) then its characteristic component def ln Φ(z) is of the is a Lévy process where  and is a nite measure. φ(z) = t b = ν form  The z b exp  œiza − + [ (izx) − − izxI{SxS<}]ν(dx)¡ e gamma process is a nonnegative increasing Lévy pro-  SR cess X, where b  , a  xν dx  and its Lévy where , and is a measure on satisfying = − ∫ ( ) = a ∈ R b ∈ R+ ν R measure is given by ν  ,  x ν dx . ({ }) = ∫R( ∧ ) ( ) < ∞ α e measure ν characterizes the size and frequency of ν(dx) = exp(−x~β)dx, x >  the jumps. If the measure is innite, then the process has x where , . It follows that the mean term innitely many jumps of very small sizes in any small inter- α β > (E(X )) and the variance term for the process are equal val. e constant a dened above is called the driŸ term of (V(X ))  the process, and b is the variance (volatility) term. to αβ and αβ , respectively. e following is a simulated sample path of a gamma o^ identify any Lévy process e Lévy–It decomposition process, where  and . (Fig. ). as the sum of three independent processes, it is stated as α = β = follows: Given any , and measure on satisfying The a ∈ R b ∈ R+ ν R e variance gamma process is a Lévy process that can be ν  ,  x ν dx , there exists a probability ({ }) = ∫R( ∧ ) ( ) < ∞ represented as either the dišerence between two indepen- space Ω, , on which a process is dened. e (  P) Lévy X dent gamma processes or as a Brownian process subordi- () process X is the sum of three independent processes X , nated by a gamma process. e latter is accomplished by a () () () random time change, replacing the time of the Brownian , and , where is a Brownian motion with driŸ and X X X a process by a gamma process, with a mean term equal to . () volatility b (in the sense dened below), X is a compound e variance gamma process has three parameters: µ – the Life Expectancy L 

12 Replacement Policies and Accelerated Life Testing (Aca- demic Press ). He is an elected member of the Inter- 10 national Statistical Institute. He has published numerous papers in Statistics and Operations research. 8

x 6 Cross References 7Non-Uniform Random Variate Generations 4 7Poisson Processes 7Statistical Modeling of Financial Markets 2 7Stochastic Processes 7Stochastic Processes: Classication 0 0 2 4 6 8 10 t References and Further Reading Lévy Processes. Fig.  Gamma process Abdel-Hameed MS () Life distribution of devices subject to a Lévy wear process. Math Oper Res :– PM Abdel-Hameed M () Optimal control of a dam using λ,τ policies and penalty cost when the input process is a com- 0.6 pound Poisson process with a positive drift. J Appl Prob : Brownian motion Variance gamma – 0.4 Black F, Scholes MS () The pricing of options and corporate liabilities. J Pol Econ :– 0.2 Cont R, Tankov P () Financial modeling with jump processes. Chapman & Hall/CRC Press, Boca Raton 0 Madan DB, Carr PP, Chang EC () The variance gamma process and option pricing. Eur Finance Rev :– x L −0.2 Patel A, Kosko B () Stochastic resonance in continuous and spiking Neutron models with Lévy noise. IEEE Trans Neural −0.4 Netw :– van Noortwijk JM () A survey of the applications of gamma processes in maintenance. Reliab Eng Syst Safety :– −0.6

−0.8 0 2 4 6 8 10 t

Lévy Processes. Fig.  Brownian motion and variance gamma Life Expectancy sample paths MZZ B†Z•-A¶¶«± Full Professor University of Rijeka, Rijeka, Croatia Brownian process driŸ term, σ – the volatility of the Brownian process, and ν – the variance term of the the gamma process. Life expectancy is dened as the average number of years e following are two simulated sample paths, one for a person is expected to live from age x, as determined by statistics. a Brownian motion with a driŸ term µ . and volatility = Statistics on life expectancy are derived from a mathe- term σ = . and the other is for a variance gamma process with the same values for the driŸ term and the volatility matical model known as the 7life table. In order to calcu- late this indicator, the mortality rate at each age is assumed terms and ν =  (Fig. ). to be constant. Life expectancy (ex) can be evaluated at any age and, in a hypothetical stationary population, can About the Author be written in discrete form as: Professor Mohamed Abdel-Hameed was the chairman of the Department of Statistics and Operations Research, Tx ex = Kuwait University (–). He was on the editorial lx board of Applied Stochastic Models and Data Analysis where x is age; Tx is the number of person-years lived (–). He was the Editor (jointly with E. Cinlar aged x and over; and lx is the number of survivors at age and J. Quinn) of the text Survival Models, Maintenance, x according to the life table.  L Life Expectancy

Life expectancy can be calculated for combined sexes Countries with the highest life expectancies in the or separately for males and females. ere can be signi- world ( years) are Australia, Iceland, Italy, Switzerland, cant dišerences between sexes. and Japan ( years); Japanese men and women live an Life expectancy at birth is the average number of average of  and  years, respectively (WHO ). (e) years a newborn child can expect to live if current mortality In countries with a high rate of HIV infection, prin- trends remain constant: cipally in Sub-Saharan Africa, the average life expectancy is  years and below. Some of the world’s lowest life T e = expectancies are in Sierra Leone ( years), Afghanistan l ( years), Lesotho ( years), and Zimbabwe ( years). where is the total size of the population and is the In nearly all countries, women live longer than men. T l number of births (the original number of persons in the e world’s average life expectancy at birth is  years for birth cohort). males and  years for females; the gap is about ve years. Life expectancy declines with age. Life expectancy at e female-to-male gap is expected to narrow in the more birth is highly inžuenced by infant mortality rate. e developed regions and widen in the less developed regions. paradox of the life table refers to a situation where life e Russian Federation has the greatest dišerence in life expectancy at birth increases for several years aŸer birth expectancies between the sexes ( years less for men), ( ... and even beyond). e paradox režects the whereas in Tonga, life expectancy for males exceeds that e < e < e higher rates of infant and child mortality in populations for females by  years (WHO ). in pre-transition and middle stages of the demographic Life expectancy is assumed to rise continuously. transition. According to estimation by the UN, global life expectancy Life expectancy at birth is a summary measure of mor- at birth is likely to rise to an average  years by –. tality in a population. It is a frequently used indicator By , life expectancy is expected to vary across countries of health standards and socio-economic living standards. from  to  years. Long-range United Nations popula- Life expectancy is also one of the most commonly used tion projections predict that by , on average, people indicators of social development. is indicator is easily can expect to live more than  years, from  (Liberia) up comparable through time and between areas, including to  years (Japan). countries. Inequalities in life expectancy usually indicate For more details on the calculation of life expectancy, inequalities in health and socio-economic development. including continuous notation, see Keytz (, ) Life expectancy rose throughout human history. In and Preston et al. (). ancient Greece and Rome, the average life expectancy was below  years; between the years  and , Cross References life expectancy at birth rose from about  years to a 7Biopharmaceutical Research, Statistics in global average of  years, and to more than  years in 7Biostatistics the richest countries (Riley ). Furthermore, in most 7Demography industrialized countries, in the early twenty-rst century, 7Life Table life expectancy averaged at about  years (WHO). ese 7Mean Residual Life changes, called the “health transition,” are essentially the 7Measurement of Economic Progress result of improvements in public health, medicine, and nutrition. References and Further Reading Life expectancy varies signicantly across regions and Keyfitz N () Introduction to the Mathematics of Population. continents: from life expectancies of about  years in Addison-Wesly, Massachusetts some central African populations to life expectancies of Keyfitz N () Applied mathematical demography. Springer,  years and above in many European countries. e New York more developed regions have an average life expectancy of Preston SH, Heuveline P, Guillot M () Demography: measuring  years, while the population of less developed regions and modelling potion processes. Blackwell, Oxford Riley JC () Rising life expectancy: a global history. Cambridge is at birth expected to live an average  years less. e University Press, Cambridge two continents that display the most extreme dišerences Rowland DT () Demographic methods and concepts. Oxford in life expectancies are North America (. years) and University Press, New York Africa (. years) where, as of recently, the gap between Siegel JS, Swanson DA () The methods and materials of demog- life expectancies amounts to  years (UN, – raphy, nd edn. Elsevier Academic Press, Amsterdam World Health Organization () World health statistics .  data). Geneva Life Table L 

World Population Prospects: The  revision. Highlights. United the continuous mode.). e remaining life expectancy at Nations, New York age x is: World Population to  () United Nations, New York ∞ o , ex = E(X − xSX > x) = ℓ(x + t)dt~ℓ(x) S

i.e., it is the expected lifetime remaining to someone who has attained age x. Life Table To turn to the statistical estimation of these various quantities, suppose that the function µ(x) is piecewise JZ• M. H™u“ constant, which means that we take it to equal some con- Professor, Director Emeritus stant, say , over each interval , for some partition µj (xj xj+) Max Planck Institute for Demographic Research, Rostock, {xj} of the age axis. For a collection {Xi} of independent Germany observations of X, let Dj be the number of Xi that fall in the interval , . In mortality applications, this is the (xj xj+) number of deaths observed in the given interval. For the

e life table is a classical tabular representation of central cohort of the initially newborn, Dj is the number of indi- features of the distribution function F of a positive variable, viduals who die in the interval (called the occurrences in the say X, which normally is taken to represent the lifetime of interval). If individual i dies in the interval, he or she will a newborn individual. e life table was introduced well of course have lived for Xi − xj time units during the inter- before modern conventions concerning statistical distri- val. Individuals who survive the interval, will have lived for time units in the interval, and individuals who butions were developed, and it comes with some special xj+ − xj do not survive to age , will not have lived during this terminology and notation, as follows. Suppose that F has a xj L d interval at all. When we aggregate the time units lived in density f x F x and dene the force of mortality or ( ) = dx ( ) x , x over all individuals, we get a total R which is at age as the function ( j j+) j death intensity x called the exposures for the interval, the idea being that individuals are exposed to the risk of death for as long as d f (x) µ(x) = − ln{ − F(x)} = . they live in the interval. In the simple case where there are dx { − F(x)} no relations between the individual parameters µj, the col- lection , constitutes a statistically su›cient set of Heuristically, it is interpreted by the relation µ x dx {Dj Rj} ( ) = observations with a likelihood Λ that satises the relation P x X x dx X x . Conversely F x { < < + S > } ( ) = ln Λ ln which is easily seen to be max- x = ∑j{−µjRj +Dj µj}  exp µ s ds . e survivor function is dened as ˆ − œ− ∫ ( ) ¡ imized by µj = Dj~Rj. e latter fraction is therefore the  maximum-likelihood estimator for (In some connec- ℓ x ℓ   F x , normally with ℓ  , . In µj ( ) = ( ){ − ( )} ( ) = tions an age schedule of mortality will be specied, such as mortality applications ℓ x is the expected number of sur- ( ) the classical Gompertz–Makeham function x, vivors to exact age x out of an original cohort of ,  µx = a + bc which does represent a relationship between the intensity newborn babies. e survival probability is values at the dišerent ages x, normally for single-year age groups. Maximum likelihood estimators can then be found tpx = P{X > x + tSX > x} = ℓ(x + t)~ℓ(x) by plugging this functional specication of the intensities t ˆ = exp ›−S µ(x + s)ds , into the likelihood function, nding the values aˆ, b, and  ˆ x ˆc that maximize Λ, and using aˆ + bˆc for the intensity in and the non-survival probability is (the converse) tqx = the rest of the life table computations. Methods that do not  . For  one writes and . In partic- amount to maximum likelihood estimation will sometimes −t px t = qx = qx px = px ular, we get ℓ(x + ) = ℓ(x)px. is is a practical recursion be used because they involve simpler computations. With formula that permits us to compute all values of ℓ(x) once some luck they provide starting values for the iterative pro- we know the values of px for all relevant x. cess that must usually be applied to produce the maximum e life expectancy is eo EX ∞ ℓ x dx ℓ  likelihood estimators. For an example, see Forsén ()).  = = ∫ ( ) ~ ( ) (e subscript  in o indicates that the is is whole schema can be extended trivially to cover cen- e computed at age  (i.e., for newborn individuals) and the soring (withdrawals) provided the censoring mechanism is superscript o indicates that the computation is made in unrelated to the mortality process.  L Likelihood

If the force of mortality is constant over a single-year Elandt-Johnson RC, Johnson NL () Survival models and data ˆ age interval (x, x + ), say, and is estimated by µx in this analysis. Wiley, New York ˆ −µˆx Forsén L () The efficiency of selected moment methods in interval, then px = e is an estimator of the single-year Gompertz–Makeham graduation of mortality. Scand Actuarial J survival probability px. is allows us to estimate the sur- vival function recursively for all corresponding ages, using – Manton KG, Stallard E () Recent trends in mortality analysis. ˆ ˆ ˆ ℓ(x + ) = ℓ(x)px for x = , , . . . , and the rest of the Academic Press, Orlando life table computations follow suit. Life table construction Preston SH, Heuveline P, Guillot M () Demography. Measuring consists in the estimation of the parameters and the tab- and modeling populations. Blackwell, Oxford ulation of functions like those above from empirical data. Seal H () Studies in history of probability and statistics, : mul- e data can be for age at death for individuals, as in the tiple decrements of competing risks. Biometrica ():– Smith D, Keyfitz N () Mathematical demography: selected example indicated above, but they can also be observa- papers. Springer, Heidelberg tions of duration until recovery from an illness, of intervals between births, of time until breakdown of some piece of machinery, or of any other positive duration variable. So far we have argued as if the life table is computed for a group of mutually independent individuals who have all Likelihood been observed in parallel, essentially a cohort that is fol- lowed from a signicant common starting point (namely NZ•hí Ru†o from birth in our mortality example) and which is dimin- Professor ished over time due to decrements (attrition) caused by University of Toronto, Toronto, ON, Canada the risk in question and also subject to reduction due to censoring (withdrawals). e corresponding table is then Introduction called a cohort life table. It is more common, however, to e likelihood function in a statistical model is propor- estimate a {px} schedule from data collected for the mem- tional to the density function for the to bers of a population during a limited time period and to be observed in the model. Most oŸen in applications of use the mechanics of life-table construction to produce a likelihood we have a parametric model f y; θ , where the from the values. ( ) period life table px parameter θ is assumed to take values in a subset of k, Life table techniques are described in detail in most R and the variable y is assumed to take values in a subset of introductory textbooks in actuarial statistics, 7biostatistics, n R : the likelihood function is dened by 7demography, and epidemiology. See, e.g., Chiang (), Elandt-Johnson and Johnson (), Manton and Stallard L(θ) = L(θ; y) = cf (y; θ), () (), Preston et al. (). For the history of the where c can depend on y but not on θ. In more gen- topic, consult Seal (), Smith and Keytz (), and eral settings where the model is semi-parametric or non- Dupâquier (). parametric the explicit denition is more di›cult, because the density needs to be dened relative to a dominating About the Author measure, which may not exist: see Van der Vaart () and For biography see the entry 7Demography. Murphy and Van der Vaart (). is article will consider only nite-dimensional parametric models. Cross References Within the context of the given parametric model, the 7Demographic Analysis: A Stochastic Approach likelihood function measures the relative plausibility of 7Demography various values of θ, for a given observed data point y. Val- 7Event History Analysis ues of the likelihood function are only meaningful relative 7Kaplan-Meier Estimator to each other, and for this reason are sometimes stan- 7Population Projections dardized by the maximum value of the likelihood func- 7Statistics: An Overview tion, although other reference points might be of interest depending on the context. If our model is ; n y  n−y, , , . . . , ; References and Further Reading f (y θ) = ‰yŽθ ( −θ) y = n Chiang CL () The life table and its applications. Krieger, θ ∈ [, ], then the likelihood function is (any function Malabar proportional to) Dupâquier J () L’invention de la table de mortalité. Presses y n−y universitaires de France, Paris L(θ; y) = θ ( − θ) Likelihood L  and can be plotted as a function of θ for any xed value the prior through the likelihood to the posterior. A major of y. e likelihood function is maximized at θ = y~n. di›culty with this approach is the choice of prior prob- is model might be appropriate for a sampling scheme ability function. In some applications there may be an which recorded the number of successes among n indepen- accumulation of previous data that can be incorporated dent trials that result in success or failure, each trial having into a probability distribution, but in general there is not, the same probability of success, θ. Another example is the and some rather ad hoc choices are oŸen made. Another likelihood function for the mean and variance parameters di›culty is the meaning to be attached to probability state- when sampling from a normal distribution with mean µ ments about the parameter. and variance σ : Inference based on the likelihood function can also be calibrated with reference to the probability model ; ,   f (y θ) L θ; y exp n log σ  σ Σ yi µ , ( ) = {− − ( ~ ) ( − ) } by examining the distribution of L(θ; Y) as a random func- tion, or more usually, by examining the distribution of where θ = (µ, σ ). is could also be plotted as a function of and  for a given sample ,..., , and it is not dif- various derived quantities. is is the basis for likelihood µ σ y yn cult to show that this likelihood function only depends inference from a frequentist point of view. In particular, ˆ ˆ on the sample through the sample mean y n−Σy and it can be shown that  log{L(θ; Y)~L(θ; Y)}, where θ = = i ˆ  −  θ Y is the value of θ at which L θ; Y is maximized, is sample variance s = (n − ) Σ(yi − y) , or equivalently ( ) ( )  approximately distributed as a χ random variable, where through Σyi and Σyi . It is a general property of likelihood k functions that they depend on the data only through the k is the dimension of θ. To make this precise requires an minimal su›cient statistic. asymptotic theory for likelihood, which is based on a cen- tral limit theorem (see 7Central Limit eorems) for the Inference score function e likelihood function was dened in a seminal paper ∂ U θ; Y log L θ; Y . of Fisher (), and has since become the basis for most ( ) = ∂θ ( ) L methods of statistical inference. One version of likelihood If ,..., has independent components, then inference, suggested by Fisher, is to use some rule such Y = (Y Yn) is a sum of independent components, which under as ˆ to dene a range of “likely” or “plau- U(θ) n L(θ)~L(θ) > k mild regularity conditions will be asymptotically normal. sible” values of . Many authors, including Royall () θ To obtain the  result quoted above it is also necessary to and Edwards (), have promoted the use of plots of χ investigate the convergence of ˆ to the true value govern- the likelihood function, along with interval estimates of θ ing the model ; . Showing this convergence, usually plausible values. is approach is somewhat limited, how- f (y θ) in probability, but sometimes almost surely, can be di›- ever, as it requires that have dimension  or possibly , θ cult: see Scholz () for a summary of some of the issues or that a likelihood function can be constructed that only that arise. depends on a component of that is of interest; see section θ Assuming that ˆ is consistent for , and that ; “ Nuisance Parameters” below. θ θ L(θ Y) 7 has su›cient regularity, the follow asymptotic results can In general, we would wish to calibrate our inference be established: for θ by referring to the probabilistic properties of the inferential method. One way to do this is to introduce a ˆ T ˆ d , () (θ − θ) i(θ)(θ − θ) → χk probability measure on the unknown parameter θ, typi- T − d , () cally called a prior distribution, and use Bayes’ rule for U(θ) i (θ)U(θ) → χk conditional probabilities to conclude  ˆ d , () {ℓ(θ) − ℓ(θ)} → χk where ′′ ; ; is the expected Fisher infor- π(θ S y) = L(θ; y)π(θ)~ L(θ; y)π(θ)dθ, i(θ) = E{−ℓ (θ Y) θ} Sθ mation function, ℓ(θ) = log L(θ) is the log-likelihood where is the density for the prior measure, and function, and  is the chi-square distribution with π(θ) π(θ S χk 7 k y) provides a probabilistic assessment of θ aŸer observing degrees of freedom. Y = y in the model f (y; θ). We could then make con- ese results are all versions of a more general result clusions of the form, “having observed  successes in  that the log-likelihood function converges to the quadratic trials, and assuming π(θ) = , the posterior probability form corresponding to a multivariate normal distribution that θ > . is .,” and so on. (see 7Multivariate Normal Distributions), under suitably is is a very brief description of Bayesian inference, in stated limiting conditions. ere is a similar asymptotic which probability statements refer to that generated from result showing that the posterior density is asymptotically  L Likelihood

normal, and in fact asymptotically free of the prior distri- inference most modications to the prole log-likelihood bution, although this result requires that the prior distribu- are derived by considering conditional or marginal infer- tion be a proper probability density, i.e., has integral over ence in models that admit factorizations, at least approxi- the parameter space equal to . mately, like the following:

f y; θ f y ; ψ f y y ; λ , or Nuisance Parameters ( ) = (  ) (  S  ) ; ; ; λ . In models where the dimension of θ is large, plotting the f (y θ) = f(y S y ψ)f(y ) likelihood function is not possible, and inference based on A discussion of conditional inference and density factori- the multivariate normal distribution for ˆ or the  dis- θ χk sations is given in Reid (). is literature is closely tied tribution of the log-likelihood ratio doesn’t lead easily to to that on higher order asymptotic theory for likelihood. interval estimates for components of θ. However it is pos- e latter theory builds on saddlepoint and Laplace expan- sible to use the likelihood function to construct inference sions to derive more accurate versions of (–): see, for for parameters of interest, using various methods that have example, Severini () and Brazzale et al. (). e been proposed to eliminate nuisance parameters. direct likelihood approach of Royall () and others does Suppose in the model f (y; θ) that θ = (ψ, λ), where ψ not generalize very well to the nuisance parameter setting, is a -dimensional parameter of interest (which will oŸen k although Royall and Tsou () present some results in be ). e prole log-likelihood function of ψ is this direction. ,ˆλ , ℓP(ψ) = ℓ(ψ ψ) ˆ Extensions to Likelihood where λψ is the constrained maximum likelihood estimate: e likelihood function is such an important aspect of it maximizes the likelihood function L(ψ, λ) when ψ is inference based on models that it has been extended to held xed. e prole log-likelihood function is also called “likelihood-like” functions for more complex data settings. the concentrated log-likelihood function, especially in Examples include nonparametric and semi-parametric econometrics. If the parameter of interest is not expressed likelihoods: the most famous semi-parametric likelihood explicitly as a subvector of θ, then the constrained maxi- is the proportional hazards model of Cox (). But mum likelihood estimate is found using Lagrange multi- many other extensions have been suggested: to empiri- pliers. cal likelihood (Owen ), which is a type of nonpara- It can be veried under suitable smoothness conditions metric likelihood supported on the observed sample; to that results similar to those at ( – ) hold as well for the quasi-likelihood (McCullagh ) which starts from the prole log-likelihood function: in particular score function U(θ) and works backwards to an infer- ence function; to bootstrap likelihood (Davison et al. ); ˆ ˆ ˆ ˆ d   ℓ ψ ℓ ψ  ℓ ψ, λ ℓ ψ, λψ χk , { P( ) − P( )} = { ( ) − ( )} →  and many modications of prole likelihood (Barndorš- is method of eliminating nuisance parameters is not Nielsen and Cox ; Fraser ). ere is recent interest

completely satisfactory, especially when there are many for multi-dimensional responses Yi in composite likeli- nuisance parameters: in particular it doesn’t allow for hoods, which are products of lower dimensional condi- errors in estimation of λ. For example the prole likeli- tional or marginal distributions (Varin ). Reid () hood approach to estimation of σ  in the linear regression concluded a review article on likelihood by stating: model (see 7Linear Regression Models) y ∼ N(Xβ, σ )   7 From either a Bayesian or frequentist perspective, the like- will lead to the estimator σˆ Σ yi yˆi n, whereas the = ( − ) ~ lihood function plays an essential role in inference. The estimator usually preferred has divisor n p, where p is the − maximum likelihood estimator, once regarded on an equal dimension of β. us a large literature has developed on improvements footing among competing point estimators, is now typi- to the prole log-likelihood. For Bayesian inference such cally the estimator of choice, although some refinement improvements are “automatically” included in the formu- is needed in problems with large numbers of nuisance parameters. The likelihood ratio statistic is the basis for lation of the marginal posterior density for ψ: most tests of hypotheses and interval estimates. The emer- , λ λ, gence of the centrality of the likelihood function for infer- πM(ψ S y) ∝ S π(ψ S y)d ence, partly due to the large increase in computing power, but it is typically quite di›cult to specify priors for possibly is one of the central developments in the theory of statistics high-dimensional nuisance parameters. For non-Bayesian during the latter half of the twentieth century. Likelihood L 

Further Reading 7Risk Analysis e book by Cox and Hinkley () gives a detailed 7Statistical Evidence account of likelihood inference and principles of statis- 7Statistical Inference tical inference; see also Cox (). ere are several 7Statistical Inference: An Overview book-length treatments of likelihood inference, including 7Statistics: An Overview Edwards (), Azzalini (), Pawitan (), and Sev- 7Statistics: Nelder’s view erini (): this last discusses higher order asymptotic 7Testing Variance Components in Mixed Linear theory in detail, as does Barndorš-Nielsen and Cox (), Models and Brazzale, Davison and Reid (). A short review 7Uniform Distribution in Statistics paper is Reid (). An excellent overview of consis- tency results for maximum likelihood estimators is Scholz References and Further Reading (); see also Lehmann and Casella (). Foundational issues surrounding likelihood inference are discussed in Azzalini A () Statistical inference. Chapman and Hall, London Barndorff-Nielsen OE, Cox DR () Inference and asymptotics. Berger and Wolpert (). Chapman and Hall, London Berger JO, Wolpert R () The likelihood principle. Institute of Mathematical Statistics, Hayward About the Author Birnbaum A () On the foundations of statistical inference. Am Professor Reid is a Past President of the Statistical Society Stat Assoc :– Brazzale AR, Davison AC, Reid N () Applied asymptotics. of Canada (–). During (–) she served Cambridge University Press, Cambridge as the President of the Institute of Mathematical Statis- Cox DR () Regression models and life tables. J R Stat Soc B tics. Among many awards, she received the Emanuel and :– (with discussion) Carol Parzen Prize for Statistical Innovation () “for Cox DR () Principles of statistical inference. Cambridge Uni- leadership in statistical science, for outstanding research versity Press, Cambridge Cox DR, Hinkley DV () Theoretical statistics. Chapman and L in theoretical statistics and highly accurate inference from Hall, London the likelihood function, and for inžuential contributions Davison AC, Hinkley DV, Worton B () Bootstrap likelihoods. to statistical methods in biology, environmental science, Biometrika :– high energy physics, and complex social surveys.” She was Edwards AF () Likelihood. Oxford University Press, Oxford awarded the Gold Medal, Statistical Society of Canada Fisher RA () On the mathematical foundations of theoretical statistics. Phil Trans R Soc A :– () and Florence Nightingale David Award, Committee Lehmann EL, Casella G () Theory of point estimation, nd edn. of Presidents of Statistical Societies (). She is Associate Springer, New York Editor of Statistical Science, (–), Bernoulli (–) and McCullagh P () Quasi-likelihood functions. Ann Stat :– Metrika (–). Murphy SA, Van der Vaart A () Semiparametric likelihood ratio inference. Ann Stat :– Owen A () Empirical likelihood ratio confidence intervals for a single functional. Biometrika :– Cross References Pawitan Y () In all likelihood. Oxford University Press, Oxford 7Bayesian Analysis or Evidence Based Statistics? Reid N () The roles of conditioning in inference. Stat Sci :– 7Bayesian Statistics Bayesian Versus Frequentist Statistical Reasoning Reid N () Likelihood. J Am Stat Assoc :– 7 Royall RM () Statistical evidence: a likelihood paradigm. 7Bayesian vs. Classical Point Estimation: A Comparative Chapman and Hall, London Overview Royall RM, Tsou TS () Interpreting statistical evidence using 7Chi-Square Test: Analysis of Contingency Tables imperfect models: robust adjusted likelihoodfunctions. J R Stat Soc B : 7Empirical Likelihood Approach to Inference from Sample Survey Data Scholz F () Maximum likelihood estimation. In: Ency- clopedia of statistical sciences. Wiley, New York, doi: 7Estimation ./.ess.pub. Accessed  Aug  7Fiducial Inference Severini TA () Likelihood methods in statistics. Oxford Univer- 7General Linear Models sity Press, Oxford Van der Vaart AW () Infinite-dimensional likelihood meth- 7Generalized Linear Models Generalized Quasi-Likelihood (GQL) Inferences ods in statistics. http://www.stieltjes.org/archief/biennial/ 7 frame/node.html. Accessed  Aug  7Mixture Models Varin C () On composite marginal likelihood. Adv Stat Anal 7Philosophical Foundations of Statistics :–  L Limit Theorems of Probability Theory

e quantity √ , where is a standard normal Limit Theorems of Probability nEξ + ζσ n ζ random variable (so that P(ζ < x) = Φ(x)), can be called Theory the second order approximation for Sn. e rst LLN (for the Bernoulli scheme) was proved AuìZ•o§ AuŽ«uuê†h„ B™§™êŽ™ê by Jacob Bernoulli in the late s (published posthu- Professor, Head of Probability and Statistics Chair at the mously in ). e rst CLT (also for the Bernoulli Novosibirsk University scheme) was established by A. de Moivre (rst published Novosibirsk University, Novosibirsk, Russia in  and referred nowadays to as the de Moivre–Laplace theorem). In the beginning of the nineteenth century, P.S. Laplace and C.F. Gauss contributed to the generaliza- Limit eorems of Probability eory is a broad name tion of these assertions and appreciation of their enormous referring to the most essential and extensive research area applied importance (in particular, for the theory of errors in Probability eory which, at the same time, has the of observations), while later in that century further break- greatest impact on the numerous applications of the latter. throughs in both methodology and applicability range By its very nature, Probability eory is concerned of the CLT were achieved by P.L. Chebyshev () and with asymptotic (limiting) laws that emerge in a long series A.M. Lyapunov (). of observations on random events. Because of this, in the e main directions in which the two aforementioned early twentieth century even the very denition of prob- main LTs have been extended and rened since then are: ability of an event was given by a group of specialists . Relaxing the assumption  . When the sec- (R. von Mises and some others) as the limit of the rela- EX < ∞ ond moment is innite, one needs to assume that the tive frequency of the occurrence of this event in a long “tail” : is a func- row of independent random experiments. e “stability” P(x) = P(X > x) + P(X < −x) tion regularly varying at innity such that the limit of this frequency (i.e., that such a limit always exists) was lim P X x P x exists. en the distribution of the postulated. AŸer the s, Kolmogorov’s axiomatic con- x→∞ ( > )~ ( ) − − struction of probability theory has prevailed. One of the normalized sum Sn~σ(n), where σ(n) := P (n ), − main assertions in this axiomatic theory is the Law of Large P  being the generalized inverse of the function P, and Numbers (LLN) on the convergence of the averages of large we assume that Eξ =  when the expectation is nite, numbers of random variables to their expectation. is converges to one of the so-called stable laws as n → ∞. law implies the aforementioned stability of the relative fre- e 7characteristic functions of these laws have simple quencies and their convergence to the probability of the closed-form representations. corresponding event. . Relaxing the assumption that the Xj’s are identically e LLN is the simplest limit theorem (LT) of probabil- distributed and proceeding to study the so-called tri- ity theory, elucidating the physical meaning of probability. angular array scheme, where the distributions of the e LLN is stated as follows: if , , , . . . is a sequence summands X X forming the sum S depend not X X X j = j,n n of i.i.d. random variables, only on j but on n as well. In this case, the class of all limit laws for the distribution of Sn (under suitable nor- n malization) is substantially wider: it coincides with the : , Sn = Q Xj class of the so-called innitely divisible distributions. j=  An important special case here is the Poisson limit the- orem on convergence in distribution of the number of and the expectation : exists then − a.s. a = EX n Sn Ð→ a occurrences of rare events to a Poisson law. (almost surely, i.e., with probability ). us the value na . Relaxing the assumption of independence of the Xj’s. can be called the rst order approximation for the sums S . n Several types of “weak dependence” assumptions on X e (CLT) gives one a more precise j Central Limit eorem under which the LLN and CLT still hold true have approximation for . It says that, if  :  , Sn σ = E(X − a) < ∞ been suggested and investigated. One should also men- then the distribution of the standardized sum : ζn = (Sn − tion here the so-called ergodic theorems (see Ergodic √ converges, as , to the standard normal 7 na)~σ n n → ∞ eorem) for a wide class of random sequences and (Gaussian) law. at is, for all , x processes. . Renement of the main LTs and derivation of asymp-  x −t~ totic expansions. For instance, in the CLT, bounds of P(ζn < x) → Φ(x) := √ e dt. π S the rate of convergence P ζ x Φ x  and −∞ ( n < ) − ( ) → Limit Theorems of Probability Theory L 

asymptotic expansions for this dišerence (in the pow- estimation theory and hypotheses testing, one also needs ers of n−~ in the case of i.i.d. summands) have been theorems on large deviation probabilities for the respec- obtained under broad assumptions. tive statistics, as it is statistical procedures with small error

. Studying large deviation probabilities for the sums Sn probabilities that are oŸen required in applications. (theorems on rare events). If x → ∞ together with n It is worth noting that summation of random variables then the CLT can only assert that P(ζn > x) → . is by no means the only situation in which LTs appear in eorems on large deviation probabilities aim to nd a Probability eory. function P(x, n) such that Generally speaking, the main objective of Probabil- ity eory in applications is nding appropriate stochastic P ζ x ( n > )  as n , x . models for objects under study and then determining the , → → ∞ → ∞ P(x n) distributions or parameters one is interested in. As a rule, e nature of the function P(x, n) essentially depends the explicit form of these distributions and parameters is on the rate of decay of P(X > x) as x → ∞ and on the not known. LTscan be used to nd suitable approximations “deviation zone,”i.e., on the asymptotic behavior of the to the characteristics in question. ratio x~n as n → ∞. At least two possible approaches to this problem should . Considering observations ,..., of a more com- X Xn be noted here. plex nature – rst of all, multivariate random vectors. . Suppose that the unknown distribution depends on If X d then the role of the limit law in the CLT Fθ j ∈ R a parameter such that, as approaches some “critical” will be played by a d-dimensional normal (Gaussian) θ θ value θ , the distributions F become “degenerate” in distribution with the covariance matrix E(X − EX)  θ T one sense or another. en, in a number of cases, one (X − EX) . can nd an approximation for Fθ which is valid for the e variety of application areas of the LLN and CLT values of that are close to . For instance, in actuarial θ θ is enormous. us, is based on Mathematical Statistics studies, 7queueing theory and some other applications L these LTs. Let ∗ : ,..., be a sample from Xn = (X Xn) one of the main problems is concerned with the dis- a distribution F and F∗ u the corresponding empirical tribution of : sup , under the assumption n ( ) S = (Sk − θk) distribution function. e fundamental Glivenko–Cantelli k≥ that . If  then is a proper random vari- theorem (see 7Glivenko-Cantelli eorems) stating that EX = θ > S a.s. sup ∗ a.s.  as is of the same nature able. If, however, θ  then S . Here we deal u TFn (u) − F(u)T Ð→ n → ∞ → Ð→ ∞ as the LLN and basically means that the unknown distri- with the so-called “transient phenomena.” It turns out  bution F can be estimated arbitrary well from the random that if σ := Var (X) < ∞ then there exists the limit ∗ sample Xn of a large enough size n. −x~σ  e existence of consistent estimators for the unknown lim P(θS > x) = e , x > . θ↓ parameters a = Eξ and σ  = E(X − a) also follows from the LLN since, as n → ∞, is (Kingman–Prokhorov) LT enables one to nd n n  a s  approximations for the distribution of S in situations a∗ : X . . a, σ  ∗ : X a∗  = n Q j Ð→ ( ) = n Q( j − ) where θ is small. j= j= . Sometimes one can estimate the “tails” of the unknown  n  ∗  a.s. . distributions, i.e., their asymptotic behavior at innity. = Q Xj − (a ) Ð→ σ n j= is is of importance in those applications where one Under additional moment assumptions on the distri- needs to evaluate the probabilities of rare events. If the equation µ(X−θ)  has a solution  then, in bution F, one can also construct asymptotic condence Ee = µ > the above example, one has intervals for the parameters a and σ , as the distributions of the quantities √ ∗ and √  ∗  con- n(a − a) n‰(σ ) − σ Ž µ x P S x ce−  , x , verge, as n → ∞, to the normal ones. e same can also ( > ) ∼ → ∞ be said about other parameters that are “smooth” enough where c is a known constant. If the distribution F of X functionals of the unknown distribution F. is (in this case, µ(X−θ) for any e theorem on the asymptotic normality and subexponential Ee = ∞ 7 ) then asymptotic e›ciency of maximum likelihood estimators µ > is another classical example of LTs’ applications in mathe-  ∞ matical statistics (see e.g., Borovkov ). Furthermore, in P(S > x) ∼ ( − F(t)) dt, x → ∞. θ Sx  L Linear Mixed Models

isLTenablesonetondapproximationsforP(S > x) and Associated Editor of journals “eory of Probabil- for large x. ity and its Applications,” “Siberian Mathematical Journal,” For both approaches, the obtained approximations “Mathematical Methods of Statistics,” “Electronic Journal of can be rened. Probability.” An important part of Probability eory is concerned with LTs for random processes. eir main objective is Cross References to nd conditions under which random processes con- 7Almost Sure Convergence of Random Variables verge, in some sense, to some limit process. An extension 7Approximations to Distributions of the CLT to that context is the so-called Functional 7Asymptotic Normality CLT (a.k.a. the Donsker–Prokhorov invariance princi- 7Asymptotic Relative E›ciency in Estimation ple) which states that, as n → ∞, the processes 7Asymptotic Relative E›ciency in Testing √ ζn t : S nt ant σ n converge in distribu- 7Central Limit eorems ™ ( ) = ( ⌊ ⌋ − )~ žt∈[,] tion to the standard . e LTs 7Empirical Processes {w(t)}t∈[,] (including large deviation theorems) for a broad class of 7Ergodic eorem functionals of the sequence ( ) ,..., Glivenko-Cantelli eorems 7 {S Sn} 7 can also be classied as LTs for 7stochastic processes. 7Large Deviations and Applications e same can be said about Law of iterated logarithm 7Laws of Large Numbers Martingale Central Limit eorem which states that, for an arbitrary ε > , the random walk 7 ∞ crosses the boundary  √ ln ln innitely Probability eory: An Outline {Sk}k= ( −ε)σ k k 7  √ Random Matrix eory many times but crosses the boundary ( + ε)σ k ln ln k 7 nitely many times with probability . Similar results hold 7Strong Approximations in Probability and Statistics true for trajectories of Wiener processes and 7Weak Convergence of Probability Measures {w(t)}t∈[,] . {w(t)}t∈[,∞) In mathematical statistics a closely related to func- tional CLT result says that the so-called “” References and Further Reading √ ∗ Billingsley P () Convergence of probability measures. Wiley, n Fn u F u converges in distribution ™ ‰ ( ) − ( )Žžu∈(−∞,∞) New York to  , where  :  is Borovkov AA () Mathematical statistics. Gordon & Breach, {w (F(u))}u∈(−∞,∞) w (t) = w(t) − tw( ) the process. is LT implies 7asymptotic Amsterdam normality of a great many estimators that can be repre- Gnedenko BV, Kolmogorov AN () Limit distributions for sums sented as smooth functionals of the empirical distribution of independent random variables. Addison-Wesley, Cambridge Lévy P () Théorie de l’Addition des Variables Aléatories. function ∗ . Fn (u) Gauthier-Villars, Paris ere are many other areas in Probability eory Loève M (–) Probability theory, vols I and II, th edn. and its applications where various LTs appear and are Springer, New York extensively used. For instance, convergence theorems Petrov VV () Limit theorems of probability theory: sequences of for martingales, asymptotics of extinction probability independent random variables. Clarendon/Oxford University 7 Press, New York of a branching processes and conditional (under non- extinction condition) LTs on a number of particles etc.

About the Author Professor Borovkov is Head of Probability and Statis- tics Department of the Sobolev Institute of Mathematics, Linear Mixed Models Novosibirsk (Russian Academy of Sciences), since . He Guu§± M™u•fu§„« is Head of Probability and Statistics Chair at the Novosi- Professor birsk University since . He is Concelour of Russian Universiteit Hasselt & Katholieke Universiteit Leuven, Academy of Sciences and full member of Russian Academy Leuven, Belgium of Sciences (Academician) (). He was awarded the State Prize of the USSR (), the Markov Prize of the Russian Academy of Sciences () and Government In observational studies, repeated measurements may be Prize in Education (). Professor Borovkov is Editor- taken at almost arbitrary time points, resulting in an in-Chief of the journal “Siberian Advances in Mathematics” extremely large number of time points at which only one Linear Mixed Models L  or only a few measurements have been taken. Many of the With respect to the estimation of unit-specic param- parametric covariance models described so far may then eters b , the posterior distribution of b given the observed i i contain too many parameters to make them useful in prac- data y can be shown to be (multivariate) normal with i tice, while other, more parsimonious, models may be based mean vector equal to ′ − α y β . Replacing β DZi Vi ( )( i − Xi ) on assumptions which are too simplistic to be realistic. and α by their maximum likelihood estimates, we obtain A general, and very žexible, class of parametric models for the so-called empirical Bayes estimates b for the b . A key ̂i i continuous longitudinal data is formulated as follows: property of these EB estimates is shrinkage, which is best illustrated by considering the prediction y β b of ̂i ≡ Xî + Zi ̂i y b β b , , () the th prole. It can easily be shown that iS i ∼ N(Xi + Zi i Σi) i b , , () i ∼ N( D) y − β − y , ̂i = ΣiVi Xî + ‰Ini − ΣiVi Ž i where Xi and Zi are (ni × p) and (ni × q) dimensional matrices of known covariates, β is a p-dimensional vec- tor of regression parameters, called the xed ešects, D is which can be interpreted as a weighted average of the population-averaged prole β and the observed data y , a general (q × q) covariance matrix, and Σi is a (ni × ni) Xî i with weights − and −, respectively. Note that covariance matrix which depends on i only through its ΣiVi Ini − ΣiVi the “numerator” of − represents within-unit variabil- dimension ni, i.e., the set of unknown parameters in Σi will ΣiVi not depend upon . Finally, b is a vector of subject-specic ity and the “denominator” is the overall covariance matrix i i or random ešects. Vi. Hence, much weight will be given to the overall average e above model can be interpreted as a linear regres- prole if the within-unit variability is large in comparison sion model (see 7Linear Regression Models) for the vec- to the between-unit variability (modeled by the random tor y of repeated measurements for each unit separately, ešects), whereas much weight will be given to the observed i where some of the regression parameters are specic (ran- data if the opposite is true. is phenomenon is referred to L dom ešects, b ), while others are not (xed ešects, β). e as shrinkage toward the average prole ̂β. An immediate i Xi distributional assumptions in () with respect to the ran- consequence of shrinkage is that the EB estimates show less dom ešects can be motivated as follows. First, E b variability than actually present in the random-ešects dis- ( i) =  implies that the mean of y still equals β, such that the tribution, i.e., for any linear combination λ of the random i Xi xed ešects in the random-ešects model () can also be ešects, interpreted marginally. Not only do they režect the ešect of changing covariates within specic units, they also mea- var λ′b var λ′b λ′ λ. ( ̂i) ≤ ( i) = D sure the marginal ešect in the population of changing the same covariates. Second, the normality assumption imme- diately implies that, marginally, y also follows a normal About the Author i Geert Molenberghs is Professor of Biostatistics at Uni- distribution with mean vector Xi β and with covariance matrix ′ . versiteit Hasselt and Katholieke Universiteit Leuven in Vi = ZiDZi + Σi Note that the random ešects in () implicitly imply the Belgium. He was Joint Editor of Applied Statistics (– marginal covariance matrix of y to be of the very spe- ), and Co-Editor of Biometrics (–). Cur- Vi i cic form ′ . Let us consider two examples rently, he is Co-Editor of Biostatistics (–). He was Vi = ZiDZi + Σi under the assumption of conditional independence, i.e., President of the International Biometric Society (– assuming  . First, consider the case where the ), received the Guy Medal in Bronze from the Royal Σi = σ Ini random ešects are univariate and represent unit-specic Statistical Society and the Myrto Lefkopoulou award from the Harvard School of Public Health. Geert Molenberghs intercepts. is corresponds to covariates Zi which are is founding director of the Center for Statistics. He is ni-dimensional vectors containing only ones. e marginal model implied by expressions () and also the director of the Interuniversity Institute for Bio- () is statistics and statistical Bioinformatics (I-BioStat). Jointly with Geert Verbeke, Mike Kenward, Tomasz Burzykowski, y β, , ′ Marc Buyse, and Marc Aerts, he authored books on lon- i ∼ N(Xi Vi) Vi = ZiDZi + Σi gitudinal and incomplete data, and on surrogate marker which can be viewed as a multivariate linear regression evaluation. Geert Molenberghs received several Excellence model, with a very particular parameterization of the in Continuing Education Awards of the American Statisti- covariance matrix Vi. cal Association, for courses at Joint Statistical Meetings.  L Linear Regression Models Cross References 7Best Linear Unbiased Estimation in Linear Models Linear Regression Models General Linear Models 7 RZ™¶ LuPZu Testing Variance Components in Mixed Linear Models 7 Professor Trend Estimation 7 Michigan State University, East Lansing, MI, USA

References and Further Reading 7 I did not want proof, because the theoretical exigencies of Brown H, Prescott R () Applied mixed models in medicine. the problem would afford that. What I wanted was to be Wiley, New York started in the right direction. Crowder MJ, Hand DJ () Analysis of repeated measures. (F.Galton) Chapman & Hall, London Davidian M, Giltinan DM () Nonlinear models for repeated e linear regression model of statistics is any functional measurement data. Chapman & Hall, London relationship y = f (x, β, ε) involving a dependent real- Davis CS () Statistical methods for the analysis of repeated valued variable , variables , measurements. Springer, New York y independent x model param- Demidenko E () Mixed models: theory and applications. Wiley, eters β and random variables ε, such that a measure of New York central tendency for y in relation to x termed the regression Diggle PJ, Heagerty PJ, Liang KY, Zeger SL () Analysis of function is linear in β. Possible regression functions include longitudinal data, nd edn. Oxford University Press, Oxford the conditional mean E(ySx, β) (as when β is itself random Fahrmeir L, Tutz G () Multivariate statistical modelling based as in Bayesian approaches), conditional medians, quantiles on generalized linear models, nd edn. Springer, New York Fitzmaurice GM, Davidian M, Verbeke G, Molenberghs G () or other forms. Perhaps y is corn yield from a given plot Longitudinal data analysis. Handbook. Wiley, Hoboken of earth and variables x include levels of water, sunlight, Goldstein H () Multilevel statistical models. Edward Arnold, fertilization, discrete variables identifying the genetic vari- London ety of seed, and combinations of these intended to model Hand DJ, Crowder MJ () Practical longitudinal data analysis. interactive ešects they may have on . e form of this Chapman & Hall, London y Hedeker D, Gibbons RD () Longitudinal data analysis. Wiley, linkage is specied by a function f known to the experi- New York menter, one that depends upon parameters β whose values Kshirsagar AM, Smith WB () Growth curves. Marcel Dekker, are not known, and also upon unseen random errors ε New York about which statistical assumptions are made. ese mod- Leyland AH, Goldstein H () Multilevel modelling of health els prove surprisingly žexible, as when localized linear statistics. Wiley, Chichester Lindsey JK () Models for repeated measurements. Oxford regression models are knit together to estimate a regres- University Press, Oxford sion function nonlinear in β. Draper and Smith () is Littell RC, Milliken GA, Stroup WW, Wolfinger RD, a plainly written elementary introduction to linear regres- Schabenberger O () SAS for mixed models, nd edn. sion models, Rao () is one of many established general SAS Press, Cary references at the calculus level. Longford NT () Random coefficient models. Oxford University Press, Oxford Molenberghs G, Verbeke G () Models for discrete longitudinal Aspects of Data, Model and Notation data. Springer, New York Suppose a time varying sound signal is the superposition Pinheiro JC, Bates DM () Mixed effects models in S and S-Plus. of sine waves of unknown amplitudes at two xed known Springer, New York frequencies embedded in background Searle SR, Casella G, McCulloch CE () Variance components. y[t] = sin . sin . . Wewrite Wiley, New York β +β [ t]+β [ t]+ε β +βx +βx +ε Verbeke G, Molenberghs G () Linear mixed models for longi- for , , , , , , , β = (β β β) x = (x x x) x ≡ x(t) = tudinal data. Springer series in statistics. Springer, New York sin . , sin . , . A natural choice of regres- [ t] x(t) = [ t] t ≥ Vonesh EF, Chinchilli VM () Linear and non-linear models for sion function is m x, β E y x, β β β x β x the analysis of repeated measurements. Marcel Dekker, Basel ( ) = ( S ) =  +   +   provided . In the Weiss RE () Modeling longitudinal data. Springer, New York Eε ≡ classical linear regression model West BT, Welch KB, Gałecki AT () Linear mixed models: a prac- one assumes for dišerent instances “i” of observation that tical guide using statistical software. Chapman & Hall/CRC, random errors satisfy ,  , Eεi ≡ Eεiεk ≡ σ > i = k ≤ Boca Raton ,  otherwise. Errors in linear regression mod- n Eεiεk ≡ Wu H, Zhang J-T () Nonparametric regression methods for els typically depend upon instances i at which we select longitudinal data analysis. Wiley, New York observations and may in some formulations depend also Wu L () Mixed effects models for complex data. Chapman & Hall/CRC Press, Boca Raton on the values of x associated with instance i (perhaps the Linear Regression Models L  errors are correlated and that correlation depends upon asteroid Ceres would re-appear from behind the sun aŸer the x values). What we observe are yi and associated val- it had been lost to view following discovery only  days ues of the independent variables x. at is, we observe before. Gauss’ prediction was well removed from all others (yi, , sin[.ti], sin[.ti]), i ≤ n. e linear model on data and he soon followed up with numerous other high-caliber may be expressed y = xβ+ε with y = column vector {yi, i ≤ successes, each achieved by tting relatively simple models n}, likewise for ε, and matrix x (the design matrix) whose motivated by Keppler’s Laws, work at which he was very  entries are . adept and quick. ese were ts to imperfect, sometimes n xik = xk[ti] limited, yet fairly precise data. Legendre () published a substantive account of least squares following which the Terminology method became widely adopted in astronomy and other Independent variables, as employed in this context, is elds. See Stigler (). misleading. It derives from language used in connec- By contrast Galton (), working with what might tion with mathematical equations and does not refer to today be described as “low correlation” data, discovered statistically independent random variables. Independent deep truths not already known by tting a straight line. variables may be of any dimension, in some applications No theoretical model previously available had prepared functions or surfaces. If y is not scalar-valued the model Galton for these discoveries which were made in a study is instead a multivariate linear regression. In some ver- of his own data w = standard score of weight of parental sions either x, β or both may also be random and subject sweet pea seed(s), y = standard score of seed weights(s) to statistical models. Do not confuse multivariate linear of their immediate issue. Each sweet pea seed has but one regression with which refers to multiple linear regression parent and the distributions of x and y the same. Work- a model having more than one non-constant independent ing at a time when correlation and its role in regression variable. were yet unknown, Galton found to his astonishment a nearly perfect straight line tracking points (parental seed L General Remarks on Fitting Linear weight w, median lial seed weight m(w)). Since for this data this was the least squares line (also the regres- Regression Models to Data sy ∼ sw Early work (the classical linear model) emphasized inde- sion line since the data was bivariate normal) and its slope was (the correlation). Medians being pendent identically distributed (i.i.d.) additive normal rsy~sw = r m(w) essentially equal to means of y for each w greatly facilitated errors in linear regression where 7least squares has par- calculations owing to his choice to select equal numbers ticular advantages (connections with 7multivariate nor- mal distributions are discussed below). In that setup least of parent seeds at weights , ±, ±, ± standard deviations squares would arguably be a principle method of tting from the mean of w. Galton gave the name co-relation linear regression models to data, perhaps with modica- (later correlation) to the slope ∼ . of this line and tions such as Lasso or other constrained optimizations that for a brief time thought it might be a universal constant. achieve reduced sampling variations of coe›cient estima- Although the correlation was small, this slope nonetheless tors while introducing bias (Efron et al. ). Absent a gave measure to the previously vague principle of reversion breakthrough enlarging the applicability of the classical (later regression, as when larger parental examples beget linear model other methods gain traction such as Bayesian ošspring typically not quite so large). Galton deduced the general principle that if  r  then for a value w Ew methods (7Markov Chain Monte Carlo having enabled < < > their calculation); Non-parametric methods (good perfor- the relation Ew < m(w) < w follows. Having sensibly mance relative to more relaxed assumptions about errors); selected equal numbers of parental seeds at intervals may Iteratively Reweighted least squares (having under some have helped him observe that points (w, y) departed on conditions behavior like maximum likelihood estimators each vertical from the regression line by statistically inde- without knowing the precise form of the likelihood). e pendent N(, θ) random residuals whose variance θ >  Dantzig selector is good news for dealing with far fewer was the same for all w. Of course this likewise amazed him observations than independent variables when a relatively and by  he had identied all these properties as a con- small fraction of them matter (Candès and Tao ). sequence of bivariate normal observations (w, y), (Galton ). Echoes of those long ago events reverberate today in Background our many statistical models “driven,” as we now proclaim, C.F. Gauss may have used least squares as early as . In by random errors subject to ever broadening statistical  he was able to predict the apparent position at which modeling. In the beginning it was very close to truth.  L Linear Regression Models

,   , : Aspects of Linear Regression Models EεI ≡ Eεi ≡ σ > i ≤ n Data of the real world seldom conform exactly to any deter- , . () yi = xi β + ⋯ + xip βp + εi i ≤ n ministic mathematical model y = f (x, β) and through the device of incorporating random errors we have now an e interpretation is that response yi is observed established paradigm for tting models to data (x, y) by for the ith sample in conjunction with numerical values statistically estimating model parameters. In consequence ,..., of the independent variables. If these errors (xi xip) we obtain methods for such purposes as predicting what {εI} are jointly normally distributed (and therefore sta- will be the average response y to particular given inputs tistically independent having been assumed to be uncor- x; providing margins of error (and prediction error) for related) and if the matrix xtrx is non-singular then the various quantities being estimated or predicted, tests of maximum likelihood (ML) estimates of the model coef- hypotheses and the like. It is important to note in all this cients , are produced by ordinary least squares {βk k ≤ p} that more than one statistical model may apply to a given (LS) as follows: problem, the function and the other model components f tr − tr () dišering among them. Two statistical models may disagree βML = βLS = (x x) x y = β + Mε − substantially in structure and yet neither, either or both for M = (xtrx) xtr with xtr denoting matrix transpose of x − may produce useful results. In this respect statistical mod- and (xtrx)  the matrix inverse. ese coe›cient estimates eling is more a matter of how much we gain from using βLS are linear functions in y and satisfy the Gauss–Markov a statistical model and whether we trust and can agree properties ()(): upon the assumptions placed on the model, at least as a , . () practical expedient. In some cases the regression function E(βLS)k = βk k ≤ p conforms precisely to underlying mathematical relation- and, among all unbiased estimators ∗ (of that are βk βk) ships but that does not režect the majority of statistical linear in y, practice. It may be that a given statistical model, although  ∗  , for every . () far from being an underlying truth, confers advantage by E((βLS)k − βk) ≤ E ‰βk − βkŽ k ≤ p capturing some portion of the variation of y vis-a-vis x. e Least squares estimator () is frequently employed method principle components, which seeks to nd relatively without the assumption of normality owing to the fact that small numbers of linear combinations of the independent properties ()() must hold in that case as well. Many sta- variables that together account for most of the variation tistical distributions F, t, chi-square have important roles in of y, illustrates this point well. In one application elec- connection with model () either as exact distributions for tromagnetic theory was used to generate by computer an quantities of interest (normality assumed) or more gener- elaborate database of theoretical responses of an induced ally as limit distributions when data are suitably enriched. electromagnetic eld near a metal surface to various com- binations of žaws in the metal. e role of principle com- Algebraic Properties of Least Squares ponents and linear modeling was to establish a simple Setting all randomness assumptions aside we may exam- model režecting those ndings so that a portable device ine the algebraic properties of least squares. If y = xβ + ε could be devised to make detections in real time based on then βLS = My = M(xβ + ε) = β + Mε as in (). at the model. is, the least squares estimate of model coe›cients acts on If there is any weakness to the statistical approach xβ + ε returning β plus the result of applying least squares it lies in the fact that margins of error, statistical tests to ε. is has nothing to do with the model being correct and the like can be seriously incorrect even if the predic- or ε being error but is purely algebraic. If ε itself has the tions ašorded by a model have apparent value. Refer to form xb + e then Mε = b + Me. Another useful observa- Hinkelmann and Kempthorne (), Berk (), tion is that if x has rst column identically one, as would Freedman (), Freedman (). typically be the case for a model with constant term, then each row , , of satises .  (i.e., is a Mk k > M Mk = Mk ) so . and . . contrast (βLS)k = βk + Mk ε Mk (ε + c) = Mk ε so ε may as well be assumed to be centered for k > . Classical Linear Regression Model ere are many of these interesting algebraic properties    and Least Squares such as s (y−xβLS) = (−R )s (y) where s(.) denotes the e classical linear regression model may be expressed y = sample standard deviation and R is the multiple correlation xβ + ε, an abbreviated matrix formulation of the system of dened as the correlation between y and the tted val- equations in which random errors ε are assumed to satisfy ues xβLS. Yet another algebraic identity, this one involving Linear Regression Models L  an interplay of permutations with projections, is exploited has been examined from an on-line learning perspective to help establish for exchangeable errors ε, and contrasts (Vovk ). v, a permutation bootstrap of least squares residuals that consistently estimates the conditional sampling distribution Joint Normal Distributed (x, y) as of v.(βLS − β) conditional on the 7order statistics of ε. (See LePage and Podgorski ). Freedman and Lane in Motivation for the Linear Regression  advocated tests based on permutation bootstrap of Model and Least Squares For the moment, think of x, y as following a multivariate residuals as a descriptive method. ( ) normal distribution, as might be the case under process control or in natural systems. e (regular) conditional Generalized Least Squares expectation of y relative to x is then, for some β: If errors ε are N(, Σ) distributed for a covariance matrix Σ known up to a constant multiple then the maximum E y x Ey E y Ey x Ey x.β for every x likelihood estimates of coe›cients β are produced by a gen- ( S ) = + (( − )S ) = + eralized least squares solution retaining properties ()() and the discrepancies y E y x are for each x distributed (any positive multiple of Σ will produce the same result) − ( S ) ,  for a xed , independent for dišerent . given by: N( σ ) σ x

tr − − tr − tr − − tr − . () βML = (x Σ x) x Σ y = β + (x Σ x) x Σ ε Comparing Two Basic Linear Regression Generalized least squares solution () retains prop- Models erties ()() even if normality is not assumed. It must Freedman () compares the analysis of two supercially similar but dišering models: not be confused with generalized linear models which 7 Model () above. refers to models equating moments of y to nonlinear func- Errors model: Data ,..., , , represent tions of xβ. Sampling model: (xi xip yi) i ≤ n L A very large body of work has been devoted to linear a random sample from a nite population (e.g., an actual regression models and the closely related subject areas of physical population). In the sampling model, are simply the experimental design, analysis of variance, principle com- {εi} residual 7 - of a least squares t of linear model ponent analysis and their consequent distribution theory. discrepancies y xβLS xβ to the population. Galton’s seed study is an exam- ple of this if we regard his data (w, y) as resulting from Reproducing Kernel Generalized Linear equal probability without-replacement random sampling Regression Model of a population of pairs , with restricted to be at Parzen (, Sect. ) developed the reproducing kernel (w y) w , , ,  standard deviations from the mean. Both with framework extending generalized least squares to spaces ± ± ± and without-replacement equal-probability sampling are of arbitrary nite or innite dimension when the ran- considered by Freedman. Unlike the errors model there dom error function ε ε t , t T has zero means = { ( ) ∈ } is no assumption in the sampling model that the popula- Eε t  and a covariance function K s, t Eε s ε t ( ) ≡ ( ) = ( ) ( ) tion linear regression model is in any way correct, although that is assumed known up to some positive constant mul- least squares may not be recommended if the population tiple. In this formulation: residuals depart signicantly from i.i.d. normal. Our only y(t) = m(t) + ε(t), t ∈ T, purpose is to estimate the coe›cients of the population LS , , m(t) = Ey(t) = Σi βiwi(t) t ∈ T t of the model using LS t of the model to our sample, where wi(.) are known linearly independent functions in give estimates of the likely proximity of our sample least the reproducing kernel Hilbert (RKHS) space H(K) of the squares t to the population t and estimate the quality of kernel K. For reasons having to do with singularity of the population t (e.g., multiple correlation). Gaussian measures it is assumed that the series dening Freedman () established the applicability of Efron’s m is convergent in H(K). Parzen extends to that context Bootstrap to each of the two models above but under dif- and solves the problem of best linear unbiased estima- ferent assumptions. His results for the sampling model are tion of the model coe›cients β and more generally of a textbook application of Bootstrap since a description of estimable linear functions of them, developing condence the sampling theory of least squares estimates for the sam- regions, prediction intervals, exact or approximate distri- pling model has complexities largely, as had been said, out butions, tests and other matters of interest, and establish- of the way when the Bootstrap approach is used. It would ing the Gauss–Markov properties ()(). e RKHS setup be an interesting exercise to examine data, such as Galton’s  L Linear Regression Models

seed data, analyzing it by the two dišerent models, obtain- to mediocrity or beyond by Samuels () this property ing condence intervals for the estimated coe›cients of a is easily come by when X, Y have the same distribution. straight line t in each case to see how closely they agree. e following result and brief proof are Samuels’ except for clarications made here (italics). ese comments are Balancing Good Fit Against addressed only to the formal mathematical proof of the Reproducibility paper. A balance in the linear regression model is necessary. Proposition Let random variables X, Y be identically Including too many independent variables in order to distributed with nite mean EX and x any c assure a close t of the model to the data is called over- > max , EX . If P X c and Y c P Y c then there tting. Models over-t in this manner tend not to work ( ) ( > > ) < ( > ) is reversion to mediocrity or beyond for that c. with fresh data, for example to predict y from a fresh choice of the values of the independent variables. Galton’s regres- Proof For any given c > max(, EX) dene the dišerence sion line, although it did not ašord very accurate predic- J of indicator random variables J = (X > c) − (Y > c). J is zero unless one indicator is  and the other . YJ is less or tions of y from w, owing to the modest correlation ∼ ., equal cJ c on J  (i.e., on X c, Y c and YJ is strictly was arguably best for his bi-variate normal data (w, y). = = > ≤ ) Tossing in another independent variable such as w for a less than cJ = −c on J = − (i.e., on X ≤ c, Y > c). Since parabolic t would have over-t the data, possibly spoiling the event J = − has positive probability by assumption, the discovery of the principle of regression to mediocrity. previous implies E YJ < c EJ and so A model might well be used even when it is under- stood that incorporating additional independent variables EY(X > c) = E(Y(Y > c) + YJ) = EX(X > c) + E YJ will yield a better t to data and a model closer to truth. < EX(X > c) + cEJ = EX(X > c), How could this be? If the more parsimonious choice of x accounts for enough of the variation of y in relation to yielding E(YSX > c) < E(XSX > c). ◻ the variables of interest to be useful and if its fewer coe›- cients β are estimated more reliably perhaps. Intentional Cross References use of a simpler model might do a reasonably good job 7Adaptive Linear Regression of giving us the estimates we need but at the same time 7Analysis of Areal and Spatial Interaction Data violate assumptions about the errors thereby invalidat- 7Business Forecasting Methods ing condence intervals and tests. Gauss needed to come 7Gauss-Markov eorem close to identifying the location at which Ceres would 7General Linear Models appear. Going for too much accuracy by complicating the 7Heteroscedasticity model risked overtting owing to the limited number of 7Least Squares observations available. 7Optimum Experimental Design One possible resolution to this tradeoš between reli- 7Partial Least Squares Regression Versus Other Methods able estimation of a few model coe›cients, versus the risk 7Regression Diagnostics that by doing so too much model related material is leŸ 7Regression Models with Increasing Numbers of Unknown in the error term, is to include all of several hierarchically Parameters ordered layers of independent variables, more than may be 7Ridge and Surrogate Ridge Regressions needed, then remove those that the data suggests are not 7Simple Linear Regression required to explain the greater share of the variation of y 7Two-Stage Least Squares (Raudenbush and Bryk ). New results on data com- pression (Candès and Tao ) may ošer fresh ideas for reliably removing, in some cases, less relevant independent References and Further Reading variables without rst arranging them in a hierarchy. Berk RA () Regression analysis: a constructive critique. Sage, Newbury park Candès E, Tao T () The Dantzig selector: statistical estimation Regression to Mediocrity Versus when p is much larger than n. Ann Stat ():– Reversion to Mediocrity or Beyond Efron B () Bootstrap methods: another look at the jackknife. Regression (when applicable) is oŸen used to prove that a Ann Stat ():– Efron B, Hastie T, Johnstone I, Tibshirani R () Least angle high performing group on some scoring, i.e., X > c > EX, will not average so highly on another scoring , as they do regression. Ann Stat ():– Y Freedman D () Bootstrapping regression models. Ann Stat on X, i.e., E(YSX > c) < E(XSX > c). Termed reversion :– Local Asymptotic Mixed Normal Family L 

 where dlnpn(x(n);θ) , d lnpn(x(n);θ) , and Freedman D () Statistical models and shoe leather. Sociol Sn(θ) = dθ Jn(θ) = − dθdθ t Methodol :– −  d − p Freedman D () Statistical models: theory and practice.  , ,  , (i)n Sn(θ) Ð→ Nk( F(θ)) (ii)n Jn(θ) Ð→ F(θ) Cambridge University Press, New York () Freedman D, Lane D () A non-stochastic interpretation of reported significance levels. J Bus Econ Stat :– F(θ) being the limiting Fisher information matrix. Here, Galton F () Typical laws of heredity. Nature :–, –, F(θ) ia assumed to be non-random. See LaCam and Yang – () for a review of the LAN family. Galton F () Regression towards mediocrity in hereditary stature. For the LAN family dened by  and  , it is well J Anthropolocal Inst Great Britain and Ireland :– ( ) ( ) Hinkelmann K, Kempthorne O () Design and analysis of exper- known that, under some regularity conditions, the maxi- iments, vol , , nd edn. Wiley, Hoboken mum likelihood (ML) estimator θ̂n is consistent asymp- Kendall M, Stuart A () The advanced theory of statistics, volume totically normal and e›cient estimator of θ with : Inference and relationship. Charles Griffin, London LePage R, Podgorski K () Resampling permutations in regres- √ d − n θ̂n θ Nk , F θ . () sion without second moments. J Multivariate Anal ():– ( − ) Ð→ ( ( )) , Elsevier A large class of models involving the classical i.i.d. (inde- Parzen E () An approach to time series analysis. Ann Math Stat pendent and identically distributed) observations are cov- ():– ered by the LAN framework. Many time series models and Rao CR () Linear statistical inference and its applications. Wiley, New York 7Markov processes also are included in the LAN family. Raudenbush S, Bryk A () Hierarchical linear models appli- If the limiting Fisher information matrix F(θ) is non- cations and data analysis methods, nd edn. Sage, Thousand degenerate random, we obtain a generalization of the LAN Oaks family for which the limit distribution of the ML estima- Samuels ML () Statistical reversion toward the mean: more tor in () will be a mixture of normals (and hence non- universal than regression toward the mean. Am Stat :– normal). If Λ satises () and () with random, the Stigler S () The history of statistics: the measurement of uncer- n F(θ) tainty before . Belknap Press of Harvard University Press, density pn(x(n); θ) belongs to a local asymptotic mixed L Cambridge normal (LAMN) family. See Basawa and Scott () for Vovk V () On-line regression competitive with reproducing a discussion of the LAMN family and related asymptotic kernel Hilbert spaces. Lecture notes in computer science, vol inference questions for this family. . Springer, Berlin, pp – For the LAMN family, one can replace the norm √n by  a random norm  to get the limiting normal distribu- Jn (θ) tion, viz.,  d  , , () Jn (θ)(θ̂n − θ) Ð→ N( I) where I is the identity matrix. Local Asymptotic Mixed Normal Two examples belonging to the LAMN family are given Family below:

I«„ëZ§ V. BZ«ZëZ Example  Variance mixture of normals Professor Suppose, conditionally on , , ,..., University of Georgia, Athens, GA, USA V = v (x x xn) − are i.i.d. N(θ, v ) random variables, and V is an expo- nential random variable with mean . e marginal joint density of x(n) is then given by pn(x(n); θ) ∝ −‰ n +Ž Suppose ,..., is a sample from a stochastic n  x n x xn   ( ) = ( )  xi θ . It can be veried that F θ is process , , ... . Let ; denote the joint  +  Q( − ) ( ) x = {x x } pn(x(n) θ)  k an exponential random variable with mean . e ML esti- density function of x(n), where θєΩ ⊂ R is a parameter. pn(x(n);θn) √ d Dene the log-likelihood ratio Λn , where mator θ̂n x and n x θ t  . It is interesting to =  pn(x(n);θ)  = ( − ) Ð→ ( ) −  note that the variance of the limit distribution of x is ! θn = θ + n  h, and h is a (k × ) vector. e joint density ∞ pn(x(n); θ) belongs to a local asymptotic normal (LAN) Example  Autoregressive process family if Λn satises Consider a rst-order autoregressive process {xt} dened by , , , . . . , with , where  xt θxt− et t x −  t − t = + = = Λ n  h S θ n h J θ h o  () n = n( ) − ‹ n( )  + p( ) {et} are assumed to be i.i.d. N(, ) random variables.  L Location-Scale Distributions

n   where F is a distribution having no other parameters. We then have pn x n ; θ exp xt θxt . (⋅) ( ( ) ) ∝ −  Q( − −) Dišerent ’s correspond to dišerent members of the  F(⋅) For the stationary case, , this model belongs SθS < family. (a, b) is called the location–scale parameter, a being to the LAN family. However, for SθS > , the model the location parameter and b being the scale parameter. For belongs to the LAMN family. For any , the ML esti- θ xed b =  we have a subfamily which is a location family n n − with parameter a, and for xed a  we have a scale family mator θ̂ x x x . One can verify that = n = ŒQ i i−‘ ŒQ i−‘ with parameter b. e variable   √ d  − n(θ̂n − θ) Ð→ N(, ( − θ ) ), for SθS < , and X − a Y =  − n d b (θ − ) θ (θ̂n − θ) Ð→ Cauchy, for SθS > . is called the reduced or standardized variable. It has a =  About the Author and b = . If the distribution of X is absolutely continuous Dr. Ishwar Basawa is a Professor of Statistics at the with density function University of Georgia, USA. He has served as interim dFX(x S a, b) head of the department (–), Executive Edi- fX x a, b ( S ) = d x tor of the Journal of Statistical Planning and Inference then a, b is a location scale-parameter for the distribu- (–), on the editorial board of Communications ( ) tion of X if (and only if) in Statistics, and currently the online Journal of Prob- . Professor Basawa is a Fellow of  x a ability and Statistics f x a, b f − the Institute of Mathematical Statistics and he was an X( S ) = b ‹ b  Elected member of the International Statistical Institute. for some density f (⋅), called the reduced density. All distri- He has co-authored two books and co-edited eight Pro- butions in a given family have the same shape, i.e., the same ceedings/Monographs/Special Issues of journals. He has skewness and the same kurtosis. When Y has mean µY and authored more than  publications. His areas of research standard deviation σY then, the mean of X is E(X) = a + include inference for stochastic processes, time series, and » b µY and the standard deviation of X is Var(X) = b σY . asymptotic statistics. e location parameter a, a ∈ R is responsible for the distribution’s position on the abscissa. An enlargement Cross References (reduction) of a causes a movement of the distribution to 7Asymptotic Normality the right (leŸ). e location parameter is either a measure 7Optimal Statistical Inference in Financial Engineering of central tendency e.g., the mean, median and mode or it 7Sampling Problems for Stochastic Processes is an upper or lower threshold parameter. e scale param- eter b, b > , is responsible for the dispersion or variation References and Further Reading of the variate X. Increasing (decreasing) b results in an Basawa IV, Scott DJ () Asymptotic optimal inference for enlargement (reduction) of the spread and a correspond- non-ergodic models. Springer, New York ing reduction (enlargement) of the density. b may be the LeCam L, Yang GL () Asymptotics in statistics. Springer, standard deviation, the full or half length of the support, New York or the length of a central ( − α)–interval. e location-scale family has a great number of members:

● Arc-sine distribution Location-Scale Distributions ● Special cases of the beta distribution like the rectan- gular, the asymmetric triangular, the U–shaped or the H™§«± R†••u power–function distributions Professor Emeritus for Statistics and Econometrics ● CZ¶h„í and half–CZ¶h„í distributions Justus–Liebig–University Giessen, Giessen, Germany ● Special cases of the χ–distribution like the half– normal, the RZíu†„ and the MZìëu–B™±ñ“Z•• distributions A random variable X with realization x belongs to the ● Ordinary and raised cosine distributions location-scale family when its cumulative distribution is a ● Exponential and režected exponential distributions function only of x a b: ( − )~ ● Extreme value distribution of the maximum and the minimum, each of type I x − a FX x a, b Pr X x a, b F ; a , b ; Hyperbolic secant distribution ( S ) = ( ≤ S ) = ‹ b  ∈ R > ● Location-Scale Distributions L 

● LZ£Zhu distribution with variance–covariance matrix ● Logistic and half–logistic distributions Var x b B. ● Normal and half–normal distributions ( ) = Parabolic distributions ● e GLS estimator of θ is ● Rectangular or uniform distribution Semi–elliptical distribution ′ − ′ ● ̂θ = ‰A Ω AŽ A Ω x ● Symmetric triangular distribution and its variance–covariance matrix reads ● Tu†««†u§ distribution with reduced density f (y) = exp y  exp  y exp y , y   ( ) −   + − ( ) ≥  ′ − Var‰̂θ Ž = b ‰A Ω AŽ . ● V–shaped distribution e vector α and the matrix B are not always easy to nd. For each of the above mentioned distributions we can For only a few location–scale distributions like the expo- design a special probability paper. Conventionally, the nential, the režected exponential, the extreme value, the abscissa is for the realization of the variate and the ordinate, logistic and the rectangular distributions we have closed- called the probability axis, displays the values of the cumu- form expressions, in all other cases we have to evalu- lative distribution function, but its underlying scaling is ate the integrals dening E and E . For according to the percentile function. e ordinate value ‰Yi:nŽ ‰Yi:n Yj:nŽ more details on linear estimation and probability plot- belonging to a given sample data on the abscissa is called ting for location-scale distributions and for distributions plotting position; for its choice see Barnett (, ), which can be transformed to location–scale type see Rinne Blom (), Kimball (). When the sample comes from (). Maximum likelihood estimation for location–scale the probability paper’s distribution the plotted data will distributions is treated by Mi (). randomly scatter around a straight line, thus, we have a graphical goodness-t-test. When we t the straight line by eye we may read oš estimates for a and b as the abscissa or About the Author L dišerence on the abscissa for certain percentiles. A more Dr. Horst Rinne is Professor Emeritus (since ) of objective method is to t a least-squares line to the data, statistics and econometrics. In  he was awarded the and the estimates of a and b will be the the parameters of Venia legendi in statistics and econometrics by the Faculty this line. of economics of Berlin Technical University. From  e latter approach takes the order statistics , to  he worked as a part-time lecturer of statistics at Xi:n X:n ≤ ... as regressand and the mean of the Berlin Polytechnic and at the Technical University of Han- X:n ≤ ≤ Xn:n reduced order statistics : E as regressor, which nover. In  he joined Volkswagen AG to do operations αi:n = ‰Yi:nŽ under these circumstances acts as plotting position. e research. In  he was appointed full professor of statis- regression model reads: tics and econometrics at the Justus-Liebig University in Giessen, where later on he was Dean of the faculty of eco- , Xi:n = a + b αi:n + εi nomics and management science for three years. He got further appointments to the universities of Tübingen and where εi is a random variable expressing the dišerence between and its mean E . As the order Cologne and to Berlin Polytechnic. He was a member of Xi:n ‰Xi:nŽ = a+b αi:n statistics and – as a consequence – the disturbance the board of the German Statistical Society and in charge Xi:n of editing its journal Allgemeines Statistisches Archiv, now terms εi are neither homoscedastic nor uncorrelated we have to use – according to Lloyd () – the general-least- AStA – Advances in Statistical Analysis, from  to . squares method to nd best linear unbiased estimators of He is Co-founder of the Econometric Board of the German Society of Economics. Since  he is Elected member a and b. Introducing the following vectors and matrices: of the International Statistical Institute. He is Associate editor for Quality Technology and Quantitative Manage- ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ X n  α n ε ⎜ : ⎟ ⎜ ⎟ ⎜ : ⎟ ⎜  ⎟ ment. His scientic interests are wide-spread, ranging from ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ nancial mathematics, business and economics statistics, ⎜ X:n ⎟ ⎜  ⎟ ⎜ α:n ⎟ ⎜ ε ⎟ a = ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ x : ⎜ ⎟,  : ⎜ ⎟, α : ⎜ ⎟, ε : ⎜ ⎟, θ : ⎜ ⎟, ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ technical statistics to econometrics. He has written numer- ⎜ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮ ⎟ ⎝ b ⎠ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ous papers in these elds and is author of several textbooks ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ Xn:n  αn:n εn and monographs in statistics, econometrics, time series A := ‰ Ꭰanalysis, multivariate statistics and statistical quality con- trol including process capability. He is the author of the the regression model now reads text e Weibull Distribution: A Handbook (Chapman and x = A θ + ε Hall/CRC, ).  L Logistic Normal Distribution

Cross References responses ri out of mi trials (i = , . . . , n), the response 7Statistical Distributions: An Overview probabilities, Pi, are assumed to have a logistic-normal dis-  tribution with logit(Pi) = log(Pi~( − Pi)) ∼ N(µi, σ ), References and Further Reading where µi is modelled as a linear function of explanatory variables, ,..., . e resulting model can be summa- Barnett V () Probability plotting methods and order statistics. x xp Appl Stat :– rized as Barnett V () Convenient probability plotting positions for the normal distribution. Appl Stat :– RiSPi ∼ Binomial(mi, Pi) Blom G () Statistical estimates and transformed beta variables. logit Pi Z ηi σZ β βxi βpxip σZ Almqvist and Wiksell, Stockholm ( )S = + = + + ⋯ + + Kimball BF () On the choice of plotting positions on probability Z ∼ N(, ) paper. J Am Stat Assoc :– is is a simple extension of the basic logit model with Lloyd EH () Least-squares estimation of location and scale parameters using order statistics. Biometrika :– the inclusion of a single normally distributed random Mi J () MLE of parameters of location–scale distributions for ešect in the linear predictor, an example of a general- complete and partially grouped data. J Stat Planning Inference ized linear mixed model, see McCulloch and Searle (). :– Maximum likelihood estimation for this model is compli- Rinne H () Location-scale distributions – linear estimation and cated by the fact that the likelihood has no closed form probability plotting; http:geb.uni-giessen/geb/volltexte// / and involves integration over the normal density, which requires numerical methods using Gaussian quadrature; routines now exist as part of generalized linear mixed model tting in all major soŸware packages, such as SAS, R, Stata and Genstat. Approximate moment-based estima- Logistic Normal Distribution tion methods make use of the fact that if σ  is small then, as derived in Williams (), J™„• H†•ou Professor of Statistics E[Ri] = miπi and National University of Ireland, Galway, Ireland      Var(Ri) = miπi( − π)[ + σ (mi − )πi( − πi)]

where logit(πi) = ηi. e form of the variance function e logistic-normal distribution arises by assuming that shows that this model is overdispersed compared to the the logit (or logistic transformation) of a proportion has binomial, that is it exhibits greater variability; the ran- a normal distribution, with an obvious extension to a vec- dom ešect Z allows for unexplained variation across the tor of proportions through taking a logistic transformation grouped observations. However, note that for binary data of a multivariate normal distribution, see Aitchison and (mi = ) it is not possible to have overdispersion arising in Shen (). In the univariate case, this provides a family of this way. distributions on (, ) that is distinct from the 7beta dis- tribution, while the multivariate version is an alternative About the Author to the Dirichlet distribution. Note that in the multivariate For biography see the entry 7Logistic Distribution. case there is no unique way to dene the set of logits for the multinomial proportions (just as in multinomial logit Cross References models, see Agresti ) and dišerent formulations may 7Logistic Distribution be appropriate in particular applications (Aitchison ). 7Mixed Membership Models e univariate distribution has been used, oŸen implicitly, 7Multivariate Normal Distributions in random ešects models for binary data and the multi- 7Normal Distribution, Univariate variate version was pioneered by Aitchison for statistical diagnosis/discrimination (Aitchison and Begg ), the References and Further Reading Bayesian analysis of contingency tables and the analysis of Agresti A () Categorical data analysis, nd edn. Wiley, New York compositional data (Aitchison , ). Aitchison J () The statistical analysis of compositional data e use of the logistic-normal distribution is most eas- (with discussion). J R Stat Soc Ser B :– ily seen in the analysis of binary data where the logit model Aitchison J () The statistical analysis of compositional data. (based on a logistic tolerance distribution) is extended Chapman & Hall, London Aitchison J, Begg CB () Statistical diagnosis when basic cases to the logit-normal model. For grouped binary data with are not classified with certainty. Biometrika :– Logistic Regression L 

Aitchison J, Shen SM () Logistic-normal distributions: some estimation of its parameters than traditional Newton– properties and uses. Biometrika :– Raphson-based maximum likelihood estimation (MLE) McCulloch CE, Searle SR () Generalized, linear and mixed methods. models. Wiley, New York In  Nelder and Wedderbrun discovered that it Williams D () Extra-binomial variation in logistic linear was possible to construct a single algorithm for estimat- models. Appl Stat :– ing models based on the exponential family of distri- butions. e algorithm was termed 7Generalized linear models (GLM), and became a standard method to esti- mate binary response models such as logistic, probit, and complimentary-loglog regression, count response mod- Logistic Regression els such as Poisson and negative binomial regression, and continuous response models such as gamma and inverse J™«u£„ M. H†fu Gaussian regression. e standard normal model, or Gaus- Emeritus Professor sian regression, is also a generalized linear model, and may University of Hawaii, Honolulu, HI, USA be estimated under its algorithm. e form of the exponen- Adjunct Professor of Statistics tial distribution appropriate for generalized linear models Arizona State University, Tempe, AZ, USA may be expressed as Solar System Ambassador California Institute of Technology, Pasadena, f (yi; θi, ϕ) = exp{(yiθi − b(θi))~α(ϕ) + c(yi; ϕ)}, () CA, USA with θ representing the link function, α(ϕ) the scale parameter, b(θ) the , and c(y; ϕ) the normaliza- tion term, which guarantees that the probability function Logistic regression is the most common method used to sums to . e link, a monotonically increasing func- model binary response data. When the response is binary, tion, linearizes the relationship of the expected mean and L it typically takes the form of /, with  generally indicat- explanatory predictors. e scale, for binary and count ing a success and  a failure. However, the actual values that models, is constrained to a value of , and the cumulant is  and  can take vary widely, depending on the purpose of used to calculate the model mean and variance functions. the study. For example, for a study of the odds of failure in a e mean is given as the rst derivative of the cumulant school setting,  may have the value of , and  of , fail not-fail with respect to , ′ ; the variance is given as the second or pass. e important point is that  indicates the fore- θ b (θ) derivative, ′′ . Taken together, the above four terms most subject of interest for which a binary response study is b (θ) dene a specic model. designed. Modeling a binary response variable using nor- GLM We may structure the () into mal linear regression introduces substantial bias into the exponential family form () as: parameter estimates. e standard linear model assumes that the response and error terms are normally or Gaus- f (yi; πi) = exp{yi ln(πi~( − πi)) + ln( − πi)}. () sian distributed, that the variance, , is constant across σ e link function is therefore ln  , and cumu- observations, and that observations in the model are inde- (π~( −π)) lant ln  or ln   . For the Bernoulli, is pendent. When a binary variable is modeled using this − ( − π) ( ~( − π)) π dened as the probability of success. e rst derivative of method, the rst two of the above assumptions are violated. the cumulant is , the second derivative,  . ese Analogical to the normal regression model being based π π( − π) two values are, respectively, the mean and variance func- on the Gaussian probability distribution function , (pdf ) tions of the Bernoulli . Recalling that the logistic model a binary response model is derived from a Bernoulli dis- pdf is the canonical form of the distribution, meaning that it is tribution, which is a subset of the binomial with the pdf the form that is directly derived from the , the values binomial denominator taking the value of . e Bernoulli pdf expressed in (), and the values we gave for the mean and may be expressed as: pdf variance, are the values for the logistic model. ; yi  −yi . () Estimation of statistical models using the GLM algo- f (yi πi) = πi ( − πi) rithm, as well as MLE, are both based on the log-likelihood Binary logistic regression derives from the canonical function. e likelihood is simply a re-parameterization of form of the Bernoulli distribution. e Bernoulli pdf is the pdf which seeks to estimate π, for example, rather than a member of the exponential family of probability distri- y. e log-likelihood is formed from the likelihood by tak- butions, which has properties allowing for a much easier ing the natural log of the function, allowing summation  L Logistic Regression

across observations during the estimation process rather derived. e logistic gradient and hessian functions are than multiplication. given as e traditional GLM symbol for the mean, µ, is typ- ∂L β n ically substituted for , when is used to estimate a ( ) () π GLM = Q(yi − µi)xi logistic model. In that form, the log-likelihood function for ∂β i= the binary-logistic model is given as: ∂L β n ( ) x x′ µ  µ () n ∂β∂β′ = − Q ™ i i i( − i)ž ; ln  ln  , () i= L(µi yi) = Q{yi (µi~( − µi)) + ( − µi)} i= One of the primary values of using the logistic regression or model is the ability to interpret the exponentiated param- n eter estimates as odds ratios. Note that the link function L µ ; y y ln µ  y ln  µ . () ( i i) = Q{ i ( i) + ( − i) ( − i)} is the log of the odds of µ, ln µ  µ , where the odds i= ( ~( − )) are understood as the success of µ over its failure,  µ. e Bernoulli-logistic log-likelihood function is essen- − e log-odds is commonly referred to as the logit function. tial to logistic regression. When is used to esti- GLM An example will help clarify the relationship, as well as the mate logistic models, many soŸware algorithms use the interpretation of the odds ratio. deviance rather than the log-likelihood function as the We use data from the  Titanic accident, compar- basis of convergence. e deviance, which can be used as ing the odds of survival for adult passengers to children. a goodness-of-t statistic, is dened as twice the dišerence A tabulation of the data is given as: of the saturated log-likelihood and model log-likelihood. For logistic model, the deviance is expressed as

n Age (Child vs Adult)  ln  ln   . () D = Q{yi (yi~µi)+( −yi) (( −yi)~( − µi))} Survived child adults Total i= Whether estimated using maximum likelihood techniques no    or as , the value of for each observation in the model GLM µ yes    is calculated on the basis of the linear predictor, x′β. For the normal model, the predicted t, yˆ, is identical to x′β, the Total  , , right side of (). However, for logistic models, the response e odds of survival for adult passengers is /, or is expressed in terms of the link function, ln(µi~( − µi)). We have, therefore, .. e odds of survival for children is /, or .. e ratio of the odds of survival for adults to the odds of ′ ln  . () xi β = (µi~( − µi)) = β + βx + βx + ⋯ + βnxn survival for children is (/)/(/), or .. is value is referred to as the , or ratio of the two e value of µi, for each observation in the logistic model, odds ratio is calculated as component odds relationships. Using a logistic regression procedure to estimate the odds ratio of age produces the   exp ′ exp ′  exp ′ . µi = ~ ‰ + ‰−xi ⎎ = ‰xi ⎠~ ‰ + ‰xi ⎎ following results () e functions to the right of µ are commonly used ways of expressing the logistic inverse link function, which con- Odds Std. [% Conf. verts the linear predictor to the tted value. For the logistic survived Ratio Err. z P > SzS Interval] model, µ is a probability. age . . −. . . . When logistic regression is estimated using a Newton- Raphson type of MLE algorithm, the log-likelihood func- tion as parameterized to x′β rather than µ. e estimated With  = adult and  = child, the estimated odds ratio t is then determined by taking the rst derivative of the may be interpreted as: log-likelihood function with respect to β, setting it to zero, and solving. e rst derivative of the log-likelihood func- tion is commonly referred to as the gradient, or score 7 The odds of an adult surviving were about half the odds of a function. e second derivative of the log-likelihood with child surviving. respect to β produces the Hessian matrix, from which the By inverting the estimated odds ratio above, we may standard errors of the predictor parameter estimates are conclude that children had [~. ∼ .] some % – or Logistic Regression L  nearly two times – greater odds of surviving than did having the specic covariate pattern. e rst observation adults. in the table informs us that there are three cases having For continuous predictors, a one-unit increase in a pre- predictor values of , , and . Of those x = x = x = dictor value indicates the change in odds expressed by three cases, one has a value of y equal to , the other two the displayed odds ratio. For example, if age was recorded have values of . All current commercial soŸware appli- as a continuous predictor in the Titanic data, and the cations estimate this type of logistic model using GLM odds ratio was calculated as ., we would interpret the methodology. relationship as:

Odds OIM [% conf. 7 Theoddsofsurvivingisoneandahalfpercentgreaterforeach y ratio std. err. z P > SzS interval] increasing year of age. Non-exponentiated logistic regression parameter esti- x . . . . . . mates are interpreted as log-odds relationships, which x . . −. . . . carry little meaning in ordinary discourse. Logistic mod- − els are typically interpreted in terms of odds ratios, unless x . . . . . . a researcher is interested in estimating predicted prob- e data in the above table may be restructured so that abilities for given patterns of model covariates; i.e., in it is in individual observation format, rather than grouped. estimating µ. e new table would have ten observations, having the Logistic regression may also be used for grouped or same logic as described. Modeling would result in iden- proportional data. For these models the response consists tical parameter estimates. It is not uncommon to nd an of a numerator, indicating the number of successes (s) individual-based data set of, for example, , observa- for a specic covariate pattern, and the denominator m , ( ) tions, being grouped into – rows or observations as the number of observations having the specic covariate above described. Data in tables is nearly always expressed L pattern. e response y m is binomially distributed as: ~ in grouped format. ; , mi yi  mi−yi , () Logistic models are subject to a variety of t tests. Some f (yi πi mi) = Š πi ( − πi) yi of the more popular tests include the Hosmer-Lemeshow with a corresponding log-likelihood function expressed as goodness-of-t test, ROC analysis, various information n criteria tests, link tests, and residual analysis. e Hosmer– L µ ; y , m y ln µ  µ m ln  µ ( i i i) = Q š i ( i~( − i)) + i ( − i) Lemeshow test, once well used, is now only used with i= caution. e test is heavily inžuenced by the manner in m + Š i Ÿ. () yi which tied data is classied. Comparing observed with expected probabilities across levels, it is now preferred to Taking derivatives of the cumulant, −mi ln( − µi), as construct tables of risk having dišerent numbers of lev- we did for the binary response model, produces a mean of els. If there is consistency in results across tables, then the µi = miπi and variance, µi( − µi~mi). statistic is more trustworthy. Consider the data below: Information criteria tests, e.g., Akaike information Cri- teria (see 7Akaike’s Information Criterion and 7Akaike’s y cases x x x Information Criterion: Background, Derivation, Proper- ties, and Renements) and Bayesian Information      (AIC) Criteria (BIC) are the most used of this type of test. Infor-      mation tests are comparative, with lower values indicating the preferred model. Recent research indicates that      AIC and BIC both are biased when data is correlated to any      degree. Statisticians have attempted to develop enhance- ments of these two tests, but have not been entirely suc-      cessful. e best advice is to use several dišerent types of      tests, aiming for consistency of results. Several types of residual analyses are typically recom- y indicates the number of times a specic pattern of covari- mended for logistic models. e references below pro- ates is successful. Cases is the number of observations vide extensive discussion of these methods, together with  L Logistic Distribution

appropriate caveats. However, it appears well established Distinguished Alumnus award at California State Univer- that m-asymptotic residual analyses is most appropri- sity, Chico in , two years following his induction into ate for logistic models having no continuous predictors. the University’s Athletic Hall of Fame (he was two-time US m-asymptotics is based on grouping observations with champion track & eld athlete). He is the only graduate of the same covariate pattern, in a similar manner to the the university to be recognized with both honors. grouped or binomial logistic regression discussed earlier. e Hilbe () and Hosmer and Lemeshow () ref- Cross References erences below provide guidance on how best to construct 7Case-Control Studies and interpret this type of residual. 7Categorical Data Analysis Logistic models have been expanded to include cat- 7Generalized Linear Models egorical responses, e.g., proportional odds models and 7Multivariate Data Analysis: An Overview multinomial logistic regression. ey have also been 7Probit Analysis enhanced to include the modeling of panel and corre- 7Recursive Partitioning lated data, e.g., generalized estimating equations, xed and 7Regression Models with Symmetrical Errors random ešects, and mixed ešects logistic models. 7Robust Regression Estimation in Generalized Linear Finally, exact logistic regression models have recently Models been developed to allow the modeling of perfectly pre- 7Statistics: An Overview dicted data, as well as small and unbalanced datasets. 7Target Estimation: A New Approach to Parametric In these cases, logistic models which are estimated using Estimation GLM or full maximum likelihood will not converge. Exact models employ entirely dišerent methods of estimation, References and Further Reading based on large numbers of permutations. Collett D () Modeling binary regression, nd edn. Chapman & Hall/CRC Cox, London About the Author Cox DR, Snell EJ () Analysis of binary data, nd edn. Joseph M. Hilbe is an emeritus professor, University of Chapman & Hall, London Hawaii and adjunct professor of statistics, Arizona State Hardin JW, Hilbe JM () Generalized linear models and exten- University. He is also a Solar System Ambassador with sions, nd edn. Stata Press, College Station NASA/Jet Propulsion Laboratory, at California Institute of Hilbe JM () Logistic regression models. Chapman & Hall/CRC Press, Boca Raton Technology. Hilbe is a Fellow of the American Statistical Hosmer D, Lemeshow S () Applied logistic regression, nd edn. Association and Elected Member of the International Sta- Wiley, New York tistical institute, for which he is founder and chair of the Kleinbaum DG () Logistic regression; a self-teaching guide. ISI astrostatistics committee and Network, the rst global Springer, New York association of astrostatisticians. He is also chair of the ISI Long JS () Regression models for categorical and limited dependent variables. Sage, Thousand Oaks sports statistics committee, and was on the founding exec- McCullagh P, Nelder J () Generalized linear models, nd edn. utive committee of the Health Policy Statistics Section of Chapman & Hall, London the American Statistical Association (–). Hilbe is author of Negative Binomial Regression (, Cam- bridge University Press), and Logistic Regression Models (, Chapman & Hall), two of the leading texts in their respective areas of statistics. He is also co-author (with Logistic Distribution James Hardin) of Generalized Estimating Equations (, Chapman & Hall/CRC) and two editions of Generalized J™„• H†•ou Linear Models and Extensions (, , Stata Press), Professor of Statistics and with Robert Muenchen is coauthor of the R for Stata National University of Ireland, Galway, Ireland Users (, Springer). Hilbe has also been inžuential in the production and review of statistical soŸware, serving as SoŸware Reviews Editor for e American Statistician e logistic distribution is a location-scale family distri- for  years from –. He was founding editor of bution with a very similar shape to the normal (Gaussian) the Stata Technical Bulletin (), and was the rst to add distribution but with somewhat heavier tails. e distri- the negative binomial family into commercial generalized bution has applications in reliability and survival analysis. linear models soŸware. Professor Hilbe was presented the e cumulative distribution function has been used for Logistic Distribution L  modelling growth functions and as a tolerance distribu- heart transplant survival – there may be an initial period tion in the analysis of binary data, leading to the widely of increasing hazard associated with rejection, followed by used logit model. For a detailed discussion of the proper- decreasing hazard as the patient survives the procedure ties of the logistic and related distributions, see Johnson and the transplanted organ is accepted. et al. (). For the standard logistic distribution (µ = , τ = ), e probability density function is the probability density and the cumulative distribution functions are related through the very simple identity  exp x µ τ f x {−( − )~ } , x () ( ) =  −∞ < < ∞  τ [ + exp {−(x − µ)~τ}] f (x) = F(x)[ − F(x)] and the cumulative distribution function is which in turn, by elementary calculus, implies that  F x , x . F x ( ) =  exp −∞ < < ∞ logit : log ( ) () [ + {−(x − µ)~τ}] (F(x)) = e  = x  − F(x) e distribution is symmetric about the mean µ and has and uniquely characterizes the standard logistic distribu- variance τπ , so that when comparing the standard ~ tion. Equation () provides a very simple way for sim- logistic distribution (µ , τ ) with the standard nor- = = ulating from the standard logistic distribution by setting mal distribution, N ,  , it is important to allow for the ( ) log  where ,  ; for the general dišerent variances. e suitably scaled logistic distribution X = e[U~( − U)] U ∼ U( ) logistic distribution in () we take . has a very similar shape to the normal, although the kur- τX + µ e logit transformation is now very familiar in mod- tosis is . which is somewhat larger than the value of  elling probabilities for binary responses. Its use goes back for the normal, indicating the heavier tails of the logistic to Berkson (), who suggested the use of the logis- distribution. tic distribution to replace the normal distribution as the In survival analysis, one advantage of the logistic underlying tolerance distribution in quantal bio-assays, distribution, over the normal, is that both right- and leŸ- where various dose levels are given to groups of subjects L censoring can be easily handled. e survivor and hazard (animals) and a simple binary response (e.g., cure, death, functions are given by etc.) is recorded for each individual (giving r-out-of-n type  response data for the groups). e use of the normal dis- S(x) = , −∞ < x < ∞ [ + exp {(x − µ)~τ}] tribution in this context had been pioneered by Finney   through his work on 7probit analysis and the same meth- h x . ods mapped across to the logit analysis, see Finney () ( ) = τ  exp x µ τ [ + {−( − )~ }] for a historical treatment of this area. e probability of e hazard function has the same logistic form and is response, P(d), at a particular dose level d is modelled by monotonically increasing, so the model is only appropriate a linear logit model for ageing systems with an increasing failure rate over time. In modelling the dependence of failure times on explana- logit log P(d) (P(d)) = e  = β + βd tory variables, if we use a linear regression model for µ,  − P(d) then the tted model has an accelerated failure time inter- which, by the identity (), implies a logistic tolerance dis- pretation for the ešect of the variables. Fitting of this model tribution with parameters and  . µ = −β~β τ = ~SβS to right- and leŸ-censored data is described in Aitkin et al. e logit transformation is computationally convenient (). and has the nice interpretation of modelling the log- One obvious extension for modelling failure times, T, odds of a response. is goes across to general logis- is to assume a logistic model for log T, giving a log-logistic tic regression models for binary data where parameter model for T analagous to the lognormal model. e result- ešects are on the log-odds scale and for a two-level factor ing hazard function based on the logistic distribution in the tted ešect corresponds to a log-odds-ratio. Approx- () is imate methods for parameter estimation involve using α− α (t~θ)  the empirical logits of the observed proportions. How- h(t) = , t, θ, α >  θ  + (t~θ)α ever, maximum likelihood estimates are easily obtained µ where θ = e and α = ~τ. For α ≤  the hazard is with standard generalized linear model tting soŸware, monotone decreasing, and for α >  it has a single max- using a binomial response distribution with a logit link imum as for the lognormal distribution; hazards of this function for the response probability; this uses the itera- form may be appropriate in the analysis of data such as tively reweighted least-squares Fisher-scoring algorithm of  L Lorenz Curve

Nelder and Wedderburn (), although Newton-based algorithms for maximum likelihood estimation of the logit Lorenz Curve model appeared well before the unifying treatment of J™„Z• Fu“Z• generalized linear models. A comprehensive treatment of 7 Professor Emeritus logistic regression including models and applications is 7 Folkhälsan Institute of Genetics, Helsinki, Finland given in Agresti () and Hilbe ().

Definition and Properties About the Author It is a general rule that income distributions are skewed. John Hinde is Professor of Statistics, School of Mathe- Although various distribution models, such as the Lognor- matics, Statistics and Applied Mathematics, National Uni- mal and the Pareto have been proposed, they are usually versity of Ireland, Galway, Ireland. He is past President applied in specic situations. For general studies, more of the Irish Statistical Association (–), the Sta- wide-ranging tools have to be applied, the rst and most tistical Modelling Society (–) and the European common tool of which is the Lorenz curve. Lorenz () Regional Section of the International Association of Sta- developed it in order to analyze the distribution of income tistical Computing (–). He has been an active and wealth within populations, describing it in the follow- member of the International Biometric Society and is the ing way: incoming President of the British and Irish Region (). He is an Elected member of the International Statisti- 7 Plot along one axis accumulated percentages of the popula- cal Institute (). He has authored or coauthored over tion from poorest to richest, and along the other, wealth held  papers mainly in the area of statistical modelling and by these percentages of the population. statistical computing and is coauthor of several books, e Lorenz curve L(p) is dened as a function of the including most recently Statistical Modelling in R (with proportion p of the population. L(p) is a curve starting M. Aitkin, B. Francis, and R. Darnell, Oxford University from the origin and ending at point (, ) with the follow- Press, ). He is currently an Associate Editor of Statis- ing additional properties (I) L(p) is monotone increasing, tics and Computing and was one of the joint Founding (II) L(p) ≤ p, (III) L(p) convex, (IV) L() =  and L() = . Editors of Statistical Modelling. e Lorenz curve is convex because the income share of the poor is less than their proportion of the population (Fig. ). e Lorenz curve satises the general rules: Cross References 7 A unique Lorenz curve corresponds to every distribution. The 7Asymptotic Relative E›ciency in Estimation contrary does not hold, but every Lorenz L(p) is a common 7Asymptotic Relative E›ciency in Testing curve for a whole class of distributions F(θ x) where θ is an 7Bivariate Distributions arbitrary constant. 7Location-Scale Distributions 7Logistic Regression e higher the curve, the less inequality in the income 7Multivariate Statistical Distributions distribution. If all individuals receive the same income, 7Statistical Distributions: An Overview then the Lorenz curve coincides with the diagonal from (, ) to (, ). Increasing inequality lowers the Lorenz curve, which can converge towards the lower right corner References and Further Reading of the square. Agresti A () Categorical data analysis, nd edn. Wiley, New York Consider two Lorenz curves LX(p) and LY (p). If Aitkin M, Francis B, Hinde J, Darnell R () Statistical modelling LX(p) ≥ LY (p) for all p, then the distribution correspond- in R. Oxford University Press, Oxford ing to LX(p) has lower inequality than the distribution Berkson J () Application of the logistic function to bio-assay. corresponding to LY (p) and is said to Lorenz dominate the J Am Stat Assoc :– other. Figure  shows an example of Lorenz curves. Finney DJ () Statistical method in biological assay, rd edn. Griffin, London e inequality can be of a dišerent type, the corre- Hilbe JM () Logistic regression models. Chapman & Hall/CRC sponding Lorenz curves may intersect, and for these no Press, Boca Raton Lorenz ordering holds. is case is seen in Fig. . Under Johnson NL, Kotz S, Balakrishnan N () Continuous univariate such circumstances, alternative inequality measures have distributions, vol , nd edn. Wiley, New York to be dened, the most frequently used being the Gini Nelder JA, Wedderburn RWM () Generalized linear models. J R Stat Soc A :– index, G, introduced by Gini (). is index is the ratio Lorenz Curve L 

e higher the G value, the greater the inequality in the income distribution. 1.0

Income Redistributions 0.8 It is a well-known fact that progressive taxation reduces inequality. Similar ešects can be obtained by appropriate transfer policies, ndings based on the following general 0.6 theorem (Fellman ; Jakobsson ; Kakwani ): Lx(p) eorem Let u(x) be a continuous monotone increasing 0.4 function and assume that µY = E (u(X)) exists. en the Ly(p) Lorenz curve LY (p) for Y = u(X) exists and u(x) (I) LY (p) ≥ LX(p) if is monotone decreasing 0.2 x u(x) (II) LY p LX p if is constant ( ) = ( ) x u(x) (III) LY p LX p if is monotone increasing. 0.0 ( ) ≤ ( ) x 0.0 0.2 0.4 0.6 0.8 1.0 For progressive taxation rules, u(x) measures the pro- x portion of post-tax income to initial income and is a Lorenz Curve. Fig.  Lorenz curves with Lorenz ordering; monotone-decreasing function satisfying condition (I), that is, L (p) ≥ L (p) X Y and the Gini index is reduced. Hemming and Keen () gave an alternative condition for the Lorenz dominance, L which is that u(x) crosses the µY level once from above. x µX If the taxation rule is a žat tax, then (II) holds and the 1 Lorenz curve and the Gini index remain. e third case in eorem  indicates that the ratio u(x) is increasing x 0.8 and the Gini index increases, but this case has only minor practical importance. A crucial study concerning income distributions and 0.6 redistributions is the monograph by Lambert ().

L1(p)

0.4 L2(p) About the Author Dr Johan Fellman is Professor Emeritus in statistics. He obtained Ph.D. in mathematics (University of Helsinki) in 0.2  with the thesis On the Allocation of Linear Observa- tions and in addition, he has published about  printed articles. He served as professor in statistics at Hanken 0.0 0 0.2 0.4 0.6 0.8 1 School of Economics, (–) and has been a scientist at Folkhälsan Institute of Genetics since . He is mem- ber of the Editorial Board of Lorenz Curve. Fig.  Two intersecting Lorenz curves. Using the Twin Research and Human and of the Advisory Board of Gini index L (p) has greater inequality (G = .) than L (p) Genetics Mathematica    and Editor of . He has been awarded the (G = .) Slovaca InterStat  Knight, First Class, of the Order of the White Rose of Finland and the Medal in Silver of Hanken School of between the area between the diagonal and the Lorenz Economics. He is Elected member of the Finnish Society curve and the whole area under the diagonal. is deni- of Sciences and Letters and of the International Statistical tion yields Gini indices satisfying the inequality  ≤ G ≤ . Institute.  L Loss Function

Cross References dened on the set of all possible data. Since the true value θ 7Econometrics is unknown the loss function L has to be dened on Θ×Θ, 7Economic Statistics i.e., Testing Exponentiality of Distribution 7 L :Θ × Θ → [, ∞). References and Further Reading e rst variable describes the true value, say, and the second one the decision. us L(θ, a) is the loss which is Fellman J () The effect of transformations on Lorenz curves. sušered if is the true value and is the decision. ere- Econometrica :– θ a Gini C () Variabilità e mutabilità. Bologna, Italy fore, each (up to technical conditions) function L :Θ × Hemming R, Keen MJ () Single crossing conditions in compar- Θ → [, ∞) with the property isons of tax progressivity. J Publ Econ :– Jakobsson U () On the measurement of the degree of progres- L(θ, θ) =  for all θ ∈ Θ sion. J Publ Econ :– Kakwani NC () Applications of Lorenz curves in economic is a possible loss function. e loss function has to be analysis. Econometrica :– chosen by the statistician according to the problem under Lambert PJ () The distribution and redistribution of income: consideration. a mathematical analysis, rd edn. Manchester university press, Next, we describe examples for loss functions. First Manchester Lorenz MO () Methods for measuring concentration of wealth. let us consider a test problem. en Θ is divided in two disjoint subsets Θ and Θ describing the null hypothesis J Am Stat Assoc New Ser :–   and the alternative set, Θ Θ Θ . en the usual loss =  +  function is given by

⎧ ⎪ if θ, ϑ Θ or θ, ϑ Θ ⎪ ∈  ∈  Loss Function L(θ, ϑ) = ⎨ . ⎪  if Θ , Θ or Θ , Θ ⎪ θ ∈  ϑ ∈  θ ∈  ϑ ∈  W™€Z• B†«h„™€€ ⎩ Professor and Dean of the Faculty of Mathematics and For point estimation problems we assume that Θ is a Geography normed linear space and let S ⋅ S be its norm. Such a space Catholic University Eichstätt–Ingolstadt, Eichstätt, is typical for estimating a location parameter. en the loss Germany L(θ, ϑ) = Sθ − ϑS, θ, ϑ ∈ Θ, can be used. Next, let us con- sider the specic case Θ = R. en L(θ, ϑ) = ℓ(θ − ϑ) is a typical form for loss functions, where ℓ : IR → [, ∞) is Loss functions occur at several places in statistics. Here nonincreasing on ,  and nondecreasing on , we attach importance to decision theory (see Decision (−∞ ] [ ∞) 7 with ℓ  . ℓ is also called loss function. An important eory: An Introduction, and Decision eory: An ( ) = 7 class of such functions is given by choosing ℓ t t p, Overview) and regression. For both elds the same loss ( ) = S S where p  is a xed constant. ere are two prominent functions can be used. But the interpretation is dišerent. > cases, for p =  we get the classical square loss and for p =  Decision theory gives a general framework to dene the robust -loss. Another class of robust losses are the L and understand statistics as a mathematical discipline. e famous Huber losses loss function is the essential component in decision theory.   e loss function judges a decision with respect to the truth ℓ(t) = t ~, if StS ≤ γ, and ℓ(t) = γStS − γ ~, if StS > γ, by a real value greater or equal to zero. In case the decision coincides with the truth then there is no loss. erefore the where γ >  is a xed constant. Up to now we have shown value of the loss function is zero then, otherwise the value symmetrical losses, i.e., L(θ, ϑ) = L(ϑ, θ). ere are many gives the loss which is sušered by the decision unequal problems in which underestimating of the true value θ the truth. e larger the value the larger the loss which is has to be dišerently judged than overestimating. For such sušered. problems Varian () introduced LinEx losses To describe this more exactly let Θ be the known set ℓ t b exp at at  , of all outcomes for the problem under consideration on ( ) = ( ( ) − − ) which we have information by data. We assume that one where a, b >  can be chosen suitably. Here underestimat- of the values θ ∈ Θ is the true value. Each d ∈ Θ is a possi- ing is judged exponentially and overestimating linearly. ble decision. e decision d is chosen according to a rule, For other estimation problems corresponding losses more exactly according to a function with values in Θ and are used. For instance, let us consider the estimation of a Loss Function L  scale parameter and let Θ , . en it is usual to con- i.e., the function in that minimizes n f , , = ( ∞) F ∑i= ℓ Šri  f ∈ F sider losses of the form , , where must be L(θ ϑ) = ℓ(ϑ~θ) ℓ where r f y f t is the residual in the ith design point. chosen suitably. It is, however, more convenient to write i = i − ( i) Here ℓ is also called loss function and can be chosen as ℓ(ln ϑ − ln θ). en ℓ can be chosen as above. described above. For instance, the least squares estimation In theoretical works the assumed properties for loss is obtained if ℓ t t. functions can be quite dišerent. Classically it was assumed ( ) = that the loss is convex (see 7Rao–Blackwell eorem). If the space Θ is not bounded, then it seems to be more Cross References convenient in practice to assume that the loss is bounded 7Advantages of Bayesian Structuring: Estimating Ranks which is also assumed in some branches of statistics. In and Histograms case the loss is not continuous then it must be carefully 7Bayesian Statistics dened to get no counter intuitive results in practice, see 7Decision eory: An Introduction Bischoš (). 7Decision eory: An Overview In case a density of the underlying distribution of the 7Entropy and Cross Entropy as Diversity and Distance data is known up to an unknown parameter the class of Measures divergence losses can be dened. Specic cases of these 7Sequential Sampling losses are the Hellinger and the Kulback-Leibler loss. 7Statistical Inference for Quantum Systems In regression, however, the loss is used in a dišer- ent way. Here it is assumed that the unknown location References and Further Reading parameter is an element of a known class F of real valued Bischoff W () Best ϕ-approximants for bounded weak loss functions. Given observations (data) ,..., observed functions. Stat Decis :– n y yn at design points ,..., of the experimental region a Varian HR () A Bayesian approach to real estate assessment. t tn loss function is used to determine an estimation for the In: Fienberg SE, Zellner A (eds) Studies in Bayesian economet- rics and statistics in honor of Leonard J. Savage, North Holland, L unknown regression function by the ‘best approximation’, Amsterdam, pp –