Faculteit Wetenschappen Vakgroep Toegepaste Wiskunde en Informatica

Semiparametric Efficiency

Karel Vermeulen

Prof. Dr. S. Vansteelandt

Proefschrift ingediend tot het behalen van de graad van Master in de Wiskunde, afstudeerrichting Toegepaste Wiskunde

Academiejaar 2010-2011

To my parents, my brother Lukas To my best friend Sara

A mathematical theory is not to be considered complete until you have made it so clear that you can explain it to the first man whom you meet on the street... David Hilbert

Preface

My interest in semiparametric theory awoke several years ago, two and a half to be precise. That time, I was in my third year of mathematics. I had to choose a subject for my Bachelor thesis. My decision was: A geometrical approach to the asymptotic efficiency of , based on the monograph by Anastasios Tsiatis, [35], under the supervision of Prof. Dr. Stijn Vansteelandt. In this manner, I entered the world of semiparametric efficiency. However, at that point I did not know it was just the beginning. Shortly after I wrote this Bachelor thesis, Prof. Dr. Stijn Vansteelandt asked me to be involved in research on semiparametric inference for so called probabilistic index models, in the context of a one month student job. Under his guidance, I applied semiparametric estimation theory to this setting and obtained the semiparametric efficiency bound to which efficiency of estimators in this model can be measured. Some results of this research are contained within this thesis. While short, this experience really convinced me I wanted to write a thesis in semiparametric efficiency. That feeling was only more encouraged after following the course Causality and Missing Data, taught by Prof. Dr. Stijn Vansteelandt. This course showed me how semiparametric estimators are used in real-life applications. Hence, my decision was made: I really wanted to write a thesis in semiparametric efficiency and after some meetings with Prof. Dr. Stijn Vansteelandt, my decision became reality. During the month August of 2010, I had several talks with Prof. Dr. Stijn Vansteelandt about which monographs and articles we should use, since there is so much information available. Luckily, semiparametric theory is one of the interests of Prof. Dr. Stijn Vansteelandt, so he really could guide me through the great amount of available information. Shortly after these talks, I started with studying the monograph by Aad van der Vaart, [40]. It was really tough to go through this manuscript and in addition, understanding everything that was written down. It gave me some uneasiness since I really lost track between all the abstract definitions and I did not quite understand the relation with the monograph by Anastasios Tsiatis. This feeling somewhat gave the structure to my thesis: studying both theories in much detail and pointing out the relations and differences. That is why my thesis is partitioned into different parts, Part II describes the theory as presented in the monograph by Anastasios Tsiatis and Part III describes the theory as presented in the monograph by Aad van der Vaart. Part I is primarily dedicated to the basics about functional analysis, on which the semiparametric theory heavily relies. As I went through the literature, these relations and differences between both approaches to semiparametric efficiency became clear to me. While the abstract theory is developed in Part III, the relations and differences with Part II are indicated and it shows that Part III is a proper generalization of Part II. Some hints were given in the landmark paper by Whitney Newey, [27]. In my opinion, this paper is a mix between both theories and actually a nice transition from the geometrical approach to semiparametric efficiency and the more abstract approach to semiparametric efficiency, although the abstract approach is also of a quite geometrical nature. It was my experience that the literature is often so difficult to understand, it is not attractive

i ii Preface anymore. Henceforth, this thesis is also written with the intention that for those interested in semiparametric theory and just entering the magnificient world of semiparametric models and estimation, they have a nice document available that explains the theory in great detail without being too complicated or obscure, although I hope. Finally, I also want to use this opportunity to thank everyone who by some means or other has contributed to this thesis.

• First and most importantly, I want to thank my promoter Stijn. He made it possible for me to write this thesis and stood by my side while I was entering the world of semiparametric efficiency. I want to thank Stijn for all the hours we spent in his office discussing all the questions I had, pointing me at my mistakes, making improvements and sharing his ideas with me. I also want to thank him for all the opportunities he gave me during the past five years. He really made my interest for grow to pleasant heights. In summary, without Stijn it would not have been possible to write this thesis and it was a great time to write it under his supervision. In addition, I cannot believe how much I learned during the last year due to Stijn. I cannot imagine a better coach than Stijn. • Next, I want to thank my parents, Ann and Philip, who let me choose my studies inde- pendently and supported me unconditionally during the past five years. They gave me the great opportunity to study mathematics in Ghent. Especially, I want to thank my mother Ann for always listening to my (to her) boring and tedious explanations whenever I was excited about the insights I obtained in the semiparametric theory. She always tried to listen with full attention, even though she did not understand a thing of what I was saying. I also want to thank my little brother Lukas who always made me laugh with his funny jokes and remarks about my passion for mathematics. • Another crucial person I want to thank is my best friend and study mate, Sara. Writing a thesis involves a lot of solitude. Luckily, I could rely on her presence during the dark winter months as she was at my place for days where we worked together on our own theses. Sara was (and still is) a great refuge when I was a bit down because writing this thesis was so difficult and sometimes I just wanted to give up. I cannot imagine how this year would have been without her presence and support. In addition, she gave me a lot of advice for the difficulties I encountered when typing my thesis in LATEX and the hints she gave me for making nice figures to illustrate the theory with. • It may sound a bit funny, but implicitly I also want to thank Lady Gaga for her music that really helped me to relax during the periods I was really stressed. Her music helped me to escape from reality sometimes when I needed it. • I also want to thank the rest of my family, other study mates and friends. Especially, my sister Annelies, my niece Helena, my aunt Greta and my friends Machteld, Catherine, Lotte, Wendy, Elien, Xavier, Lieselot, Bert and many others. Further, the intellectual progress I made during the past five years would not have been possible without the educational staff of the Univeristy of Ghent, so many thanks to those as well. • Many thanks to Jan also for reading some parts of this thesis and pointing me to some typographical matters.

Karel Vermeulen May 2011 Ghent Toelating tot Bruikleen

De auteur geeft de toelating deze masterproef voor consultatie beschikbaar te stellen en delen van de masterproef te kopi¨erenvoor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze masterproef.

Karel Vermeulen Mei 2011

iii iv Toelating tot Bruikleen Nederlandstalige Samenvatting

Inleiding

Parametrische modellen, ruwweg modellen beschreven door een eindig-dimensionale parameter, en schattingstheorie voor parametrische modellen kregen in de geschiedenis al veel aandacht in de literatuur. De meest gekende statistische methoden zijn parametrisch. Een heel belangrijk resultaat is de asymptotische normaliteit en de asymptotische effici¨entie van de maximum like- lihood schatter (MLE). Dit resultaat vindt zijn oorsprong in het werk van Fisher in de jaren 1920. Een ander fundamenteel resultaat in de parametrische schattingstheorie is de Cram´er- Rao ondergrens voor onvertekende schatters. Deze ondergrens vindt zijn origine in het werk van Cram´eren Rao in de jaren 1940. We kunnen dus besluiten dat er heel wat theorie beschikbaar is wanneer we parameters in parametrische modellen wensen te schatten. In het tweede deel van de vorige eeuw werden er echter nieuwe types modellen ge¨ıntroduceerd: semiparametrische modellen. Deze kunnen gezien worden als het intermediare geval tussen parametrische modellen en niet-parametrische modellen. Ruwweg kunnen we stellen dat semi- parametrische modellen, modellen zijn die worden beschreven door een parameter die zowel een eindig- als een oneindig-dimensionaal deel bevat. Het eindig-dimensionale stuk (of althans een deel ervan) is dan veelal de parameter waarin we ge¨ınteresseerd zijn en het oneindig-dimensionale deel wordt veelal de nuisance parameter genoemd. Slechts in de laatste decennia kregen semiparametrische modellen toenemende aandacht. Deze aandacht werd vooral gemotiveerd door model misspecificatie. De semiparametrische aanpak van model misspecificatie is door bepaalde delen van de gezamenlijke dichtheidsfunctie die het model beschrijft volledig onbekend te laten. Door de parameterruimte oneindig-dimensionaal te houden leggen we veel minder restricties op die onze geobserveerde data kan hebben. Een gevolg hiervan is dat oplossingen, indien ze bestaan en redelijk zijn, een grotere toepasbaarheid hebben. Semiparametrische modellen zijn een belangrijke meerwaarde ten opzichte van volledig niet-parametrische modellen want semiparametrische modellen zullen een betere performantie hebben bij kleine hoeveelheden data in vergelijking met niet-parametrische modellen, die eerder een slechte performantie hebben bij kleine hoeveelheden data.

Deel I: Inleiding en Wiskundige Achtergrond

Het eerste en inleidende deel van deze thesis is bedoeld om de lezer warm te maken voor semi- parametrische theorie en de lezer vertrouwd te maken met de wiskundige machinerie die we zullen nodig hebben. Hoofdstuk 1: Inleiding Het eerste hoofdstuk maakt de lezer vertrouwd met semiparametrische modellen. We beschrijven kort een aantal voorbeelden van semiparametrische modellen gekozen uit de brede waaier van

v vi Nederlandstalige Samenvatting beschikbare voorbeelden. In het bijzonder introduceren we het restricted moment model die de conditionele verwachtingswaarde van een response variabele Y modelleert als functie van een aantal covariaten X, i.e., E(Y |X) = µ(X, β), waarbij β een eindig-dimensionale parameter is. Vervolgens wordt dit type van modellen uitgebreid tot parti¨elelineaire regressie modellen. Een ander type model dat we later in meer detail zullen bestuderen is een statistisch model waarin enkel wordt ondersteld dat de dichtheid symmetrisch is. Nadien introduceren we ook mixture models en Cox’s proportional hazards model. Dit laatste is een van de meest gebruikte modellen in de overlevingsanalyse. Hoofdstuk 2: Banach- en Hilbertruimten In het tweede hoofdstuk worden enkele basisresultaten uit de functionaalanalyse samengevat waarop de semiparametrische theorie sterk steunt. Eerst voeren we de basisbegrippen van een Banachruimte B relatief tot een norm k · k en een Hilbertruimte H relatief tot een inproduct h·, ·i in die onontbeerlijk zijn voor de ontwikkeling van de semiparametrische theorie. Eens de toon is gezet geven we een kort overzicht van belangrijke concepten zoals lineaire functionalen, duale ruimtes, lineare operatoren, toegevoegde operatoren en inverse operatoren. Het hoofdstuk wordt afgesloten met ´e´envan de belangrijkste concepten uit de Hilbertmeetkunde die veelvuldig zal gebruikt worden: orthogonaliteit in Hilbertruimten met betrekking tot een inproduct. We bespreken de projectiestelling die aan de basis ligt van de geometrische benadering van de asymptotische effici¨entie van schatters in statistische modellen. Verder gaan we ook dieper in op het concept van een projectieoperator die ook van fundamenteel belang zal blijken te zijn. Tenslotte beschouwen we een toepassing van de projectiestelling die nuttig zal blijken te zijn voor latere doeleinden.

Deel II: Inleiding tot Semiparametrische Effici¨entie voor de Klasse van RAL Schat- ters

In het tweede deel van deze thesis laten we de lezer kennis maken met de basisbegrippen omtrent semiparametrische theorie. Dit deel is gebaseerd op de eerste vijf hoofdstukken van het boek Semiparametric Theory and Missing Data geschreven door Anastasios Tsiatis, zie [35] en deels op het artikel Semiparametric Efficiency Bounds van Whitney Newey, zie [27]. Hoofdstuk 3: Meetkunde van de Invloedsfuncties In het derde hoofdstuk voeren we de meetkunde van de invloedsfuncties in voor parametrische modellen. Eerst introduceren we de Hilbertruimte van q-dimensionale random functies die gemiddeld nul zijn met eindige variantie, Hq. Dit is een gesloten deelruimte van de Hilbertruimte van de q-dimensionale kwadratisch integreerbare functies. Als inproduct wordt de covariantie genomen. Deze ruimte is van fundamenteel belang in het tweede deel en RAL schatters (die verder worden ingevoerd) zullen kunnen gezien worden als punten in deze Hilbertruimte. Vervolgens nemen we de tijd om in de verf te zetten welk type schatters we zullen beschouwen in dit deel, zogenaamde reguliere asymptotische lineaire schatters. Het regulier zijn komt ruwweg overeen met de eigenschap dat de limietverdeling van de schatter βˆn voor de parameter β niet afhangt van het onderliggend lokaal data genererend proces en het asymptotisch linear zijn betekent dat n √ 1 X n(βˆ − β ) = √ ϕ(X ) + o (1) n 0 n i P i=1 waarbij ϕ de unieke invloedsfunctie van de schatter is. Eens we vastgelegd hebben tot welk type schatters we ons zullen beperken in het tweede deel, geven we een stelling die ons zal toelaten Nederlandstalige Samenvatting vii

om deze RAL schatters voor te stellen als punten van de Hilbertruimte Hq door middel van hun corresponderende invloedsfunctie, zie Stelling 3.2. Eens we zover zijn gekomen, kunnen we ´e´envan de belangrijkste concepten die in deze thesis wordt gebruikt invoeren die ons in staat zal stellen om semiparametrische effici¨entie te visualis- eren: de tangent ruimte J en de nuisance tangent ruimte Λ. Merk op dat we enkel van een nuisance tangent ruimte kunnen spreken wanneer de parameter netjes kan gepartitioneerd wor- den in het deel waarin we ge¨ınteresseerd zijn en de nuisance parameter. In deze sectie bespreken we een nieuw resultaat omtrent de constructie van een RAL schatter gegeven een invloedsfunctie. Nadien geven we de implicaties weer van het bovenvermelde resultaat die voornamelijk volgen uit het feit dat de invloedsfunctie van een RAL schatter loodrecht staat op de nuisance tangent ruimte Λ. We sluiten de sectie af met een nieuw resultaat: de constructie van een RAL schatter wanneer de parameter waarin we ge¨ınteresseerd zijn een functie is van de totale parameter die het model beschrijft. Het besluit van deze resultaten is dat elke RAL schatter kan voorgesteld worden door een punt in de Hilbertruimte Hq en dat de meest effici¨ente RAL schatter correspon- deert met de invloedsfunctie met de kleinste variantie, i.e., de invloedsfunctie die het dichtst bij de oorsprong ligt. Uiteindelijk, na al deze technische resultaten, kunnen we zoeken naar de effici¨enteinvloedsfunc- tie, i.e., de invloedsfunctie met de kleinste variantie. We tonen aan dat de verzameling van alle invloedsfuncties een lineaire vari¨eteitvoorstelt in de Hilbertruimte Hq, zie Stelling 3.4. Deze stelling stelt ons in staat om de effici¨ente invloedsfunctie te vinden, zie Stelling 3.5. Het besluit is dat we deze bekomen door de projectie van een willekeurige invloedsfunctie op de tangent ruimte J . We bekomen ook een expliciete vorm voor deze effici¨ente invloedsfunctie. Het hoofd- stuk wordt afgesloten door het geval waar de parameter netjes kan gepartiotioneerd worden te bestuderen en we zien dat de effici¨ente invloedsfunctie kan gezien worden als een herschaling van de effici¨entescore.

Hoofdstuk 4: Uitbreiding naar Semiparametrische Modellen

In het vierde hoofdstuk breiden we de meetkunde van invloedsfuncties uit tot semiparametrische modellen. Dit gebeurt via zogenaamde parametrische submodellen. Het grootste deel van dit hoofdstuk fixeert zich op de situatie waar de parameter die het model beschrijft kan gepartition- eerd worden in de parameter waarin we zijn ge¨ınteresseerd en de oneindig-dimensionale nuisance parameter. Eens we hebben uitgelegd hoe de invloedsfunctie van een semiparametrische RAL schatter staat ten opzichte van de invloedsfunctie van parametrische modellen, voeren we het concept in van de semiparametrische nuisance tangent space. Deze is van fundamenteel belang en is gedefinieerd als de gemiddeld-kwadratische sluiting van de unie van alle nuisance tangent ruimten van alle parametrische submodellen. Nadien kunnen we ook het analogon van Stelling 3.2 uit het derde hoofdstuk geven, zie Stelling 4.1. Deze stelling zal ons in staat stellen om analoge geometrische interpretaties te maken over invloedsfuncties van RAL schatters als in het derde hoofdstuk. Verder defini¨erenwe ook het belangrijke concept van de semiparametrische effici¨entiegrens. Uiteindelijk gaan we ook op zoek naar de effici¨enteinvloedsfunctie voor een semiparametrisch model. Zoals beschreven in [35] doen we dit eerst voor het geval waarbij de parameter netjes kan gepartitioneerd worden. In dat geval vinden we dat de effici¨ente invloedsfunctie een herschaling is van de effici¨entescore. Bovendien tonen we aan dat de semiparametrische effici¨entie grens precies gelijk is aan de variantie van de effici¨ente invloedsfunctie. Nadien bespreken we in veel detail hoe we de effici¨ente invloedsfunctie kunnen vinden voor een semiparamtrisch model waarin de parameter die we wensen te schatten een functie is van de oneindig-dimensionale parameter die het model beschrijft. In dit geval hebben we geen nuisance viii Nederlandstalige Samenvatting tangent ruimte maar enkel de tangent ruimte J . Aan dit geval wordt weinig aandacht gegeven in [35]. De rest van die sectie in deze thesis is dan vooral eigen werk. We defini¨erende semi- parametrische effici¨entie grens in dit geval maar we kunnen dit enkel doen indien we onderstellen dat er ten minste ´e´enRAL schatter bestaat. Bovendien defini¨erenwe de effici¨ente invloedsfunc- tie als de projectie van een willekeurige invloedsfunctie op de tangent ruimte J . We bewijzen in veel detail dat de variantie van de effici¨ente invloedsfunctie gelijk is aan de semiparametrische effici¨entie grens. We sluiten het hoofdstuk af met enkele hulpmiddelen om tangent ruimten te construeren. In het bijzonder leiden we de tangent ruimte af voor een niet-parametrisch model en partitioneren we de Hilbertruimte Hq als gevolg van de partitie van een gezamenlijke dichtheidsfunctie in conditionele dichtheidsfuncties. Hoofdstuk 5: Toepassingen In het vijfde hoofdstuk en tevens het laatste hoofdstuk van dit deel bespreken we enkele mooie toepassingen van de theorie ontwikkelt in het derde en het vierde hoofdstuk. Eerst bestuderen we grondig het restricted moment model voor een continue uitkomst Y . We beschrijven grondig hoe we de nuisance tangent ruimte Λ vinden, hoe we het orthogonaal com- plement Λ⊥ bekomen en vervolgens hoe met elk element van Λ⊥ een unieke RAL schatter correspondeert. Het resultaat is dat we de GEE-schatters bekomen en we kunnen dan ook gemakkelijk de meest effici¨ente GEE-schatter afleiden. We beschrijven dan ook bondig hoe we de efficientste schatter vinden wanneer Y bijvoorbeeld binair is. In een volgende sectie bespreken we kort een concept dat in heel wat semiparametrische mod- ellen opduikt: adaptive estimation. Kort gezegd betekent dit dat we het eindig-dimensionale deel van de parameter in een semiparametrisch model even goed kunnen schatten wanneer bepaalde oneindig-dimensionale delen van het model al dan niet onbekend zijn. Dit wordt toegepast op enkele eenvoudige voorbeelden. In de derde sectie van dit hoofdstuk geven we een voorbeeld van een semiparametrisch model waarbij de parameter die het model beschrijft niet netjes kan gepartitioneerd worden: het schat- ten van een behandelingseffect in een pretest-posttest studie of meer algemeen, het schatten van een behandelingseffect in een gerandomiseerde studie met baseline covariaten. We leiden de tangent ruimte samen met zijn orthogonaal complement af en vervolgens geven we aan dat met elk element van het orthogonaal complement een unieke RAL schatter correspondeert. Nadien leiden we de effici¨entse schatter af en zien we hoe de pretest meting, of meer algemeen, de base- line covariaten, kunnen gebruikt worden om aan effici¨entie te winnen. We maken ook van de gelegenheid gebruik om te onderzoeken onder welke minimale condities we baseline covariaten kunnen gebruiken om aan effici¨entie te winnen en we bespreken het concept van zogenaamde auxiliary variables. We eindigen deze sectie met enkele technische resultaten die eigen werk zijn. Als laatste onderdeel van dit hoofdstuk bespreken we zogenaamde probabilistic index models, voorgesteld door Prof. Dr. Olivier Thas. De resultaten in deze sectie zijn volledig eigen werk en kwamen tot stand na een vakantiejob onder begeleiding van Prof. Dr. Stijn Vansteelandt. We geven enkel aan hoe we de nuisance tangent ruimte kunnen construeren. Voor het vinden van het orthogonaal complement en het afleiden van de effici¨ente score verwijzen we naar [43] omdat de bekomen resultaten heel technisch en ingewikkeld zijn.

Deel III: Abstracte Benadering tot Semiparametrische Effici¨entie

In het derde deel van deze thesis introduceren we een abstracte aanpak tot semiparametrische theorie. Dit deel is voornamelijk gebaseerd op de lecture notes Semiparametric statistics door Nederlandstalige Samenvatting ix

Aad van der Vaart, zie [40]. In het bijzonder bespreken we de eerste drie lectures. Dit eerder moeilijke manuscript geschreven door Aad van der Vaart bevat in het totaal tien lectures. Jam- mer genoeg moesten we ons beperken tot de eerste drie lectures om het doel van deze thesis niet uit het oog te verliezen: een toegankelijke neerslag geven van de abstracte semiparametrische theorie. Niettemin wordt de basis goed beschreven in deze drie lectures. In de verdere lectures worden meer gespecialiseerde onderwerpen aangereikt, bijvoorbeeld hoe empirische procesthe- orie kan gebruikt worden voor de evaluatie van semiparametrische schatters. Dit wordt echter niet besproken in deze thesis. Deze tien lectures tonen dan ook de rijkdom aan van de semi- parametrische theorie en personen die na het lezen van deze thesis warm zijn gemaakt voor semiparametrischte theorie, verwijzen we dan ook met plezier naar [40]. Een grote hulp bij het schrijven van dit deel was ook het boek Asymptotic Statistics door Aad van der Vaart, zie [39]. Hoofdstuk 6: Tangent Sets en Effici¨enteInvloedsfunctie In het begin van dit hoofdstuk geven we aan hoe we de parameter die we willen schatten weergeven: als een functionaal ψ gedefinieerd op een statistisch model P. Vervolgens intro- duceren we afleidbare paden en zien we hoe we op een abstracte manier score functies kunnen introduceren, in een mean-square sense. Deze afleidbaarheid in mean-square sense zal andere types van convergentie impliceren. Dit zal de inhoud zijn van Propositie 6.1. Dit is een eigen resultaat. We defini¨erenook tangent sets als een willekeurige verzameling van scores, een uit- breiding van het begrip tangent ruimte uit het tweede deel. Vervolgens tonen we aan dat score functies eindige variantie hebben en gemiddeld nul zijn, de fundamentele eigenschappen van een score functie. Bovendien leggen we de relatie uit met scores gedefinieerd in de gewone zin. We bewijzen dan ook in detail de asymptotische normaliteit van de log likelihood ratio. Dit is een resultaat dat van fundamenteel belang is in het beschrijven van ondergrenzen voor de asympto- tische effici¨entie van schatters. In de tweede sectie defini¨erenwe afleidbare functionalen, de parameters tot dewelke we ons zullen beperken. Het zal uitvoerig besproken worden dat de RAL schatters uit het tweede deel hieraan voldoen zodat dit wel degelijk een uitbreiding is. We voeren dan ook op abstracte wijze de effici¨ente invloedsfunctie in en argumenteren dat dit een uitbreiding is van de effici¨ente in- vloedsfunctie uit het tweede deel. Het hoofdstuk wordt afgesloten met enkele voorbeelden om de zware theorie te verduidelijken. Eerst bespreken we een parametrisch model, vervolgens een niet-parametrisch model en we eindigen met het Cox’s Proportional Hazards Model. De bewijzen van deze voorbeelden zijn voornamelijk eigen werk. Hoofdstuk 7: Ondergrenzen In dit hoofdstuk bespreken we een aantal ondergrenzen. Eerst argumenteren we waarom deze ondergrenzen zo belangrijk zijn. E´envan´ de belangrijkste redenen is dat deze ondergrenzen quantificeren hoeveel effici¨entie we verliezen door een semiparametrische aanpak in vergelijking met een parametrische aanpak zodat we een trade-off kunnen maken tussen verlies aan effici¨entie en het probleem van model misspecificatie. Eerst bespreken we in detail de semiparametrische effici¨entie grens vanuit dit abstracte oogpunt. Nadien bespreken we enkele diepe stellingen omtrent de asymptotische effici¨entie van semi- parametrische schatters. We argumenteren eerst waarom dit geen triviale opdracht is omdat we ons nu niet beperken tot RAL schatters. We illustreren dit met een beroemd/berucht voorbeeld: de super-effici¨ente schatter gecontrueerd door Hodges. Nadien bespreken we de lokale asymp- totische minimax stelling (LAM-stelling), zie Stelling 7.3. Deze stelling is toepasbaar voor alle mogelijke semiparametrische schatters in een bepaald semiparametrisch model door krimpende omgevingen rond de werkelijke verdeling te beschouwen. Vervolgens introduceren we op een x Nederlandstalige Samenvatting rigoreuze manier reguliere schatters en beschrijven we de convolutie stelling, zie Stelling 7.4. Deze is echter enkel toepasbaar voor reguliere schatters. Om aan te geven dat asymptotische effici¨entie niet absoluut te defini¨erenis, illustreren we dit met de Stein shrinkage . We sluiten deze sectie af met een poging tot het defini¨erenvan een asymptotisch effici¨ente schatter en tonen aan dat de empirische distributie asymptotisch effici¨ent is in een niet-parametrisch model. Als laatste sectie van dit hoofdstuk beschouwen we semiparametrische modellen waarvan de pa- rameter netjes kan gepartitioneerd worden en bekomen zo analoge (maar algemenere) resultaten als in het vierde hoofdstuk van deze thesis. De bewijzen zijn voornamelijk eigen werk. Het hoofdstuk wordt afgesloten met het toepassen van de theorie op het restricted moment model en een model waarin we enkel onderstellen dat de dichtheden symmetrisch zijn. De uitwerkingen zijn ook voornamelijk eigen werk. Hoofdstuk 8: Scorecalculus Met de hoofdstukken 6 en 7 is de basis omtrent de semiparametrische effici¨entie gelegd en hebben we de resultaten uit het tweede deel veralgemeend. Dit hoofdstuk bevat al een meer gevorderd onderwerp inzake semiparametrische theorie. Het rijkt ons een manier aan om de effici¨ente invloedsfunctie in willekeurige semiparametrische modellen te bepalen. In dit achtste hoofdstuk wordt de algemene theorie omtrent score operatoren ingevoerd. In de eerste sectie van dit hoofdstuk introduceren we de theorie van de score operatoren voor een semiparametrisch model van de vorm

P = {Pη : η ∈ H} waarbij H zelf een semiparametrisch model voorstelt, al dan niet niet-parametrisch. We voeren zoals aangekondigd score operatoren in maar ook toegevoegde score operatoren en de informatie operator. We onderzoeken hoe deze objecten verwant zijn met de effici¨ente invloedsfunctie. We merken op dat de meeste bewijzen en uitwerkingen eigen werk zijn. De fundamentele idee¨en echter komen uit [40]. In de tweede sectie breiden we deze theorie uit tot semiparametrische modellen van de vorm

P = {Pη : η ∈ H} maar waarbij H nu niet noodzakelijk een semiparametrisch model voorstelt maar een willekeurige oneindig-dimensionale verzameling kan voorstellen. We breiden de concepten van een score operator, toegevoegde score operator en informatie operator uit tot deze setting. We passen dit toe op een parametrisch model en bekomen vertrouwde resultaten. De uitwerking van het voorbeeld is volledig eigen werk. Hoofdstuk 9: Toepassingen tot de Scorecalculus In het laatste hoofdstuk van deze thesis beschouwen we enkele toepassingen van de algemene theorie van score operatoren. Eerst en vooral beschouwen we information loss models. In dit geval observeren we een variabele X die een gekende transformatie is van een ongekende Y , i.e., X = m(Y ) en waarbij de verdeling η van Y tot een bepaald semiparametrisch model H behoort. Het afleiden van de score operator in dit geval is intu¨ıtiefmeteen duidelijk, de conditionele verwachtingswaarde. Het is echter heel technisch om dit te bewijzen. Niettemin wordt dit bewijs in detail uitgelegd. Eens dit bewezen is, is het triviaal om de toegevoegde score operator te bepalen. In de tweede sectie van dit hoofdstuk passen we dit toe op mixture models. Eerst bespreken Nederlandstalige Samenvatting xi we mixture models met een gekende kern en vervolgens heel kort met een kern die behoort tot een parametrisch model. Ook al kunnen dichtheden in dit model heel speciale eigenschappen vertonen, we zullen bewijzen dat de meest effici¨ente schatter in dergelijke modellen de empirische distributie is. De uitwerkingen zijn hoofdzakelijk eigen werk en bovendien tonen we aan dat de tangent set een convexe kegel is. Het bewijs hiervan is ook eigen werk. Uiteindelijk passen we de algemene theorie van de score operatoren uit het voorgaande hoofdstuk toe op semiparametrische modellen in een strikte vorm, waarbij de parameter netjes kan geparti- tioneerd worden. Onder gepaste voorwaarden kunnen we een expliciete uitdrukking geven voor de projectieoperator op de nuisance tangent ruimte en bijgevolg ook voor die effici¨ente invloeds- functie voor de eindig-dimensionale parameter. Bovendien maken we van de gelegenheid gebruik om een uitdrukking op te stellen voor de effici¨ente invloedsfunctie voor een ´e´en-dimensionale functie van de oneindig-dimensionale nuisance parameter. Alle uitwerkingen worden in detail beschreven zodat het duidelijk is hoe we aan de uiteindelijke uitdrukkingen komen. De laatste sectie van dit hoofdstuk handelt rond het vinden van de effici¨ente invloedsfunctie in het Cox model maar nu onder right censoring wat de berekeningen vrij complex zal maken. Niettemin zullen de resultaten vrij elegant zijn. Eerst leiden we de score operator af die er op het eerste zicht vrij eenvoudig uitziet. Aantonen dat deze operator begrensd is is echter heel complex. Dit bewijs is volledig eigen werk. Vervolgens leiden we de toegevoegde operator af. De bekomen uitdrukking is echter zeer complex. Niettemin, gebruikmakend van enkele geniale idee¨envan van der Vaart zullen we na heel wat rekenwerk een elegante uitdrukking voor de effici¨ente score bekomen. De effici¨ente informatiematrix kan bekomen worden door gebruik te maken van martingaaltheorie. Met een opmerking over adaptive estimation in dit model en het verband met de partial likelihood van Cox besluiten we deze thesis.

Appendices

We merken nog op dat aan het einde van deze thesis in de appendices enkele basisbegrippen uit de asymptotische statistiek en schattingstheorie worden samengevat. Eveneens vermelden we de regel van de herhaalde verwachtingswaarde en de regel van de herhaalde variantie. De regel van de herhaalde verwachtingswaarde wordt namelijk talloze keren toegepast in deze thesis en het is gerechtvaardigd op te merken dat dit resultaat ook van fundamenteel belang is voor de ontwikkeling van de semiparametrische theorie. xii Nederlandstalige Samenvatting Contents

Preface i

Toelating tot Bruikleen iii

Nederlandstalige Samenvatting v

Table of Contents xvii

List of Figures xix

Notation xxi

I Introduction & Mathematical Background 1

1 Introduction 3 1.1 Basic Notation and Definitions ...... 3 1.2 Why Semiparametric Theory? ...... 4 1.3 Examples of Semiparametric Models ...... 4 1.3.1 Restricted Moment Model ...... 5 1.3.2 Partially Linear Regression ...... 7 1.3.3 Symmetric Location ...... 8 1.3.4 Mixture Models ...... 8 1.3.5 Proportional Hazards Model ...... 9

2 Banach and Hilbert Spaces 13 2.1 Definitions and Basic Properties ...... 13 2.1.1 Banach Spaces ...... 13 2.1.2 Hilbert Spaces ...... 14 2.2 Linear Functionals and Dual Spaces ...... 16

xiii xiv Contents

2.3 Linear Operators, Adjoints and Ranges ...... 17 2.3.1 Linear Operators ...... 17 2.3.2 Adjoint Operators ...... 18 2.3.3 Ranges and Inverse Operators ...... 19 2.4 Orthogonality and Projections ...... 21 2.4.1 Orthogonality ...... 21 2.4.2 Orthogonal Projections ...... 21 2.4.3 Applications of the Projection Theorem ...... 24

II Introduction to Semiparametric Efficiency for the Class of RAL Esti- mators 27

3 Geometry of Influence Functions 29 3.1 The Space of Mean-Zero q-dimensional Random Functions ...... 29 3.2 Influence Functions and (Regular) Asymptotically Linear Estimators ...... 30 3.2.1 Asymptotically Linear Estimators and Influence Functions ...... 31 3.2.2 Regular Asymptotically Linear Estimators ...... 32 3.3 Geometry of Influence Functions for Parametric Models ...... 35 3.3.1 The Case θ = (βT , ηT )T ...... 35 3.3.2 The Case β = β(θ)...... 40 3.3.3 Importance ...... 41 3.4 Efficient Influence Function ...... 42 3.4.1 Comparing Variances when Dimension is Greater than One ...... 42 3.4.2 Deriving the Efficient Influence Function ...... 43

4 Extension to Semiparametric Models 49 4.1 Parametric Submodels ...... 50 4.2 Influence Functions for Semiparametric RAL Estimators ...... 51 4.3 Semiparametric Nuisance Tangent Space ...... 54 4.4 Efficient Influence Function ...... 55 4.4.1 The Case θ = (βT , ηT )T ...... 55 4.4.2 The General Case β = β(θ)...... 60 4.5 Some Tools for Practical Applications ...... 66 4.5.1 Tangent Space for Nonparametric Models ...... 66 4.5.2 Partitioning the Hilbert Space ...... 67 Contents xv

5 Applications 71 5.1 Restricted Moment Model ...... 71 5.1.1 Tangent Space and Nuisance Tangent Space for Parametric Submodels . . 72 5.1.2 Semiparametric Tangent Space and Nuisance Tangent Space ...... 74 5.1.3 The Class of Influence Functions ...... 79 5.1.4 The Efficient Influence Function ...... 82 5.1.5 Some Additional Notes on the Restricted Moment Model ...... 83 5.2 Adaptive Estimation ...... 86 5.3 Estimating Treatment Difference Between Two Treatments ...... 91 5.3.1 Model Description and Preliminary Estimator ...... 91 5.3.2 The Tangent Space and Its Orthogonal Complement ...... 93 5.3.3 The Efficient Influence Function ...... 94 5.3.4 Auxiliary Variables ...... 96 5.3.5 Some Technical Results ...... 98 5.4 Probabilistic Index Models ...... 101 5.4.1 Model Formulation ...... 102 5.4.2 Nuisance Tangent Space for a Parametric Submodel ...... 103 5.4.3 Semiparametric Nuisance Tangent Space ...... 106

III Abstract Approach to Semiparametric Efficiency 109

6 Tangent Sets and Efficient Influence Function 111 6.1 Score Functions and Tangent Sets ...... 111 6.2 Differentiable Functionals and Efficient Influence Function ...... 120 6.3 Examples ...... 124 6.3.1 ...... 124 6.3.2 Nonparametric Model ...... 126 6.3.3 Proportional Hazards Model ...... 130

7 Lower Bounds 133 7.1 Why Are Semiparametric Efficiency Bounds so Important? ...... 133 7.2 Parametric Point of View ...... 134 7.3 Semiparametric Efficiency Bounds ...... 135 7.3.1 Semiparametric Efficiency Bound ...... 135 xvi Contents

7.3.2 Some Lower Bound Theorems ...... 138 7.3.3 Empirical Distribution ...... 150 7.4 Semiparametric Models in a Strict Sense ...... 152 7.4.1 Efficient Score Function and Efficient Information Matrix ...... 152 7.4.2 Restricted Moment Model, Revisited ...... 157 7.4.3 Symmetric Location ...... 159 7.4.4 The Infinite Bound Case ...... 162

8 Calculus of Scores 163 8.1 Introduction ...... 163 8.2 Score and Information Operators ...... 163 8.2.1 Semiparametric Models Indexed by a Probability Measure η Contained in a Model H ...... 163 8.2.2 Semiparametric Models Indexed by a Parameter η Contained in an Arbi- trary Set H ...... 170

9 Applications to the Calculus of Scores 179 9.1 Information Loss Models ...... 179 9.2 Mixture Models ...... 184 9.2.1 Mixtures with Known Kernel p(x|z) ...... 184

9.2.2 Semiparametric Mixtures with Kernel pθ(x|z) ...... 188 9.3 Semiparametric Models in a Strict Sense ...... 189 9.3.1 Efficient Influence Function for θ ...... 190 9.3.2 Efficient Influence Function for χ(η) ...... 191 9.3.3 Conclusion ...... 194 9.4 Cox’s Proportional Hazards Model Under Right Censoring ...... 194 9.4.1 Building the Model Under Right Censoring ...... 194

9.4.2 Known Covariate Distribution pZ (z) and Conditional Censoring Distribu- tion FC|Z (c|z)...... 197

9.4.3 Unknown Covariate Distribution pZ (z) and Conditional Censoring Distri- bution FC|Z(c|z), Adaptive Estimation ...... 210

Appendices 215

A Fundamentals about Asymptotic Statistics 215 A.1 Stochastic Convergence ...... 215 Contents xvii

A.1.1 Types of Convergence ...... 215 A.1.2 Properties of Stochastic Convergence ...... 216 A.1.3 Stochastic o and O Symbols ...... 218 A.2 Asymptotic Properties of Estimators ...... 219

B Law of Iterated Expectation and Variance 221

Bibliography 223

Index 227 xviii Contents List of Figures

2.1 Projection onto a linear subspace ...... 22

3.1 Depiction of a linear variety V ...... 44

4.1 The class of influence functions for semiparametric RAL estimators ...... 52

4.2 Projection of Sβ onto Λ and Λγ ...... 57 ⊥ ⊥ 4.3 Projection of ϕa onto J and Jγ ...... 64

4.4 Projection of ϕa onto J and Jγj ...... 65

7.1 Examples of bowl-shaped subconvex loss functions ...... 139

7.2 When θ 6= 0, Pθ(Xn 6= Sn) → 0...... 140 −1/4 7.3 When θ = 0, P0(|Xn| < n ) → 1 ...... 141 −1/3 7.4 When θn = n ,Pθn (Sn = 0) → 1 ...... 142 7.5 Quadratic risk functions of the Hodges estimator ...... 143

xix xx List of Figures Notation

 end of a proof ∼ distributed as 4 end of an example q independency . smaller than up to a constant ∧ minimum ⊕ direct sum of subspaces of a Hilbert space 0q×r matrix of zeros A∗ adjoint operator of an operator A Aη score operator, general (Part III) ∗ Aη adjoint score operator (Part III) ∗ AηAη information operator (Part III) A, B, C, U σ-algebra ∗ αθ,η, αθ,η score operator for θ and its adjoint for a model Pθ,η (Part III) ∗ Aθ,η, Aθ,η total score operator and its adjoint for a model Pθ,η (Part III) ∗ Bθ,η, Bθ,η score operator for η and its adjoint for a model Pθ,η (Part III) (B, k · k) normed space with corresponding norm ∗ ∗ B , H dual space β parameter of interest indexing a model D(f), D(A) domain of definition →D convergence in distribution ε, e error term E expectation expit logistic regression function Fn empirical distribution function ϕ(X) influence function (Part II) ϕeff (X) efficient influence function (Part II) eff eff ϕβ,γ(X), ϕγ (X) efficient influence function for parametric submodel (Part II) g score function (Part III) Γ(θ) matrix of partial derivatives (Part II) H infinite-dimensional set (Part III) . Hη tangent set (Part III) (H, h·, ·i) inner product space with corresponding inner product H1 × H2 direct sum of Hilbert spaces

xxi xxii Notation

⊥ H0 orthogonal complement Hη subset of a Hilbert space indexing directions for η in H (Part III) Hq space of mean-zero q-dimensional random functions λ(t), λ(t|Z) hazard function and conditional hazard function Λ(t), Λ(t|Z) cumulative hazard function and conditional hazard function Λ nuisance tangent space (Part II) Λγ nuisance tangent space of parametric submodel (Part II) I identity operator I(x ∈ A) indicator function for a set A Iq×q identity matrix I(θ) Fisher information matrix (Part II) Iθ Fisher information matrix (Part III) I˜θ,η efficient information matrix (Part III) 1 identity function h·, ·i inner product J tangent space (Part II) Jβ tangent space for the parameter of interest (Part II) Jγ tangent space for parametric submodel (Part II) J tangent space nonparametric model (Part II) Jj tangent space j-th component nonparametric model (Part II) lin linear span ` bowl-shaped subconvex loss function (Part III) `Y Lebesgue measure corresponding with the random variable Y ˜ `θ,η efficient score function (Part III) . `θ score function for the parametric component (Part III) L(B1, B2) space of linear bounded operators between two Banach spaces lim inf lower limit lim sup upper limit L2(µ) space of square integrable functions with respect to a measure µ 0 0 L2(η), L2(Pη) space of mean-zero square integrable functions with respect to η, Pη LP limiting distribution of an estimator sequence under P N (A) kernel of an operator A N(µ, Σ) normal distribution η nuisance parameter indexing a model . η Pθ,η nuisance tangent set (Part III) k · k norm oP (1) term that converges in probability to zero (stochastic order symbol) OP (1) bounded in probability (stochastic order symbol) (Ω, U, P) probability space P probability measure P probability distribution Pθ probability distribution of a parametric model Pβ,η, Pθ,η probability distribution of a semiparametric model in a strict sence Notation xxiii

P g expectation of g under P , i.e., R gdP P Pβ,γ, Pγ parametric submodel (Part II) . PP tangent set of the model P at P (Part III) ˜ ˜T P (ψP ψP ) semiparametric efficiency bound (Part III) Pn empirical distribution →P convergence in probability ψ(P ) functional, parameter to be estimated (Part III) . ψP derivative map of a differentiable functional (Part III) ˜ ˜ ˜ ψP , ψθ,η, ψPη (efficient) influence function (Part III) ˜θ ψθ,η efficient influence function for ψ(Pθ,η) = θ (Part III) ˜χ(η) ψθ,η efficient influence function for ψ(Pθ,η) = χ(η) (Part III) Π(·|H0) orthogonal projection operator Πθ,η orthogonal projection operator onto closed linear span nuisance tangent set (Part III) R(A) range of an operator A d R d-dimensional Euclidean space Seff efficient score (Part II) eff Sβ,γ efficient score for parametric submodel (Part II) Sθ, Sβ, Sη, Sγ score vector (Part II) Tn estimator sequence t 7→ Pt differentiable path (Part III) τ end of a survival study ·T transpose θ parameter indexing a model Θ open subset of a finite-dimensional space µ, µt dominating measure νX , νY , νZ , νj, ν dominating measure V linear variety of influence functions, parametric model V∗ linear variety of influence functions, semiparametric model V linear variety of influence functions, restricted moment model var variance x ⊥ y orthogonal elements (X , A), (Y, B), (Z, C) measurable space (X , A,P ) measure space χ(·) functional to be estimated as a function of a parameter (Part III) . χθ matrix of partial derivatives of the function χ(θ) (Part III) . χη derivative map as a functional (Part III) χ˜η (efficient) influence function (Part III)

Part I

Introduction & Mathematical Background

1

Chapter 1

Introduction

1.1 Basic Notation and Definitions

Statistical problems are described using probability models. That is, the data are envisioned as a realization of a vector of random variables X1,...,Xn. Each Xi can be a vector of random variables itself, corresponding to the data collected on the i-th individual in a sample of n indi- viduals chosen from some population of interest. Each of these Xi, i = 1, . . . , n are measurable functions from some probability space, say (Ω, U, P) to some measurable space (X , A). We as- sume that the observations are independent and identically distributed, i.e., we assume an i.i.d. sample. These Xi have a distribution P on (X , A), so (X , A,P ) is a measure space. One then believes that the distribution P belongs to some statistical or probability model, where a model consists of a class of distributions or densities that we believe might have generated the data. The distributions in a model are identified through a set of parameters (possibly infinite- dimensional). The problem is usually set up in such a way that the value of the parameters, or at least, the value of some subset of the parameters that describe the distribution that generates the data, is of importance. So, the question then is: how can we learn about this true parameter value from a sample of observed data? We wish to do this in such a way that we extract as much as possible information from the data. We may give a classification of different types of models. Though the definitions below are not very rigorous, they constitute an intuitive notion of the three important types of models. Let us start with the most simple and familiar kind of model. Definition 1.1. A model P that can be smoothly indexed by an Euclidean vector, a vector of a finite number of real values (the parameters), is called a finite-dimensional parametric model.

For finite-dimensional parametric models, the class of distributions can be described as m P = {Pθ|θ ∈ Θ ⊂ R }. (1.1) The dimension m is some finite positive integer. We shall not attempt to make the definition more precise by specifying smoothly. Note that this should cover all classical statistical models, including exponential families. When we consider such a parametric model, we make a lot of assumptions about the data-generating mechanism. The shape of our distribution is fixed and the only degrees of freedom are the m parameters describing the model. If one wishes not to make any essential assumptions on the distribution of the data, we roughly obtain the other extreme.

3 4 Chapter 1. Introduction

Definition 1.2. A model P containing all probability distributions on the measurable space (X , A), is called a nonparametric model.

In this case, we do not have a finite-dimensional component of the parameter, the parameter is fully infinite-dimensional. There is also an intermediate case for which the theory in this thesis is mainly developed. Definition 1.3. A model P containing probability distributions described through a parameter that contains both a finite-dimensional component and an infinite-dimensional component is referred to as a semiparametric model.

In this case, the class of distributions is so large that the parameter indexing the model is infinite-dimensional. Thus, a semiparamtric model can be seen as an infinite-dimensional model that is essentially smaller than the set of all possible distributions. Examples of semiparametric models will be given shortly. By allowing the space of parameters to be infinite-dimensional, we are putting less restrictions on the probabilistic constraints that our data might have. A semiparametric model in a strict sense will be typically denoted by

P = {Pθ,η : θ ∈ Θ, η ∈ H}, m where Θ ⊂ R and where H is an infinite-dimensional set. The prime interest will be typically in the finite-dimensional parameter θ, which we call the parameter of interest and the infinite dimensional part η will be referred to as the nuisance parameter. This terminology will be used throughout the whole thesis, also in parametric models. This will be made more specific when we need it.

1.2 Why Semiparametric Theory?

Parametric theory has received a lot of attention in the literature. Indeed, most well-known elementary statistical methods are parametric. One of the most important results concerns the asymptotic normality and efficiency of the maximum likelihood estimator (MLE), rooted in the work by Fisher in the 1920s. Another key result is the lower bound theory rooted in the work by Cram´erand Rao in the 1940s. As such, a great deal of theory is available when dealing with estimation of statistical models. Thus, when there is so much information available, why consider developing a completely new theory to deal with estimation in semiparametric models? Semiparametric models have only received increasing attention in recent years. This attention has been motivated primarily by the problem of misspecification of statistical models. The semiparametric approach to misspecification is to allow the functional form of some components of the model to be unrestricted. By allowing the space of parameters to be partly infinite- dimensional, we are putting less restrictions on the probabilistic constraints that our data might have. Therefore, solutions, if they exist and are reasonable, will have greater applicability and robustness. This approach is an important complement to fully nonparametric models, which may not be very useful with small amounts of data or data of large dimension.

1.3 Examples of Semiparametric Models

It is time to consider some important examples of semiparametric models. Most of these ex- amples will be used later on to clarify the abstract and tough semiparametric theory that is 1.3. Examples of Semiparametric Models 5 developed which will heavily rely on functional analysis. For those not familiar with semipara- metric models, the following examples give a good idea how wide the range of these models can be and how they can be parameterized in an interesting way.

1.3.1 Restricted Moment Model

We begin our list of examples with a common and probably the most well known statistical model. This model describes the relationship between a response variable Y , possibly vector- valued, and a vector of covariates X. For this example, we adopt the notation from Tsiatis (2006), [35], p.3-7. We consider a family of probability distributions for Z = (X,Y ) that satisfy the regression relationship E(Y |X) = µ(X, β), (1.2) where µ(X, β) is a known function of X, smooth in the unknown q-dimensional parameter β. The function µ(X, β) has the same dimension as the response variable Y , say d. The function µ(X, β) may be linear or nonlinear in β. Since we allow considering a multivariate response variable as well as a univariate response variable, the restricted moment model allows the modelling of multivariate and longitudinal response variables (d > 1) as a function of the covariates as well as the traditional regression models for a univariate response variable (d =1). Let us look at some one-dimensional examples. When µ(X, β) = βT X∗, this is a linear model and X∗ = (1,XT )T . This enables us to allow for an intercept. When we take µ(X, β) to be exp(βT X∗), we obtain a log-linear model. Also the logistic regression model belongs to the realm of restricted moment models by letting µ(X, β) = eβT X∗ /(1 + eβT X∗ ). When we do not want an intercept, just replace X∗ by X. No other essential assumptions are made on the class of probability distributions other than (1.2). Therefore, we will call these models restricted moment models. The models were studied in Chamberlain (1986), [8] and Newey (1988), [26] in the econometrics literature. They were also popularized in the statistics literature in Liang and Zeger (1986), [23]. We now demonstrate how we can obtain an interesting parametrization θ = (βT , ηT )T of the model. The parameter η will turn out to be infinite-dimensional and describes the nonparametric component of the model. Thus, this parametrization reveals that the restricted moment model is a semiparametric model in a strict sense. Those familiar with semiparametric models can skip this derivation. It will be the only introductory example that is dealt with so much detail. It is given so the reader who is unfamiliar with semiparametric theory gets more intuition in the concept of infinite-dimensional parameters. For ease of explanation, we suppose that Y is a random variable that is continuous so the dominating measure is the ordinary Lebesgue measure, which we will denote by `Y . The model (1.2) can be rewritten as Y = µ(X, β) + ε, E(ε|X) = 0.

The data are realizations of (Y1,X1),..., (Yn,Xn) that are i.i.d. with density for a single obser- vation given by pY,X {y, x; β, η(·)}, where η(·) denotes the infinite-dimensional nuisance parameter. As we shall see, it characterizes the joint distribution of ε and X. The joint distribution of (ε, X) together with the parameter β will induce the joint distribution of (Y,X). Indeed, ε = Y − µ(X, β) and hence pY,X (y, x) = pε,X {y − µ(x, β), x}. Note this is possible since Y is also continuous. When Y is for example dichotomous, this would not be possible since ε = Y −µ(X, β) may no longer have a dominating 6 Chapter 1. Introduction measure that allows us to define densities. We will briefly discuss this case in Chapter 5. The condition E(ε|X) = 0 implies that for (ε, X), we only consider joint density functions pε,X (ε, x) = pε|X (ε|x)pX (x) that satisfy:

1. pε|X (ε|x) ≥ 0 for all (ε, x), R 2. pε|X (ε|x)d`Y (ε) = 1 for all x, R 3. εpε|X (ε|x)d`Y (ε) = 0 for all x,

4. pX (x) ≥ 0 for all x, R 5. pX (x)dνX (x) = 1.

Hereby νX (x) denotes the dominating measure corresponding with the random variable X. Since the covariates X may be continuous, discrete or mixed, we leave it unspecified. E.g., νX could be the Lebesgue measure or the counting measure or a product of such measures. We now construct the set of all joint distributions for (ε, X) satisfying the conditions above.

(a) Choose an arbitrary positive function of (ε, x): h(0)(ε, x) > 0, for all (ε, x).

(b) Normalize this function to be a conditional density:

h(0)(ε, x) h(1)(ε, x) = , R (0) h (ε, x)d`Y (ε)

R (1) or equivalently, h (ε, x)d`Y (ε) = 1 for all x.

(c) Center this function. To do this, consider a random variable ε∗ with conditional density h(1)(ε0, x) = p(ε∗ = ε0|X = x). Given X = x, this random variable has mean Z 0 (1) 0 0 κ(x) = ε h (ε , x)d`Y (ε ).

Now put ε = ε∗ − κ(X), then E(ε|X = x) = E(ε∗|X = x) − κ(x) = 0. The Jacobian of this transformation is 1 and ε∗ = ε + κ(X), thus the conditional density of ε given X is given by the function  Z  (1) (1) 0 0 η1(ε, x) = h ε + εh (ε , x)d`Y (ε ), x .

By construction, the function η1(ε, x) satisfies the conditions (1), (2) and (3). Note that we started with an arbitrary positive function h(0). The space of these functions is infinite- dimensional, consequently the space of η1(ε, x) is infinite-dimensional. Analogously, we can construct a density function pX (x) = η2(x) for X satisfying (4) and (5) by repeating steps (a) and (b). The space of functions η2(x) will also be infinite-dimensional. Conclusion. The restricted moment model can be characterized by the parameters

{β, η1(·), η2(·)}, 1.3. Examples of Semiparametric Models 7

q with β ∈ R finite-dimensional and η1(ε, x) = pε|X (ε|x), η2(x) = pX (x) infinite-dimensional. The parameter of interest is β and the nuisance parameter is η(·) = {η1(·), η2(·)}. The joint distribution of (Y,X) is given by

pY,X {y, x; β, η1(·), η2(·)} = η1{y − µ(x, β), x}η2(x).

Hence, we write q P = {Pβ,η : β ∈ R , η ∈ H}, (1.3) where Pβ,η is the probability measure corresponding with the density pY,X {y, x; β, η1(·), η2(·)} and H is the set of all possible values the nuisance parameter η = (η1, η2) can take. Note that H is an infinite-dimensional set. Finally, it is possible to contrast this semiparametric model with the parametric version (for simplicity we take Y to be one-dimensional),

Yi = µ(Xi, β) + εi, i = 1, . . . , n,

2 where εi are i.i.d. N(0, σ ). That is,

1  1 {y − µ(x, β)}2  p (y|x; β, σ2) = exp − . Y |X (2πσ2)1/2 2 σ2

It is clear that this model is much more restrictive than the semiparametric model (1.3) defined earlier. The robustness of the semiparametric model is now very clear, estimators for parameters in the semiparametric model are less sensitive to misspecification of the distribution of ε in contrast with the parametric model that assumes normally distributed error terms. Note we do not need to specify the distribution of the covariates X because the parameter β is ancillary for pX (x), i.e., the density function pX (x) does not depend on the parameter β. This intuitive statement will be made mathematically rigorous later on when we discuss adaptive estimation.

1.3.2 Partially Linear Regression

This example extends the previous example. The restricted moment model specifies that the conditional mean of the response variable Y given a vector of covariates X is a fixed, known function µ(X, β) of an unknown finite-dimensional parameter β,

E(Y |X) = µ(X, β).

A nonparametric regression model would replace this by an arbitrary function Ψ, perhaps re- stricted by being smooth, E(Y |X) = Ψ(X). The intermediate case that mixes these two extremes is also a semiparametric model, for instance

E(Y |X,W ) = µ{βT X + Ψ(W )}

k for β ∈ R , W an additional set of covariates and Ψ ranging over the class of all twice differen- tiable functions on the domain of W . The function µ is still fixed and known. Note that this type of model may occur in a causal setting when we are only interested in the effect of some variable X on the response variable Y and we leave the effect of a confounder W unspecified because we are not interested in the exact magnitude of this effect. 8 Chapter 1. Introduction

To describe the full model, we write Z = (Y,X,W ). We use the results from the previous example. The only new component is the function Ψ and the additional covariates W . The generalization is now straightforward. Thus, the joint density can be described as

T pZ {z; β, η1(·), η2(·), Ψ(·)} = η1[y − µ{β x − Ψ(w)}, x, w]η2(x, w), where β continues to be the finite-dimensional parameter of interest and the nuisance parameter is now the infinite-dimensional triplet

η(·) = {η1(·), η2(·), Ψ(·)}.

There are other ways to describe this model, see [39], p.7. For an application of this model, we refer to Robins et al. (1992), [31].

1.3.3 Symmetric Location

The following model is that of a symmetric location. Consider for example the case where we have collected data and we have a priori knowledge that the data is distributed symmetrically about its mean but we have no further information. In this case, it is reasonable we do not want to make any further modelling assumptions to avoid problems resulting from model misspecifi- cation. We now describe how we can parameterize this model, keeping in mind estimation of the location parameter. Let us start with a given θ ∈ R and a probability density η on R that is symmet- ric about 0. Denote Pθ,η to be the measure with density x 7→ η(x − θ). The corresponding semiparametric model becomes

P = {Pθ,η : θ ∈ R, η ∈ H}, (1.4) with H the set of all proper absolutely continuous densities symmetric about zero with finite Fisher information for location, i.e., " #  ∂ 2 Z η0(z)2 I (η) = E log η(X − θ) = η(z)dz < +∞. θ ∂θ η(z)

Note that H is an infinite-dimensional set. The parametric component of this model, the param- eter of interest, is the parameter θ and the nonparametric component, the nuisance parameter, is the density η. Apparently, this model arose naturally in the study of nonparametric testing theory and was studied long before the general subject of semiparametric models had been conceived. As we shall see in Chapter 7, this model turns out to be a very special model with regard to the esti- mation of the parameter θ. There exist estimators for θ in this model with η unknown which are asymptotically of the same quality as the efficient estimators in the corresponding model with η known, e.g., the sample mean when η is normal.

1.3.4 Mixture Models

In this section, we introduce the notion of a semiparametric mixture model. First we introduce this type of model in a more abstract way and afterwards, we give an intuitive example. Let us start with assuming that x 7→ pθ(x|z) is a probability density for every pair (θ, z) ∈ Θ×Z 1.3. Examples of Semiparametric Models 9

k where Θ is a subset of R and (Z, C) is a measurable space. In addition, suppose the map (x, z) 7→ pθ(x|z) is jointly measurable. This guarantees that Z pθ,η(x) = pθ(x|z)dη(z) Z defines a proper density for every probability measure η on (Z, C). When the probability distri- bution η is degenerate at z, i.e., a probability distribution that puts all its mass into the value z,  1, ifz ˜ = z, η(˜z) = 0, ifz ˜ 6= z, the mixture density pθ,η(·) reduces to the density pθ(·|z). Henceforth, the model of all mixture densities pθ,η(·) is considerably bigger than the original model pθ(·|z), which is parametric if z is Euclidean and the map (θ, z) 7→ pθ(·|z) is smooth. We can represent the mixture model as the semiparametric model

P = {Pθ,η : θ ∈ Θ, η ∈ H}, where Pθ,η denotes the probability measure corresponding with the mixture density pθ,η and H is the infinite-dimensional set of all probability measures on the measurable space (Z, C). Let us give an intuitive example: the errors-in-variables model. A typical observation is a pair X = (X1,X2) where X1 = Z + e and X2 = gβ(Z) + f for a bivariate normal vector (e, f) with mean zero and unknown covariance matrix Σ ∈ ΘΣ, i.e., (e, f) ∼ N(0, Σ) and a function gβ known up to a finite-dimensional parameter β ∈ Θβ. The set ΘΣ is the set of all possible 2 × 2 k covariance matrices and Θβ ⊂ R where k is the dimension of β. The parameter θ is thus given by (β, Σ). The distribution of Z is unknown. The kernel pθ(·|z) is in this case a multivariate normal density. Thus,  Z   X1,X2|Z ∼ N , Σ . gβ(Z)

A particular example is the linear errors-in-variables model for which β = (β0, β1) and gβ(z) = β0 + β1z.

1.3.5 Proportional Hazards Model

In many biomedical applications, we are often interested in modelling the survival time of in- dividuals as a function of covariates. Thus, a typical observation is a pair X = (T,Z) of a survival time, e.g., the time an individual died, and Z the vector of covariates. The survival time T is the response variable whose distribution depends on explanatory variables Z. Before we build the joint density of T and Z, we give some basic definitions and relations of some special functions that are encountered in survival analysis related with the marginal dis- tribution of T . For simplicity, we assume that T is continuous. The density of the survival R t time T is denoted by f(t). The corresponding distribution function is then F (t) = 0 f(u)du. Thus, F (t) = P(T ≤ t) is the probability the individual died before time t. In survival analysis, the interest tends to focus on the survivor function S(t) = P(T > t) = 1 − F (t). Thus, R +∞ S(t) = t f(u)du. Next, we define the hazard function,

P(t ≤ T < t + h|T ≥ t) λ(t) = lim . h→0 h 10 Chapter 1. Introduction

This is also called the instantaneous failure rate, e.g., the instantaneous probability of dying at time T = t, given that the individual is still alive. It is easy to show that

f(t) f(t) λ(t) = = . (1.5) S(t) 1 − F (t)

This equation shows that we can write λ(t) in terms of f(t). We can also do the reverse. Denote R t by Λ(t) = 0 λ(u)du the cumulative hazard function. For a sufficiently smooth survivor function S(t), we have the relation S0(t) = −f(t) because S(t) = 1 − F (t). Together with (1.5), we obtain ∂ f(t) − {log S(t)} = = λ(t) ∂t S(t)

R t −Λ(t) and thus S(t) = 1 − F (t) = exp{− 0 λ(u)du} = e . Differentiating with respect to t yields

f(t) = λ(t)e−Λ(t). (1.6)

This shows that the relationship between f(t) and λ(t) is one-to-one and onto. We are now ready to postulate a semiparametric model to describe the relationship between T and Z. A popular model in survival analysis is Cox’s proportional hazards model, first introduced by Cox (1972), [9]. The assumption is that the conditional hazard rate is a function of Z and takes the form

P(t ≤ T < t + h|T ≥ t, Z) λ(t|Z) = lim h→0 h = λ(t) exp(θT Z), where λ(t) is an unknown baseline hazard function. k Interest often focuses on the finite-dimensional parameter θ ∈ R when Z is k-dimensional. This parameter θ describes the magnitude of the effect that the covariates have on the survival time. The underlying hazard function λ(t) is considered as a nuisance parameter. Since this function can be any positive function in t, subject to some regularity conditions, this nuisance parameter is infinite-dimensional. Using (1.6), we see that

−Λ(t|z) − R t λ(u|z)du pT |Z {t|z; θ, λ(·)} = λ(t|z)e = λ(t|z)e 0 t T  T Z  = λ(t)eθ z exp −eθ z λ(u)du 0 T n T o = λ(t)eθ z exp −eθ zΛ(t) . (1.7)

Using this, we are now able to describe the joint density of T and Z,

pT,Z {t, z; θ, λ(·), pZ (·)} = pT |Z {t|z; θ, λ(·)}pZ (z) θT z n θT z o = λ(t)e exp −e Λ(t) pZ (z). (1.8)

Exactly as in the example of the restricted moment model, pZ (z) is defined as a function of z such that pZ (z) is a positive function that integrates to 1. Thus, the distribution of the covariates Z is left unspecified. This is also an infinite-dimensional nuisance parameter. 1.3. Examples of Semiparametric Models 11

From the derivation above, we see that the model can be parameterized in an interesting way k with θ ∈ R the parameter of interest and the infinite-dimensional nuisance parameter η = {λ(·), pZ (·)}. We conclude that this semiparametric model can be written as

k P = {Pθ,η : θ ∈ R , η ∈ H}, (1.9) with H the set of all possible values for the nuisance parameter η = (λ, pZ ). Note that H is an infinite-dimensional set. We see this model is a semiparametric model that can be represented naturally with a finite number of parameters of interest and an infinite-dimensional nuisance parameter, a semipara- metric model in a strict sense. It was one of the first models to be studied using semiparametric theory as presented in Begun et al. (1983), [3]. The proportional hazards model has gained a great deal of popularity because it is more flexible and less vulnerable to model misspecification than a finite-dimensional parametric model, that assumes the hazard function for T has a specific functional form in terms of a few parameters, e.g.,

• Exponential model: constant hazard over time, λ(t) ≡ λ,

• Weibull model: λ(t) = γλtγ−1 with λ the scale parameter and γ the shape parameter. When γ = 1, we obtain a constant hazard over time.

Following the same reasoning as with the restricted moment model, we do not need to specify the distribution of Z because θ is ancillary for pZ (z). 12 Chapter 1. Introduction Chapter 2

Banach and Hilbert Spaces

In this chapter we sketch the basic theory concerning Banach spaces and in particular Hilbert spaces, which play an important role in this thesis. We will not recap basic terminology such as vector spaces. We refer those who are not at ease reading this chapter to Luenberger (1969), [24] for a more detailed description of these basic facts. Easy proofs or sketches are sometimes given to give some intuition into the basic concepts but they can be skipped by the less interested reader.

2.1 Definitions and Basic Properties

2.1.1 Banach Spaces

Definition 2.1. A normed linear space B is a vector space equipped with a norm k·k : B → R satisfying

(i) kx + yk ≤ kxk + kyk for all x, y ∈ B (triangle inequality),

(ii) kαxk = |α|kxk for all α ∈ R and x ∈ B,

(iii) kxk ≥ 0 for all x ∈ B and kxk = 0 iff x = 0.

We denote this by (B, k · k).

Proposition 2.1. From (i) it easily follows that kx − yk ≥ kxk − kyk for all x, y ∈ B, the reversed triangle inequality.

Definition 2.2. A sequence {xn}n≥1 ⊂ B is called a Cauchy sequence iff kxp − xqk → 0 as p, q → +∞. A normed linear space is called complete iff every Cauchy sequence in B has a limit point which is also in B, i.e., every Cauchy sequence is convergent and has its limit in B. Definition 2.3. A complete normed linear space is called a Banach space.

We give two examples that will be of interest for our purposes.

d Example 2.1. The most well-known example of a√ Banach space is the Euclidean space B = R Pd 2 1/2 T with kxk the Euclidean norm |x| = ( i=1 xi ) = x x.

13 14 Chapter 2. Banach and Hilbert Spaces

Example 2.2. Let µ be a positive measure on a measurable space (X , B). For any measurable R p 1/p function f : X → R and 0 < p < ∞, let kfkLp(µ) = { X |f(x)| dµ(x)} and Lp(µ) = {f : kfkLp(µ) < +∞}. The linear normed space (Lp(µ), k · kLp(µ)) is a Banach space. Note that we call two functions f and g to be equal (equivalent) if kf − gkLp(µ) = 0 which indicates that they are equal almost everywhere (a.e.).

2.1.2 Hilbert Spaces

Definition 2.4. A vector space H is an inner product space or a pre-Hilbert space if there is an inner product h·, ·i : H × H → R which satisfies

(i) hx, yi = hy, xi for all x, y ∈ H,

(ii) hx + y, zi = hx, zi + hy, zi for all x, y, z ∈ H,

(iii) hαx, yi = αhx, yi for all α ∈ R and x, y ∈ H,

(iv) hx, xi ≥ 0 for all x ∈ H and hx, xi = 0 iff x = 0.

We denote this by (H, h·, ·i).

Proposition 2.2. Any inner product space (H, h·, ·i) is a normed linear space (H, k · k) with norm k · k = ph·, ·i.

Definition 2.5. An inner product space (H, h·, ·i) that is complete with respect to the norm ph·, ·i is called a Hilbert space.

Note that by Proposition 2.2 every Hilbert space is also a Banach space. The converse is not true since the norm must be associated with an inner product which is not always the case. An inner product now enables us to define the notion of orthogonal elements.

Definition 2.6. Two elements x and y from a Hilbert space (H, h·, ·i) are called orthogonal, denoted x ⊥ y, iff hx, yi = 0. They are called orthonormal if in addition kxk = kyk = 1.

An easy but useful consequence of orthogonality is the Pythagorean theorem.

Theorem 2.1 (Pythagorean Theorem). If x and y are orthogonal elements of a Hilbert space 2 2 2 H, then kx + yk = kxk + kyk .

Proof. This follows from an easy calculation,

kx + yk2 = hx + y, x + yi = kxk2 + kyk2 + 2hx, yi = kxk2 + kyk2 since an inner product is symmetric and x ⊥ y.

We give examples of Hilbert spaces that will be of interest for our purposes.

d Example 2.3. The d-dimensional Euclidean space H = R is a Hilbert space relative to the T Pd inner product hx, yi = x y = i=1 xiyi.

This is a special case of the following property. 2.1. Definitions and Basic Properties 15

Proposition 2.3. Every finite-dimensional inner product space (H, h·, ·i) is a Hilbert space (for the associated norm).

Proof. Let H be a d-dimensional inner product space and let {φ1, . . . , φd} be an orthonormal basis for H. Now take any Cauchy sequence {xn}n≥1 ⊂ H with

d X j xn = αnφj, j=1

j with αn ∈ R for any n and j. By definition of a Cauchy sequence, we have that 2 d 2 X j j kxp − xqk = (α − α )φj → 0 as p, q → +∞. p q j=1

j j By the Pythagorean theorem, it follows that |αp − αq| → 0 as p, q → +∞ for any j = 1, . . . , d. j This means that the d sequences {αn}n≥1 are Cauchy sequences in R. Since R is complete, there j j exist real numbers α = limn→+∞ αn for j = 1, . . . , d. From this it follows that

d d X j X j xn = αnφj → α φj =: x ∈ H. j=1 j=1

This shows that every Cauchy sequence is convergent which concludes the proof.

Example 2.4. Recall Example 2.2 and take p = 2. It can be shown that the space L2(µ) = {f : R kfkL2(µ) < +∞} is a Hilbert space with hf, gi = X f(x)g(x)dµ(x) for any f, g ∈ L2(µ). This is the prototype of an infinite-dimensional Hilbert space.

The following theorem, which is quite essential in Hilbert space theory, will be used throughout.

Theorem 2.2. Every closed subspace H0 of a Hilbert space (H, h·, ·i) is a Hilbert space itself relative to the inner product of H.

Proof. Let {xn}n≥1 ⊂ H0 be a Cauchy sequence. Since H0 ⊂ H, {xn}n≥1 is also a Cauchy sequence in H. Because H is a Hilbert space, {xn}n≥1 converges to an element x ∈ H. Because H0 is closed, it contains all its limit points. Therefore we see that x ∈ H0. We conclude that H0 must be a Hilbert space.

The next theorem is a fundamental theorem in Hilbert space theory and is of great importance. We will use it a lot in this thesis.

Theorem 2.3 (Cauchy-Schwartz inequality). For any two elements x and y from a Hilbert space (H, h·, ·i), we have |hx, yi| ≤ kxkkyk, with equality holding if and only if x and y are linearly dependent, i.e., x = αy for some real number α.

We end this section with the notion of the direct sum of Hilbert spaces. 16 Chapter 2. Banach and Hilbert Spaces

Definition 2.7. Suppose we have two Hilbert spaces (H1, h·, ·iH1 ) and (H2, h·, ·iH2 ). These can be combined into another Hilbert space, called the direct sum, and denoted by H1 × H2. This is the set {(x1, x2): xi ∈ Hi, i = 1, 2}. The inner product for this linear space is defined by

h(x1, x2), (y1, y2)iH1×H2 = hx1, y1iH1 + hx2, y2iH2 , with xi, yi ∈ Hi, i = 1, 2. It is easy to show that (H1 × H2, h·, ·iH1×H2 ) is a Hilbert space.

We illustrate this with an easy example.

d d d Example 2.5. We take Hi to be the Euclidean space R i , i = 1, 2. We know that R 1 × R 2 = d1+d2 . It is a Hilbert space relative to the inner product h·, ·i d +d . This inner product equals R R 1 2 the inner product defined in the preceding definition, h·, ·i d d . This shows that the Hilbert R 1 ×R 2 d1 d2 d1+d2 space ( × , h·, ·i d d ) equals the Hilbert space ( , h·, ·i d +d ). R R R 1 ×R 2 R R 1 2

2.2 Linear Functionals and Dual Spaces

Consider a Banach space B.

Definition 2.8. A mapping f : B → R : x 7→ f(x) is called a functional on B. With D(f) the domain of definition of the functional f, the functional f is called linear iff D(f) is linear and

f(αx + βy) = αf(x) + βf(y) for all α, β ∈ R and for all x, y ∈ D(f). Definition 2.9. A linear functional is bounded if there exists a finite constant

kfk = sup |f(x)|. (2.1) kxk≤1 x∈D(f)

For the sake of completeness, we state the following fundamental theorem in functional analysis.

Theorem 2.4 (Hahn-Banach). Let B be any real normed space (for example a Banach space) and f be a linear bounded functional with D(f) ⊂ B. Then there exists a linear bounded func- tional f˜ defined on the whole space B such that

(i) kf˜k = kfk,

(ii) f˜(x) = f(x), for all x ∈ D(f).

An important implication of this theorem is that any linear bounded functional on a real normed space can be considered to be defined on the whole space. Thus, without loss of generality we can say that D(f) = B. The set of all linear bounded functionals defined on a Banach space B is also a Banach space ∗ with norm defined as (2.1). It is called the dual space of B and is denoted B . We can now ∗ define two types of convergence in B . 2.3. Linear Operators, Adjoints and Ranges 17

∗ ∗ Definition 2.10. Take any sequence {fn}n≥1 ⊂ B and f ∈ B . Then we say

(i) the sequence {fn}n≥1 is strongly convergent to f, denoted fn → f iff kfn − fk → 0 as n → +∞,

(ii) the sequence {fn}n≥1 is weakly convergent to f, denoted fn * f iff fn(x) → f(x) as n → +∞ for all x ∈ B.

The next theorem is a famous theorem in functional analysis and will be very useful in the development of the more abstract semiparametric theory to estimate functionals, see Part III. It gives a representation of linear bounded functionals on Hilbert spaces.

Theorem 2.5 (Riesz’ Representation Theorem). Let H be a Hilbert space with the inner product h·, ·i. Each linear bounded functional f on H can be written as

f(x) = hφ, xi, for all x ∈ H.

The element φ ∈ H is uniquely determined. Moreover kfk = kφk.

This theorem and the Cauchy-Schwarz inequality shows that there exists an isometry, i.e., a ∗ linear operator that leaves the norm invariant, between H and H from which we can identify ∗ H with its dual space H and we consider them to be equivalent.

2.3 Linear Operators, Adjoints and Ranges

2.3.1 Linear Operators

Let B1 and B2 to be two Banach spaces. An operator A : B1 → B2 is a mapping from B1 to B2. The domain of definition of A is denoted D(A).

Definition 2.11. An operator A : B1 → B2 is called linear iff D(A) is a linear space and

A(αx + βy) = αA(x) + βA(y), for all α, β ∈ R and for all x, y ∈ D(A).

Definition 2.12. An operator A : B1 → B2

(i) is called bounded if it maps any bounded set in B1 into a bounded set in B2,

(ii) is called continuous in x if A(xn) → A(x) for any xn → x. Moreover A is continuous in D(A) if A is continuous in any point of D(A).

We now state an interesting proposition.

Proposition 2.4. Assume A : B1 → B2 is a linear operator defined on the whole space B1. Then the following statements are equivalent,

(i) A is continuous, 18 Chapter 2. Banach and Hilbert Spaces

(ii) A is bounded,

(iii) there exists a positive constant C such that kA(x)kB1 ≤ CkxkB2 for any x ∈ B1.

It turns out that continuity and boundedness are equivalent concepts for linear operators defined on the whole space. Moreover, we have a practical condition to check the boundedness and equivalently, the continuity of an operator A : B1 → B2.

Let us introduce the set A = {kAxkB2 : kxkB1 ≤ 1} for any linear bounded operator A from B1 into B2. By the boundedness of A, we know that A is bounded and thus the supremum of A exists, kAk = sup A . It is then easy to show that kAxkB2 ≤ kAkkxkB1 for all x.

Definition 2.13. The space of all linear bounded operators from B1 into B2 is a linear space, which is denoted by L(B1, B2). The norm in this space is defined as kAk = sup A .

Proposition 2.5. The space of linear bounded operators L(B1, B2) is a Banach space.

Note it is actually enough that only B2 is a Banach space; it is sufficient that B1 is a real normed space. The next theorem will be of importance for our purposes. We admit, this theorem will only be used to fill in some details. Theorem 2.6 (Continuous Prolongation of an Operator). We consider a linear operator A : D(A) ⊂ B1 → B2, where D(A) = B1, D(A) is dense in B1. In addition, suppose that A is ˜ bounded on D(A). Then there exists a linear bounded operator A defined on the whole space B1 such that

(i) A˜ = A on D(A), (ii) kA˜k = kAk.

Proof. We only show how the operator A can be extended. The remainder of the theorem can be found in [33], Theorem 1.5, p.7. Since D(A) is dense in B1, any x ∈ B1 can be approached by a sequence {xn}n≥1 ⊂ D(A). We see that

kAxn − AxmkB2 = kA(xn − xm)kB2 ≤ kAkkxn − xmkB1 which implies that {Axn}n≥1 ⊂ B2 is a Cauchy sequence since {xn}n≥1 is convergent. Because B2 is a Banach space, the sequence {Axn}n≥1 must also be convergent and thus we can define

Ax˜ := lim Axn for x∈ / D(A). n→+∞

When x ∈ D(A), we simply define Ax˜ := Ax.

2.3.2 Adjoint Operators

∗ ∗ Let B1 and B2 be Banach spaces with dual spaces B1 and B2 respectively. Suppose that A ∈ L(B1, B2). ∗ ∗ Definition 2.14. The adjoint of A is defined to be the linear operator from B2 to B1 satisfying (A∗y∗)x = y∗(Ax),

∗ ∗ for x ∈ B1 and y ∈ B2. 2.3. Linear Operators, Adjoints and Ranges 19

It is easy to see this is a linear operator.

∗ Theorem 2.7. If A ∈ L(B1, B2), then kAk = kA k.

∗ ∗ ∗ By this theorem, we see that A ∈ L(B2, B1), thus it is a continuous linear operator. A more detailed proof of the existence, uniqueness, linearity and the boundedness of the adjoint can be found in [32], Theorem 4.10, p. 93. We now give some basic rules to do calculations with adjoint operators.

Proposition 2.6. Let Bi, i = 1, 2, 3 be Banach spaces, then the following assertions are true:

∗ ∗ ∗ (i) If A, B ∈ L(B1, B2), then (A + B) = A + B . ∗ ∗ (ii) If A ∈ L(B1, B2) and α ∈ R, then (αA) = αA . ∗ ∗ ∗ (iii) If A ∈ L(B1, B2) and B ∈ L(B2, B3), then (AB) = B A . 1 −1 −1 ∗ ∗ −1 (iv) If A ∈ L(B1, B2) has a bounded inverse A , then (A ) = (A ) .

We specialize these considerations for Hilbert spaces H1 and H2. Note the results obtained so far remain applicable because a Hilbert space is a Banach space. We have seen that the dual space of a Hilbert space can be identified with the original space. The adjoint of an operator ∗ A : H1 → H2 can thus be seen as a map A : H2 → H1 characterized by the property

∗ hAx, yiH2 = hx, A yiH1 , for all x ∈ H1 and y ∈ H2. For Hilbert spaces, we have the following interesting property.

∗ ∗ Proposition 2.7. Suppose H is a Hilbert space. If A ∈ L(H, H), then (A ) = A. Remark 2.1. An operator between Euclidean spaces can be identified with a matrix and its adjoint is then simply the transpose of the matrix. The results summarized in the preceding propositions are the familiar properties for the transpose of matrices.

For later purposes, we consider the adjoint of a restriction A0 : H1,0 ⊂ H1 → H2 of A. This is ∗ defined to be the composition Π ◦ A of the projection Π : H1 → H1,0 and the adjoint of the original A. The projection operator Π will be discussed in much detail in the next section, but intuitively, this is clear.

2.3.3 Ranges and Inverse Operators

Let H1 and H2 be Hilbert spaces.

Definition 2.15. If A ∈ L(H1, H2), then we define

(i) N (A) = {x ∈ H1 : Ax = 0} to be the kernel or the null space of A,

(ii) R(A) = {Ax ∈ H2 : x ∈ H1} to be the range of A. 1The inverse of an operator is defined in Definition 2.16. 20 Chapter 2. Banach and Hilbert Spaces

The range R(A) is a linear space. However, R(A) is not necessarily closed and this can have serious implications, also in semiparametric estimation theory (as we shall see in Chapter 8). We state a proposition that gives some equivalences of closed ranges in terms of the adjoint.

Proposition 2.8. Let A : H1 → H2 be a continuous linear map between two Hilbert spaces. Then the following statements are equivalent,

(i) R(A) is closed,

(ii) R(A∗) is closed,

(iii) R(A∗A) is closed.

In that case R(A∗) = R(A∗A).

Next we state another important theorem in terms of the inverse of A which we define first.

Definition 2.16. For an operator A : B1 → B2 that is one-to-one, i.e., so that for any y ∈ R(A) there exists a unique x ∈ D(A) ⊂ H1 such that Ax = y, we can define the inverse operator −1 A : H2 → H1 : y 7→ x and D(A−1) = R(A), R(A−1) = D(A).

Lemma 2.1. An operator A : B1 → B2 is one-to-one if and only if N (A) = {0}.

So, when the kernel is trivial, the inverse operator can be defined.

Theorem 2.8 (Bounded Inverse Theorem). Let A : H1 → H2 be a continuous linear map between two Hilbert spaces. Suppose A is one-to-one (N (A) = {0}), then R(A) is closed if and only if A−1 is continuous (i.e., bounded).

In the light of this theorem, we also have a result involving the operator A∗A, which will be also of interest later on.

Theorem 2.9. Let A : H1 → H2 be a continuous linear map between two Hilbert spaces. The ∗ map A A : H1 → H1 is one-to-one, onto, i.e., so that for any y ∈ H2 there exists a unique x ∈ H1 such that Ax = y, and has a continuous inverse if and only if A is one-to-one and R(A) is closed.

In contrast to the range R(A), the kernel N (A) of a continuous, linear operator is always closed which is easy to prove. Take a sequence {xn}n≥1 ⊂ N (A) that converges to x ∈ H1. We then deduce that

kAxkH2 = kA(x − xn)kH2 ≤ kAkkx − xnkH1 → 0. This shows that Ax = 0 and thus x ∈ N (A). We conclude this section with another theorem to ensure that the inverse operator is continuous.

Theorem 2.10. Let A : H1 → H2 be a continuous linear map between two Hilbert spaces that −1 is one-to-one. Then the inverse A from R(A) to H1 is continuous if and only if there exists a positive constant C0 such that

kAxkH2 ≥ C0kxkH1 , for all x ∈ D(A). 2.4. Orthogonality and Projections 21

2.4 Orthogonality and Projections

We end this chapter with a discussion of orthogonality in a Hilbert space H and orthogonal pro- jections onto subspaces. These concepts will be of fundamental importance in the development of semiparametric theory.

2.4.1 Orthogonality

Recall that we defined x, y ∈ H to be orthogonal, we write x ⊥ y, iff hx, yi = 0. Now consider a subset H0 ⊂ H. We say that x ∈ H is orthogonal to H0, we write x ⊥ H0, iff x ⊥ y for all y ∈ H0. Analogously, we say that two subsets H1 and H2 are orthogonal iff hx, yi = 0 for any pair x ∈ H1 and y ∈ H2. ⊥ Definition 2.17. Let H0 be a subset of H. The orthogonal complement of H0, denoted H0 , is the closed subspace ⊥ H0 = {y ∈ H : hx, yi = 0 for all x ∈ H0}.

Note that we can define the orthogonal complement of any subset of a Hilbert space and the result is always a closed subspace. Linearity follows from the linearity of the inner product and the closedness follows from the continuity of the inner product.

Proposition 2.9. For subsets Hi ⊂ H, i = 0, 1, 2, we have that,

⊥ ⊥ (i)(H0 ) = H0, ⊥ ⊥ ⊥ (ii)(H1 + H2) = H1 ∩ H2 , ⊥ ⊥ ⊥ (iii) H1 + H2 = (H1 ∩ H2 ) .

Using this terminology, we can state an interesting relation between the kernel and the range ∗ of a linear bounded operator A and its adjoint A between two Hilbert spaces H1 and H2. Its usefulness will appear in later chapters.

Proposition 2.10. Let A : H1 → H2 be a continuous linear map between two Hilbert spaces. Then

(i) N (A) = R(A∗)⊥, (ii) N (A∗) = R(A)⊥.

2.4.2 Orthogonal Projections

In this section we state a key result for Hilbert spaces which we will use repeatedly. It gives rise to the concept of a projection operator. Next, we conclude this chapter with some applications of this important theorem.

Definition 2.18. Let H0 be a subset of the Hilbert space H and take x ∈ H. We call

d(x, H0) = inf kx − yk y∈H0 the distance of x to H0. 22 Chapter 2. Banach and Hilbert Spaces

Definition 2.19. A subset H0 of the Hilbert space H is called convex iff for any x, y ∈ H0, the segment {αx + (1 − α)y : 0 ≤ α ≤ 1} is contained within H0.

Lemma 2.2 (Projection Lemma). Let H0 be a closed, convex subset of the Hilbert space H and take x ∈ H. Then there exists a unique element x0 ∈ H0 for which kx − x0k = d(x, H0).

Theorem 2.11 (Projection Theorem). Let H0 be a closed subspace of the Hilbert space H and take x ∈ H. Then x can be uniquely written as

x = x0 + (x − x0), where x0 = Π(x|H0) ∈ H0, the unique element from the previous lemma, and x − x0 = ⊥ x − Π(x|H0) ∈ H0 . The element Π(x|H0) is now uniquely determined by the orthogonality relationship hx − Π(x|H0), yi for all y ∈ H0.

Note we need the completeness of the Hilbert space H to assure the existence of x0. As a consequence of the Projection Theorem, we see that the Hilbert space H can be written as

⊥ H = H0 ⊕ H0 , (2.2) for any closed subspace H0. The situation is sketched in Figure 2.1.

Figure 2.1: Projection onto a linear subspace.

Definition 2.20. The operator Π(·|H0) defined in the previous theorem is called the orthogonal projection operator onto the closed subspace H0. The element x0 = Π(x|H0) is then called the orthogonal projection of x onto H0.

The following proposition gives the key properties of the orthogonal projection operator. 2.4. Orthogonality and Projections 23

Proposition 2.11.

1. If H0 is a closed subspace of H, then the orthogonal projection operator Π(·|H0) ≡ Π satisfies:

(i)Π is linear: Π(x + y) = Π(x) + Π(y), for all x, y ∈ H, (ii)Π is idempotent: Π2 = Π, (iii)Π is self-adjoint: Π∗ = Π.

2. Properties (ii) and (iii) are equivalent to

(iv) hΠ(x), Π(y)i = hΠ(x), yi for all x, y ∈ H. 3. If an operator Π satisfies (i), (ii) and (iii), or equivalently (i) and (iv), then Π is the orthogonal projection operator on H0 := R(Π).

From (iv) it follows that the orthogonal projection operator is bounded,

kΠ(x)k2 = hΠ(x), Π(x)i = hΠ(x), xi ≤ kΠ(x)kkxk, which shows that kΠ(x)k ≤ kxk. By this, we see that kΠk ≤ 1. In addition, one can prove that kΠk = 1. Note, when Π is an orthogonal projection operator, we see that (2.2) yields

H = R(Π) ⊕ N (Π).

The next proposition gives some useful rules when calculating orthogonal projections.

Proposition 2.12.

1. If H1 and H2 are closed subspaces of H and H1 ⊂ H2, then

Π(·|H1) = Π{Π(·|H2)|H1},

or with Πi ≡ Π(·|Hi) for i = 1, 2 we have Π1 = Π1 ◦ Π2.

2. If H1 and H2 are closed subspaces of H and H1 ⊥ H2, then

Π(·|H1 + H2) = Π(·|H1) + Π(·|H2).

The first assertion states that the projection onto the closed subspace H1 can be found by first projecting onto the closed subspace H2 and then projecting this projection onto H1. The second assertion states that the projection onto the sum of two orthogonal closed subspaces is the same as the sum of the projections onto these closed subspaces separately. Let us state a final result involving the operator A∗A as defined earlier that will be used in later chapters.

Proposition 2.13. Let A : H1 → H2 be a continuous linear map between two Hilbert spaces. ∗ ∗ −1 ∗ If A A : H1 → H1 is continuously invertible, then A(A A) A : H2 → R(A) is the orthogonal projection onto the range of A. 24 Chapter 2. Banach and Hilbert Spaces

2.4.3 Applications of the Projection Theorem

In this section we consider some applications of the Projection Theorem, more specifically, we consider projections onto finite-dimensional subspaces. We get our inspiration from Tsiastis (2006), [35], §2.4, but we deduce the results for arbitrary Hilbert spaces. The applications we consider here are natural generalizations.

Projection Onto a Finite-Dimensional Subspace of H

Let H be a Hilbert space relative to an inner product h·, ·i. Now take k arbitrary elements T u1, . . . , uk of H. Denote this with the vector u = [u1, . . . , uk] . As a finite-dimensional subspace of H we consider the linear span of the vector u. That is

T k H0 = lin u = {a u : a ∈ R }.

When the components of u are linearly independent, the dimension of H0 is equal to k. Let x be an arbitrary element of H. By the Projection Theorem, the projection of x onto H0 is T given by the unique element Π(x|H0) = a0 u that satisfies

k T T X T hx − a0 u, a ui = ajhx − a0 u, uji = 0 j=1

T k T for all a = [a1, . . . , ak] ∈ R . It is not difficult to see that this is equivalent with hx−a0 u, uji = 0 for all j = 1, . . . , k. Just put a = [0,..., 1,..., 0]T with a one on the j-th position. This can be written as T T 1×k hx − a0 u, u i = 0 , T T T T where hx − a0 u, u i denotes the vector [hx − a0 u, u1i,..., hx − a0 u, uki]. This implies that T T T hx, u i = a0 hu, u i where   hu1, u1i · · · hu1, uki T  . .. .  k×k hu, u i =  . . .  ∈ R . huk, u1i · · · huk, uki Suppose this matrix is invertible, then

T T T −1 a0 = hx, u ihu, u i , in which case the unique projection is equal to

T T −1 Π(x|H0) = hx, u ihu, u i u. (2.3)

q Projection Onto a Finite-Dimensional Subspace of H

As before, let H be a Hilbert space relative to the inner product h·, ·i. Now consider the direct q sum of q copies of H, H = H × · · · × H. As seen before, this is a Hilbert space relative to the inner product q X hx, yiHq = hxi, yii i=1 2.4. Orthogonality and Projections 25

T T q for all x = [x1, . . . , xq] , y = [y1, . . . , yq] ∈ H . Take r arbitrary elements v1, . . . , vr of H. T q Denote this with the vector v = [v1, . . . , vr] . As a finite-dimensional subspace of H we consider the linear span of the vector v. That is

q×r H0 = lin v = {Bv : B ∈ R }.

When the components of v are linearly independent, the dimension of H0 is equal to q × r. T q Let x = [x1, . . . , xq] be an arbitrary element of H . By the Projection Lemma, the projection of x onto H0 is given by the unique element Π(x|H0) = B0v that satisfies

kx − B0vkHq = inf kx − BvkHq . B∈Rq×r q By definition of the inner product in H , we see that

2 q kx − BvkHq = hx − Bv, x − BviH q X = hxi − (Bv)i, xi − (Bv)ii i=1 q r X X 2 = kxi − Bijvjk . i=1 j=1

From this we see that minimizing kx − BvkHq is equivalent with minimizing each component Pr kxi − j=1 Bijvjk for i = 1, . . . , q. Thus, finding the projection of the q-dimensional vector x onto the subspace H0 is equivalent to taking each xi and projecting it individually onto the supspace spanned by v in H, as in the previous application. The formula is given by (2.3). Stacking these individual projections into a vector shows that the unique projection is equal to

T T −1 Π(x|H0) = hx, v ihv, v i v, (2.4) where   hx1, v1i · · · hx1, vri T  . .  q×r hx, v i =  . .  ∈ R . hxq, v1i · · · hxq, vri The matrix hv, vT i is defined as before and is assumed to be invertible. Note we could also obtain this result by a direct calculation, see [35], p.17-18, but here we used a more insightful argument. 26 Chapter 2. Banach and Hilbert Spaces Part II

Introduction to Semiparametric Efficiency for the Class of RAL Estimators

27

Chapter 3

Geometry of Influence Functions

In this chapter, we give a crash course in the geometry of influence functions. For a more detailed description of this theory and additional proofs of theorems, we refer to Tsiatis (2006), [35], Chapter 3 or [42] (in Dutch). Hilbert spaces will play an important role and this will allow us to study efficiency of estimators in a geometric way. In this chapter, we restrict ourselves to parametric models, models that are smoothly indexed by a finite-dimensional real valued vector of parameters. In the next chapter, we extend the ideas put forward here to semiparametric models. Rather than defining a statistical model as a class of probability distributions as in the introduction, here we adopt the definition from [35]. A parametric model is thus defined as a set of probability densities, p P = {p(x; θ): θ ∈ Θ ⊂ R }, (3.1) with respect to some dominating measure νX . The dimension p is some finite positive integer. Note this is just a notational detail as both definitions (1.1) and (3.1) are equivalent. We note it not to confuse the reader. The set Θ is defined to be open so we are able to define derivatives with respect to each θ ∈ Θ. It is now reasonable we are only interested in a subset of the parameters. The theory we will develop here is to deal with estimation of the part we are interested in. Nonetheless, the entire set of parameters is necessary to describe the class of possible densities. More specifically, we partition the parameter θ as (βT , ηT )T where β is a q-dimensional vector and η is an r-dimensional vector with q +r = p. The part β will be referred to as the parameter of interest we want to estimate and, as for this whole thesis, the part η will be referred to as the so called nuisance parameter. For later reference, the truth will be T T T denoted by θ0 = (β0 , η0 ) . Before defining the class of estimators we shall consider in this part, we define an important Hilbert space that will be extensively used in this part.

3.1 The Space of Mean-Zero q-dimensional Random Functions

In this section we give an important example of a Hilbert space that will be extensively used in this part. Recall that our data are envisioned as realizations of the random vector X1,...,Xn. Let X denote a single observation and as we have seen, the underlying probability space was denoted by (Ω, U, P). Let P denote the distribution of X. q Now consider the set of all q-dimensional mean-zero random functions of X, h :Ω → R by which we mean that h(X) is measurable, it must satisfy the mean-zero property, i.e., E{h(X)} = 0 and

29 30 Chapter 3. Geometry of Influence Functions in addition we demand E{hT (X)h(X)} < +∞. When we refer to an element h, we implicitly mean h(X). We denote this set by Hq. It is clear this set is a linear space. The subscript q is used the highlight the dimension of the function h. This is not the dimension of the space Hq itself since this set is infinite-dimensional. See [35], p.12 for more information about this. We thus see that Hq is an infinite-dimensional vector space. We now define an inner product on the space Hq. For two functions h1, h2 ∈ Hq, we define Z Z T T T hh1, h2i = E(h1 h2) = h1 (x)h2(x)dP (x) = h1 (x)h2(x)pX (x)dνX (x). X X This inner product is called the covariance inner product. This clearly satisfies the axioms (i), (ii) and (iii) of an inner product. As for condition (iv), we define h1 to be equal (equivalent) to h2 if h1 equals h2 almost everywhere. This means that P(h1 6= h2) = 0. Thus we defined an inner product space with associated norm {E(hT h)}1/2. Now we argue why this is a Hilbert space. Take an arbitrary element h ∈ Hq. Every component T q of h can be seen to belong to L2(P ) because E(h h) < +∞, thus h ∈ L2(P ), a Hilbert space. q Since Hq is a closed subspace of L2(P ), by Theorem 2.2 we conclude that Hq is a Hilbert space.

Remark 3.1. It can be insightful to note that Hq, the set of q-dimensional mean-zero random functions, is the direct sum of q copies of H1, the set of 1-dimensional mean-zero random q functions, i.e., Hq = H1. The inner product defined on Hq corresponds with the inner product for the direct sum for Hilbert spaces. Take hi ∈ Hq with

 (1) hi  (2) hi  hi =  .  ,  .   .  (q) hi

(j) i = 1, 2. We have that hi ∈ H1 for each j = 1, . . . , q and i = 1, 2. We see that q T X (j) (j) E(h1 h2) = E(h1 h2 ). j=1 This indeed corresponds with the definition of the inner product of the direct sum of Hilbert spaces.

The Hilbert space Hq will be of great importance in the development of the semiparametric theory discussed in this part. This part directly focuses on the estimation of a finite-dimensional parameter. The Hilbert space Hq will also show its importance in Part III, but it will not be used so explicitly because the theory developed in that part will be more general. That part will be more abstract than Part II and focuses on the estimation of functionals on the model, a more general approach for semiparametric efficiency.

3.2 Influence Functions and (Regular) Asymptotically Linear Estimators

Before studying the efficiency theory of estimators, we discuss the type of estimators we will re- strict ourselves to, so called regular asymptotically linear estimators, to be defined shortly. This 3.2. Influence Functions and (Regular) Asymptotically Linear Estimators 31 is a quite restrictive class of estimators, however, as stated in [35], most reasonable estimators belong to this class. The theory developed in this part is made only for this class of estimators. For a more general discussion of semiparametric efficiency, we will have to wait until Part III where we will discuss a much more rich but more abstract and tedious theory.

3.2.1 Asymptotically Linear Estimators and Influence Functions

As noticed before, we are interested in estimating β. An estimator βˆn for β is a q-dimensional measurable random function of our sample X1,...,Xn. Let us define what is meant with an asymptotically linear estimator.

Definition 3.1. An estimator βˆn for β is called asymptotically linear (ALE) iff there exists a q-dimensional measurable random function ϕ(X, θ0) such that

q×1 1. Eθ0 {ϕ(X, θ0)} = 0 ,

T T 2. Eθ0 {ϕ(X, θ0) ϕ(X, θ0)} < +∞ and Eθ0 {ϕ(X, θ0)ϕ(X, θ0) } is non-singular, and most importantly, n √ 1 X n(βˆ − β ) = √ ϕ(X , θ ) + o (1), (3.2) n 0 n i 0 P i=1 where oP (1) is a term that converges in probability to zero as n goes to infinity.

Remark 3.2. The function ϕ(X, θ0) is defined with respect to the true distribution p(x; θ0) that generates the data. We use the notation ϕ(X, θ) to emphasize that this random function will vary according to the value of θ in the model. Not to overload the notation, we assume ϕ(X) is evaluated at the truth and thus, E{ϕ(X)} is a shorthand for Eθ0 {ϕ(X, θ0)}. Definition 3.2. The q-dimensional measurable random function ϕ(X) is called the influence function of the estimator βˆn for β. The function ϕ(Xi) is referred to as the i-th influence function or the influence function of the i-th observation.

This terminology comes from the robustness literature, where the influence function at a point x is defined as measuring the influence of an infinitesimal change in the data generating distribution (by adding a probability mass at that point x) on the estimator, see Hampel (1974), [14]. Note we defined ϕ(X) to be the influence function of the estimator βˆn. This is justified by the following theorem.

Theorem 3.1. An asymptotically linear estimator has a unique (a.s.) influence function.

By this we mean that if ϕ1 and ϕ2 are two influence functions of βˆn, then P(ϕ1 = ϕ2) = 1. The nice thing about asymptotically linear estimators is that they have interesting asymptotic properties and they are fully characterized by their influence function. First we note, by defi- P(θ0) nition of an asymptotically linear estimator βˆn, this estimator is consistent, i.e., βˆn −→ β0. In addition, by the CLT and Slutsky’s theorem, we obtain that

√ D(θ0)  T n(βˆn − β0) −→ N 0,E(ϕϕ ) . (3.3) 32 Chapter 3. Geometry of Influence Functions

This means an asymptotically linear estimator is asymptotically normal with asymptotic vari- ance equal to the variance of the influence function. By Prohorovs Theorem, the estimator βˆ √ n is a n-consistent estimator for β0 when the nuisance parameter would be known, i.e., √ ˆ n(βn − β0) = OP (1).

From the asymptotic normality and the uniqueness of ϕ we see that, in asymptotic sense, an asymptotically linear estimator can be identified through its influence function. We conclude √ that every ALE is a consistent, moreover a n-consistent and asymptotically normal estimator. This representation of estimators through their influence function lends itself to geometric inter- pretations in terms of the Hilbert space of mean-zero q-dimensional random functions because, for fixed θ, the influence function ϕ ∈ Hq by definition. However, before describing this geome- try, we comment on some additional regularity conditions. We keep the discussion to a minimum as this will be discussed in full detail in Chapter 7.

3.2.2 Regular Asymptotically Linear Estimators

It is a well known result that the variance of an unbiased estimator is not smaller than the Cram´er-Raolower bound, see for example Casella and Berger (2002), [6], §7.3. However, most reasonable estimators are only asymptotically unbiased. One can expect the asymptotic variance of such asymptotically unbiased estimators will also not be smaller than the Cram´er-Raolower bound. Under suitable regularity conditions, the maximum likelihood estimator (MLE) attains the Cram´er-Raolower bound. But there is more. One of the peculiarities of asymptotic theory is that one can construct asymptotically unbiased estimators that have asymptotic variance equal to the Cram´er-Raolower bound for most of the parameter values but have smaller asymptotic variance than the Cram´er-Raolower bound for the other parameter values. Such estimators are referred to as super-efficient. Super-efficiency seems like a good property for an estimator to possess but it seems this is at the expense of poor estimation in the neighbourhood of the point of super-efficiency. This will be discussed in full detail in Chapter 7. To exclude such estimators, we impose some additional regularity conditions on the class of estimators. Specifically, we will require that an estimator is regular, as we now define. Definition 3.3. Consider a local data generating process (LDGP), where for each n, the data are √ ∗ distributed according to a parameter θn where n(θn −θ ) converges to a constant vector τ. This ∗ means θn is close to some fixed parameter θ relative to the sample size. To be precise, for each n T T T we have an i.i.d. sample X1n,...,Xnn with density function p(x; θn) where θn = (βn , ηn ) and ∗ ∗T ∗T T θ = (β , η ) . An estimator βˆn, more specifically βˆn(X1n,...,Xnn), is said to be regular if ∗ √ for each θ , n(βˆn − βn) has a limiting distribution that does not depend on the LDGP.

For asymptotically linear estimators, this will just mean that if the true asymptotic distribution is given by ∗ √ n ∗o D(θ ) ∗ n βˆn(X1n,...,Xnn) − β −→ N(0, Σ ),

∗ where X1n,...,Xnn is an i.i.d. sample with density function p(x; θ ) for all n, then

√ n o D(θn) ∗ n βˆn(X1n,...,Xnn) − βn −→ N(0, Σ ), where X1n,...,Xnn is an i.i.d. sample with density function p(x; θn) for all n and for every +∞ √ ∗ sequence (θn)n=1 satisfying n(θn − θ ) → τ, where τ is an arbitrary constant vector. 3.2. Influence Functions and (Regular) Asymptotically Linear Estimators 33

Asymptotically linear estimators that are also regular are referred to as regular asymptotically linear estimators, RAL estimators. This will be the class of estimators we will focus on in this and the next two chapters. In Tsiatis (2006), [35], p.27, it is argued it is reasonable to restrict ourselves to RAL estimators. Note however, although a great range of estimators fall into this class, sometimes it can be worthwhile to consider non-regular estimators and then the theory developed in this part fails to deal with this. It would be nice if we would have an easy criterion to check this regularity. Such a criterion is available for asymptotically linear estimators. In addition, it will present a powerful result that allows us to describe the geometry of influence functions for RAL estimators. This will aid us in defining and visualizing efficiency of RAL estimators and help us generalize ideas to semiparametric models. Before stating this theorem, we define and give notation for the score vector for a single observation X.

Definition 3.4. Consider a parametric model and an observation X ∼ pX (x; θ) with θ = T T T (β , η ) . Then the score vector for X, denoted by Sθ(X, θ0), is defined by

∂ log pX (x, θ) Sθ(x, θ0) = . (3.4) ∂θ θ=θ0 This is the p-dimensional vector of partial derivatives of the log likelihood with respect to the elements of the parameter θ evaluated at the truth. Note this vector can be partioned as T T T Sθ(X, θ0) = {Sβ (X, θ0),Sη (X, θ0)} with Sβ the q-dimensional vector of partial derivatives of the log likelihood with respect to the elements of the parameter of interest β and Sη the r- dimensional vector of partial derivatives of the log likelihood with respect to the elements of the nuisance parameter η. For parametric models, it is always possible to parametrize the model such that the partition θ = (βT , ηT )T is possible. When we consider semiparametric models, this will not always be the case. The parameter of interest can then be seen as a smooth function of θ, i.e., β(θ). In some applications, this is a more natural representation, as we will see in Chapter 5. The theory developed in this part is made principally for the case where the parameter θ can be partitioned as (βT , ηT )T . Let us now state the announced theorem on which the theory in this chapter is based. One direction of the implication is stated in Tsiatis (2006), [35], p.28. That the other direction of the implication also holds is stated in Newey (1990), [27], p.103. For this purpose and for later reference we will use the following notation. Definition 3.5. Let the parameter of interest β(θ) be a smooth q-dimensional function of the p-dimensional parameter θ (q < p). Then denote the q × p matrix of partial derivatives by   ∂β1/∂θ1 ··· ∂β1/∂θp ∂β(θ) =  . .  = Γ(θ), ∂θT  . .  ∂βq/∂θ1 ··· ∂βq/∂θp i.e., Γij(θ) = ∂βi/∂θj. Theorem 3.2. Let the parameter of interest β(θ) be a smooth q-dimensional function of the p-dimensional parameter θ (q < p) such that Γ(θ) exists, has rank q and is continuous in θ in a neighbourhood of the truth θ0. Also let βˆn be an asymptotically linear estimator with influence T function ϕ(X) such that Eθ(ϕ ϕ) exists and is continuous in θ in a neighbourhood of θ0. Then βˆn is regular if and only if T E{ϕ(X)Sθ (X, θ0)} = Γ(θ0). (3.5) 34 Chapter 3. Geometry of Influence Functions

Although this is a very imporant theorem, the proof is not given. The proof as presented in [35], §3.2, which only proves the theorem in one direction, makes use of the definition of regularity and other concepts that fall outside the scope of this thesis (in particular the concept of contiguity, additional smoothness conditions are discussed that make a local data generating process contiguous to the sequence of distributions at the truth). In [27], p.127-128, a sketch of a proof of both directions is presented but in a semiparametric setting. However, it a fortiori holds for parametric models.

Corollary 3.1. If in addition the parameter θ can be partioned as (βT , ηT )T , then under the same assumptions as in Theorem 3.2, βˆn is regular if and only if

T q×q (i) E{ϕ(X)Sβ (X, θ0)} = I ,

T q×r (ii) E{ϕ(X)Sη (X, θ0)} = 0 , where Iq×q denotes the q × q identity matrix and 0q×r denotes the q × r matrix of zeros.

q×q q×r Note this is a very trivial consequence of Theorem 3.2, because in this case Γ(θ0) = [I , 0 ].

The Class of m-estimators.

To gain some insight in Theorem 3.2, under sufficient regularity conditions, Corollary 3.1 can be proven directly for the special case of the class of m-estimators, see [35], §3.2. We summarize below some properties of these so called m-estimators. The details are also found in [35], §3.2. Such estimators are defined as follows. Take a p-dimensional function of X and θ, m(X, θ) T T for wich Eθ{m(X, θ)} = 0, Eθ{m (X, θ)m(X, θ)} < +∞ and Eθ{m(X, θ)m (X, θ)} is positive definite for all θ ∈ Θ. Note, for fixed θ, this function belongs to Hp. The m-estimator θˆn Pn ˆ is defined as the solution of i=1 m(Xi, θn) = 0 from an i.i.d. sample with density function pX (x; θ).

Example 3.1 (MLE). Under suitable regularity conditions, the MLE of θ is an m-estimator. It is found by solving the score equations in θ,

n X Sθ(Xi, θ) = 0. i=1

1 Because the score vector has mean zero , we see that we can take m(X, θ) to be Sθ(X, θ).

These m-estimators are ALE’s with influence function given by

" ( )#−1 ∂m(X, θ) ϕm(X) = − E m(X, θ0). (3.6) ∂θT θ=θ0

The asymptotic normality and the consistency then follows from the properties of ALE’s.

1See for example equation (7.3.8) of Casella and Berger (2002), [6] but this is quite trivial to obtain by noting R that pX (x; θ)dνX (x) = 1 for all θ ∈ Θ and then taking the derivative with respect to θ (assuming we can interchange integration and differentiation). 3.3. Geometry of Influence Functions for Parametric Models 35

Example 3.2 (MLE, continued). Recall the case of the MLE. We have seen this is a special case of an m-estimator where m(X, θ) = Sθ(X, θ). Using (3.6), we see that the influence function for the MLE is given by

" ( )#−1 ∂Sθ(X, θ) ϕMLE(X) = − E Sθ(X, θ0). ∂θT θ=θ0

T We note that −∂Sθ(X, θ)/∂θ equals −Sθθ(X, θ), the p×p matrix of second order partial deriva- tives of the log likelihood with respect to θ. Under suitable regularity conditions (see Casella and Berger (2002), [6], §7.3) the Fisher information matrix I(θ0), defined as the p × p matrix T var{Sθ(X, θ0)} = E{Sθ(X, θ0)Sθ(X, θ0) }, equals −E{Sθθ(X, θ)}. From this we see that the influence function of the MLE is given by

−1 ϕMLE(X) = {I(θ0)} Sθ(X, θ). (3.7)

−1 The asymptotic distribution is normal with mean zero and variance equal to {I(θ0)} .

Under the assumptions made in Theorem 3.2, it is possible to show that the first q parts of the p-dimensional vector (3.6) satisfy the conditions of Corollary 3.1 and from this we can conclude that every m-estimator2 is a RAL estimator. These considerations show that the class of RAL estimator is quite broad since this class contains the wide range of m-estimators.

3.3 Geometry of Influence Functions for Parametric Models

Now we discussed what class of estimators we will consider, it is time to look at the geometric interpretations Theorem 3.2 implies. As we noted before, for fixed θ and thus especially in the truth θ0, the influence function of an ALE ϕ(X, θ0) belongs to Hq, the Hilbert space of q-dimensional measurable functions of X with mean zero and finite variance. In Example 3.1 it is seen that under suitable regularity conditions, Sθ(X, θ0) has mean zero. This enables us to define a finite-dimensional subspace of Hq.

Definition 3.6. The finite-dimensional subspace of Hq spanned by the score vector Sθ(X, θ0),

q×p q×p q×p J = {B Sθ(X, θ0): B ∈ R }, (3.8) is called the tangent space.

3.3.1 The Case θ = (βT , ηT )T

As we assume, the parameter θ can be partitioned as (βT , ηT )T . This results in a decomposition of the tangent space.

Definition 3.7. The finite-dimensional subspace of Hq spanned by the score vector for η, Sη(X, θ0), q×r q×r q×r Λ = {B Sη(X, θ0): B ∈ R }, (3.9) is called the nuisance tangent space.

2Of course under suitable regularity conditions which we do not discuss for the time being but these can be found in [35], §3.2. 36 Chapter 3. Geometry of Influence Functions

Definition 3.8. The finite-dimensional subspace of Hq spanned by the score vector for β, Sβ(X, θ0), q×q q×q q×q Jβ = {B Sβ(X, θ0): B ∈ R }, (3.10) is called the tangent space of the parameter of interest.

With these definitions, it is clear we can write J as the direct sum of Jβ and Λ, J = Jβ ⊕ Λ. Now consider a RAL estimator βˆ with corresponding influence function ϕ (X). From (i) of n βˆn Corollary 3.1, it is clear that ϕ (X) ∈ Λ⊥. The influence function ϕ (X) also satisfies (ii) of βˆn βˆn Corollary 3.1. What can we say about the converse? Take any ϕ ∈ Hq satisfying (i) and (ii) of Corollary 3.1. Does there exist a RAL estimator βˆn for β0 with influence function ϕ?

Construction of a RAL Estimator With Given Influence Function

The answer to the aforementioned question is positive. A proof of this is given in Tsiatis (2006), [35], p.38-41. The starting point of the proof is considering the estimator βˆn that solves the estimating equations (with ϕ ∈ Hq satisfying (i) and (ii) of Corollary 3.1)

n X ϕ(Xi, βˆn, ηˆn) = 0, i=1 √ whereη ˆn is a n-consistent estimator for η0. It is then shown that βˆn is an ALE with influence function ϕ and since it satisfies the conditions of Corollary 3.1, it is a RAL estimator. However, √ the n-consistency of the estimatorη ˆ for the nuisance parameter η can be too strong, that is, n 0√ we may not find a consistent estimator for η0 that converges at speed n. An analogous heuristic proof can be given when we impose slightly weaker conditions on the consistent estimatorη ˆn, 1/4+ we assume we can find a n -consistent estimatorη ˆn for η0 where  > 0 is a small number. In addition, we need to assume that ϕ is continuously differentiable up to second order with respect to η. Before we prove this, we derive some key facts that will be of great importance in our proof.

Lemma 3.1. Take any influence function ϕ(X, β, η) for β. Under suitable regularity condi- tions3, we have that

( ) ∂ϕ  T (i) −E (X, β0, η) = E ϕ(X, β0, η0)S (X, β0, η0) , ∂ηT η η=η0 ( ) ∂ϕ n T o (ii) −E (X, β, η0) = E ϕ(X, β0, η0)S (X, β0, η0) . ∂βT β β=β0

Proof. The proof of this is not difficult. By definition of an influence function, we have that

Eβ0,η{ϕ(X, β0, η)} = 0. That is, Z ϕ(x, β0, η)p(x, β0, η)dνX (x) = 0.

3Regularity conditions assuring that ϕ(X, β, η) has continuous first order derivatives with respect to θ and η and assuring we can interchange differentiation and integration. 3.3. Geometry of Influence Functions for Parametric Models 37

Taking the derivative with respect to η in η0 and assuming we can interchange differentiation and integration, we obtain Z Z ∂ϕ T (x, β0, η) p(x, β0, η0)dνX (x) + ϕ(x, β0, η0)S (x, β0, η0)p(x, β0, η0)dνX (x) = 0. ∂ηT η η=η0 From this we obtain ( ) ∂ϕ  T −E (X, β0, η) = E ϕ(X, β0, η0)S (X, β0, η0) . ∂ηT η η=η0 Similarly, we can show that ( ) ∂ϕ  T −E (X, β, η0) = E ϕ(X, β0, η0)S (X, β0, η0) ∂βT β β=β0 by noting that Eβ,η0 {ϕ(X, β, η0)} = 0 and then taking the derivative with respect to β in β0.

As announced, we will work with influence functions satisfying conditions (i) and (ii) of Corollary 3.1. This specializes the previous lemma. Due to the orthogonality of the influence functions ϕ and the nuisance tangent space Λ, we see that ( ) ∂ϕ E (X, β0, η) = 0. (3.11) ∂ηT η=η0 This equation has an interesting interpretation. Due to the orthogonality of the influence func- tion and the nuisance tangent space, we see that the influence function ϕ(X, β0, η) as a function of η does not locally vary. By this we mean that small changes in η do not affect the influence function. As we will see, this will imply some local robustness. Therefore it is insightful to go through the upcoming derivation (without going into detail in the technical difficulties). The other condition implies that ( ) ∂ϕ q×q − E (X, β, η0) = I . (3.12) ∂βT β=β0 The upcoming derivation is not based on the reference works in the bibliography. Thus, let ϕ be a q-dimensional measurable function with mean zero and finite variance that satisfies conditions (i) and (ii) of Corollary 3.1. Now suppose (for small ) we have a preliminary n1/4+-consistent 1/4+ ˆ estimatorη ˆn for the nuisance parameter η0, i.e., n (ˆηn − η0) = OP (1). Now define βn to be the solution of n X ϕ(Xi, βˆn, ηˆn) = 0. (3.13) i=1

We now give a sketch how it can be shown that βˆn is a RAL estimator for β0 with influence function ϕ(X, β0, η0). A Taylor expansion with respect to β about β0 yields

n 1 X 0 = √ ϕ(X , βˆ , ηˆ ) n i n n i=1 n ( n ) 1 X 1 X ∂ϕ √ = √ ϕ(X , β , ηˆ ) + (X , β∗, ηˆ ) n(βˆ − β ), n i 0 n n ∂βT i n n n 0 i=1 i=1 38 Chapter 3. Geometry of Influence Functions

∗ ˆ where βn is an intermediate value between βn and β0. From this we obtain

( n )−1 n √ 1 X ∂ϕ 1 X n(βˆ − β ) = − (X , β∗, ηˆ ) √ ϕ(X , β , ηˆ ). n 0 n ∂βT i n n n i 0 n i=1 i=1 From the uniform WLLN and the Continuous Mapping Theorem, we see that ( n )−1 ( )−1 1 X ∂ϕ ∗ P ∂ϕ − (Xi, β , ηˆn) → −E (X, β, η0) n ∂βT n ∂βT i=1 β=β0 and from (3.12) we obtain that

n √ 1 X n(βˆ − β ) = √ ϕ(X , β , ηˆ ) + o (1). (3.14) n 0 n i 0 n P i=1

This result is not satisfactory as we want the asymptotic distribution of βˆn to be independent of our estimatorη ˆn. In addition, ϕ(Xi, β0, ηˆn) may not even has mean zero under the truth. Using the orthogonality of ϕ(X, β0, η0) and the nuisance tangent space Λ brings us back on the right track. First, we use a Taylor expansion (formally) in η about η0 (assuming ϕ is twice continuously differentiable with respect to η),

n n 1 X 1 X √ ϕ(X , β , ηˆ ) = √ ϕ(X , β , η ) n i 0 n n i 0 0 i=1 i=1 ( n ) ( n ) 1 X ∂ϕ 1 X ∂2ϕ √ + √ (X , β , η ) (ˆη − η ) + (X , β , η∗) n(ˆη − η )2, n ∂ηT i 0 0 n 0 n ∂η2 i 0 n n 0 i=1 i=1 ∗ where ηn is an intermediate value betweenη ˆn and η0. Let us investigate the different parts of  ∂ϕ  this long equation. By the key fact that E (X, β , η ) = 0 and the CLT, we see that ∂ηT 0 0

n 1 X ∂ϕ D √ (X , β , η ) → N(0, Σ), n ∂ηT i 0 0 i=1 n 1 X ∂ϕ where Σ represents the variance of ∂ϕ/∂ηT . This means that √ (X , β , η ) = O (1). n ∂ηT i 0 0 P i=1 From the consistency ofη ˆn we then obtain ( n ) 1 X ∂ϕ √ (X , β , η ) (ˆη − η ) = O (1)o (1) = o (1). n ∂ηT i 0 0 n 0 P P P i=1 Note that this step is possible due to the orthogonality relation (ii) of Corollary 3.1. Next, we consider the last term. By the uniform WLLN,

n 2  2  1 X ∂ ϕ P ∂ ϕ (X , β , η∗) → E (X, β , η ) . n ∂η2 i 0 n ∂η2 0 0 i=1

1/4+ 1/4+ We assume this is finite. Now we use the n -consistency ofη ˆn. We have that n (ˆηn−η0) = OP (1). This implies that 1/4 1/4+ − n (ˆηn − η0) = n (ˆηn − η0)n = OP (1)oP (1) = oP (1) 3.3. Geometry of Influence Functions for Parametric Models 39

√ 2 and by the Continous Mapping Theorem we have n(ˆηn − η0) = oP (1). Summing up the obtained results, we find that n n 1 X 1 X √ ϕ(X , β , ηˆ ) = √ ϕ(X , β , η ) + o (1). n i 0 n n i 0 0 P i=1 i=1 Substituting this in (3.14) yields n √ 1 X n(βˆ − β ) = √ ϕ(X , β , η ) + o (1). (3.15) n 0 n i 0 0 P i=1

Because ϕ(X, β0, η0) satisfies (i) and (ii) of Corollary 3.1, we have constructed a RAL estimator and (3.15) shows that it has influence function ϕ(X, β0, η0).

Implications

The construction above has serious and important implications for the asymptotic distribution of RAL estimators with influence function orthogonal to the nuisance tangent space that we now wish to highlight.

1. The first thing we wish to highlight is that the argument of the derivation above was independent of the choice of the n1/4+-consistent estimatorη ˆ for the nuisance parameter n√ η0. In [35], the argument is independent of the choice of the n-consistent estimator for η0. 2. The second thing we wish to highlight is that the asymptotic distribution of the esti- mator obtained by solving the estimating equation (3.13) is the same as the asymptotic distribution of the estimator β˜n solving the estimating equation n X ϕ(Xi, β˜n, η0) = 0, (3.16) i=1 thus assuming the true value of the nuisance parameter is known. This can be easily seen from (3.14) whereη ˆn should be replaced by η0 and no further Taylor expansion with respect to η is needed. That is, both βˆn and β˜n have T  N 0,E{ϕ(X, β0, η0)ϕ(X, β0, η0) } as asymptotic distribution. This fact follows from the orthogonality of the estimating function evaluated at the truth (the influence function ϕ(X, β0, η0)) to the nuisance tangent space. As we noted, this translated itself to the fact that ϕ as a function of η does not locally vary by which we mean it is unaffected by small, local variations of the nuisance parameter. More specifically, if we use an influence function of β as our estimating equation that is orthogonal to the nuisance tangent space, we do not need to acknowledge the uncertainty of estimating the nuisance parameter in the asymptotic variance of the estimator βˆn. Thus, since the influence function is orthogonal to the nuisance tangent space, it automatically acknowledges the uncertainty onη ˆn. Henceforth, we see that this kind of robustness, where the asymptotic distribution of an estimator is independent of whether the true value of the nuisance parameter is known or whether the nuisance parameter is estimated in an estimating equation, and by the first remark, how it has been estimated, is one of the bonusses of working with estimating equations with estimating functions that are orthogonal to the nuisance tangent space. 40 Chapter 3. Geometry of Influence Functions

3. The third thing we now wish to highlight is that, by Corollary 3.1, all RAL estimators have influence functions that belong to a subset V of Hq satisfying

T q×q (i) E{ϕ(X)Sβ (X, θ0)} = I , T q×r (ii) E{ϕ(X)Sη (X, θ0)} = 0 .

In addition, by the heuristic argument above it follows that any element of V of Hq is the influence function of some RAL estimator. This is the geometric interpretation of the conditions of Corollary 3.1.

3.3.2 The Case β = β(θ)

Even though the focus of this part is on the situation of the previous section where the parameter can be nicely partitioned, it can be worthwhile to consider the more general case where our parameter of interest β is a smooth function of θ, i.e., β = β(θ). This is not really considered in [35]. However, implicitly it is used there in the search of what we will call the efficient influence function. Therefore, the following derivations are not based on the reference works in the bibliography. It is clear we will have less structure than in the previous section because the setting is more general. This situation also gives a hint in how the theory will be developed in the next part. Consider a RAL estimator βˆ with corresponding influence function ϕ (X). We know that n βˆn ϕ (X) must satify (3.5) of Theorem 3.2. What can we say about the converse? Take any βˆn ϕ ∈ Hq satisfying (3.5) of Theorem 3.2. Does there exists a RAL estimator βˆn for β0 with influence function ϕ?

Construction of a RAL Estimator With Given Influence Function

The answer to the aforementioned question is also positive. We will give a heuristic argument to show this. The starting point of the proof is considering an arbitrary solution θˆn of the estimating equation (with ϕ(X, θ0) ∈ Hq satisfying (3.5) of Theorem 3.2) n X ϕ(Xi, θˆn) = 0. i=1

We will show (heuristically) that β(θˆn) is an ALE with influence function ϕ(X, θ0) and since it satisfies (3.5) of Theorem 3.2, it will be a RAL estimator. A Taylor expansion with respect to θ about θ0 yields n 1 X 0 = √ ϕ(X , θˆ ) n i n i=1 n ( n ) 1 X 1 X ∂ϕ √ = √ ϕ(X , θ ) + (X , θ∗ ) n(θˆ − θ ), n i 0 n ∂θT i n n 0 i=1 i=1 ∗ ˆ where θn is an intermediate value between θn and θ0. From the uniform WLLN we obtain that n   1 X ∂ϕ P ∂ϕ (X , θ∗ ) → E (X, θ ) . n ∂θT i n ∂θT 0 i=1 3.3. Geometry of Influence Functions for Parametric Models 41

This shows that n  ∂ϕ  √ 1 X − E (X, θ ) n(θˆ − θ ) = √ ϕ(X , θ ) + o (1). (3.17) ∂θT 0 n 0 n i 0 P i=1

Completely analogously as in Lemma 3.1, starting from the property that Eθ{ϕ(X, θ)} = 0 and then differentiating with respect to θ and evaluating at θ0, we obtain that  ∂ϕ  −E (X, θ ) = E{ϕ(X, θ )ST (X, θ )} = Γ(θ ). ∂θT 0 0 θ 0 0

Combining this with (3.5) and substituting this in (3.17), we obtain

n √ 1 X Γ(θ ) n(θˆ − θ ) = √ ϕ(X , θ ) + o (1). (3.18) 0 n 0 n i 0 P i=1

Note this equation does not imply that θˆn is an ALE. We assumed β = β(θ), a smooth function. Thus, a Taylor expansion of this smooth function in θ about θ0 yields √ ∂β(θ )√ n{β(θˆ ) − β(θ )} = 0 n(θˆ − θ ) + o (1) n 0 ∂θT n 0 P √ ˆ = Γ(θ0) n(θn − θ0) + oP (1). ˆ Note the estimator θn needs to be of sufficient quality to assure the remainder term is oP (1), e.g., n1/4+-consistency will be enough. The (technical) argument is the same as for the case θ = (βT , ηT )T , see §3.3.1. From (3.18) we obtain the desired result,

n √ 1 X n{β(θˆ ) − β(θ )} = √ ϕ(X , θ ) + o (1). (3.19) n 0 n i 0 P i=1

This shows that β(θˆn) is an ALE with influence function ϕ(X, θ0) and as mentioned before it is a RAL estimator since ϕ(X, θ0) satisfies (3.5).

Implication

Because this setting is more general compared to the setting where θ can be partitioned, it is now clear we obtain less structure when the parameter β is a smooth function of θ. Nonetheless, we now also obtain a nice geometric interpretation of Theorem 3.2. From this theorem we know that all RAL estimators have an influence function that belongs to a subset V of Hq satisfying

T E{ϕ(X)Sθ (X, θ0)} = Γ(θ0).

In addition, by the heuristic argument above it follows that any element of V of Hq is the influence function of some RAL estimator.

3.3.3 Importance

Up to now, we were able to visualize each RAL estimator (in both cases) as an element of V with a functional relation that is onto, i.e., each RAL estimator has a unique representation described 42 Chapter 3. Geometry of Influence Functions by its influence function belonging to V. These RAL estimators are then asymptotically normally distributed and this normal distribution is completely characterized through the corresponding influence function, √ D  T n(βˆn − β0) → N 0,E(ϕϕ ) . Because we have determined the whole class of RAL estimators (to which we restrict ourselves in this part) and we have a nice representation, the definition of the most efficient RAL estimator is now straightforward. We will look for the RAL estimator with the smallest asymptotic variance. As we argued earlier, the exciting thing is that the asymptotic variance precisely equals the variance of the corresponding influence function. We noted these influence functions can be considered as points in the subset V of the infinite-dimensional Hilbert space Hq. Hence, we can make the following very important interpretation. The search for the RAL estimator with the smallest asymptotic variance is equivalent with the search of the point in V that is closest to the origin of Hq since the squared distance to the origin represents the variance of that element. This will be clarified in the next section.

3.4 Efficient Influence Function

In this section we are finally ready to show how the geometry of Hilbert spaces will allow us to identify the most efficient influence function. By this we mean the influence function with the smallest variance. However some caution is needed. When the dimension of β ≥ 2, we must be careful about what we mean by smaller variance. This is because the variance of a q-dimensional influence function is a q × q matrix.

3.4.1 Comparing Variances when Dimension is Greater than One

(1) (2) (1) (2) Consider two RAL estimators βˆn and βˆn for β0 with influence functions ϕ (X) and ϕ (X), respectively. We say that the estimator βˆ(1) is more efficient than the estimator βˆ(2) if for √ n √ n q T ˆ(1) T T ˆ(2) T any a ∈ R , n{a βn − a β0} has smaller asymptotic variance than n{a βn − a β0}. Equivalently, var{ϕ(1)(X)} ≤ var{ϕ(2)(X)} if and only if var{aT ϕ(1)(X)} ≤ var{aT ϕ(2)(X)} k for all a ∈ R . More precisely, T T aT [E{ϕ(2)(X)ϕ(2) (X)} − E{ϕ(1)(X)ϕ(1) (X)}]a ≥ 0. This means the matrix E{ϕ(2)(X)ϕ(2)T (X)} − E{ϕ(1)(X)ϕ(1)T (X)} must be semi-positive defi- nite. Thus, we say that var{ϕ(1)(X)} ≤ var{ϕ(2)(X)} iff var{ϕ(2)(X)} − var{ϕ(1)(X)} is semi- positive definite. Unfortunately another problem arises when the dimension of β is bigger than one. When we are 2 working in H1, the variance of h ∈ H1 equals its squared norm, i.e. var(h) = khk . If h1 and h2 are elements of H1 that are orthogonal to each other, the Pythagorean Theorem states that

var(h1 + h2) = var(h1) + var(h2).

When we are working in Hq, the variance of h ∈ Hq does not equal its squared norm since the variance is a matrix and the norm is a real number. However, there is an important special case where we can write a similar relation for the variances. We use the definition as presented in [35]. 3.4. Efficient Influence Function 43

Definition 3.9. A subspace U ⊂ Hq is called a q-replicating space if U is a direct sum of q q copies of a subspace U1 ⊂ H1, i.e., U = U1 × · · · × U1 = U1 . Every u ∈ U can then be written as T (u1, . . . , uq) with uj ∈ U1 for all j = 1, . . . , q.

Example 3.3. Consider the subspace S ⊂ Hq spanned by v(X) ∈ Hr, more specifically

q×r q×r q×r S = {B v(X): B ∈ R }.

q It is not difficult to see that S is a q-replicating space, S = U1 with

T r U1 = {b v(X): b ∈ R }.

Just take b to be the rows of the matrix B.

Since the tangent space J and the nuisance tangent space Λ are subspaces spanned by score vectors, these are examples of q-replicating spaces. For this type of spaces we can now state an interesting generalization of the Pythagorean Theorem.

Theorem 3.3 (Multivariate Pythagorean Theorem). If h ∈ U ⊂ Hq with U a q-replicating space and ` ∈ Hq is orthogonal to U, then

var(` + h) = var(`) + var(h), (3.20) where var(h) = E(hhT ). As a consequence, we obtain the Multivariate Pythagorean Theorem: for all h∗ ∈ H, var(h∗) = var{Π(h∗|U)} + var{h∗ − Π(h∗|U)}. (3.21)

Remark 3.3. This is quite easy to prove. A proof can be found in [35]. A more detailed proof can be found in [42].

Remark 3.4. The existence of the projection Π(h∗|U) of h∗ onto U readily follows from the Projection Theorem. For more information about projections, see §2.4.2.

Especially this means that, for such cases, the variance of ` + h, for q-dimensional ` and h, is larger (in the sense defined above) than either the variance of ` or h, var(` + h) ≥ var(`) and var(` + h) ≥ var(h). In what follows, this will be very interesting. When we consider the decomposition of h ∈ Hq in the projection onto J or Λ and the residual after projecting, we can immediately apply the Multivariate Pythagorean Theorem because J and Λ are q-replicating spaces.

3.4.2 Deriving the Efficient Influence Function

After all the work that has been done up to now, the derivation of the efficient influence func- tion is quite straightforward. Recall that we defined the efficient influence function for RAL estimators to be the influence function with smallest variance with the meaning of smallest fully clear. 44 Chapter 3. Geometry of Influence Functions

Figure 3.1: Depiction of a linear variety V .

The General Case β = β(θ)

It turns out the set V of all influence functions for RAL estimators (defined in the previous section) is a linear variety, a translation of a subspace away from the origin. More specifically, a linear variety V can be written as V = x0 + M where x0 ∈ Hq and x0 ∈/ M, kx0k= 6 0 and M is a subspace of Hq. The situation is sketched in Figure 3.1.

Theorem 3.4. The set of all influence functions of RAL estimators, i.e., the elements of Hq that satisfy condition (3.5) of Theorem 3.2, is the linear variety ϕ∗(X) + J ⊥, where ϕ∗(X) is any influence function and J ⊥ is the orthogonal complement of the tangent space. Thus we can write V = ϕ∗(X) + J ⊥.

Proof. The proof is not that difficult. Take any `(X) ∈ J ⊥. Due to the orthogonality, it must T q×p ∗ ∗ satisfy E{`(X)Sθ (X, θ0)} = 0 . Now set ϕ(X) = ϕ (X) + `(X) with ϕ (X) an arbitrary element of V. An easy calculation shows that

T ∗ T E{ϕ(X)Sθ (X, θ0)} = E[{ϕ (X) + `(X)}Sθ (X, θ0)] ∗ T T = E{ϕ (X)Sθ (X, θ0)} + E{`(X)Sθ (X, θ0)} q×p = Γ(θ0) + 0 = Γ(θ0), which shows that ϕ(X) ∈ V and thus that ϕ∗(X) + J ⊥ ⊂ V. Conversely, take ϕ(X) ∈ V. We can write this as ϕ(X) = ϕ∗(X) + {ϕ(X) − ϕ∗(X)}. Since

∗ T T ∗ T E[{ϕ(X) − ϕ (X)}Sθ (X, θ0)] = E{ϕ(X)Sθ (X, θ0)} − E{ϕ (X)Sθ (X, θ0)} = Γ(θ0) − Γ(θ0) = 0, we trivially obtain that ϕ(X) − ϕ∗(X) ∈ J ⊥ which shows that ϕ(X) ∈ ϕ∗(X) + J ⊥ and thus that V ⊂ ϕ∗(X) + J ⊥. Combining the obtained results, we see that V = ϕ∗(X) + J ⊥, as desired.

We now determined a nice formula for the set of all influence functions for RAL-estimators. As a consequence, we just need to search in this set for the one with the smallest variance, the 3.4. Efficient Influence Function 45

efficient influence function, which will be denoted by ϕeff (X). This will be the one closest to the origin. Recall that this should be read as, for any influence function ϕ(X) 6= ϕeff (X), the matrix var{ϕ(X)} − var{ϕeff (X)} is semi-positive definite. The efficient influence function in the general case is derived in the following theorem. Theorem 3.5. The efficient influence function is given by

∗ ∗ ⊥ ∗ ϕeff (X) = ϕ (X) − Π(ϕ (X)|J ) = Π(ϕ (X)|J ), (3.22) where ϕ∗(X) is an arbitrary influence function and J is the tangent space. This can be explicitly written as −1 ϕeff (X) = Γ(θ0)I (θ0)Sθ(X, θ0). (3.23)

Proof. From the previous theorem, we know we need to search in the linear variety V = ϕ∗(X)+ ⊥ ∗ ∗ ⊥ ∗ ∗ ⊥ ⊥ J . Now put ϕeff = ϕ − Π(ϕ |J ) = Π(ϕ |J ). By definition of projections, Π(ϕ |J ) ∈ J . Thus, ϕeff (X) ∈ V. This means that ϕeff (X) is an influence function. Therefore, the linear ⊥ variety V can also be written as ϕeff (X) + J . This means that any influence function ϕ can ⊥ ⊥ be written as ϕeff + ` with ` ∈ J . Moreover, ϕeff is orthogonal to J and it belongs to the q-replicating space J . Using the Multivariate Pythagorean Theorem gives

var(ϕ) = var(ϕeff ) + var(`). (3.24)

This shows that ϕeff is indeed the efficient influence function. ˜ The uniqueness also follows from this equation. Whenϕ ˜ = ϕeff + ` ∈ V is also an influence ˜ function with variance equal to var(ϕeff ), we see from (3.24) that var(`) = 0 which implies that ˜ ˜ k`k = 0 and thus that ` = 0 in Hq. Henceforth,ϕ ˜ = ϕeff in Hq. ∗ We now prove the second part. Since ϕeff (X) = Π{ϕ (X)|J } ∈ J , there must exist a ma- q×p q×p q×p trix Beff ∈ R such that ϕeff (X) = Beff Sθ(X, θ0). Because ϕeff (X) ∈ V, it also satisfies T q×p T E{ϕeff (X)Sθ (X, θ0)} = Γ(θ0). Equivalently, Beff E{Sθ(X, θ0)Sθ (X, θ0)} = Γ(θ0). As defined in Example 3.2, we see that the Fisher information matrix appears in our calculations. Rewriting q×p −1 this gives Beff = Γ(θ0)I (θ0). Consequently, the efficient influence function is given by −1 ϕeff (X) = Γ(θ0)I (θ0)Sθ(X, θ0).

Note that ϕeff is also the unique influence function belonging to the tangent set J . Remark 3.5. Note the proofs of both theorems are given because they explicitly show how the geometry of the influence functions is used. Therefore, the proofs are very insightful. Remark 3.6. To obtain the desired formula (3.23), we had to do a lot of work. Remember the structure of this formula. In Part III, §6.3.1, we will see we will have to do almost no work to obtain this result for parametric models due to the abstract approach that will be used. It will almost be just a matter of definitions.

The variance of the efficient influence function now follows from an easy calculation,

T var{ϕeff (X)} = E{ϕeff (X)ϕeff (X)} −1 T = Γ(θ0)I (θ0)Γ (θ0).

This corresponds with the multivariate Cram´er-Raobound4.

4The multivariate Cram´er-Raobound we will recaptured in Chapter 7. 46 Chapter 3. Geometry of Influence Functions

The Case θ = (βT , ηT )T

It is instructive to consider the special case θ = (βT , ηT )T . For this we introduce the notion of the efficient score vector. It turns out the efficient score vector is closely related with the efficient influence function. Recall that in this case the tangent space possessed more structure, we could define the notion of a nuisance tangent space.

Definition 3.10. The efficient score is defined to be the residual of the score vector with respect to the parameter of interest after projecting it onto the nuisance tangent space, i.e.,

Seff (X, θ0) = Sβ(X, θ0) − Π{Sβ(X, θ0)|Λ}.

T T −1 We now use (2.4), from this we see that Π{Sβ(X, θ0)|Λ} = E(SβSη ){E(SηSη )} Sη(X, θ0) and the efficient score can then be written as

T T −1 Seff (X, θ0) = Sβ(X, θ0) − E(SβSη ){E(SηSη )} Sη(X, θ0).

T T −1 The term E(SβSη ){E(SηSη )} Sη(X, θ0) can be seen as the predicted value from the regression of Sβ on Sη. The following corollary shows that the efficient influence function in this case can be obtained from the efficient score by appropriate scaling.

Corollary 3.2. When the parameter θ can be partitioned as (βT , ηT )T , where β is the parameter of interest and η the nuisance parameter, then the efficient influence function can be written as

T −1 ϕeff (X, θ0) = {E(Seff Seff )} {Seff (X, θ0)}. (3.25)

Proof. We just give a sketch, the easy calculations are left to the reader. First we argue that ϕeff (X, θ0) defined in (3.25) belongs to the class of influence functions for RAL estimators. It is certainly orthogonal to the nuisance tangent space since the efficient score is orthogonal to T q×r the nuisance tangent space by construction. From this we see that E(ϕeff Sη ) = 0 . By T q×q appropriately scaling the efficient score, we see that E(ϕeff Sβ ) = I . This uses the fact that T T E(Seff Sβ ) = E(Seff Seff ). This shows that ϕeff (X, θ0) is an appropriate influence function. To be the efficient influence function, it must be the unique influence function in J = Jβ ⊕ Λ. Since Sβ(X, θ0) ∈ Jβ and Π{Sβ(X, θ0)|Λ} ∈ Λ, we can conclude that ϕeff (X, θ0) ∈ J .

Remark 3.7. Note Corollary 3.2 also admits a straightforward proof because in this case Γ(θ0) = [Iq×q, 0q×r]. This and (3.23) then implies (3.25).

We end this chapter with a calculation of the variance of the efficient influence function. It will explicitly show how much information is lost due to the presence of the nuisance parameter. An easy calculation gives

T var(ϕeff ) = E(ϕeff ϕeff ) T −1 T T −1 = E[{E(Seff Seff )} Seff Seff {E(Seff Seff )} ] T −1 −1 = {E(Seff Seff )} = {var(Seff )} .

This shows the variance of the efficient influence function equals the inverse of the variance of the efficient score. When the variance of the efficient score is large, then the variance of the 3.4. Efficient Influence Function 47 efficient influence function will be small and vice versa. Using the explicit form of the efficient score, we can obtain a more specialized expression,

T T −1 var(Seff ) = var{Sβ − E(SβSη )E(SηSη ) Sη} T T T −1 T T T −1 = var(Sβ) − 2E(SβSη ){E(SβSη )E(SηSη ) } + var{E(SβSη )E(SηSη ) Sη} T T −1 T = var(Sβ) − 2E(SβSη )E(SηSη ) E(SηSβ ) T T −1 T T −1 T + E(SβSη )E(SηSη ) E(SηSη )E(SηSη ) E(SηSβ ) T T −1 T = var(Sβ) − E(SβSη )E(SηSη ) E(SηSβ ).

T T T Now define Iββ = E(SβSβ ), Iηη = E(SηSη ) and Iβη = E(SβSη ). Note that in this notation, the Fisher information matrix becomes   Iββ Iβη I(θ0) = T . Iβη Iηη We obtain that −1 −1 T −1 var(ϕeff ) = {var(Seff )} = (Iββ − IβηIηη Iβη) . (3.26) Equation (3.26) now explicitly reveals the information that is lost due to the presence of the unknown nuisance parameter. When the nuisance parameter would be known, the efficient score would equal the ordinary score for β, i.e., Seff (X, θ0) = Sβ(X, θ0). The variance of the efficient −1 influence function is then given by Iββ . However, when the nuisance parameter is unknown we need to modify the variance. Equation (3.26) shows that when the nuisance parameter η is unknown, the most effcient estimator of β has variance given by the reciprocal of the variance −1 T of the ordinary score for β after substracting IβηIηη Iβη to obtain the variance of the efficient −1 T score. Thus, IβηIηη Iβη quantifies the information that is lost due to the presence of the nuisance parameter. This is a nice result to end this chapter with and we are ready to extend the ideas put forward here to semiparametric models in the next chapter.

ˆMLE MLE Remark 3.8. If we denote by βn and ηˆn the values of β and η that maximize the likelihood Qn ˆMLE i=1 p(Xi; β, η), then under suitable regularity conditions, the estimator βn of β0 is a RAL estimator whose influence function is the efficient influence function given by (3.25). ˆMLE If however the parameter of interest is given by β(θ) and we define by θn the value of θ that Qn maximizes the likelihood i=1 p(Xi; θ), then under suitable regularity conditions, the estimator ˆMLE β(θn ) of β0 is a RAL estimator with efficient influence function (3.23). An example of this p×p is given in Example 3.2 where β(θ) = θ and thus Γ(θ0) = I . We can conclude, as we already know, the MLE (under suitable regularity conditions) is the most efficient regular estimator in parametric models. 48 Chapter 3. Geometry of Influence Functions Chapter 4

Extension to Semiparametric Models

In the previous chapter, we developed the theory of the geometry of influence functions for RAL estimators for parameters in finite-dimensional parametric models. In this chapter, our focus will be on semiparametric models. As in the previous chapter, we will represent a statistical model as a class of densities rather than a class of distributions. Moreover, we will focus our attention on semiparametric models in a strict sense, q P = {p(x; β, η): β ∈ Θ ⊂ R , η ∈ H}, with respect to some dominating measure νX . The q-dimensional parameter of interest β q belongs to an open subset Θ of R and the nuisance parameter η belongs to some infinite- dimensional set H. This situation will show how the geometry of influence functions can be easily extended. In addition, as for the rest of this thesis, we assume the parameters β and η to be variationally independent. This means that any choice of (β, η) ∈ Θ × H results in a proper density p(x; β, η) in the semiparametric model. More specifically, this means that the parame- ters β and η can be perturbed independently. We write θ = (βT , ηT )T 1 and as before, the truth T T T will be denoted by θ0 = (β0 , η0 ) . The true density is then denoted by p0(x) = p(x; β0, η0). Keep in mind, however, that some problems lend themselves more naturally to models repre- sented by the class of densities p(x; θ), where θ is infinite-dimensional and the parameter of interest is a smooth q-dimensional function β(θ) of θ, e.g., estimating the mean in a nonpara- metric model. It can therefore be worthwile to consider this case as well. Unfortunately, not much attention has given to the generalization of the geometry of influence functions for this case in Tsiatis (2006) [35]. Therefore, in §4.4.2, we will extend the concepts introduced for semiparametric models in a strict sense and we will be able to proof analogous results, however, we will need to make one crucial additional assumption to define efficiency in contrast with a semiparametric model in a strict sense, where as we shall see, efficiency always can be defined. Even though this chapter will illuminate some nice results, they will not be completely sat- isfactory. The reason is that this theory is only applicable to regular estimators that are in addition asymptotically linear where defining efficiency is straightforward by just comparing the asymptotic variance. Consequently, this motivates the more abstract approach in Part III. The abstract approach will reveal more structure to the semiparametric efficiency problem. Nonethe- less, this chapter gives a nice and intuitive transition to semiparametric models and the basic

1Even though the infinite-dimensional parameter η is not really a vector, we use this notation the remain conform with the notation of the previous chapter.

49 50 Chapter 4. Extension to Semiparametric Models facts about semiparametric efficiency can be given. However as mentioned already, only for the class of semiparametric RAL estimators. We are ready to extend the ideas put forward in the previous chapter to semiparametric models starting with the important notion of parametric submodels. As in the parametric context, we will try to obtain a criterion to measure efficiency of estimators against using the geometry of influence functions. These parametric submodels will form the missing link between parametric models and semiparametric models to make the extension.

4.1 Parametric Submodels

The transition from finite-dimensional parametric models to infinite-dimensional semiparametric models happens through so called parametric submodels of the semiparametric model. The tech- nique of working with finite-dimensional problems as an approximation to infinite-dimensional problems and then taking limits to infinity is widely used in mathematics. This will also be the case here.

q r Definition 4.1. A parametric submodel of P, denoted Pβ,γ = {p(x; β, γ): β ∈ R , γ ∈ R }, T T T q+r is a class of densities indexed by a finite-dimensional parameter (β , γ ) ∈ R such that

(i) Pβ,γ ⊂ P,

(ii) p0(x) ∈ Pβ,γ, i.e., there exists a vector (β0, γ0) such that p0(x) = p(x; β0, γ0).

This definition shows we are actually identifying the infinite-dimensional nuisance parameter η with the r-dimensional nuisance parameter γ. The dimension of γ depends on the choice of the parametric submodel. The reason we introduce these parametric submodels is that we could as well analyze the data in a certain parametric submodel.

Remark 4.1. In the previous chapter, we implicitly imposed some regularity conditions on the considered parametric models. For example, the parametric model had to possess sufficient reg- ularity to assure the interchange of differentiation and integration of the density with respect to the parameters. This was necessary to show that the score had mean zero. Consequently, the parametric model had to satisfy certain smoothness conditions. Therefore we will impose similar smoothness conditions on the parametric submodels of a semiparametric model. Ap- propriate smoothness and regularity conditions are given in Definition A.1 of the appendix of Newey (1990), [27] where also the notions of smooth and regular (when the information ma- trix is non-singular) models are formally introduced. Using this terminology, we implicitly are working with smooth and regular parametric submodels. However, when we have defined the semiparametric nuisance tangent space, it will turn out this is actually no restriction.

Before we proceed, some caution is needed with the terms parametric model on the one hand and parametric submodel on the other hand. A parametric model is a model where the densities are characterized through a finite-dimensional parameter under the assumption these parameters suffice to describe the density of the data. A parametric submodel, however, is only a conceptual idea that is used to help us develop theory for semiparametric models. We assume the truth is contained in this submodel but since the truth is unknown, we can only describe such submodels 4.2. Influence Functions for Semiparametric RAL Estimators 51 generically. These submodels are hence not useful for data analysis. We clarify the concept of a parametric submodel with an easy example. Additive semiparametric regression model. This model was first introduced in Engle et al. (1986), [11]. It assumes the data satisfy

T Yi = Xi β0 + g0(Vi) + εi, (4.1) where Yi is a scalar response variable, Xi and Vi are vectors of exogenous variables. The function g0(v) is unknown and εi is disturbance. We want to focus on the semiparametric nature of the regression function (4.1). Therefore, we assume that the disturbance is independent of the 2 2 regressors. In addition, we assume that each εi is distributed as N(0, σ0) with σ0 known and the density p0(x, v) of Xi and Vi is also known. The parameter of interest is then the finite- dimensional parameter β with true value β0 and the nuisance parameter is the unknown function g(·), the nonparametric component. Thus, this model is parametrized through the parameter {β, g(·)} and this fits our current setting. A parametric submodel corresponds to a parametrization of the function g(v), say g(v, γ), such that g(v, γ0) = g0(v) for some finite-dimensional γ0. The parameters of this parametric submodel are then θ = (βT , γT )T . This is not a parametric model since it is clear this parametric submodel can only be described generically. 4 Those still not feeling at ease with the difference between both concepts, we refer to Tsiatis (2006), [35], p.60-61 where an additional example is given for Cox’s proportional hazards model.

4.2 Influence Functions for Semiparametric RAL Estimators

We now show how the (smooth and regular) parametric submodels introduced in the previ- ous section will help us generalize the geometry of influence functions for RAL estimators in parametric models to semiparametric models. Since parametric submodels are indexed by a T T T q r finite-dimensional parameter θ = (β , γ ) , where β ∈ R and γ ∈ R , we can apply the the- ory developed in the previous chapter. We list the most important properties of the influence function of a RAL estimator in a parametric submodel Pβ,γ below:

1. Every influence function belongs to a subspace of Hq (see §3.1) that is orthogonal to the nuisance tangent space of the parametric submodel, i.e., for every influence function ϕ we have that q×r q×r q×r ϕ ⊥ Λγ = {B Sγ(X, β0, γ0): B ∈ R },

where Sγ(x, β0, γ0) = ∂ log p(x; β0, γ0)/∂γ.

2. The efficient influence function of the parametric submodel is given by

eff eff effT −1 eff ϕβ,γ(X, β0, γ0) = {E(Sβ,γSβ,γ )} Sβ,γ(X, β0, γ0),

where eff Sβ,γ(X, β0, γ0) = Sβ(X, β0, γ0) − Π{Sβ(X, β0, γ0)|Λγ},

the efficient score of this parametric submodel with Sβ(x, β0, γ0) = ∂ log p(x; β0, γ0)/∂β. 52 Chapter 4. Extension to Semiparametric Models

3. Finally, the asymptotic variance of the most efficient RAL estimator and thus the variance of the efficient influence function of the parametric submodel is given by

eff effT −1 [E{Sβ,γ(X, β0, γ0)Sβ,γ (X, β0, γ0)}] . Note this is the smallest possible asymptotic variance for all RAL estimators for β in the considered parametric submodel.

An estimator, say βˆn, for β is a RAL estimator for the semiparametric model P if it is a RAL ˆ estimator for every parametric submodel Pβ,γ. Intuitively, this can be seen as follows. If βn is a RAL estimator for β in the semiparametric model P, we want that √ ˆ D(β,η) n(βn − β) −→ N(0, Σβ,η) for all p(x; β, η) ∈ P. Such an estimator then necessarily satisfies √ ˆ D(β,γ) n(βn − β) −→ N(0, Σβ,γ) for all p(x; β, γ) ∈ Pβ,γ since Pβ,γ ⊂ P. Therefore, an influence function of a RAL estimator in the semiparametric model P must be an influence function of a RAL estimator in every parametric submodel Pβ,γ of P. The converse need not to be true. The situation is made visual in Figure 4.1. The picture shows the situation where we considered three parametric submodels

Figure 4.1: The class of influence functions for semiparametric RAL estimators.

Pβ,γi for i = 1, 2, 3. The corresponding class of influence functions for RAL estimators is denoted

Vβ,γi . The class of influence functions for semiparametric RAL estimators is denoted by V and is situated in the intersection of the sets Vβ,γi for i = 1, 2, 3. Of course all these sets are subsets of the bigger set Hq. Although this heuristic argument is enough to understand what is happening, we want to be a bit more formal. 4.2. Influence Functions for Semiparametric RAL Estimators 53

Definition 4.2. Consider a semiparametric estimator βˆn for β. We define βˆn to be regular if it is regular in every regular parametric submodel and the limiting distribution does not depend on the parametric submodel. Recall that the estimator βˆn is regular in a parametric submodel if its limiting distribution does not depend on the LDGP.

This definition enables us to state a theorem analogous to Theorem 3.2 in the parametric case. It gives a similar criterion to check the regularity of an asymptotically linear estimator in a semiparametric model. It is extracted from Newey (1990), [27]. Because it gives an if and only if condition, it yields a geometric interpretation (similar to the parametric case) to the class of RAL estimators as we will see in the following sections.

Theorem 4.1. Suppose that βˆn is a semiparametric asymptotically linear estimator for β with influence function ϕ. Suppose that for all parametric submodels Pγ the parameter of interest β(γ) is a smooth q-dimensional function of the p-dimensional parameter γ (q < p) such that T Γ(γ) = ∂β(γ)/∂γ , the q × p matrix of partial derivatives (Γij(γ) = ∂βi/∂γj) exists, has rank q T and is continuous in γ in a neighbourhood of the truth γ0. In addition, assume that Eγ(ϕ ϕ) is continuous in γ in a neighbourhood of γ0. Then βˆn is regular if and only if

T E{ϕ(X)Sγ (X, γ0)} = Γ(γ0), (4.2) for all parametric submodels Pγ.

A proof of this theorem is given in the appendix of [27]. It also makes use of contiguity and therefore we omit it since this is not within the scope of this thesis. Trivially, when the parameter can be partitioned, we obtain a consequence similar to Corollary 3.1.

Corollary 4.1. Under the same assumptions as in Theorem 4.1, βˆn is regular if and only if for r all parametric submodels Pβ,γ where γ ∈ R ,

T q×q (i) E{ϕ(X)Sβ (X, β0, γ0)} = I ,

T q×r (ii) E{ϕ(X)Sγ (X, β0, γ0)} = 0 , where Iq×q denotes the q × q identity matrix and 0q×r denotes the q × r matrix of zeros.

For the case where the parameter θ can be nicely partitioned as the parameter of interest β and the nuisance parameter η, the discussion above, especially Corollary 4.1, has impor- tant implications. First, the influence function of any semiparametric RAL estimator must be orthogonal to the nuisance tangent space Λγ of any parametric submodel Pβ,γ. A second implication is that the variance of a semiparametric RAL estimator cannot be smaller than eff effT −1 [E{Sβ,γ(X, β0, γ0)Sβ,γ (X, β0, γ0)}] for every parametric submodel Pβ,γ. These implications lead to the following definition.

Definition 4.3. When the parameter θ can be nicely partitioned as (βT , ηT )T , the semipara- metric efficiency bound is given by

−1 n eff effT o sup E Sβ,γSβ,γ . (4.3) Pβ,γ ⊂P 54 Chapter 4. Extension to Semiparametric Models

Since the variance of a semiparametric RAL estimator for β cannot be smaller than

eff effT −1 [E{Sβ,γ(X, β0, γ0)Sβ,γ (X, β0, γ0)}] for every parametric submodel Pβ,γ, (4.3) represents a lower bound for the asymptotic variance of a semiparametric RAL estimator. Henceforth, efficient estimators in semiparametric models cannot be more precise than efficient estimators in parametric models. This is clear since semiparametric models contain lesser information than parametric models. However, this should not be taken in a strict sense as we shall see when we discuss adaptive estimation. This also has a geometric interpretation as we will discuss in the next section. We end this section with some additional terminology. When we would obtain the efficient estimator e.g., for the parameter β, this estimator will typically depend on functions of the nuisance parameters. In this case, the efficient estimator will only be efficient when the nuisance parameter, or at least a function of the nuisance parameter, can be estimated in a consistent way. When the efficient estimator does not depend on functions of the nuisance parameter, this estimator will be always efficient.

Definition 4.4. A semiparametric RAL estimator βˆn for which the asymptotic variance equals (4.3) for p0(x) = p(x; β0, η0) is called locally efficient at p0(x). If the semiparametric RAL estimator βˆn has asymptotic variance equal to (4.3) regardless the truth p0(x), βˆn is called globally efficient.

Examples of locally efficient estimators will be given in §5.1 and §5.3. An example of a globally efficient estimator will be given in §5.1.5 and at the end of this thesis, in §9.4.3.

4.3 Semiparametric Nuisance Tangent Space

We now define what is meant by the nuisance tangent space in a semiparametric sense. Recall that the Hilbert space Hq is also a metric space with the distance defined as

T 1/2 d(h1, h2) = kh2 − h1k = [E{(h1 − h2) (h1 − h2)}] .

Those not familiar with metric spaces and the basic topological notions such as closures and limit points, we refer to Rudin (1974), [32], Chapter 1.

Definition 4.5. The (semiparametric) nuisance tangent space, denoted Λ, is defined as q×r q×r q×r the mean-square closure of all nuisance tangent spaces Λγ = {B Sγ(X, β0, γ0): B ∈ R } of all parametric submodels Pβ,γ ⊂ P. More specifically, Λ ⊂ Hq is the space of all q-dimensional +∞ functions h(X) ∈ Hq for which there exists a sequence {BjSγj (X)}j=1 such that

2 kh(X) − BjSγj (X)k → 0 as j → +∞,

for a sequence of parametric submodels Pβ,γj where the matrices Bj have appropriate dimensions, q×r r i.e., Bj ∈ R j when γj ∈ R j .

Note we use the word mean-square since limits are taken with respect to the distance which is the square root of the expected sum of squared differences between the q components of the 4.4. Efficient Influence Function 55

two elements. The distance in Hq represents a variance. The semiparametric nuisance tangent space can then simply be written as [ Λ = Λγ.

Pβ,γ ⊂P By definition, the semiparametric nuisance tangent space Λ is closed. Unfortunately, it need not to be linear. In [35], it is argued that this is a linear space in most applications and therefore, for the rest of this part, we assume that Λ is linear. Henceforth, it is allowed to call it a space and not just a set. However, this merely follows from the way the nuisance tangent space is defined. In the next part, we will define the (nuisance) tangent set of semiparametric models in a slightly different (and more general) way and the theory will be developed without relying on the linearity assumption. In addition, the tangent set will not be necessarily closed in contrast with our definition here. The closedness and the linearity are required to guarantee the existence and uniqueness of projections onto these spaces. In the next part, this problem will be solved by simply taking the closed linear span of these sets to project on. For the time being, we focus our attention again on the current setting where Λ is a closed linear subset (a closed subspace) of Hq so in the next section, we can apply the Projection Theorem without any additional arguments to consider projections onto Λ. We now state a lemma that assures the results obtained under the restriction of smooth and regular parametric submodels remain valid. Lemma 4.1. The scores in the definition of the semiparametric nuisance tangent space can be restricted to be those for regular parametric submodels (i.e., smooth submodels with a nonsingular information matrix) without shrinking the semiparametric nuisance tangent space.

This lemma is extracted from the appendix of Newey (1990), [27]. A proof is also given but here it is omitted because of its technical nature and it gives no additional insight in the semiparametric efficiency theory. We end the discussion of the nuisance tangent space for a semiparametric model P with a nice geometric interpretation why semiparametric RAL estimators cannot be more efficient than parametric RAL estimators. Indeed, for any parametric submodel Pγ,β ⊂ P, we have that Λγ ⊂ Λ. Henceforth, the semiparametric nuisance tangent space is bigger than the nuisance tangent space of any parametric submodel. This means that the distance of any influence function to the semiparametric nuisance tangent space becomes smaller. Therefore the inverse of this distance becomes larger and this is exactly a measure of the variability of a RAL estimator.

4.4 Efficient Influence Function

4.4.1 The Case θ = (βT , ηT )T

Before defining the efficient influence function for semiparametric models, we look for a practical expression for (4.3), the semiparametric efficiency bound. This will be obtained in terms of the so called semiparametric efficient score. We get our inspiration from the previous chapter. Definition 4.6. The semiparametric efficient score for β is defined as

Seff (X, β0, η0) = Sβ(X, β0, η0) − Π{Sβ(X, β0, η0)|Λ}. (4.4) 56 Chapter 4. Extension to Semiparametric Models

Note this is a well defined object since Λ is closed and linear, therefore we are assured that Π{Sβ(X, β0, η0)|Λ} exists and is unique. The nice thing about this always existing object is that it characterizes the semiparametric efficiency bound. Before showing this, we prove that the semiparametric nuisance tangent space Λ is a q-replicating space. Lemma 4.2. The semiparametric nuisance tangent space Λ is a q-replicating space.

Proof. Take an arbitrary h(X) ∈ Λ ⊂ Hq. This h(X) is a q-dimensional vector, say h(X) = (1) (q) T +∞ [h (X), . . . , h (X)] . By the definition of Λ, there exists a sequence {BjSγj }j=1 such that

j→+∞ kh − BjSγj k −→ 0, where Bj are matrices of appropriate dimensions and Sγj is the nuisance score vector of the (i) corresponding parametric submodel Pβ,γj . Denote the i-th row of Bj by Bj , then

2 T kh − BjSγj k = E{(h − BjSγj ) (h − BjSγj )} q X (i) (i) 2 = E{(h − Bj Sγj ) }. (4.5) i=1 2 As j → +∞, kh − BjSγj k → 0. From (4.5), we see that this is a sum of squares. Therefore, this sum will approach zero if and only if each term approaches zero. This means that each (i) +∞ component h belongs to a space for which there exists a sequence {bj}j=1, with bj a vector of appropriate dimension, for which

(i) T j→+∞ kh − bj Sγj k −→ 0. This shows that Λ is a q-replicating space.

Theorem 4.2. The semiparametric efficiency bound (4.3) equals the inverse of the variance of the semiparametric efficient score, i.e., −1 n eff effT o T −1 sup E Sβ,γSβ,γ = [E{Seff (X, β0, η0)Seff (X, β0, η0)}] . Pβ,γ ⊂P

Proof. First we prove the case where q = 1 to gain some intuition in the problem. Consider an arbitrary parametric submodel Pβ,γ with nuisance tangent space Λγ. Note that for q = 1, eff effT eff 2 E(Sβ,γSβ,γ ) = kSβ,γk . Define eff −2 W = sup kSβ,γk , Pβ,γ ⊂ P eff with Sβ,γ(X, β0, γ0) = Sβ(X, β0, γ0) − Π{Sβ(X, β0, γ0)|Λγ}. Since Λγ ⊂ Λ, we trivially obtain eff that kSeff k ≤ kSβ,γk for any parametric submodel Pβ,γ. The situation is sketched in Figure eff 4.2 where we clearly see the norm of Seff = Sβ − Π(Sβ|Λ) is smaller than the norm of Sβ,γ = Sβ − Π(Sβ|Λγ). Therefore we have −2 eff −2 kSeff k ≥ sup kSβ,γk = W. (4.6) Pβ,γ ⊂ P

We now show the reversed inequality. Because Π(Sβ|Λ) ∈ Λ, there exists a sequence of para- metric submodels Pβ,γj with corresponding score vectors Sγj such that

2 j→+∞ kΠ(Sβ|Λ) − BjSγj k −→ 0, 4.4. Efficient Influence Function 57

Figure 4.2: Projection of Sβ onto Λ and Λγ.

where the matrices Bj have appropriate dimensions. By the definition of W, for any parametric submodel P , we have that W−1 ≤ kSeff k−2. Henceforth, β,γj β,γj W−1 ≤ kSeff k2 = kS − Π(S |Λ )k2 β,γj β β γj (1) 2 ≤ kSβ − BjSγj k (2) 2 2 = kSβ − Π(Sβ|Λ)k + kΠ(Sβ|Λ) − BjSγj k .

(1) Because Sβ − Π(Sβ|Λγj ) ⊥ Λγj , kSβ − Π(Sβ|Λγj )k is the shortest distance to Λγj from Sβ.

The inequality is valid since BjSγj belongs to Λγj .

(2) Because Sβ − Π(Sβ|Λ) ⊥ Λ and Π(Sβ|Λ) − BjSγj belongs to Λ, this equality is valid by the Pythagorean Theorem.

Now take the limit for j → +∞ and we obtain

2 2 −1 kSeff k = kSβ − Π(Sβ|Λ)k ≥ W , and thus −2 kSeff k ≤ W. (4.7) −2 From (4.6) and (4.7), it then follows that kSeff k = W. When q > 1, the proof is the same. We only need to replace the norms by variances. First we show that it is justified to replace the norms by variances in equation (4.6). This will be justified eff when we can show that var(Sβ,γ) − var(Seff ) is semi-positive definite. Note that Λγ ⊂ Λ. Using this and Proposition 2.12, we see that Π(Sβ|Λγ) = Π{Π(Sβ|Λ)|Λγ}. We can then write

Sβ − Π(Sβ|Λγ) = Sβ − Π{Π(Sβ|Λ)|Λγ}

= Sβ − Π(Sβ|Λ) + Π(Sβ|Λ) − Π{Π(Sβ|Λ)|Λγ}. 58 Chapter 4. Extension to Semiparametric Models

This shows that eff Sβ,γ = Seff + [Π(Sβ|Λ) − Π{Π(Sβ|Λ)|Λγ}], thus the efficient score of the parametric submodel is the sum of the efficient score of the semiparametric model and the residual after projecting the orthogonal projection of Sβ onto Λ onto the nuisance tangent space Λγ of the parametric submodel. This residual belongs to Λ and ⊥ Seff belongs to Λ . This is also made visual in Figure 4.2. Since Λ is a q-replicating space, we can use the Multivariate Pythagorean Theorem and this yields

eff var(Sβ,γ) = var(Seff ) + var ([Π(Sβ|Λ) − Π{Π(Sβ|Λ)|Λγ}]) .

eff From this we see that the matrix var(Sβ,γ) − var(Seff ) equals the variance of the residual [Π(Sβ|Λ) − Π{Π(Sβ|Λ)|Λγ}] and thus it is semi-positive definite. Analogously, one proves that the transition (1) remains valid. The only problem that is left is (2). To make this transition, we also need the Multivariate Pythagorean Theorem and the fact that Λ is a q-replicating space. We obtain

var(Sβ − BjSγj ) = var{Sβ − Π(Sβ|Λ)} + var{Π(Sβ|Λ) − BjSγj }, with Π(Sβ|Λ) − BjSγj ∈ Λ.

In the next theorem, we will construct a unique element of Hq that always exists, which will be defined to be the efficient influence function. When there exists a semiparametric RAL estimator with an influence function whose variance equals the semiparametric efficiency bound, i.e., the inverse of the variance of the efficient score vector, we will show this influence function must be the efficient influence function. Unfortunately, there is no guarantee such a RAL estimator can be derived.

Theorem 4.3. Every semiparametric RAL estimator for β has an influence function ϕ(X) that satisfies

T T q×q (i) E{ϕ(X)Sβ (X, β0, η0)} = E{ϕ(X)Seff (X, β0, η0)} = I , (ii)Π{ϕ(X)|Λ} = 0, i.e., ϕ ⊥ Λ.

The efficient influence function is then the unique element in Hq that satisfies (i) and (ii) whose variance equals (4.3). The efficient influence function is then given by

T −1 ϕeff (X, β0, η0) = {E(Seff Seff )} Seff (X, β0, η0). (4.8)

Proof. We first show (ii). We need to show that hϕ, hi = 0 for all h ∈ Λ. By the definition of +∞ Λ, there exists a sequence {BjSγj (X)}j=1 for which

j→+∞ kh(X) − BjSγj (X)k −→ 0,

where BjSγj (X) ∈ Λγj and the matrices Bj have appropriate dimensions. We trivially have that

hϕ, hi = hϕ, BjSγj i + hϕ, h − BjSγj i.

We know that every influence function of a semiparametric RAL estimator for β is also an influence function for a RAL estimator in a parametric submodel. From (ii) of Corollary 4.1, we 4.4. Efficient Influence Function 59

see that ϕ ⊥ Λγj and thus hϕ, BjSγj i = 0. In addition, from the Cauchy-Schwartz inequality, we obtain

|hϕ, hi| ≤ kϕkkh − BjSγj k.

Now take the limit for j → +∞ and we see that hϕ, hi = 0. Since h was arbitrary, ϕ ⊥ Λ. This shows (ii).

Let us now show (i). For this, we will use (i) of Corollary 4.1, we have

T q×q E{ϕ(X)Sβ (X, β0, η0)} = I .

By the definition of the efficient score, we can write

T T T E{ϕ(X)Seff (X, β0, η0)} = E{ϕ(X)Sβ (X, β0, η0)} − E[ϕ(X)Π{Sβ (X, β0, η0)|Λ}].

T T Since Π{Sβ (X, β0, η0)|Λ} ∈ Λ, E[ϕ(X)Π{Sβ (X, β0, η0)|Λ}] = 0 because from (ii) it follows that ϕ ⊥ Λ. Henceforth,

T T q×q E{ϕ(X)Seff (X, β0, η0)} = E{ϕ(X)Sβ (X, β0, η0)} = I .

This shows (i).

The only thing that is left to show is that the efficient influence function is given by (4.8) and that it is unique. It is trivial to see that ϕeff (X) satisfies (i). By definition Seff is orthogonal to Λ, thus ϕeff (X) also satisfies (ii). The variance of ϕeff (X) is given by

T T −1 E{ϕeff (X)ϕeff (X)} = [E{Seff (X)Seff (X)}] = W.

This means that (4.8) indeed represents the efficient influence function. To prove the uniqueness of the efficient influence function, we suppose there exists another influence functionϕ ˜(X) satisfying (i) and (ii) and

T T −1 E{ϕ˜(X)ϕ ˜ (X)} = [E{Seff (X)Seff (X)}] = W.

We deduce that

T E[{ϕ˜(X) − ϕeff (X)}{ϕ˜(X) − ϕeff (X)} ] T T T = E{ϕ˜(X)ϕ ˜(X) } + E{ϕeff (X)ϕeff (X) } − 2E{ϕ˜(X)ϕeff (X)} = 2W − 2E{ϕ˜(X)Seff (X)}W = 2W − 2Iq×qW = 0.

This implies that

2 T kϕ˜(X) − ϕeff (X)k = E[{ϕ˜(X) − WSeff (X)} {ϕ˜(X) − WSeff (X)}] = 0.

From this we conclude that the efficient influence function ϕeff (X) is unique. 60 Chapter 4. Extension to Semiparametric Models

4.4.2 The General Case β = β(θ)

Unfortunately, Theorem 4.3 is only applicable to semiparametric models in which the parameter θ can be nicely partitioned into a finite-dimensional parameter β and an infinite-dimensional parameter η. In this way we could construct the nuisance tangent space Λ which plays a crucial role in the preceding display. This is not always the case. As mentioned before, it is often more natural to describe a semiparametric model through an infinite-dimensional parameter θ and where the parameter of interest is a smooth q-dimensional function β(θ). As in the parametric model, the tangent space for θ will play a crucial role. Let us be a bit more formal. Recall that for parametric models, the set of influence functions for RAL estimators is a linear ⊥ variety Vθ = ϕ(X) + Jθ where ϕ(X) was allowed to be the influence function for any RAL estimator for β(θ) and Jθ was defined to be the tangent space for θ, i.e., the linear span of the score vector for θ. We now show how we can extend the geometry of influence functions for parametric models to semiparametric models. We will use the same terminology and concepts as before so a sketch will be enough to understand what is happening. The semiparametric model is defined to be the class of densities P = {p(x; θ): θ ∈ H}, where H is an arbitrary infinite-dimensional set. The concept of a parametric submodel will be important again but in a slightly different form. It will be defined as a set Pγ ⊂ P, containing the truth. This class of densities is indexed by a finite-dimensional parameter γ, say p-dimensional. 2 The parameter of interest then must satisfy β(θ0) = β(γ0) . As for the semiparametric nuisance tangent space, we define the semiparametric tangent space to be the mean-square closure of the tangent spaces of all parametric submodels Pγ ⊂ P, i.e.,

[ J = Jγ.

Pγ ⊂P

Thus, J ⊂ Hq is the space of all q-dimensional functions h(X) ∈ Hq for which there exists a +∞ sequence {BjSγj (X)}j=1 such that

2 kh(X) − BjSγj (X)k → 0 as j → +∞,

for a sequence of parametric submodels Pγj where the matrices Bj have appropriate dimensions, p ×p p i.e., Bj ∈ R j j if γj ∈ R j . As for the nuisance tangent space, it is justified to restrict ourselves to smooth and regular (non-singular information matrix) submodels without shrinking the tangent space (see [27], Lemma A.2). In Tsiatis (2006), [35], p.67, it is stated that Theorem 3.4 and Theorem 3.5 can be easily gener- alized to semiparametric models and the proof is the same and is therefore omitted. However, things are not as easy as that. There are some subtleties underlying this and the proofs are not as analogous as stated. Why is that? Theorem 3.4 and Theorem 3.5 rely on Theorem 3.2. As presented in [35], this is a theorem applicable only for parametric models. We conclude that in order to generalize Theorem 3.4 and Theorem 3.5, we first need to generalize Theorem 3.2. This has already been accomplished, see Theorem 4.2.

2 We have that β(θ0) = β(γ0) with a slight abuse of notation. The reason is that for β(θ0) the parameter β is seen as a function of the infinite-dimensional parameter θ and for β(γ0) the parameter β is seen as a function of the finite-dimensional parameter γ. 4.4. Efficient Influence Function 61

The good news is we can now generalize Theorem 3.4 and Theorem 3.5. Before doing so, we try to define the semiparametric efficiency bound for this general case. This is not given in [35], but this generalization will bring us closer to the semiparametric efficieny bound defined in Part III. However, for the time being, the semiparametric efficiency bound discussed here merely follows from the restriction to RAL estimators, regular and asymptotically linear estimators. In Part III, we will see how we can generalize this and the semiparametric efficiency bound will be obtained very easily, not to say trivially. We follow the same reasoning as before. Consider an arbitrary (regular) parametric submodel p Pγ ⊂ P where γ ∈ R . We know that the efficient influence function in this parametric submodel is given by eff −1 ϕγ (X) = Γ(γ0)I (γ0)Sγ(X, γ0) −1 T with corresponding variance Γ(γ0)I (γ0)Γ (γ0). Now consider a semiparametric RAL estima- tor βˆn for β(θ) with influence function ϕ. Since ϕ is the influence function of a semiparametric RAL estimator, it must be the influence function for a RAL estimator in every parametric sub- model. In addition, the efficient influence function for this parametric submodel can then be written as

eff ⊥ ϕγ (X) = ϕ(X) − Π{ϕ(X)|Jγ } = Π{ϕ(X)|Jγ}.

−1 T The variance of a semiparametric RAL estimator cannot be smaller than Γ(γ0)I (γ0)Γ (γ0) for every parametric submodel Pγ. This observation leads to the following definition.

Definition 4.7. If the parameter of interest is a smooth q-dimensional function β(θ), the semi- parametric efficiency bound is given by

n −1 T o sup Γ(γ0)I (γ0)Γ (γ0) . (4.9) Pγ ⊂P

This represents a lower bound for the asymptotic variance of a semiparametric RAL estima- tor. Henceforth, as before, estimators in semiparametric models cannot be more efficient than in parametric models since the latter contains more information. However, this should not be taken in a strict sense since it is possible to obtain equality for certain parametric submodels as we shall note when we discuss adaptive estimation. Note this bound can be defined without any semiparametric RAL estimator even existing. Un- fortunately, this is not easy to calculate since it involves the calculation of the lower bound for every regular parametric submodel and then we need to take the supremum. When the parameter can be nicely partitioned, we have seen this supremum equals the variance of the efficient influence function, i.e., the inverse of the variance of the efficient score. Can we obtain a similar result? To good news is we can. However, we need to make an assumption. We need to assume there exists at least one RAL estimator for β(θ) with influence function ϕa(X). We use the subscript a to highlight this is an assumption. However, there is no guarantee such a RAL estimator exists. When this assumption fails, our criterion to measure efficiency against will not be valid. To avoid this assumption, we will have to wait until Part III where the more abstract approach will reveal a much more general result without assuming the existence of a certain RAL estimator. Instead, we will have a weaker assumption. Therefore, the efficiency theory to be developed in Part III will have greater applicability. For the time being, under this assumption, we can give the generalization of Theorem 3.4. 62 Chapter 4. Extension to Semiparametric Models

Theorem 4.4. If a semiparametric RAL estimator βˆn for β exists, then the influence function of this estimator must belong to the linear variety

∗ ⊥ V = ϕa(X) + J , where ϕa(X) is the influence function of any semiparametric RAL estimator for β(θ), assuming there exists at least one, but it can be taken to be any influence function of a semiparametric RAL estimator, and J is the semiparametric nuisance tangent space.

Proof. Suppose the semiparametric RAL estimator βˆn has influence functionϕ ˜(X). We can ∗ trivially writeϕ ˜(X) = ϕa(X) + {ϕ˜(X) − ϕa(X)}. If we want to show thatϕ ˜(X) ∈ V , we thus ⊥ need to show thatϕ ˜(X) − ϕa(X) ∈ J , i.e., hϕ˜(X) − ϕa(X), h(X)i = 0 for all h(X) ∈ J . By +∞ the definition of J , there exists a sequence {BjSγj (X)}j=1 for which

j→+∞ kh(X) − BjSγj (X)k −→ 0,

where BjSγj (X) ∈ Jγj and the matrices Bj have appropriate dimensions. We have that

hϕ˜(X) − ϕa(X), h(X)i = hϕ˜(X) − ϕa(X),BjSγj (X)i + hϕ˜(X) − ϕa(X), h(X) − BjSγj (X)i.

We need to show that both terms appearing on the right hand side are equal to zero as j → +∞. The second one is easy, by the Cauchy-Schwarz inequality, we obtain

hϕ˜(X) − ϕa(X), h(X) − BjSγj (X)i ≤ kϕ˜(X) − ϕa(X)kkh(X) − BjSγj (X)k.

Now take the limit for j → +∞ and we see that limj→+∞hϕ˜(X)−ϕa(X), h(X)−BjSγj (X)i = 0. It is more tricky to show that the first term equals zero. We will use Theorem 4.1. We have that T T T E[{ϕ˜(X) − ϕa(X)}Sγj (X)] = E{ϕ˜(X)Sγj (X)} − E{ϕa(X)Sγj (X)}.

Since we assumed that bothϕ ˜(X) and ϕa(X) are influence functions of semiparametric RAL estimators, we can apply (4.2). We obtain

T E[{ϕ˜(X) − ϕa(X)}Sγj (X)] = Γ(γj,0) − Γ(γj,0) = 0.

T This implies that E[{ϕ˜(X) − ϕa(X)} BjSγj (X)] = 0 and thus

hϕ˜(X) − ϕa(X),BjSγj (X)i = 0

⊥ ∗ for all j. We can conclude thatϕ ˜(X) − ϕa(X) ∈ J . Henceforth, we see thatϕ ˜(X) ∈ V = ⊥ ϕa(X) + J .

We now define an element in V∗ that is well-defined under the aforementioned assumption.

Definition 4.8. We define the efficient influence function ϕeff (X) to be the orthogonal projection of any influence function (i.e., an arbitrary element of the linear variety V∗) onto the semiparametric tangent space, i.e.,

⊥ ϕeff (X) = ϕ(X) − Π{ϕ(X)|J } = Π{ϕ(X)|J }. (4.10) 4.4. Efficient Influence Function 63

The influence function ϕ(X) can be taken to be any influence function, e.g., ϕa(X). This definition is independent of the chosen influence function since any element of V∗ yields the same result. ∗ Note that ϕeff (X) belongs to V by construction since equation (4.10) shows that it is written as ϕ(X), an arbitrary influence function, minus an element of the orthogonal complement J ⊥ of J . To be the efficient influence function, it must be the element of V∗ with the smallest variance. This will indeed be the case as we will see in the proof of the following theorem.

Theorem 4.5. If a RAL estimator βˆn for β exists that has variance equal to the variance of the efficient influence function (4.10), then the influence function of this estimator must be equal to ∗ the efficient influence function ϕeff (X), which is the element of V with the smallest variance.

∗ Proof. Because ϕeff (X) belongs to V by construction, this linear variety can be equivalently written as ∗ ⊥ V = ϕeff (X) + J . (4.11)

Consequently, any other influence function can be written as ϕ(X) = ϕeff (X)+`(X), with `(X) ∈ J ⊥. As for the semiparametric nuisance tangent space Λ, we can show that the semiparametric tangent space J is a q-replicating space. Therefore, we can use the Multivariate Pythagorean Theorem because ϕeff (X) is in J . We obtain

var{ϕ(X)} = var{ϕeff (X)} + var{`(X)}. (4.12)

∗ This shows that ϕeff (X) is indeed the element of V with smallest variance. Therefore, it is justified to define ϕeff (X) to be the efficient influence function. Now suppose that βˆn is a RAL estimator for β with influence functionϕ ˜(X) and with asymptotic variance equal to the variance of the efficient influence function ϕeff (X). We writeϕ ˜(X) = ˜ ˜ ϕeff (X) + `(X). Using (4.12), we obtain var{`(X)} = 0 since var{ϕ˜(X)} = var{ϕeff (X)} by assumption. This implies that k`˜(X)k = 0 which means that `˜(X) = 0 in Hq. We may conclude thatϕ ˜(X) = ϕeff (X).

Note that, as in the parametric case, we find that ϕeff (X) is the unique influence function in the tangent space J . Indeed, on the one hand, by (4.11), each influence function can be written as

⊥ ϕ(X) = ϕeff (X) + `(X), `(X) ∈ J . (4.13)

⊥ On the other hand, by the Projection Theorem, we know that Hq = J ⊕ J , thus each h(X) ∈ ⊥ Hq can be uniquely written as h(X) = Π{h(X)|J } − Π{h(X)|J }. Henceforth, (4.13) is this unique decomposition of ϕ(X). This shows that if `(X) 6= 0, i.e., ϕ(X) 6= ϕeff (X), which means that the projection of ϕ(X) onto J ⊥ is not zero, that ϕ(X) ∈/ J since J is the kernel of the projection onto J ⊥. We end this section with showing that the semiparametric effciency bound in this case also equals the variance of the efficient influence function. However, also under the aforementioned assumption.

Theorem 4.6. The semiparametric efficiency bound (4.9) equals the variance of the efficient influence function defined in (4.10), i.e.,

n −1 T o sup Γ(γ0)I (γ0)Γ (γ0) = var(ϕeff ). Pγ ⊂P 64 Chapter 4. Extension to Semiparametric Models

Proof. Consider the semiparametric RAL estimator with influence function ϕa(X). Consider an arbitrary parametric submodel Pγ ⊂ P with corresponding tangent space Jγ. We know that the efficient influence function for this parametric submodel is given by

eff ⊥ ϕγ (X) = ϕa(X) − Π{ϕa(X)|Jγ }. By definition, the efficient influence function for this semiparametric model can be written as

⊥ ϕeff (X) = ϕa(X) − Π{ϕa(X)|J }.

⊥ ⊥ Since Jγ ⊂ J , we have that J ⊂ Jγ . By Proposition 2.12, we know that

⊥ ⊥ ⊥ Π{ϕa(X)|J } = Π[Π{ϕa(X)|Jγ }|J ]. We can write

⊥ ⊥ ϕeff (X) = ϕa(X) − Π[Π{ϕa(X)|Jγ }|J ] ⊥ ⊥ ⊥ ⊥ = ϕa(X) − Π{ϕa(X)|Jγ } + Π{ϕa(X)|Jγ } − Π[Π{ϕa(X)|Jγ }|J ] eff  ⊥ ⊥ ⊥  = ϕγ (X) + Π{ϕa(X)|Jγ } − Π[Π{ϕa(X)|Jγ }|J ] .

This decomposition shows that ϕeff (X) can be written as the sum of the effcient influence eff function ϕγ (X) for a parametric submodel and the residual after projecting the orthogonal ⊥ projection of ϕa(X) onto Jγ onto the orthogonal complement of the semiparametric tangent ⊥ eff space J . This residual belongs to Jγ and ϕγ belongs to Jγ. The situation is sketched in Figure 4.3. Since Jγ is a q-replicating space, we can use the Multivariate Pythagorean Theorem. This

⊥ ⊥ Figure 4.3: Projection of ϕa onto J and Jγ . yields that

eff  ⊥ ⊥ ⊥  var{ϕeff (X)} = var{ϕγ (X)} + var Π{ϕa(X)|Jγ } − Π[Π{ϕa(X)|Jγ }|J ] . 4.4. Efficient Influence Function 65

eff This equation shows us that var{ϕeff (X)} − var{ϕγ (X)} equals the variance of the residual ⊥ ⊥ ⊥ Π{ϕa(X)|Jγ } − Π[Π{ϕa(X)|Jγ }|J ] and thus it is semi-positive definite. Under the assump- tion that ϕa(X) is the influence function of a semiparametric RAL estimator, we have that T Γ(γ0) = E{ϕa(X)Sγ (X)} and this condition implied we could write

eff −1 T ϕγ (X) = Γ(γ0)I (γ0)Sγ (X).

From these results, we can conclude that

−1 T var{ϕeff (X)} ≥ Γ(γ0)I (γ0)Γ (γ0) (4.14) for every parametric submodel Pγ ⊂ P and thus we have found this is an upper bound for the semiparametric efficiency bound. That var{ϕeff (X)} is indeed the supremum we are looking for, easily follows from the definition of J and the characterization of a supremum. Since ϕeff (X) ∈ J , there exists a sequence +∞ {BjSγj (X)}j=1 for which j→+∞ kϕeff (X) − BjSγj (X)k −→ 0, where BjSγj (X) ∈ Jγj and the matrices Bj have appropriate dimensions. Since Jγj ⊂ J and by Proposition 2.12, we obtain that

eff ϕγj (X) = Π{ϕa(X)|Jγj } = Π[Π{ϕa(X)|J }|Jγj ] = Π{ϕeff (X)|Jγj }.

From this, we deduce

eff j→+∞ kϕeff (X) − ϕγj (X)k ≤ kϕeff (X) − BjSγj (X)k −→ 0.

eff The situation is sketched in Figure 4.4. From this equation, it follows that ϕγj (X) → ϕeff (X)

Figure 4.4: Projection of ϕa onto J and Jγj . 66 Chapter 4. Extension to Semiparametric Models

as j → +∞ in Hq. This implies that

eff −1 T j→+∞ var{ϕγj (X)} = Γ(γj,0)I (γj,0)Γ (γj,0) −→ var{ϕeff (X)}. This, together with (4.14), completes the proof.

What is not clear from the theorems above is whether there exist semiparametric RAL estimators with influence function satisfying conditions (i) and (ii) of Theorem 4.3 or more generally, belonging to the linear variety V∗. In many cases, deriving the space of influence functions, or even the space orthogonal to the nuisance tangent space, for semiparametric models in a strict sense, will suggest how semiparametric estimators may be constructed and even how to find locally or globally efficient semiparametric estimators. We will explicitly illustrate this for the Restricted Moment Model, introduced in Chapter 1. But before doing so, it can be instructive to develop some methods and tools for finding tangent spaces.

4.5 Some Tools for Practical Applications

4.5.1 Tangent Space for Nonparametric Models

Suppose we have collected an i.i.d. sample X1,...,Xn with true density function p0(x) and we are interested in estimating a q-dimensional parameter β, e.g., the mean. Suppose we do not want to make any modelling assumptions. Therefore, consider a nonparametric model P, i.e., the collection of all measurable functions p(x) with respect to some dominating measure R νX satisfying p(x) ≥ 0 and p(x)dνX (x) = 1. To discuss efficiency, we need to construct the tangent space.

Theorem 4.7. The tangent space J of a nonparametric model is the entire Hilbert space Hq, i.e., the space of all measurable mean-zero random functions with finite variance. Thus, we can write [ Hq = Jγ = J.

Pγ ⊂P

Proof. We first look for a conjecture for the tangent space by considering parametric submodels. s Thus, consider a parametric submodel Pγ = {p(x; γ): γ ∈ Θ ⊂ R } where Θ is open. The q×s q×s q×s tangent space for this parametric submodel is given by Jγ = {B Sγ(X): B ∈ R } where Sγ(x) = ∂ log p(x; γ0)/∂γ with p(x; γ0) = p0(x), the truth. Since we only need to consider smooth and regular submodels, we know that E{Sγ(X)} = 0. From this we see that the structure of the tangent space of a parametric submodel enables us to make an educated guess for the nonparametric tangent space. This leads to the following conjecture of the tangent space:

conj J = {all q-dimensional mean-zero random functions} = Hq.

conj We already know that Jγ ⊂ J for all parametric submodels. What is left to show is that each conj element of J is either an element of some Jγ for some parametric submodel or a limit of such conj elements. This is not that difficult. Take any h(X) ∈ J that is bounded, i.e., supx∈X |h(x)| is T q finite. Next, consider the parametric submodel p(x; γ) = p0(x){1+γ h(x)} where γ ∈ R . Note that for γ = 0, we have that p(x; 0) = p0(x), the truth. This explicitly shows that we can only describe the densities of a submodel generically since p0(x) is unknown. By the boundedness of 4.5. Some Tools for Practical Applications 67

q T h, we can find an open set Θ (0 ∈ Θ) in R of sufficiently small γ such that {1 + γ h(x)} ≥ 0 for all x. This is necessary to guarantee that p(x; γ) is a proper density. Henceforth, since Θ is open, we can also ensure that the partial derivatives of p(x; γ) with respect to γ exists. Moreover, we have that Z Z Z T T p(x; γ)dνX (x) = p0(x)dνX (x) + γ h(x)p0(x)dνX (x) = 1 + γ E{h(X)} = 1.

We can conclude that p(x; γ) defines a proper density for all γ ∈ Θ. It is trivial to see that the q×q q×q score for γ is Sγ(X) = h(X). With B = I (i.e., the q × q identity matrix), we see that h(X) ∈ Jγ, the tangent space of the considered submodel. The proof is completed by noting that the the bounded functions are dense in Hq.

This theorem has a very interesting implication. There can be at most one RAL estimator and when it exists, its influence function must be the efficient influence function. Indeed, we have seen that the influence function of a RAL estimator must belong to the linear variety ∗ ⊥ V = ϕeff (X) + J where J denotes the tangent space for the nonparametric model. Since ⊥ ∗ J = Hq, we know that J = {0} and thus V = {ϕeff (X)}. This shows that ϕeff (X) is the unique influence function. This can also be deduced as follows. Recall that the efficient influence function is given by ⊥ ϕeff (X) = ϕ(X) − Π{ϕ(X)|J }, where ϕ(X) is the influence function of an arbitrary RAL estimator. Since J⊥ = {0}, it is ⊥ trivial to see that Π{ϕ(X)|J } = 0 and therefore we have that ϕeff (X) = ϕ(X), thus the unique influence function must be the efficient influence function. The are more ways to obtain this result, e.g., by noting that ϕeff (X) is the unique influence function in J = Hq.

4.5.2 Partitioning the Hilbert Space

When considering a nonparametric model, thus all possible density functions for a random vector X, we already know the tangent space is the entire Hilbert space. But there is more structure to the problem than this. We get our inspiration from a well known fact, the partitioning of the density function of a random vector into a product of conditional density functions. Let us be a bit more formal. Suppose X is an m-dimensional random vector, i.e., X = [X(1),...,X(m)]. We use superscripts since this does not denote a sample but the different components of one observation. We know the density function pX (x) of X can be written as

m (1) Y (j) (1) (j−1) pX (x) = pX(1) (x ) pX(j)|X(1),...,X(j−1) (x |x , . . . , x ). (4.15) j=2

(j) (1) (j−1) (j) The function pX(j)|X(1),...,X(j−1) (x |x , . . . , x ) denotes the conditional density of X given (1) (j−1) X ,...,X , defined with respect to the dominating measure νj. Since we work with a nonparametric model, we do not impose any restrictions on pX (x) other than it defines a proper density. This is equivalent with putting no restrictions on the conditional densities other than they define proper densities, i.e., the j-th conditional density can be envisioned as an arbitrary (1) (j) positive function ηj(x , . . . , x ) such that Z (1) (j) (j) ηj(x , . . . , x )dνj(x ) = 1, 68 Chapter 4. Extension to Semiparametric Models for all j = 1, . . . , m. From this we see that the nonparametric model can be represented by m infinite-dimensional nuisance parameters {η1(·), . . . , ηm(·)} and we assume them to be variation- ally independent. Recall that this means that the product of any η1(·), . . . , ηm(·) can be used to construct a proper density for the random vector X = [X(1),...,X(m)] in the nonparametric model. We now described in a rigorous way what is meant by partioning the density into variationally independent conditional densities. The exciting news is that the geometry of Hilbert spaces allows us to translate this in partitioning the Hilbert space Hq. To gain more insight in the problem, we consider an arbitrary parametric submodel given by the class of densities m (1) Y (j) (1) (j−1) pX (x; γ) = pX(1) (x ; γ1) pX(j)|X(1),...,X(j−1) (x |x , . . . , x ; γj), (4.16) j=2 s where γj ∈ R j for all j = 1, . . . , m and variationally independent, i.e., the parameters can be perturbed independently so we can define partial derivatives with respect to these parameters. T T T Assume the truth is attained in γ0 = (γ1,0, . . . , γm,0) . To find the tangent space for this submodel, we need to derive the log likelihood with respect to all parameters. The log likelihood is given by m (1) X (j) (1) (j−1) log pX (x; γ) = log pX(1) (x ; γ1) + log pX(j)|X(1),...,X(j−1) (x |x , . . . , x ; γj). (4.17) j=2

Without looking at (4.17), we can write that the score for γj, j = 1, . . . , m, is given by

(1) (m) ∂ (1) (m) Sγj (x , . . . , x ) = log pX (x , . . . , x ; γ1,0, . . . , γm,0). ∂γj

However, (4.17) implies some more structure to the corresponding scores. The score for γ1 (1) (1) (1) becomes Sγ1 (x ) = ∂ log pX(1) (x ; γ1,0)/∂γ1, a function of x only. Analogously, we see that (1) (j) (j) (1) (j−1) Sγj (x , . . . , x ) is given by ∂ log pX(j)|X(1),...,X(j−1) (x |x , . . . , x ; γj,0)/∂γj, a function of x(1), . . . , x(j) only, for j = 2, . . . , m. From this, we may conclude that the tangent space for the parametric submodel is given by the space

Jγ = Jγ1 ⊕ · · · ⊕ Jγm ,

q×sj (1) (j) q×sj q×sj where Jγj = {B Sγj (X ,...,X ): B ∈ R }. This means that an arbitrary element of Jγ can be written as m q×s (1) (m) X q×sj (1) (j) B Sγ (X ,...,X ) = B Sγj (X ,...,X ), (4.18) j=1

q×sj q×sj q×s q×s Pm where B ∈ R (j = 1, . . . , m) and B ∈ R where s = j=1 sj. It is easy to show (1) (1) (j) (1) (j−1) that E{Sγ1 (X )} = 0 and E{Sγj (X ,...,X )|X ,...,X } = 0 for j = 2, . . . , m. This (1) (j) property remains valid for all elements in the linear span of Sγj (X ,...,X ), i.e., Jγj , for 0 j = 1, . . . , m. It is a good exercise to show that Sγj ⊥ Sγj0 for j 6= j . A similar calculation will be given shortly. Note this implies that the m spaces Jγj are mutually orthogonal. To obtain the tangent space for the nonparametric model, we need to take the mean-square closure and we obtain that 3 Hq = J = J1 ⊕ ... ⊕ Jm ,

3 We use these special notations J and Jj to highlight this is the tangent space for a nonparametric model. 4.5. Some Tools for Practical Applications 69

where Jj, j = 1, . . . , m, is the mean-square closure of the tangent spaces of all parametric submodels for the infinite-dimensional parameter ηj(·), i.e., [ Jj = Jγj .

Pγj ⊂Pj

The set Pj is the set of all functions ηj(·) and the set Pγj is a parametric submodel of Pj given by (j) (1) (j−1) sj Pγj = {pX(j)|X(1),...,X(j−1) (x |x , . . . , x ; γj): γj ∈ Θj ⊂ R }.

As we announced, we obtain a partition of the Hilbert space Hq. However, this is not the result we were looking for, it is not satisfactory. We want an explicit description of the m spaces Jj for j = 1, . . . , m. This will enable us to find some interesting relations between these spaces, similar to those of the parametric submodel. It is not difficult to find an explicit description of the m spaces. Motivated by the properties of the (conditional) scores of the aforementioned parametric submodel, we can make an educated guess,

n (1) (1) o J1 = α1(X ) ∈ Hq : E{α1(X )} = 0 , (4.19a)

n (1) (j) (1) (j) (1) (j−1) o Jj = αj(X ,...,X ) ∈ Hq : E{αj(X ,...,X )|X ,...,X } = 0 , j = 2, . . . , m. (4.19b)

It follows that Jγj ⊂ Jj for all j = 1, . . . , m. To show that each Jj indeed equals the mean-square closure of all parametric submodels, we use similar arguments as in the previous section. Take (1) (j) j ∈ {1, . . . , m} and take αj(X ,...,X ) ∈ Jj to be bounded. Consider the corresponding submodel

(j) (1) (j−1) (j) (1) (j−1) T (1) (j) pj(x |x , . . . , x ; θj) = p0,j(x |x , . . . , x ){1 + θj αj(X ,...,X )},

(j) (1) (j−1) (j) where p0,j(x |x , . . . , x ) denotes the true conditional density of X given the variables (1) (j−1) q X ,...,X and θj ∈ Θj ⊂ R where Θj is an open set consisting of sufficiently small (j) (1) (j−1) θj to guarantee that pj(x |x , . . . , x ; θj) ≥ 0. This is possible since αj is assumed to (j) (1) (j−1) be bounded. Using the fact that αj ∈ Jj, it is easy to show that pj(x |x , . . . , x ; θj) integrates to one. Since Θj is open, we can define partial derivatives with respect to θj and we obtain that the score for this parametric submodel is given by αj. Since the bounded functions are dense in Jj, any element of Jj is a limit of such bounded functions. This completes the derivation. As for the parametric submodels, we have that Jj, j = 1, . . . , m, are mutually orthogonal spaces. 0 This follows from an easy calculation using the law of iterated expectations. Take j < j , hj ∈ Jj and hj0 ∈ Jj0 . We find that

T T (1) (j0−1) E(hj hj0 ) = E{E(hj hj0 |X ,...,X )} T (1) (j0−1) = E{hj E(hj0 |X ,...,X )} = 0.

The m spaces Jj, j = 1, . . . , m, can be equivalently written as

n ∗ (1) (j) ∗ (1) (j) (1) (j−1) ∗ o Jj = hj (X ,...,X ) − E{hj (X ,...,X )|X ,...,X } : hj ∈ L2,j(Pj) , (4.20) 70 Chapter 4. Extension to Semiparametric Models

(1) (j) where L2,j(Pj) denotes the set of all square-integrable functions of X ,...,X with respect to the probability measure Pj, the probability distribution corresponding with the true conditional (j) (1) (j−1) density function p0,j of X given the variables X ,...,X . (1) (m) This decomposition of the tangent set implies that any h(X ,...,X ) ∈ Hq can be decom- posed into orthogonal elements h = h1 + ... + hm.

By the Projection Theorem, we know that hj is the orthogonal projection of h onto Jj, i.e., hj = Π(h|Jj). It is not difficult to see that these projection operators are given by

(1) (1) h1(X ) = E(h|X ), (4.21a) (1) (j) (1) (j) (1) (j−1) hj(X ,...,X ) = E(h|X ,...,X ) − E(h|X ,...,X ), j = 2, . . . , m. (4.21b)

This is easy to show. Take j ∈ {1, . . . , m}. It is easy to see that hj ∈ Jj. Therefore, by the Projection Theorem, we only need to show that h − hj is orthogonal to every element of Jj. This is not so difficult to show using the law of iterated expectation. Take an arbitrary `j ∈ Jj. We sequentially obtain that

T T (1) (j) E{(h − hj) `j} = E[E{(h − hj) `j|X ,...,X }] (1) (j) T = E[{E(h|X ,...,X ) − hj} `j] (1) (j−1) T = E[{E(h|X ,...,X )} `j]  (1) (j−1) T (1) (j−1)  = E E[{E(h|X ,...,X )} `j|X ,...,X ] (1) (j−1) T (1) (j−1) = E[{E(h|X ,...,X )} E(`j|X ,...,X )] = 0, where we used the definition of hj and the assumption that `j ∈ Jj to obtain the zero. We conclude that we obtained a very interesting result. At first sight, this gives a partitioning of the tangent space only for nonparametric models. However, as we will especially discuss for the restricted moment model, this special structure also arises in many semiparametric models. We give an intuitive argument. Since a semiparametric model imposes some additional restrictions on the class of admitted density functions, or equivalently, on the class of allowed conditional density functions, the corresponding tangent space Jj and especially the corresponding nuisance tangent space Λj ⊂ Jj for ηj(·) will be subspaces of Jj. This will be made more specific in the next chapter when we will study the restricted moment model in much detail. However, some caution is needed. This partitioning is only possible when the conditional density functions can be described through variationally independent nuisance parameters. Semiparametric models can become quite complicated and describing the conditional densities through variationally independent parameters can be too restrictive. An example of a model where the different com- ponents of the joint density function cannot be indexed by variationally independent parameters is given in the next chapter, so called probabilistic index models. Nonetheless, when such a partition of the nuisance tangent space is possible, it will be surprising that sometimes, no efficiency is lost when some components of the decomposition of the joint density function are left unspecified. This will be illustrated for the restricted moment model and some simple additional examples when we consider adaptive estimation. Another interesting implication is that for semiparametric models in a strict sense (where the parameter can be nicely partitioned), a decomposition of the nuisance tangent space in mutually orthogonal spaces implies that, by Proposition 2.12, a projection on the nuisance tangent space can be obtained by the sum of the projections of the different orthogonal components of the nuisance tangent space. Chapter 5

Applications

5.1 Restricted Moment Model

It is time to apply the theory developed in the previous chapter to the restricted moment model. The derivation below is primarily based on Tsiatis (2006), [35], §4.5. However, some additional structure is clarified. Recall this model describes the relationship between a univariate or multivariate response variable Y , say d-dimensional, and a set of covariates X. It is modelled by considering the conditional expectation of Y given X as a known d-dimensional function of X described through a finite-dimensional parameter β, say q-dimensional, i.e., E(Y |X) = µ(X, β). We will consider estimation of the parameter β using an i.i.d. sample Zi = (Yi,Xi), i = 1, . . . , n. Assume as before that the d-dimensional response variable Y is continuous with dominating measure the ordinary Lebesgue measure, denoted `Y . After we discussed this case, we will briefly discuss how to deal with a discrete response variable Y . The covariates X were allowed to be anything, continuous, discrete or even mixed. The dominating measure is denoted νX . In §1.3.1, we have seen we could equivalently write this model as Y = µ(X, β) + ε where E(ε|X) = 0 with ε = Y − µ(X, β). It was shown the model then becomes n o P = pY,X {y, x; β, η1(·), η2(·)} = η1{y − µ(x, β), x}η2(x) , where η1(ε, x) = pε|X (ε|x) = pY |X (y|x) and η2(x) = pX (x). The only restriction on the density function for X and ε other than to define proper densities is that E(ε|X) = 0. Therefore, η1(ε, x) need to be nonnegative and should satisfy Z Z η1(ε, x)d`Y (ε) = 1, εη1(ε, x)d`Y (ε) = 0, for all x. (5.1)

In addition, the function η2(x) need also to be nonnegative and should satisfy Z η2(x)dνX (x) = 1. (5.2)

The conditions (5.1) and (5.2) will be important when constructing the nuisance tangent space. q As discussed in §1.3.1, β ∈ R is our finite-dimensional parameter of interest and η = {η1(·), η2(·)} is referred to as the nuisance parameter. For further reference, the truth is denoted by

p0(y, x) = η1,0{y − µ(x, β0), x}η2,0(x) = pY,X {y, x; β0, η1,0(·), η2,0(·)}.

71 72 Chapter 5. Applications

5.1.1 Tangent Space and Nuisance Tangent Space for Parametric Submodels

The first step in our search for the efficient influence function in the restricted moment model is to construct the tangent space and especially the nuisance tangent space since the parameter can be nicely partitioned. However, before doing so, it is convenient to first consider parametric submodels to gain some insight in the restrictions the model implies to the score vectors corre- sponding with the nuisance parameters. This will enable us to make an educated guess for the semiparametric nuisance tangent space. Instead of arbitrary functions η1(ε, x) and η2(x) for pε|X (ε|x) and pX (x) satisfying (5.1) and r1 r2 (5.2), we consider parametric submodels pε|X (ε|x; γ1) and pX (x; γ2) where γ1 ∈ R and γ2 ∈ R , T T T r r = r1 +r2. Thus, γ = (γ1 , γ2 ) ∈ R . From this it is clear we consider the parametric submodel

n T T T T o Pβ,γ = p(y, x; β, γ1, γ2) = pε|X {y − µ(x, β)|x; γ1}pX (x; γ2): θ = (β , γ1 , γ2 ) ∈ Θβ,γ , (5.3)

q+r where β, γ1 and γ2 are variationally independent. The set Θβ,η ⊂ R is considered to be an open set so we can take partial derivatives with respect to these parameters. The parametric sub- model also contains the truth and this is denoted p0(y, x) = pε|X {y −µ(x, β0)|x; γ1,0}pX (x; γ2,0). Let us investigate the tangent space and especially the nuisance tangent space for the parametric submodel (5.3). Remark 5.1. For further reference, since scores are mostly evaluated at the truth, we omit this in the arguments of the score. If scores are not evaluated at the truth, we write the parameter in the argument of the score. We will also, as before, use the shorthand ε for y − µ(x, β0) to simplify notations.

The score for θ, i.e., Sθ(y, x) = ∂ log p(y, x; β0, γ1,0, γ2,0)/∂θ, can be partitioned into three parts, the score for β, γ1 and γ2. Thus, T T T T Sθ(y, x) = [Sβ (y, x),Sγ1 (y, x),Sγ2 (y, x)] , where the scores are defined as the partial derivatives of the log likelihood with respect to the corresponding parameter.

Tangent Space for the Parameter of Interest β: Jβ

Not much has to be said about this space. It is defined to be the space

q×q q×q q Jβ = {B Sβ(Y,X): B ∈ R }, the linear span of the score vector for β. We will discuss this space in more detail in the semiparametric point of view. We now focus our attention on the nuisance tangent space for the parametric submodel. The log likelihood for the submodel is given by

log p(y, x; β, γ1, γ2) = log pε|X {y − µ(x, β)|x; γ1} + log pX (x; γ2). (5.4) Following the ideas put forward in §4.5.2, this partitioning of the density function into two variationally independent parts implies a similar decomposition of the nuisance tangent space Λγ into two mutually orthogonal subspaces. First we discuss the properties of these subspaces separately and next we explicitly show that they are mutually orthogonal. This will turn out to be trivial to obtain. In what follows, we fix the value of β at β0. 5.1. Restricted Moment Model 73

Nuisance Tangent Space for the Nuisance Parameter γ1: Λγ1

The nuisance tangent space for γ1 is defined to be the space n o q×r1 q×r1 q×r1 Λγ1 = B Sγ1 (ε, X): B ∈ R . (5.5) By (5.4), we see that ∂ Sγ1 (ε, x) = log pε|X {y − µ(x, β0)|x; γ1,0}. ∂γ1 We now use the two restrictions we imposed on the conditional density function to obtain some fundamental properties of the score for γ1 in the parametric submodel. Z (i) We know that pε|X (ε|x; γ1)d`Y (ε) = 1. Differentiating with respect to γ1 yields that ∂ Z pε|X (ε|x, γ1)d`Y (ε) = 0. Interchanging integration and differentiation, evaluating ∂γ1

at the truth γ1,0 and using the definition of Sγ1 (ε, x) yields that

E{Sγ1 (ε, X)|X} = 0. (5.6)

(ii) By definition of the restricted moment model, we have that E(ε|X). This is equiva- Z T 1×d lent with pε|X (ε|x; γ1)ε d`Y (ε) = 0 . Differentiating with respect to γ1 yields that ∂ Z T r1×d pε|X (ε|x; γ1)ε d`Y (ε) = 0 . Interchanging integration and differentiation, eval- ∂γ1

uating at the truth γ1,0 and using the definition of Sγ1 (ε, x) yields that

T r1×d E{Sγ1 (ε, X)ε |X} = 0 , (5.7) r ×d where 0 1 denotes an r1 × d matrix of zeros.

Nuisance Tangent Space for the Nuisance Parameter γ2: Λγ2

The nuisance tangent space for γ2 is defined to be the space n o q×r2 q×r2 q×r2 Λγ2 = B Sγ2 (ε, X): B ∈ R . (5.8) By (5.4), we see that ∂ Sγ2 (x) = log pX (x; γ2,0). ∂γ2 We have the following fundamental properties.

(i) Since γ2 is only involved in the density function pX (x; γ2) of X, we obtain that Sγ2 (X) is a function of X only. (ii) Since the distribution of X is completely left unspecified, we only can use the fact that Z pX (x; γ2)dνX (x) = 1. Differentiating this equality with respect to γ2 yields that ∂ Z pX (x; γ2)dνX (x) = 0. Interchanging integration and differentiation, evaluating at ∂γ2

the truth γ2,0 and using de definition of Sγ2 (x) yields that

E{Sγ2 (X)} = 0. (5.9) 74 Chapter 5. Applications

These considerations will help us to make an educated guess for the semiparametric nuisance tangent space. First we explicitly show that Λγ1 ⊥ Λγ2 .

Lemma 5.1. The space Λγ1 is orthogonal to the space Λγ2 .

Proof. Using the law of iterated expectation, we easily deduce that

T T E{Sγ1 (ε, X)Sγ2 (X)} = E[E{Sγ1 (ε, X)Sγ2 (X)|X}] T = E[E{Sγ1 (ε, X)|X}Sγ2 (X)] = 0r1×r2 ,

this follows from (5.5) and that Sγ2 (X) is a function of X only. This shows that any component of Sγ1 (ε, X) is orthogonal to any component of Sγ2 (X) in H1. Since any element of Λγ1 is a linear combination of the components of Sγ1 (ε, X) and any element of Λγ2 is a linear combination of the components of Sγ2 (X), we may conclude that Λγ1 ⊥ Λγ2 .

Conclusion

The total nuisance tangent space for the parametric submodel is the direct sum of the two orthogonal nuisance tangent spaces for γ1 and γ2, i.e., Λγ = Λγ1 ⊕ Λγ2 with Λγ1 ⊥ Λγ2 . More specifically, we have that

n q×r q×r q×ro Λγ = B Sγ(ε, X): B ∈ R ,

T T T with Sγ(ε, X) = [Sγ1 (ε, X),Sγ2 (X)] and thus

q×r q×r1 q×r2 B Sγ(ε, X) = B1 Sγ1 (ε, X) + B2 Sγ2 (X)

q×r1 q×r2 and B1 Sγ1 (ε, X) ⊥ B2 Sγ2 (X). For a matter of completeness, we note that the total tangent space for the parametric submodel, denoted by J sub, can be written as

sub J = Jβ ⊕ Λγ = Jβ ⊕ Λγ1 ⊕ Λγ2 . The properties developed for an arbitrary submodel will now help us to construct the semipara- metric tangent space and especially the semiparametric nuisance tangent space.

5.1.2 Semiparametric Tangent Space and Nuisance Tangent Space

Partitioning the Hilbert Space

Using the results obtained in §4.5.2, with a slight abuse of notation, we know that the Hilbert space Hq equals the tangent space J of a nonparametric model and can be partitioned into two subspaces, i.e., Hq = J = J1 ⊕ J2, J1 ⊥ J2, where n o J1 = a(ε, X) ∈ Hq : E{a(ε, X)|X} = 0 , n o J2 = α(X) ∈ Hq : E{α(X)} = 0 . 5.1. Restricted Moment Model 75

Tangent Space for the Parameter of Interest β: Jβ

Even though not much has to be said about the tangent space for the parameter of interest, it can be insightful to investigate the properties of the score vector for β, i.e.,

∂ Sβ(y, x) = log pY,X {y, x; β, η1,0(·), η2,0(·)} ∂β β=β0

∂ = log η1,0{y − µ(x, β), x} ∂β β=β0 ∂ = log η (ε, x) = S (ε, x). ∂β 1,0 β The tangent space for the parameter of interest β for the semiparametric model is then simply given by q×q q×q q Jβ = {B Sβ(Y,X): B ∈ R }, the linear span of the score vector for β. Note we do not have to take the mean-square closure over all parametric submodels since the parameter β is finite-dimensional and we do not need to consider parametric submodels to define the tangent space for the parameter of interest. The model implies two fundamental properties for the score for β which are insightful and the second one will be useful for later calculations. Z (i) We know that η1,0{y − µ(x, β), x}d`Y (y) = 1. Differentiating with respect to β yields ∂ Z that η {y − µ(x, β), x}d` (y) = 0. Interchanging integration and differentiation, ∂β 1,0 Y evaluating at the truth β0 and using the definition of Sβ(ε, x) yields that

E{Sβ(ε, X)|X} = 0. (5.10)

(ii) Due to the model restriction, we know that Z T 1×d η1,0{y − µ(x, β), x}{y − µ(x, β), x} d`Y (y) = 0 .

Differentiating with respect to β yields that ∂ Z η {y − µ(x, β), x}{y − µ(x, β), x}T d` (y) = 0q×d. ∂β 1,0 Y Interchanging integration and differentiation, using the chain rule, evaluating at the truth β0 and using the definition of Sβ(ε, x), we obtain that

Z Z  ∂µ(x, β )q×d S (ε, x)εT η (ε, x)d` (ε) + − 0 η (ε, x)d` (ε) = 0q×d. β 1,0 Y ∂β 1,0 Y

∂µ(X, β ) Now define D(X) to be 0 . This is a function of X only and the latter equation ∂βT can be written as the following compact formula,

T T D (X) = E{Sβ(ε, X)ε |X}. (5.11) 76 Chapter 5. Applications

Note that (5.10) implies that Jβ is a proper subspace of J1 that in addition satisfies (5.11). It then follows that Jβ is automatically orthogonal to J2. This means that no information is lost if the distribution of X is left unspecified. We now focus on the more important space, the semiparametric nuisance tangent space. This space will allow us to construct the class of all RAL estimators. By definition, the semiparametric nuisance tangent space Λ is the mean-square closure of all parametric submodel nuisance tangent spaces, i.e., [ Λ = Λγ.

Pβ,γ ⊂P

Since Λγ = Λγ1 ⊕ Λγ2 , we see that

[ M [ Λ = Λγ1 Λγ2 = Λ1 ⊕ Λ2. Pβ,γ ⊂P Pβ,γ ⊂P

Nuisance Tangent Space for the Nuisance Parameter η1(·): Λ1

We now state a theorem that gives us a precise description of the nuisance tangent space Λ1. Using the techniques we discussed already, it will be easy to prove the theorem.

Theorem 5.1. The space Λ1 is the space of all q-dimensional random functions a(ε, x) ∈ Hq satisfying

(i) E{a(ε, X)|X} = 0q×1,

(ii) E{a(ε, X)εT |X} = 0q×d.

Proof. We know the space Λ1 is the mean-square closure of all parametric submodel nuisance tangent spaces Λγ1 where the corresponding score for γ1, i.e., Sγ1 (ε, X), must satisfy the two fundamental properties (5.6) and (5.7) and is further left unspecified. It follows that any ele- ment of Λγ1 must satisfy the properties (i) and (ii) of this theorem. Henceforth, a reasonable conjecture for the semiparametric nuisance tangent space for η1(·) is then

conj n q×1 T q×do Λ1 = a(ε, X) ∈ Hq : E{a(ε, X)|X} = 0 and E{a(ε, X)ε |X} = 0 .

conj conj We already know that Λγ1 ⊂ Λ1 . To prove the converse, take any a(ε, X) ∈ Λ1 that is bounded. Consider the parametric submodel

T pε|X (ε|x; γ1) = η1,0(ε, x){1 + γ1 a(ε, x)} for a q-dimensional parameter γ1. By the boundedness of a(ε, X), the parameter γ1 can be T chosen to be sufficiently small to ensure that {1 + γ1 a(ε, x)} ≥ 0 for all ε, x. It is clear this parametric submodel contains the truth, just take γ1 to be 0. Since E{a(ε, X)|X} = 0, it is clear that pε|X (ε|x; γ1) defines a proper density function for all x and γ1, i.e., Z pε|X (ε|x; γ1)d`Y (ε) = 1. 5.1. Restricted Moment Model 77

T In addition, from E{a(ε, X)ε |X} = 0, it easily follows that E(ε|X) = 0 for all x and γ1, i.e., Z εpε|X (ε|x; γ1)d`Y (ε) = 0.

Following the same argument as before, we can define partial derivatives with respect to γ1 ∈ q Θ1 ⊂ R , an open set. Henceforth, it is trivial to see that the score for γ1 for this parametric q×q q×q submodel is given by Sγ1 (ε, X) = a(ε, X) and with B = I , we see that a(ε, X) ∈ Λγ1 . conj conj Finally, any a(ε, X) ∈ Λ1 can be written as a limit of bounded elements from Λ1 since the conj bounded functions are dense in Hq. This shows that Λ1 = Λ1.

Now we obtained a precise description of the semiparametric nuisance tangent space for η1(·), by condition (i) of Theorem 5.1, we see that Λ1 ⊂ J1. In addition, if we define J1 = Jβ ⊕ Λ1, we see that J1 ⊂ J1. The space J1 has the interpretation of the tangent space for the conditional density of Y given X, parametrized through β and η1(·).

Nuisance Tangent Space for the Nuisance Parameter η2(·): Λ2

The construction of the nuisance tangent space for η2(·) now almost becomes a trivial task to do. Since we are considering marginal distributions of X with no restrictions, finding the space Λ2 is similar to finding the tangent space for a nonparametric model.

Theorem 5.2. The space Λ2 is the space of all q-dimensional random functions of X, α(X) ∈ q×1 Hq and thus E{α(X)} = 0 .

Proof. We know the space Λ2 is the mean-square closure of all parametric submodel nuisance tangent spaces Λγ2 where, as we have seen, the corresponding score for γ2, i.e., Sγ2 (X) is a function of X only and must satisfy (5.9) and is further left unspecified. It follows that any element of Λ2 must satisfy these properties. Henceforth, a reasonable conjecture for the semiparametric nuisance tangent space for η2(·) is then

conj n q×1o Λ2 = α(X) ∈ Hq : E{α(X)} = 0 .

conj We already know that Λγ2 ⊂ Λ2 . To prove the converse, take an arbitrary bounded function conj α(X) ∈ Λ2 . By a similar reasoning as in Theorem 5.1, we can show that the parametric submodel T pX (x; γ2) = η2,0(x){1 + γ2 α(X)} for a q-dimensional parameter γ2, is a proper parametric submodel containing the truth with q×q q×q score vector Sγ2 (X) = α(X). With B = I , we see that α(X) ∈ Λγ2 . conj conj Finally, any α(X) ∈ Λ2 can be written as a limit of bounded elements from Λ2 since the conj conj bounded functions are dense in Λ2 . This shows that Λ2 = Λ2.

Now we obtained a precise description of the semiparametric nuisance tangent space for η2(·), we see that Λ2 = J2. 78 Chapter 5. Applications

Further Properties of Λ1 and Λ2

Recall that we obtained that

n q×1 T q×do (i)Λ1 = a(ε, X) ∈ Hq : E{a(ε, X)|X} = 0 and E{a(ε, X)ε |X} = 0 , n o (ii)Λ2 = α(X) ∈ Hq : E{α(X)} = 0 .

By the results obtained in §4.5.2, we know that J1 ⊥ J2. Since Jβ ⊕ Λ1 = J1 ⊂ J1 and Λ2 = J2, we see that:

Jβ ⊥ Λ2, (5.12)

Λ1 ⊥ Λ2. (5.13)

Looking at the description of Λ1, we see that Λ1 = J1 ∩ Λb, with

n T q×do Λb = b(ε, X) ∈ Hq : E{b(ε, X)ε |X} = 0 .

We already know that Λ = Λ1⊕Λ2. In the present notation, we can write Λ = (J1∩Λb)⊕J2. This could be our final conclusion. However, the restricted moment model possesses more structure and this will reveal a more simple shape for the nuisance tangent space Λ. Doing so, deriving the orthogonal complement of the nuisance tangent space Λ⊥ will become much easier.

Lemma 5.2. We have the following relations:

⊥ (i) J1 = J2 ,

(ii) J2 ⊂ Λb,

(iii) Λ = (J1 ∩ Λb) ⊕ J2 = Λb.

Proof. (i) This follows from the discussion in §4.5.2. However, it can be insightful to prove this explicitly for this specific case. We first show that J1 ⊥ J2. This follows from a simple application of the law of iterated expectations. Take a(ε, X) ∈ J1 and α(X) ∈ J2,

E{αT (X)a(ε, X)} = E[E{αT (X)a(ε, X)|X}] = E[αT (X)E{a(ε, X)|X}] = 0.

To show that J2 is indeed the orthogonal complement of J1, we need to show that any h ∈ Hq can be written as h1 + h2 with hi ∈ Ji, i = 1, 2. Just write h = {h − E(h|X)} + E(h|X). It is trivial to see that {h − E(h|X)} ∈ J1 and E(h|X) ∈ J2. Henceforth, we obtain that Π(h|J1) = {h − E(h|X)} and Π(h|J2) = E(h|X).

(ii) Take α(X) ∈ J2, E{α(X)εT |X} = α(X)E(εT |X) = 0.

Henceforth, we obtain that α(X) ∈ Λb and thus J2 ⊂ Λb.

(iii) We first show that Λ ⊂ Λb. Any element of Λ can be written as h1 + h2 with h1 ∈ J1 ∩ Λb and h2 ∈ J2. By definition of the intersection, we see that h1 ∈ Λb. In addition, we just showed that J2 ⊂ Λb and hence h2 ∈ Λb. Since Λb is a linear space, we obtain that h1 + h2 ∈ Λb and 5.1. Restricted Moment Model 79

thus Λ ⊂ Λb. We now show that Λb ⊂ Λ. Take h ∈ Λb. Since E(h|X) ∈ J2 ⊂ Λb and Λb is a linear space, we obtain that {h − E(h|X)} ∈ Λb. This implies that h can be written as

h = {h − E(h|X)} + E(h|X)

⊥ with E(h|X) ∈ J2 and {h−E(h|X)} ∈ Λb. Since J1 = J2 , we see that {h−E(h|X)} also belongs to J1 and hence {h−E(h|X)} ∈ J1 ∩Λb from which we can conclude that Λb ⊂ (J1 ∩Λb)⊕J2 = Λ. From the two inclusions Λb ⊂ Λ and Λ ⊂ Λb, we can conclude that Λb = Λ.

From this lemma, it is clear we have obtained a simple description of the nuisance tangent space Λ for the restricted moment model, i.e.,

n T q×do Λ = h(ε, X) ∈ Hq : E{h(ε, X)ε |X} = 0 . (5.14)

In §7.4.2, this result will be deduced in a more direct way without considering the decomposition Λ = Λ1 ⊕ Λ2 by representing the restricted moment model in a slightly different way.

5.1.3 The Class of Influence Functions

In this section, we wish to derive the set of all influence functions of RAL estimators for the q-dimensional parameter β in the restricted moment model. Recall that from Theorem 4.3, we know that the influence function of any semiparametric RAL estimator must satisfy

T q×q 1. E{ϕ(ε, X)Sβ (ε, X)} = I , 2. Π{ϕ(ε, X)|Λ} = 0, i.e., ϕ ⊥ Λ.

To find the set of all influence functions, from (2) it is clear we first need to identify the orthogonal complement Λ⊥ of the nuisance tangent space Λ.

The Orthogonal Complement of the Nuisance Tangent Space: Λ⊥

The space Λ⊥ is the linear space of residuals h(ε, X)−Π{h(ε, X)|Λ} for all h(ε, X) ∈ Λ. However, this is not very useful. Luckily, the restricted moment model admits a simple formula for Λ⊥ which we describe in the following theorem.

Theorem 5.3. The space orthogonal to the nuisance tangent space is given by

⊥ n q×d q×d o Λ = A (X)ε : A(X) ∈ R (X) , (5.15)

q×d where R (X) denotes the space of all q × d matrices of arbitrary L2-functions of X over the field R. Moreover, the projection of any h(ε, X) ∈ Hq onto Λ satisfies h(ε, X) − Π{h(ε, X)|Λ} = g(X)ε, (5.16) where the q × d matrix g(X) of functions of X is given by

g(X) = E{h(ε, X)εT |X}{E(εεT |X)}−1. (5.17) 80 Chapter 5. Applications

Proof. We first prove that Λ ⊥ Λ⊥. Take any A(X)ε ∈ Λ⊥ and a(ε, X) ∈ Λ. We know that T 0 E{a(ε, X)ε |X} = 0. This means that E{aj(ε, X)εj0 |X} = 0 for all j = 1, . . . , q and j = T T 1, . . . , d. Hereby we use the notation that a(ε, X) = [a1(ε, X), . . . , aq(ε, X)] , ε = [ε1, . . . , εd] and   A11(X) ··· A1d(X)  . .  A(X) =  . .  . Aq1(X) ··· Aqd(X) Using this, we deduce that

E{aT (ε, X)A(X)ε} = E[E{aT (ε, X)A(X)ε|X}] q d X X = E[Ajj0 (X)E{aj(ε, X)εj0 |X}] = 0. j=1 j0=1

We can conclude that Λ ⊥ Λ⊥. ⊥ To prove that Λ is indeed the orthogonal complement of Λ, we need to show that any h ∈ Hq ⊥ can be written as h1 + h2 with h1 ∈ Λ and h2 ∈ Λ . This is equivalent with showing that there exists a q × d matrix g(X) of functions of X such that h(ε, X) − g(X)ε ∈ Λ. Thus, we need to solve the equation E[{h(ε, X) − g(X)ε}εT |X] = 0, which yields g(X) = E{h(ε, X)εT |X}{E(εεT |X)}−1. This is clearly just a function of X. We assume that the conditional variance matrix E(εεT |X) is positive definite ans hence invertible.

Corollary 5.1. From (5.16) and (5.17), it follows that

n T T −1 o Λ = h(ε, X) − E{h(ε, X)ε |X}{E(εε |X)} ε : h(ε, X) ∈ Hq . (5.18)

Now we determined the orthogonal complement Λ⊥, we almost identified the class of all influ- ence functions for the restricted moment model. On top of the orthogonality to the nuisance tangent space, influence functions of RAL estimators need to be normalized in the sense that T q×q E{ϕ(ε, X)Sβ (ε, X)} = I . Thus, to obtain an influence function, we need to multiply the elements of Λ⊥ with a normalization factor which is a matrix of constants in this case. Let us be more formal. Take an arbitrary element of Λ⊥, say A(X)ε. Define ϕ(ε, X) = CA(X)ε, where q×q C ∈ R , a matrix of constants. Then ϕ(ε, X) will be an influence function iff T q×q E{CA(X)εSβ (X, β0, η0)} = I , from wich we see that T −1 C = [E{A(X)εSβ (ε, X)}] . (5.19) Equation (5.19) can be simplified. Using the law of iterated expectations and using the trans- posed version of (5.11), we easily deduce that

T −1 C = [E{A(X)εSβ (ε, X)}] T −1 = (E[A(X)E{εSβ (ε, X)|X}]) = [E{A(X)D(X)}]−1. 5.1. Restricted Moment Model 81

From these calculations, we may conclude that the class of all influence functions for the re- stricted moment model is given by

n −1 q×d o V = [E{A(X)D(X)}] A(X){Y − µ(X, β0)} : A(X) ∈ R (X) . (5.20)

Construction of Semiparametric RAL Estimators: GEE

Now we derived the class of influence functions for the restricted moment model, if possible, we would like to construct the corresponding RAL estimators. The shape of V motivates us to consider an m-estimator βˆn (introduced in §3.2.2) for β that solves

n X −1 [E{A(X)D(X)}] A(Xi){Yi − µ(Xi, βˆn)} = 0. i=1 Equivalently, this can be written as

n X A(Xi){Yi − µ(Xi, βˆn)} = 0. (5.21) i=1 These equations are well-known in statistics. These are called the Generalized Estimating Equations, GEE, as defined in Liang and Zeger (1986), [23]. The solution βˆn to these esti- mating equations is then called a GEE-estimator. Subject to some regularity conditions, βˆn is a consistent and asymptotically normal estimator for β. This follows from the properties of an ALE as shown in [35], §4.1. We omit this derivation to avoid more technical difficulties. However, the argument given in [35] merely makes use of a Taylor expansion of (5.21) about the truth β0. The result is then that

n √ 1 X n(βˆ − β ) = √ [E{A(X)D(X)}]−1A(X ){Y − µ(X , β )} + o (1). n 0 n i i i 0 P i=1

−1 It follows that βˆn is an ALE with influence function ϕ(Y,X) = {E(AD)} A(X){Y −µ(X, β0)}. By Theorem 4.3, it then follows that βˆn is a RAL estimator. The asymptotic variance of βˆn equals the variance of the corresponding influence function and by a simple calculation using the law of iterated variance, we deduce that

−1 h i −T var{ϕ(Y,X)} = [E{A(X)D(X)}] var A(X){Y − µ(X, β0)} [E{A(X)D(X)}]

−1 n   = [E{A(X)D(X)}] E var[A(X){Y − µ(X, β0)}|X]

 o −T + var E[A(X){Y − µ(X, β0)}|X] [E{A(X)D(X)}] = [E{A(X)D(X)}]−1E{A(X)V (X)AT (X)}[E{A(X)D(X)}]−T , where V (X) = var(Y |X), the d × d conditional variance matrix of Y given X. It is self-evident, that for data analytic applications such as constructing confidence intervals for β, we must be able to derive a consistent estimator for the asymptotic variance of βˆn. We now give one possibility with some arguments for constructing such an estimator without going into technical details. The estimator we will discuss is a so called sandwich estimator and is widely used in semiparametric inference, not only in the case of the restricted moment model. 82 Chapter 5. Applications

First suppose the truth β0 is known to us. By the WLLN, a consistent estimator for E(AD) is given by n 1 X Eˆ (AD) = A(X , β )D(X , β ), (5.22) 0 n i 0 i 0 i=1 where the subscript 0 is used to emphasize that the estimator is evaluated at the truth. Also by the WLLN, a consistent estimator for E(AV AT ) is given by

n 1 X Eˆ (AV AT ) = A(X , β ){Y − µ(X , β )}{Y − µ(X , β )}T AT (X , β ). (5.23) 0 n i 0 i i 0 i i 0 i 0 i=1

Since the value β0 is not known to us, these estimators are not very useful. However, because βˆn is a consistent estimator for β0, using the uniform WLLN, we obtain that

n 1 X Eˆ(AD) = A(X , βˆ )D(X , βˆ ) n i n i n i=1 is a consistent estimator for E(AD) and

n 1 X Eˆ(AV AT ) = A(X , βˆ ){Y − µ(X , βˆ )}{Y − µ(X , βˆ )}T AT (X , βˆ ) n i n i i n i i n i n i=1 is a consistent estimator for E(AV AT ). Finally, the Continuous Mapping Theorem yields that

{Eˆ(AD)}−1Eˆ(AV AT ){Eˆ(AD)}−T (5.24) is a consistent estimator for the asymptotic variance of the RAL estimator βˆn. From the shape of (5.24), is is clear why we call this a sandwich estimator. More details about these sandwich estimators can be found in Liang and Zeger (1986), [23]. Those not feeling at ease with this concept, we refer to [35], p.57-58, where an application of this methodology is given for the log-linear model.

5.1.4 The Efficient Influence Function

We now enter the final step in our search for the most efficient RAL estimator. Recall that since the parameter can be nicely partitioned into the parameter of interest β and the nuisance parameter η, we can find the efficient influence function by considering the efficient score, defined in (4.4). Henceforth, we need to take the residual after projecting Sβ onto the nuisance tangent space Λ, i.e., Seff (ε, X) = Sβ(ε, X) − Π{Sβ(ε, X)|Λ}. Using (5.16) and (5.17), we can write

T −1 Seff (ε, X) = E{Sβ(ε, X)ε |X}V (X)ε.

Using (5.11), we obtain a more simple formula,

T −1 Seff (ε, X) = D (X)V (X)ε. (5.25)

To obtain the efficient influence function, we need to multiply with the appropriate normalization T −1 matrix of constants Ceff . With A(X) = D (X)V (X), we obtain that this normalization 5.1. Restricted Moment Model 83

T −1 −1 matrix is given by Ceff = E{D (X)V (X)D(X)} . Consequently, the efficient influence function is given by

T −1 −1 T −1 ϕeff (Y,X) = E{D (X)V (X)D(X)} D (X)V (X){Y − µ(X, β0)}. (5.26)

T −1 −1 An easy calculation yields that var{ϕeff (Y,X)} = E{D (X)V (X)D(X)} , which exactly T −1 equals the semiparametric efficiency bound [E{Seff (ε, X)Seff (ε, X)}] . Using the ideas of the previous section, the optimal estimator for β can be obtained by solving the estimating equations

n X T −1 D (Xi)V (Xi){Yi − µ(Xi, β)} = 0. (5.27) i=1

We see that, to obtain the optimal estimator for β, we need to solve the efficient score equa- tions since we are summing up the efficient score (5.25) evaluated in the collected sample.

5.1.5 Some Additional Notes on the Restricted Moment Model

1. A Different Representation

We now briefly describe how we can generalize to results obtained in the preceding display to a discrete random variable. Since in the previous sections, we assumed the response variable Y was continuous, the error term ε = Y − µ(X, β) also was continuous and this allowed us to consider conditional densities for ε given X with respect to the Lebesgue measure. Consequently, the model could be described through variationally independent parameters {β, η1(ε, x), η2(x)}. However, we now suppose Y is not a continuous but a discrete response variable, e.g., Y is a dichotomous response variable in which case we will use the logistic regression model1 and this is also a resticted moment model. To see where the difficulties arise, just look at the residual Y − µ(X, β). This variable is not discrete anymore and it may no longer have a dominating measure that allows us to define densities. Henceforth, we need to work with densities defined on (Y,X), p(y, x), with respect to a dominating measure νY × νX where νY will be typically the counting measure. We now briefly describe how to generalize the earlier obtained results to this more general case. A more detailed description of this can be found in Tsiatis (2006), [35], p.88-92.

The joint density can be written as p(y, x) = p(y|x)p(x) and the truth is denoted by p0(y, x) = p0(y|x)p0(x). Parametric submodels are then described completely the same as in 5.1.1, i.e., T T T T p(y, x; β, γ1, γ2) = p(y|x; β, γ1)p(x; γ2) for finite-dimensional parameters (β , γ1 , γ2 ) . Using the same methodology, one can show that Λ = Λ1 ⊕ Λ2, where n o Λ2 = J2 = α(X) ∈ Hq : E{α(X)} = 0 and n T o Λ1 = a(Y,X) ∈ Hq : E{a(Y,X)|X} = 0 and E{a(Y,X)Y |X} = 0 .

1 exp(βT X∗) ∗ Recall that for the logistic regression model, we work with the function µ(X, β) = 1+exp(βT X∗) where X = (1,XT )T and E(Y |X) = P(Y = 1|X). 84 Chapter 5. Applications

Since Λ1 = J1 ∩ Λb, where J1 = {a(Y,X) ∈ Hq : E{a(Y,X)|X} = 0} and Λb = {a(Y,X) ∈ Hq : T E[a(Y,X){Y − µ(X, β0)} |X] = 0}, we have that Λ = (J1 ∩ Λb) ⊕ J2. Using the same series of lemma’s, we obtain that Λ = Λb. Henceforth, we obtain that

⊥ n q×d o Λ = A(X){Y − µ(X, β0)} : A(X) ∈ R (X) and for any h(Y,X) ∈ Hq, we have that

Π{h(Y,X)|Λ⊥} = h(Y,X) − Π{h(Y,X)Λ} T −1 = E[h(Y,X){Y − µ(X, β0)} |X]V (X){Y − µ(X, β0)}.

As a last step, we need to derive the efficient score. Using the same arguments, we obtain that

⊥ Seff (Y,X) = Π{Sβ(Y,X)|Λ } T −1 = E[Sβ(Y,X){Y − µ(X, β0)} |X]V (X){Y − µ(X, β0)} T −1 = D (X)V (X){Y − µ(X, β0)}.

Remark 5.2. We advise the reader to fill in de details theirselves as it is a good exercise to how semiparametric theory is used in practice. As we already noted, the details of these calculations can be found in [35], p.88-92. Thus, [35] can be used to check your calculations.

2. Working Variance Assumption

Recall the optimal estimating equation for the semiparametric restricted moment model:

n X T −1 D (Xi)V (Xi){Yi − µ(Xi, β)} = 0. (5.28) i=1 This estimating equation depends on using the correct conditional variance matrix V (X) = var(Y |X). Unfortunately, in the restricted moment model, this is left unspecified. Consequently, the function V (X) is unknown to us and needs to be estimated. Let us first look to a special case in which V (X) can be derived exactly. Logistic Regression Model. We already know that the logistic regression model is a spe- cial case of the restricted moment model. The response variable Y is dichotomous and one- dimensional. More specifically, we have that

exp(βT X∗) P (Y = 1|X) = E(Y |X) = µ(X, β) = , 1 + exp(βT X∗) where X∗ = (1,XT )T . Since Y is binary, we know that

exp(βT X∗) V (X) = var(Y |X) = µ(X, β ){1 − µ(X, β )} = . 0 0 {1 + exp(βT X∗)}2

Using the chain rule, we deduce that DT (X) = X∗V (X), where we used the specific shape of V (X). This implies that the optimal estimating equation in the logistic regression model is given by n    T ∗  X 1 exp(β Xi ) Yi − T ∗ = 0. (5.29) Xi 1 + exp(β X ) i=1 i 5.1. Restricted Moment Model 85

From this equation, it is clear that β is the only unknown piece and can therefore be nicely solved without estimating other parts of the estimating equation. Henceforth, the resulting estimator is a globally efficient estimator. 4 Unfortunately, this is not a general approach since the result merely follows by the fact that Y is dichotomous. In what follows, we take Y to be one-dimensional for ease of explanation. One approach that is proposed in [35], is to posit some relationship for the variance function V (x), either completely specified or as a function of an additional finite-dimensional parameter ξ as well as the parameter β, i.e., V (x, ξ, β). This is referred to as a working variance assumption because this model may not contain the truth and thus this is an assumption. Some examples T 2 ξ1 are exp(ξ0 + ξ1 x) or ξ0{µ(x, β)} . Note these functions are always positive. Obviously, the ˆinit parameter ξ needs to be estimated. One way to do this, as discussed in [35], is to consider βn , an initial consistent estimator for β using the function A(X) = D(X) in the estimating equation (5.21) (which is equivalent to a working variance V (X) equal to the identity matrix). Using this initial estimator, an estimator for ξ can be found by solving n X ˆinit h ˆinit 2 ˆinit i Q(Xi, ξ, βn ) {Yi − µ(Xi, βn )} − V (Xi, ξ, βn ) = 0, i=1 where Q(X, ξ, β) is an arbitrary vector of functions of X, ξ and β of dimension equal to the dimension of ξ, e.g., Q(X, ξ, β) = ∂V (X, ξ, β)/∂ξ. The resulting estimator is denoted by ξˆn. It can be shown that under weak regularity conditions, ξˆn converges in probability to some ∗ ˆ ˆinit constant ξ . In addition, it can also be shown that using V (Xi, ξn, βn ) into (5.28) will result in a RAL estimator βˆn with influence function ϕ(Y,X) = [E{A(X)D(X)}]−1A(X){Y − µ(X, β)},

T −1 ∗ where A(X) = D (X)V (X, ξ , β0). This has a very important implication: the resulting esti- ˆ ˆinit mator is locally efficient. If the working variance contains the truth, V (Xi, ξn, βn ) will converge to the truth V (x, ξ0, β0) and the resulting estimator is semiparametric efficient, otherwise it is not. However, if the working variance is not correct, since this is still a function of X, the result- ing estimator for β is still a RAL estimator and thus consistent and asymptotically normal. The asymptotic variance of this estimator can then be estimated using the sandwich estimator (5.24) T ˆ −1 ˆ ˆinit with A(X) = D (X, βn)V (Xi, ξn, βn ). No further adjustment for the asymptotic variance is necessary since the influence function is orthogonal to the nuisance tangent space assuming ˆ ˆinit 1/4+ V (Xi, ξn, βn ) is at least a n -consistent estimator. Below, we illustrate this methodology for a log-linear model (see [35], p.96, for more details). Log-Linear Model. A log-linear model is used to model the relationship of a positive response variable Y as a function of covariates X. For simplicity, we take Y to be one-dimensional. Let us T denote X = [X1,...,Xq−1] the (q − 1)-dimensional vector of covariates. The log-linear model then assumes q−1 X log{E(Y |X)} = α + δiXi. i=1 T The parameter of interest is then given by β = [α, δ1, . . . , δq−1] . Remark 5.3. The log-linear model is a restricted moment model given by

q−1 ! X E(Y |X) = µ(X, β) = exp α + δiXi . i=1 86 Chapter 5. Applications

The transformation exp(·) assures the conditional mean response is always positive. We know the optimal semiparametric estimator for β is given as the solution to the equation

q−1 X T −1 D (Xi, β)V (Xi){Yi − µ(Xi, β)} = 0. i=1 It is easy to see that ∂µ(X, β)  1  DT (X , β) = = µ(X, β) . i ∂β X Since Y will typically represent a count, we might be willing to assume that Y follows a Poisson distribution. In this case, V (X) = µ(X, β). This may be a too strong assumption and hence we assume V (X) = σ2µ(X, β) where σ2 is a scale factor. Under this assumption, the most efficient RAL estimator is the solution to the estimation equation

n ( q−1 !) X  1  X Y − exp α + δ X = 0. X i i i i=1 i=1

Thus, if V (X) is indeed given by σ2µ(X, β), the resulting estimator is efficient and if V (X) is not given by σ2µ(X, β), we still obtain a RAL estimator since σ2µ(X, β) is a function of X only. Hence, the resulting estimator is locally efficient. For a more detailed discussion, we refer to McCullagh and Nelder (1989), [25], Chapter 6. 4

5.2 Adaptive Estimation

We now briefly discuss a phenomenon that frequently occurs in semiparametric theory and explicitly shows why a semiparametric approach can be attractive. This phenomenon is called adaptive estimation. As stated in Newey (1990), [27], adaptive estimation refers to models where parameters of interest can be estimated equally well if the nonparametric part of the model is unknown (or at least some part of it) as it is known. Such models are those where the semiparametric bound is equal to the bound that applies if (a part of) the nonparametric component of the model is known. For a more detailed discussion about this subject, we refer to Begun et al. (1983), [3] and Bickel et al. (1993), [5]. We will focus on semiparametric models of the form

q 1. P1 = {p(x; β, η): β ∈ Θβ ⊂ R , η ∈ H}, q r 2. P2 = {p(x; β, α, η): β ∈ Θβ ⊂ R , α ∈ Θα ⊂ R , η ∈ H}, where Θβ and Θα are open sets and H is an arbitrary infinite-dimensional set. For P1, the nuisance parameter is only the infinite-dimensional parameter η and for P2, the nuisance pa- rameter is the finite-dimensional parameter α together with the infinite-dimensional parame- ter η. These are both semiparametric models in a strict sense where the parameter can be nicely partitioned. Let us denote for both models the nuisance tangent space for the infinite- dimensional parameter by Λ. For the second model, the total nuisance tangent space equals q×r q×r r Jα ⊕ Λ, where Jα = {B Sα : B ∈ R }, the nuisance tangent space corresponding with the finite-dimensional parameter α. 5.2. Adaptive Estimation 87

From the algebraic lemma in §3 of Stein (1956), [34], it follows that in a parametric model, a condition for the Cram´er-Raobound for the parameters of interest to be the same if nuisance parameters are known as if they are unknown is block diagonality of the information matrix. The geometric interpretation of this block diagonality is orthogonality of the scores for the pa- rameters of interest and the nuisance parameters. This has a nice semiparametric generalization: orthogonality to the nuisance tangent space. We give a necessary condition for both models P1 and P2, the second being an extension of the first.

1. A necessary condition for the existence of an adaptive estimator for β in the model P1 is T that E(Sβ h) = 0 for all h ∈ Λ, i.e., Sβ ⊥ Λ.

2. A necessary condition for the existence of an adaptive estimator for β in the model P2  T T −1  is that E [Sβ − E(SβSα ){E(SαSα )} Sα]h = 0 for all h ∈ Λ. By (2.4), it follows T T −1 that E(SβSα ){E(SαSα )} Sα = Π(Sβ|Jα) and thus the necessary condition is that Sβ − ⊥ Π(Sβ|Jα) = Π(Sβ|Jα ) ⊥ Λ.

It is important to remember that these results give necessary conditions in which case it may be possible to construct estimators whose asymptotic variance equals the asymptotic variance in models where some parts of the nonparametric component are unknown by which we mean that the semiparametric efficiency bound does not grow bigger if some additional parts of the nonparametric component are left unspecified. This still sounds quite difficult to visualize. However, we already encountered an example of this phenomenon in the restricted moment model. We deduced that Sβ ⊥ Λ1. Henceforth, this orthogonality implied no information was lost by leaving the distribution of X unspecified. We now illustrate this phenomenon with some additional examples so the meaning becomes more obvious. Moreover, these examples yield nice exercises to practice on the construction of semiparametric nuisance tangent spaces and therefore we only sketch the results without giving all technical details. These exercises will be quite straightforward and the same techniques as for the restricted moment model need to be used. Before doing so, we state a useful lemma for calculating projections, as presented in [27]. We omit some technical conditions since these would overload the formulas. These can also be found in [27], as well as a proof. The first result was already encountered in §4.5.2. Lemma 5.3. We have the following projection results:

(i) Consider two random variables U and V with E(U) = 0 and let V be the space of all square integrable functions of V with mean zero. The orthogonal projection of U onto V is then given by Π(U|V ) = E(U|V ). (ii) Consider four random variables U, V , W and T . Suppose V and W are functions of T T such that E(UU |T ) is constant and positive definite. Let UV be the space of all square integrable functions of the form α(V )U. The orthogonal projection of WU onto UV is then given by Π(WU|UV ) = E(W |V )U.

1. Additive Semiparametric Regression Model

Recall that the additive semiparametric regression model satisfies the relation (4.1),

T Yi = Xi β0 + g0(Vi) + εi, 88 Chapter 5. Applications

introduced in §4.1 where we assumed the function g0(v) is unknown, the disturbance εi is inde- 2 pendent of the regressors and normally distributed with known variance σ0 and the distribution of the regressors is also assumed to be known. Note this is a special case of a partially linear regression model, introduced in §1.3.2. An arbitrary density function in this model is given by 1  1 {y − xT β − g(v)}2  p{x, v, y; β, g(·)} = exp − p (x, v). p 2 2 σ2 0 2πσ0 0 As discussed earlier, a parametric submodel is given by 1  1 {y − xT β − g(v, γ)}2  p(x, v, y; β, γ) = exp − p (x, v). p 2 2 σ2 0 2πσ0 0 Writing down the log likelihood, taking the partial derivatives with respect to β and γ and 2 evaluating at the truth, it is easy to show that the scores are given by Sβ = εx/σ0 and Sγ = 2 T εgγ/σ0, where ε = y − x β0 − g0(v) and gγ = ∂g(v, γ0)/∂γ. Since gγ is a vector consisting of arbitrary functions of v only, a conjecture for the nuisance tangent space is given by n o Λ1 = εα(v): α(v) ∈ L2(P ) , (5.30) where P denotes the corresponding probability measure. Using the same arguments as for the restricted moment model, we can check this conjecture is the nuisance tangent space we are 2 T looking for. Just consider the parametric submodel g(v, γ) = g0(v) + σ0γ α(v). Now we convinced ourselves that the nuisance tangent space is given by (5.30), we can search for the efficient score Seff = Sβ − Π(Sβ|Λ1). Using Lemma 5.3, (ii), this becomes almost trivial. 2 Just put U = ε, W = X/σ0 and V = V. T is considered to be an empty random variable. By T 2 2 assumption, we have that E(UU ) = E(ε ) = σ0, a constant. Henceforth, we may conclude that ε Seff = Sβ − E(W |V )U = 2 {X − E(X|V)}. (5.31) σ0 However, we must admit, the assumptions that σ0 and the density of X and V are known is not a realistic assumption. Therefore, we now assume these are unknown. The nuisance parameter now becomes (σ, g(·), η(·)), where η(·) denotes the density of X and V. Surprisingly, the efficient score given by (5.31) does not change. Indeed, consider the same parametric submodel as before but now we insert a submodel for the density of x and v as well, described through a finite- dimensional parameter γ∗, i.e., p(x, v; γ∗). Assume γ and γ∗ are variationally independent. The score for β and γ remains the same, but now we have additional scores for σ and γ∗ given by 2 2 3 ∗ ∗ Sσ = (ε − σ0)/σ0 and Sγ∗ = ∂ log p(x, v; γ0 )/∂γ , respectively. Consequently, a conjecture for the nuisance tangent space is Λ = Jσ ⊕ Λ1 ⊕ Λ2, (5.32) where Jσ denotes the linear span of Sσ,Λ1 is defined in (5.30) and n o Λ2 = a(X, V): E{a(X, V)} = 0 . It is easy to check that these are three mutual orthogonal spaces. Using Proposition 2.12, we see we can obtain the efficient score by projecting Sβ onto Jσ,Λ1 and Λ2 separately and then adding these projections. However, using the fact that ε q X, V, an easy calculation shows that

T 1  T E{Sβ a(X, V)} = 2 E εX a(X, V) σ0 1  T = 2 E(ε)E X a(X, V) = 0, σ0 5.2. Adaptive Estimation 89

which means that Sβ ⊥ Λ2. In addition, it is also easy to show that Sβ ⊥ Jσ. Indeed, using the fact that ε q X, V and the normality of ε,

 2 2  Xε (ε − σ0) E(SβSσ) = E 2 3 σ0 σ0 1 3 1 = 5 E(Xε ) − 3 E(Xε) σ0 σ0 1 3 1 = 5 E(X)E(ε ) − 3 E(X)E(ε) = 0. σ0 σ0

Henceforth, we see that Π(Sβ|Λ) = Π(Sβ|Λ1) and the efficient score Seff given by (5.31) remains the same if σ0 and the density of X and V are unknown. In addition, the necessary conditions for adaptive estimation are fulfilled. Consequently, if a semiparametric RAL estimator for β exists, β can be estimated with the same level of efficiency whether the parameter σ0 and the density of X and V are known or not, i.e., the asymptotic distribution of such a RAL estimator does not change.

2. Linear Location-Shift Regression Model

We now discuss another type of semiparametric models and the relation with adaptive esti- mation, so called location-shift regression models. These models involve unknown disturbance distributions. Our focus will be on the linear location-shift model to keep things as simple as possible. In this model, we assume we have a one-dimensional and continuous response variable Y and a vector of covariates X, satisfying the relationship

T Y = X β0 + ε, (5.33) where ε denotes the disturbance term. We assume εqX and the density function of ε is given by p0(ε), which is unknown. Note we do not allow an intercept in (5.33), any constant is absorbed in ε. In addition, denote the true density of X by p0(x), which is also unknown. An arbitrary density function in this model is given by

T pY,X (y, x; β) = pε(y − x β)pX (x).

Note that this factoring of the density function already reveals a decomposition of the nuisance tangent space. A parametric submodel is given by a set of density functions

T p(y, x; γ1, γ2) = p(y − x β; γ1)p(x; γ2), where γ1 and γ2 are some finite-dimensional parameters which we assume to be variationally T T T T T T independent and we write γ = (γ1 , γ2 ) . The truth is denoted by γ0 = (γ10, γ20) . Writing down the log likelihood, taking the partial derivatives with respect to β, γ1 and γ2 and evaluating at the truth, it is easy to show that the scores are given by Sβ = −Xs1(ε), Sγ1 = s2(ε) and

Sγ2 = s3(X), where s1(ε) = ∂ log p(ε; γ10)/∂ε = ∂ log p0(ε)/∂ε, s2(ε) = ∂ log p(ε; γ10)/∂γ1 and s3(x) = ∂ log p(x; γ20)/∂γ2. From this it is clear a conjecture for the nuisance tangent space is given by Λ = Λ1 ⊕ Λ2, (5.34) where Λ1 = {αε(ε): E{αε(ε)} = 0} and Λ2 = {αX (X): E{αX (X)} = 0}. It is easy to show that Λ1 ⊥ Λ2. To check the conjecture (5.34), just use similar arguments as for the restricted moment 90 Chapter 5. Applications

T T model and use submodels p(ε; γ1) = p0(ε){1 + γ1 αε(ε)} and p(x; γ2) = p0(x){1 + γ2 αX (x)}. Now we convinced ourselves that the nuisance tangent space is given by (5.34), we can search for the efficient score Seff = Sβ − Π(Sβ|Λ). First we prove an easy lemma, see [35], p.107. Lemma 5.4. If the random variable ε is continuous with support on the real line, then

E{s1(ε)} = 0 where s1(ε) = ∂ log p0(ε)/∂ε.

R Proof. Since p0(ε−ξ) defines a proper density for all constants ξ ∈ R, we have that p0(ε−ξ)dε = ∂ R 1. Differentiating with respect to ξ yields ∂ξ p0(ε − ξ)dε = 0. Assuming we can interchange R 0 0 integration and differentiation, we obtain − p0(ε − ξ)dε = 0 where p0(ε) = dp0(ε)/dε. In the usual way, we obtain Z − s1(ε)p0(ε)dε = 0, or E{s1(ε)} = 0.

Using Proposition 2.12, this can be equivalently written as Seff = Sβ − Π(Sβ|Λ1) − Π(Sβ|Λ2). However, it turns out that Sβ ⊥ Λ2 and thus Π(Sβ|Λ2) = 0. Indeed, since εqX, we deduce that

T T E{Sβ αX (X)}} = −E{X αX (X)s1(ε)} T = −E{X αX (X)}E{s1(ε)} = 0, because E{s1(ε)} = 0. Henceforth, we only need to compute Π(Sβ|Λ1). For this purpose, we use Lemma 5.3, (i). Just put U = Sβ and V = ε. From this we see that Π(Sβ|Λ1) = E(Sβ|ε). The efficient score is thus given by

Seff = Sβ − E(Sβ|ε) = −{X − E(X|ε)}s1(ε) = −{X − E(X)}s1(ε). (5.35)

We can give an interesting interpretation to the direction Π(Sβ|Λ1) we substract from the score Sβ. It is the direction in Λ of the score for location in a more restrictive model where the density of ε is completely known up to a location parameter α. More specifically, suppose p0(ε) = h(ε − α0), where h(·) is known and α0 is unknown. The score for β remains the same but now we have an additional score for α, i.e., Sα = −s1(ε). The partial score for β is

 2  2 −1 E{Xs1(ε)} Sβ − E(SβSα){E(Sα)} Sα = − X − 2 s1(ε) = −{X − E(X)}s1(ε), E{s1(ε)} where we used that ε q X. Thus, the partial score for β equals the efficient score Seff and is orthogonal to the nuisance tangent space. Henceforth, the necessary conditions for adaptive estimation are satisfied. This means that the parameter β can be estimated equally well if the distribution of ε is completely unknown or known up to a location parameter α. The parametric submodel where the density function of ε is known up to a location parameter is what we refer to as the least favourable parametric submodel. In addition, as we noted already, since Sβ is automatically orthogonal to Λ2, the parameter β can also be estimated equally well whether or not the distribution of X is known. In the discussion above, we assumed the relation between Y and X was linear. However, this need not be the case. A more general discussion of this type of model where we assume that 5.3. Estimating Treatment Difference Between Two Treatments 91

Y = µ(X, β) + ε, is given in Tsiatis (2006), [35], §5.1. In this case, µ(X, β) can be any known function described through a finite-dimensional parameter β. It turns out that the structure of the nuisance tangent space Λ will be completely the same. However, the score Sβ will be more complicated in general and therefore, the simple arguments we used here will fail. Therefore, in [35], the orthogonal complement Λ⊥ of the nuisance tangent space Λ is constructed and ⊥ afterwards the residual Sβ − Π(Sβ|Λ) is searched within the space Λ . Unfortunately, we do not give this general derivation not to overload the reader. Nevertheless, this general derivation is very interesting and is very well described in Tsiatis (2006), [35], §5.1. In this general case, the efficient score is given by

T T Seff = −[D (X, β0) − E{D (X, β0)}]s1(ε),

T where D(x, β0) = ∂µ(x, β0)/∂β . Note this clearly coincides with the linear case as discussed T here since D(x, β0) = x . Interesting to note is that any location-shift regression model can be seen as a restricted moment model but with the additional restriction that the disturbance is independent of the covariates, i.e., ε q X. Intuitively, this is clear. In [35], this is explicitly shown for the linear model. This has an interesting implication. Since the location-shift regression model is more restrictive than the restricted moment model, we may expect the class of semiparametric RAL estimators for β for the location-shift regression model to be larger than the class of semiparametric RAL estimators for β for the restricted moment model. Henceforth, we may expect the semipara- metric efficiency bound for β for the location-shift regression model to be smaller than the semiparametric efficiency bound for the restricted moment model.

5.3 Estimating Treatment Difference Between Two Treatments

We now consider the problem of estimating the mean treatment difference between two treat- ments in a randomized pretest-posttest study or more generally, in a randomized study with covariate adjustment. This problem will be represented through an infinite-dimensional parame- ter θ. The parameter of interest, the mean treatment difference, will be given by β(θ), a function of θ. Henceforth, we will use the methodology outlined in §4.4.2.

5.3.1 Model Description and Preliminary Estimator

Let us first explain what is meant by a randomized pretest-posttest study. In the beginning of the study, a random sample of subjects are chosen from some population of interest. For each subject, a pretest measurement, denoted by Y1, is made. Next, the subjects are randomized to one of two treatments. We denote this by a treatment indicator A = (1, 0). This is done with probability δ and 1−δ, i.e., P(A = 1) = δ and P(A = 0) = 1−δ. The randomization probability δ is chosen by the investigator and thus known. After some prespecified time period, a posttest measurement Y2 is made, which we assume to be one-dimensional. The goal of the study is to estimate the effect of the treatment intervention on the posttest measurement. An example of such a study is given in [35], p.127. Although our focus will be on the pretest-posttest study, the results we will obtain will be directly applicable to the case of covariate adjustment. In that case, Y2 will be the response variable and A remains to be the treatment indicator. Y1 then represents a vector of baseline covariates that are collected on all individuals prior to randomization. Henceforth, the exact 92 Chapter 5. Applications same strategy can be used as for the pretest-posttest study with the only difference to be the interpretation of Y1. Let us be a bit more formal. From the discussion above, it is clear we are interested in the estimation of the treatment effect, defined as

β = E(Y2|A = 1) − E(Y2|A = 0), (5.36) which is one-dimensional since Y2 is one-dimensional. The data from such a study are then represented as the vector Zi = (Y1i,Ai,Y2i), i = 1, . . . , n. A naive estimator for this treatment difference is simply the difference in the treatment-specific sample averages of posttest response, Pn Pn ˆnaive i=1 AiY2i i=1(1 − Ai)Y2i βn = Pn − Pn . (5.37) i=1 Ai i=1(1 − Ai) Other examples are also given in [35], §5.4, e.g., a least squares estimator in an analysis-of- covariance model or alternatively using a simple restricted moment model for which we already know how to derive the efficient generalized estimating equations (GEE). For a study of the relative efficiency of these estimators, we refer to Yang and Tsiatis (2001), [44]. Although comparing these ad hoc semiparametric estimators for β gives some interesting results, it is still not clear how we can use the pretest measurement Y1 (or the additional baseline covariates) to extract more information from our data and thus how to gain efficiency from considering the variable Y1. Using semiparametric theory, we will obtain the desired answer in a very elegant way. In addition, we will study under which minimal condition we can gain efficieny by considering baseline covariates, so called auxiliary variables. Upon using the semiparametric theory outlined in §4.4.2, we need to derive the set of influence functions of RAL estimators for β. Henceforth, we need to consider the linear variety ϕ(Z)+J ⊥, to be defined shortly. However, as we noticed, we need to dispose of at least one influence function ˆnaive of a RAL estimator for β. This is easy to obtain by considering the estimator βn , defined in (5.37). We denote the true parameter by β = E(Y |A = 1) − E(Y |A = 0) = µ(1) − µ(0). 0 2 √ 2 2 2 Substracting β0 from both sides of (5.37) and multiplying by n, we obtain √ √ Pn  √ Pn  ˆnaive i=1 AiY2i (1) i=1(1 − Ai)Y2i (0) n(βn − β0) = n Pn − µ2 − n Pn − µ2 . i=1 Ai i=1(1 − Ai) We can rewrite this as ( ) ( ) √ Pn (1) Pn (0) ˆnaive 1 i=1 Ai(Y2i − µ2 ) 1 i=1(1 − Ai)(Y2i − µ2 ) n(βn − β0) = √ Pn − √ Pn . n i=1 Ai/n n i=1(1 − Ai)/n

Pn P Pn P Using the WLLN, we deduce that i=1 Ai/n → δ and i=1(1 − Ai)/n → 1 − δ. Hence, we can write that n   √ 1 X Ai (1) (1 − Ai) (0) n(βˆnaive − β ) = √ (Y − µ ) − (Y − µ ) + o (1). (5.38) n 0 n δ 2i 2 (1 − δ) 2i 2 P i=1

ˆnaive This shows that βn is an ALE with influence function A (1 − A ) ϕ(Z ) = i (Y − µ(1)) − i (Y − µ(0)). (5.39) i δ 2i 2 (1 − δ) 2i 2 5.3. Estimating Treatment Difference Between Two Treatments 93

We also need to show this ALE is a RAL estimator. The derivation of this regularity, we postpone until the end of this section because of its technical nature. Therefore, the less interested reader ˆnaive can skip that derivation. Luckily, the estimator βn turns out to be regular and hence it is a RAL estimator. Now we have identified one influence function for a RAL estimator for β, we can identify the linear variety of influence functions by deriving the tangent space and its orthogonal complement J .

5.3.2 The Tangent Space and Its Orthogonal Complement

The Tangent Space J

To construct the tangent space J , we need to be more specific about the shape of the joint density function of Z = (Y1, A, Y2). By the design of the study, the only restriction on the joint density function is that induced by the randomization. That is, the pretest measurament Y1 (or equivalently the additional baseline covariates) is (are) independent of the treatment indicator A. Its distribution is given by P(A = 1) = δ and P(A = 0) = 1 − δ, where δ denotes the randomization probability, known by design. We want to use the results from §4.5.2. Therefore, we factor the joint density as

pY1,A,Y2 (y1, a, y2) = pY1 (y1)pA|Y1 (a|y1)pY2|Y1,A(y2|y1, a), (5.40)

with respect to some dominating measure νYi for Yi, i = 1, 2. If we do not impose any restrictions on the distribution of Z = (Y1, A, Y2), as if we consider a nonparametric model, we know that this results in a partition of the Hilbert space H1 = J1 ⊕ J2 ⊕ J3, where  J1 = α1(Y1): E{α1(Y1)} = 0 ,  J2 = α2(Y1,A): E{α2(Y1,A)|Y1} = 0 ,  J3 = α2(Y1, A, Y2): E{α3(Y1, A, Y2)|Y1,A} = 0 , and Ji, i = 1, 2, 3 are mutually orthogonal spaces. Since we do impose restrictions on the observed data distribution, the tangent space J for this model will be smaller. Indeed, by randomization, Y1 q A and the distribution of A is completely known by design,

a (1−a) pA|Y1 (a|y1) = PA(a) = δ (1 − δ) .

Henceforth, (5.40) becomes

pY1,A,Y2 (y1, a, y2) = PA(a)pY1 (y1)pY2|Y1,A(y2|y1, a). (5.41)

The density pY1 (y1) of Y1 and the conditional density pY2|Y1,A(y2|y1, a) of Y2 given A and Y1 are left completely unspecified. Consequently, it is clear that J = J1 ⊕ J3.

Orthogonal Complement and Linear Variety of Influence Functions

⊥ Using the results obtained so far, it is trivial to see that J = J2. From (4.20) it follows that any ⊥ ∗ ∗ element of J can be written as h (Y1,A)−E{h (Y1,A)|Y1} for any arbitrary square integrable ∗ ∗ function h of Y1 and A. Using the fact that A is dichotomous, the function h2 can be written 94 Chapter 5. Applications

∗ as h (Y1,A) = Ah1(Y1) + h2(Y2), where h1 and h2 are arbitrary square integrable functions of ⊥ Y1. From this, we deduce that J is the space of all functions given by ∗ ∗ h (Y1,A) − E{h (Y1,A)|Y1} = Ah1(Y1) + h2(Y1) − {E(A|Y1)h1(Y1) + h2(Y1)}

= (A − δ)h1(Y1), ⊥ where we used the fact that A q Y1 and E(A) = P(A = 1) = δ. This shows that J = ∗ ∗ {(A − δ)h (Y1): h an arbitrary square-integrable function of Y1}. Using this and (5.39), we obtain that A (1 − A)  ϕ(Z) + J ⊥ = (Y − µ(1)) − (Y − µ(0)) + (A − δ)h∗(Y ): h∗(Y ) ∈ L (Y ) , (5.42) δ 2 2 (1 − δ) 2 2 1 1 2 1 where L2(Y1) denotes the set of al square integrable functions of Y1. Let us denote an arbitrary influence function, characterized by a particular function h∗, from our linear variety ϕ(Z) + J ⊥ by ϕh∗ , i.e., A (1) (1 − A) (0) ∗ ϕ ∗ (Z) = (Y − µ ) − (Y − µ ) + (A − δ)h (Y ). h δ 2 2 (1 − δ) 2 2 1 It is not difficult to show that the estimator Pn Pn n n ! AiY2i (1 − Ai)Y2i 1 X 1 X βˆ∗ = i=1 − i=1 + A − A h∗(Y ) n Pn A Pn (1 − A ) n i n i 1i i=1 i i=1 i i=1 i=1 n n ! 1 X 1 X = βˆnaive + A − A h∗(Y ) n n i n i 1i i=1 i=1 is an asymptotically linear estimator (ALE) with influence function ϕh∗ . Indeed, n n ! √ √ 1 X 1 X n(βˆ∗ − β ) = n(βˆnaive − β ) + √ A − A h∗(Y ). n 0 n 0 n i n i 1i i=1 i=1 Pn P Because i=1 Ai/n → δ, we have n √ √ 1 X n(βˆ∗ − β ) = n(βˆnaive − β ) + √ (A − δ) h∗(Y ) + o (1). n 0 n 0 n i 1i P i=1 ˆ∗ From the latter equation, it is clear that βn is an ALE with influence function ϕh∗ . It can be ˆ∗ shown that βn is also regular, which we will show in §5.3.5 (1). The asymptotic variance of the estimator will vary according to the choice of the function h∗. It thus remains to find the function h∗ for which the asymptotic variance is minimal.

5.3.3 The Efficient Influence Function

From (4.10), we know that the effcient influence function is given by ϕ(Z) − Π{ϕ(Z)|J ⊥}. Since we dispose of the influence function of a particular RAL estimator, we can use this formula. Without the a priori knowledge of the influence function of such an estimator, our derivation would fail. Fortunately, we have one available and we can write that A A  1 − A 1 − A  ϕ (Z) = (Y −µ(1))−Π (Y − µ(1)) J ⊥ − (Y −µ(0))−Π (Y − µ(0)) J ⊥ . eff δ 2 2 δ 2 2 1 − δ 2 2 1 − δ 2 2 5.3. Estimating Treatment Difference Between Two Treatments 95

To find the efficient influence function, we thus need to find these orthogonal projections onto ⊥ ⊥ J . Since J = J2, using (4.21b), we see that these projections are easy to obtain. For any function α(Y1, A, Y2), we have that

⊥ Π{α(Y1, A, Y2)|J } = Π{α(Y1, A, Y2)|J2} = E{α(Y1, A, Y2)|Y1,A} − E{α(Y1, A, Y2)|Y1}.

Let us begin by calculating the first projection, A  A  A  Π (Y − µ(1)) J ⊥ = E (Y − µ(1)) Y ,A − E (Y − µ(1)) Y δ 2 2 δ 2 2 1 δ 2 2 1 A n o  A   = E(Y |Y ,A = 1) − µ(1) − E E (Y − µ(1))|Y ,A Y δ 2 1 2 δ 2 2 1 1 A − δ n o = E(Y |Y ,A = 1) − µ(1) , δ 2 1 2 where we used the law of iterated expectations in the second line of our calculation. Similarly, we deduce that 1 − A  Π (Y − µ(0)) J ⊥ 1 − δ 2 2 1 − A  1 − A  = E (Y − µ(0)) Y ,A − E (Y − µ(0)) Y 1 − δ 2 2 1 1 − δ 2 2 1 1 − A n o  1 − A   = E(Y |Y ,A = 0) − µ(0) − E E (Y − µ(0))|Y ,A Y 1 − δ 2 1 2 1 − δ 2 2 1 1 A − δ n o = − E(Y |Y ,A = 0) − µ(0) . 1 − δ 2 1 2 Substituting these results in our formula for the efficient influence function, we obtain that A (A − δ)  ϕ (Z) = Y − E(Y |A = 1,Y ) eff δ 2 δ 2 1 (1 − A) (A − δ)  − Y + E(Y |A = 0,Y ) − β . (5.43) (1 − δ) 2 (1 − δ) 2 1 0 The efficient RAL estimator can then be obtained by evaluating the efficient influence function at our data, summing up, equating to zero and solving for β. This would lead to the estimator n    1 X Ai (Ai − δ) βˆeff = Y − E(Y |A = 1,Y ) n n δ 2i δ 2i i 1i i=1 (1 − A ) (A − δ)   − i Y + i E(Y |A = 0,Y ) . (1 − δ) 2i (1 − δ) 2i i 1i

However, to construct this efficient RAL estimator for β, we would need to know E(Y2|A = 1,Y1) and E(Y2|A = 0,Y1), which, of course, we do not. This problem is comparable to that described in the second note of §5.1.5. Following the same strategy, we need to posit models for E(Y2|A = 1,Y1) and E(Y2|A = 0,Y1) in terms of some finite-dimensional parameters ξ1 and ξ0, i.e., E(Y2|A = j, Y1) = ζj(Y1, ξj), j = 0, 1. These are two restricted moment models for the subset of individuals in treatment groups A = 1 and A = 0, respectively. For these models we then use the efficient generalized estimating 96 Chapter 5. Applications

equations (GEE) to obtain estimators ξˆ1n and ξˆ0n for ξ1 and ξ0 using patients {i : Ai = 1} and {i : Ai = 0}, respectively. These estimators are consistent for ξ1 and ξ0, respectively, if the models are correctly specified. It turns out that, even if these models are incorrectly specified, under suitable regularity conditions, the estimators ξˆjn would converge to some probability limit ∗ ξj for j = 0, 1. Hence, we propose the estimator

n     1 X Ai (Ai − δ) (1 − Ai) (Ai − δ) βˆ = Y − ζ (Y , ξˆ ) − Y + ζ (Y , ξˆ ) . (5.44) n n δ 2i δ 1 1i 1n (1 − δ) 2i (1 − δ) 0 1i 0n i=1 This is an ALE with influence function

A (1) (A − δ) ∗ (1) ϕ ˆ (Z) = (Y2 − µ ) − {ζ1(Y1, ξ ) − µ } βn δ 2 δ 1 2 (1 − A) (A − δ) − (Y − µ(0)) − {ζ (Y , ξ∗) − µ(0)}. (5.45) (1 − δ) 2 2 (1 − δ) 0 1 0 2

Because of its technical nature, this derivation is not given in [35] and there it is left as an excercise. However, it can be interesting to go through an argument leading to this result. Since this involves some technical difficulties, we wait to give this derivation until the end of this section and the less interested reader can skip this. The question remains if the ALE βˆn is also regular. The answer to this question turns out to be positive since ϕ (Z) ∈ ϕ(Z)+J ⊥. Indeed, βˆn just put 1 1 h∗(Y ) = {ζ (Y , ξ∗) − µ(1)} − {ζ (Y , ξ∗) − µ(0)}. 1 δ 1 1 1 2 1 − δ 0 1 0 2 From this we see that, whether or not the posited models for the conditional expectations E(Y2|A = 1,Y1) and E(Y2|A = 0,Y1) are correct, the resulting estimator has its influence function in the class of influence functions of RAL estimators. Henceforth, the considered estimator βˆn is regular. We may conclude that the estimator βˆn is a consistent, asymptotically normal semiparametric RAL estimator for β, even if the posited models for the treatment-specific conditional expectations of Y2 given Y1 are incorrect and this estimator is semiparametric efficient if the posited models are correct. Henceforth, we obtain a locally efficient estimator. In Leon et al. (2003), [21], it is discussed that a reasonable attempt at modelling these conditional expectations will lead to estimators with very high efficiency, even if the model is not exactly correct.

5.3.4 Auxiliary Variables

Using semiparametric theory, we investigated how we can use the pretest measurement, or more general, the baseline covariates, to obtain a more efficient estimator for the parameter of interest β = E(Y2|A = 1) − E(Y2|A = 0) compared to the case where we do not have such additional measurements. For that reason, we call these measurements that make up the variable Y1 aux- iliary variables, as they are not needed to define the parameter of interest. We now wish to investigate under which minimal condition we can gain efficiency by considering additional baseline measurements. Thus, to focus our attention, suppose we want to estimate some q-dimensional parameter β(θ) in a model pZ (z; θ) where θ is an infinite-dimensional pa- rameter describing the model and Z represents the vector of collected data for one individual. Next, suppose we also collected additional auxiliary variables W . Surprisingly, it turns out that if we do not impose additional restrictions on the joint distribution of (W, Z) other than those 5.3. Estimating Treatment Difference Between Two Treatments 97 from the marginal distribution of Z, then the class of RAL estimators is the same as if we never considered W . We briefly discuss how we can obtain this result. This will be quite easy using the semiparametric machinery we developed so far. We introduce some supplementary notation. The Hilbert space of q-dimensional mean-zero Z square integrable functions of Z, we denote by Hq . Similarly, the Hilbert space of q-dimensional WZ mean-zero square integrable functions of W and Z, we denote by Hq . If we consider estimators for β solely based on the data from Z, the class of influence functions for β is given by

⊥ ϕ(Z) + J Z , where ϕ(Z) is the influence function of any RAL estimator for β and J Z⊥ is the space orthogonal Z to the tangent space defined in Hq . However, if we also consider auxiliary variables W , then the space of influence functions of RAL estimators for β is given by

⊥ ϕ(Z) + J WZ ,

WZ⊥ WZ where J is the space orthogonal to the tangent space defined in Hq . The key result now is that if no restrictions are put on the conditional density pW |Z (w|z), where the marginal density of Z is assumed to be from the semiparametric model pZ (z; θ), then the orthogonal complement of the tangent space J WZ for the semiparametric model for the joint distribution of (W, Z), i.e., J WZ⊥ , is equal to the orthogonal complement of the tangent space J Z for the semiparametric model of the marginal distribution for Z alone, i.e., J Z⊥ . Indeed, under the aforementioned assumptions, the model for (W, Z) can be represented by the class of densities

∗ ∗ pW,Z (w, z; θ, θ ) = pW |Z (w|z; θ )pZ (z; θ), ∗ ∗ where pW |Z (w|z; θ ) can be any arbitrary conditional density of W given Z and θ and θ are variationally independent. Using the results obtained in §4.5.2, we see that the part of the tangent space corresponding with the infinite-dimensional parameter θ∗ is the space

n q q o J1 = a (W, Z): E{a (W, Z)|Z} = 0

WZ and the entire Hilbert space Hq can be written as

WZ Z Z Hq = J1 ⊕ Hq , J1 ⊥ Hq ,

Z q q ∗ since Hq = {α (Z): E{α (Z)} = 0}. Because θ and θ are variationally independent, it is easy to see that WZ Z Z J = J1 ⊕ J , J1 ⊥ J . Henceforth, from these two equations it is immediately clear that the space orthogonal to the WZ Z tangent space J is the space orthogonal both to J1 and to J . Thus it must be the space Z Z WZ⊥ Z⊥ within Hq that is orthogonal to J . We conclude that J = J . This means that the space of influence functions of RAL estimators for β is the same for both models. Especially, this means both models give rise to the same efficient influence function. Conclusion. If we are not willing to make any additional assumptions regarding the relationship of the auxiliary variables W and the variables Z, then we cannot gain efficieny by considering these auxiliary variables W and we need not consider these variables when estimating β. In our example of the pretest-posttest study, it turned out we gained efficiency by considering the auxiliary variable Y1. This corresponds with the result obtained here. Indeed, a relationship was induced between Y1 and the treatment indicator A due to randomization: Y1 q A. 98 Chapter 5. Applications

5.3.5 Some Technical Results

As promised, in this section we discuss some technical results. They are not based on any of the reference works in the bibliography. As we noted, the less interested reader can skip these results since the calculations are not essential to understand the derivations made earlier.

ˆnaive 1. Regularity of βn

ˆnaive In §5.3.1, we showed that the naive estimator βn , given by Pn Pn ˆnaive i=1 AiY2i i=1(1 − Ai)Y2i βn = Pn − Pn , i=1 Ai i=1(1 − Ai) is an ALE with influence function A (1 − A) ϕ(Z) = (Y − µ(1)) − (Y − µ(0)). δ 2 2 (1 − δ) 2 2

We still need to show that this ALE is a regular estimator and thus a RAL estimator. For this purpose, we use Theorem 4.1, which gives a necessary and sufficient condition for an ALE to be regular. Thus, we need to show that (4.2) is fulfilled for any parametric submodel. From (5.41), T T T we see that an arbitrary parametric submodel Pγ for γ = (γ1 , γ2 ) can be written as

pZ (z; γ1, γ2) = PA(a)p(y1; γ1)p(y2|y1, a; γ2), where γ1 and γ2 are finite-dimensional variationally independent parameters and the truth T T T is denoted by γ0 = (γ10, γ20) . The scores are given by Sγ1 (y1) = ∂ log p(y1; γ10)/∂γ1 and

Sγ2 (y2, y1, a) = ∂ log p(y2|y1, a; γ20)/∂γ2. They satisfy the usual properties of scores, i.e.,

E{Sγ1 (Y1)} = 0 and E{Sγ2 (Y2,Y1,A)|Y1,A} = 0.

We will show that ∂β(γ ) ∂β(γ ) E(ϕST ) = [E(ϕST ),E(ϕST )] = 0 , 0 = Γ(γ ). γ γ1 γ2 T T 0 ∂γ1 ∂γ2

T T Let us first calculate E(ϕSγ1 ) and E(ϕSγ2 ). Since A is dichotomous and independent of Y1, we can write

T (1) T (0) T E(ϕSγ1 ) = E{(Y2 − µ2 )Sγ1 (Y1)|A = 1} − E{(Y2 − µ2 )Sγ1 (Y1)|A = 0} T T = E{Y2Sγ1 (Y1)|A = 1} − E{Y2Sγ1 (Y1)|A = 0} (1) T (0) T −µ2 E{Sγ1 (Y1)|A = 1} + µ2 E{Sγ1 (Y1)|A = 0} T T (1) T (0) T = E{Y2Sγ1 (Y1)|A = 1} − E{Y2Sγ1 (Y1)|A = 0} − µ2 E{Sγ1 (Y1)} + µ2 E{Sγ1 (Y1)} T T = E{Y2Sγ1 (Y1)|A = 1} − E{Y2Sγ1 (Y1)|A = 0}.

T To do a similar caclulation for Sγ2 , we need an additional argument that E{Sγ2 (Y2,Y1,A)|A = j} = 0 for j = 0, 1. For this purpose, we use the law of iterated expectations,

T T E{Sγ2 (Y2,Y1,A)|A = j} = E[E{Sγ2 (Y2,Y1,A)|Y1,A = j}|A = j] = 0, 5.3. Estimating Treatment Difference Between Two Treatments 99 for j = 0, 1. It then easily follows that

T T T E(ϕSγ2 ) = E{Y2Sγ2 (Y2,Y1,A)|A = 1} − E{Y2Sγ2 (Y2,Y1,A)|A = 0}.

T T Next, we calculate ∂β(γ0)/∂γ1 and ∂β(γ0)/∂γ2 . First we explicitly show how the parameter β depends on the parameters γ1 and γ2, i.e., how the function β(γ) looks like. We write

β(γ) = Eγ(Y2|A = 1) − Eγ(Y2|A = 0) Z

= y2p(y1; γ1)p(y2|y1, 1; γ2)dνY1 × νY2 (y1, y2) Z

− y2p(y1; γ1)p(y2|y1, 0; γ2)dνY1 × νY2 (y1, y2).

We use the notation Eγ to highlight that we are not taking expectations under the truth. Assuming we can interchange differentiation and integration and using the usual techniques for exercices of this type, we deduce that Z ∂β(γ0) ∂p(y1; γ10) T = y2 T p(y2|y1, 1; γ20)dνY1 × νY2 (y1, y2) ∂γ1 ∂γ1 Z ∂p(y1; γ10) − y2 T p(y2|y1, 0; γ20)dνY1 × νY2 (y1, y2). ∂γ1 Z T = y2Sγ1 (Y1)p(y1; γ10)p(y2|y1, 1; γ20)dνY1 × νY2 (y1, y2) Z T − y2Sγ1 (Y1)p(y1; γ10)p(y2|y1, 0; γ20)dνY1 × νY2 (y1, y2) T T = E{Y2Sγ1 (Y1)|A = 1} − E{Y2Sγ1 (Y1)|A = 0}. Similarly, we deduce that ∂β(γ ) 0 = E{Y ST (Y ,Y ,A)|A = 1} − E{Y ST (Y ,Y ,A)|A = 0}. T 2 γ2 2 1 2 γ2 2 1 ∂γ2 ˆnaive These results show that for any parametric submodel Pγ, we have that the ALE βn with influence function ϕ satisfies ∂β(γ ) ∂β(γ ) E(ϕST ) = [E(ϕST ),E(ϕST )] = 0 , 0 = Γ(γ ). γ γ1 γ2 T T 0 ∂γ1 ∂γ2 ˆnaive Theorem 4.1 then assures us that βn is a RAL estimator. To end this discussion about regularity, recall an arbitrary influence function is given by

A (1) (1 − A) (0) ∗ ∗ ϕ ∗ (Z) = (Y − µ ) − (Y − µ ) + (A − δ)h (Y ) = ϕ(Z) + (A − δ)h (Y ). h δ 2 2 (1 − δ) 2 2 1 1 The corresponding ALE is given by

Pn Pn n n ! AiY2i (1 − Ai)Y2i 1 X 1 X βˆ∗ = i=1 − i=1 + A − A h∗(Y ) n Pn A Pn (1 − A ) n i n i 1i i=1 i i=1 i i=1 i=1 n n ! 1 X 1 X = βˆnaive + A − A h∗(Y ). n n i n i 1i i=1 i=1 100 Chapter 5. Applications

∗ T T T ˆ ∗ ∗ ∗ To show that βn is regular, we need to show that E(ϕh Sγ ) = [E(ϕh Sγ1 ),E(ϕh Sγ2 )] = Γ(γ0). ∗ Since ϕh∗ (Z) = ϕ(Z) + (A − δ)h (Y1), it is clear the only thing that is left to show is that ∗ T ∗ T ∗ T E{(A − δ)h (Y1)Sγ } = [E{(A − δ)h (Y1)Sγ1 },E{(A − δ)h (Y1)Sγ2 }] = 0. If we are able to do ˆ∗ so, we know βn is a RAL estimator. The first one is easy, using the fact that Y1 q A, we obtain

∗ T ∗ T E{(A − δ)h (Y1)Sγ1 (Y1)} = E(A − δ)E{h (Y1)Sγ1 (Y1)} = 0.

T Second, using the law of iterated expectation and the fact that E{Sγ2 (Y2,Y1,A)|Y1,A} = 0, we obtain

∗ T ∗ T E{(A − δ)h (Y1)Sγ2 (Y2,Y1,A)} = E[E{(A − δ)h (Y1)Sγ2 (Y2,Y1,A)|Y1,A}] ∗ T = E[(A − δ)h (Y1)E{Sγ2 (Y2,Y1,A)|Y1,A}] = 0.

The calculations we made above are valid for any parametric submodel. Hence, by Theorem 4.1 ˆ∗ we know βn is a RAL estimator.

2. Influence Function for βˆn

Recall that we considered the locally efficient estimator βˆn for the treatment difference β given by (5.44),

n     1 X Ai (Ai − δ) (1 − Ai) (Ai − δ) βˆ = Y − ζ (Y , ξˆ ) − Y + ζ (Y , ξˆ ) . n n δ 2i δ 1 1i 1n (1 − δ) 2i (1 − δ) 0 1i 0n i=1

(1) (0) The truth was denoted by β0 = µ2 − µ1 . √ ∗ Suppose the estimators ξˆjn, j = 0, 1 are n-consistent estimators, i.e., there exist ξ , j = 0, 1 √ j ˆ ∗ such that n(ξjn −ξj ) = OP (1) for j = 0, 1. In addition, we assume that the functions ζj(y1, ξj) ∗ as a function of ξj are differentiable in a neighbourhood of ξj for j = 0, 1, for all y1. Under these assumptions, we heuristically show that the influence function for the estimator βˆn is given by (5.45).

Some simple algebra yields

n   √ 1 X Ai (1) (Ai − δ) (1) n(βˆ − β ) =√ (Y − µ ) − {ζ (Y , ξˆ ) − µ } n 0 n δ 2i 2 δ 1 1i 1n 2 i=1 n   1 X (1 − Ai) (0) (Ai − δ) (0) − √ (Y − µ ) + {ζ (Y , ξˆ ) − µ } . n (1 − δ) 2i 2 (1 − δ) 0 1i 0n 2 i=1

∗ Next, we use a taylor expansion of the functions ζj about ξj , j = 0, 1,

ˆ ∗ ∂ζj ˜ ˆ ∗ ζj(Y1i, ξjn) = ζj(Y1i, ξj ) + T (Y1i, ξj,n)(ξjn − ξj ), ∂ξj 5.4. Probabilistic Index Models 101

˜ ∗ ˆ where ξj,n is an intermediate value between ξj and ξjn. This yields n   √ 1 X Ai (1) (Ai − δ) (1) n(βˆ − β ) =√ (Y − µ ) − {ζ (Y , ξ∗) − µ } n 0 n δ 2i 2 δ 1 1i 1 2 i=1 n   1 X (1 − Ai) (0) (Ai − δ) (0) − √ (Y − µ ) + {ζ (Y , ξ∗) − µ } n (1 − δ) 2i 2 (1 − δ) 0 1i 0 2 i=1 n   1 X (Ai − δ) ∂ζ1 √ ∗ − (Y1i, ξ˜1,n) × n(ξˆ1n − ξ ) n δ ∂ξT 1 i=1 1 n   1 X (Ai − δ) ∂ζ0 √ ∗ − (Y1i, ξ˜0,n) × n(ξˆ0n − ξ ). n (1 − δ) ∂ξT 0 i=1 0 It remains to show that the last two lines of the equation above are o (1) terms. We already √ P ˆ ∗ know that n(ξjn − ξj ) = OP (1) for j = 0, 1. In addition, by the uniform WLLN and the fact that Y1 q A, we deduce that n     1 X (Ai − δ) ∂ζ1 P (A − δ) ∂ζ1 ∗ (Y1i, ξ˜1,n) → E (Y1, ξ ) n δ ∂ξT δ ∂ξT 1 i=1 1 1     (A − δ) ∂ζ1 ∗ = E E T (Y1, ξ1) = 0, δ ∂ξ1 n     1 X (Ai − δ) ∂ζ0 P (A − δ) ∂ζ0 ∗ (Y1i, ξ˜0,n) → E (Y1, ξ ) n (1 − δ) ∂ξT (1 − δ) ∂ξT 0 i=1 0 0     (A − δ) ∂ζ0 ∗ = E E T (Y1, ξ0) = 0. (1 − δ) ∂ξ0

Henceforth, the last two lines equal oP (1)OP (1) = oP (1). Moreover, we obtain that n  √ 1 X Ai (1) (Ai − δ) (1) n(βˆ − β ) = √ (Y − µ ) − {ζ (Y , ξ∗) − µ } n 0 n δ 2i 2 δ 1 1i 1 2 i=1 (1 − A ) (A − δ)  − i (Y − µ(0)) − i {ζ (Y , ξ∗) − µ(0)} + o (1), (1 − δ) 2i 2 (1 − δ) 0 1i 0 2 P from which we may conclude that βˆn is indeed an ALE with influence function given by (5.45) and by the argument in the previous note, we know it is a RAL estimator.

5.4 Probabilistic Index Models

In this section, we present a recently developed semiparametric model proposed by Prof. Dr. Olivier Thas, so called probabilistic index models. It gives an example of a model where the different components of the joint distribution cannot be indexed by variationally independent parameters. The results are novel work in collaboration with Prof. Dr. Stijn Vansteelandt. The calculations will become very technical and complicated. Henceforth, we only show how to construct the nuisance tangent space. For the derivation of the orthogonal complement of the nuisance tangent space and a first approach for the construction of an efficient estimator, we refer to [43]. 102 Chapter 5. Applications

5.4.1 Model Formulation

Let Y (X) denote a random variable which has distribution function FY |X(y|x), and let Y de- note a random variable with distribution function FY (y), which is the distribution of Y (X) marginalised over the p-dimensional X. Let FX(x) denote the distribution function of X. Let fY (y), fY |X(y|x) and fX(x) denote the corresponding densities with respect to the dominating measures νY and νX.

Remark 5.4. In what follows, we will encounter a lot of integral equations. To keep the no- tation as simple as possible, we will use de notation dx and dy instead of dνX(x) and dνY (y), respectively.

Consider (Yi, Xi), i = 1, . . . , n, to be an i.i.d. random sample from the joint distribution of Y and X which satisfies the following restriction:

P{Y 0 < Y (X)} = expit {α + m(X; β)} 2, (5.46) where m(X; β) is a known function, smooth in β and satisfying m(0; β) = 0 with βT = (β1, . . . , βp) a p-dimensional parameter. The parameters β and α are unknown and only the parameter β is of interest. Equation (5.46) can be rewritten as an integral equation that we will use in the construction of the nuisance tangent space. Indeed, we deduce that

expit {α + m(X; β)} = P{Y 0 < Y (X)} = E[I{Y 0 < Y (X)}] Z 0 0 = I(y < y)dFY (y )dFY |X(y|X).

This can be written as Z 0 0 0 0 0 0 expit {α + m(X; β)} = I(y < y)fY |X(y |x )fX(x )fY |X(y|X)dydy dx . (5.47)

This equation shows that we cannot choose the density functions fX(x) and fY |X(y|X) inde- pendently. The consequence is that we cannot describe both density functions in terms of two variationally independent parameters separately. This will make the calculations more diffi- cult. Henceforth, we first choose a density function for X. Next, we take a conditional density function for Y given X satisfying (5.47). Therefore we parametrize the model as fX(x; η2) and fY |X(y|X; β, η1, η2) where β is the parameter of interest and the parameters η1 and η2 are con- sidered to be variationally independent infinite-dimensional nuisance parameters. Note we can now take η1 and η2 to be variationally independent since the conditional density of Y given X is allowed to depend on η2. In terms of this notation, (5.47) becomes

expit {α + m(X; β)} (5.48) Z 0 0 0 0 0 0 = I(y < y)fY |X(y |x ; β, η1, η2)fX(x ; η2)fY |X(y|X; β, η1, η2)dydy dx .

2The function expit denotes the logistic regression function, i.e., ez expit(z) = . 1 + ez 5.4. Probabilistic Index Models 103

For further reference, the truth is denoted α0, β0, η1,0 and η2,0. In addition, we write f0(y, x) = fY |X(y|x; β0, η1,0, η2,0)fX(x; η2,0).

There is one subtlety left. We do not allow the conditional density function fY |X(y|X; β, η1, η2) to depend on the parameter α. Why is that? Due to the independence of Y 0 and Y , we know 0 1 that P(Y < Y ) = 2 . This implies the following integral condition, Z 1 expit {α + m(x; β)} dF (x) = . X 2 This can be equivalently written as Z 1 expit {α + m(x; β)} f (x; η )dx = . (5.49) X 2 2

Hence, the parameter α can be seen as a function of β and η2, i.e., α ≡ α(β, η2) solves (5.49) for a given β and η2. We end this section with an interesting interpretation. Note that (5.46) can also be written as

0 P{Y < Y (X)} = E{FY (Y )|X} = expit {α + m(X; β)} . (5.50) This suggests that the model defined by (5.46) can be viewed as a restricted moment model of a transformed outcome, where the transformation FY (·) is indexed by an infinite-dimensional nuisance parameter. Conclusion. The probabilistic index model can be characterized by the parameters

{β, η1(·), η2(·)}.

p The finite-dimensional parameter β ∈ R is the parameter of interest and η(·) = {η1(·), η2(·)} is the infinite-dimensional nuisance parameter. The model is then given by the set P of density functions fY |X(y|x, β, η1, η2)fX(x; η2) satisfying (5.48).

5.4.2 Nuisance Tangent Space for a Parametric Submodel

As for the restricted moment model, we first consider parametric submodels to gain some in- sight in the restrictions the model implies to the score vectors corresponding with the nuisance parameters. This will enable us to make an educated guess for the nuisance tangent space in the next section. r1 We consider parametric submodels fY |X(y|x; β, γ1, γ2) and fX(x; γ2) where γ1 ∈ R and γ2 ∈ r2 T T T r R , r = r1 + r2. Thus, γ = (γ1 , γ2 ) ∈ R . Hence, we consider the parametric submodel Pβ,γ ⊂ P, n T T T o Pβ,γ = fY |X(y|x; β, γ1, γ2)fX(x; γ2): θ = (β , γ1 , γ2 ) ∈ Θβ,γ , (5.51) where β, γ1 and γ2 are variationally independent. The set Θβ,γ is considered to be an open set so we can take partial derivatives with respect to these parameters. The parametric submodel also contains the truth and this is denoted f0(y, x) = fY |X(y|x; β0, γ1,0, γ2,0)fX(x; γ2,0). Let us investigate the nuisance tangent space for the parametric model (5.51). Remark 5.5. For further reference, since scores are mostly evaluated at the truth, we omit this in the arguments of the score. If scores are not evaluated at the truth, we write the parameter in the argument of the score. 104 Chapter 5. Applications

The score vectors for the nuisance parameters γ1 and γ2 are given by ∂ Sγ1 (y, x) = log fY |X(y|x; β0, γ1,0, γ2,0), ∂γ1 ∂ ∂ Sγ2 (y, x) = log fY |X(y|x; β0, γ1,0, γ2,0) + fX(x; γ2,0) ∂γ2 ∂γ2 (1) (2) = Sγ2 (y, x) + Sγ2 (x), where

(1) ∂ (2) ∂ Sγ2 (y, x) = log fY |X(y|x; β0, γ1,0, γ2,0) and Sγ2 (x) = fX(x; γ2,0). ∂γ2 ∂γ2 The nuisance tangent space for the parametric submodel is then given by

n p×r p×r p×ro Λγ = B Sγ (Y, X): B ∈ R ,

T T T where Sγ = [Sγ1 ,Sγ2 ] . This can be written as Λγ = Λγ1 ⊕ Λγ2 with n o p×r1 p×r1 p×r1 Λγ1 = B Sγ1 (Y, X): B ∈ R , n o p×r2 p×r2 (1) (2) p×r2 p×r2 Λγ2 = B Sγ2 (Y, X) = B {Sγ2 (Y, X) + Sγ2 (X)} : B ∈ R .

Nuisance Tangent Space for the Nuisance Parameter γ1: Λγ1

The nuisance tangent space for γ1, as given above, is defined to be the space n o p×r1 p×r1 p×r1 Λγ1 = B Sγ1 (Y, X): B ∈ R . (5.52)

We now use the fact that fY |X(y|x, β, γ1, γ2) defines a proper density and the identity (5.48) to obtain fundamental properties of the score for γ1 in the parametric submodel. Z (i) We know that fY |X(y|x; β0, γ1, γ2,0)dy = 1. Differentiating with respect to γ1, in-

terchanging integration and differentiation, evaluating at the truth γ1,0 and using the

definition of Sγ1 (y, x) yields

E{Sγ1 (Y, X)|X} = 0. (5.53) (ii) By definition of the probabilistic index model, we have the identity (5.48). Taking the partial derivative with respect to γ1 yields Z ∂ 0 0 0 0 0 0 I(y < y)fY |X(y |x ; β0, γ1, γ2,0)fX(x ; γ2,0)fY |X(y|x; β0, γ1, γ2,0)dydy dx ∂γ1 ∂ = expit{α0 + m(X; β0)} = 0. ∂γ1

Interchanging differentiation and integration, evaluating at the truth γ1,0 and using the

definition of Sγ1 (y, x) yields Z 0 0 0 0 0 0 I(y < y){Sγ1 (y, x) + Sγ1 (y , x )}fY |X(y |x ; β0, γ1,0, γ2,0)fX(x ; γ2,0) 0 0 × fY |X(y|x; β0, γ1,0, γ2,0)dydy dx = 0, or equivalently 0 0 0 E[I(Y < Y ){Sγ1 (Y, X) + Sγ1 (Y , X )}|X] = 0. (5.54) 5.4. Probabilistic Index Models 105

Nuisance Tangent Space for the Nuisance Parameter γ2: Λγ2

The nuisance tangent space for γ2 is defined to be the space n o p×r2 p×r2 (1) (2) p×r2 p×r2 Λγ2 = B Sγ2 (Y, X) = B {Sγ2 (Y, X) + Sγ2 (X)} : B ∈ R .

We now also use the fact that fY |X(y|x; β, γ1, γ2) defines a proper density and the identity (5.48) to obtain fundamental properties of the score for γ2 in the parametric submodel. In addition, we also use the fact that fX(x; γ2) defines a proper density.

(i) In the usual way, using the fact that fX(x; γ2) defines a proper density, we obtain

(2) E{Sγ2 (X)} = 0. (5.55)

(ii) Also in the usual way, using the fact that fY |X(y|x; β, γ1, γ2) defines a proper density, we obtain (1) E{Sγ2 (Y, X)|X} = 0. (5.56) (iii) Finally, by definition of the probabilistic index model, we have the identity (5.48). Taking the partial derivative with respect to γ2 yields Z ∂ 0 0 0 0 0 0 I(y < y)fY |X(y |x ; β0, γ1,0, γ2)fX(x ; γ2)fY |X(y|x; β0, γ1,0, γ2)dydy dx ∂γ2 ∂ = expit{α(β0, γ2) + m(x; β0)}. ∂γ2

Note we now cannot equate the right hand side to zero since α is a function of γ2. Therefore, we explicitly write the dependence of α on β and γ2. Upon interchanging integration and (1) differentiation, evaluating at the truth γ2,0 and using the definition of Sγ2 (Y, X) and (2) Sγ2 (X), we obtain Z 0 (1) (1) 0 0 (2) 0 0 0 0 I(y < y){Sγ2 (y, x) + Sγ2 (y , x ) + Sγ2 (x )}fY |X(y |x ; β0, γ1,0, γ2,0)fX(x ; γ2,0)

0 0 0 ∂α(β0, γ2,0) × fY |X(y|x; β0, γ1,0, γ2,0)dydy dx = expit {α0 + m(x; β0)} , ∂γ2 or equivalently

0 (1) (1) 0 0 (2) 0 E[I(Y < Y ){Sγ2 (Y, X) + Sγ2 (Y , X )+Sγ2 (X )}|X] 0 ∂α(β0, γ2,0) = expit {α0 + m(X; β0)} . ∂γ2

We now look for an expression for ∂α/∂γ2. This can be obtained by taking the partial derivative of (5.49) with respect to γ2, ∂ Z expit{α(β0, γ2) + m(x; β0)}fX(x; γ2)dx = 0. ∂γ2 Upon interchanging integration and differentiation, evaluating at the truth and using the (2) definition of Sγ2 (x) yields Z   ∂α(β0, γ2,0) 0 (2) expit {α0 + m(x; β0)} + expit{α0 + m(x; β0)}Sγ2 (x) fX(x; γ2,0)dx = 0. ∂γ2 106 Chapter 5. Applications

This yields an expression for ∂α/∂γ2: (2) ∂α(β0, γ2,0) E[expit{α0 + m(X; β0)}Sγ2 (X)] = − 0 . (5.57) ∂γ2 E[expit {α0 + m(X; β0)}] Using (5.57), we obtain the following condition 0 (1) (1) 0 0 (2) 0 E[I(Y < Y ){Sγ2 (Y, X) + Sγ2 (Y , X ) + Sγ2 (X )}|X] (5.58) 0 expit {α0 + m(X; β0)} (2) = − 0 E[expit{α0 + m(X; β0)}Sγ2 (X)]. E[expit {α0 + m(X; β0)}]

5.4.3 Semiparametric Nuisance Tangent Space

We now focus on the space that is of interest, the semiparametric nuisance tangent space Λ. By definition, the semiparametric nuisance tangent space Λ is the mean-square closure of all parametric submodel nuisance tangent spaces, i.e., [ Λ = Λγ .

Pβ,γ ⊂P

Since Λγ = Λγ1 ⊕ Λγ2 , we see that [ M [ Λ = Λγ1 Λγ2 = Λ1 ⊕ Λ2. Pβ,γ ⊂P Pβ,γ ⊂P

Nuisance Tangent Space for the Nuisance Parameter η1(·): Λ1

We now state a theorem that gives us a precise description of the nuisance tangent space Λ1.

Theorem 5.4. The nuisance tangent space Λ1 for the nuisance parameter η1(·) is given by

n 0 0 0 Λ1 = h1(Y, X) ∈ Hp : E{h1(Y, X)|X} = 0 and E[I(Y < Y ){h1(Y, X) + h1(Y , X )}|X] = 0 o with (X,Y ) q (X0,Y 0) . (5.59)

Remark 5.6. We omit the proof since it is analogous to the calculations we made for the restricted moment model. The proof is given in [43].

Nuisance Tangent Space for the Nuisance Parameter η2(·): Λ2

As for the semiparametric nuisance tangent space for the nuisance parameter η1(·), we now state a theorem that gives us a precise description of the nuisance tangent space Λ2. We also omit the proof since the calculations are analogous to those for the restricted moment model. The proof is given in [43].

Theorem 5.5. The nuisance tangent space Λ2 for the nuisance parameter η2(·) is given by n Λ2 = h2(Y, X) + h3(X) ∈ Hp : E{h2(Y, X)|X} = 0,E{h3(X)} = 0 and (5.60) 0 0 0 0 0 expit {α0 + m(X; β0)} E[I(Y < Y ){h2(Y, X) + h2(Y , X ) + h3(X )}|X] = − 0 E[expit {α0 + m(X; β0)}] 0 0 o × E[expit{α0 + m(X; β0)}h3(X)] with (X,Y ) q (X ,Y ) . 5.4. Probabilistic Index Models 107

Relation Between Λ1 and Λ2

We end with a relation between the spaces Λ1 and Λ2 and a slightly different representation for the space Λ.

Lemma 5.5. We have the following relations,

(i)Λ1 ⊂ Λ2,

(ii) Λ = Λ2, where Λ is the complete nuisance tangent space.

Proof. To prove (i), we take an arbitrary element from Λ1, h1(Y, X). Now take h3 to be the zero function, i.e., h3(X) = 0 for all X. It is then easy to see that h1(Y, X) + h3(X) is an element of Λ2. Indeed, E{h1(Y, X)|X} = 0 because h1(Y, X) is in Λ1 and E{h3(X)} is trivially 0. If we take h3(X) to be 0 for all X, then

0 0 0 E[I(Y < Y ){h1(Y, X) + h1(Y , X )}|X] = 0 implies

0 0 0 0 E[I(Y < Y ){h1(Y, X) + h1(Y , X ) + h3(X )}|X] 0 expit {α0 + m(X; β0)} = − 0 E[expit{α0 + m(X; β0)}h3(X)]. E[expit {α0 + m(X; β0)}]

That (ii) holds, i.e., Λ = Λ2, is now trivial to see since Λ = Λ1 ⊕ Λ2 and Λ1 ⊂ Λ2.

Lemma 5.6. An equivalent way of representing the nuisance tangent space Λ is n Λ = h1(Y, X) + h2(X) ∈ Hp : E{h1(Y, X)|X} = 0,E{h2(X)} = 0 and

 0 0 0 0  E [I(Y < Y ) − expit{α0 + m(X; β0)}]{h1(Y, X) + h1(Y , X ) + h2(X )}|X 0 expit {α0 + m(X; β0)} 0 0 o = − 0 E[expit{α0 + m(X; β0)}h2(X)] with (X,Y ) q (X ,Y ) . E[expit {α0 + m(X; β0)}]

Proof. This follows from Lemma 5.5 upon noting that the restriction

0 0 0 0 E[I(Y < Y ){h1(Y, X) + h1(Y , X ) + h2(X )}|X] 0 expit {α0 + m(X; β0)} = − 0 E[expit {α0 + m(X; β0)} h2(X)], E[expit {α0 + m(X; β0)}] is equivalent with

 0 0 0 0  E [I(Y < Y )−expit{α0 + m(X; β0)}]{h1(Y, X) + h1(Y , X ) + h2(X )}|X 0 expit {α0 + m(X; β0)} = − 0 E[expit{α0 + m(X; β0)}h2(X)] E[expit {α0 + m(X; β0)}] 108 Chapter 5. Applications

This can be obtained by a direct computation using the law of iterated expectation,

0 0 0 −E[expit{α0 + m(X; β0)}{h1(Y, X) + h1(Y , X ) + h2(X )}|X] 0 0 0 = −expit {α0 + m(X; β0)} [E{h1(Y, X)|X} + E{h1(Y , X )|X} + E{h2(X )|X}]  0 0 0 0  = −expit {α0 + m(X; β0)} E{h1(Y, X)|X} + E[E{h1(Y , X )|X }] + E{h2(X )} = 0.

The last step uses the fact that (X,Y ) q (X0,Y 0). Part III

Abstract Approach to Semiparametric Efficiency

109

Chapter 6

Tangent Sets and Efficient Influence Function

In this chapter, we will generalize the basic concepts about semiparametric efficiency theory, e.g., tangent sets, score functions and the efficient influence function. We will argue these concepts as introduced here are indeed proper generalizations of the corresponding concepts as introduced in Part II.

6.1 Score Functions and Tangent Sets

Consider a semiparametric model P. In general, we shall be interested in estimating some m functional ψ : P → B : P 7→ ψ(P ), where B is a Banach space, e.g., R or R in a multi- dimensional setting. We immediately note an important example of a functional ψ. Consider a semiparametric model in a strict sense,

P = {Pθ,η : θ ∈ Θ, η ∈ H}.

m The set Θ is an open subset of R and H is some infinite-dimensional set (see §1.3 for examples). This semiparametric model is indexed by two parameters θ and η where θ is finite-dimensional and η is the infinite-dimensional nuisance parameter. This will not stop us from also considering the estimation of η. Models with a partitioned parameter (θ, η) of this form are semiparametric models in a strict sense. The finite-dimensional parameter θ is the parametric component and the infinite-dimensional parameter η is the nonparametric component. Alternatively, we could call this a parametric-nonparametric model, having in mind that η could be an element of a nonparametric model, see Begun et al. (1983), [3]. When we are interested in the parameter θ, we wish to estimate the functional

m ψ : P → R : Pθ,η 7→ ψ(Pθ,η) = θ. Another example arises in a nonparametric model P consisting of all distribution functions of a continuous random variable Z. Suppose we are interested in estimating the mean E(Z) = R zdP. The functional corresponding with this problem is Z ψ : P → R : P 7→ ψ(P ) = zdP.

111 112 Chapter 6. Tangent Sets and Efficient Influence Function

Comparing this with the general set-up of Part II, we already encounter the first difference. In Part II, we developed a theory that directly deals with estimation of the parameter of interest for semiparametric models in a strict sense, and sometimes, as seen in §5.3, a more natural representation is that the parameter of interest is a function of an infinite-dimensional parameter. In that case, we could not define efficiency without having a preliminary RAL estimator available. This is something we want to avoid. Therefore, we unify both cases by considering estimation of a functional ψ working on the model P. It is clear this encompasses all the examples of Part II, so we are considering a proper generalization. Our main interest is in functionals ψ that allow an asymptotic theory analogous to the theory for smooth parametric models. We start by developing a notion of information for estimating ψ(P ) given the model P. In parametric models, we have a strict definition for the Fisher information for estimating the parameters. So, what can we say about the information for the semiparametric model for estimating ψ(P )? To give some intuitive notion of information in semiparametric models, we follow the same reasoning as in Chapter 4. We consider a smooth parametric submodel P0 = {Pθ|θ ∈ Θ} ⊂ P. This smoothness will be defined shortly. By the definition of a parametric submodel, the truth is contained in P0. For this parametric submodel, we can calculate the Fisher information for estimating ψ(Pθ). Then the information for estimating ψ(P ) in the whole model is certainly not bigger than the information covered by each of these parametric submodels. So it is certainly not bigger than the infimum of the informations over all submodels. Let this be clear, because in parametric models, one imposes more restrictions from which one can extract more information for estimating ψ(P ). The information for P is then simply defined as this infimum. If there exists a submodel for which the infimum is reached, we call this submodel the least favourable or hardest submodel, as already encountered in §5.2. It seems that in most situations, it suffices to consider one-dimensional submodels P0. We know that they should pass through the true distribution P . Note that these parametric submodels are only conceptual and of theoretical nature because we do not have knowledge about the true distribution. They should also be differentiable in an appropriate way, which we shall define now. This kind of differentiability is what is meant by smooth submodels and is necessary to introduce the notion of a score function in the abstract sense.

Definition 6.1. A differentiable path is a map t 7→ Pt from a neighbourhood of 0, (0, ε) ⊂ (0, +∞) to P such that, for some measurable function g : X → R, !2 Z dP 1/2 − dP 1/2 1 t − gdP 1/2 → 0 as t → 0. (6.1) t 2

The function g is called the score function of the submodel {Pt|0 ≤ t < ε} at t = 0 with P0 = P , the truth.

1/2 The preceding definition is not very enlightening. First of all, the objects dPt seem rather 1/2 odd. The objects dPt can be formalized by introducing the Hilbert space of square roots of measures. This concept is due to Le Cam who came up with the idea to define differentiable paths through root-densities in the 1960s, see Le Cam (1960), [17], (1969), [18] and (1986), [19] for more information. However, these square roots of measures are not necessary for our purposes. A more simple form reads as !2 Z p1/2 − p1/2 1 tt t − gp1/2 dµ → 0 as t → 0, (6.2) t 2 t t 6.1. Score Functions and Tangent Sets 113

where, for each t, µt is an arbitrary measure relative to which P and Pt possess densities pt and ptt. We will not pay much attention to these measure theoretical details. This formula is a bit more insightful than (6.1) but it is still not satisfactory. Therefore we manipulate (6.1) in a different way. We deduce that

!2 !2 Z dP 1/2 − dP 1/2 1 Z dP 1/2 − dP 1/2 1 t − gdP 1/2 = t − g dP. t 2 tdP 1/2 2

So we see that (6.1) is equivalent with

 !2  dP 1/2 − dP 1/2 g  lim E t − = 0. (6.3) 1/2 t→0  tdP 2 

We could use (6.3) as the definition for a differentiable path because now, the interpretation of this type of differentiability is more clear: we have some kind of differentiability in quadratic mean of the square root of our densities. In words we say that a differentiable path is a para- metric submodel {Pt|0 ≤ t < ε} that is differentiable in quadratic mean at t = 0 with score function g. Such a differentiable path is also called a smooth parametric submodel. If in addition the corresponding information matrix is nonsingular, this is called a regular para- metric submodel. This terminology is introduced in Newey (1990), [27], Definition A.1 of the appendix.

Remark 6.1. This differentiability in quadratic mean requires that the convergence is in L2(P )- sense, or in L2(µt)-sense for (6.2)), so it is only required a.s. (almost surely), or more generally a.e. (almost everywhere). Sets with probability 0 are not considered in this type of convergence, as opposed to pointwise convergence.

We are now ready to define tangent sets.

Definition 6.2. Letting t 7→ Pt range over a collection of smooth and regular parametric sub- models, we obtain a collection of score functions, which we call a tangent set of the model P at . P . We denote this tangent set by PP . When we consider all possible differentiable paths t 7→ Pt, we obtain the maximal collection of score functions. Henceforth, this set is referred to as the maximal tangent set.

Remark 6.2. At first sight, this definition of a tangent set does not correspond with the def- inition of a semiparametric tangent space as defined in Chapter 4 as the mean-square closure of all tangent spaces of regular parametric submodels. However, after we defined the efficient influence function in the next section, we will argue we defined quite similar objects.

We can ask ourselves a natural question. Does differentiability in quadratic mean imply other types of convergence? The answer to this question is positive. The result is not based on the reference works in the bibliography.

Proposition 6.1. Suppose we have a differentiable path t 7→ Pt with score function g, so we have differentiability in quadratic mean to g. This implies

dP 1/2 − dP 1/2 g 1. convergence in probability, i.e., t →P as t → 0, tdP 1/2 2 114 Chapter 6. Tangent Sets and Efficient Influence Function

dP 1/2 − dP 1/2 g 2. convergence in distribution, i.e., t →D as t → 0. tdP 1/2 2

Proof. Let us first prove the convergence in probability. We know that

( 1/2 )2 dP − dP 1/2 g E tn − → 0, 1/2 tndP 2 where tn is an arbitrary sequence of real numbers such that tn → 0 if n → +∞. Set Xn = 1/2 1/2 dPtn −dP 2 1/2 . So, we know that E{(Xn − g/2) } → 0 as n → +∞ and we must show that tndP P Xn → g/2 as n → +∞. Take any ε > 0, using Markov’s inequality, we obtain  g  E{(X − g/2)2} P X − > ε ≤ n . n 2 ε2 By the differentiability in quadratic mean, we obtain  g  P X − > ε → 0 n 2 as n → +∞, for any ε > 0. By definition of convergence in probability, we have shown that

dP 1/2 − dP 1/2 g t →P . tdP 1/2 2 The second assertion is now trivial because convergence in probability implies convergence in distribution.

The following lemma gives two fundamental but familiar properties of score functions. These are consequences of the differentiability in quadratic mean. The proof of this lemma makes us familiar with the notations P g and P g2. Lemma 6.1. Every score function satisfies P g = 0 and P g2 < +∞.

Proof. For given but arbitrary (tn)n≥1 such that tn → 0 as n → +∞, let pn and p be densities of P and P relative to a dominating measure µ. In many cases this will be the ordinary Lebesgue tn √ √ measure. By (6.1) the sequence ( pn − p)/tn converges in quadratic mean (i.e., in L2(µ)-sense) 1 √ to 2 g p. Hence, √ √ 2 pn − p 1 √ − g p → 0. t 2 n L2(µ) √ √ We have that ( p − p)/t ∈ L (µ) for all n ≥ 1. This can be obtained as follows. We have that √ n√ n 2 √ √ p ∈ L (µ) and p ∈ L (µ) because k p k = k pk = 1 and any linear combination n 2 2 n L2(µ) √ L2(µ) of these functions is also in L2(µ). This shows that g p/2 is a limit of L2(µ)-functions and therefore an L2(µ)-function. Now it is easy to see that g ∈ L2(P ),

1 1 Z Z 1 √ 2 P g2 = g2dP = g p dµ < +∞. 4 4 2 Next we deduce that √ √ √ √ 1 √ pn − p 1 √ lim pn − p − g ptn = lim − g p × lim tn = 0, tn→0 2 tn→0 t 2 tn→0 L2(µ) n L2(µ) 6.1. Score Functions and Tangent Sets 115 because both parts tend to zero. By the reversed triangle inequality, it follows that

√ √ 1 √ √ √ 1 √ k pn − pkL (µ) − tn g p ≤ pn − p − g ptn 2 2 2 L2(µ) L2(µ) and taking limits we obtain

√ √ √ √ √ 0 ≥ lim k pn − pkL2(µ) − kg p/2kL2(µ) × lim tn = lim k pn − pkL2(µ), tn→0 tn→0 tn→0 √ √ so this must be zero. We conclude that limtn→0 pn = p in L2(µ). This enables us to show that P g = 0. Indeed, by the continuity of the inner product (so we can interchange limit and integration), Z Z Z 1 √ √ P g = gdP = gpdµ = g p × 2 pdµ 2 √ √ Z ( pn − p) √ √ = lim × ( pn + p)dµ. tn→0 tn The right hand side equals

Z p − p R p dµ − R pdµ lim n dµ = lim n = 0 tn→0 tn tn→0 tn because both probability densities integrate to 1. We conclude that P g = 0.

What can we conclude from this lemma? Looking at the properties of a score function g in the 2 sense of Definition 6.1, a tangent set is a subset of L2(P ) (i.e., P g is finite) with mean zero (P g = 0). Henceforth, score functions defined in the sense of Definition 6.1 possess the same fundamental properties as score functions defined in Part II. In many cases, the tangent set is a linear space, i.e., closed under taking linear combinations. In this case we speak of a tangent space. We can give some geometric interpretation to this tangent set. We may visualize the model P, or rather the corresponding set of square roots of measures dP 1/2, as a subset of the unit ball . of a Hilbert space (L2(µ) if the model is dominated). A tangent set PP , or rather the set of 1 1/2 all objects 2 gdP , is then tangent to this subset of the unit ball in the Hilbert space. Note however that we have not defined a tangent set to be equal to the set of all score functions g that correspond to some differentiable submodel. As stated in [40], for many purposes this maximal tangent set, i.e., the tangent set consisting of score functions of all the parametric submodels, is too big. So we have given ourselves the flexibility of calling any set of score functions a tangent set. The drawback will be that in any result obtained later on we must specify which tangent set we are working with. The approach we used above to define score functions is not very familiar. As in Part II, we usually define score functions as the pointwise derivative of the logarithm of the density. So, for a parametric submodel t 7→ Pt we obtain, for every x,

∂ g(x) = log dPt(x). ∂t t=0 This pointwise differentiability is not required by (6.1). Conversely, given this pointwise differ- entiability, we are not assured to have (6.1). We still need to be able to apply a convergence 116 Chapter 6. Tangent Sets and Efficient Influence Function theorem for integrals to obtain this type of convergence in quadratic mean, such as the Dom- inated Convergence Theorem of Lebesgue1 or the Monotone Convergence Theorem2, since we need to interchange limit and integration. An example of such a calculation will be given in §6.3.2. Thus, both definitions of score functions are not equivalent. But there is some good news. When the conditions of the following lemma are satisfied, the pointwise differentiability implies (6.1). This lemma seems to solve most examples as stated in [40]. p Lemma 6.2. If pt is a probability density relative to a fixed measure µ and t 7→ pt(x) is continuously differentiable in t in a neighbourhood of 0 and Z . 2 pt . ∂pt t 7→ dµ, pt= , pt ∂t is finite and continuous in this neighbourhood, then t 7→ Pt is a differentiable path. . p2 Remark 6.3. Note that R t dµ is simply the Fisher information for the one-dimensional sub- pt model t 7→ pt.

For a proof of this lemma, we refer to van der Vaart (1998), [39], Chapter 7, Lemma 7.6, p.95-96. Examples of the application of this lemma are also given on p.96 of [39]. More particular, it shows that (under some smoothness conditions) exponential families and location models are in the realm of Lemma 6.2. In addition, on p.97 of [39], a counterexample is given: a uniform 1 distribution on the interval [0, θ] which has density function pθ(x) = θ I(0 ≤ x ≤ θ). This family is not differentiable in quadratic mean and henceforth not within the realm of Lemma 6.2. The reason is that the support of the uniform distribution depends too much on the parameter. It p is also clear the function θ 7→ pθ(x) is not differentiable. Because we will jump several times between the different definitions of score functions, we give a heuristic argument that in sufficiently smooth cases, both definitions are the same. With the notations of the preceding lemma,

∂ log pt(x) 1 ∂pt(x) g(x) = = . ∂t t=0 p(x) ∂t t=0 Using this and looking at (6.2), we see p p p   pt(x) − p(x) ∂ pt(x) 1 1 ∂pt(x) p 1 p ≈ = p(x) = g(x) p(x), t ∂t 2 p(x) ∂t 2 t=0 t=0

1 +∞ Consider a measure space (X , A, µ). Let {fn}n=1 be a sequence of measurable functions fn : X → R converging almost everywhere to a function f : X → R. If there exists another measurable function F : X → R R such that X F (x)dµ(x) < +∞ and |fn| ≤ F almost everywhere for every n, then Z Z Z lim fn(x)dµ(x) = lim fn(x)dµ(x) = f(x)dµ(x). n→+∞ X X n→+∞ X

2 +∞ Consider a measure space (X , A, µ). Let {fn}n=1 be a sequence of measurable functions fn : X → R converging almost everywhere to a function f : X → R. If

0 ≤ f1 ≤ f2 ≤ f3 ≤ f4 ≤ ... almost everywhere, then Z Z Z lim fn(x)dµ(x) = lim fn(x)dµ(x) = f(x)dµ(x). n→+∞ X X n→+∞ X 6.1. Score Functions and Tangent Sets 117

with p0 = p, the truth. So why do we use such a difficult definition of score functions? The reason is given in van der Vaart (1988), [36]. The quite abstract Definition 6.1 that gives a quadratic mean version of a score function precludes many of the awkward regularity conditions usually attached to the Cram´er-Raobound. Thus, when the statistical model is sufficiently smooth, both definitions of the score function are equivalent and for such smooth statistical models the awkward regularity conditions attached to the Cram´er-Raobound are satisfied.

The differentiability (6.1) ensures a type of local asymptotic normality of the log likelihood ratio similar to as for the ordinary pointwise definition of a score function, but without imposing additional difficult regularity conditions. The proof of the lemma is long and technical. It can be skipped by the less interested reader. Nonetheless, it explicitly shows how the definition of score functions in mean-square sense is used. So, it is worth paying some attention to the proof. It is explained in full detail.

Lemma 6.3. If the path t 7→ Pt in P satisfies (6.1), then

n √ n Y dP1/ n 1 X 1 log (X ) = √ g(X ) − P g2 + o (1). (6.4) dP i n i 2 P i=1 i=1

√1 Proof. We adopt the notation of the proof of Lemma 6.1, but with tn = n . Next, we define the random variables rp  W = 2 n (X ) − 1 , i = 1, . . . , n. (6.5) ni p i

These are well-defined with P -probability 1. Because X1,...,Xn are independent and identically distributed random variables, we deduce

( n n ) " n # X 1 X X  1  var W −√ g(X ) = var W − √ g(X ) ni n i ni n i i=1 i=1 i=1 n n " 2# X  1  X  1  = var W − √ g(X ) ≤ E W − √ g(X ) ni n i ni n i i=1 i=1 " #  1 2 h √ i = nE W − √ g(X ) = E  nW − g(X ) 2 . ni n i ni i

By the definition of Wni and (6.1), we obtain

( n n ) "√ √ 2# X 1 X pn − p 1 var W −√ g(X ) ≤ 4E √ √ (X ) − g(X ) → 0. (6.6) ni n i p/ n i 2 i i=1 i=1

Pn √1 Pn This means, asymptotically, the sum i=1 Wni is as variable as the sum n i=1 g(Xi). Note that the latter sum has mean zero. Next, we want to know something about the asymptotic 118 Chapter 6. Tangent Sets and Efficient Influence Function mean. Some easy calculations show us that

n ! " n √ # Z √  X X pn pn E W = 2E √ (X ) − 1 = 2n √ pdµ − 1 ni p i p i=1 i=1 Z √ √ = n (2 pn p − p − pn) dµ √ √ Z √ √ Z  p − p2 = −n ( p − p)2dµ = − n √ dµ n 1/ n √ √ 2 pn − p = − √ . 1/ n L2(µ)

√ √ p − p 2 n√ The convergence of the sequence − 1/ n is assured by (6.1). This yields L2(µ)

n ! 2 X 1 √ 1 2 E Wni → − g p = − P g . (6.7) 2 4 i=1 L2(µ)

Pn 1 2 Asymptotically, the sum i=1 Wni has mean − 4 P g . By (6.6) and (6.7), it is justified to write

n n X 1 X 1 W = √ g(X ) − P g2 + o (1). (6.8) ni n i 4 P i=1 i=1

Pn Let us now express the log likelihood ratio which appears in i=1 Wni through a Taylor expan- sion of the logarithm. First, we write

( n ) n   n √  Y pn X pn X pn log (X ) = log (X ) = 2 log √ (X ) p i p i p i i=1 i=1 i=1 n X  1  = 2 log 1 + W . 2 ni i=1

Now, using the Taylor expansion of the logarithm,

1 log(1 + x) = x − x2 + x2R(2x) 2 with R(2x) → 0 as x → 0, we get

( n ) n n n Y pn X 1 X 1 X log (X ) = W − W 2 + W 2 R(W ) (6.9) p i ni 4 ni 2 ni ni i=1 i=1 i=1 i=1

P P with R(Wni) → 0 as Wni → 0 (see Lemma A.3). Using (6.8) in (6.9) brings us a step closer to the final result,

( n ) n n n Y pn 1 X 1 1 X 1 X log (X ) = √ g(X ) − P g2 − W 2 + W 2 R(W ) + o (1). (6.10) p i n i 4 4 ni 2 ni ni P i=1 i=1 i=1 i=1 6.1. Score Functions and Tangent Sets 119

1 Pn 2 Can we say something about the term − 4 i=1 Wni? We will show that this term converges in 1 2 2 2 probability to − 4 P g . Let us define the random variables Ani = nWni − g (Xi), i = 1, . . . , n. P We argue that Ani → 0. Indeed, we deduce √ √ 2 pn − p A = nW 2 − g2(X ) = 4n √ − g2(X ). ni ni i p i

√ √ √  p − p  P n√ From Proposition 6.1 we know that n p → g(Xi)/2. Henceforth, by the Continuous Mapping Theorem, √ √ 2 pn − p P 1 n √ → g2(X ) p 4 i

P and thus Ani → 0. As a consequence, we deduce

n n X 1 X 1 P W 2 = g2(X ) + A → P g2. ni n i n ni i=1 i=1

Pn 2 2 From this it is clear that i=1 Wni = P g + oP (1). Using this in (6.10) gives

( n ) n n Y pn 1 X 1 1 X log (X ) = √ g(X ) − P g2 + W 2 R(W ) + o (1). (6.11) p i n i 2 2 ni ni P i=1 i=1 i=1

This is almost (6.4). The last thing we need to show is that 1 Pn W 2 R(W ) = o (1). We 2 i=1 √ ni ni P will do this in a number of steps. First, we show that nP(|Wni| > ε 2) → 0. Indeed, by the triangle inequality, we get √ 2 2 nP(|Wni| > ε 2) = nP(Wni > 2ε ) 2 2 = nP(g (Xi) + Ani > 2nε ) 2 2 2 ≤ nP(g (Xi) > nε ) + nP(|Ani| > nε ).

2 2 2 2 Now introduce the variable Y = g (Xi) and the set G = {g (Xi) > nε } = {Y : Y > nε }. We can then write Z 2 2 2 2 P g {g (Xi) > nε } = P g (G ) = ydPY (y), G 2 where PY denotes the distribution of Y = g (Xi). Since y ∈ G , we have Z 2 2 2 2 2 2 2 2 P g {g (Xi) > nε } ≥ nε dPY (y) = nε P(y ∈ G ) = nε P(g (Xi) > nε ). G Rewriting this yields 1 nP(g2(X ) > nε2) ≤ P g2{g2(X ) > nε2}. (6.12) i ε2 i Now using (6.12) and from Markov’s inequality, we obtain √ −2 2 2 2 −2 nP(|Wni| > ε 2) ≤ ε P g {g > nε } + ε E|Ani|.

P P We already argued that Ani → 0. By the Continuous Mapping Theorem, we get E|Ani| → 0. 2 This is just a number, thus we have ordinary convergence E|Ani| → 0. Now, because P g < +∞, the set {g2 > nε2} converges to a set with measure 0 because the score function g can only be 120 Chapter 6. Tangent Sets and Efficient Influence Function

√ 2 2 2 infinite in a zero-set. Therefore P g {g > nε√} → 0 and thus nP(|Wni| > ε 2) →√ 0. This is an upper bound for P(maxi≤i≤n |Wni| > ε 2). Indeed,√ P(maxi≤i≤n |Wni| > ε 2) is the probability that at least one of the |Wni| is√ bigger than ε 2, so this is no bigger than the sum of the probability that |W√n1| is bigger than ε 2 disregarding the other ones, the probability that |Wn2| is bigger than ε 2 disregarding the other ones and so forth. This implies

P max |Wni| → 0. 1≤i≤n

By the property of the function R, the sequence max1≤i≤n |R(Wni)| converges in probability 1 Pn 2 1 Pn 2 to zero as well. The term 2 i=1 WniR(Wni) is bounded by 2 max1≤i≤n |R(Wni)| i=1 Wni. Pn 2 P 2 Pn 2 D 2 We already know that i=1 Wni → P g , thus i=1 Wni → P g and by Prohorov’s theorem, Pn 2 i=1 Wni is bounded in probability. This yields n 1 X 2 max |R(Wni)| Wni = oP (1)OP (1) = oP (1). 2 1≤i≤n i=1 Combining this with (6.11) gives us the final result,

( n ) n Y pn 1 X 1 log (X ) = √ g(X ) − P g2 + o (1). p i n i 2 P i=1 i=1 This yields the theorem.

The local asymptotic normality can be obtained in the usual way by applying the CLT. This yields n ( n ) 1 X √ 1 X D √ g(X ) = n g(X ) − P g → N(0, P g2). n i n i i=1 i=1 By Lemma 6.3 we now see that

n √ Y dP1/ n D log (X ) → N(−P g2/2, P g2). (6.13) dP i i=1 This result shows its importance in proving lower bound theorems and the representation of an asymptotically efficient estimator (see Chapter 7). Thus, because the log likelihood ratio is asymptotically normal for parametric submodels satisfying (6.1), it will follow the information in these models with respect to t can be compared by just looking at the variance P g2 of the score function g. This shows that the representation (6.4) of the log likelihood ratio is a quite important result and also reveals the importance of our definition of the score function g in the context of differential paths to preclude awkward regularity conditions.

6.2 Differentiable Functionals and Efficient Influence Function

For defining the information for estimating ψ(P ), only those submodels t 7→ Pt along which the parameter t 7→ ψ(Pt) is differentiable in an appropriate sense are of interest. A minimal requirement is that the map t 7→ ψ(Pt) is differentiable at t = 0, but we need more. We follow the definition as first introduced by Levit (1978), see [22] and Koshevnik and Levit (1976), see [16]. 6.2. Differentiable Functionals and Efficient Influence Function 121

. Definition 6.3. A map ψ : P → is differentiabele at P relative to a given tangent set PP if B . . there exists a continuous linear map ψP : L2(P ) → B such that for every g ∈PP and a submodel t 7→ Pt with score function g, . ∂ψ(Pt) ψ(Pt) − ψ(P ) = lim = ψP g. (6.14) ∂t t=0 t→0 t As with the definition of a differentiable path, this definition also needs some comments. First, this definition requires that the derivative of the map t 7→ ψ(Pt) exists in the ordinary sense. Indeed, this is a map between R and the Banach space B. For such a map we have the classical notion of a derivative with respect to the norm of . In addition, we require a special represen- B . tation of this derivative. There must exist a continuous linear map ψP : L2(P ) → B that maps ∂ψ(Pt) the score function g of the parametric submodel t 7→ Pt to the derivative . So, we are ∂t t=0 indeed asking more than ordinary differentiability at t = 0. Remark 6.4. For those familiar with advanced topics in functional analysis, note that this derivative is much like a Hadamard derivative of ψ viewed as a map on the space of square roots of measures to B. Indeed, we can write (6.14) as . ψ(Pt) − ψ(P ) − ψ g → 0. t P B This is very similar to the definition of a Hadamard derivative, see for example Bickel et al. (1993), [5], §A.5.

At first sight it is not clear why we demand this kind of differentiability. To gain some insight, let us look what happens in the case when we wish to estimate a k-dimensional Euclidean parameter ψ(P ), so = k. We shall derive an interesting form for (6.14). Assume ψ is differentiable at P . B R . . k From Definition 6.3, we have the existence of a map ψ : L2(P ) → such that for every g ∈PP P . R ∂ψ(Pt) and a submodel t 7→ Pt with score function g, =ψP g. Componentwise application ∂t t=0 of the Riesz’ representation theorem for Hilbert spaces (see Theorem 2.5) allows us to write the . derivative ψP in the form of a (vector-valued) inner product. Riesz’ theorem yields the existence ˜ k ˜ k of a fixed, vector-valued, measurable function ψP ∈ L2(P ), ψP : X → R , such that . Z ˜ ˜ k ψP g = hψP , giL2(P ) = ψP gdP ∈ R . ˜ The inner product hψP , giL2(P ) is actually a k-dimensional vector of inner products, ˜1 ˜2 ˜k T [hψP , giL2(P ), hψP , giL2(P ),..., hψP , giL2(P )] ˜i ˜ where ψP denotes the i-th component of ψP . Now we obtain a more enlightening form,

∂ψ(Pt) ˜ ˜ = P ψP g = E(ψP g). (6.15) ∂t t=0 ˜ 3 A function ψP satisfying (6.15) is defined to be an influence function . In the light of Part 3For those familiar with the robustness literature, note we can write (6.14) as . ψ(Pt) − ψ(P ) = t ψP g + o(t) Z = t ψ˜P gdP + o(t).

This representation is quite similar to that of the definition of an influence function in the robustness literature where Pt should be taken to be (1 − t)P + tδx where δx is the Dirac measure. 122 Chapter 6. Tangent Sets and Efficient Influence Function

II, looking at Theorem 4.1, we see that the influence function ϕ of a RAL estimator for the k- dimensional parameter ψ(P ) satisfies (6.15) for all one-dimensional parametric submodels t 7→ Pt ˜ with corresponding score function g. Therefore, one way of finding an influence function ψP is to find the influence function ϕ for any RAL estimator and then we are assured the parameter ψ(P ) is differentiable in appropriate sense. However, this is not necessary. There may be other ˜ ways to obtain an influence function ψP that is not necessary the influence function of a RAL estimator. Indeed, as shown by van der Vaart (1991) in [38], ordinary differentiability of ψ(Pt) as a function of t and the existence of a regular estimator (not necessarily asymptotically linear) ˜ are sufficient for the existence of such a ψP . The only-if part of Theorem 4.1 specializes this sufficient condition. In addition, this definition of an influence function does not require the influence function to have mean zero. Henceforth, we generalized the concept of an influence function and furthermore, it seems we are on the right track to define efficiency without assuming the existence of a RAL estimator and as it should be for a good generalization, by Theorem 4.1, the influence function ϕ of a RAL estimator for the parameter ψ(P ) satisfies (6.15). ˜ The Riesz’ representation theorem assures uniqueness of the function ψP when the inner product ˜ h·, ·iL2(P ) is specified for all functions of L2(P ). Here, the function ψP is not uniquely defined by the functional ψ and the model P, since only inner products of ψ˜ with elements g of the . P tangent set PP are specified. The tangent set does not span all of L2(P ). Nonetheless, using the projection theorem of Hilbert spaces, we can construct a unique ψ˜ whose coordinate functions . P are contained in lin PP , the mean-square closure of the linear span of the tangent set. This ˜i k-dimensional function is unique. This is quite straightforward. Every i-th component ψP of ψ˜ is an element of L (P ). By the projection theorem of Hilbert spaces, there exists a unique P . 2 . . ˜i element in lin PP that minimizes kψP − hkL2(P ) for all h ∈ lin PP . Because lin PP is a closed .  ˜i  linear subspace of L2(P ), the orthogonal projection Π ψP lin PP minimizes this norm. After ˜ . projecting each component of ψP onto lin PP , we obtain the unique k-dimensional vector

. . . . T .  ˜ k  h  ˜1   ˜2   ˜k i k Π ψP lin PP = Π ψP lin PP , Π ψP lin PP ,..., Π ψP lin PP ∈ lin PP .

˜  ˜ k .  k . Because ψP − Π ψP lin PP ⊥ lin PP , the inner product

D ˜  ˜ k .  E ψP − Π ψP lin PP , g L2(P ) . equals zero for all g ∈PP . This yields

∂ψ(Pt) ˜ D  ˜ k .  E = hψP , giL2(P ) = Π ψP lin PP , g . ∂t t=0 L2(P ) This unique influence function  ˜ k .  Π ψP lin PP is defined to be the efficient influence function and, with a slight abuse of notation, we ˜ ˜ denote it again by ψP . So, for further reference, when we write ψP , we refer this to be the efficient influence function. By the reasoning outlined above, it can be found as the projection k . of any other influence function onto lin PP . Henceforth, another way to define an influence k function is as a measurable function in L2(P ), ψP : X → R , whose orthogonal projection onto k . ˜ lin PP is the efficient influence function ψP . When we talk about score operators in Chapter 8, we will prove the uniqueness of this efficient influence function in another way. 6.2. Differentiable Functionals and Efficient Influence Function 123

Remark 6.5. Sometimes influence functions and the efficient influence function are referred to as gradients and the canonical gradient, respectively, see van der Vaart (1988), [36].

˜ At this point, we can ask ourselves the question, why do we give this projection ψP the special ˜ name efficient influence function? This name suggests that ψP has something to do with effi- ciency of estimators for ψ(P ). Indeed, as we shall see in the next chapter, the efficient influence ˜ function ψP will characterize asymptotic efficiency bounds. However, with the knowledge we have obtained from Part II, especially from Chapter 4, it is not surprising we define this unique projection onto the closed linear span of the tangent set to be the efficient influence function. We explain this in more detail below. . First, we argue Definition 6.2 of the tangent set PP coincides with the definition of the semipara- metric tangent space as defined in Chapter 4 as the mean-square closure of all tangent spaces of regular parametric submodels. This was denoted by J . From the discussion in the previous section, we convinced ourselves the objects g we defined to be score functions correspond to ordinary pointwise score functions for one-dimensional submodels in sufficiently smooth cases. We know that J ⊂ H and therefore, scores in the ordinary sense have mean zero and finite k . variance. By Lemma 6.1, scores g ∈ also possess these properties, an therefore we see that . PP ⊂ H . However, since we are estimating a k-dimensional parameter ψ(P ), in the ordinary PP 1 . sense, J is defined to be a closed space of k-dimensional functions. In contrast, PP ⊂ H1 is de- fined to be a set of one-dimensional functions. First, to meet the linearity condition, we take the . linear span, i.e., lin . Secondly, the meet the closedness condition, we take the mean-square . PP closure, i.e., lin PP . Finally, in Chapter 4, we deduced that J is a k-replicating space, i.e., J can be written as the direct sum of k copies of a closed subspace of H1. Henceforth, it is clear k . k . that J and lin PP represent quite similar objects since lin PP clearly is a k-replicating space. k . However, the only difference is that J can be seen as the maximal tangent space and lin PP can be seen as a closed subspace of J since we defined a tangent set to be any set of score functions. k . When lin PP represents the maximal tangent set, they can be considered to be the same. The ˜ argument now shows that the definition of the efficient influence function ψP as the orthogonal k . k . projection onto lin PP is very natural. If we take lin PP to be the maximal tangent set, the ˜ efficient influence function ψP equals the efficient influence function ϕeff as defined in (4.10). The flexibility we have imposed in the definition of the tangent set now shows its advantage. . These tangent sets PP are made to depend both on the model P and the functional ψ. We do not always want to use the maximal tangent set, which is the set of all score functions of differentiable submodels t 7→ Pt, because the parameter ψ may not be differentiable relative to it. Note this implies no RAL estimator for ψ(P ) can be found. According to our definition of a tangent set, every subset of a tangent set is a tangent set itself. ˜ We may conclude that the efficient influence function ψP is a proper extension of the efficient influence function ϕeff for RAL estimators. In addition, it is exciting that the at first sight rather strange and abstract definitions we gave here turn out to be not so strange at all since they really extend the ideas put forward in Part II and they will enable us to develop a much more broad and rich approach to define efficiency in semiparametric models. We now end this section with an interesting property. The definition of a differentiable path implies a geometric view of a the maximal tangent set. This is the contents of the following proposition.

. . Proposition 6.2. The maximal tangent set is a cone, i.e., g ∈PP and a ≥ 0, then ag ∈PP . 124 Chapter 6. Tangent Sets and Efficient Influence Function

Proof. Suppose g is a score function with corresponding submodel t 7→ Pt, i.e., g statisfies (6.1). The case a = 0 is trivial. For an arbitrary a > 0, ag is a score function with corresponding submodel t 7→ Pat. Indeed, for sufficiently small t, the definition of g yields, !2 Z dP 1/2 − dP 1/2 1 at − gdP 1/2 → 0. at 2

This is equivalent with !2 Z dP 1/2 − dP 1/2 1 at − agdP 1/2 → 0, t 2 which tells us that ag is the score function with corresponding submodel t 7→ Pat.

It seems that it is rarely loss of generality to assume that the tangent set we work with is a cone as well. The submodels t 7→ Pt and t 7→ Pat with corresponding score function g and ag, respectively, are said to be submodels in the same direction. As for typical semiparametric models P, P is a large set of probability measures. Henceforth, there may be submodels in many directions. By this geometric interpretation, we can refer to a tangent set as a tangent cone. This corresponds with the classical notion of a cone as the union of lines pointing to different directions.

6.3 Examples

The previous section was quite abstract and tough. Therefore, we now consider some examples that will clarify the concepts of a tangent set and efficient influence function as presented in this abstract way. First, we consider a parametric model and obtain a familiar result. Secondly, we look at the other extreme by considering a nonparametric model and we will also obtain familiar results. Finally, we try to get some intuition in the shape of the tangent set of a semiparametric model, namely Cox’s proportinal hazards model.

6.3.1 Parametric Model

In the parametric case, we already know that the maximum likelihood estimator is the most efficient estimator in asymptotic sense. The MLE is obtained by maximizing the log likelihood, or equivalently, solving the score equations. Fortunately, we obtain the same result when applying the theory developed above. m Consider a parametric model with parameters θ ranging over an open subset Θ of R given by densities pθ with respect to some measure µ, e.g., the ordinary Lebesgue measure. We assume that the model is sufficiently smooth. In addition, suppose that there exists a vector-valued . m m measurable map `θ: X → R such that as h → 0, h ∈ R , Z  2 1/2 1/2 1 . 1/2 p − p − hT ` p dµ = o(khk2). (6.16) θ+h θ 2 θ θ This condition enables us to the derive the familiar tangent set of a parametric model. . . . Proposition 6.3. A tangent set at P is given by the linear space lin = {hT : h ∈ m}. . . PP θ `θ `θ R So PP = lin `θ is a tangent space. 6.3. Examples 125

Proof. Fix an arbitrary h ∈ m and consider the one-dimensional submodel t 7→ P with t R . θ+th T sufficiently small. We need to show that h `θ is the corresponding score function. First, note that kthk2 tends to zero as t approaches zero. Equation (6.16) yields Z  2 1/2 1/2 1 . 1/2 p − p − thT ` p dµ = o(kthk2) = o(t2khk2). θ+th θ 2 θ θ

Dividing both sides by t2 and taking the limit for t tending to zero gives

1/2 1/2 !2 Z . pθ+th − pθ 1 T 1/2 lim − h `θ p dµ = 0. t→0 t 2 θ

The result now follows by Definition 6.1.

. If the model is sufficiently smooth, see Lemma 6.2, `θ is the ordinary score function, the derivative of the log likelihood. Thus, the tangent space at Pθ is spanned by the score functions for the coordinates of the parameter θ. k Now suppose one wishes to estimate a k-dimensional function χ :Θ → R . We can now explicitly describe the relation between the tangent space for a parametric model as defined in Part II and . the tangent space obtained here. Comparing the tangent space PP with the tangent space J for the parametric model (see Definition 3.8), we see that

. . k k J = lin `θ=PP , and this corresponds with the fact that J is a k-replicating space, see Example 3.3. Henceforth, by the argument given in the previous section, we should obtain a familiar form for the efficient influence function as we describe below. Let us investigate the differentiability of the parameter χ(θ). The Fisher information matrix . . T k for estimation of θ is Iθ = Pθ(`θ `θ ). If Iθ is invertible, then every map χ :Θ → R that is differentiable in ordinary sense as a map between Euclidean spaces, is differentiable as a map . ψ(P ) = χ(θ) on the model relative to the given tangent space . This follows, as we have θ PP . T shown above, because the submodel t 7→ Pθ+th has score function h `θ and, by using the chain rule,

∂ψ(Pθ+th) ∂χ(θ + th) ∂χ θ + th . = = × =χθ h. ∂t t=0 ∂t t=0 ∂θ ∂t t=0 . m k The derivative χθ is a k × m matrix. Thus, a linear operator between R and R . Can we now derive the efficient influence function from this equation? It is not enough to write this R ˜ as ψP gdP without further comments, see (6.15). We need to argue that this is the unique . T m influence function in {h `θ: h ∈ R }. This is quite straightforward, using the non-singularity of Iθ and the associativity of matrix multiplication,

∂ψ(P ) . n . . .T o θ+th χ −1 χ −1 = θ Iθ Iθh = Pθ ( θ Iθ `θ)(`θ h) ∂t t=0 n . . . o χ −1 T = Pθ ( θ Iθ `θ)(h `θ)

. . m ˜ −1 for each h ∈ . We conclude that ψP =χθ I `θ is the efficient influence function, since it is R . θ θ in the tangent set PP . 126 Chapter 6. Tangent Sets and Efficient Influence Function

Now compare this with (3.23). We obtain the exact same result as expected but in a different . . notation, Γ(θ0) =χθ, I(θ0) = Iθ and Sθ =`θ. We now focus attention on an interesting common case. Suppose the parameter θ can be written T T T k m−k as (β , η ) where β ∈ R is the parameter of interest and η ∈ R is the nuisance parameter. We wish to find the efficient influence function for the parameter β. Thus, ψ(P ) = χ(θ) = β. . θ The derivative χθ becomes  1 0 ··· 0 0 ··· 0  .  0 1 ··· 0 0 ··· 0  χ =   = [Ik×k 0k×(m−k)], θ  ......   ......  0 0 ··· 1 0 ··· 0 where Ik×k is the k-dimensional identity matrix and 0k×(m−k) is a k × (m − k)-matrix of zeros. . . . ˜ χ −1 −1 The efficient influence function ψPθ = θ Iθ `θ consists of the first k parts of the vector Iθ `θ.

6.3.2 Nonparametric Model

Let us look to the other extreme, a nonparametric model. In this case, the model P consists of all probability laws on the sample space. We focus attention on the construction of the tangent set. We will obtain the same result as in §4.5.1. Let us denote, as usual, the tangent . set at P by PP . We argue that this set consists of all meausurable functions g ∈ L2(P ), i.e., P g2 = R g2dP < +∞, satisfying P g = R gdP = 0. Because by Lemma 6.1, any score function . satisfies this mean zero property and has finite variance, we know that PP ⊂ H1, where we used the notation of Part II. Since we impose no further restrictions to the model, it seems plausible that we should indeed obtain this maximal tangent set. The proof is sketched in much detail. It shows how we can use the abstract Definition 6.1 and the notion of parametric submodels to check that some conjecture of a tangent set is really a tangent set. . Proposition 6.4. The tangent set is the set of all mean-zero measurable functions g ∈ . PP . L2(P ), i.e., PP = H1. Especially, this set is linear, so PP is a tangent space.

Proof. We need to show that every mean-zero measurable function g ∈ L2(P ) is a score function. For doing so, we construct a corresponding parametric submodel. It suffices to exhibit suitable one-dimensional submodels. The true density is denoted by p0(x). As usual, we assume the model is dominated by a measure µ. First, we take a bounded function g. More explicitly, let g be such that ∃M ∈ R for which supx∈X |g(x)| < M. We consider two types of examples. Consider a submodel linear in g,

pt(x) = {1 + tg(x)}p0(x). (6.17)

By the boundedness of g, we are assured that for t sufficiently small {1 + tg(x)} ≥ 0 for all x. In addition, we have that Z Z Z Z pt(x)dµ(x) = {1 + tg(x)}p0(x)dµ(x) = dP (x) + t g(x)dP (x) = 1 + tP g = 1, where we used the fact that P g = 0. Henceforth, we defined proper density functions for sufficiently small t. It is easy to check the pointwise definition as the derivative of the log 6.3. Examples 127 likelihood of the score function,

∂ log pt(x) g(x) = . ∂t t=0 It follows that this submodel has score function g at t = 0 in ordinary sense. We ask ourselves the question: is this also a score function in the sense of (6.1)? This follows by Lemma 6.2 but also by a direct calculation as we demonstrate here. Using (6.3) yields

Z "p p #2 {1 + tg(x)}p0(x) − p0(x) 1 p lim − g(x) p0(x) dµ(x) t→0 t 2 ( )2 Z p1 + tg(x) − 1 1 = lim − g(x) p0(x)dµ(x) t→0 t 2 ( )2 Z p1 + tg(x) − 1 1 = lim − g(x) dP (x) t→0 t 2 ( p )2 (1) Z ∂ 1 + tg(x) 1 = − g(x) dP (x) = 0. ∂t 2 t=0 We need an argument to assure we can interchange the limit and integration, so the equality (1) is valid. The argument is quite technical and not very enlightening. Therefore, we will only demonstrate this for the calculation above. We will need to give a similar argument below, but we skip that since this is so technical. In addition, the less interested reader can skip the rest of the proof because, for practical calculations, the most important arguments are given. Nonetheless, it is important to read the notes below the proof since there we describe why it is actually justified to end the proof here. The argument to interchange limit and integration proceeds as follows. Introduce the function

f(x, t) = p1 + tg(x), for all x and t sufficiently small. Using the mean-value theorem, we can write almost everywhere,

p1 + tg(x) − 1 f(x, t) − f(x, 0) ∂f = = (x, ξ ) t t ∂t t where ξt ∈ (0, t) and ξt → 0 as t → 0. We have that

∂f 1 (x, t) = g(x), ∂t 2p1 + tg(x) and especially for t = 0, we have ∂f 1 (x, 0) = g(x). ∂t 2 If we take t to be sufficiently small and using the boundedness of g, we have that

∂f (x, t) ≤ |g(x)|, ∂t 128 Chapter 6. Tangent Sets and Efficient Influence Function almost everywhere. Next, we write ( )2 Z p1 + tg(x) − 1 1 − g(x) dP (x) t 2 ( )2 ( ) Z p1 + tg(x) − 1 Z p1 + tg(x) − 1 1 Z = dP (x) − g(x)dP (x) + g2(x)dP (x). t t 4

∗ Now put tn = 1/n for n ≥ n such that tn is sufficiently small. We first deal with the first part of the latter equation. Define for n ≥ n∗,  2 f(x, tn) − 1 Fn(x) = . tn

2 It follows that Fn(x) ≤ g (x) almost everywhere and since the score function g ∈ L2(P ), we R 2 know that g (x)dP (x) < +∞. Finally, we note that Fn converges almost everywhere to 1 2 4 g (x). Now using the Dominated Convergence Theorem of Lebesgue, we can interchange limit and integration and obtain ( )2 Z p1 + tg(x) − 1 Z lim dP (x) = lim Fn(x)dP (x) t→0 t n→+∞ Z Z 1 2 = lim Fn(x)dP (x) = g (x)dP (x). n→+∞ 4 For the second part, we define for n ≥ n∗,   f(x, tn) − 1 Gn(x) = g(x). tn

2 It follows that |Gn(x)| ≤ g (x) almost everywhere and since the score function g ∈ L2(P ), R 2 1 2 we know that g (x)dP (x) < +∞. In addition, Gn converges almost everywhere to 2 g (x). Now using the Dominated Convergence Theorem of Lebesgue, we can interchange limit and integration and obtain ( ) Z p1 + tg(x) − 1 Z lim g(x) dP (x) = lim Gn(x)dP (x) t→0 t n→+∞ Z Z 1 2 = lim Gn(x)dP (x) = g (x)dP (x). n→+∞ 2 Putting everything together, we obtain ( )2 Z p1 + tg(x) − 1 1 Z ∂f 1 2 lim − g(x) dP (x) = (x, 0) − g(x) dP (x) t→0 t 2 ∂t 2 Z 1 1 2 = g(x) − g(x) dP (x) = 0. 2 2 This ends the proof for a bounded function g. The model linear in g does not give a hint what to do with an unbounded g. Therefore we consider a one-dimensional exponential family

pt(x) = c(t) exp{tg(x)}p0(x), 6.3. Examples 129 where g is still bounded. To be a proper density, we require some restrictions on the function c(t). First, we require that the model gives p0(x) at t = 0, thus c(0) = 1. The function c(t) is R 0 also chosen such that pt(x)dµ(x) = 1. Another restriction is that c (0) = 0. Indeed, we know R that every density function integrates to 1, thus pt(x)dµ(x) = 1 for all t sufficiently small. Differentiating with respect to t yields Z Z 0 c (t) exp{tg(x)}p0(x)dµ(x) + c(t) exp{tg(x)}g(x)p0(x)dµ(x) = 0 for every t, sufficiently small. Especially, for t = 0, Z Z 0 0 0 0 = c (0)p0(x)dµ(x) + g(x)p0(x)dµ(x) = c (0) + P g = c (0).

The log likelihood is then log pt(x) = log c(t) + tg(x) + log p0(x). From this and the restrictions on c(t), it is clear that g(x) is a score function for this model in ordinary sense. Again, by Lemma 6.2 or by direct calculation, it follows that g is also a score function in the sense of (6.1). The calculation is analogous to the previous case. The advantage now is that we get a clue on how we can construct an appropriate submodel for an unbounded function g. The submodels above are not well-defined for an unbounded g. However, consider a model of the same structure as the exponential family, pt(x) = c(t)k{tg(x)}p0(x) for a nonnegative function k with k(0) = k0(0) = 1 and with c(t) as above. The function k(x) = 2(1 + e−2x)−1 is bounded and can be used with any g. Because k{tg(x)} is now bounded, this defines a proper density for t sufficiently small. To show that g is the corresponding score function, we show that (6.1) is fulfilled. The model is again dominated by µ. We obtain:

Z "p p #2 c(t)k{tg(x)}p0(x) − p0(x) 1 p lim − g(x) p0(x) dµ(x) t→0 t 2 ( )2 Z pc(t)k{tg(x)} − 1 1 = lim − g(x) dP (x) t→0 t 2 ( )2 Z ∂pc(t)k{tg(x)} 1 = − g(x) dP (x) ∂t 2 t=0 ( ! )2 Z 1 1 = [c0(t)k{tg(x)} + c(t)k0{tg(x)}g(x)] − g(x) dP (x) 2pc(t)k{tg(x)} 2 t=0 " #2 Z 1 1 = {c0(0)k(0) + c(0)k0(0)g(x)} − g(x) dP (x) = 0. 2pc(0)k(0) 2

At this point, we have shown that every function from L2(P ) with mean zero is a score function. Because there are no more candidates for score functions, we have found the maximal tangent set. But this is also a tangent space, i.e., closed under linear combinations. If g1 and g2 have the mean zero property, then

P (ag1 + bg2) = aP g1 + bP g2 = 0, for every a, b ∈ R.

Because L2(P ) is a linear space and by the latter equation, the tangent set is a tangent space.

We end this section with some additional notes. First, we explain why it is justified to end the proof at the point where we only considered bounded functions g. Since we will be interested 130 Chapter 6. Tangent Sets and Efficient Influence Function in the efficient influence function ψ˜ defined as the componentwise projection onto the mean- . P. square closure of , we can take to be the set of all bounded functions g ∈ L (P ) with PP PP 2 . P g = 0. Indeed, the bounded functions are dense in L2(P ) and hence we have that lin PP = H1. Therefore, the efficient influence function remains the same. Secondly, we put this result in contrast with the result obtained in §4.5.1. Suppose we are estimating a k-dimensional function ψ(P ). As seen in §4.5.1, the nonparametric tangent space . k k equals J = Hk and this is a k-replicating space, i.e., Hk = H1, thus we have that J = lin PP . Finally, we clearly see that the calculations made here using Definition 6.1 are very technical and rather difficult. Therefore, we rather use the classical notion of a score function for practical applications and the Definition 6.1 is for theoretical purposes only.

6.3.3 Proportional Hazards Model

Remember the joint density function of an observation x = (t, z) in the proportional hazards model was given by equation (1.8),

T θT z −eθ zΛ(t) (t, z) 7→ λ(t)e e pZ (z).

This example shows how tangent sets are often constructed in practice, in a somewhat loose way. We assume sufficient smoothness such that Lemma 6.2 can be used. The logarithm of the density is T θT z log pT,Z {t, z; θ, λ(·), pZ (·)} = log λ(t) + θ z − e Λ(t) + log pZ (z). The score function for θ for the observation can be obtained by differentiating this logarithm,

. ∂ ` (x) = log p {t, z; θ, λ(·), p (·)} θ,Λ ∂θ T,Z Z T = z − zeθ zΛ(t). (6.18)

To obtain a score for λ, we insert appropriate parametric submodels s 7→ λs. In Chapter 9, we shall see how this can be done. We then differentiate log pT,Z {t, z; θ, λ(·), pZ (·)} with respect to s at s = 0. If a is the derivative of log λs at s = 0, then the corresponding score function is  Z t  ∂ ∂ θT z log pT,Z {t, z; θ, λs(·), pZ (·)} = log λs(t) − e λs(u)du ∂s s=0 ∂s 0 s=0 Z t θT z ∂ log λs(u) = a(t) − e λ(u)du 0 ∂s s=0 t T Z = a(t) − eθ z a(u)dΛ(u). 0 For further reference, we write this as

Z t θT z Bθ,Λa(x) = a(t) − e a(u)dΛ(u). (6.19) 0 This means that a score for λ, equivalently a score for Λ, can be found as an operator working on functions a. The concept of a score operator will be studied later on in a more general setting and will then be applied to this particular model. So, do not worry if this notation seems rather odd. 6.3. Examples 131

Finally, scores for the density pz are L2-functions b(z) with mean zero because there are no restrictions on the distribution of Z other than that this should define a proper distribution. Thus we can use the result from the previous example, the nonparametric model. We derived scores for the three different components of the model. The tangent space contains the linear span off all these functions. This example shows how scores can be calculated in practice. As we already noticed, when the density functions are sufficiently smooth (e.g., so we may apply Lemma 6.2), the scores as calculated above are also scores in the sense of Definition 6.1. 132 Chapter 6. Tangent Sets and Efficient Influence Function Chapter 7

Lower Bounds

In the previous chapter, we built the basic theory on semiparametric estimation. At this point, we have good knowledge about the concept of a tangent set and a notion of information cov- ered by semiparametric models. We also introduced a special influence function, the efficient ˜ ˜ influence function ψP . In this chapter, we will see how ψP is related to the efficiency of semi- parametric estimators, thus justifying its name. We state a number of theorems giving lower bounds on the asymptotic performance of estimators and make these concrete for the estimation of a parameter θ in a semiparametric model in a strict sense,

P = {Pθ,η : θ ∈ Θ, η ∈ H},

m where Θ is a subset of R and H is some infinite-dimensional set.

7.1 Why Are Semiparametric Efficiency Bounds so Important?

The importance of semiparametric efficiency bounds is described very well in Newey (1990), [27], p.99. Semiparametric efficiency bounds are of fundamental importance for semiparametric models in different manners. First, such bounds quantify the efficiency loss that can result from a semiparametric rather than a pararametric approach. This can be done by comparing the semiparametric efficiency bound, thus the best we can expect from a semiparametric estimator, with the efficiency bound of an alternative parametric model, which is easily calculated from the Fisher information matrix. The extent of this loss is then important for the decision whether or not to use semiparametric models. This can help making a trade-off between reduced effi- ciency in semiparametric models and lack of robustness to model misspecification in parametric models. Second, these bounds sometimes provide a guide to estimation methods. They give a standard against which the asymptotic efficiency of any particular semiparametric estimator can be measured. In addition, their form can suggest ways of constructing estimators and cal- culating their limiting distribution. This has already been illustrated in much detail for the restricted moment model (§5.1) and estimating a treatment difference between two treatments in a pretest-posttest study (§5.3). Last but not least, the bounds can rule out the existence of certain types of estimators, e.g., the infinite bound case, which we will discuss briefly at the end of this chapter.

133 134 Chapter 7. Lower Bounds

7.2 Parametric Point of View

A lower bound theorem in statistics is an assertion that something, in our case estimation, cannot be done better than a given way. The best known bound is the Cram´er-Raobound for finite samples for the case of independent sampling from a parametric model

P = {Pθ : θ ∈ Θ ⊂ R}, which is taught in most introductory statistics courses. . Theorem 7.1. Let θ 7→ Pθ be differentiable at θ ∈ R with score function `θ and let Tn = Tn(X1,...,Xn) be an unbiased estimator of χ(θ) for a differentiable function χ : R → R, then under some regularity conditions, χ0(θ)2 varθ(Tn) ≥ (7.1) nIθ . for Iθ = varθ{`θ (X1)}, the Fisher information for θ (for one observation).

0 2 The Cram´er-Raobound is the number χ (θ) /Iθ, which depends solely on the functional χ to be estimated and on the model P = {Pθ : θ ∈ Θ ⊂ R}, through its Fisher information. Rewriting (7.1) gives √ χ0(θ)2 n varθ(Tn) = varθ( nTn) ≥ . Iθ

As we already mentioned, the quantity Iθ is called the information number or Fisher infor- mation of an observation. This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator for θ. As the information number gets bigger and thus we have more information about θ, we have a smaller bound on the variance of the best unbiased estimator. A more general version of the Cram´er-Raobound is also available, i.e., the parameter θ and the functional χ are higher-dimensional. For instance, let θ be an m-dimensional parameter and let χ be a k-dimensional functional.

Theorem 7.2. Let θ 7→ P be differentiable at θ = [θ , . . . , θ ]T ∈ m with score function . θ 1 m R T `θ and let Tn = Tn(X1,...,Xn) be an unbiased estimator of χ(θ) = [χ1(θ), . . . , χk(θ)] for a m k differentiable function χ : R → R , then under some regularity conditions,

. . χ −1 χT nvarθ(Tn) ≥ θ Iθ θ (7.2) . for Iθ = varθ{`θ (X1)}, the Fisher information for θ (for one observation). This inequality between both matrices should be read as usual, in the semi-definite sense.

It turns out that this bound is often not sharp. In the sense of (7.1) or (7.2), this means that there may not exist unbiased estimators Tn for which n times their variance is equal to the bound. But there is some good news, the bound is sharp in some asymptotic sense, as n → +∞. Indeed, the MLE asymptotically attains this bound. For more details about the consistency and the asymptotic efficiency of the MLE, we refer to Casella and Berger (2002), [6], Chapter 10. However, there are some peculiarities about asymptotics as we will discuss in the next section when we discuss semiparametric efficiency bounds. 7.3. Semiparametric Efficiency Bounds 135

7.3 Semiparametric Efficiency Bounds

Motivated by the considerations above, we now wish to state the deep theorems that allow a precise formulation of what it means to be asymptotically efficient in a semiparametric context. We will first discuss a quite simple and intuitive result that gives an intuitive transition of efficiency in parametric models to semiparametric models and we will obtain the so called semiparametric efficiency bound. However, this result will still be vague. Henceforth, we will motivate why we need more sophisticated theorems to deal with defining efficiency. Finally, we state the deep theorems that allow a precise formulation of what it means to be asymptotically efficient: the LAM theorem and the convolution theorem.

7.3.1 Semiparametric Efficiency Bound

In the beginning of the previous chapter, §6.1, we gave a notion of information covered by semiparametric models. It is time to be more rigorous about it. To motivate the definition of information in the semiparametric set-up, assume for simplicity that the parameter to be estimated is one-dimensional, ψ : P → R : P 7→ ψ(P ).

Consider a differentiable parametric submodel t 7→ Pt with score function g at t = 0. It is easy to show that the Fisher information for t in this parametric submodel is equal to the variance of the score function g. By the fundamental properties of score functions, it follows 2 that It = P g = hg, giP . To define information in this semiparametric context, we said we would only consider functionals ψ that are differentiable in an appropriate sense (see Definition 6.3). In calculating the Cram´er-Raobound for this parametric submodel, we finally see the importance of this sentence and why we restrict ourselves to such functionals, . 2 2 ˜ 2 [{∂ψ(Pt)/∂t}|t=0] (ψP g) hψP , giP 2 = = , P g hg, giP hg, giP ˜ where ψP is the unique efficient influence function. Without the correct definition of differ- entiability of ψ, we could not make this calculation. Using the same ideology as in §4.2, the supremum of this expression over all submodels, equivalently, over all elements of the tangent set, is a lower bound for estimating ψ(P ) given the model P, if the true measure is P . We now prove a very simply, nonetheless, very important lemma.

Lemma 7.1. Suppose that the functional ψ : P → is differentiable at P relative to a tangent . R set PP . Then ˜ 2 hψP , giP ˜2 sup = P ψP . (7.3) . hg, giP g∈linPP

Proof. This is a consequence of the Cauchy-Schwarz inequality,

hψ˜ , gi2 ≤ kψ˜ k2 kgk2 = P ψ˜2 P g2. P P P L2(P ) L2(P ) P . Thus, for any g ∈ lin PP , ˜ 2 hψP , giP ˜2 ≤ P ψP . hg, giP 136 Chapter 7. Lower Bounds

Taking the supremum yields ˜ 2 hψP , giP ˜2 sup = P ψP , . hg, giP g∈linPP . since ψ˜ is in the closure of lin . Indeed, by definition of elements in the closure, there exists P PP . +∞ ˜ a sequence of scores {gj}j=1 ⊂ lin PP such that gj → ψP as j → +∞ in L2(P ). From the continuity of the inner product, it follows that

˜ ˜ ˜ ˜2 hψP , gjiP → hψP , ψP iP = P ψP and kg k2 → kψ˜ k2 as j → +∞. This shows that j L2(P ) P L2(P )

˜ 2 hψP , gjiP ˜2 → P ψP , hgj, gjiP as j → +∞, which concludes the proof.

˜2 From this lemma, it is justified to call the number P ψP the semiparametric efficiency bound. Remark 7.1. Note that we cannot equate the supremum with the maximum, because ψ˜ ∈ . . P lin PP and not necessary in lin PP . By this we mean, there is not necessarily a differentiable ˜ submodel with score function the efficient influence function ψP . However, we can find differ- entiable submodels with corresponding score function arbitrarily close to the efficient influence function (by definition of closure).

Now the special meaning of the efficient influence function becomes clear. The squared norm ˜2 ˜ P ψP of the efficient influence function ψP plays the role of a smallest asymptotic variance an ˜ estimator for ψ(P ) can have. The connection of ψP with efficiency is thus clear. In the next section, we state and discuss some theorems to see in which sense this optimality of smallest ˜2 asymptotic variance P ψP holds. Recall that Lemma 7.1 was based on a one-dimensional parameter ψ(P ). A similar result is still available if we are estimating a k-dimensional parameter. Thus, suppose ψ : P → k : P 7→ ψ(P ) R . is a k-dimensional functional that is differentiable at P relative to a tangent set PP . We will k transform this functional into a one-dimensional functional. Therefore, take an arbitrary a ∈ R , T then the functional a ψ : P → R is one-dimensional. By the differentiability of the k-dimensional parameter ψ(P ), it follows that

T ∂a ψ(Pt) T ∂ψ(Pt) = a ∂t t=0 ∂t t=0 . T = a ψP g T ˜ T ˜ = a hψP , giP = ha ψP , giP , where t 7→ Pt is a differentiable path with corresponding score g. Henceforth, we may apply Lemma 7.1. We obtain that T ˜ 2 ha ψP , giP T ˜ 2 ≤ P {(a ψP ) }. hg, giP This is equivalent with ˜ ˜T T hψP , giP hψP , giP T ˜ ˜T a a ≤ a P (ψP ψP )a. hg, giP 7.3. Semiparametric Efficiency Bounds 137

˜ −1 ˜T ˜ ˜T We may conclude that hψP , giP hg, giP hψP , giP ≤ P (ψP ψP ) in the semi-definite sense and with ˜ ˜T supremum equal to P (ψP ψP ) where the order relation is again defined by the semi-definiteness. The smallest covariance for estimating the k-dimensional parameter ψ(P ) is thus given by the ˜ ˜T covariance matrix P (ψP ψP ) of the k-dimensional efficient influence function. We have proved the following result.

Corollary 7.1. Suppose that the functional ψ : P → k is differentiable at P relative to a . R tangent set PP . Then

n ˜ −1 ˜T o ˜ ˜T sup hψP , giP hg, gi hψ , giP = P (ψP ψ ). (7.4) . P P P g∈linPP

˜ ˜T Henceforth, it is justified to refer to the matrix P (ψP ψP ) as the semiparametric efficiency bound. Let us put this result in contrast with the results obtained in Chapter 4. Corollary 7.1 general- izes Theorem 4.2 and Theorem 4.6. If we want to estimate the finite-dimensional component of a semiparametric model in a strict sense, Theorem 4.2 shows that the semiparametric efficieny bound as defined there, equals the variance of the efficient influence function which could be always defined (without assuming the existence of a certain preliminary RAL estimator). In Chapter 4, it is seen we could also define the semiparametric efficieny bound when it is more natural to represent the parameter of interest as a function of an infinite-dimensional param- eter indexing the model. However, we could only show this bound equals the variance of the efficient influence function if we assume a certain preliminary RAL estimator exists, since this assumption was necessary for defining the efficient influence function. This was the contents of Theorem 4.6. Thus, we may conclude, due to the abstract approach, Corollary 7.1 properly unifies and gen- eralizes Theorem 4.2 and Theorem 4.6. In addition, the semiparametric efficiency bound as defined here as the variance of the efficient influence function coincides with the definition as presented in Chapter 4. Parametric Model. Looking at the results obtained in §6.3.1 for a parametric model

m P = {Pθ : θ ∈ Θ ⊂ R }, we deduced that for estimating ψ(Pθ) = χ(θ), the efficient influence function is . . ˜ χ −1 ψPθ = θ Iθ `θ, if the Fisher information matrix Iθ is invertible and the map χ is differentiable. Thus, the optimal covariance matrix is

. . .T . T P ψ˜ ψ˜T = P (χ I−1 I−1 χ ) θ Pθ Pθ θ θ θ `θ `θ θ θ . . . . χ −1 T −1 χT = θ Iθ Pθ(`θ `θ )Iθ θ . . χ −1 χT = θ Iθ θ ,

. .T because Pθ(`θ `θ ) = Iθ. This is precisely the multivariate Cram´er-Raobound, see (7.2). 4 138 Chapter 7. Lower Bounds

7.3.2 Some Lower Bound Theorems

It is time to give a precise meaning to smallest covariance since it is still vague what is meant by this. To avoid confusion, we focus attention on the estimation of the functional k ψ : P → R : P 7→ ψ(P ) with P an arbitrary model, a set of probability measures on the sample space X . We shall state two important theorems regarding this estimation problem. Before we do so, we try to argue why it is no trivial assignment to state easy theorems about the efficiency and the best limiting distribution of semiparametric estimators in asymptotic sense. To simplify the √ situation, we shall assume that for a sequence of estimators Tn, n{Tn − ψ(P )} converges in distribution under every possible probability measure P ∈ P to some limiting distribution LP . The subscript P denotes the potential dependence on the specific probability measure P . We speak about the best estimator in terms of the best limiting distribution. But what do we mean with this sentence? The notion of a best limiting distribution is understood in terms of concentration. If the distribution is a priori assumed to be normal, which is the case for asymptotically linear estimators (to which we restricted ourselves in Part II), this is usually translated into consistency and minimum asymptotic variance. Thus, for asymptotically nor- 2 mally distributed estimators Tn, i.e., when LP = N{µ(P ), Σ (P )}, the asymptotic normality µ(P ) √ means that eventually Tn is approximately normally distributed with mean ψ(P ) + n and Σ2(P ) variance n . Thus,

 µ(P ) Σ2(P ) T (X ,...,X ) ≈ N ψ(P ) + √ , . (7.5) n 1 n n n

By definition, the reason why we come up with the sequence Tn is estimation of ψ(P ). From (7.5), it is clear that optimal choices for the asymptotic mean and variance are µ(P ) = 0 and Σ2(P ) as small as possible. As a consequence, in terms of concentration, this optimal limiting distribution will be maximally concentrated about zero. Indeed, in the one-dimensional case, the probability of the interval (−a, a) is maximized by this optimal choice. In contrast with Part II, we do not want to make the a priori assumption that the estimators are asymptotically normal, i.e., LP is not necessarily a normal distribution. Even though, it will turn out (after studying some lower bound theorems, especially the convolution theorem) that normal limiting distributions are best. The exciting thing is that these optimal normal limits ˜ will be characterized by the efficient influence function ψP . Up to now, we already discussed intuitively how to evaluate the asymptotic efficiency of estima- tors in terms of the optimal limiting distribution by looking at its concentration. There is more than mean and variance to evaluate this concentration. Other examples are (X ∼ LP ) Z Z Z 2 2 E(X ) = x dLP (x),E|X| = |x|dLP (x),E(|X| ∧ a) = (|x| ∧ a)dLP (x), (7.6) where |x| ∧ a denotes the minimum of |x| and a. The limiting distribution LP of the estimator Tn for ψ(P ) is then called good if quantities of this type are small. Let us be more formal to which type of quantities we shall restrict ourselves in evaluating the limiting distribution of the estimator Tn. We will work with bowl-shaped, subconvex loss functions.

k Definition 7.1. A loss function is any function ` : R → [0, +∞). It is used to quantify the loss in concentration of an estimator. 7.3. Semiparametric Efficiency Bounds 139

k Definition 7.2. A bowl-shaped function is a function ` : R → [0, +∞) such that the sublevel sets {x : `(x) ≤ c} are convex and symmetric about the origin for all c > 0. k Definition 7.3. A function ` : R → [0, +∞) is called subconvex if ` is bowl-shaped and, moreover, the sets {x : `(x) ≤ c} are closed.

Figure 7.1: Examples of bowl-shaped subconvex loss functions, `(x) = x2 (dotted), `(x) = |x| (solid) and `(x) = |x| ∧ a (dashed) with a = 2.

Examples of such one-dimensional bowl-shaped subconvex loss functions are given in equation (7.6), `(x) = x2, `(x) = |x| or `(x) = |x| ∧ a. These have typically the shape of a bowl (which explains the name), see Figure 7.1. Equation (7.6) shows how these loss functions are used to measure the concentration of the estimator Tn, by means of these loss functions. Definition 7.4. Consider a loss function `. The mean with respect to the asymptotic distribution R LP of the estimator Tn, `dLP , is called the asymptotic risk of the estimator relative to the given loss function `.

At a given P , the risk function is the average loss that will be incurred if the estimator Tn is used. Since the true distribution P is unknown, we would like to find estimators for which this asymptotic risk is small for all P ∈ P. These risk functions can then be used to compare different estimators. However, the judgement as to which estimator is better may not be so clear since different estimators may have a different quality for different risk functions as we will see below. For more information about loss functions and risk functions, see Casella and Berger (2002), [6], §7.3.4. The following (in)famous example shows that a definition of what constitutes asymptotic optimality is not as straightforward as it might seem.

Example due to Hodges. We consider a simple parametric model. Let X1,...,Xn be i.i.d. N(θ, 1) with θ ∈ R. We know that the maximum likelihood estimator (MLE) of θ is given by 1 Pn the sample mean, Xn = n i=1 Xi and that

√ D(θ) n(Xn − θ) −→ N(0, 1). 140 Chapter 7. Lower Bounds

We know that the asymptotic variance of Xn equals the Cram´er-Raolower bound, so Xn is referred to as asymptotically efficient. One of the peculiarities of asymptotic theory is that asymptotically unbiased estimators can be constructed that have asymptotic variance equal to the Cram´er-Raolower bound for most of the parameter values in the model but have smaller asymptotic variance than the Cram´er-Raolower bound for the other parameter values. Such estimators are referred to as super-efficient. We now construct such a super-efficient estimator −1/4 Sn given by Hodges in 1951. Define Sn = XnI |Xn| ≥ n , meaning  −1/4 Xn if |Xn| ≥ n Sn = −1/4 . (7.7) 0 if |Xn| < n

If the sample mean Xn is already close to zero, then it is changed exactly to zero. In the other case it is left unchanged. The truncation point n−1/4 has been chosen in such a way that the limit behaviour of Sn is the same as that of Xn for every θ 6= 0, but for θ = 0 there appears to be a great improvement. Indeed, first consider the case that θ 6= 0. In this case, when n tends 1 to infinity, the support of the distribution of Xn ∼ N(θ, n ) moves away from 0. This situation is sketched in Figure 7.2.

Figure 7.2: When θ 6= 0, Pθ(Xn 6= Sn) → 0 ([35], Fig. 3.1).

This means that Pθ(Xn 6= Sn) → 0 as n → +∞ which is the same as Pθ(Sn = 0) → 0. This follows by an easy calculation, −1/4 Pθ(Sn = 0) = Pθ(|Xn| ≤ n ) −1/4 −1/4 = Pθ(−n ≤ Xn ≤ n ) n√ −1/4 √ √ −1/4 o = Pθ n(−n − θ) ≤ n(Xn − θ) ≤ n(n − θ) √ √ = Φ(n1/4 − nθ) − Φ(−n1/4 − nθ). Taking the limit for n and by the continuity of Φ, it follows that

lim Pθ(Sn = 0) = Φ(−∞) − Φ(−∞) = 0. n→+∞

An equivalent way of stating this, is that Pθ(Xn = Sn) → 1 as n → +∞. Therefore √ √ √ √ n(Xn − θ) = n(Sn − θ) + n(Xn − Sn) = n(Sn − θ) + oPθ (1) 7.3. Semiparametric Efficiency Bounds 141

√ D(θ) and by Slutsky’s lemma, we conclude that n(Sn − θ) −→ N(0, 1).

Now consider the case that θ = 0. In this case, the support of the distribution of Xn will be √ 1 concentrated in an O( n) neighbourhood about the origin because Xn ∼ N(0, n ). Hence, with 1/4 1/4 increasing probability, Xn will be within the interval (−n , n ). This situation is sketched in Figure 7.3.

−1/4 Figure 7.3: When θ = 0, P0(|Xn| < n ) → 1 ([35], Fig. 3.2).

This can be easily calculated,

−1/4 −1/4 −1/4 P0(|Xn| < n ) = P0(−n < Xn < n ) n 1/4 √ 1/4o = P0 −n < n(Xn − 0) < n = Φ(n1/4) − Φ(−n−1/4).

Taking the limit for n going to infinity and using the continuity of Φ yields

−1/4 lim P0(|Xn| < n ) = Φ(+∞) − Φ(−∞) = 1. n→+∞

From the latter equation and the definition of Sn, we obtain that P0(Sn = 0) → 1. Hence D(0) P (r S = 0) → 1 and thus r (S − 0) −→P0 0 or r (S − 0) −→ 0 for every sequence of scaling 0 n n n n n n √ factors rn → +∞ (we do not have only convergence at rate n). This means that the asymptotic variance of Sn is equal to 0 if the true parameter θ is equal to 0.

Our conclusion from this pointwise investigation of the asymptotic distribution of Sn, is that

√ D(θ) n(Sn − θ) −→ N(0, 1) (θ 6= 0), D(0) rnSn −→ 0 (θ = 0).

At first sight, Sn is an improvement on Xn. For every θ 6= 0 the estimator Sn behaves identically as the sample mean, while for θ = 0, the sequence Sn has an infinite rate of convergence. Super-efficiency may seem like a good property for an estimator to possess. However, upon further investigation it is found that super-efficiency is gained at the expense of poor estimation 142 Chapter 7. Lower Bounds

−1/3 in a neighbourhood of zero. Consider the sequence θn = n , which converges to zero, the value at which the estimator Sn is super-efficient. The MLE Xn has the property that

√ D(θn) n(Xn − θn) −→ N(0, 1).

1 −1/2 However, because Xn ∼ N(θn, n ), it concentrates its mass in an O(n ) neighbourhood about −1/3 θn = n , which eventually, as n increases, will be completely contained within the range (−n−1/4, n−1/4) with probability converging to one. This situation is sketched in Figure 7.4.

−1/3 Figure 7.4: When θn = n , Pθn (Sn = 0) → 1 ([35], Fig. 3.3).

This also follows from an easy calculation,

−1/4 Pθn (Sn = 0) = Pθn (|Xn| < n ) −1/4 −1/4 = Pθn (−n < Xn < n ) n 1/4 1/6 √ −1/3 1/4 1/6o = Pθn −n − n < n(Xn − n ) < n − n = Φ(n1/4 − n1/6) − Φ(−n1/4 − n1/6).

Taking the limit for n going to infinity and using the continuity of Φ yields

lim Pθ (Sn = 0) = Φ(+∞) − Φ(−∞) = 1. n→+∞ n √ √ Therefore P { n(S − θ ) = − nθ } → 1 and because θ = n−1/3, the sequence −n1/2θ → θn n √ n n n √ n −∞. Now we see that n(Sn − θn) diverges to −∞, i.e., there is no convergence at n-rate and we do not have convergence in probability. This illustrates that the super-efficiency is gained at the expense of poor estimation in a neighbourhood of zero. Another way of making this point 2 clear is when we look to the graph of the risk function θ 7→ Eθ{(Sn − θ) } for three different values of n. This is the mean squared error (MSE) of the estimator Sn. It is well known that 2 2 Eθ{(Sn − θ) } = varθ(Sn) + biasθ(Sn) . We hope this will be close to the asymptotic variance of Sn so we introduce no serious bias in a finite sample. We expect that this is close to 1 if θ 6= 0 and is close to zero if θ = 0 and this must be closer to these values if n grows bigger. This risk function is shown in Figure 7.5 for different values of n. As we expected, these functions are close to 1 on most of the domain but possess peaks close to zero. As n → +∞, the locations 7.3. Semiparametric Efficiency Bounds 143

Figure 7.5: Quadratic risk functions of the Hodges estimator based on the means of samples of size 10 (dashed), 100 (dotted) and 1000 (solid) observations from the N(θ, 1)-distribution ([39], Fig. 8.1).

and widths of the peaks converge to zero but their heights to infinity. This shows clearly that the price we have to pay for better asymptotic behaviour of Sn at θ = 0 is the erratic behaviour of Sn close to zero. Because the values of θ at which Sn is bad differ by n, the erratic behaviour is not visible in the pointwise limit distribution under fixed θ. 4 As we announced, Hodges’ example shows that there is no hope for a nontrivial lower bound for √ the limit distribution of an arbitrary standardized estimator sequence n{Tn − ψ(P )} for every P , in a uniform sense. When we want to state a theorem without imposing further regularity conditions on the sequence of estimators, the preceding example shows that we should not consider pointwise lower bounds but instead consider lower bounds over shrinking neighbourhoods of the probability measure P . In this manner, we take into account the poor estimation in the neighbourhood of the points of super-efficiency. . For every g in a given tangent set PP , we write Pt,g for a corresponding submodel with score function g along which the functional ψ is differentiable. We now state the first of the deep lower bound theorems.

Theorem 7.3 (Local Asymptotic Minimax, LAM). Let the functional ψ : P → k : . R P 7→ ψ(P ) be differentiable at P relative to the tangent set PP with efficient influence function ˜ . ψP . If PP is a convex cone, then, for any estimator sequence Tn and subconvex loss function k ` : R → [0, +∞), Z n  h√ n oio T √ √ ˜ ˜ sup lim inf sup EP1/ n,g ` n Tn − ψ(P1/ n,g) ≥ `dN(0,P ψP ψP ). (7.8) I n→+∞ g∈I

The first supremum is taken over all finite subsets I of the tangent set.

This theorem gives a lower bound, depending only on the model P and the functional ψ to be √ estimated, for the lim inf of the risk EP (` [ n {Tn − ψ(P )}]), for an arbitrary estimator Tn.A 144 Chapter 7. Lower Bounds

best estimator Tn can then be defined as one that attains equality in (7.8) for all subconvex loss functions. If we define optimality in this LAM-sense, it is clear that the best possible limiting distribution of an estimator is a normal distribution characterized by the efficient influence ˜ ˜ function ψP . Thus, the name efficient influence function for ψP really is justified (and this is not the end). However, before we stated this theorem, we noted we would state a lower bound defined over shrinking neighbourhoods of P . This is not immediately clear from the LAM- √ theorem, although we see a glimpse of it because we consider submodels P1/ n,g. A slightly weaker assertion makes this clearer. We will make use of the variation norm k · k. When P and Q are two probability measures on the measure space (X , A), then

kP − Qk = sup |P (B) − Q(B)|. B∈A Corollary 7.2. Under the same assumptions of Theorem 7.3, √ Z    ˜ ˜T inf lim inf sup EQ ` n {Tn − ψ(Q)} ≥ `dN(0,P ψP ψP ). (7.9) δ>0 n→+∞ kQ−P k<δ

Now it is clear what we mean by shrinking neighbourhoods of P : in (7.9), we only take prob- ability measures Q that are sufficiently close to P in the variation norm. Without taking the local supremum risk, both theorems would fail. Let this be clear by thinking at super-efficient estimators such as the example due to Hodges. Thus, we need to be able to deviate from the true probability distribution so the lower bound is also applicable to non regular estimators such as super-efficient estimators. Surprisingly, the conclusion is that a normal limit characterized by the efficient influence function ψ(P ) is optimal in this LAM-sense. Even though the LAM-theorem is applicable to any estimator, there is also bad news. Be- cause we only evaluate a maximum risk, the distinction between two estimator sequences can be blurred. That is, we are taking a supremum over probability distributions that are close to each other with respect to the variation norm, i.e., P and Q such that kP − Qk < δ. It is then possible that we have two estimator sequences Tn,1 and Tn,2 such that Tn,1 is optimal at the truth P and Tn,2 is optimal at the alternative Q that is close to P . However, since we are taking a supremum over these shrinking neigbourhoods, the distinction between Tn,1 and Tn,2 is not noticed, the distinction between these two estimator sequences is blurred. Is there a way to avoid taking this maximum risk? The good news is that there is a way, but a strong price has to be paid. We are able to avoid taking this maximum risk if we restrict ourselves to so called regular estimator sequences. Regular estimators have already been in- troduced in Part II. However, we restricted ourselves to regular estimators that are in addition asymptotically linear, i.e., RAL estimators. Here we define regular estimators in a general way.

Definition 7.5. An estimator sequence T is called regular at P for estimating ψ(P ) relative . n to PP , if there exists a probability measure L such that

√ n o D(P √ ) . √ 1/ n,g n Tn − ψ(P1/ n,g) −→ L, for every g ∈PP . (7.10)

In words, the limiting distribution L does not depend on the local data generating process (LDGP). Small perturbations of the true data generating distribution do not affect the limiting distribution of the estimator. These words clarify why we call estimator sequences with this property regular. D From the Portmanteau lemma, we know that Xn → X is equivalent with E{`(Xn)} → E{`(X)} 7.3. Semiparametric Efficiency Bounds 145 for all bounded continuous functions `. In addition, if ` is subconvex, applying this to (7.10), the LAM-criterion (7.8) becomes very simple, Z Z ˜ ˜T `dL ≥ `dN(0,P ψP ψP ). (7.11)

Again, this shows the optimality of a normal limiting distribution characterized by the efficient ˜ influence function ψP . Fortunately, restricting ourselves to regular estimator sequences yields more information than only a more simple form of the LAM-theorem. We now state a theorem, the convolution theorem, that shows that this discrepancy between . limit and lower bound always (under some geometric restrictions on the tangent set PP ) results from L being more dispersed than the normal measure.

k Theorem 7.4 (Convolution). Let the functional ψ : P → R : P 7→ ψ(P ) be differentiable at . ˜ P relative to a tangent set PP with efficient influence function ψP . Let Tn be regular at P with limit distribution L. Then the following statements hold:

. R T ˜ ˜T T (i) if PP is a cone, then yy dL(y) − P ψP ψP (take `(y) = yy ) is nonnegative definite. In words, the asymptotic covariance matrix of every regular estimator sequence Tn with ˜ ˜T limiting distribution L is bounded below by P ψP ψP , . (ii) if in addition, PP is a convex cone, then there exists a probability measure M such that ˜ ˜T L = U + M where U ∼ N(0,P ψP ψP ) and M q U, i.e., M is independent of U.

. This theorem shows important results. If we only have the property that the tangent set PP is a cone, which is certainly true if we work with the maximal tangent set (see Proposition 6.2), we know that for regular estimator sequences, the optimal covariance matrix is indeed the covari- ˜ ˜T ˜ ance matrix P ψP ψP of the efficient influence function ψP . This result coincides with the results obtained in Chapter 4, especially see Theorem 4.3 and Theorem 4.5. These theorems state that any RAL estimator for ψ(P ) cannot have asymptotic variance smaller than the variance of the efficient influence function. Thus we certainly see that the first part of the convolution theorem is valid for the class of RAL estimators (as it should be). Unfortunately, the theory developed in Part II cannot say something about the variance of a regular estimator that is not asymp- totically linear. Using the abstract approach as presented here, the first part of the convolution theorem yields that any regular estimator (a fortiori all RAL estimators) for ψ(P ) (where ψ is differentiable in the sense of Definition 6.3), cannot have asymptotic variance smaller than the ˜ variance of the efficient influence function ψP . If in addition, the tangent set is also a convex cone, we have a representation of the limiting distri- ˜ ˜T ˜ ˜T bution L in terms of the optimal limiting distribution N(0,P ψP ψP ), i.e., L = N(0,P ψP ψP )+M. ˜ ˜T This result indicates that the limiting distribution is equal to that of a N(0,P ψP ψP ) random variable plus noise. Let us look at a special case. Suppose M is a normal distribution with zero mean. It fol- ˜ ˜T lows from independence of N(0,P ψP ψP ) and M that the asymptotic variance of Tn is equal T ˜ ˜T T to E(LL ) = P ψP ψP + E(MM ). Thus, in this case, the covariance matrix of L differs from ˜ ˜T T P ψP ψP by a nonnegative definite matrix E(MM ), the covariance matrix of M (which is a special case of (i) in the convolution theorem). Comparison of estimators can then be easily ob- tained by just comparing the covariance matrices of the limiting distributions. This was the way we evaluated the efficiency of RAL estimators in Part II since these were asymptotically normal with zero mean. By (i) of the convolution theorem, this methodology is justified in a rigorous 146 Chapter 7. Lower Bounds way. If we apply (ii) to RAL estimators, we see this is actually a generalization of Theorem 4.4. A consequence of Theorem 4.4 is that, if the efficient influence function is well-defined, any influence function ϕ of a RAL estimator can be written as the sum of the efficient influence function ϕeff and an element ` of the orthogonal complement of the tangent space J . In this case, the RAL estimator with influence function ϕ is asymptotically normal N(0,E{ϕϕT }) and T T this can be written as N(0,E{ϕeff ϕeff }) + N(0,E{`` }) where the independency is implied by the orthogonality ϕ ⊥ `. Thus, for this example M = N(0,E{``T }). eff . Finally, note that this theorem holds if PP is a tangent space, i.e., closed under taking lineair combinations, because a tangent space is certainly a convex cone. In general, we need to be careful that we do not apply this theorem if the tangent set does not have the required shape.

Remark 7.2. We gave both Theorem 7.3 and Theorem 7.4 without proof. Because this thesis is meant to clarify semiparametric theory without making things more complicated than they already are, we skip them. For the interested reader, sketches of these proofs (e.g., how the shape of the tangent set is used) can be found in van der Vaart (1998), [39], §25.3. We have to admit, these proofs are very tough and make use of convergence of experiments, which, for simplicity, we do not mention here. Important to note, the proofs of these theorems also rely on Lemma 6.3, the local asymptotic normality of the log likelihood ratio. This important result helped us to state powerful theorems such as the LAM-theorem and the convolution-theorem.

These theorems originally have their roots in two papers by H´ajek,see [12] and [13]. In these two papers, H´ajekconsidered the estimation of the parameter θ in a parametric model {Pθ : k θ ∈ Θ ⊂ R } and presented the LAM theorem and the convolution theorem for parametric models. Afterwards, many generalizations have been obtained, especially generalizations to semiparametric models. A more detailed discussion of the LAM theorem and the convolution theorem for semiparametric models together with some generalizations, can be found in van der Vaart (1988), [36], Chapter 2. These theorems are applicable to functionals ψ with values in an Euclidean space. In [36], Chapter 3, these theorems are generalized to functionals ψ with values in some vector space. So far, we stated two fundamental theorems about the efficiency of semiparametric estimators. Both theorems give the message that the normal distribution with mean zero and covariance ˜ ˜T matrix P ψP ψP is the best limiting distribution for an estimator sequence. Unfortunately, we should not take this in a too absolute sense. For instance, in parametric theory, shrinkage estimators, as first invented by Stein in the 1950s, are not regular and hence are not in the realm of the convolution theorem. These are optimal in LAM-sense for certain loss functions ` but they are not asymptotically normal. We briefly introduce the notion of a shrinkage estimator due to Stein.

Shrinkage Estimators. Consider a sample X1,..., Xn and assume each Xi has a multivariate T k normal distribution with mean θ = [θ1, . . . , θk] ∈ R and variance the identity matrix Ik, i.e., Xi ∼ N(θ,Ik) for all i = 1, . . . , n and k denotes the dimension of each Xi. It is important to note that the dimension k of the observations is assumed to be at least 3. This turns out to be essential. We know the MLE for θ is given by the vector of sample means,

n 1 X X = X ∈ k. (7.12) n n i R i=1 7.3. Semiparametric Efficiency Bounds 147

√ D(θ) Moreover, we know that n(Xn − θ) −→ N(0,Ik) for all θ, which is the best limit distribution a regular estimator can have in this parametric model. The Stein shrinkage estimator is then defined to be the estimator

X T = X − (k − 2) n . (7.13) n n 2 nkXnk

Recall that kxk2 = xT x. We now investigate the asymptotic distribution of the Stein shrinkage estimator. We consider two different cases.

First consider the case where θ 6= 0. By the WLLN, we know that Xn converges in probability 2 −1 to the mean θ. From this it follows that the second term −(k − 2)Xn/nkXnk is OP (n ). To 2 −1 obtain this, we need to show that −(k − 2)Xn/nkXnk = Ynn and Yn = OP (1). This means that for every  > 0, we need to find a constant M such that

 X M  sup P −(k − 2) n >  < . 2 n∈N nkXnk n

This is equivalent to the statement that for every  > 0, there exists a constant M such that

 1 M  sup P > ε < . (7.14) n∈N kXnk k − 2

P(θ) D(θ) Since Xn → θ, we know that Xn → θ. Using the Continuous Mapping Theorem, we obtain D(θ) that 1/kXnk → 1/kθk and hence by Prohorov’s Theorem, 1/kXnk = OP (1) and this is equiv- alent with (7.14). √ This result implies that n(Tn − Xn) = oP (1). Indeed,

√ √ X n(T − X ) = − n(k − 2) n n n 2 nkXnk √ 1 = nO (n−1) = √ O (1) = o (1)O (1) P n P P P = oP (1).

Finally, we deduce that √ √ √ n(Tn − θ) = n(Tn − Xn) + n(Xn − θ) √ D(θ) = oP (1) + n(Xn − θ) → N(0,Ik).

We may conclude that the asymptotic distribution of Tn for θ 6= 0 is given by N(0,Ik) and henceforth independent of the LDGP which shows that Tn is regular at any θ 6= 0. √ Now consider the case where θ = 0. Consider the LDGP θn = h/ n where h is a constant k √ √ vector in R . We see that n(θn − θ) = n(θn − 0) converges to the constant vector h such √ D(θn) that this defines a proper LDGP. We have that nXn → X where X ∼ N(h, Ip). Using the Continuous Mapping Theorem, we obtain that

√  h  D(θ ) n T − √ →n T(X) − h ∼ L , n n 0,h 148 Chapter 7. Lower Bounds where X T(X) = X − (k − 2) kXk2 and L0,h is just a notation for the distribution of T(X)−h. We may conclude that the asymptotic distribution of Tn for θ = 0 depends on h and thus on the LDGP, the estimator sequence Tn is not regular at θ = 0. Below, we explicitly show the dependency on h arises due to the shrinkage. We have X T (X) − h = (X − h) − (k − 2) . kXk2

Since (X − h) ∼ N(0,Ip), this part of the asymptotic distribution is independent of h. However, X the part kXk2 depends on h. From this we see the shrinkage part makes the estimator non- regular. Putting (i) and (ii) together, we conclude that the Stein shrinkage estimator defined by (7.13) is not a regular estimator. In Lehmann (1983), [20], p. 300, it is proven that the Stein shrinkage estimator has the remark- k able property that, for every h ∈ R ,

2 2 Eh{kT(X) − hk } < Eh{kX − hk } = k.

It follows that, in terms of joint quadratic loss (MSE) `(x) = kxk2, the local limit distributions √ √ √ L0,h of the sequence n(Tn − h/ n) under θn = h/ n are all better than the N(0,Ik)-limit distribution of the best regular estimator sequence, the MLE Xn. As a consequence of the discussion above, we can make the following interesting conclusion. When three or more unrelated parameters are estimated, their MSE can be reduced using the Stein shrinkage estimator. However, by introducing bias, the Stein shrinkage estimator can improve the estimation of a particular component of θ and detoriates the estimation of other components of θ. Fortunately, in the end, the MSE is reduced. To illustrate this idea, suppose we are interested in estimating a treatment effect β using a linear model E(Y |X,L) = α + βX + γL, where Y denotes the response variable, X denotes the treatment and L denotes a possibly high- dimensional set of confounders for the relation between X and Y . Since we are not interested in the separate estimation of the parameter α and the different components of the possibly high-dimensional parameter γ, we could consider the non-regular Stein shrinkage estimator for these parameters in order to minimize the MSE of α, β and γ. To be a bit more specific, assume T L is p-dimensional, i.e., L = [L1,...,Lp] . Hence, the parameter γ is also p-dimensional, i.e., T γ = [γ1, . . . , γp] . We are interested in the setting where the information size is small relative to p. When we use ordinary least squares for the estimation of the parameters γj, we typically obtain large standard errors and large estimates for the parameters γj. For this purpose, we 2 assume each γj ∼ N(0, σ ), an idea coming from Bayesian statistics. This will attenuate the parameters γj to zero. 4 For more information about this phenomenon, we refer to James and Stein (1961), [15]. The example of shrinkage estimators shows that, depending on the optimality criterion, a normal ˜ ˜T N(0,P ψP ψP )-limit distribution need not to be optimal. Nevertheless, so far best regular esti- mator sequences have been considered best in the general semiparametric theory. This motivates the following definition. 7.3. Semiparametric Efficiency Bounds 149

Definition 7.6. An estimator sequence is called asymptotically efficient at P , relative to a . given tangent set PP , for estimating the differentiable parameter ψ(P ), if it is regular at P with ˜ ˜T limit distribution L = N(0,P ψP ψP ), i.e.,

√ n o D(P √ ) . √ 1/ n,g ˜ ˜T n Tn − ψ(P1/ n,g) → = N(0,P ψP ψP ), for every g ∈PP . (7.15)

Thus, if we have an asymptotically efficient estimator sequence Tn at P , we obtain equality in (7.11) for all subconvex loss functions ` and the random variable M in the convolution theorem has a singular distribution at zero. The definition of asymptotic efficiency is not absolute. It is defined relative to a given tangent set. When we defined a tangent set, we admitted every collection of scores to be a tangent set. In practice, one hunts for a tangent set and an estimator sequence such that the tangent set is big enough and the estimator sequence efficient enough so that this estimator sequence is asymptotically efficient relative to this tangent set. As stated in van der Vaart (2002), [40], p.18, one strongly believes that this is all that need to be said about the problem. We end our introduction to general efficiency theory with an interesting lemma. It shows that asymptotically efficient estimator sequences relative to a tangent cone can be asymptotically approximated by an average of the efficient influence function evaluated at the observations.

k Lemma 7.2. Let the functional ψ : P → R : P 7→ ψ(P ) be differentiable at P relative to the . ˜ tangent cone PP with efficient influence function ψP . A sequence of estimators Tn is regular ˜ ˜T at P with limiting distribution N(0,P ψP ψP ) (asymptotically efficient) if and only if it satisfies

n √ 1 X n{T − ψ(P )} = √ ψ˜ (X ) + o (1). (7.16) n n P i P i=1

˜ ˜ This lemma also explains the name influence function for ψP , the term ψP (Xi) charges for the influence of the i-th observation in the estimation of ψ(P ) via the estimator Tn. Moreover, it ˜ justifies again why we call ψP the efficient influence function. Remark 7.3. The proof of this lemma is skipped. We refer the interested reader van der Vaart (1998), [39], §25.3 for a detailed proof, which we skip because it also involves convergence of experiments. Again, as with the previous theorems, it is important to note that the proof makes use of Lemma 6.3, the local asymptotic normality of the log likelihood ratio.

It is worth reflecting on what we have achieved by know. If we are willing to pay the price to restrict ourselves to regular estimators, we have a nice criterion for asymptotic efficiency of an estimator based on an object that always exists if the parameter ψ(P ) to be estimated is ˜ differentiable in the sense of Definition 6.3: the efficient influence function ψP . An estimator will be considered best if it is an asymptotically efficient estimator. Note that this criterion for optimality should not be taken in absolute sense because it is relative to a given tangent set. Thus, given a tangent set, we have obtained a standard to which we can measure the efficiency of a certain proposed regular semiparametric estimator. The latter lemma yields a nice representation for an asymptotically efficient estimator as a linear combination of the efficient influence function evaluated at the observations. The unfortunate thing is that we are not assured an asympotically efficient estimator even exists. We only constructed a way to evaluate the efficiency of a given estimator. Thus, the bound can be well defined without being 150 Chapter 7. Lower Bounds sharp. There exist examples (see Ritov and Bickel (1987), [30]) where the bound is well defined √ and finite but no n-consistent estimator exists. A fortiori, no asymptotically efficient estimator can exist. It is time to give an example of an asymptotic efficient estimator, the empirical distribution in a nonparametric model.

7.3.3 Empirical Distribution

Using the theory we developed so far, it becomes almost trivial to show that, in a pointwise sense, the empirical distribution is an asymptotically efficient estimator of the underlying distribution P of the sample if the distribution P is completely unknown, i.e., we work in a nonparametric model.

Definition 7.7. Let X1,...,Xn be a random sample from a probability distribution P on a measurable space (X , A). The empirical distribution is the discrete uniform measure on the 1 Pn observations. We denote it by Pn = n i=1 δXi , where δx is the probability distribution that is degenerate at x, i.e., a probability distribution that puts all its mass into the value x,

 1, if y = x, δ (y) = x 0, if y 6= x.

The measure δx is also called the Dirac measure.

The empirical distribution Pn puts mass 1/n at each observation X1,...,Xn. Now, given a 2 R 2 measurable function f : X → R with P f = f dP < +∞ (thus f ∈ L2(P )), we write Pnf for the expectation of f under the empirical measure and, as usual, P f is the expectation of f under the true distribution P . Thus,

n 1 X Z f = f(X ), P f = fdP. Pn n i i=1

In this notation, Pnf is considered as an estimator for P f, so we are estimating the functional ψ : P → R : P 7→ ψ(P ) = P f. . In Proposition 6.4, we deduced that the tangent set for a nonparametric model is the set PP . of all mean-zero (P g = 0) measurable functions g ∈ L2(P ) and PP is a tangent space. For a general function f, the functional (or parameter) ψ may not be differentiable relative to the maximal tangent set, but it certainly is differentiable relative to the tangent set consisting of . ∗ all bounded, measurable functions g with P g = 0, which we denote by PP . The closure of . ∗ . this tangent set PP is the maximal tangent set PP because the bounded functions are dense in . ∗ L2(P ). Hence, working with this smaller set PP does not change the efficient influence function. Indeed, the efficient influence function is obtained after projecting onto the closure of the linear . ∗ span of the tangent set. Since the closure of our smaller tangent set is the maximal tangent . PP set PP , the efficient influence function is left unchanged. For a bounded function g we can use the one-dimensional submodel dPt = (1 + tg)dP since due to the boundedness of g, this defines a proper distribution for sufficiently small t. The corresponding parameter is then Z Z ψ(Pt) = Ptf = fdPt = f(1 + tg)dP = P f + tP fg. 7.3. Semiparametric Efficiency Bounds 151

. ∗ . ∗ In order to use Lemma 7.2, PP should be a cone, which is the case because PP is linear. The . ∗ other assumption is that ψ should be differentiable at P relative to the tangent cone PP . We now show that this is true. ∂ψ(Pt) = P fg. ∂t t=0 From this equation, it is clear that the derivative of ψ, in the sense of (6.14), is the map

. . Z ψP : L2(P ) → R : g 7→ψP g = P fg = fgdP.

Henceforth, the function f is an influence function in the sense it is defined in §6.11. To obtain . . ∗ . ∗ the efficient influence function ψ˜ , we need to project f onto = lin = . It is easy to P PP . PP PP . ˜ ˜ ˜2 ˜ ˜ check that ψP = f −P f because P ψP = 0, P ψ < +∞ (thus P ψP ∈PP ) and f −ψP = P f ⊥PP . P because for any g ∈PP , Z hP f, gi = P f × gdP = P f × P g = 0.

˜ We conclude that Lemma 7.2 is applicable with efficient influence function ψP = f − P f. We have that ( n ) √ √ 1 X n{ f − ψ(P )} = n f(X ) − P f Pn n i i=1 n 1 X = √ {f(X ) − P f} n i i=1 n 1 X = √ ψ˜ (X ). n P i i=1

By Lemma 7.2, it follows that Pnf is an asymptotically efficient estimator sequence for P f with ˜2 2 2 2 asymptotic variance P ψP = P {(f − P f) } = P (f ) − (P f) , i.e., Pnf is regular at P with ˜ ˜T limiting distribution N(0,P ψP ψP ). Empirical Distribution Function. Let us now investigate a specific choice for the function f. Consider the indicator function f(x) = I(x ∈ A), A ∈ A. Then we estimate P (A) with the estimator n 1 X f = I(X ∈ A). Pn n i i=1

Especially, let X1,...,Xn be a random sample with corresponding distribution function FX on the real line. In addition, take A to be the interval (−∞, x0] for x0 ∈ R. Henceforth, we estimate P(X ≤ x0) = FX (x0). The estimator for this particular x0 ∈ R is then given by n 1 X (x ) = I(X ≤ x ). Fn 0 n i 0 i=1 This is what we call the empirical distribution function. From the derivation above, we know that the empirical distribution function Fn(x0) is the asymptotically efficient estimator ˜ for FX (x0) with efficient influence funtion ψP (X) = I(X ≤ x0) − FX (x0) for every x0 ∈ R. 4 1Note that f is not an influence function in the sense of Definition 3.2 since f does not necessarily have mean-zero. 152 Chapter 7. Lower Bounds

Sample Mean. Another interesting example arises when we choose f to be the identity function 1. We have that 1 ∈ L2(P ) iff var(X) < +∞. Indeed, Z 1/2 Z 1/2 1 1 2 2 p k kL2(P ) = { (x)} dP (x) = x dP (x) = var(X).

In this case we estimate P 1 = R 1(x)dP = R xdP = E(X), which is the mean of X. The nonparametric estimator for P 1 is then given by

n n 1 X 1 X 1 = 1(X ) = X = X , Pn n i n i n i=1 i=1 the sample mean. From the derivation above, we know that the sample mean Pn1 = Xn is an asymptotically efficient estimator for the mean E(X) with efficient influence function given by ˜ ψP (X) = 1(X) − P 1 = X − E(X). 4

7.4 Semiparametric Models in a Strict Sense

7.4.1 Efficient Score Function and Efficient Information Matrix

We leave the more general theory of lower bound theorems and focus attention again on the semiparametric model in a strict sense,

P = {Pθ,η : θ ∈ Θ, η ∈ H}, (7.17)

k with Θ ⊂ R an open set and H an arbitrary set, typically of infinite dimension. In this section, we consider estimation of a special (k-dimensional) functional, nonetheless a very common case in practice, k ψ : P → R : Pθ,η 7→ ψ(Pθ,η) = θ. (7.18) Thus we focus attention on the estimation of the parametric component θ (the parameter of interest) of our model. The nonparametric component η is considered the nuisance parameter. The results we will obtain here will again extend the results obtained in Chapter 4. Our aim is to apply the theory developed in the previous sections and find a more specialized criterion for the efficiency of an estimator Tn for ψ(Pθ,η) = θ. Thus, we are looking for the ˜ ˜ efficient influence function ψθ,η in this special setting because we have seen that ψθ,η characterizes the lower bounds and the asymptotic efficient estimators in the sense of Definition 7.6. We will express the efficient influence function in terms of the efficient score function and the efficient information matrix. ˜ The efficient influence function ψθ,η can be found by searching the derivative of the functional . ψ (see Definition 6.3) and projecting it onto the tangent space PPθ,η . To obtain this, we need to consider appropriate parametric submodels, appropriate in the sense of Definition 6.1. As k submodels, we use paths of the form t 7→ Pθ+ta,ηt , for given paths t 7→ ηt in H and a ∈ R . It will be case dependent which paths ηt should be used, depending on the considered semiparametric model. At this point, we are assuming these paths exist. Examples will be given if we have obtained the desired result. Given these differentiable paths t 7→ Pθ+ta,η , we are interested in . t how a tangent space PPθ,η looks like. In doing so, we use our intuition. The score functions for such submodels will typically have the form of a sum of partial derivatives with respect to 7.4. Semiparametric Models in a Strict Sense 153 the parametric component θ and the nonparametric component η. Note again, we assume these . score functions exist. When considering a particular example, this should be checked. If `θ,η is the ordinary score function for θ in the model where η is fixed (as we are dealing with an ordinary parametric model), then we expect

. ∂ T log dPθ+ta,ηt = a `θ,η +g. (7.19) ∂t t=0

The function g has the interpretation of a score function for η when θ is fixed. The function g typically runs through an infinite-dimensional set, e.g., when η represents an unresticted density (H is a nonparametric model), g will run through the set of all mean-zero functions in L2(Pθ,η). Because the set of all nuisance score functions is so important, we give it a special name.

Definition 7.8. The set of all score functions g for the nuisance parameter η, i.e., nuisance score functions, is called the tangent set for η, i.e., the nuisance tangent set and is denoted . by η PPθ,η .

Next, we look what it should mean that the functional ψ is differentiable at P relative to the . tangent set PPθ,η consisting of score functions such as (7.19). The parameter ψ(Pθ+ta,ηt ) = θ+ta is certainly differentiable with respect to t in ordinary sense with derivative a. However, to . be differentiable at P relative to PPθ,η , we need something more. By definition, ψ will be ˜ differentiable as a parameter on the model if and only if there exists a function ψθ,η such that

. ∂ D ˜ T E a = ψ(Pθ+ta,ηt ) = ψθ,η, a `θ,η +g , (7.20) ∂t t=0 Pθ,η . k for all a ∈ R and for all g ∈ η PPθ,η . Now assuming this condition is fulfilled, so we are dealing with a differentiable parameter, we can make a special choice for a and obtain an interesting result.

k Proposition 7.1. Let the functional ψ : P → : Pθ,η 7→ ψ(Pθ,η) = θ be differentiable at Pθ,η . R . T relative to the tangent set PPθ,η consisting of all score functions a `θ,η +g, then every influence ˜ . function ψθ,η is orthogonal to the nuisance tangent set η PPθ,η .

Proof. The result follows immediately by putting a = 0 in (7.20),

D ˜ E 0 = ψθ,η, g . Pθ,η

˜ . Since the inner product of any influence function ψθ,η with an arbitrary g ∈ η PPθ,η , we conclude ˜ . that ψθ,η ⊥ η PPθ,η .

This is an extension of the result obtained in Chapter 4, see Theorem 4.3, (ii). That this is indeed a generalization follows from the fact that the influence function ϕ of any RAL estimator for θ satisfies (6.14) and this is the assumption Proposition 7.1 relies on. In particular, the ˜ efficient influence function, which we denote again by ψθ,η, is orthogonal to the nuisance tangent . space η PPθ,η . 154 Chapter 7. Lower Bounds

˜ Remark 7.4. It is worth paying attention to the dimension of ψθ,η. This is a k-dimensional D ˜ E function. Thus, we should read 0 = ψθ,η, g as a componentwise equality between two Pθ,η vectors,  D E  ψ˜1 , g   θ,η 0  Pθ,η   D ˜2 E   0   ψθ,η, g    =  Pθ,η  ,  .   .   .   .   .  0  D ˜k E  ψθ,η, g Pθ,η ˜ ˜1 ˜2 ˜k T where ψθ,η = [ψθ,η, ψθ,η,..., ψθ,η] .

It is time to be more rigorous about the preceding observations. We shall state a lemma that ˜ gives an interesting form for the efficient influence function ψθ,η. Before doing so, we define the operator . Πθ,η : L2(Pθ,η) → lin η PPθ,η to be the orthogonal projection onto the closure of the linear span of the nuisance tangent space in L2(Pθ,η). This operator will be used in the following definition. ˜ Definition 7.9. In constructing the efficient influence function ψθ,η, the following objects will be important.

˜ . . (i) The efficient score function for θ is `θ,η =`θ,η −Πθ,η`θ,η. ˜ ˜ ˜T (ii) The efficient information matrix for θ is Iθ,η = Pθ,η`θ,η`θ,η. . Remark 7.5. The action of Πθ,η on the k-dimensional vector `θ,η is defined componentwise,

. . . 1 k T Πθ,η`θ,η= [Πθ,η`θ,η,..., Πθ,η`θ,η] ,

. . . 1 k T where `θ,η= [`θ,η,..., `θ,η] . This coincides with the definition of the efficient score Seff in Chapter 4, see Definition 4.6, since the nuisance tangent space Λ as defined in Chapter 4, see Definition 4.5, is a k-replicating space, see Lemma 4.2.

Note that the efficient information matrix I˜θ,η is the variance of the efficient score vector since the efficient score vector has mean zero. We now state the lemma under which conditions the efficient influence function can be found and how we can compute it. . k Lemma 7.3. Suppose that for every a ∈ R and every g ∈ η PPθ,η there exists a path t 7→ ηt in H such that  2 Z 1/2 1/2 dPθ+ta,η − dPθ,η 1 . 1/2 t − (aT ` +g)dP → 0 as t → 0. (7.21)  t 2 θ,η θ,η 

k If I˜θ,η is nonsingular, then the functional ψ : P → : Pθ,η 7→ ψ(Pθ,η) = θ is differentiable . . R . at Pθ,η relative to the tangent set PP = lin `θ,η + η PP with efficient influence function . . θ,η θ,η ˜ ˜−1 ˜ T k ψθ,η = Iθ,η `θ,η where lin `θ,η= {a `θ,η: a ∈ R }. 7.4. Semiparametric Models in a Strict Sense 155

. T Proof. Comparing (7.21) with (6.1) shows that a `θ,η +g are score functions corresponding with the parametric submodels t 7→ Pθ+ta,η and by definition of a tangent set, we conclude that . . . t PP = lin `θ,η + η PP is a tangent set by assumption. An easy calculation shows that ψ is θ,η θ,η . ˜ ˜−1 ˜ differentiable at Pθ,η relative to PPθ,η with efficient influence function ψθ,η = Iθ,η `θ,η. Indeed,

. . D ˜ T E D˜−1 ˜ T E ψθ,η, a `θ,η +g = Iθ,η `θ,η, a `θ,η +g Pθ,η Pθ,η  .  ˜−1 D˜ T E D˜ E = Iθ,η `θ,η, a `θ,η + `θ,η, g Pθ,η Pθ,η ˜ . by the linearity of the inner product. Since `θ,η is the residual of `θ,η after projecting it onto . . ˜ η PPθ,η and because g ∈ η PPθ,η , h`θ,η, giPθ,η = 0. In addition,

. Z . ˜−1 D˜ T E ˜−1 ˜ T Iθ,η `θ,η, a `θ,η = Iθ,η `θ,ηa `θ,η dPθ,η Pθ,η Z . ˜−1 ˜ T = Iθ,η `θ,η `θ,η adPθ,η

Z . . T ˜−1 ˜ n o = Iθ,η `θ,η `θ,η −Πθ,η`θ,η dPθ,ηa

˜ . because h`θ,η, Πθ,η`θ,ηiPθ,η = 0. Thus,

. D ˜ T E ˜−1 ˜ ˜T ψθ,η, a `θ,η +g = Iθ,η Pθ,η`θ,η`θ,ηa Pθ,η ˜−1 ˜ = Iθ,η Iθ,ηa = a

∂ = ψ(Pθ+ta,ηt ). ∂t t=0

k We conclude that ψ : P → : Pθ,η 7→ ψ(Pθ,η) = θ is differentiable at Pθ,η relative to the tangent . . . R ˜ ˜−1 ˜ set PPθ,η = lin `θ,η + η PPθ,η with efficient influence function ψθ,η = Iθ,η `θ,η. Note that this is k . clearly the efficient influence function because it is the unique influence function in lin PPθ,η . Indeed, . . ˜−1 ˜ ˜−1 ˜−1 Iθ,η `θ,η = Iθ,η `θ,η −Iθ,η Πθ,η`θ,η . . . . ˜−1 k ˜−1 k and Iθ,η `θ,η∈ lin `θ,η and −Iθ,η Πθ,η`θ,η∈ lin η PPθ,η . Remark 7.6. Again it is worth noting the importance of keeping the dimensions of the con- sidered objects in the proof in mind. These are not just scalars, but vectors or matrices. E.g., ˜ . h`θ,η, Πθ,η`θ,ηiPθ,η = 0 is an equality between two (k × k)-matrices,

. .  D˜1 1 E D˜1 k E  `θ,η, Πθ,η`θ,η ··· `θ,η, Πθ,η`θ,η Pθ,η Pθ,η    . .  0 ··· 0  D˜2 1 E D˜2 k E   `θ,η, Πθ,η`θ,η ··· `θ,η, Πθ,η`θ,η   0 ··· 0   Pθ,η Pθ,η  =   ,  . .   . .   . .   . .   . .  . . 0 ··· 0  D˜k 1 E D˜k k E  `θ,η, Πθ,η`θ,η ··· `θ,η, Πθ,η`θ,η Pθ,η Pθ,η

. . . ˜ ˜1 ˜k T 1 k T where `θ,η = [`θ,η,..., `θ,η] and Πθ,η`θ,η= [Πθ,η`θ,η,..., Πθ,η`θ,η] . 156 Chapter 7. Lower Bounds

As a consequence, we obtain a specialized version of Lemma 7.2. The previous lemma showed . . that the functional ψ(Pθ,η) = θ is differentiable at Pθ,η relative to the tangent set PP = lin `θ,η . θ,η + η PP . In order to apply Lemma 7.2, the tangent set must be a cone. The part of the tangent θ,η . set corresponding with the parameter of interest lin `θ,η is linear, thus a fortiori a cone. The . nuisance tangent set η PPθ,η is not necessarily a cone; in this general theory this is an assumption and in practice this should be checked. However, it seems that in most cases this is fulfilled so we are not really imposing serious restrictions. . Corollary 7.3. Suppose the nuisance tangent set η PPθ,η is a cone. A sequence of estimators ˜ ˜T Tn is regular at Pθ,η with limiting distribution N(0,P ψθ,ηψθ,η) (asymptotically efficient) if and only if it satisfies n √ 1 X n(T − θ) = √ I˜−1`˜ (X ) + o (1). (7.22) n n θ,η θ,η i Pθ,η i=1

This result also coincides with the results obtained in Chapter 4, see Theorem 4.3. Surprisingly, we may conclude that if an asymptotically efficient estimator exists, it is a regular asymptotically linear estimator, a RAL estimator. ˜ ˜T We now reflect on the new terminology. What happens when we express P ψθ,ηψθ,η in terms of the new terminology? We see that

˜ ˜T n˜−1 ˜ ˜−1 ˜ T o ˜−1 ˜ ˜T ˜−1 P ψθ,ηψθ,η = P Iθ,η `θ,η(Iθ,η `θ,η) = P Iθ,η `θ,η`θ,ηIθ,η ˜−1 ˜ ˜T ˜−1 ˜−1 = Iθ,η P (`θ,η`θ,η)Iθ,η = Iθ,η −1  ˜ ˜T  = P `θ,η`θ,η .

˜ The reason why we call `θ,η the efficient score function and I˜θ,η the efficient information matrix is now clear. The variance of the efficient influence function is equal to the inverse of the efficient information matrix and thus equal to the inverse of the variance of the efficient score function. The result obtained in equation (7.22) is very similar to the situation for efficient estimators in parametric models. In the parametric case where we are interested in the full parameter θ, there is no nuisance parameter. Under regularity conditions, the maximum likelihood estimator θˆn in a parametric model satisfies

n √ 1 X . n(θˆ − θ) = √ I−1` (X ) + o (1), n n θ θ i Pθ i=1 . where Iθ is the ordinary Fisher information matrix and `θ is the ordinary score function. For more information about regularity conditions and proofs, see van der Vaart (1998), [39], Chapters 5 and 8. . The only difference in the semiparametric setting is that the ordinary score function `θ,η is ˜ replaced by the efficient score function `θ,η and the Fisher information matrix Iθ,η for θ is replaced by the efficient information matrix I˜θ,η. This has an interesting intuitive but very insightful explanation. A part of the score function for θ can also be accounted for by score functions for the nuisance parameter η meaning when the nuisance parameter is unknown, a part of the information for θ is lost. The exciting thing is that this loss of information for θ due to the presence of η is quantified. The geometry of the tangent sets yields the answer. The orthogonal . . projection Πθ,η`θ,η of the score function for θ onto the nuisance tangent space η PPθ,η quantifies 7.4. Semiparametric Models in a Strict Sense 157 this loss. The loss of information for θ thus corresponds with a loss of the score function for . θ, we substract the orthogonal projection onto η PPθ,η . When there is no nuisance parameter, there is no nuisance tangent space and thus no loss of information for estimating θ. Let us now apply this to some examples. First we revisit the restricted moment model which is already discussed in much detail in §5.1. Next, we consider estimation under symmetric location.

7.4.2 Restricted Moment Model, Revisited

Let us consider the restricted moment model one last time, but now under a slightly different representation and in the notation of this part, as presented in van der Vaart (1998), [39], §25.28. To be more specific, the model is denoted by

Y = gθ(X) + e, E(e|X) = 0, (7.23) where we assume Y to be a one-dimensional continuous random variable with dominating mea- sure the ordinary Lebesgue measure `Y and X represents the vector of covariates with dominating measure νX . The function gθ(·) is a known function depending on the unknown parameter θ, the parameter of interest. Note this is not a location-shift model since we do not assume e q X. As- sume that (X, e) possesses a density η. This is the nuisance parameter. The observation (X,Y ) then has a density η{x, y − gθ(x)} where η is essentially only restricted by the conditions to be a proper joint density function of (X, e) and E(e|X) = 0. This is equivalent to the conditions R R η(x, e) ≥ 0 for all x and e, η(x, e)dνX (x)d`Y (e) = 1 and eη(x, e)d`Y (e) = 0 for all x. Note the representation of the restricted moment model here is slightly different than the rep- resentation described in Chapter 5. We now directly work with the joint density function of e and X instead of a factorization in conditional density functions. Let us first look for the nuisance tangent space. As we have seen for the nonparametric model, every score function a(X, e) ∈ L2(P ) must have mean zero, i.e., E{a(X, e)} = 0. This was R implied by the condition that η(x, e)dνX (x)d`Y (e) = 1 for all x and e but this also follows from Lemma 6.1 where we also used the fact that every density function integrates to one to obtain R this property of score functions. To see what condition the restriction eη(x, e)d`Y (e) = 0 for R all x implies, we insert appropriate submodels t 7→ ηt. This means that eηt(x, e)d`Y (e) = 0 for all t in a neighbourhood of zero. Differentiating with respect to t, assuming we can interchange integration and differentiation and evaluating at t = 0, we obtain that Z Z ∂ ∂ηt(x, e) 0 = eηt(x, e)d`Y (e) = e d`Y (e) ∂t t=0 ∂t t=0 Z ∂ log ηt(x, e) = e η(x, e)d`Y (e) ∂t t=0 This implies that R e aηt (X, e)η(X, e)d`Y (e) E{e aηt (X, e)|X} = R = 0, η(X, e)d`Y (e)

∂ log ηt(X,e) where aηt (X, e) = . This implies that any score function is a function a(X, e) = ∂t t=0 a{X,Y − gθ(X)} ∈ L2(P ) that has mean zero and must satisfy R e a(X, e)η(X, e)d`Y (e) E{e a(X, e)|X} = R = 0. η(X, e)d`Y (e) 158 Chapter 7. Lower Bounds

This can be checked by using one-dimensional parametric submodels that are linear in the score, pt(x, e) = {1+ta(x, e)}p0(x, e) where p0(x, e) denotes the truth. Henceforth, the nuisance tangent space for the restricted moment model is . n o η PPθ,η = a(X, e) ∈ L2(Pθ,η): E{a(X, e)} = 0 and E{e a(X, e)|X} = 0 . (7.24) Note this is a similar result as in Chapter 5, see (5.14). Since E{e a(X, e)|X} = 0 for all values of X, we see that e ⊥ a(X, e). Multiplying e with a function of X, i.e., h(X), we see that E{eh(X)a(X, e)|X} = 0. This shows that any score function a(X, e) is in the orthogonal complement of eHX , where HX is the space of all square-integrable functions of X. As we already know from Chapter 5, this is the orthogonal complement of the nuisance tangent space . η PPθ,η within the space of all L2(Pθ,η)-functions with mean zero. . We will be interested in the residual of the ordinary score `θ,η after projecting it onto the nuisance . ˜ . . tangent space η PPθ,η , i.e., the efficient score function `θ,η =`θ,η −Πθ,η`θ,η. This is equivalent to ˜ . `θ,η = Π{`θ,η |eHX }. Therefore, we are interested in how the projection onto eHX looks like. Consider an arbitrary function b(X, e) ∈ L2(Pθ,η). Denote the projection of b(X, e) onto eHX by Π{b(X, e)|eHX } = eh0(X). By the orthogonality relationship, we obtain that

Π{b(X, e)|eHX } = eh0(X) ⇔ E[{b(X, e) − eh0(X)}eh(X)] = 0, ∀h(X) ∈ HX 2 ⇔ E{b(X, e)eh(X)} = E{e h0(X)h(X)} = 0, ∀h(X) ∈ HX .

This can be easily solved and we see that the projection operator onto eHX is given by E{b(X, e)e|X} Π{b(X, e)|eH } = e (7.25) X E(e2|X) 2 where h0(X) = E{b(X, e)e|X}/E(e |X). Before calculating the efficient score function, we need to know the ordinary score. We have that . ∂[log η{x, y − gθ(x)}] η2(x, e) . ` (x, y) = = − g (x), (7.26) θ,η ∂θ η(x, e) θ where η (x, e) = ∂η(x, e)/∂e, the partial derivative with respect to the second argument and . 2 gθ (x) = ∂gθ(x)/∂θ. From this we see that the total tangent space is given by . . . PPθ,η = lin `θ,η + η PPθ,η , . where η PPθ,η is defined in (7.24) and   . η2(x, e) . lin ` = − aT g (x): a ∈ k . θ,η η(x, e) θ R Applying (7.25) to (7.26) yields the efficient score function, . . eE{`θ,η (X,Y )e|X} `˜ (X,Y ) = Π{` (X,Y )|eH } = θ,η θ,η X E(e2|X) .   e gθ (X) η2(X, e)e = − E X E(e2|X) η(X, e) . Z e gθ (X) η2(X, e)e η(X, e) = − 2 R 0 0 d`Y (e) E(e |X) η(X, e) η(X, e )d`Y (e ) . R e gθ (X) η2(X, e)ed`Y (e) = − 2 R . E(e |X) η(X, e)d`Y (e) 7.4. Semiparametric Models in a Strict Sense 159

It seems this is our final result for the efficient score. However, we can still make some progress by R differentiating the identity eη(X, e)d`Y (e) = 0 with respect to θ. Assuming we can interchange integration and differentiation, we deduce ∂ Z Z ∂ 0 = eη(X, e)d` (e) = {eη(X, e)} d` (e) ∂θ Y ∂θ Y . Z . Z = − gθ (X) η(X, e)d`Y (e)− gθ (X) η2(X, e)ed`Y (e).

R R This shows that η(X, e)d`Y (e) = − η2(X, e)ed`Y (e). Henceforth, we obtain . . e g (X) {Y − g (X)} g (X) `˜ (X,Y ) = θ = θ θ . (7.27) θ,η E(e2|X) E(e2|X) Compare this with (5.25) and we see we obtain the exact same result. The one thing that remains to calculate is the efficient information matrix I˜θ,η. Using the law of iterated expectations, we obtain ˜ ˜ ˜T ˜ ˜T Iθ,η = Pθ,η`θ,η`θ,η = E{`θ,η`θ,η} . . 2 " T #  e . . T  g (X) g (X) = E g (X) g (X) = E θ θ E(e2|X) {E(e2|X)}2 θ θ {E(e2|X)}2 ( . . T ) g (X) g (X) = E θ θ . (7.28) E(e2|X)

This also coincides with the calculations made in Chapter 5.

Using Lemma 7.3, we know that the parameter ψ(Pθ,η) = θ is differentiable at Pθ,η relative to . . . ˜ ˜−1 ˜ the tangent set PPθ,η = lin `θ,η + η PPθ,η with efficient influence function given by ψθ,η = Iθ,η `θ,η. Using (7.27) and (7.28), we obtain

( . . T ) . g (X) g (X) {Y − g (X)} g (X) ψ˜ = E θ θ θ θ . (7.29) θ,η E(e2|X) E(e2|X)

Comparing (7.29) with (5.26), we see we have obtained the same result.

7.4.3 Symmetric Location

In §1.3.3, we introduced the symmetric location model, i.e., a model that consists of all densities x 7→ η(x − θ) with θ ∈ R and η symmetric about 0 with finite Fisher information for location Iθ(η), i.e., " #  ∂ 2 Z η0(z)2 I (η) = E ln η(X − θ) = η(z)dz < +∞. (7.30) θ ∂θ η(z) From this it follows that the observations are sampled from a density that is symmetric about θ and this semiparametric model turned out to be a semiparametric model in a strict sence,

P = {Pθ,η : θ ∈ R, η ∈ H}, with H the set of all proper absolutely continuous densities symmetric about zero with finite Fisher information for location. 160 Chapter 7. Lower Bounds

As we already know, to define efficiency in this model, we need to look for the nuisance tangent set. It will turn out this is a linear set, henceforth we can speak of the nuisance tangent space. First note that by the symmetry, the density can be equivalently written as η(|x − θ|). This follows from the fact that the value of η only depends on the distance of x to the point of symmetry θ. No further information is contained in the nonparametric component of the symmetric location model. This implies that any score function for the nuisance parameter η must be a function of |x − θ|. Indeed, by inserting appropriate parametric submodels t 7→ ηt, we deduce that ∂ log ηt(|x − θ|) 1 ∂ηt(|x − θ|) = . ∂t t=0 η(|x − θ|) ∂t t=0 This implies that any score function for the nuisance parameter must be a function b(|x − θ|) ∈ 2 L2(Pθ,η) that has mean zero . Using the usual argument, we can show that for any function b(|x − θ|) ∈ L2(Pθ,η) with mean zero, the one-dimensional parametric submodel, linear in the score, ηt(|x − θ|) = {1 + tb(|x − θ|)}η(|x − θ|) defines a proper submodel with score function b(|x − θ|). Henceforth, the nuisance tangent space is given by . n o η PPθ,η = b(|X − θ|) ∈ L2(Pθ,η): E{b(|X − θ|)} = 0 . (7.31) ˜ . As for the restricted moment model, we are interested in the efficient score function `θ,η =`θ,η . . −Πθ,η`θ,η. Henceforth, we need the calculate the ordinary score for θ, i.e., `θ,η. Using the chain rule, we obtain . ∂ η0(|x − θ|) ` (x) = η(|x − θ|) = − sign(x − θ). θ,η ∂θ η(|x − θ|) This can be equivalently written as . η0(x − θ) ` (x) = − . (7.32) θ,η η(|x − θ|) Indeed, since the derivative of a symmetric function is always an assymetric function, we know that η0 is an assymetric function, i.e., −η0(x − θ) = η0(θ − x). If sign(x − θ) = 1, we have that |x − θ| = x − θ, henceforth . η0(|x − θ|) η0(x − θ) ` (x) = − = − . θ,η η(|x − θ|) η(|x − θ|) If sign(x − θ) = −1, we have that |x − θ| = θ − x, henceforth, using the assymmetry of η0, . η0(|x − θ|) η0(θ − x) η0(x − θ) ` (x) = = = − . θ,η η(|x − θ|) η(|x − θ|) η(|x − θ|) We may conclude that the total tangent set is given by . . . PPθ,η = lin `θ,η + η PPθ,η , . where η PPθ,η is defined in (7.31) and .  η0(x − θ)  lin ` = a : a ∈ , θ,η η(|x − θ|) R where the minus sign is absorbed in the constant a. We now prove an interesting lemma that makes the calculation of the efficient score very easy.

2As we already know, this is implied by the fact that any density must integrate to 1 and by Lemma 6.1 we know this is a fundamental property of score functions. 7.4. Semiparametric Models in a Strict Sense 161

Lemma 7.4. The linear span of the ordinary score function is orthogonal to the nuisance tangent space, i.e., . . lin `θ,η ⊥ η PPθ,η .

. Proof. Take any b(|X − θ|) ∈ η PPθ,η and a ∈ R. We then subsequently deduce

 η0(X − θ)  Z +∞ η0(x − θ) P a b(|X − θ|) = a b(|x − θ|)η(|x − θ|)dx η(|X − θ|) −∞ η(|x − θ|) Z +∞ = a η0(x − θ)b(|x − θ|)dx −∞ +∞ (i) Z = a η0(y)b(|y|)dy −∞ 0 +∞ (ii) Z Z  = a η0(y)b(−y)dy + η0(y)b(y)dy −∞ 0 +∞ +∞ (iii) Z Z  = a η0(−z)b(z)dz + η0(y)b(y)dy 0 0 +∞ +∞ (iv)  Z Z  = a − η0(z)b(z)dz + η0(y)b(y)dy = 0. 0 0

(i) We put y to be x − θ and dx = dy.

(ii) For the first integral, b(|y|) = b(−y) since y ≤ 0 and for the second integral, b(|y|) = b(y) since y ≥ 0.

(iii) In de first integral, we put z to be −y and −dz = dy. The upperbound of the integral remains 0 and the lowerbound becomes +∞. Afterwards, we use the minus sign of the differential −dz to interchange the lowerbound and upperbound.

(iv) We use the assymmetry of η0, i.e., η0(−z) = −η0(z).

From these calculations, we conclude that . n . o ha `θ,η (X), b(|X − θ|)iPθ,η = P a `θ,η (X)b(|X − θ|) = 0,

. which means that a `θ,η (X) ⊥ b(|X − θ|).

. This lemma has a very important consequence. The ordinary score function `θ,η is automatically . . orthogonal to the nuisance tangent space η PP . Henceforth we obtain Πθ,η`θ,η= 0. From this θ,η. we conclude that the ordinary score function `θ,η coincides with the efficient score function ˜ `θ,η. The efficient information matrix I˜θ,η is then simply given by the Fisher information for location, Iθ(η), defined in equation (7.30). This means that there is no difference in information about θ whether the shape of the density is known or unknown, as long as it is known to be symmetric. This is, as we have seen in §5.2, what we refer to as adaptive estimation. This quite surprising fact was discovered by Stein in 1956, see [34] for more information, and has been a motivation in the early work on semiparametric models. Several authors worked on defining adaptive estimators for the symmetric location model. A summary approach was given by Bickel (1982) in [4]. This provided a good starting point to extensions to more general models. 162 Chapter 7. Lower Bounds

It turns out that in the symmetric location model, there exist estimator sequences for θ whose −1 definition does not depend on η that have asymptotic variance equal to Iθ (η) under any true η. Since this relies on empirical processes and the notion of Donsker classes, see for example van der Vaart (1998), [39], Chapter 19, this is not within the scope of this thesis. However, those familiar to these concepts, we refer to [39], §25.8, for a general discussion on how semiparametric efficient estimators can be constructed using the efficient score function. Especially, in §25.8.1, this is applied to the symmetric location model. This is no trivial task since the efficient score heavily depends on the nuisance parameter η. We end this section with the discussion of the efficient influence function. Using Lemma 7.3, we know that the parameter ψ(Pθ,η) = θ is differentiable at Pθ,η relative to the tangent set . . . ˜ ˜−1 ˜ PPθ,η = lin `θ,η + η PPθ,η with efficient influence function given by ψθ,η = Iθ,η `θ,η. Using the results obtained above, we conclude that the efficient influence function is given by

" #−1 . η0(X − θ) Z η0(z)2 ψ˜ (X) = I−1(η)` (X) = − η(z)dz . (7.33) θ,η θ θ,η η(|X − θ|) η(z)

Now compare the efficient influence function for this semiparametric model with the efficient influence function obtained for a general parametric model, discussed in §6.3.1. Since ψ(Pθ,η) = . θ = χ(θ), it follows that χθ= 1. Hence, the efficient influence function for the symmetric location model coincides with the efficient influence function for any parametric model, which describes the full shape of the density η. This leads to the same conclusion with regard to the information about the location parameter θ.

7.4.4 The Infinite Bound Case

We end this chapter with an impossibility result due to Chamberlain (1986), [7]. Chamberlain showed that if the semiparametric efficiency bound (i.e., the variance of the efficient influence √ ˜ ˜ ˜ ˜T function ψP ) is infinitely large (e.g., Iθ,η = Pθ,η`θ,η`θ,η is singular), then no n-consistent, regular estimator exist. A special case of this result can be seen to hold from the results obtained in Chapter 4. If the efficient information matrix is singular, the variance of the efficient influence function is infinite and since this is a lower bound for the variance of any RAL estimator, no RAL estimator can exist. More details about this and other impossibility theorems can be found in Newey (1990), [27], §4.1 and §4.2. Chapter 8

Calculus of Scores

8.1 Introduction

In this chapter we introduce a calculus of scores, which is a useful way of finding the efficient influence function in semiparametric models. The method of finding the efficient influence func- tion of a parameter given in §7.4.1 is the most convenient method if the model can be naturally partitioned in a parameter of interest and a nuisance parameter. For many models, such a partition is impossible or unnatural. This is the case for information loss models, which will be the contents of §9.1 and §9.2. Furthermore, even in semiparametric models it can be worthwhile to derive a more concrete description of the nuisance tangent set, defined in Definition 7.8 and by doing so we will obtain a more specialized formula for the efficient influence function. In addition, we will also deal with the estimation of a real function of the (infinite-dimensional) nuisance parameter. This will be the contents of §9.3. We will do this in terms of a so called score operator, already introduced in §6.3.3. First, we will consider the case of a statistical model in which the distributions are indexed by a probability measure η contained in another model H. Second, we will generalize this theory to the case of a statistical model in which the distributions are indexed by a parameter η contained in an arbitrary set (possibly infinite-dimensional) H. After this quite abstract theory is developed, in the next chapter we will focus attention on some applications as already mentioned. First, we will apply the theory of score operators to information loss models. Next, we apply the (generalized) theory of score operators to semi- parametric models in a strict sense. Especially, as already mentioned in §6.3.3 we will study Cox’s proportional hazards model.

8.2 Score and Information Operators

8.2.1 Semiparametric Models Indexed by a Probability Measure η Contained in a Model H

Consider the case where the model is indexed by a parameter η that is itself a probability measure on some measurable space,

P = {Pη : η ∈ H},

163 164 Chapter 8. Calculus of Scores where H is the set of all permitted probability measures η. We focus attention on the estimation k of a k-dimensional parameter ψ(Pη) = χ(η) for a given function χ : H → R on the model H k and thus, ψ : P → R . To motivate this, we give a simple example. Missing Data Problem. Suppose we are interested in the mean of a random variable Y . Assume the distribution η of Y belongs to a class of distributions H. However, we do not observe Y completely due to missing data. To indicate whether Y is observed or not, we use a missing data indicator R. We have

 0, if Y is not observed, R = 1, if Y is observed.

Hence, the distribution of the observed data, denoted YR, is given by

f(y)P(R = 1|Y = y) f(y|R = 1) = . P(R = 1)

Hence, the observed data belongs to the class of density functions P described by the latter equation. 4

The model P defines a way to connect a density η from the model H to a density Pη from the model P, i.e., this defines a map A : η 7→ Pη. Moreover, the model H gives rise to a tangent . . set Hη at η. We know the model P also gives rise to a tangent set PP . A natural question is . . η then, can we connect scores in Hη with scores in PPη ? To make this precise, we assume that a smooth parametric submodel t 7→ ηt in H induces a smooth parametric submodel t 7→ Pηt in P. In addition, we assume that the score function b corresponding with the submodel t 7→ ηt and the score function g corresponding with the induced submodel t 7→ Pη are related by g = Aηb. t. . Because we defined a tangent set to be any set of scores, the set Aη Hη= {Aηb : b ∈ Hη} is a tangent set for the model P at Pη. The situation is sketched in the following diagram.

A H / P

.  . Hη / Aη Hη Aη

The assumptions we made in the preceding display then translate into commutativity of the diagram. Because this map Aη will be induced by the map A and it is a map between tangent spaces, we can interpret the map Aη as a derivative of the map A at η. Thus, evaluating the derivative map in a certain element of the model yields a map between the corresponding . tangent spaces. Because the linear span of Hη is a linear subspace of the Hilbert space L2(η) and furthermore the image of the map Aη is a subspace of the Hilbert space L2(Pη), we can look at this map as an operator between these Hilbert spaces. This leads to the following definition.

Definition 8.1. The operator that makes the diagram above commutative,

. Aη : lin Hη⊂ L2(η) → L2(Pη): b 7→ Aηb = g, turns scores for the model H into scores for the model P and it is therefore called a score operator. 8.2. Score and Information Operators 165

In addition, we will assume this operator is continuous and linear. This has a serious implication: a lot of theory is available about continuous and linear operators between Hilbert spaces. We will use this opportunity extensively and obtain very interesting results. However, we must admit that at this point, it is not clear how we can obtain such a score operator. There is no general recipe to construct a score operator and one has to rely on its creativity and intuition. Nonetheless, it is possible to construct this score operator but then we need to look at specific examples. Luckily, in some examples this turns out to be quite straightforward. In a further stage, we will apply the aforementioned ideas in much detail to information loss models (e.g., a missing data problem). But before doing so, we demonstrate how we can use interesting properties from continuous and linear operators between Hilbert spaces extracted from functional analysis in the semiparametric estimation theory.

As announced, we return to the estimation problem ψ(Pη) = χ(η). We assume that the function . k χ : H → R is differentiable with influence functionχ ˜η relative to the tangent set Hη, i.e., by Definition 6.3 this means that Z ∂χ(ηt) . = hχ˜η, biL2(η) = χ˜ηbdη, for all b ∈Hη . (8.1) ∂t t=0

Next, we express what it means that the function ψ(Pη) is differentiable relative to the tangent . . set PP = Aη Hη in terms of the differentiability of the function χ(η). By Definition 6.3, the η . k function ψ : P → R is differentiable at Pη relative to the tangent set Aη Hη if and only if there ˜ exists a vector-valued function ψPη such that

∂ψ(P ) ∂χ(η ) . ˜ ηt t hψPη ,AηbiL2(Pη) = = = hχ˜η, biL2(η), for all b ∈Hη . (8.2) ∂t t=0 ∂t t=0 Using functional analysis, we can rewrite this equation. In Definition 2.14 we introduced the notion of an adjoint operator. We will use this concept to rewrite the latter equation. . ∗ Definition 8.2. The operator Aη : L2(Pη) → lin Hη, defined by

∗ hh, AηbiL2(Pη) = hAηh, biL2(η),

. for every h ∈ L2(Pη) and for every b ∈Hη, is called the adjoint score operator.

. . ∗ Note that we define Aη to have range a subset of lin Hη, the closed linear span of Hη. Why . ∗ is that? Why can’t we define a subset of Hη itself to be the range of Aη? We want that . ∗ Aη is the adjoint of Aη : Hη→ L2(Pη). To define its adjoint, we should consider an extension ˜ ˜∗ Aη : L2(η) → L2(Pη). Next, we obtain the adjoint Aη : L2(Pη) → L2(η) of this extension. The . ∗ ˜∗ result Aηh for h ∈ L2(Pη) is then obtained as the orthogonal projection of Aηh onto lin Hη. Thus, to be able to use the Projection Theorem, one must project onto a closed subspace of L2(η).

Remark 8.1. For more information about adjoint operators, see §2.3.

Using the adjoint score operator, we see that (8.2) is equivalent to

∗ ˜ AηψPη =χ ˜η. (8.3) 166 Chapter 8. Calculus of Scores

What does this mean? We have translated the problem of differentiability of the function ψ at . Pη relative to the tangent set Aη Hη to an easy-looking equation. This is similar to a system ˜ of equations whenχ ˜η would be a known finite-dimensional vector of numbers, ψPη would be a ∗ finite-dimensional vector of unknowns and Aη a matrix of appropriate dimension with known ˜ numbers. However, this is not the case. The unknown ψPη is a function and can be seen as an infinite-dimensional vector of unknowns. Luckily, there is no limitation in functional analysis to study solvability of this equation. We conclude that the function ψ(Pη) = χ(η) is differentiable at Pη relative to the tangent set . . ˜ PPη = Aη Hη if and only if this equation can be solved for ψPη ; equivalently, if and only if each ∗ coordinate function ofχ ˜η is contained within the range of the adjoint Aη. Note that the efficient . influence functionχ ˜η is in lin Hη by definition. Thus, at first sight, it seems that (8.3) is always . ∗ solvable. This is not true. There is no certainty that Aη is onto lin Hη, not even when it is one-to-one. Solvability of (8.3) is thus a condition. ∗ The critical reader may have noted the adjoint operator Aη is defined to be an operator from . ∗ L2(Pη) to lin Hη. However, in (8.3) it is seen that Aη acts on a k-dimensional funcion and ∗ Aη is defined to act on 1-dimensional functions. When we discussed lower bounds, similar ˜ considerations arose. So, equation (8.3) should be read componentwise. That is, with ψPη = [ψ˜1 , ψ˜2 ,..., ψ˜k ]T andχ ˜ = [˜χ1, χ˜2,..., χ˜k]T , Pη Pη Pη η η η η

 A∗ψ˜1   1  η Pη χ˜η  A∗ψ˜2   χ˜2  ∗ ˜  η Pη   η  AηψPη =  .  = . ,  .   .   .    A∗ψ˜k χ˜k η Pη η thus, A∗ψ˜i =χ ˜i for i = 1, . . . , k. In what follows, we should keep in mind that the action of η Pη η ∗ Aη is defined in this componentwise manner. Let us try to find more stucture in the solvability of (8.3). We begin with some simple lemmas.

˜ ∗ Lemma 8.1. Two solutions ψPη of (8.3) can differ only by an element of the kernel N (Aη) = ∗ ∗ {h ∈ L2(Pη): Aηh = 0} of Aη, by which we mean that each component of this difference is in ∗ N (Aη).

Proof. Suppose ψ˜(1) and ψ˜(2) are two solutions of (8.3). Thus, A∗ψ˜(i) =χ ˜ , i = 1, 2. Substract- Pη Pη η Pη η ∗ ing both equalities and using the linearity of the adjoint operator Aη yields,

[0, 0,..., 0]T =χ ˜ − χ˜ = A∗ψ˜(1) − A∗ψ˜(2) = A∗(ψ˜(1) − ψ˜(2)). η η η Pη η Pη η Pη Pη

This means that each component of ψ˜(1) − ψ˜(2) is in N (A∗). Pη Pη η ∗ Lemma 8.2. The kernel N (Aη), defined in the previous lemma, is equal to the orthogonal . . ∗ complement of the range R(Aη) = {Aηb : b ∈ lin Hη} of Aη : lin Hη→ L2(Pη), i.e., N (Aη) = ⊥ R(Aη) .

∗ ⊥ ∗ ∗ Proof. We first show that N (Aη) ⊂ R(Aη) . Take any g ∈ N (Aη), thus Aηg = 0. This means that ∗ hg, AηbiL2(Pη) = hAηg, biL2(η) = 0. 8.2. Score and Information Operators 167

. ⊥ We obtain that g ⊥ Aηb for all b ∈ lin Hη, which says that g ∈ R(Aη) . We conclude that ∗ ⊥ N (Aη) ⊂ R(Aη) . . ⊥ ∗ ⊥ Next, we show that R(Aη) ⊂ N (Aη). Take any g ∈ R(Aη) . Thus, for all b ∈ lin Hη,

∗ hAηg, biL2(η) = hg, AηbiL2(Pη) = 0. . ∗ Due to the continuity of the inner product, hAηg, biL (η) = 0 for all b ∈ lin Hη. Indeed, take . 2 b ∈ lin Hη. Then there exists a sequence {bn}n≥1 such that bn → b as n → +∞, i.e.,

kb − bnkL2(η) → 0 as n → +∞. Then, by the Cauchy-Schwarz inequality,

∗ ∗ ∗ |hAηg, biL2(η) − hAηg, bniL2(η)| = |hAηg, b − bniL2(η)| ∗ ≤ kAηgkL2(η)kb − bnkL2(η) → 0

∗ ∗ as n tends to infinity. Because hAηg, bniL (η) = 0 for every n ≥ 1, it follows that hAηg, biL (η) = 0. . 2 2 Now, because A∗g ∈ lin , we put b = A∗g. This yields hA∗g, A∗gi = kA∗gk2 = 0. η Hη η η η L2(η) η L2(η) ∗ ∗ Thus, by definition of the L2(η)-norm, Aηg = 0 in L2(η), i.e., g ∈ N (Aη). This shows that ⊥ ∗ R(Aη) ⊂ N (Aη). ⊥ ∗ Combining both results, we obtain that R(Aη) = N (Aη).

∗ ⊥ Remark 8.2. As seen in §2.3 and §2.4, the spaces N (Aη) and R(Aη) are closed.

Using these results, we can make an interesting conclusion. ˜ Proposition 8.1. There is at most one solution ψP of (8.3) that is contained within R(Aη) = . η lin Aη Hη.

Proof. If ψ˜(i) ∈ R(A ), i = 1, 2, by Lemma 8.1, we know that ψ˜(1) − ψ˜(2) is in N (A∗) and Pη η Pη Pη η by Lemma 8.2, we see that ψ˜(1) − ψ˜(2) ∈ R(A )⊥. In addition, by assumption, we have that Pη Pη η ψ˜(1) − ψ˜(2) ∈ R(A ). Thus, Pη Pη η

ψ˜(1) − ψ˜(2) ∈ R(A ) ∩ R(A )⊥ = {0}. Pη Pη η η

This shows that ψ˜(1) = ψ˜(2). Pη Pη

From (8.2) and (8.3), we have learned that, if (8.3) has a solution, ψ is differentiable at Pη . relative to the tangent set Aη Hη. In Chapter 7, we deduced that the efficient influence function . ψ˜ can be obtained by projecting any influence function ψ˜∗ onto lin A = R(A ). Equation Pη Pη η Hη η (8.3) tells the same story. Let us start with a solution ψ˜∗ of (8.3), thus an arbitrary influence Pη function. After projecting ψ˜∗ onto R(A ), we should obtain the unique efficient influence Pη η ˜ function ψPη . By the preceding lemmas, it is easy to see that this projection satisfies (8.3). The ⊥ residual ψ˜∗ − ψ˜ is orthogonal to R(A ), thus ψ˜∗ − ψ˜ ∈ R(A ) = N (A∗). Using the Pη Pη η Pη Pη η η ∗ linearity of Aη, we then see that ∗ ˜ ∗ ˜ ˜∗ ˜∗ ∗ ˜∗ AηψPη = Aη(ψPη − ψPη + ψPη ) = AηψPη =χ ˜η. 168 Chapter 8. Calculus of Scores

From Proposition 8.1, this is thus the unique solution of (8.3) contained within R(Aη), i.e., the efficient influence function and by Proposition 8.1, we see that it is unique. ˜ This derivation gives a very strict proof of the uniqueness of the efficient influence function ψPη . Note however, we derived this result in a quite restrictive setting: η should be a density in the model H. In the next section, we will extend the obtained results to a more general setting. First, we will look to an interesting case in which we can write the efficient influence function in an attractive form. ∗ ∗ Supposeχ ˜η is contained within the smaller range R(AηAη) ⊂ R(Aη). This means that there k . exists a function b ∈ lin Hη such that

∗ AηAηb =χ ˜η. (8.4)

∗ − ∗ − We can write this in terms of the generalized inverse (AηAη) , b = (AηAη) χ˜η, which solely means that b is a solution of (8.4). The unique efficient influence function (the unique solution of (8.3) contained within R(Aη)) can then be written in an attractive form, ˜ ∗ − ψPη = Aη(AηAη) χ˜η ∈ R(Aη). (8.5)

Just plug in this formula into (8.3) to check this. We now look for a sufficient condition for (8.5) to be fulfilled, which is of course only useful if we assume a solution of (8.3) exists. By Proposition 2.8, we know the following statements are equivalent,

(i) R(Aη) is closed,

∗ (ii) R(Aη) is closed, ∗ (iii) R(AηAη) is closed.

∗ ∗ ˜ Moreover, in this case, we know that R(AηAη) = R(Aη). Thus, if there exists a solution ψPη of (8.3), the attractive form (8.5) is available for any functional χ whenever the range of the ∗ ∗ ∗ score operator Aη is closed because then, R(AηAη) = R(Aη), sinceχ ˜η ∈ R(Aη) by assumption. Unfortunately, there is no guarantee that R(Aη) is closed. In practice, this often fails.

∗ Definition 8.3. The operator AηAη appearing in (8.4) is called the information operator.

∗ T This information operator AηAη performs a similar role as the matrix X X in the least squares solution of linear regression models, which is closely related with the Fisher information matrix in this model1. The matrix X(XT X)−1XT is then the orthogonal projection onto the column ∗ T space of X. We have noted the operator AηAη performs a similar role as the matrix X X, so ∗ −1 ∗ what can we say about the operator Aη(AηAη) Aη? If it exists, it is the orthogonal projection of L2(Pη) onto the range space of Aη, R(Aη). These considerations are only useful if we assume that R(Aη) is closed in which case we are assured that the form (8.5) is available.

1 n×p Consider the linear model E(Y|X) = Xβ, where X = [x1, x2,..., xp] ∈ R the matrix of covariates, p n β ∈ R the vector of parameters and Y ∈ R the outcome vector. The least squares estimator βˆ for β is then βˆ = (XT X)−1XT Y. The predicted values Yˆ of Y are then given by Yˆ = Xβˆ = X(XT X)−1XT Y, i.e., the orthogonal projection of Y onto the column space of X, X(XT X)−1XT is the projection operator. These equations are referred to as the normal equations. See [25], Chapter 3, especially §3.6 or [29], Chapter 3, for more information. 8.2. Score and Information Operators 169

∗ −1 ∗ ∗ Let us assume for the moment that the operator Aη(AηAη) Aη exists, thus AηAη is continuously invertible. In a minute, we will look for a sufficient condition for this to be satisfied. Let us first prove this is indeed the orthogonal projection operator onto R(Aη). Note that the projection onto R(Aη) is well-defined because, by assumption, this is a linear closed subspace of L2(Pη). In Chapter 2, Proposition 2.11, we have seen an operator is a projection operator iff it is linear, idempotent and self-adjoint. It is easy to show these conditions are satisfied.

∗ −1 ∗ (i) Linearity: this is immediately clear because Aη(AηAη) Aη is a composition of linear operators. (ii) Idempotent: this means, applying the operator twice yields the same result as applying it once,  ∗ −1 ∗  ∗ −1 ∗ ∗ −1 ∗ ∗ −1 ∗ Aη(AηAη) Aη ◦ Aη(AηAη) Aη = Aη(AηAη) (AηAη)(AηAη) Aη ∗ −1 ∗ = Aη(AηAη) Aη.

(iii) Self-adjoint: using the usual properties of the adjoint, we deduce that  ∗ −1 ∗ ∗ ∗ ∗  ∗ −1 ∗ ∗ Aη(AηAη) Aη = (Aη) (AηAη) Aη  ∗ ∗ ∗ −1 ∗ = Aη Aη(Aη) Aη ∗ −1 ∗ = Aη(AηAη) Aη.

We can also explicitly show that this is an orthogonal projection operator. Consider an arbitrary . g ∈ L2(Pη) and an arbitrary element of R(Aη), Aηb with b ∈ lin Hη. The orthogonality follows from the following derivation, ∗ −1 ∗ ∗ −1 ∗ g − Aη(A Aη) A g, Aηh = hg, Aηhi − Aη(A Aη) A g, Aηh η η L2(Pη) L2(Pη) η η L2(Pη) ∗ ∗ −1 ∗ = hg, Aηhi − A Aη(A Aη) A g, h L2(Pη) η η η L2(η) ∗ = hg, Aηhi − A g, h L2(Pη) η L2(η) = hg, A hi − hg, A hi = 0, η L2(Pη) η L2(Pη) ∗ −1 ∗ ∗ −1 ∗ and thus we see that g−Aη(AηAη) Aηg ⊥ R(Aη) from which we conclude that Aη(AηAη) Aηg is the orthogonal projection of g onto R(Aη). ˜ From (8.5), we see that the unique efficient influence ψPη can be written as ˜ ∗ −1 ψPη = Aη(AηAη) χ˜η. (8.6) If ψ˜∗ is an arbitrary solution of (8.3), the efficient influence function can then be written in Pη terms of this orthogonal projection operator, ˜ ∗ −1 ∗ ˜∗ ψPη = Aη(AηAη) AηψPη , (8.7) we thus obtain a rigorous manner to write the known result: the unique influence function is the orthogonal projection of any influence function onto the closure of the linear span of the tangent set. Remark 8.3. Note this is only applicable in this restrictive setting where η defines a probability measure itself, contained within the model H. This is because we need the efficient influence function χ˜η, which we have available since H is a statistical model. In the next section, we shall generalize this to a more general setting. 170 Chapter 8. Calculus of Scores

Now, we focus attention under which condition the existence of this projection operator is ∗ assured, thus when is AηAη continuously invertible? We make use of Theorem 2.9 for this . ∗ purpose. Note that the operator AηAη is initially only defined for elements in lin Hη. However, by Theorem 2.6, there exists a continuous prolongation of the linear bounded operator Aη onto . . 2 lin H to the closure lin Hη. Thus, we obtain an operator between two Hilbert spaces , . . ∗ ∗ AηAη : lin Hη→ lin Hη: b 7→ AηAηb.

∗ From Theorem 2.10, we thus see that AηAη is continuously invertible, i.e., is one-to-one, onto and has a continuous inverse iff R(Aη) is closed and Aη is one-to-one. We already assumed that R(Aη) is closed, so the only extra assumption to be checked is that Aη is one-to-one. It is useful to summarize the obtained results about the solvability of (8.3), and thus the differ- . entiability of ψ at Pη relative to the tangent set Aη Hη:

∗ ˜ ∗ 1. AηψPη =χ ˜η has a solution iffχ ˜η ∈ R(Aη). ˜ 2. The unique influence function ψP can be obtained by projecting any solution of (8.3) onto . η lin Aη Hη= R(Aη).

3. Moreover, if R(Aη) is a closed subset of L2(Pη), for any functional χ, the solution of (8.3) can be written in the attractive form

˜ ∗ − ψPη = Aη(AηAη) χ˜η,

∗ − ∗ where (AηAη) is the generalized inverse of AηAη, the information operator. ∗ 4. If in addition Aη is one-to-one, the information operator AηAη is continuously invertible ∗ −1 ∗ and Aη(AηAη) Aη is the orthogonal projection operator from L2(Pη) onto R(Aη) and the unique influence function can be written as (8.6) or equivalently (8.7).

8.2.2 Semiparametric Models Indexed by a Parameter η Contained in an Arbitrary Set H

So far we have assumed that the parameter η is a probability distribution, but this is not necessary. Consider now the more general situation of a statistical model

P = {Pη : η ∈ H}, (8.8) indexed by a parameter η running through an arbitrary (possibly infinite-dimensional) set H. Note that a semiparametric model in a strict sense is a model of this type. The parameter η in (8.8) should then be taken to be the parameter (θ, η) in (7.17). This situation will be discussed in §9.3 in much detail. The ideas developed in the previous section give us a hint how to extend this theory to the more general setting of (8.8) and which assumptions should be made to obtain similar results. We note why we need these assumptions and how they differ from their analogues discussed earlier. First of all, we assume there exists a subset η of a Hilbert space that indexes directions b in H . which η can be approximated within H. This set Hη will play the role of the tangent set Hη 2Any closed subspace of a Hilbert space is a Hilbert space, see Theorem 2.2. 8.2. Score and Information Operators 171 discussed in the previous section. The difference is that the set Hη does not necessarily contain score functions. We will also need an analogue for the score operator Aη and the efficient influence functionχ ˜η. In what follows, we discuss how we come up with these analogous objects that will have similar properties. Thus, secondly, we assume that there exists a continuous, linear operator

Aη : lin Hη → L2(Pη): b 7→ Aηb.

This is an operator between two Hilbert spaces. It is reasonable the operator Aη will play the role of the score operator in the previous section, as the notation already revealed. Indeed, we additionally assume that for every b ∈ Hη, there exists a path t 7→ ηt such that !2 Z dP 1/2 − dP 1/2 1 ηt η − A bdP 1/2 → 0 (8.9) t 2 η η

as t → 0. This means that t 7→ Pηt defines a differentiable path with corresponding score function Aηb, see Definition 6.1. For this reason, it is clear that Aη can be seen as a score operator: it maps elements of lin Hη onto scores of the model P, defined in (8.8). Another important object in the previous section was the efficient influence functionχ ˜η, see k (8.1), for estimating the functional χ : H → : η 7→ χ(η). In that case, the existence ofχ ˜η R . was assured by the assumption that χ was differentiable at η relative to the tangent set Hη. We want to define an object with similar properties in the current setting. Note however, η does not necessarily define a probability measure. Thus, in this abstract setting, the notion of this kind of differentiability, see Definition 6.3, does not exist. Fortunately, we can proceed in a very k similar way. So, thirdly, we assume that for the function χ : H → R : η 7→ χ(η), there exists a continuous, linear functional . k . χη: lin Hη → R : b 7→χη b, such that for every b ∈ Hη, there exists a path t 7→ ηt for which

χ(η ) − χ(η) . t →χ b (8.10) t η as t → 0. Componentwise application of the Riesz’ Representation Theorem for Hilbert spaces, assures us that the derivative has a representation as an inner product, . χ η b = hχ˜η, biHη (8.11)

k for an elementχ ˜η ∈ lin Hη.

Remark 8.4. Note, although this function χ˜η is defined in a very similar way as the efficient influence function of the previous section, it does not have the interpretation of an efficient influence function for the simple reason that Hη is not necessarily a tangent set, i.e., a set containing score functions.

The theory developed in the previous section was made for estimating functions of the type k ψ : P → R : Pη 7→ ψ(Pη) = χ(η). This remains the same in this setting. E.g., the results we shall obtain here will be applicable to the map (7.18). In that case, we have ψ(Pθ,η) = χ(θ, η) = θ. . Before stating a theorem that will summarize the results, we look how the objects Hη, Aη, χη andχ ˜η look like in a sufficiently smooth parametric model (with adapted notation). 172 Chapter 8. Calculus of Scores

Parametric Model. Consider a parametric model

m P = {Pθ : θ ∈ Θ ⊂ R }, (8.12)

m where Θ is an open subset of R . It is clear that an arbitrary parametric model fits the scope of (8.8). We take η to be θ and H to be Θ. In this example, we focus on how the objects , . Hθ Aθ, χθ andχ ˜θ look like. Note we adapted the notation because now the parameter θ is indexing the model. 3 m m For Hθ we take the subset of the Hilbert space R such that for any a ∈ Hθ ⊂ R , θ + a ∈ Θ. Thus, more precisely,

m ˜ ˜ Hθ = Θ − θ = {a ∈ R : a = θ − θ, for all θ ∈ Θ}. Next, we need a continuous, linear operator A . The choice for this operator A is straightfor- . θ θ ward. Suppose `θ is the ordinary score function. We already know, if the model is sufficiently smooth such that the conditions in Lemma 6.2 are satisfied, this ordinary score function satisfies (6.16). Thus, for Aθ, we take . T Aθ : lin Hθ → L2(Pθ): a 7→ Aθa = a `θ . (8.13) From (6.16) and Proposition 6.3 it is then clear that (8.9) is fulfilled. In addition, it is clear that k Aθ is linear. Moreover, from the Cauchy Schwarz inequality in R , it follows that

Z . kA ak2 = (aT )2dP θ L2(Pθ) `θ θ Z . ≤ kak2 | |2dP Rk `θ θ Z .  = | |2dP kak2 , `θ θ Rk which shows that Aθ is continuous. k Next, we take any function χ :Θ → R that is differentiable in ordinary sense as a map between m k the Euclidean spaces R and R .

Remark 8.5. In §6.3.1, it is seen that the map ψ(Pθ) = χ(θ) is differentiable at Pθ relative to . .T a given tangent space if the Fisher information matrix Iθ = Pθ(`θ `θ ) is invertible. However, to identify the aforementioned objects, we do not need this assumption. This is only necessary when we are looking for the efficient influence function, see further.

In §6.3.1, we also deduced that, for t sufficiently small,

χ(θ + ta) − χ(θ) . →χ a, t θ . where χθ is the k × m matrix of partial derivatives, i.e.,

 . 11 . 1m  χθ ··· χθ .  . .  χθ=  . .  ,  . k1 . km  χθ ··· χθ

3Any finite-dimensional inner product space is a Hilbert space, see Proposition 2.3. 8.2. Score and Information Operators 173

. ij with χ = ∂χi where χ = [χ , . . . , χ ]T and θ = [θ , . . . , θ ]T . Now write this matrix as θ ∂θj 1 k 1 m

.  1 T  (χθ) .  .  χθ=  .  ,  .  k T (χθ )

. . . . i i1 im T where χθ= [χθ ,..., χθ ] denotes the i-th row of χθ written as a column vector. With this notation, it is clear that

. .  1 T   1  χ χ m ( θ) a h θ, aiR . χ  .   .  θ a =  .  =  .  = hχ˜θ, aiHθ ,  .   .  k T k χ χ m ( θ ) a h θ , aiR

. k thusχ ˜ = χ ∈ lin ⊂ k×m. θ θ Hθ R . We thus derived the objects Hθ, Aθ, χθ andχ ˜θ in a parametric model. By this example, we hope that the assumptions we made above now seem very plausible. 4 Let us now state an important theorem that summarizes the obtained results.

k Theorem 8.1. The map ψ : P → R given by ψ(Pη) = χ(η) is differentiable at Pη relative to the tangent set AηHη if and only if each coordinate function of χ˜η is contained within the range ∗ ˜ of Aη : L2(Pη) → lin Hη. The efficient influence function ψPη satisfies

∗ ˜ AηψPη =χ ˜η. (8.14)

∗ Moreover, if each coordinate function of χ˜η is contained within the range of AηAη : lin Hη → lin Hη, then it also satisfies ˜ ∗ − ψPη = Aη(AηAη) χ˜η. (8.15)

Proof. We first note that AηHη = {Aηb : b ∈ Hη} is indeed a tangent set. This follows from (8.9), Aη maps every element of Hη to a score function Aηb with corresponding differentiable path t 7→ Pηt . The assertion that the map ψ is differentiable at Pη relative to the tangent set AηHη (with the corresponding submodels t 7→ Pηt ) if and only if each coordinate function ofχ ˜η is contained ∗ within the range of Aη, readily follows by the derivation leading up to (8.3). Note that the translation to this setting of that derivation is only a typographical matter (with a slight change . of interpretation). Just change the set Hη to the set Hη and we obtain a proof applicable to this general setting. Maybe there is one peculiarity left. In the derivation leading up to (8.3), we . used the fact that χ is differentiable at η relative to the tangent set Hη with efficient influence functionχ ˜η. This assumption should be replaced by (8.10) and whereχ ˜η is then defined by (8.11). Now, if ψ is differentiable at Pη relative to the tangent set AηHη, we know that the efficient influence function exists and satisfies (8.14). Moreover, it is also trivial to obtain (8.15). If each coordinate function ofχ ˜η is contained ∗ k ∗ within the range of AηAη, it follows that there exists an element b ∈ lin Hη such that AηAηb = χ˜η, which means that, in terms of the generalized inverse, we have (8.15). This conludes the proof. 174 Chapter 8. Calculus of Scores

∗ ˜ In the previous section, we have seen a lot of properties about the solvability of AηψPη =χ ˜η ˜ ∗ − and if the form ψPη = Aη(AηAη) χ˜η is available in terms of properties of the continuous, linear operator Aη and its adjoint. Because these results rely on similar assumptions we have made in this section, they are also applicable to this setting. The only thing that should be changed is . the tangent set Hη. Instead, we have the subset Hη of a Hilbert space. These two sets share the same properties. On the other hand, there is some peculiarity aboutχ ˜η. In the previous section, this had the interpretation of an efficient influence function, in this section it does not. However, both objects also share the same properties. Similar considerations apply for the operator A . . η In short, the assumptions we made in this section about the objects Hη, Aη, χη andχ ˜η are chosen so that all the results obtained in the previous section can be copied to this general setting. That is all that needs to be said. It is useful to look how the adjoint looks like in a parametric model and we will investigate the assertions made in Theorem 8.1 for this parametric model. Parametric Model (continued). We continue the study of the parametric model defined in . (8.12). In (8.13), it is seen that A a = aT . It is also seen thatχ ˜ equals the derivative map . θ `θ θ χθ (the matrix of partial derivatives). We wrote:

. 1 T . k T T χ˜θ = [(χθ) ,..., (χθ ) ] . ∗ First, we want to look for the adjoint of Aθ, Aθ : L2(Pθ) → lin Hθ. Note we can assume that m m m lin Hθ = R which means that Θ is a set in R containing every direction of R such that (the m closure of) the linear span is exactly R . Intuitively, this has the interpretation that the set Θ does not yield extra information about the parameters. The adjoint now follows from an easy m calculation. Take any g ∈ L2(Pθ) and any a ∈ R , then D . E Z . Z . hg, A ai = g, aT = gaT dP = aT g dP θ L2(Pθ) `θ `θ θ `θ θ L2(Pθ) Z . T Z .  = g dP a = g dP , a = hA∗g, ai . `θ θ `θ θ θ Rm Rm We thus find an easy form for the adjoint operator,

Z . ∗ m ∗ Aθ : L2(Pθ) → R : g 7→ Aθg = g `θ dPθ. (8.16)

From Theorem 8.1, we know that ψ(Pθ) = χ(θ) is differentiable at Pθ relative to the tangent set . χi ∗ AθHθ if and only if θ∈ R(Aθ) for all i = 1, . . . , k. In §6.3.1, it is seen that the occurrence of this . .T differentiability is equivalent with invertibility of the Fisher information matrix Iθ = Pθ `θ `θ . . χi ∗ Thus, non-singularity of Iθ should be equivalent with θ∈ R(Aθ) for all i = 1, . . . , k. Indeed, . . χi T −1 put g to be ( θ) Iθ `θ∈ lin AθHθ. We obtain . . Z . . . Z . . . ∗ χi T −1 χi T −1 χi T −1 T Aθ{( θ) Iθ `θ} = {( θ) Iθ `θ} `θ dPθ = `θ {( θ) Iθ `θ} dPθ

Z . . . Z . .  . T −1 χi T −1 χi = `θ `θ Iθ θ dPθ = `θ `θ dPθ Iθ θ

. i = χθ . . i . . Now write ψ˜i = (χ )T I−1 which means that that ψ˜ =χ ˜ I−1 with ψ˜ = [ψ˜1 ,..., ψ˜k ]T . θ θ θ `θ Pθ θ θ `θ Pθ Pθ Pθ ∗ ˜ ˜ k Thus, AθψPθ =χ ˜θ and ψPθ ∈ lin AθHθ, the unique efficient influence function, unique because 8.2. Score and Information Operators 175

k there is at most one solution of (8.14) contained within lin AθHθ. This proves the aformentioned equivalence. ∗ There is still one object left to be looked at, the information operator AθAθ. The following m calculation justifies this nomenclature. Take any a ∈ R , . Z . . ∗ ∗ T T AθAθa = Aθ(a `θ) = (a `θ) `θ dPθ

Z . .T = `θ (`θ a)dPθ = Iθa.

∗ This shows that the abstract information operator AθAθ equals the Fisher information matrix m Iθ, a linear, continuous operator (if Iθ is invertible, as we assume) on R . It is now easy to . . χi ∗ −1 χi see that each θ (i = 1, . . . , k) is contained within the range of AθAθ. Just put a to be Iθ θ. Theorem 8.1 then assures that (8.15) is available, . . ˜i ∗ −1 χi −1 χi ψθ = Aθ(AθAθ) θ= Aθ(Iθ θ) . . . . −1 χi T χi T −1 = (Iθ θ) `θ= ( θ) `θ Iθ . . ˜ −1 Again, we obtain ψPθ =χ ˜θIθ `θ, the unique efficient influence function, see §6.3.1, but now via a more general way than before. 4 As mentioned in the previous section, here also, equation (8.14) seems odd at first sight. By definition, every coordinate function ofχ ˜η is contained within the closed linear span of Hη, see ∗ ∗ (8.11). In addition, the operator Aη maps L2(Pη) to a subset of lin Hη. If Aη is onto, equation ∗ (8.14) is satisfied. However, there are two reasons why Aη may fail to be onto. ∗ First, it may happen that R(Aη) is a proper subspace of lin Hη. In this case, there exists an ∗ ⊥ 4 element b ∈ R(Aη) \{0}. In a similar way as in the proof of Lemma 8.2, we deduce that ∗ ⊥ R(Aη) = N (Aη). Thus, the kernel N (Aη) is not trivial which means that Aη is not one- to-one. Therefore, we are asssured there exist two linear independent directions b1 and b2 for which Aηb1 = Aηb2, both directions lead to the same score functions. This means the Fisher information matrix for the corresponding two-dimensional submodel is singular. We give a heuristic argument to show this. Consider the following path in H, with b1 and b2 two linear independent directions in Hη with Aηb1 = Aηb2,

ηs,t = (1 + sb1)(1 + tb2)η.

The corresponding submodel is then (s, t) 7→ Pηs,t . The two-dimensional score function for the parameter (s, t) then becomes   Aηb1 gs,t = . Aηb2 From this, we obtain the Fisher information matrix,

    2  T Aηb1   (Aηb1) Aηb1Aηb2 Is,t = P gs,tgs,t = P Aηb1 Aηb2 = P 2 . Aηb2 Aηb1Aηb2 (Aηb2)

We assumed that Aηb1 = Aηb2 and thus we find that 1 1 I = P g gT = P (A b )2 , s,t s,t s,t η 1 1 1

4See also the more general Proposition 2.10. 176 Chapter 8. Calculus of Scores

a singular matrix (det(Is,t) = 0). A rough interpretation is that the parameter is not locally identifiable. It is not surprising we encounter problems when estimating these parameters. ∗ A second reason is that the range space R(Aη) is dense but not closed in lin Hη. In this case, the kernel N (Aη) is trivial, so this cannnot cause any problems. The problem arises when ∗ ∗ χ˜θ ∈ ∂R(Aη), the border of the range. There exist elements of R(Aη) that are arbitrary close toχ ˜θ, but (8.14) may still fail. This is harder to understand than the first reason but it seems that this happens quite often. The following theorem shows that this failure has serious consequences.

∗ Theorem 8.2. With the notations as before, if χ˜η ∈/ R(Aη), then

(i) there exists no estimator sequence for χ(η) that is regular at Pη, (ii) in addition, hχ˜ , bi2 η L2(η) sup 2 = +∞. (8.17) kAηbk b∈Hη L2(Pη) Remark 8.6. We omit the proof of this theorem. For a proof of (i), we refer to van der Vaart (1991), [37]. To prove this, one needs a lot of knowledge about convergence of experiments. We omit this because this is not within the scope of this thesis. For a proof of (ii), we refer to van der Vaart (2002), [40]. We must admit, the proof is quite technical and long. It makes use of the spectral decomposition of the information operator and spectral calculus. Again, it would be too much to cover these concepts in this thesis. Those really interested in the proof, we refer to Rudin (1974), [32] for more information about spectral theory.

∗ This theorem indeed shows that failure of the condition thatχ ˜η belongs to the range of Aη has serious consequences. First, no regular estimator sequence Tn for ψ(Pη) = χ(η) exists. In models where this occurs, we do not have asymptotically efficient estimators for such functionals ψ in the sense of Definition 7.6. We are then obligated to enter the realm of non-regular estimators. Second, have a look at (8.17). Does this statement have a nice interpretation? Let us investigate ∗ what the ratio means in the situation whereχ ˜η ∈ R(Aη). For ease of explanation, we assume that χ is one-dimensional. The numerator equals

hχ˜ , bi2 = hA∗ψ˜ , bi2 = hψ˜ ,A bi2 . η L2(η) η Pη L2(η) Pη η L2(Pη) By the Cauchy-Schwarz inequality, it follows that

˜ 2 ˜ 2 2 hψPη ,AηbiL (P ) kψPη kL (P )kAηbkL (P ) 2 η ≤ 2 η 2 η = kψ˜ k2 , 2 2 Pη L2(Pη) kAηbk kAηbk L2(Pη) L2(Pη) the variance of the efficient influence function. Thus, in that case

2 hχ˜η, bi sup L2(η) = kψ˜ k2 = P ψ˜2 , 2 Pη L2(Pη) η Pη kAηbk b∈Hη L2(Pη) the variance of the efficient influence function equals the considered supremum, see §7.3.1. For ∗ ˜ functionals χ for which thatχ ˜η ∈/ R(Aη), we cannot refer ψPη to as an influence function, it does not exist. Hence, we call this a virtual efficient influence function. An intuitive interpretation for (8.17) is that the variance of this virtual efficient influence function is infinite. This is 8.2. Score and Information Operators 177 somewhat comparable to the infinite bound case, see §7.4.4. Theorem 8.2 is also an example of an impossibility theorem.

We end this tough section with the parametric model defined in (8.12) and show that R(Aθ) is ∗ closed for m = 1 and that AθAθ is continuously invertible.

Parametric Model (continued, again). We show that R(Aθ) = {Aθb : b ∈ R} is a closed subset of L2(Pθ) for m = 1. Note we take lin Hθ to be R, which we discussed before. To show that R(Aθ) is closed, we need to show that any convergent sequence in R(Aθ) has its limit in R(Aθ). So, take a sequence {gn}n≥1 ⊂ R(Aθ), converging to g ∈ L2(Pθ). Since gn ∈ R(Aθ) for . all n ≥ 1, we know that there exists a sequence {an}n≥1 ⊂ R such that an `θ= gn. The sequence {an}n≥1 is a Cauchy sequence in R. Indeed, . kgl − gnkL2(Pθ) = |al − an|k `θ kL2(Pθ), which tends to 0 as l, n → +∞ since {gn}n≥1 is a Cauchy sequence in L2(Pθ). Because R is complete, there exists a unique a ∈ R such that an → a as n → +∞. Therefore, . gn = an `θ ↓ ↓ . g = a `θ from which we conclude that g ∈ R(Aθ). ∗ We end this section noting that AθAθ is continuously invertible because Iθ is continuously ∗ invertible as a map between Euclidean spaces and AθAθ = Iθ. 4 178 Chapter 8. Calculus of Scores Chapter 9

Applications to the Calculus of Scores

9.1 Information Loss Models

Suppose we conducted a study in which a typical observation is distributed as a measurable transformation X = m(Y ) of an unobservable variable Y . Assume that the form of m is known and that the distribution η of Y is known to belong to a class H. Not to loose attention in this abstract set-up, we can think at a missing data problem. This yields a natural parametrization of the distribution Pη of X. Indeed, Pη(x) = P(X ≤ x) = P{m(Y ) ≤ x}, which is completely determined by η for fixed m. Note, a natural parametrization in terms of a parameter of interest and a nuisance parameter is not available in this setting. We thus consider the statistical model

P = {Pη : η ∈ H}, where H is a statistical model, describing the distribution of the unobservable variable Y . These so called information loss models clearly fit the scope of §8.2.1. Even though this was a quite restrictive setting, important applications are available. Before we proceed, we investigate what probabilistic constraints the information loss model implies on the distribution of X and Y . The approach we use is slightly different than the approach presented in [40], §3.1.1. As described above, we dispose of a random vector (X,Y ) and we only observe X, a known measurable function m(Y ) of the unobservable Y . Let us denote the joint density function of X and Y by f(x, y). The conditional density function of X given Y is denoted by f ∗(x|y). Now suppose h(y) is the density function corresponding with the probability measure η relative to a dominating measure ν. The joint density function can then be written as f(x, y) = f ∗(x|y)h(y). Since X = m(Y ), the joint density function f ∗(x|y) is one of a very special kind. Indeed, we have  1 if x = m(y), f ∗(x|y) = P(X = x|Y = y) = (9.1) 0 if x 6= m(y). ∗ In short, we write f (x|y) = I{x = m(y)}. Suppose Pη has a density p(x) with respect to a dominating meausre µ. It follows that Z Z Z p(x) = f(x, y)dν(y) = I{x = m(y)}h(y)dν(y) = h(y)dν(y), Mx

179 180 Chapter 9. Applications to the Calculus of Scores

where Mx = {y ∈ Y : x = m(y)}. We denote this by p(x) = Eν{h(Y )|X = x} to remain conform with the notations in [40]. Now we fixed notation, we may proceed. Suppose we dispose of differentiable paths t 7→ η in H with corresponding score functions b. . t This gives rise to a tangent set Hη⊂ L2(η). To apply the theory of score operators, we want that a differentiable submodel t 7→ ηt induces a submodel t 7→ Pηt in P. There is some good news, a nice property of differentiability in quadratic mean is that it is preserved under censoring mechanisms of this type. The bad news is that we have to do a lot of (technical) work to obtain this. It is seen we need to find an operator Aη that maps score functions b(y) of the model H to score functions g(x) = Aηb(x) of the model P. The action of the score operator Aη working on the function b(y) turns out to be just taking the conditional expectation of b(y) given X = x. Intuition. Let us try to obtain some intuition by looking at a parametric model. Suppose the model H is described by a parameter θ, we then have the following relation, Z p(x; θ) = h(y; θ)dν(y). Mx We want to obtain a relation between the score corresponding to p and the score corresponding to h. Hence, we take the logarithm and differentiate with respect to θ, ∂ R ∂ h(y; θ)dν(y) log p(x; θ) = ∂θ Mx . ∂θ R h(y; θ)dν(y) Mx Now assuming we can interchange differentiation and integration, we obtain R Sh,θ(y)h(y; θ)dν(y) S (X = x) = Mx = E {S (Y )|X = x}, p,θ R h(y; θ)dν(y) η h,θ Mx where Sp,θ(X) = ∂ log p(X; θ)/∂θ and Sh,θ(Y ) = ∂ log h(Y ; θ)/∂θ. 4 The score operator is denoted by

Aη : L2(η) → L2(Pη): b(y) 7→ Aηb(x) = Eη{b(Y )|X = x}. (9.2) With the notation introduced above, we can write Z f ∗(x|y) A b(x) = E {b(Y )|X = x} = b(y) dη(y) η η p(x) R b(y)h(y)I{x = m(y)}dν(y) E {b(Y )h(Y )|X = x} = = ν . Eν{h(Y )|X = x} Eν{h(Y )|X = x} This yields an interesting relation and its usefulness will become clear in the proof below. We have Eν{b(Y )h(Y )|X} = Aηb(X)Eν{h(Y )|X} = Aηb(X)p(X). (9.3) Although we will have to do a lot of work to prove the score operator is indeed given by (9.2), the result is quite intuitive. If we consider the score functions b and g corresponding with the submodels t 7→ ηt and t 7→ Pηt , respectively, as the carriers of information about t in the variables Y ∼ ηt and X ∼ Pηt , respectively, then the information contained in the observation X is the information contained in Y reduced through conditioning. This is also the reason why we call this kind of models information loss models, since by only observing X, we lose information about Y and the mathematical way of saying what information that is left in X about Y happens through conditioning. Let us now prove this. The less interested reader can skip the proof since it is very technical. We will make use of the following algebraic lemma. 9.1. Information Loss Models 181

Lemma 9.1. For any real numbers a, b, c and d with a > 0, b/a ≤ ε < 1, c ≤ 0 and a + b + c + d ≥ 0, we have

2 2 2 2 √ √ 1 b 3d 1 b a + b + c + d − a − √ ≥ + 3c + √ − 1 . 2 a a(1 − ε) 1 − ε a

Theorem 9.1. Suppose that {ηt : 0 < t < 1} is a collection of probability measures on a measurable space (Y, B) such that for some measurable function b ∈ L2(η), b : Y → R, ( )2 Z dη1/2 − dη1/2 1 t − bdη1/2 → 0. (9.4) t 2

For a measurable map m : Y → X , let Pη be the distribution of m(Y ) if Y has distribution η and let Aηb(X) be the conditional expectation of b(Y ) given m(Y ) = X as defined in (9.2). Then, ( )2 Z dP 1/2 − dP 1/2 1 ηt η − A bdP 1/2 → 0. (9.5) t 2 η η

Proof. We begin the proof with some notation. Recall that we denoted the density of Y corre- sponding with the probability measure η by h(y) with respect to a dominating measure ν. For simplicity, we assume that the probability measures ηt have densities ht(y) with respect to the same dominating measure ν. Next, we assume b is uniformly bounded1 by M, i.e.,

sup |b(y)| < M. y∈Y

Let us introduce the following function of y,

h1/2(y) − h1/2(y) 1 u (y) = t − b(y)h1/2(y). (9.6) t t 2 2 R 2 From (9.4), we immediately see that νut = ut (y)dν(y) → 0 as t → 0. Above it is seen that Pη possesses a density p(x) = Eν{h(Y )|X = x} with respect to a dominating measure µ. Completely analogous, the probability distributions Pηt possess densities pt(x) = Eν{ht(Y )|X = x} with respect to the dominating measure µ. 1/2 1/2 1/2 Manipulating (9.6) yields ht = h + t(ut + bh /2). Taking squares at both sides gives  1  h = h + tbh + t2u2 + t tu bh1/2 + 2u h1/2 + tb2h . t t t t 4

Evaluating this equation at Y and applying the operator Eν(·|X), we find

2 2 Eν{ht(Y )|X} =Eν{h(Y )|X} + tEν{h(Y )b(Y )|X} + t Eν{ut (Y )|X}  1   + tE tu bh1/2 + 2u h1/2 + tb2h (Y )|X . ν t t 4

Using the definitions of p(X) and pt(X) and using (9.3), we may write

pt(X) = p(X) + tAηb(X)p(X) + ct(X) + dt(X), (9.7)

1In [40], p.30, it is described how to adjust the proof for unbounded functions b(y). 182 Chapter 9. Applications to the Calculus of Scores where 2 2 ct(X) = t Eν{ut (Y )|X},  1   d (X) = tE tu bh1/2 + 2u h1/2 + tb2h (Y )|X . t ν t t 4 2 The next step is to find an upper bound for |dt(X)| to use in further calculations. This will be quite technical and it will not be pretty mathematics. Using the uniform boundedness of b, we find  1  2 |d (X)|2 = t2E tu bh1/2 + 2u h1/2 + tb2h (Y )|X t ν t t 4  1  2 ≤ t2E tu h1/2M + 2u h1/2 + tM 2h (Y )|X ν t t 4 2 2 h 1/2 2 i . t Eν {uth (tM + 1) + tM h}(Y )|X . Rewriting this and using the simple inequality (α + β)2 ≤ 2(α2 + β2) for real numbers α and β, we obtain 2 2 2 h n 1/2 o 2 i |dt(X)| . t Eν (uth )(Y )|X (Mt + 1) + tM Eν{h(Y )|X}  2  2 n 1/2 o 2 2 4 2 . t Eν (uth )(Y )|X (Mt + 1) + t M Eν{h(Y )|X} .

Next we use the Cauchy-Schwarz inequality and the definition of p(X) to obtain 2 2  2 2 2 4  |dt(X)| . t Eν{ut (Y )|X}(Mt + 1) + t M Eν{h(Y )|X} Eν{h(Y )|X} 2  2 2 2 4  . t Eν{ut (Y )|X}(Mt + 1) + t M p(X) p(X). Introduce the set A = {x ∈ X : p(x) > 0} and fix X ∈ A. We now use Lemma 9.1 with a = p(X), b = tAηb(X)p(X), c = ct(X) and d = dt(X). Put ε to be Mt. Since X ∈ A, we have that a > 0. In addition, using (9.3), we have that b tA b(X)p(X) tE {b(Y )h(Y )|X} = η = ν ≤ tM = ε. (9.8) a p(X) p(X) When we take t to be sufficiently small, we have that ε = Mt < 1. It is also clear that c ≥ 0 and in addition a + b + c + d = pt(X) ≥ 0. We conclude all the assumptions of Lemma 9.1 are fulfilled. Hence, n 1 o2 p1/2(X)−p1/2(X) − tA b(X)p1/2(X) t 2 η 2 2 2 2 2 3dt (X) 1 t {Aηb(X)} p (X) ≤ + 3ct(X) + √ − 1 . p(X)(1 − Mt) 1 − Mt p(X) 2 Next, we use the upper bound for dt (X) and Aηb(X)p(X) ≤ Mp(X), this follows from (9.8), n 1 o2 p1/2(X)−p1/2(X) − tA b(X)p1/2(X) t 2 η (Mt + 1)2 t2M 4 t2E {u2(Y )|X} + t2 p(X) + t2E {u2(Y )|X} . ν t (1 − Mt) 1 − Mt ν t 2 2 1 2 + t √ − 1 M p(X). 1 − Mt 9.1. Information Loss Models 183

Dividing both sides through t2 yields

( )2 p1/2(X) − p1/2(X) 1 (Mt + 1)2 t2M 4 t − A b(X)p1/2(X) E {u2(Y )|X} + p(X) t 2 η . ν t (1 − Mt) 1 − Mt 2 2 1 2 + Eν{u (Y )|X} + √ − 1 M p(X). t 1 − Mt

2 Finally, using the fact that νut → 0 as t → 0, we have

2 Z ( 1/2 1/2 ) pt (x) − p (x) 1 1/2 − Aηb(x)p (x) dµ(x) → 0 (9.9) A t 2 as t → 0. To end the proof, we also need that

2 Z ( 1/2 1/2 ) pt (x) − p (x) 1 1/2 − Aηb(x)p (x) dµ(x) → 0 (9.10) Ac t 2

c c as t → 0 where A = {x ∈ X : p(x) = 0}. Since p(x) = 0 for all x ∈ A , (9.10) becomes

Z ( 1/2 )2 pt (x) 1 c dµ(x) = 2 Pηt (A ). Ac t t

c c We have that A is a zero set with respect to the probability measure Pη. Indeed, Pη(A ) = R c 2 c p(x)dµ(x) = 0. We now show that Pη ( ) = o(t ). This follows from (9.7), A t A Z Z c Pηt (A ) = {1 + tAηb(x)}p(x)dµ(x) + {ct(x) + dt(x)}dµ(x) Ac Ac Z = 0 + {ct(x) + dt(x)}dµ(x). Ac We now argue that d (x) = 0 for x ∈ c. Since p(x) = 0, we know that R h(y)dν(y) = 0. t A Mx c Since h(y) ≥ 0, we have that h(y) = 0 for almost every y ∈ Mx given x ∈ A . This implies that c dt(x) = 0 for all x ∈ A (null sets can be ignored). Hence, we see that Z c 2 2 Pηt (A ) = t Eν{ut(Y )|X = x}dµ(x) = o(t ). Ac

1 c R 2 Thus, t2 Pηt (A ) = c Eν{ut(Y )|X = x}dµ(x) → 0 as t → 0 because νut → 0. Henceforth, A c (9.10) also holds. Since A ∪ A = X , we have that

2 Z ( 1/2 1/2 ) pt (x) − p (x) 1 1/2 − Aηb(x)p (x) dµ(x) → 0 (9.11) X t 2

as t → 0. This is (9.5) where pdµ = dPη and ptdµ = dPηt . We may conclude that t 7→ Pηt is differentiable in quadratic mean with corresponding score function Aηb(X). This concludes the proof. 184 Chapter 9. Applications to the Calculus of Scores

The adjoint of the operator Aη : L2(η) → L2(Pη) is now easy to obtain. By Definition 8.2 and the law of iterated expectation, we deduce

hg, A bi = E{g(X)A b(X)} = E[g(X)E {b(Y )|X}] η L2(Pη) η η = E{g(X)b(Y )} = E[Eη{g(X)|Y }b(Y )] = A∗g, b . η L2(η)

Hence we can write

∗ ∗ Aη : L2(Pη) → L2(η): g(x) 7→ Aηg(y) = Eη{g(X)|Y = y}. (9.12)

However, although the score operator Aη is defined for all elements of L2(η), we need its restric- . tion to lin Hη to be in the scope of the theory developed in Chapter 8 and we denote this again ∗ by Aη. The implication was that, when considering the adjoint Aη, this must be the operator . from L2(Pη) to lin Hη⊂ L2(η). Henceforth, as we have seen before, the conditional expectation . Eη{g(X)|Y = y} needs to be followed by the orthogonal projection onto lin Hη.

9.2 Mixture Models

In this section, we will apply the general situation described in the previous section to an interesting example, so called mixture models. Mixture models were already introduced in §1.3.4. We consider two cases below. First we consider the case where the kernel of the mixture model is known, and next we briefly consider the case where the kernel of the mixture belongs to a parametric family.

9.2.1 Mixtures with Known Kernel p(x|z)

Suppose we observe X and suppose X possesses a known conditional density p(x|z) given an unobservable variable Z. If the unobservable Z possesses an unknown probability distribution η ∈ H, the observations are a random sample from the mixture density Z pη(x) = p(x|z)dη(z).

The set H is the infinite-dimensional set of all probability distributions for Z, i.e., the distribu- tion of Z is left unspecified. This kind of model can be seen as an information loss model. Put Y = (X,Z) and m(Y ) = m(X,Z) = X, the first component of Y . Note that, using the notation of the preceding para- graph, dηY (y) = p(x|z)dη(z) where ηY is the joint distribution of X and Z, i.e., the distribution of the random vector Y .

∗ The Score Operator Aη and its Adjoint Aη

. Since we assume the mixing distribution η is completely unknown, a tangent set Hη for η 0 can be taken to be the maximal tangent space {b ∈ L2(η): ηb = 0}, which we denote by L2(η). 9.2. Mixture Models 185

. . . Note that lin Hη= Hη since Hη is linear and closed. Using the knowledge we have obtained in the previous section, we see the score operator becomes

. R b(z)p(x|z)dη(z) A : L0(η) =H → L (P ): b(z) 7→ A b(x) = E{b(Z)|X = x} = . (9.13) η 2 η 2 η η R p(x|z)dη(z)

Note that Aη is a linear and bounded operator. Linearity is trivial to obtain. Using the Cauchy- Schwarz inequality, we obtain

kp(x|·)kL2(η) |Aηb(x)| ≤ kbkL2(η) . pη(x) Using this, we obtain

(Z kp(x|·)k2 )1/2 L2(η) kAηbkL2(Pη) ≤ kbkL2(η) dµ(x) = CkbkL2(η), pη(x) where we assume that (Z kp(x|·)k2 )1/2 C = L2(η) dµ(x) pη(x) is finite. This proves the boundedness of Aη. From Theorem 9.1, we know a tangent set for Pη . . . n 0 o is given by Aη Hη= Aηb : b ∈ L2(η) =Hη . In general, the tangent set Aη Hη is a subset of 0 the maximal tangent set {g ∈ L2(Pη): Pηg = 0} = L2(Pη). This follows by Theorem 9.1 and Lemma 6.1 but this can also be seen from an easy calculation, Z ZZ Pη(Aηb) = Aηb(x)pη(x)dµ(x) = b(z)p(x|z)dη(z)dµ(x) Z Z  = b(z) p(x|z)dµ(x) dη(z)

= ηb = 0.

0 0 0 Hence, we can see the operator Aη as an operator between L (η) and L (Pη), i.e., Aη : L (η) → . 2 2 . 2 0 0 L2(Pη). Since Hη= L2(η) is a linear space and Aη is a linear operator, we see that Aη Hη is also ∗ a linear space. It is easy to see that the adjoint operator Aη is given by

. Z ∗ 0 0 ∗ Aη : L2(Pη) 7→ L2(η) =Hη: g(x) 7→ Aηg(z) = E{g(X)|Z = z} = g(x)p(x|z)dµ(x). (9.14)

This follows from the following calculation,

hg, A bi = E{g(X)A b(X)} = E[g(X)E{b(Z)|X}] η L2(Pη) η = E{g(X)b(Z)} = E[E{g(X)|Z}b(Z)] = A∗g, b . η L2(η)

Exponential Kernel

In what follows, we consider the situation that the kernel p(x|z) belongs to an exponential family, i.e., p(x|z) = c(z)d(x) exp(zT x). Since z is unknown (we cannot observe it), the kernel p(x|z) 186 Chapter 9. Applications to the Calculus of Scores can be seen as it is parameterized through the vector z. Mixtures over exponential families of this type seem to give relatively large models. However, the mixture densities may still possess very special properties. Mixtures over the exponential family p(x|z) = zezxI(x > 0) turn out to be monotone densities and mixtures over the normal location family are extremely smooth. Henceforth, the set of mixtures of this type can be far from equal to the nonparametric model. Nonetheless, although mixtures with exponential kernel may possess very special properties, we will obtain a somewhat surprising result regarding the estimation of the functional ψ(η) = Pηf where f ∈ L2(Pη). We will consider two different cases: (a) the support of η contains a limit point and (b) the support of η does not contain a limit point.

(a) Support of η Contains a Limit Point

In order to say something about the efficiency an estimator for ψ(η) = Pηf can have, we will use the theory of the score operators and the abstract approach will turn out to be very useful. Before doing so, we remind the reader of the definition of a complete family of density functions and this will turn out to be useful in the derivation below.

Definition 9.1. Let {f(t|λ): λ ∈ Λ} be a family of density functions for a statistic T (X) where k Λ ⊂ R an open subset. The family of density functions {f(t|λ): λ ∈ Λ} is called complete if Eλ{g(T )} = 0 for all λ ∈ Λ implies that Pλ{g(T ) = 0} = 1 for all λ ∈ Λ, i.e., g(T ) = 0 almost surely. Equivalently, the statistic T (X) is called a complete statistic.

Lemma 9.2. If the interior of the support of η is nonempty, i.e., the support of η contains a . 0 limit point, then the tangent set Aη Hη is dense in the maximal tangent set L2(Pη), i.e.,

. 0 Aη Hη = L2(Pη).

. Proof. By definition, Aη Hη= R(Aη) where Aη is seen as the linear and bounded operator 0 0 ⊥ between the Hilbert spaces L2(η) and L2(Pη). If we can show that R(Aη) = {0}, then it 0 ⊥ ∗ follows that R(Aη) is dense in L2(Pη). By Lemma 8.2, we know that R(Aη) = N (Aη). Hence, ∗ 0 ∗ we need to show that N (Aη) = {0}. Take any g ∈ L2(Pη) and assume g ∈ N (Aη), i.e., Z ∗ 0 = Aηg(z) = E{g(X)|Z = z} = g(x)p(x|z)dµ(x), a.s. under η,

T ∗ where p(x|z) = c(z)d(x) exp(z x). We note the latter equation is valid a.s. under η since Aηg(z) ∗ is the zero function in L2(η)-sense, i.e., for all z in a set of η-measure 1. Thus, η{Aηg(Z)} = E{g(X)} = 0. Next we use the completeness of the exponential family2. However, a technical condition involving this property is that the interior of the support of η has a limit point, i.e., there is an open set contained within the support of η with non-zero probabililty. Since we assume this, we can use the completeness of the exponential family. Thus, from the definition of completeness, E{g(X)} = 0 implies that Pη{g(X) = 0} = 1, i.e., g(x) = 0 a.s. under Pη. This . ∗ 0 shows that N (Aη) = {0}. Hence, we may conclude that Aη Hη is dense in L2(Pη).

This lemma has an important consequence. Recall we are interested in the estimation of the . 0 functional ψ(η) = Pηf where f ∈ L2(Pη). Since AηHη is linear and dense in L2(Pη) and using

2See for example Theorem 6.2.25, p.288 of [6]. 9.2. Mixture Models 187

Lemma 7.2, we see that the empirical estimators Pηf, for the fixed square integrable function f, are asymptotically efficient estimators for the functional ψ(η) = Pηf. For instance, suppose we are interested in the mean of the observation, in that case f(x) = 1(x) = x and f is square integrable provided that X has finite variance. It follows that the sample mean is asymptotically efficient for estimating the mean of the observations. This means that for estimating a functional such as the mean of the observations in this mixture model, it is of little use to know that the underlying distribution is a mixture, i.e., this additional information does not decrease the asymptotic variance. As noted before, this is quite surprising since mixtures may possess very special properties.

(b) Support of η Does Not Contain a Limit Point

We now consider the case where the support of η does not contain a limit point. If this occurs, the preceding argument fails. However, under similar assumptions, we may reach almost the same conclusion by using a different type of scores, as proposed by van der Vaart. Recall that the set H was the set of all probability distributions for Z. We use the paths

(a,η1) ηt = (1 − at)η + taη1

(a,η1) for any fixed a ≥ 0 and fixed η1 ∈ H. We write this as dηt = (1 − at)dη + tadη1. This is well-defined if we choose t to be sufficiently small such that 0 ≤ at ≤ 1, this assures that (a,η1) (a,η1) dηt ≥ 0. It is also easy to see that dηt integrates to 1, Z Z Z (a,η1) dηt = (1 − at) dη + at dη1 = (1 − at) + at = 1.

The corresponding path p (a,η1) (x) of density functions for X is then given by ηt Z (a,η1) p (a,η1) (x) = p(x|z)dηt (z) ηt Z Z = (1 − ta) p(x|z)dη(z) + ta p(x|z)dη1(z)

= (1 − ta)pη(x) + tapη1 (x).

An easy calculation show that the score in a pointwise sense for this submodel is given by   ∂ pη1 log p (a,η ) (x) = a (x) − 1 . η 1 ∂t t=0 t pη In [40], p.33, it is stated it can be shown this is also a score in mean-squared sense provided that this score is in L2(Pη). Hence, we may conclude that a tangent set for P = {Pη : η ∈ H} is given by .     pη1 PPη = a (x) − 1 : a ≥ 0, η1 ∈ H . (9.15) pη In [40], p.33, is is also stated this tangent set is a convex cone. This is not so difficult to obtain. . Lemma 9.3. The tangent set PPη defined in (9.15) is a convex cone. 188 Chapter 9. Applications to the Calculus of Scores

. Proof. The tangent set PPη is clearly a cone. We now show that it is also a convex cone. Take n o n o pη1 pη2 two paths p 1 and p 2 with scores a1 (x) − 1 and a2 (x) − 1 , respectively (ai ≥ 0 ηt ηt pη pη and ηi ∈ H for i = 1, 2). We need to show that

    . pη1 pη2 λa1 (x) − 1 + (1 − λ)a2 (x) − 1 ∈PPη , pη pη for any λ ∈ (0, 1). i The path in H corresponding with p i is given by η = (1 − tai)η + taiηi for i = 1, 2. Next, ηt t define the path

λ 1 2 ηt = ληt + (1 − λ)ηt = [1 − t{λa1 + (1 − λ)a2}]η + t{λa1η1 + (1 − λ)a2η2}.

Put aλ = {λa1 + (1 − λ)a2} and introduce ηλ = {λa1η1 + (1 − λ)a2η2}/aλ. We have that

Z 1  Z Z  dηλ = λa1 dη1(z) + (1 − λ)a2 dη2(z) = 1. aλ

This shows that ηλ defines a proper distribution, i.e., ηλ ∈ H. Hence, since 0 ≤ taλ ≤ 1, the path λ ηt = (1 − taλ)η + taληλ defines a proper path. The corresponding path of density functions for n o . pηλ x is given by p λ (x) = (1 − taλ)pη(x) + taλpη with corresponding score aλ (x) − 1 ∈PP . ηt λ pη η We end the proof by noting that p λ (x) = λp 1 (x) + (1 − λ)p 2 (x) and ηt ηt ηt

      . pη1 pη2 pηλ λa1 (x) − 1 + (1 − λ)a2 (x) − 1 = aλ (x) − 1 ∈PPη . pη pη pη

. This shows that PPη is a convex cone.

. 0 We want to show that the convex cone PPη is dense in L2(Pη). To do this, we show that . . ⊥ 0 the orthogonal complement PPη = {0}. Thus, take any g ∈ L2(Pη) and g ⊥PPη . By the orthogonality, we have that    pη1 0 = Pη g − 1 = Pη1 g, for all η1 ∈ H. pη

To obtain that g = 0 a.s. under Pη, we need to assume that P = {Pη : η ∈ H} is complete. If we do so, we see that g = 0 a.s. under Pη and hence we obtain the desired result that the . 0 convex cone PP is dense in L (Pη), the maximal tangent set for a nonparametric model for Pη. . η 2 Since PPη is a convex cone, we may apply the lower bound theorems discussed in Chapter 7 and following the same argument as in case (a), we conclude that the empirical estimators Pηf are asymptotically efficient for the estimation of the functional ψ(η) = Pηf where f ∈ L2(Pη).

9.2.2 Semiparametric Mixtures with Kernel pθ(x|z)

In this section, we briefly describe a result that is given in van der Vaart (1998), [39], Example 25.36, p.376-377. In the case of semiparametric mixtures, we assume the kernel pθ(x|z) belongs to a parametric family. Thus, the model for the unobserved data Y = (X,Z) is given by the set 9.3. Semiparametric Models in a Strict Sense 189

of densities pθ(x|z)dη(z) where η ∈ H. This model has scores for both θ and η. Suppose that 0 the path t 7→ ηt is differentiable with score b(z) ∈ L2(η) and that ZZ  2 1/2 1/2 1 . 1/2 p − p (x|z) − aT ` (x|z)p (x|z) dµ(x)dη(z) = o(kak2). θ+a θ 2 θ θ . Under these assumptions, it can be shown that the function aT (x|z)+b(z) is the score function `θ . corresponding with the submodel t 7→ pθ+ta(x|z)dηt(z). If pθ(x|z) is sufficiently smooth, `θ(x|z) coincides with the ordinary score function for θ. Next, by Theorem 9.1, the function . . R T n T o {a `θ (x|z) + b(z)}pθ(x|z)dη(z) Eθ,η a `θ (X|Z) + b(Z)|X = x = R pθ(x|z)dη(z) is a score for the model corresponding to observing X only. Note this is a natural generalization of the ideas put forward in the previous section. If θ is considered to be the parameter of interest, the nuisance tangent set is left unchanged.

9.3 Semiparametric Models in a Strict Sense

Consider a semiparametric model in a strict sense,

P = {Pθ,η : θ ∈ Θ, η ∈ H},

k where Θ ⊂ R , an open set and H an arbitrary set, typically of infinite dimension. These models fit the scope of §8.2.2 where η should now be taken to be (θ, η). These parameters can be perturbed independently. In §7.4.1, it is seen that a score will be . . T typically of the form a `θ,η +g where `θ,η equals the ordinary score function for θ with fixed η k (a ∈ R ) and g has the interpretation of a score function for η when θ is fixed. In the light of the theory described in §8.2.2, we can expect that the score operator now takes the form . T Aθ,η(a, b) = a `θ,η +Bθ,ηb. (9.16) We already encountered the part working on the k-dimensional vector a. This is the score k operator discussed in the example of the parametric model, an operator from to L2(Pθ,η). R . T Just to give it a name, for further reference, we denote this operator by αθ,ηa = a `θ,η. The operator Bθ,η : Hη → L2(Pθ,η) is the score operator for the nuisance parameter. As in the general theory, Hη is a subset of a Hilbert space Hη consisting of directions b to approximate the nuisance parameter η. Thus, the operator defined in (9.16) is defined as the linear operator

k Aθ,η : R × lin Hη → L2(Pθ,η). (9.17)

k 3 k This is an operator defined on a subset R × lin Hη of the direct sum of Hilbert spaces R × Hη. k The space R × Hη is a Hilbert space relative to the inner product T h(a, b), (α, β)i k = a α + hb, βi . R ×Hη Hη

T k The first part a α represents the inner product of the Hilbert space R and hb, βiHη is the inner product of the Hilbert space Hη but we use the subscript Hη to emphasize we are working with 3See Definition 2.7 and Example 2.5 for more information about the direct sum of Hilbert spaces and the associated inner product. 190 Chapter 9. Applications to the Calculus of Scores elements from the (closed linear span/linear span of the) set Hη. Let it be clear this setting fits k the general set-up with R × Hη playing the role of the earlier Hη. After noting these introductory considerations, we find ourselves in a comfortable situation. The general theory of score operators is applicable straight away. We will first focus on finding the efficient influence function for estimating θ. Next, we will focus on finding the efficient influence function for estimating a function of the nuisance parameter η, which despite its name, can also be of interest.

9.3.1 Efficient Influence Function for θ

Estimation of the functional ψ(Pθ,η) = θ already received a lot of attention in §7.4.1. The efficient influence function for estimating θ is expressed in terms of the efficient score function ˜ `θ,η, which is defined as the ordinary score function minus its orthogonal projection onto the . closed linear span of the nuisance tangent set η PPθ,η , the score-space for η. This was a result obtained in Lemma 7.3. We found that

˜ ˜−1 ˜ ψθ,η = Iθ,η `θ,η, (9.18) ˜ with I˜θ,η the efficient information matrix (the variance matrix of `θ,η). Presently, the nuisance . tangent set η PPθ,η is defined to be the range space of the score operator Bθ,η, . η PPθ,η := R(Bθ,η) = Bθ,ηHη.

∗ Assume for the moment the operator Bθ,ηBθ,η is continuously invertible (which fails in many examples). Using the result derived in §8.2.1, or simply using Proposition 2.10 with H1 = Hη ∗ −1 and H2 = L2(Pθ,η), we know that the operator Bθ,η(Bθ,ηBθ,η) Bθ,η is the orthogonal projection . onto lin Bθ,ηHη = lin η PPθ,η . Before, we denoted this with Πθ,η, now we obtained a specific form for this orthogonal projection operator under the aforementioned assumption. Therefore we can write . ˜ ∗ −1 `θ,η = {I − Bθ,η(Bθ,ηBθ,η) Bθ,η} `θ,η, (9.19) where I denotes the identity operator. Let us reflect on the form of (9.19). Because the variance of the efficient influence function, −1 ˜ ˜T  ˜ ˜T  Pθ,ηψθ,ηψθ,η equals the inverse of the variance of the efficient score function, Pθ,η`θ,η`θ,η , its ˜ ˜T magnitude is smallest when the magnitude of Pθ,η`θ,η`θ,η is as large as possible. With as small ˜ ˜T as possible we mean that, for an arbitrary estimator Tn for θ, the matrix var(Tn) − Pθ,ηψθ,ηψθ,η is semi-positive definite. An analogous interpretation is given to as large as possible. Thus, in (9.19), we want to substract as little as possible, therefore the direction . ∗ −1 b = −(Bθ,ηBθ,η) Bθ,η `θ,η is a least favourable direction in H for estimating θ, it contains as little as possible information ˜ about θ. In this way `θ,η contains as much as possible information about θ under the semipara- metric model. If for example θ is one-dimensional, the submodel t 7→ Pθ+t,ηt where ηt approaches η in this direction, has the least information for estimating θ relative to the other parametric ˜ submodels and has score function `θ,η at t = 0. This is what we called the least favourable or hardest submodel. 9.3. Semiparametric Models in a Strict Sense 191

What is new? It seems we did not make a lot of progress introducing these score operators for estimation in semiparametric models in a strict sense. Nonetheless, a specific form of the orthogonal projection onto lin Bθ,ηHη is now available. Introducing the concept of a score oper- ator in the Cox model will yield elegant formulas to evaluate efficiency of estimators. Luckily, we can do something more, as we already announced.

9.3.2 Efficient Influence Function for χ(η)

A function χ(η) of the nuisance parameter can, despite the name, also be of interest. Thus, we consider estimation of the functional

ψ : P → R : Pθ,η 7→ ψ(Pθ,η) = χ(η), where χ : H → R : η 7→ χ(η). The efficient influence function for this parameter can be found from (8.14) or (when available) from (8.15). Therefore, we need the adjoint of Aθ,η : k ∗ k R × lin Hη → L2(Pθ,η), i.e., Aθ,η : L2(Pθ,η) → R × lin Hη and the corresponding information ∗ k k ∗ operator Aθ,ηAθ,η : R × lin Hη → R × lin Hη. By Theorem 2.6, Aθ,ηAθ,η can be extended to k a continuous operator on R × lin Hη. Before deriving these two operators, let us denote the adjoint operator of Bθ,η by ∗ Bθ,η : L2(Pθ,η) → lin Hη.

Note that its existence follows from the assumption that Bθ,η exists, as discussed earlier. Take any g ∈ L2(Pθ,η), then . T hg, Aθ,η(a, b)iL2(Pθ,η) = hg, a `θ,η +Bθ,ηbiL2(Pθ,η) . T = hg, a `θ,ηiL2(Pθ,η) + hg, Bθ,ηbiL2(Pθ,η) . ∗ = hPθ,ηg `θ,η, aiRk + hBθ,ηg, biHη . ∗ = h(P g ,B g), (a, b)i k . θ,η `θ,η θ,η R ×Hη ∗ The second last equality follows from (8.16) and the fact that Bθ,η is defined to be the adjoint of Bθ,η. We conclude that . ∗ ∗ k Aθ,ηg = (Pθ,ηg `θ,η,Bθ,ηg) ∈ R × lin Hη. (9.20)

k The information operator also follows from an easy calculation. Take any (a, b) ∈ R × lin Hη, . ∗ ∗ T Aθ,ηAθ,η(a, b) = Aθ,η(a `θ,η +Bθ,ηb) . . .  T ∗ T  = Pθ,η[{a `θ,η +Bθ,ηb} `θ,η],Bθ,η{a `θ,η +Bθ,ηb} . . . .  T ∗ T ∗  = {Pθ,η `θ,η `θ,η}a + Pθ,η{`θ,η Bθ,ηb}, {Bθ,η `θ,η}a + Bθ,ηBθ,ηb " . #   Iθ,η Pθ,η `θ,η Bθ,η a = . . (9.21) ∗ T ∗ b Bθ,η `θ,η Bθ,ηBθ,η

. ∗ T Remark 9.1. As usual, Bθ,η `θ,η should be read componentwise,

. . . ∗ T h ∗ 1 ∗ k i Bθ,η `θ,η= Bθ,η `θ,η,...,Bθ,η `θ,η . 192 Chapter 9. Applications to the Calculus of Scores

∗ The information operator Aθ,ηAθ,η can thus be seen as a (k + 1) × (k + 1) matrix. However, this should not be taken in a strict sense because this is not a real matrix but a matrix of operators, ∗ e.g., Bθ,ηBθ,η can be seen as an infinite dimensional real matrix if we have a basis for Hη. The first diagonal element of this block-matrix is the ordinary Fisher information matrix for ∗ estimating θ. The other diagonal element Bθ,ηBθ,η is the information operator for estimating η if θ would be known. We now readdress our attention to the estimation of the function χ(η). Assume that the function η 7→ χ(η) is differentiable in the sense of (8.10). This gives rise to a functionχ ˜η ∈ lin Hη. This means that the map (θ, η) 7→ χ∗(θ, η) = χ(η) is differentiable in the sense that for any path k t 7→ ηt with corresponding function b and for any a ∈ R , χ∗(θ + ta, η ) − χ∗(θ, η) χ(η ) − χ(η) t = t t t

= hχ˜η, biHη

= h(0, χ˜ ), (a, b)i k . η R ×Hη

k This shows that (0, χ˜η) ∈ R × lin Hη is the corresponding influence function. Thus, for a real parameter χ(η), equation (8.14) becomes   ∗ ˜ 0 Aθ,ηψPθ,η = . (9.22) χ˜η

Combining (9.22) with (9.20) yields two equations the efficient influence function must satisfy, ˜ . Pθ,ηψPθ,η `θ,η= 0, (9.23a) ∗ ˜ Bθ,ηψPθ,η =χ ˜η. (9.23b)

The first condition is in the light of Proposition 7.1. The efficient influence function for estimating . χ(η) is orthogonal (to each component of) to the score function `θ,η of θ. That is, the efficient . influence function for estimating χ(η) must be orthogonal to lin `θ,η. . Remark 9.2. The set lin `θ,η can be regarded to be the tangent space for the parameter θ (using . the terminology from Part II). The whole tangent space Pθ,η can then be seen as the direct sum of the tangent space for θ and the nuisance tangent space, . . . k Pθ,η= lin `θ,η × η PPθ,η = Aθ,η(R × Hη), this is in the line of the definition of tangent spaces in Part II.

We already know there can be at most one solution of (9.23a) and (9.23b) contained within k the closed linear span of the tangent set lin Aθ,η(R × Hη). The key question now is: how can we come up with a solution of (9.23a) and (9.23b)? To obtain the desired result, we use a pedagogical approach that explains in some steps below, how and why we come up with the proposed formulas. If θ would be known, we can guess from (8.15) that the efficient influence function for χ(η) takes ∗ − the form Bθ,η(Bθ,ηBθ,η) χ˜η. Note we can write down this formula if we assume thatχ ˜(η) is ∗ contained within the range of Bθ,ηBθ,η, which we do. However, θ is unknown. This should be taken into account. We get our inspiration from (9.19). We should substract the orthogonal . ∗ − projection of Bθ,η(Bθ,ηBθ,η) χ˜η onto the tangent space for θ, lin `θ,η. This is a k-dimensional 9.3. Semiparametric Models in a Strict Sense 193

subspace of L2(Pθ,η). As an application of the projection theorem, in §2.4.3, especially equation (2.4), it is seen that the orthogonal projection on such a finite-dimensional space is given by

.  . .T −1 . . . hB (B∗ B )−χ˜ , iT P = hB (B∗ B )−χ˜ , iT I−1 . θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η `θ,η `θ,η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η

Therefore we suggest the following solution, . . ψ˜init = B (B∗ B )−χ˜ − hB (B∗ B )−χ˜ , iT I−1 . Pθ,η θ,η θ,η θ,η η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η Note it will turn out this is not the solution we look for. Nonetheless, we give it a try to see where it goes wrong.

Remark 9.3. Because ψ˜init is one-dimensional, we note that ψ˜init = (ψ˜init )T , and thus Pθ,η Pθ,η Pθ,η

.T . ψ˜init = B (B∗ B )−χ˜ − I−1hB (B∗ B )−χ˜ , i . Pθ,η θ,η θ,η θ,η η `θ,η θ,η θ,η θ,η θ,η η `θ,η L2(Pθ,η)

Let us check the first condition (9.23a), . . P ψ˜init = P ψ˜init θ,η Pθ,η `θ,η θ,η `θ,η Pθ,η . n ∗ − o = Pθ,η Bθ,η(Bθ,ηBθ,η) χ˜η `θ,η . . . T −1 ∗ − − Pθ,η `θ,η `θ,η Iθ,η hBθ,η(Bθ,ηBθ,η) χ˜η, `θ,ηiL2(Pθ,η) = 0.

. ˜init This trivially holds because we made ψP to be orthogonal to `θ,η because we substracted the . θ,η orthogonal projection onto lin `θ,η. The problem arises when checking the second equation, n . . o B∗ ψ˜init =χ ˜ − B∗ hB (B∗ B )−χ˜ , iT I−1 . θ,η Pθ,η η θ,η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η

To satisfy (9.23b), we want the second piece to be zero. This will be the case if . . hB (B∗ B )−χ˜ , iT I−1 ∈ N (B∗ ). θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η θ,η

∗ ⊥ ⊥ Since N (Bθ,η) = R(Bθ,η) = (Bθ,ηHη) , this is equivalent with . . hB (B∗ B )−χ˜ , iT I−1 ∈ R(B )⊥. θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η `θ,η θ,η . ⊥ We now see the problem, the ordinary score function `θ,η does not necessarily belong to R(Bθ,η) . . ⊥ Thus, do we possess an object, related with `θ,η, contained within R(Bθ,η) ? The answer is positive, the efficient score function is defined to be the residual of the ordinary score function after projecting it onto the nuisance tangent set, in this case, the range space of Bθ,η, thus each ˜ ⊥ component of `θ,η belongs to R(Bθ,η) . This sounds great, therefore we modify our proposed efficient influence function, . ψ˜ = B (B∗ B )−χ˜ − hB (B∗ B )−χ˜ , iT I˜−1`˜ . (9.24) Pθ,η θ,η θ,η θ,η η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η θ,η . ∗ − T −1 ˜ Because hBθ,η(B Bθ,η) χ˜η, `θ,ηi I˜ `θ,η is a linear combination of the components of θ,η L2(Pθ,η) θ,η ˜ ⊥ `θ,η, this belongs to R(Bθ,η) and therefore we conclude that (9.24) satisfies (9.23b). 194 Chapter 9. Applications to the Calculus of Scores

. ˜T ˜ ˜T The question now is, is (9.23a) still satisfied? Luckily it is because Pθ,η `θ,η `θ,η equals Pθ,η`θ,η`θ,η, ˜−1 ˜ the efficient information matrix Iθ,η , since each component of `θ,η is orthogonal to each compo- ˜ . nent of `θ,η− `θ,η. We conclude that (9.24) is the efficient influence function for estimating χ(η). The second part of the efficient influence function for χ(η) is the part that is lost due to the fact that θ is unknown. Since it is orthogonal to the first part (again because it is orthogonal to R(Bθ,η), which contains ∗ − Bθ,η(Bθ,ηBθ,η) χ˜η), it adds a positive contribution to the variance. Indeed,

h i n . o2 P ψ˜2 = P B (B∗ B )−χ˜ 2 + P hB (B∗ B )−χ˜ , iT I˜−1`˜ . θ,η Pθ,η θ,η θ,η θ,η θ,η η θ,η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η θ,η

The second part of the expression for the variance of the efficient influence function quantifies how much variance there must be added to compensate the fact that θ is unknown.

9.3.3 Conclusion

We successfully obtained the efficient influence functions for estimating the parameter of interest, θ, and a one-dimensional function of the nuisance parameter, χ(η).

1. Efficient influence function for θ:

˜θ ˜−1 ˜ ψθ,η = Iθ,η `θ,η . ˜−1 ∗ −1 = Iθ,η {I − Bθ,η(Bθ,ηBθ,η) Bθ,η} `θ,η,

2. Efficient influence function for χ(η):

. ψ˜χ(η) = B (B∗ B )−χ˜ − hB (B∗ B )−χ˜ , iT I˜−1`˜ . Pθ,η θ,η θ,η θ,η η θ,η θ,η θ,η η `θ,η L2(Pθ,η) θ,η θ,η

The theory sketched out above is still quite general. It is time to apply it to an interesting semiparametric model in a strict sense, already introduced in §1.3.5, Cox’s proportional hazards model. In the next section, we will investigate it in a more (complicated) realistic form.

9.4 Cox’s Proportional Hazards Model Under Right Censoring

We now illustrate the general formulas obtained for semiparametric models in a strict sense by explicit calculation for the Cox’s proportional hazards model under right censoring. This model is slightly more complicated than the ordinary Cox’s proportional hazards model, introduced in §1.3.5 due to the presence of the right censoring. The ordinary Cox model will constitute the basis to build the model under right censoring.

9.4.1 Building the Model Under Right Censoring

Right Censoring

Let us first describe what is meant by right censoring. The primary goal is to model the relationship of time to event as a function of the covariates Z. We will refer to the time of event 9.4. Cox’s Proportional Hazards Model Under Right Censoring 195 as the time of death since in many applications where these models are used, the endpoint is time of death. However, any event could be used. Unfortunately, in many clinical trials where survival data are obtained, not all survival data are available for the individuals in the study. In that case, some of the survival data may be right censored. By this we mean that for some individuals, we only know they survived to some time. As an example, we consider a clinical trial. Patients enter the study during some period the study runs and it is self-evident the study ends before all patients have died. Henceforth, a patient who is still alive at the end of the study is called right censored. This is what we refer to as administrative censoring. Another reason for right censoring is patient dropout. In this case, the investigator only knows the patient is still alive at the time the patient dropped out.

Density Function Under Right Censoring

We now describe the model more formally. Let it be clear we typically observe a random sample from the distribution of the variable

X = {T ∧ C,I(T ≤ C),Z}.

The variable T represents a time of death. This time of death is only observed if death occurs before the time C, a censoring event. Otherwise the censoring time C is observed. Therefore we observe the minimum Y = T ∧ C. The variable Z is a k-dimensional vector of covariates. We will assume that T q C|Z, given the covariates Z, the time of death is independent of the time of the censoring event. The variable ∆ = I(T ≤ C) is a dichotomous variable indicating whether we observe T (∆ = 1) or alternatively we observe C (∆ = 0). The couple (Z,T ) follows the standard proportional hazards model (1.8). We will now derive the joint density pY,∆,Z of the variable X = (Y, ∆,Z). We can write

pY,∆,Z {y, δ, z; θ, λ(·)} = pY,∆|Z {y, δ|z; θ, λ(·)}pZ (z), where pZ (z) is the marginal density of Z and is left unspecified. Note it is independent of the unknown parameter θ and the baseline unknown hazard function λ(·). Now we investigate how the conditional density pY,∆|Z looks like. First, consider the case that ∆ = 1. This means we observe T . In this case we have

T θT z e−θ zΛ(y) pY,∆|Z {y, 1|z; θ, λ(·)} = e λ(y)e × P(∆ = 1|Z = z). We investigate the term P(∆ = 1|Z = z),

P(∆ = 1|Z = z) = P(y ≤ C|Z = z) = 1 − P(C < y|Z = z)

= 1 − FC|Z (y− |z).

This calculation needs some comments. The funcion FC|Z (c|z) is the conditional distribution function of the variable C, given the covariates Z. We assume this possesses a density fC|Z (c|z), which we shall use shortly. This distribution and corresponding density function are left unspec- ified. Note it is also independent of θ and λ(·). In the derivation above, we wrote the symbol y−. Why is that? We should clarify this notation,

FC|Z (y− |z) = lim FC|Z (ξ|z), ξ→y < 196 Chapter 9. Applications to the Calculus of Scores

the one sided limit where we approach y from the left. We use this one sided limit because FC|Z is not necessarily continuous. We will discuss this in more detail when we are talking about the assumptions we will impose on this model. Second, consider the case that ∆ = 0. This means we observe C. In this case we have

pY,∆|Z {y, 0|z; θ, λ(·)} = fC|Z (y|z) × P(∆ = 0|Z = z). We investigate the term P(∆ = 0|Z = z), P(∆ = 0|Z = z) = P(y ≤ T |Z = z) = 1 − P(T < y|Z = z)

θT z = S(y|Z = z) = e−e Λ(y). We also give some comments on the latter derivation. The function S(y|Z = z) is the conditional survivor function. It is defined analogously to the marginal survivor function, introduced in §1.3.5. Let it be clear it possesses the same properties, but now using the conditional hazard rate in de derivations made in §1.3.5. Therefore, the last equality is justified. We do not need to take a one sided limit because S(y|Z = z) is assumed to be continuous. We may conclude with the joint density function pY,∆,Z {y, δ, z; θ, λ(·)}, the density function of an observation distributed according to the proportional hazards model under right censoring,

 T δ  T 1−δ θT z −eθ zΛ(y) −eθ zΛ(y) e λ(y)e {1 − FC|Z (y− |z)} e fC|Z (y|z) pZ (z). (9.25)

Due to the right censoring mechanism, the density function is more complicated than the density function (1.8) derived in §1.3.5. If there is no right censoring (i.e., C is not present), (9.25) reduces to (1.8).

Assumptions

Before we derive the score operator, its adjoint and the information operator, we make a number of assumptions.

• We assume the covariate Z is bounded, i.e., ∃M ∈ R such that kZk ≤ M where kZk = 1/2 Pk 2 T i=1 Zi and Z = [Z1,...,Zk] . • Another assumption made on the covariate Z is that the true conditional distribution of T given Z possesses a continuous Lebesgue density, a continuous density function with Lebesgue measure on the real line. • We assume that there exists a finite number τ > 0 such that P(C ≥ τ) = P(C = τ) > 0, which also implies that P(C > τ) = 0. This τ typically represents the end of the survival study. Thus, when the survival study is ended, no later censoring time is possible. However, it is possible that C is smaller than τ because people can leave the survival study before it is ended other than through death. From these considerations, it is clear that FC|Z is not necessarily continuous. • Next we assume that P(T > τ) > 0. This is not an unnatural assumption. It is satisfied if the survival study is stopped at this time τ at which a positive fraction of the individuals is still at risk (alive). Indeed, it would be very unnatural that everyone is dead at the end of the survival study. 9.4. Cox’s Proportional Hazards Model Under Right Censoring 197

• Finally, we assume that, for any measurable function h, the probability that Z 6= h(Y ) is positive; there exists no measurable function that transforms Y into the covariate Z with probability 1.

From the assumptions made above, the function Λ only matters on the interval [0, τ] to calculate the log likelihood. Indeed, when we observe T , then T must be certainly smaller than τ, else we would observe C but C is smaller or equal than τ with probability 1. On the other hand, when we observe C, it certainly is smaller than τ with probability 1 and the value of T does not matter. It is clear that we only make use of Λ(y) for y ∈ [0, τ] in the density function (9.25). Thus, in what follows, we shall identify Λ with its restriction to this interval [0, τ]. In the next section, §9.4.2, we will derive the efficient score function and the efficient information matrix under the assumption that the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z) are known to us. This does not seem very logical. However, in the end, it will turn out the efficient score function and the efficient information matrix are the same if we leave the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z) unspecified. The reason will be that the score for θ is automatically orthogonal to the tangent space corresponding with the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z). This is what we referred to as adaptive estimation. We will describe this in §9.4.3. The results are based on the brief discussion in van der Vaart (2002), [40], p.35-37. However, no reference was made to adaptive estimation and in addition, the discussion in [40] assumes the covariate Z to be one-dimensional. We generalize this to the case where Z is k-dimensional.

9.4.2 Known Covariate Distribution pZ (z) and Conditional Censoring Distri- bution FC|Z (c|z)

Score Operators αθ,Λ and Bθ,Λ for θ and Λ, Respectively

As in §6.3.3, to derive the score functions for θ and Λ, we need the log likelihood of the joint density function pY,∆,Z {y, δ, z; θ, λ(·)}. We write,

log[pY,∆,Z {y, δ, z;θ, λ(·)}] h T θT z i = δ θ z − e Λ(y) + log{λ(y)} + log{1 − FC|Z (y− |z)} (9.26)

h θT z i + (1 − δ) −e Λ(y) + log{fC|Z (y|z)} + log{pZ (z)}.

For further reference, we write x = (y, δ, z). . The ordinary score function `θ,Λ (x) for θ is straightforward. Differentiating (9.26) with respect to θ gives

. ∂ ` (x) = log [p {y, δ, z; θ, λ(·)}] θ,Λ ∂θ Y,∆,Z T T = δz − zδeθ zΛ(y) − (1 − δ)zeθ zΛ(y) T = δz − zeθ zΛ(y). (9.27)

As it is seen before, the score operator for θ then becomes

k T θT z αθ,Λ : R → L2(Pθ,Λ): r 7→ αθ,Λ(r) = r {δz − ze Λ(y)}. (9.28) 198 Chapter 9. Applications to the Calculus of Scores

Finding the score operator Bθ,Λ is a more tedious task. Take any bounded, measurable function a : [0, τ] → R. Note we take a to be defined solely on the interval [0, τ] because we took Λ to be the restriction to this interval. The path defined by dΛt = (1+ta)dΛ defines a submodel passing through Λ at t = 0. By the boundedness of a, we are assured that (1 + ta)dΛ is positive for sufficiently small t. Thus there exists an open set (0, ε) for which this path is defined so we can speak of derivatives with respect to the parameter t. The corresponding path for the baseline hazard function is then dΛ dΛ λ (y) = t = (1 + ta) = (1 + ta)λ(y). t dy dy

The corresponding density function pY,∆,Z {y, δ, z; θ, λt(·)} then becomes δ "  Z y  # θT z θT z e λt(y) exp −e λt(s)ds {1 − FC|Z (y− |z)} 0 1−δ "  Z y  # θT z × exp −e λt(s)ds fC|Z (y|z) pZ (z). 0

The derivative of the log likelihood log pY,∆,Z {y, δ, z; θ, λt(·)} with respect to t equals Z y Z y θT z ∂ ∂ θT z ∂ −δe λt(s)ds + δ log{λt(y)} − (1 − δ)e λt(s)ds. ∂t 0 ∂t ∂t 0 Now assume we can interchange integration and differentiation so the derivative of the log likelihood with respect to t equals y y T Z δa(y) T Z δa(y) −eθ z a(s)λ(s)ds + = −eθ z a(s)dΛ(s) + . 0 1 + ta(y) 0 1 + ta(y) θT z R y Evaluating this derivative at t = 0 yields δa(y) − e 0 a(s)dΛ(s). From this it is clear that the score operator can be viewed as an operator Bθ,Λ : HΛ → L2(Pθ,Λ) where HΛ is the set of all bounded functions in L2(Λ). We want to apply the theory developed in the previous section. However, there is one subtlety left that is easily forgotten. In the theory developed in this chapter, we always assumed the score operators we work with are linear and continuous (i.e., bounded). We should check these assumptions. The proof of the boundedness is not insightful and very technical. It may as well be skipped by the less interested reader. It is just a matter of completeness we show how to obtain it. The boundedness of the covariates Z will play a crucial role. Lemma 9.4. The score operator defined by Z y θT z Bθ,Λ : HΛ → L2(Pθ,Λ): a(y) 7→ Bθ,Λa(x) = δa(y) − e a(s)dΛ(s) 0 is a linear and bounded operator.

Proof. Linearity is trivial to obtain. Take two arbitrary functions a1 and a2 from the set HΛ and take two scalars r1 and r2 from the set of real numbers. We see that Z y θT z Bθ,Λ(r1a1 + r2a2)(x) = δ{r1a1(y) + r2a2(y)} − e {r1a1(s) + r2a2(s)}dΛ(s) 0  Z y   Z y  θT z θT z = r1 δa1(y) − e a1(s)Λ(s) + r2 δa2(y) − e a2(s)Λ(s) 0 0 = r1Bθ,Λa1(x) + r2Bθ,Λa2(x). 9.4. Cox’s Proportional Hazards Model Under Right Censoring 199

The proof that Bθ,Λ is bounded is much more tricky. We want to show that there exists a constant

C (independent of the function a) such that kBθ,ΛakL2(Pθ,Λ) ≤ CkakL2(Λ). We consecutively deduce that Z y θT z |Bθ,Λa(x)| ≤ δ|a(y)| + e a(s)dΛ(s) 0 Z y ≤ δ|a(y)| + ekθkM |a(s)|dΛ(s) 0 Z τ ≤ δ|a(y)| + ekθkM |a(s)| · 1 dΛ(s) 0 Z τ 1/2 ≤ δ|a(y)| + ekθkM |a(s)|2dΛ(s) {Λ(τ)}1/2 0

= δ|a(y)| + C1kakL2(Λ),

kθkM 1/2 where the constant C1 is taken to be e {Λ(τ)} , independent of the function a and kθk = 1/2 Pk  i=1 θi . We used that kZk ≤ M and also the Cauchy-Schwarz inequality. Next, we use the result just obtained followed by the triangle inequality of norms,

kB ak ≤ kδa(y)k + C kak θ,Λ L2(Pθ,Λ) L2(Pθ,Λ) 1 L2(Λ) L2(Pθ,Λ)

= kδa(y)kL2(Pθ,Λ) + C1kakL2(Λ)k1kL2(Pθ,Λ)

= kδa(y)kL2(Pθ,Λ) + C1kakL2(Λ). (9.29) This is still not the desired result. Let us investigate the first part in (9.29). We have that

kδa(y)k2 = hδa(y), δa(y)i L2(Pθ,Λ) L2(Pθ,Λ) Z τ T  2 θT Z −eθ Z Λ(y) = EZ |a(y)| e e {1 − FC|Z (y − |Z)}dΛ(y) + 0. 0 The second part vanishes because δ = 0 for that part and the function a is multiplied by δ. R We use the notation EZ {m(Z)} to denote the integral m(z)pZ (z)dνZ (z) where νZ denotes the dominating measure with respect to Z. The term {1−FC|Z (y −|Z)} is always smaller than or equal to 1 because FC|Z (y −|Z) represents a probability, thus FC|Z (y − |Z) ∈ [0, 1] for any y and Z. We now have

τ Z T θT Z  kδa(y)k2 ≤ E |a(y)|2eθ Z e−e Λ(y)dΛ(y) . L2(Pθ,Λ) Z 0 Using the boundedness of Z yields

τ Z kθkM  kδa(y)k2 ≤ E |a(y)|2ekθkM ee Λ(y)dΛ(y) L2(Pθ,Λ) Z 0 τ Z kθkM = ekθkM |a(y)|2ee Λ(y)dΛ(y). 0 R y The function Λ(y) is defined to be the integral 0 λ(ξ)dξ. Therefore, it is continuous as a function of y. It follows that Λ(y) is bounded over the interval [0, τ], we write supy∈[0,τ] Λ(y) ≤ MΛ. Note that Λ(y) is always positive. Using this, we finally obtain that

kδa(y)k2 ≤ C2kak2 , L2(Pθ,Λ) 2 L2(Λ) 200 Chapter 9. Applications to the Calculus of Scores

1/2  kθkM ekθkM M  with C2 = e e Λ . Putting everything together, we obtain

kBθ,ΛakL2(Pθ,Λ) ≤ C2kakL2(Λ) + C1kakL2(Λ)

= CkakL2(Λ), with C = C1 +C2. This shows that the linear operator Bθ,Λ is bounded and therefore continuous. This concludes the proof.

We now use Theorem 2.6 to extend Bθ,Λ to L2(Λ) since the set of all bounded functions is dense in L2(Λ). We may conclude that the score operator Bθ,Λ is a linear continuous operator defined by Z y θT z Bθ,Λ : L2(Λ) → L2(Pθ,Λ): a(y) 7→ Bθ,Λa(x) = δa(y) − e a(s)dΛ(s). (9.30) 0 For the sake of completeness, we conclude with the total score operator for θ and Λ,

k Aθ,Λ : R × L2(Λ) → L2(Pθ,Λ):(r, a) 7→ Aθ,Λ(r, a) = αθ,Λ(r) + Bθ,Λa. (9.31)

∗ ∗ Adjoint Score Operators αθ,Λ and Bθ,Λ

Let us first consider the adjoint of αθ,Λ. Not much has to be said about it. As we deduced already, the adjoint of αθ,Λ is given by

. ∗ k αθ,Λ : L2(Pθ,Λ) → R : g 7→ Pθ,Λg `θ,Λ . (9.32)

. This seems like a very simple formula. However, the expected value Pθ,Λg `θ,Λ is not that easy. Indeed, for any g(y, δ, z) ∈ L2(Pθ,Λ), we have

. Z T  θT Z θT Z −eθ Z Λ(y) Pθ,Λg `θ,Λ = EZ g(y, 1,Z){Z − Ze Λ(y)}e e {1 − FC|Z (y− |Z)}dΛ(y)

Z T  θT Z −eθ Z Λ(y) + EZ g(y, 0,Z){−Ze Λ(y)}e dFC|Z (y|Z)

R where EZ {m(Z)} denotes the integral m(z)pZ (z)dνZ (z) where νZ denotes the dominating measure with respect to Z.

∗ We now consider the more interesting part, the search for the adjoint Bθ,Λ of the operator Bθ,Λ. We warn the reader who dislikes long and tedious calculations, because the calculations we will make, may give rise to long equations. Do not worry, in the end, after some clever tricks due to van der Vaart, the results will take surprisingly (relative) simple forms. ∗ We are looking for the operator Bθ,Λ : L2(Pθ,Λ) → L2(Λ) defined by

∗ hBθ,Λg, aiL2(Λ) = hg, Bθ,ΛaiL2(Pθ,Λ), for any g ∈ L2(Pθ,Λ) and b ∈ L2(Λ). This means we need to transform the inner product 9.4. Cox’s Proportional Hazards Model Under Right Censoring 201

∗ hg, Bθ,ΛaiL2(Pθ,Λ) to the inner product hBθ,Λg, aiL2(Λ). Let us start with Z

hg, Bθ,ΛaiL2(Pθ,Λ) = g(x)Bθ,Λa(x)dPθ,Λ(x)

Z τ  Z y  T  θT Z θT Z −eθ Z Λ(y) = EZ g(y, 1,Z) a(y) − e a(s)dΛ(s) e e {1 − FC|Z (y− |Z)}dΛ(y) 0 0 Z τ  Z y  T  θT Z −eθ Z Λ(y) + EZ g(y, 0,Z) −e a(s)dΛ(s) e dFC|Z (y|Z) 0 0 Z τ T  θT Z −eθ Z Λ(y) = EZ g(y, 1,Z)a(y)e e {1 − FC|Z (y− |Z)}dΛ(y) (9.33a) 0 Z τ Z y  T  2θT Z −eθ Z Λ(y) − EZ g(y, 1,Z) a(s)dΛ(s) e e {1 − FC|Z (y− |Z)}dΛ(y) (9.33b) 0 0 Z τ Z y  T  θT Z −eθ Z Λ(y) − EZ g(y, 0,Z) a(s)dΛ(s) e e dFC|Z (y|Z) . (9.33c) 0 0

Next, we use Fubini’s theorem4 to change the order of integration in the three terms above. We silently assume the conditions to apply Fubini’s theorem are satisfied. We do not go in to these measure theoretical details. Let us begin with the first part (9.33a). We have

Z τ T  θT Z −eθ Z Λ(y) EZ g(y, 1,Z)a(y)e e {1 − FC|Z (y− |Z)}dΛ(y) 0 Z Z τ T  θT z −eθ zΛ(y) = g(y, 1, z)a(y)e e {1 − FC|Z (y− |z)}dΛ(y) pZ (z)dνZ (z). 0 Using Fubini’s theorem to change the orther of integration, we obtain

Z τ Z T  θT z −eθ zΛ(y) a(y) g(y, 1, z)e e {1 − FC|Z (y− |z)}pZ (z)dνZ (z) dΛ(y) 0 Z τ  T  θT Z −eθ Z Λ(y) = a(y)EZ g(y, 1,Z)e e {1 − FC|Z (y− |Z)} dΛ(y) 0   T   θT Z −eθ Z Λ(y) = EZ g(y, 1,Z)e e {1 − FC|Z (y− |Z)} , a(y) . L2(Λ)

Next we consider the second part (9.33b),

Z τ Z y  T  2θT Z −eθ Z Λ(y) −EZ g(y, 1,Z) a(s)dΛ(s) e e {1 − FC|Z (y− |Z)}dΛ(y) 0 0 Z Z τ Z y  T  2θT z −eθ zΛ(y) = − g(y, 1, z) a(s)dΛ(s) e e {1 − FC|Z (y− |z)}dΛ(y) pZ (z)dνZ (z). 0 0

4 Suppose (X , A, µ1) and (Y, B, µ2) are two complete measure spaces. By complete we mean that if N1 ⊂ N2 and µi(N2) = 0, then N2 is also measurable and µi(N1) = 0 for i = 1, 2. Let f be a measurable function, A ∈ A and B ∈ B . If Z |f(x, y)|dµ1 × µ2(x, y) < +∞, A×B then Z Z  Z Z  Z f(x, y)dµ2(y) dµ1(x) = f(x, y)dµ1(x) dµ2(x) = f(x, y)dµ1 × µ2(x, y). A B B A A×B 202 Chapter 9. Applications to the Calculus of Scores

We manipulate the integral involving the function a, we decude Z y Z τ a(s)dΛ(s) = I(s ≤ y)a(s)dΛ(s). 0 0 Indeed, if s > y, this contributes 0 to the integral. Now, after applying Fubini’s theorem, we obtain

Z τ Z Z τ T  2θT z −eθ zΛ(y) − a(s) g(y, 1, z)I(s ≤ y)e e {1 − FC|Z (y− |z)}dΛ(y) pZ (z)dνZ (z)dΛ(s) 0 0 Z τ Z τ T  2θT Z −eθ Z Λ(y) = − a(s)EZ g(y, 1,Z)I(s ≤ y)e e {1 − FC|Z (y− |Z)}dΛ(y) dΛ(s). 0 0 As a last step, we interchange the names of the integration variables s and y to be conform with the first part. Thus, after all these manipulations, (9.33b) becomes

Z τ Z τ T  2θT Z −eθ Z Λ(s) − a(y)EZ g(s, 1,Z)I(y ≤ s)e e {1 − FC|Z (s− |Z)}dΛ(s) dΛ(y) 0 0  Z τ T   2θT Z −eθ Z Λ(s) = −EZ g(s, 1,Z)I(y ≤ s)e e {1 − FC|Z (s− |Z)}dΛ(s) , a(y) . 0 L2(Λ)

Finally, we deal with the third part (9.33c). We write

Z τ Z y  T  θT Z −eθ Z Λ(y) −EZ g(y, 0,Z) a(s)dΛ(s) e e dFC|Z (y|Z) 0 0 Z Z τ Z y  T  θT z −eθ zΛ(y) = − g(y, 0, z) a(s)dΛ(s) e e dFC|Z (y|z) pZ (z)dνZ (z). 0 0 Analogously as before, the integral involving the function a can be written as Z y Z τ a(s)dΛ(s) = I(s ≤ y)a(s)dΛ(s). 0 0 Again applying Fubini’s theorem yields

Z τ Z Z τ T   θT z −eθ zΛ(y) − a(s) g(y, 0, z)I(s ≤ y)e e dFC|Z (y|z) pZ (z)dνZ (z) dΛ(s) 0 0 Z τ Z τ T  θT Z −eθ Z Λ(y) = − a(s)EZ g(y, 0,Z)I(s ≤ y)e e dFC|Z (y|Z) dΛ(s). 0 0 As with (9.33b), we now also interchange the names of the integration variables s and y. Thus, after all these manipulations, (9.33c) becomes

Z τ Z τ T  θT Z −eθ Z Λ(s) − a(y)EZ g(s, 0,Z)I(y ≤ s)e e dFC|Z (s|Z) dΛ(y) 0 0  Z τ T   θT Z −eθ Z Λ(s) = −EZ g(s, 0,Z)I(y ≤ s)e e dFC|Z (s|Z) , a(y) . 0 L2(Λ) Collecting the three parts, we obtain the adjoint operator

∗ ∗ Bθ,Λ : L2(Pθ,Λ) → L2(Λ) : g 7→ Bθ,Λg, (9.34) 9.4. Cox’s Proportional Hazards Model Under Right Censoring 203

∗ with Bθ,Λg given by

 T  ∗ θT Z −eθ Z Λ(y) Bθ,Λg(y) = EZ g(y, 1,Z)e e {1 − FC|Z (y− |Z)}

Z τ T  2θT Z −eθ Z Λ(s) − EZ g(s, 1,Z)I(y ≤ s)e e {1 − FC|Z (s− |Z)}dΛ(s) 0 Z τ T  θT Z −eθ Z Λ(s) − EZ g(s, 0,Z)I(y ≤ s)e e dFC|Z (s|Z) . 0

∗ ∗ −1 Information Operator Bθ,ΛBθ,Λ and its Inverse (Bθ,ΛBθ,Λ)

From a theoretical point of view, we have enough information to calculate the efficient score ˜ function `θ,Λ. Indeed, from (9.19), we know that

. ˜ ∗ −1 ∗ `θ,Λ = {I − Bθ,Λ(Bθ,ΛBθ,Λ) Bθ,Λ} `θ,Λ (9.35)

. ∗ and we have knowledge about the operators Bθ,Λ and Bθ,Λ and the score function `θ,Λ. However, ∗ we promised that after the tedious calculations to obtain Bθ,Λ, we would obtain a simple formula. We see from the above this is certainly not the case. Our ultimate goal is to obtain the efficient ˜ influence function for which we need the efficient score function `θ,Λ. Looking at (9.35), we see . . ∗ ∗ we only need to know how the adjoint Bθ,Λ acts on the score function `θ,Λ, i.e., Bθ,Λ `θ,Λ. When ∗ we obtain this, we need to know how the information operator Bθ,ΛBθ,Λ acts on this. Because . ∗ ∗ we do not have a priori knowledge about the form of Bθ,Λ `θ,Λ, we need a formula for Bθ,ΛBθ,Λg for any g ∈ L2(Pθ,Λ). One possibility is to continue our mechanical work by combining the ∗ formulas obtained so far. But the present form of the adjoint Bθ,Λ is not very promising. Long and horrible calculations would await us. For that reason, we shall not pursue this approach, it is too tedious and not insighful. Instead, we will use a clever trick due to van der Vaart, a statistical principle: minus the mean of the Hessian matrix of the log likelihood equals the covariance matrix of the partial derivatives of the log likelihood. The following lemma gives a more precise formulation.

Lemma 9.5. Given probability densities ps,t : X → R : x 7→ ps,t(x) that depend smoothly on the 2 parameters (s, t) ∈ R ⊂ R , an open set, with respect to a dominating measure µ, we have

 ∂   ∂   ∂2  E log p (X) log p (X) = −E log p (X) , s,t ∂s s,t ∂t s,t s,t ∂s∂t s,t assuming the densities are sufficiently smooth to interchange differentiation and integration.

Proof. Because for any (s, t) ∈ R, ps,t(x) defines a proper density, we know that Z ps,t(x)dµ(x) = 1. X Differentiating with respect to s yields

∂ Z ps,t(x)dµ(x) = 0. ∂s X 204 Chapter 9. Applications to the Calculus of Scores

Upon interchanging integration and differentiation, we obtain

Z ∂ Z ∂ ps,t(x)dµ(x) = log ps,t(x)ps,t(x)dµ(x) = 0. X ∂s X ∂s Next, we take the derivative with respect to t. After interchanging integration and differentia- tion, we get

Z ∂2 Z ∂ ∂ log ps,t(x)ps,t(x)dµ(x) + log ps,t(x) ps,t(x)dµ(x) = 0. X ∂s∂t X ∂s ∂t Using the same trick as before, we see that

Z ∂2 Z ∂ ∂ log ps,t(x)ps,t(x)dµ(x) + log ps,t(x) log ps,t(x)ps,t(x)dµ(x) = 0. X ∂s∂t X ∂s ∂t Writing this in terms of expectations, we finally obtain that

 ∂2   ∂   ∂  E log p (X) + E log p (X) log p (X) = 0. s,t ∂s∂t s,t s,t ∂s s,t ∂t s,t

Putting the first term to the other side of the equation concludes the proof.

Using this lemma, we will be able to derive quite simple formulas for the information operator . ∗ ∗ Bθ,ΛBθ,Λ and the action of the adjoint on the score function Bθ,Λ `θ,Λ. ∗ Let us start with the information operator Bθ,ΛBθ,Λ. Take two arbitrary functions a and b from HΛ. By definition of the adjoint, we see that

∗ hBθ,ΛBθ,Λa, biL2(Λ) = hBθ,Λa, Bθ,ΛbiL2(Pθ,Λ). (9.36)

We shall use paths dΛs,t = (1 + sa + sb + stab)dΛ at (s, t) = (0, 0). If t and s are sufficiently small, this is always positive. Thus there exists an open set (0, εs) × (0, εt) for which this path is defined, therefore we can define derivatives with respect to t and s. These paths imply differentiable submodels (s, t) 7→ Pθ,Λs,t at (s, t) = (0, 0). Note that dΛs,t = (1 + sa)(1 + sb)dΛ = (1 + sa)dΛ0,t = (1 + tb)dΛs,0.

Remark 9.4. We assumed the functions a and b belong to the set HΛ which is dense in L2(Λ). When we obtain the action of the information operator on these bounded functions, we use Theorem 2.6 to extend this operator to L2(Λ). Keep this in mind, we will not note it again. So do not worry if in the end we write the information operator to be defined on the whole space L2(Λ).

The corresponding path for the baseline hazard function is then

dΛ dΛ λ (y) = s,t = (1 + sa + sb + stab) = (1 + sa + tb + stab)λ(y). s,t dy dy

We will manipulate the second inner product in (9.36), this equals Eθ,Λ{(Bθ,Λa)(Bθ,Λb)}. It is easy to see that Bθ,Λa is the ordinary score function with respect to s and Bθ,Λb is the ordinary score function with respect to t of the submodel (s, t) 7→ Pθ,Λs,t at (s, t) = (0, 0). We prove this for the function a. The calculation for b is completely the same. 9.4. Cox’s Proportional Hazards Model Under Right Censoring 205

The log likelihood for this submodel is given by

 Z y  T θT z log pθ,Λs,t (x) = δ θ z − e λs,t(ξ)dξ + log{λs,t(y)} + log{1 − FC|Z (y − |z)} 0  Z y  θT z + (1 − δ) −e λs,t(ξ)dξ + log{fC|Z (y|z)} + log{pZ (z)}. 0 Taking the derivative with respect to s and evalutating at (s, t) = (0, 0) yields the score for s at (s, t) = (0, 0). We obtain that

Z y ∂ ∂ θT z ∂ log pθ,Λs,t (x) = δ log{λs,t(y)} − e λs,t(ξ)d(ξ). ∂s ∂s ∂s 0 Interchanging differentiation and integration gives Z y Z y δa(y){1 + tb(y)} θT z δa(y) θT z − e a(ξ){1 + tb(ξ)}λ(ξ)dξ = − e a(ξ)dΛ0,t(ξ). {1 + sa(y)}{1 + tb(y)} 0 1 + sa(y) 0 Evaluating at s = 0 shows us that

Z y ∂ θz log pθ,Λs,t (x) = δa(y) − e a(ξ)dΛ0,t(ξ) = Bθ,Λ0,t a(x). (9.37) ∂s s=0 0 This intermediate result will be useful later on. In addition, evaluating (9.37) at t = 0 gives

∂ log pθ,Λs,t (x) = Bθ,Λa(x). (9.38) ∂s (0,0)

As announced, analogously we can show that

∂ log pθ,Λs,t (x) = Bθ,Λb(x). (9.39) ∂t (0,0)

Let us begin with the more interesting calculations. From (9.38) and (9.39) and next by Lemma 9.5, it follows that

( ) ∂ ∂ Eθ,Λ{(Bθ,Λa)(Bθ,Λb)} = Eθ,Λ log pθ,Λs,t (X) × log pθ,Λs,t (X) ∂s (0,0) ∂t (0,0) ( 2 ) ∂ = −Eθ,Λ log pθ,Λs,t (X) . ∂s∂t (0,0)

Rewriting the partial derivative and then using (9.37) yields      ∂ ∂ ∂ −Eθ,Λ log pθ,Λs,t (X) = −Eθ,Λ Bθ,Λ0,t a(x) . ∂t t=0 ∂s s=0 ∂t t=0 This derivative is now easy to calculate from (9.37). We already did this kind of calculations, so it is instantaneously clear that

 Z y  Z y ∂ θT z θT z δa(y) − e a(ξ)dΛ0,t(ξ) = −e a(ξ)b(ξ)dΛ(ξ). ∂t 0 0 206 Chapter 9. Applications to the Calculus of Scores

Substituting this in the last but one equation gives

 Z Y   Z τ  θT Z θT Z Eθ,Λ e a(ξ)b(ξ)dΛ(ξ) = Eθ,Λ e I(ξ ≤ Y )a(ξ)b(ξ)dΛ(ξ) . 0 0

Finally using Fubini’s theorem yields the desired result,

Z τ h n θT Z o i θT Z b(ξ) Eθ,Λ e I(ξ ≤ Y ) a(ξ) dΛ(ξ) = hEθ,Λ{e I(ξ ≤ Y )}, b(ξ)iL2(Λ). 0

From (9.36) we see that

∗ θT Z hBθ,ΛBθ,Λa(y), b(y)iL2(Λ) = hEθ,Λ{e I(y ≤ Y )}a(y), b(y)iL2(Λ).

We thus may conclude with the following form for the information operator,

∗ ∗ θT Z Bθ,ΛBθ,Λ : L2(Λ) → L2(Λ) : a(y) 7→ Bθ,ΛBθ,Λa(y) = Eθ,Λ{e I(y ≤ Y )}a(y). (9.40)

Looking at (9.40), we finally see the simple formula as promised. A wonderful thing has hap- ∗ pened. After all the tough calculations, the information operator Bθ,ΛBθ,Λ is now a multipli- ∗ cation operator. To know the action of Bθ,ΛBθ,Λ on a function a(y), just multiply this initial θT Z function with the function Eθ,Λ{e I(y ≤ Y )}. Note this is also a function of y, present in ∗ the indicator function. However, we must admit, it is no surprise the operator Bθ,ΛBθ,Λ can be written as a multiplication operator. Indeed, a theorem in Hilbert space theory states that ev- ery self-adjoint operator on a Hilbert space can be written as a multiplication operator, relative to an appropriate coordinate system. Multiplication operators are somewhat the equivalent to diagonal matrices as operators between finite-dimensional spaces. From (9.40) we see the infor- mation operator already takes the form of a multiplication operator. Thus, the wonderful thing is that the information operator is a multiplication operator relative to the original coordinate system. Because our ultimate goal is to calculate the efficient score function (9.35), we need the inverse of the information operator. Luckily, it is easy to invert a multiplication operator. In the present θT Z situation, due to the assumptions we have made, the multiplier function y 7→ Eθ,Λ{e I(y ≤ Y )} is bounded away from zero on [0, τ]. Indeed, take any y ∈ [0, τ]. By the boundedness of Z, eθT Z is certainly different from zero. A problem may arise from the indicator function I(y ≤ Y ). If ∆ = 1, we have T = Y . In this case, by the assumption that P(T > τ) > 0, we see that P(T > y) > 0 because y ≤ τ. Thus, with a probability different from zero, I(y ≤ T ) is different from zero. Now suppose ∆ = 0, which means that T = C. In this case, by the assumption that P(C = τ) > 0, we see that P(C ≥ y) > 0 because y ≤ τ. Thus, with a probability different from θT Z zero, I(y ≤ C) is different from zero. We conclude that Eθ,Λ{e I(y ≤ Y )} is bounded away from zero on [0, τ]. By this we are assured the inverse of the information operator exists as a θT Z linear, continuous operator because Eθ,Λ{e I(y ≤ Y )}= 6 0 for any y ∈ [0, τ]. It is given by

∗ −1 ∗ −1 (Bθ,ΛBθ,Λ) : L2(Λ) → L2(Λ) : a(y) 7→ (Bθ,ΛBθ,Λ) a(y) (9.41) with ∗ −1 θT Z −1 (Bθ,ΛBθ,Λ) a(y) = [Eθ,Λ{e I(y ≤ Y )}] a(y). (9.42) 9.4. Cox’s Proportional Hazards Model Under Right Censoring 207

. ∗ The Action of the Adjoint Score Operator Bθ,Λ on the Score Function `θ,Λ

. ∗ We will obtain a simple formula for the function Bθ,Λ `θ,Λ∈ L2(Λ) using a similar argument ∗ as with the formula for the information operator Bθ,ΛBθ,Λ. Therefore, the calculations will k be done in lesser detail. We shall exploit differentiable paths (s, t) 7→ Pθ+sr,Λt , r ∈ R , with dΛt = (1 + tb)dΛ where b ∈ HΛ at (s, t) = (0, 0). As we argued before, the boundedness of b implies the path Λt is well defined and derivatives with respect to t exist. The corresponding path for the baseline hazard function is λt(y) = (1 + tb)λ(y). The log likelihood becomes  Z y  T (θ+sr)T z log pθ+sr,Λt (x) =δ (θ + sr) z − e λt(ξ)dξ + log{λt(y)} + log{1 − FC|Z (y − |z)} 0  Z y  (θ+sr)T z + (1 − δ) −e λt(ξ)dξ + log{fC|Z (y|z)} + log{pZ (z)}. 0 Taking the derivative with respect to s and evaluating at (s, t) = (0, 0) yields the score function . T r `θ,Λ. Indeed, Z y ∂ T T (θ+sr)T z log pθ+sr,Λt (x) = δr z − r ze λt(ξ)dξ. ∂s 0 Evaluating the latter equation at (s, t) = (0, 0) shows that we obtain the score function rT {δz − . θz T ze Λ(y)} = r `θ,Λ. Next, taking the derivative of the log likelihood with respect to t and evaluating at (s, t) = (0, 0) yields the score function Bθ,Λb. Indeed, upon interchanging integration and differentiation gives Z y ∂ δb(y) (θ+sr)T z log pθ+sr,Λt (x) = − e b(ξ)dΛ(ξ). ∂t 1 + tb(y) 0 Evaluating this at (s, t) = (0, 0) gives Z y θT z δb(y) − e b(ξ)dΛ(ξ) = Bθ,Λb(y). 0 Using these results and Lemma 9.5 gives . . T ∗ ∗ T r hBθ,Λ `θ,Λ, biL2(Λ) = hBθ,Λr `θ,Λ, biL2(Λ) . T = hr `θ,Λ,Bθ,ΛbiL2(Pθ,Λ) ( ) ∂ ∂ = Eθ,Λ log pθ+sr,Λt (X) × log pθ+sr,Λt (X) ∂s (0,0) ∂t (0,0)    ∂ ∂ = −Eθ,Λ log pθ+sr,Λt (X) ∂t t=0 ∂s s=0   Z Y  T ∂ θT Z = −r Eθ,Λ δZ − Ze λt(ξ)dξ ∂t t=0 0  Z Y  T θT Z = r Eθ,Λ Ze b(ξ)dΛ(ξ) . 0 Using the same tricks as before (rewriting the integral and using Fubini’s theorem), we conclude that . Z τ ∗ n θT Z o hBθ,Λ `θ,Λ, biL2(Λ) = b(ξ)Eθ,Λ Ze I(ξ ≤ Y ) dΛ(ξ). 0 208 Chapter 9. Applications to the Calculus of Scores

After all these calculations, we obtain the desired result,

. ∗ n θT Z o Bθ,Λ `θ,Λ (y) = Eθ,Λ Ze I(y ≤ Y ) . (9.43)

As promised, this is also a simple formula.

˜ Efficient Score Function `θ,Λ

We are ready to derive one of the two most important results of this section, the efficient score function. Starting from (9.35) and successive application of the results derived so far, one obtains . ˜ ∗ −1 ∗ `θ,Λ(x) = {I − Bθ,Λ(Bθ,ΛBθ,Λ) Bθ,Λ} `θ,Λ (x) . ∗ −1 h θT Z i = `θ,Λ (x) − Bθ,Λ(Bθ,ΛBθ,Λ) Eθ,Λ{Ze I(y ≤ Y )} " θT Z # T E {Ze I(y ≤ Y )} = δz − zeθ zΛ(y) − B θ,Λ . θ,Λ θT Z Eθ,Λ{e I(y ≤ Y )}

Now introduce two special functions:

 θT Z  L0,θ,Λ(y) = Eθ,Λ{e I(y ≤ Y )},

θT Z  L1,θ,Λ(y) = Eθ,Λ{Ze I(y ≤ Y )}.

It is not difficult to see the efficient score function for θ is then given by   Z y   ˜ L1,θ,Λ θT z L1,θ,Λ `θ,Λ(x) = δ z − (y) − e z − (t) dΛ(t). (9.44) L0,θ,Λ 0 L0,θ,Λ This is a relatively simple and elegant result.

Efficient Information Matrix I˜θ,Λ

We now try to derive the other important result of this section, the efficient information matrix. By the definition of the efficient information matrix, this equals the variance of the efficient score function. The result is given by

" Z    T # θT Z L1,θ,Λ L1,θ,Λ I˜θ,Λ = E e Z − (y) Z − (y) Gθ,Λ(y|Z)dΛ(y) , (9.45) L0,θ,Λ L0,θ,Λ where G(y|Z) = P(Y ≥ y|Z). Since we assumed that Z is not almost surely equal to a function of Y , we know the efficient information matrix I˜θ,Λ is invertible. We could obtain this by a direct calculation. As stated in van der Vaart (2002), [40], this would be long and difficult. We will not pursue this approach. Instead, we will use martingale theory. Some intermediate results are given in [40]. Below, we fill in the details. However, those not familiar with the basic concepts about martingale theory and counting processes can skip the following derivation and just accept the result (9.45). Nonetheless, those really willing to understand the calculations, we refer to Anderson et al. (1993), [2], Chapter II for a mathematical approach to martingale theory or, for a more intuitive approach, we refer to Aalen 9.4. Cox’s Proportional Hazards Model Under Right Censoring 209 et al. (2008), [1], Chapter 1-2. In what follows, we will also need a vector-valued stochastic integral. The basic results remain the same as in the one-dimensional case. This subject is also contained in Chapter II of [2]. In [1], vector-valued stochastic integrals are discussed in the Appendix B. Let us define the counting process {N(t) : 0 ≤ t ≤ τ} where

N(t) = I(T ≤ t).

This is an example of a one-jump process since the counting process N(t) makes at most one jump of a unit in the interval [0, τ]. It indicates if the individual has died or not, jumping from zero to one when death occurs. Next, we define the left continuous adapted process {Y (t) : 0 ≤ t ≤ τ} where Y (t) = I(T ≥ t). From the general theory about counting processes, we know that N(t) has the compensator {C(t) : 0 ≤ t ≤ τ},

t t Z T Z T C(t) = Y (s)eθ Z dΛ(s) = I(T ≥ s)eθ Z dΛ(s). 0 0 Substracting both stochastic processes gives us the stochastic process {M(t) : 0 ≤ t ≤ τ},

t Z T M(t) = N(t) − C(t) = I(T ≤ t) − I(T ≥ s)eθ Z dΛ(s). 0 The process M(t) is a martingale relative to the filtration {F(t) : 0 ≤ t ≤ τ} generated by the random variables (Z,C) and the counting process N(t), i.e.,

E{M(t)|F(s)} = M(s), for all s ∈ [0, t].

The corresponding predictable quadratic variation process is given by t Z T hM(t)i = C(t) = I(T ≥ s)eθ Z dΛ(s). 0 Note this result makes use of the assumption that Λ is continuous. We will now show the ˜ efficient score `θ,Λ(X) given by (9.44) can be written as a stochastic integral with respect to the martingale M(t). We claim that Z τ   ˜ L1,θ,Λ `θ,Λ(X) = I(t ≤ C) Z − (t) dM(t) =: I(t). (9.46) 0 L0,θ,Λ Indeed, we have that dM(t) = dN(t) − Y (t)eθT Z dΛ(t). Henceforth, Z τ   ˜ L1,θ,Λ `θ,Λ(X) = I(t ≤ C) Z − (t) dN(t) 0 L0,θ,Λ τ   T Z L − eθ Z I(t ≤ C)I(t ≤ T ) Z − 1,θ,Λ (t) dΛ(t). 0 L0,θ,Λ n o The first integral has to be understood as the sum of the values of I(t ≤ C) Z − L1,θ,Λ (t) at L0,θ,Λ every jump time of the counting process. Hence, Z τ  L   L  I(t ≤ C) Z − 1,θ,Λ (t) dN(t) = δ Z − 1,θ,Λ (Y ) . 0 L0,θ,Λ L0,θ,Λ 210 Chapter 9. Applications to the Calculus of Scores

For the second integral, we note that I(t ≤ C)I(t ≤ T ) = I(t ≤ C ∧ T ) = I(t ≤ Y ) and hence the integral can be rewritten as

Y   T Z L −eθ Z Z − 1,θ,Λ (t) dΛ(t). 0 L0,θ,Λ Putting everything together, we see we obtain (9.44). ˜ Using the representation (9.46) for the efficient score `θ,Λ(X), we can apply one of the basic n o properties of stochastic integration. Since we know the integrand I(t ≤ C) Z − L1,θ,Λ (t) is L0,θ,Λ predictable because it is known just before time t, we know that I(t) is also a martingale with respect to the filtration F(t). Henceforth, we can define the corresponding quadratic variation T ˜ ˜T process hI(t)i which is the compensator of the stochastic process I(t)I (t) = `θ,Λ(X)`θ,Λ(X), i.e., I(t)IT (t) − hI(t)i is a martingale, zero at time zero. This means that E{I(t)IT (t) − hI(t)i} = 0, ˜ ˜T henceforth we obtain that E{`θ,Λ(X)`θ,Λ(X)} = E{hI(t)i}. We deduce

˜ ˜ ˜T Iθ,Λ = E{`θ,Λ(X)`θ,Λ(X)} = E{hI(t)i} " # Z τ  L   L T = E I(t ≤ C) Z − 1,θ,Λ (t) Z − 1,θ,Λ (t) dhM(t)i , 0 L0;θ,Λ L0;θ,Λ where we used another fundamental property of stochastic integration, see for example equation (2.29) in [1]. Using the fact that dhM(t)i = I(t ≤ T )eθT Z dΛ(t) and using the law of iterated expectations, we find

" Z τ    T # θT Z L1,θ,Λ L1,θ,Λ I˜θ,Λ = E e I(t ≤ C)I(t ≤ T ) Z − (t) Z − (t) dΛ(t) 0 L0;θ,Λ L0;θ,Λ " τ    T # T Z L L = E eθ Z I(t ≤ C ∧ T ) Z − 1,θ,Λ (t) Z − 1,θ,Λ (t) dΛ(t) 0 L0;θ,Λ L0;θ,Λ " τ    T # T Z L L = E eθ Z E{I(y ≤ Y )|Z} Z − 1,θ,Λ (y) Z − 1,θ,Λ (t) dΛ(y) 0 L0;θ,Λ L0;θ,Λ " τ    T # T Z L L = E eθ Z P(Y ≥ y|Z) Z − 1,θ,Λ (y) Z − 1,θ,Λ (t) dΛ(y) 0 L0;θ,Λ L0;θ,Λ " Z    T # θT Z L1,θ,Λ L1,θ,Λ = E e Z − (y) Z − (t) Gθ,Λ(y|Z)dΛ(y) . L0,θ,Λ L0;θ,Λ

This shows that (9.45) is valid.

9.4.3 Unknown Covariate Distribution pZ (z) and Conditional Censoring Dis- tribution FC|Z(c|z), Adaptive Estimation

˜ In the previous section, we calculated the efficient score `θ,Λ and the efficient information ma- trix I˜θ,Λ assuming the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z), or equivalently the conditional censoring density function fC|Z (c|z), are known to us. However, we do not want to make this assumption. Henceforth, we leave the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z) and thus the conditional 9.4. Cox’s Proportional Hazards Model Under Right Censoring 211

censoring density function fC|Z (c|z) completely unspecified. The product fC|Z (c|z)pZ (z) then equals pC,Z (c, z), the joint density function of C and Z. The tangent set corresponding with this unknown joint density function pC,Z (c, z) is then easy to obtain since pC,Z (c, z) is completely left unspecified and hence the tangent set corresponding with this joint density function pC,Z (c, z) is the tangent set of a nonparametric model, the set of all square integrable mean-zero functions 0 of c and z. Let us denote this tangent space by L2(PC,Z ) where PC,Z denotes the probability measure corresponding with the joint density function pC,Z (c, z), i.e.,

0 L2(PC,Z ) = {ζ(c, z) ∈ L2(PC,Z ): PC,Z ζ = 0}. . 0 We want to show that the ordinary score function `θ,Λ is orthogonal to L (PC,Z ). This can . 2 be done by showing that the conditional expection of `θ,Λ given C and Z equals zero. Using martingale theory, this can be shown in a very elegant way. For this purpose, we note the ordinary score function can be written as

. Z τ `θ,Λ (X) = I(t ≤ C)ZdM(t). 0 Since this again makes use of martingale theory, we refer the interested reader to Tsiatis (2006), ˜ [35], §5.2. Henceforth, we may conclude the efficient score `θ,Λ if the covariate distribution pZ (z) and the conditional censoring distribution FC|Z (c|z) are unknown to us is the same as if they are known to us. This is what we referred to as adaptive estimation, see §5.2 since in the light of efficiency, it does not matter whether the covariate distribution and the conditional censoring distribution are known or not. Hence, the formulas we obtained in the previous section, especially (9.44) and (9.45) remain valid. In [35], §5.2, it is also argued how RAL estimators for θ can be constructed and how a globally ˜ efficient estimator for θ can be obtained using the efficient score function `θ,Λ. A globally efficient estimator for θ can be found by solving it from the estimating equations

  n  X T    Zj exp{θ Zj}I(Yj ≥ Yi) n    X  j=1  ∆i Zi − n  = 0. i=1   X T    exp{θ Zj}I(Yj ≥ Yi)     j=1 

Note this results in a globally efficient estimator (and not only a locally efficient estimator) since the estimating equations are independent of the nuisance parameters. It turns out this estimator exactly equals the estimator proposed by Cox for maximizing the partial likelihood, where the notion of a partial likelihood was first introduced by Cox in 1975, see [10] for more information. 212 Chapter 9. Applications to the Calculus of Scores Appendices

213

Appendix A

Fundamentals about Asymptotic Statistics

This appendix provides the basic definitions and properties about asymptotic statistics. These concepts were extensively used throughout this thesis. This summary is based on van der Vaart (1998), [39], Chapter 2 and the lecture notes Vansteelandt (2007-2008), [41], Chapter 5 and 6.

A.1 Stochastic Convergence

A.1.1 Types of Convergence

+∞ Definition A.1. Let {Xn}n=1 be a sequence of random vectors defined on the probability space (Ω, U, P). We then say that Xn converges in probability to a random vector X iff for any  > 0,

lim P(kXn − Xk > ) = 0, n→+∞ or equivalently, iff for any  > 0,

lim P(kXn − Xk ≤ ) = 1. n→+∞

P This is denoted by Xn → X.

+∞ Definition A.2. Let {Xn}n=1 be a sequence of random vectors defined on the probability space

(Ω, U, P) with corresponding distribution functions FXn (x). We then say that Xn converges in distribution to a random vector X with distribution function FX (x) iff

lim FX (x) = FX (x) n→+∞ n for any x at which the limit distribution function FX (x) is continuous. Alternative names are D convergence in law or weak convergence. This is denoted by Xn → X.

215 216 Chapter A. Fundamentals about Asymptotic Statistics

A.1.2 Properties of Stochastic Convergence

A first property states that convergence in probability implies convergence in distribution, i.e., convergence in probability is stronger than convergence in distribution. That is why convergence in distribution is also called weak convergence.

P D Proposition A.1. If Xn → X, then also Xn → X.

Note the converse is not true in general, so both types of convergence are not equivalent. How- ever, there is a special case.

D P Proposition A.2. If Xn → c for a constant c, then also Xn → c.

Hence, in the case of convergence to a constant, convergence in probability and convergence in distribution are equivalent. The next lemma gives a number of equivalent descriptions of convergence in distribution. Most of them are only useful in proofs. This is also known as the Portmanteau Lemma.

Lemma A.1 (Portmanteau). For any random vectors Xn and X the following statements are equivalent:

D (i) Xn → X;

(ii) E{f(Xn)} → E{f(X)} for all bounded, continuous functions f;

1 (iii) E{f(Xn)} → E{f(X)} for all bounded, Lipschitz functions f;

(iv) lim inf E{f(Xn)} ≥ E{f(X)} for all nonnegative, continuous functions f;

(v) lim inf P(Xn ∈ G) ≥ P(X ∈ G) for every open set G;

(vi) lim sup P(Xn ∈ F ) ≤ P(X ∈ F ) for every closed set F ;

(vii) P(Xn ∈ B) → P(X ∈ B) for all Borel sets B with P(X ∈ ∂B) = 0, where ∂B is the boundary of B.

To check if a sequence of random vectors Xn converges in probability to a random vector X, the following two inequalities may be very useful.

Theorem A.1 (Markov’s Inequality). Let X be a random vector, p > 0 and E{kXkp} < +∞, then for every  > 0, E{kXkp} P(kXk > ) ≤ . p

A special case of Markov’s inequality is Chebyshev’s inequality.

Theorem A.2 (Chebyshev’s Inequality). Let X be a random vector with finite variance, then for every  > 0 E{kX − E(X)k} P{kX − E(X)k > } ≤ . 2 1A function is called Lipschitz if there exists a number L sucht that |f(x) − f(y)| ≤ Lkx − yk. A.1. Stochastic Convergence 217

We now state the classical limit theorems together with an extension that turned out to be useful for our purposes.

+∞ Theorem A.3 (Weak Law of Large Numbers (WLLN)). Let {Xn}n=1 be a sequence of random vectors that are i.i.d. and for which E(Xn) = µ and kµk < +∞. Then we have

n 1 X P X = X → µ. n n i i=1

+∞ Theorem A.4 (Central Limit Theorem (CLT)). Let {Xn}n=1 be a sequence of random vectors that are i.i.d. and for which E(Xn) = µ and kµk < +∞. In addition, we need that 2 T E{kXnk } < +∞. The covariance matrix of Xn is denoted Σ = E{(Xn − µ)(Xn − µ) }. Under these assumptions, we have √ D n(Xn − µ) → N(0, Σ).

The following theorem is an extension of the ordinary weak law of large numbers and is given in Newey and McFadden (1994), [28], Lemma 4.3.

+∞ Theorem A.5 (Uniform WLLN). Let {Xn}n=1 be a sequence of random vectors that are i.i.d., a(X, θ) is continuous at θ0 with probability one and there is a neighbourhood N of θ0 such ˜ P that E{supθ∈N ka(X, θ)k} < +∞, then for any θ → θ0,

n 1 X P a(X , θ˜) → E{a(X, θ )}. n i 0 i=1

Next, we state some useful rules for the calculation of limiting distributions, known as the Continuous Mapping Theorem and Slutsky’s lemma.

Theorem A.6 (Continuous Mapping Theorem). Let Xn and X be random vectors and let g be a continuous function at every point of a set C such that P(X ∈ C) = 1.

D D (i) If Xn → X, then g(Xn) → g(X);

P P (ii) If Xn → X, then g(Xn) → g(X).

Lemma A.2 (Slutsky). Let Xn, X, Yn and Y be random vectors or variables and consider a constant c.

P P P (i) If Xn → X and Yn → Y , then Xn + Yn → X + Y ;

D D D (ii) If Xn → X and Yn → c, then Xn + Yn → X + c;

P P P (iii) If Xn → X and Yn → Y , then XnYn → XY ;

D D D (iv) If Xn → X and Yn → c, then XnYn → cX.

We want to end with Prohorov’s theorem. For this purpose, we need some additional terminol- ogy. 218 Chapter A. Fundamentals about Asymptotic Statistics

Definition A.3. A random vector is called tight if for every  > 0 there exists a constant M sucht that P(kXk > M) < .

Definition A.4. A set of random vectors {Xα : α ∈ A } is called uniformly tight if M of the preceding definition can be chosen to be the same for any α ∈ A , i.e., for every  > 0 there exists a constant M sucht that sup P(kXαk > M) < . α∈A

The definition of uniformly tight means that there exists a compact set to which all Xα give probability almost one. Therefore, uniformly tight is also called bounded in probability. We now have enough information to state Prohorov’s theorem.

Theorem A.7 (Prohorov). Let Xn and X be random vectors.

D (i) If Xn → X, then {Xn : n ∈ N} is uniformly tight; D (ii) If {Xn : n ∈ N} is uniformly tight, then there exists a subsequence with Xnj → X as j → +∞.

From Prohorov’s theorem, we see that uniformly tight is almost equivalent with convergence in distribution in the sense that any sequence that converges in distribution is uniformly tight and if we have a uniformly tight sequence, Prohorov’s theorem states that this sequence has a subsequence that converges in distribution.

A.1.3 Stochastic o and O Symbols

In this section, we give some basic rules for the symbols oP (1) and OP (1). The symbol oP (1) is a shorthand for an expression for a sequence of random vectors that converges to zero in prob- ability. The symbol OP (1) is a shorthand for a sequence of random variables that is uniformly tight, i.e., boundend in probability. An extension of these symbols is the following. Consider a sequence of random variables Rn, then

P Xn = oP (Rn) means that Xn = YnRn and Yn → 0 and Xn = OP (Rn) means that Xn = YnRn and Yn = OP (1).

For deterministic sequences Xn and Rn, the stochastic symbols oP and OP reduce to the usual o and O from the ordinary calculus. Below we state a list of useful rules concerning these stochastic symbols.

(i) oP (1) + oP (1) = oP (1),

(ii) oP (1) + OP (1) = OP (1),

(iii) oP (1)OP (1) = oP (1),

−1 (iv) {1 + oP (1)} = OP (1),

(v) oP (Rn) = RnoP (1), A.2. Asymptotic Properties of Estimators 219

(vi) OP (Rn) = RnOP (1),

(vii) oP {OP (1)} = oP (1).

We end this section with two more complicated, nonetheless, useful rules.

Lemma A.3. Let R be a function with R(0) = 0. Let Xn be a sequence of random vectors with values in the domain of R that converges in probability to zero, i.e., Xn = oP (1). Then, for every p > 0,

p p (i) if R(h) = o(khk ) as h → 0, then R(Xn) = oP (kXnk );

p p (ii) if R(h) = O(khk ) as h → 0, then R(Xn) = OP (kXnk ).

A.2 Asymptotic Properties of Estimators

In this section, we present some basic asymptotic properties an estimator may possess. We begin with a formal definition of a statistic and an estimator.

Definition A.5. Let X1,...,Xn be an i.i.d. sample. The random function g(X1,...,Xn) is then called a statistic. Let x1, . . . , xn be the observed values of the sample. The quantity g(x1, . . . , xn) is called the observed value of the statistic.

Definition A.6. A statistic Tn(X1,...,Xn) that is used to estimate a parameter θ0 is called an estimator of θ0. Let x1, . . . , xn be the observed values of the sample. The value Tn(x1, . . . , xn) is then called an estimation of θ0.

We now state a list of properties an estimator may possess. The set Θ denotes the set of admitted parameter values θ.

Definition A.7. An estimator Tn(X1,...,Xn) of θ is called unbiased iff

Eθ{Tn(X1,...,Xn)} = θ, for all θ ∈ Θ.

Definition A.8. The bias of an estimator Tn(X1,...,Xn) of θ is given by

Bn(θ) = Eθ{Tn(X1,...,Xn)} − θ.

Definition A.9. A sequence of estimators Tn(X1,...,Xn) of θ is called asymptotically un- biased iff lim Bn(θ) = 0, for all θ ∈ Θ. n→+∞

Definition A.10. A sequence of estimators Tn(X1,...,Xn) of θ is called consistent iff

P Tn(X1,...,Xn) → θ, for all θ ∈ Θ.

A sufficient condition for consistency is given in the following proposition.

Proposition A.3. If Tn ≡ Tn(X1,...,Xn) is a sequence of estimators of θ satisfying 220 Chapter A. Fundamentals about Asymptotic Statistics

(i) limn→+∞ varθ(Tn) = 0,

(ii) limn→+∞ Bn(θ) = 0, for every θ ∈ Θ, then Tn is a consistent sequence of estimators of θ.

We end our list with a formal definition of rn-consistency for scaling factors rn → +∞ as n → ∞. Definition A.11. A sequence of estimators T ≡ T (X ,...,X ) of θ is called r -consistent n n 1 √ n n iff there exists a sequence of scaling factors rn, e.g., rn = n, such that rn → +∞ as n → ∞ for which rn(Tn − θ) = OP (1), for all θ ∈ Θ.

The scaling factors rn should be chosen with some caution. If the scaling factors rn are chosen to be too small, the limiting distribution of rn(Tn − θ) becomes singular as n → +∞. If the scaling factors r are chosen to be too big, the limiting distribution of r (T − θ) only takes n √ n n infinite values. The most convenient scaling factors are r = n. In this case, the estimator T √ n n of θ is called a n-consistent (root-n consistent) estimator. In the light of this concept, we can also define CAN-estimators.

Definition A.12. A sequence of estimators Tn ≡ Tn(X1,...,Xn) of θ is called consistent asymptotically normal (CAN) iff there exists a sequence of scaling factors rn such that

D rn(Tn − θ) → N{µ(θ), Σ(θ)}, for all θ ∈ Θ. Appendix B

Law of Iterated Expectation and Variance

The following rules are very useful for calculating the expected value of variance of a function or multiple random variables. Especially the first one, the law of iterated expectation, is used extensively throughout this thesis.

Theorem B.1 (Law of Iterated Expectation). Let X and Y be two random vectors. Then we have that E[E{r(X,Y )|X}] = E{r(X,Y }, for an arbitrary function r(x, y), assuming all occuring expectations exist.

This can be easily generalized, e.g., let X, Y and Z be random vectors. Assuming all occuring expectations exist, we have that

E[E{r(X,Y,Z)|X,Z}|Z] = E{r(X,Y,Z)|Z}, for an arbitrary function r(x, y, z).

Theorem B.2 (Law of Iterated Variance). Let X and Y be two random vectors. Then we have that E[var{r(X,Y )|X}] + var[E{r(X,Y )|X}] = var{r(X,Y )}, for an arbitrary function r(x, y), assuming all occuring expectations exist.

This also allows easy generalizations, e.g., let X, Y and Z be random vectors. Assuming all occuring expectations exist, we have that

E[var{r(X,Y,Z)|X,Z}|Z] + var[E{r(X,Y,Z)|X,Z}|Z] = var{r(X,Y,Z)|Z}, for an arbitrary function r(x, y, z).

221

Bibliography

[1] Aalen, O.O., Borgan, O. and Gjessing, H.K., (2008). Survival and Event History Analysis. Springer.

[2] Anderson, P.K., Borgan, O., Gill, R.D. and Keiding, N., (1993). Statistical Models Based on Counting Processes. Springer series in statistics.

[3] Begun, J.M., Huang, W.M. and Wellner, J.A., (1983). Information and asymptotic efficiency in parametric-nonparametric models. Annals of Statistics 11, 432-452.

[4] Bickel, P.J., (1982). On adaptive estimation. Annals of Statistics 10, 647-671.

[5] Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A., (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore.

[6] Casella, G. and Berger, R.L., (2002). . Second Edition. Duxbury Ad- vanced Series.

[7] Chamberlain, G., (1986). Asymptotic efficiency in semiparametric models with censoring. Journal of Econometrics 32, 189-218.

[8] Chamberlain, G., (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34, 305-334.

[9] Cox, D.R., (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society, Series B 34, 187-220.

[10] Cox, D.R., (1975). Partial likelihood. Biometrika 62, 269276.

[11] Engle, R.F., Granger, C.W.J., Rice, J. and Weis, A., (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 76, 817-823.

[12] H´ajek, J., (1970). A characterization of limiting distributions of regular estimators. Zeitschrift f¨urWahrscheinlichkeitstheorie und Verwandte Gebiete 14, 323-330.

[13] H´ajek,J., (1972). Local asymptotic minimax and admissibility in estimation. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, Vol. 1, 175-194.

[14] Hampel, F.R., (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association 69, 383-393.

223 224 Bibliography

[15] James, W. and Stein, C., (1961). Estimation with Quadratic Loss. Proceedings of the fourth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, vol. 1, 361-379.

[16] Koshevnik, Y.A. and Levit, B.Y., (1976). On a nonparametric analogue of the information matrix. Theory of Probability and its Applications 21, 738-753.

[17] Le Cam, L., (1960). Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98.

[18] Le Cam, L., (1969). Th´eorieAsymptotique de la D´ecisionStatistique. Les Presses de l’Universit´ede Montr´eal.

[19] Le Cam, L., (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York.

[20] Lehmann, E.L., (1983). Theory of Point Estimation. Wiley, New York.

[21] Leon, S., Tsiatis, A.A. and Davidian, M., (2003). Semiparametric estimation of treatment effect in a pretest-posttest study. Biometrics 59, 1046-1055.

[22] Levit, B.Y., (1978). Infinite-dimensional informational lower bounds. Theory of Probability and its Applications 23, 388-394.

[23] Liang, K-Y. and Zeger, S.L., (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22.

[24] Luenberger, D.G., (1969). Optimization by Vector Space Methods. John Wiley & Sons, Inc., New York.

[25] McCullagh, P., Nelder, J.A., (1989). Generalized Linear Models. Second Edition. Mono- graphs on statistics and applied probability.

[26] Newey, W.K., (1988). Adaptive estimation of regression models via moment restrictions. Journal of Econometrics 38, 301-339.

[27] Newey, W.K., (1990). Semiparametric Efficiency Bounds. Journal of Applied Econometrics 5, 99-135.

[28] Newey, W.K. and McFadden, D., (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111-2245.

[29] Rao, C.R., Toutenburg, H., (1995). Linear Models, Least Squares and Alternatives, Second Edition. Springer series in statistics.

[30] Ritov, Y., Bickel, P.J., (1987). Achieving information bounds in non and semiparametric models. Technical Report 116, Department of Statistics, University of California, Berkeley.

[31] Robins, J.M., Mark, S.D., Newey, W.K., (1992). Estimating Exposure Effects by Modelling the Expectation of Exposure Conditional on Confounders. Biometrics 48, 479-495.

[32] Rudin, W., (1974). Functional Analysis. McGraw Hill.

[33] Slodiˇcka, M., (2009-2010). Applied Funcional Analysis. Unpublished course material, Uni- versiteit Gent. Bibliography 225

[34] Stein, C., (1956). Efficient nonparametric testing and estimation. Proceedings of the third Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, vol. 1, 187-195.

[35] Tsiatis, A.A, (2006). Semiparametric Theory and Missing Data. Springer series in statistics.

[36] van der Vaart, A.W., (1988). Statistical Estimation in Large Parameter Spaces. CWI Tracts 44. Centrum voor Wiskunde en Informatica, Amsterdam.

[37] van der Vaart, A.W., (1991). An asymptotic representation theorem. International Statis- tical Review, 97-121.

[38] van der Vaart, A.W., (1991). On differentiable functional. Annals of Statistics 19, 178-204.

[39] van der Vaart, A.W., (1998). Asymptotic Statistics. Cambridge University Press.

[40] van der Vaart, A.W., (2002). Semiparametric statistics. Ecole´ d’ Et´ede´ St Flour 1999 (eds: P. Bernard). Lecture Notes in Mathematics 1781, 331-457. Springer Verlag, Berlin.

[41] Vansteelandt, S., (2007-2008). Kansrekening en Wiskundige Statistiek I. Unpublished course material, Universiteit Gent.

[42] Vermeulen, K., (2008-2009). Meetkundige benadering van de asymptotische effici¨entievan schatters. Bachelor thesis, Universiteit Gent.

[43] Vermeulen, K., Vansteelandt, S., De Neve, J. and Thas, O., (2010). Semiparametric efficient estimation for probabilistic index models. In progress.

[44] Yang, L. and Tsiatis, A.A., (2001). Efficiency study of estimators for a treatment effect in a pretest-posttest trial. The American Statistician 55, 314-321.

Index

adapted process, 205 convergence in probability, 211 adaptive estimation, 85 convex, 22 additive semiparametric regression model, 51 Convolution theorem, 143 adjoint operator, 18 counting process, 205 adjoint score operator, 163 covariance inner product, 30 administrative censoring, 192 Cram´er-Raobound, 132 almost everywhere (a.e.), 14, 111 cumulative hazard function, 10 almost surely (a.s.), 111 asymptotic risk, 137 differentiabele functional, 119 asymptotically efficient estimator, 146 differentiable path, 110 asymptotically linear estimator (ALE), 31 Dirac measure, 147 asymptotically unbiased estimator, 215 direct sum, 16 auxiliary variables, 94 distance, 21 dominated convergence theorem of Lebesgue, Banach space, 13 114 bias, 215 dual space, 16 bounded functional, 16 bounded in probability, 214 efficient influence function, 45, 58, 62, 120 Bounded Inverse Theorem, 20 efficient information matrix, 152 bounded operator, 17 efficient score, 46 bowl-shaped function, 137 efficient score equations, 82 efficient score function, 152 canonical gradient, 121 efficient score, semiparametric, 55 Cauchy sequence, 13 empirical distribution, 147 Cauchy-Schwartz inequality, 15 empirical distribution function, 149 censoring event, 192 estimation, 215 central limit theorem (CLT), 213 estimator, 215 Chebyshev’s inequality, 212 compensator, 205 filtration, 206 complete family, 183 Fisher information matrix, 35 complete measure space, 198 Fisher information of an observation, 132 complete space, 13 functional, 16 complete statistic, 183 GEE-estimator, 81 conditional hazard rate, 10 Generalized Estimating Equations, GEE, 81 consistent asymptotically normal (CAN), 216 globally efficient estimator, 54 consistent estimator, 215 gradients, 121 continuous mapping theorem, 213 continuous operator, 17 hardest submodel, 110 continuous prolongation, 18 hazard function, 9 convergence in distribution, 211 Hilbert space, 14 convergence in law, 211 Hodges’ estimator, 137

227 228 Index influence function, 31, 119 orthogonal projection, 22 information loss models, 177 orthogonal projection operator, 22 information number, 132 orthonormal, 14 information operator, 166 inner product space, 14 parameter of interest, 4, 29, 49 inverse operator, 20 parametric model, 3 isometry, 17 parametric submodel, 50 partial likelihood, 208 kernel, 19 partially linear regression, 7 Portmanteau Lemma, 212 law of iterated expectation, 217 pretest-posttest study, 90 law of iterated variance, 217 probabilistic index models, 99 least favourable submodel, 110 probability model, 3 linear model, 5 probability space, 3 linear operator, 17 Prohorov’s theorem, 214 linear variety, 44 Projection Lemma, 22 Local Asymptotic Minimax (LAM), 141 Projection Theorem, 22 local data generating process (LDGP), 32 proportional hazards model, 9 locally efficient estimator, 54 Pythagorean Theorem, 14 location-shift regression models, 88 log-linear model, 5 quadratic variation process, 206 logistic regression model, 5 loss function, 136 rn-consistent, 216 RAL estimators, 33 m-estimators, 34 range, 19 Markov’s inequality, 212 regular asymptotically linear estimators, 33 martingale, 206 regular estimator, 32, 142 maximal tangent set, 111 regular parametric submodel, 111 measurable space, 3 replicating space, 43 measure space, 3 restricted moment model, 5 mixing distribution, 182 reversed triangle inequality, 13 mixture density, 182 Riesz’ Representation Theorem, 17 mixture models, 8, 181 right censoring, 191 monotone convergence theorem, 114 root-n consistent, 216 multiplication operator, 203 Multivariate Pythagorean Theorem, 43 sandwich estimator, 81 score function, 110 nonparametric model, 4 score operator, 162, 169 norm, 13 score vector, 33 normed linear space, 13 semi-positive definite, 42 nuisance parameter, 4, 29, 49 semiparametric efficiency bound, 53, 61, 134, nuisance score function, 150 135 nuisance tangent set, 150 semiparametric model, 4 nuisance tangent space, 35 Slutsky’s lemma, 213 nuisance tangent space, semiparametric, 54 smooth parametric submodel, 111 statistic, 215 one-jump process, 205 statistical model, 3 orthogonal, 14 Stein shrinkage estimator, 144 orthogonal complement, 21 strongly convergent, 17 Index 229 subconvex function, 137 super-efficient estimator, 32, 138 survival time, 9 survivor function, 9 symmetric location, 8 tangent set, 111 tangent space, 35, 113 tangent space of the parameter of interest, 36 tangent space, semiparametric, 60 tight, 214 time of death, 192 time to event, 191 treatment difference, 90 triangle inequality, 13 unbiased estimator, 215 uniform WLLN, 213 uniformly tight, 214 variation norm, 142 variationally independent, 49 weak convergence, 211 weak law of large numbers (WLLN), 213 weakly convergent, 17 working variance assumption, 84