Functional Data Analysis: Techniques and Applications

Home , Basis function

R. Todd Ogden and Jeff Goldsmith

March 17, 2014 Outline

Examples, deﬁnitions, notation

Display

Smoothing

Functional principal components analysis

Regression with functional predictors and/or responses

1 of 67 What is functional data?

Some examples...

Child height as a function of age.

2 of 67 What is functional data?

Some examples...

Knee angle as children go through a gait cycle.

3 of 67 What is functional data?

Some examples...

Systolic blood pressure at various ages for 150 subjects.

4 of 67 What is functional data?

Some examples...

Examples of the S in Shakespeare’s signature

5 of 67 What is functional data?

Some examples... 10 10 10 5 5 5 ) )

o t t

180 ( ( 0 0 0 Y o X Y

0 P P -5 -5 -5 -10 -10 -10

-10 -5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X t t

Reaching motions made by a stroke patient

6 of 67 What is functional data?

Some examples...

Curvature and radius of the carotid artery.

7 of 67 What is functional data?

Some examples...

Brain images. 8 of 67 Recurring example: DTI 0.8 0.7 0.6 0.5 Fractional Anisotropy Fractional 0.4 0.3 0.2

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

Tract proﬁles from diffusion tensor imaging

9 of 67 What is functional data?

Something like a deﬁnition:

“Observations on subjects that you can imagine as Xi(si), where si is continuous”

Functional notation is conceptual; observations are made on a ﬁnite discrete grid.

10 of 67 Some characteristics of functional data

The following are sometimes associated with functional data:

High dimensional

Temporal and/or spatial structure

Interpretability across subject domains

11 of 67 Discretization of functional data

Conceptually, we regard functional data as being deﬁned on a continuum, e.g., X (t), 0 t 1. i ≤ ≤ In practice, functional data are observed at a ﬁnite number of points.

12 of 67 Discretization of functional data

Dense functional data: Often, this is a ﬁne regular grid, i.e., 1 2 xi = Xi N , Xi N ,..., Xi (1) : spectral data, imaging data, accelerometry, ... Sparse functional data: In other situations, the points at which observations are taken are irregular, often random: CD4 count, blood pressure, etc.

In such cases, some kind of interpolation is necessary.

13 of 67 Any technique for functional data should take into account the structure of the data — results from multivariate data analyses are generally permutation-invariant, but results from functional data analyses should not be!

Methodological developments in FDA are often extensions of corresponding multivariate techniques.

Functional data are technically multivariate data!

Why not just apply multivariate techniques (MANOVA, clustering, multiple regression, etc.)?

14 of 67 Functional data are technically multivariate data!

Why not just apply multivariate techniques (MANOVA, clustering, multiple regression, etc.)?

Any technique for functional data should take into account the structure of the data — results from multivariate data analyses are generally permutation-invariant, but results from functional data analyses should not be!

Methodological developments in FDA are often extensions of corresponding multivariate techniques.

14 of 67 Functional data are often observed with measurement error

Xi(t) is smooth (and continuously deﬁned) but we observe

1 2 x = X + , X + ,..., X (1) + i i N 1 i N 2 i n

It is common to smooth the data before any analysis (topic we’ll revisit soon)

In other situations, accounting for measurement error is built in to the analysis procedure.

15 of 67 Comparison across observations

In order for functional data to be comparable across observations (e.g., across subjects), they must be observed on the same domain, i.e., t must be the same for X1(t) and X2(t). In many cases, this is straightforward:

Spectral data Problematic for some other situations:

Growth curves (for adolescents, “growth spurts” may not line up)

Brain imaging data (structure is somewhat different from subject to subject) In such cases it is often possible to register the data, e.g., using landmarks or by warping.

16 of 67 Summary measures for functional data

Suppose we have functional data X (t), t , i = 1,..., n . { i ∈ T } Mean: µ(t) = EXi(t).

The mean is itself functional

Typically, we assume that the mean is smooth ¯ 1 “Raw” estimator is sample mean: X(t) = n Xi(t)

A typical estimator of µ would be a smoothedP version of X¯(t) (more on this later).

17 of 67 Summary measures for functional data

Suppose we have functional data X (t), t , i = 1,..., n . { i ∈ T } Variance: Σ(s, t) = Cov(X(s), X(t)) = E [(X(s) µ(s))(X(t) µ(t))] − − This is a (two-dimensional) surface.

“Raw” estimator is sample covariance: ˆ Σ(s, t) = Cov(Xi(s), Xi(t))

Would need to smooth this as well.

18 of 67 Summary measures for functional data 0.8 0.7 0.6 0.5 Fractional Anisotropy Fractional 0.4 0.3 0.2

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

19 of 67 Summary measures for functional data

0.008

0.006

Cov(s, t) 0.004

0.002

t s 0.000

20 of 67 Beyond iid functional data

Although the iid case is quite common, other situations are possible:

Multilevel functional data:

I X (t), t , i = 1,..., n, j = 1,..., J { ij ∈ T i} I Example: repeated motions in gesture data

Longitudinal functional data:

I X (t, v ), t , i = 1,..., n, j = 1,..., J { ij j ∈ T i} I Example: DTI data (multiple clinical visits)

21 of 67 Common problems in functional data analysis

Some issues arise regularly in FDA

Data display and summarization

Smoothing and interpolation

Patterns in variability: principal component analysis

Regression (with functional predictors, outcomes, or both)

22 of 67 Data display

Lots of tools for displaying data

Spaghetti plots

Rainbow plots

3D rainbow plots

Examples for all using DTI data follow; R code is available online

23 of 67 Spaghetti plot 0.8 0.7 0.6 0.5 Fractional Anisotropy Fractional 0.4 0.3 0.2

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

24 of 67 2D rainbow plot 0.8 0.7 0.6 0.5 Fractional Anisotropy Fractional 0.4 0.3 0.2

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

25 of 67 3D rainbow plot

0.7

Fractional Anisotropy0.6

0.5

0.4 60 50 0.3 40 0.0 30 0.2 0.4 20 Distance Along Tract PASAT Score 0.6 10 0.8

0 1.0

26 of 67 How are we going to do smoothing?

Use a known set of basis functions

Regress observed data onto known basis

Smoothing

Why do we need smoothing?

Data are often observed with error

There’s a need to interpolate to a common grid

27 of 67 Smoothing

Why do we need smoothing?

Data are often observed with error

There’s a need to interpolate to a common grid

How are we going to do smoothing?

Use a known set of basis functions

Regress observed data onto known basis

27 of 67 Some common basis functions: Splines

Fourier basis B spline basis 1.5 Continuous 0.8 0.5

) ) Easily deﬁned s s ( ( k k φ φ

0.4 derivatives 0.5 −

Good for smooth data 0.0 1.5 − 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 s s

28 of 67 Some common basis functions: Wavelets

Formed from a single 0.3 “mother wavelet” 0.2 function: 0.1 ψ (t) = 2j/2ψ(2jt k) jk − 0.0 Orthonormal basis wavelet function value function wavelet -0.1 Particularly good when

-0.2 there are jumps, spikes,

0.0 0.2 0.4 0.6 0.8 1.0 peaks, etc. x

Wavelet representation is sparse

29 of 67 Minimize sum of squares

Suppose we want to smooth a curve Yi(t) observed with error. We can use K ˆ Yi(t) = ˆcikψk(t). Xk=1 We only need to estimate the subject-speciﬁc scores ˆcik; minimize SSEi with respect to cik, where

K 2 SSEi = Yi(ti) cikψk(ti) − ! X Xk=1

30 of 67 Example 0.60 0.55 0.50 y 0.45 0.40 0.35

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

31 of 67 Tuning

For any curve, many possible smooths are available

Depends on the spline basis

Depends on the number of basis functions

Depends on the estimation procedure “Tuning” is the process of adjusting the smoother to the data at hand. This is often implicit.

32 of 67 Example 0.60 0.55 0.50 y 0.45 0.40 0.35

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

33 of 67 Example 0.60 0.55 0.50 y 0.45 0.40 0.35

0.0 0.2 0.4 0.6 0.8 1.0

Distance Along Tract

34 of 67 Penalization

Rather than choosing a smoother “by hand”, we could use a lot of basis functions but explicitly penalize “wiggliness”

Leads to a penalized SSE:

SSE = (Y (t) Ψ(t)c )2 + λPen(Ψ(t)c ) i i − i i X

Common penalties are on the derivatives (enforcing smoothness)

Need to choose tuning parameter λ

35 of 67 Data-driven basis

Previous bases don’t depend on the data; only the loadings do.

FPCA gives a “data-driven” basis: it is constructed from the observed data.

Looks pretty similar mathematically:

K ˆ Yi(t) = ˆcikψk(t). Xk=1

Difference is that the ψ aren’t pre-speciﬁed.

36 of 67 Data-driven basis

So where do the basis functions ψ come from?

Construct covariance matrix Σ

(Remove main diagonal, smooth)

Spectral decomposition of Σ produces basis functions ψ

37 of 67 Data-driven basis

Some properties of FPCA

The ψ are orthonormal (non-overlapping)

Also the most parsimonious basis expansion for a given data set

Basis functions are often interpretable - describe the major directions of variability in the observed data

38 of 67 Example

0.008

0.006

Cov(s, t) 0.004

0.002

t s 0.000

39 of 67 Example

0.008

0.007

0.006

0.005

Cov(s, t) 0.004

0.003

0.002

t 0.001 s

0.000

40 of 67 Example

1st PC for FA (67.9%) 2nd PC for FA (9.8%) 0.8 0.8 0.7 0.7

+++++++++ ++ ++++ + ++++++++ ++ + + ++ 0.6 + + + 0.6 + + + +++ + - + + +-- + + -- + + + +-- --+ + -- + +++++ +++ + - +- -+- - + + +++ +++++++++++++ +++ + -- - +- + ------++ +++ ++ + -- -+ - + - + - - +++++ + - - +------+ -- - - ++ + -- - + +------0.5 + - - ++ ++ - 0.5 ------+ - - - ++ + -- - + + ---- +++++++ -- + -- + - ++++ - - + +++++++++ +++ ++ - + - Fractional Anisotropy Fractional - - - Anisotropy Fractional + + ++ ++++++++ + - + ------++ ++ ++ -- + ------+ +++ + --- +------++++++-+------+ 0.4 - - 0.4 ------0.3 0.3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

t1 t1

41 of 67 Data-driven vs Pre-speciﬁed

Data-driven bases are the most parsimonious for a given dataset, but may not transfer to new data

Data-driven often work better for sparse data (borrowing strength to derive basis functions)

Pre-speciﬁed often have better analytical properties (easily computed derivatives, known forms)

42 of 67 Regression modeling with functional data

Scalar on function regression

Function on scalar regression

Function on function regression

43 of 67 Scalar on function regression: Example scenarios

X = temperature (over time) for the year Y = total rainfall for one year

X = NIR spectrum Y = water content of a sample

X = brain image Y = clinical outcome

44 of 67 Example data: DTI

xi(s) = fractional anisotropy along the corticospinal tract Yi = measure of cognitive function

Corticospinal Tract 0.7 0.5 Fractional Anisotropy Fractional 0.3

0.00 0.25 0.50 0.75 1.00

45 of 67 Linear scalar-on-function regression model

Given data ( x (s), s , Y ),..., x (s), s , Y ), the { 1 ∈ S} 1 { n ∈ S} n scalar-on-function regression model is:

Yi = α + xi(s)β(s) ds + i, i = 1,..., n Z Interpretation of “coefﬁcient function” β:

Where β(s) > 0, larger values of xi(s) lead to higher predicted Y.

Where β(s) < 0, larger values of xi(s) lead to lower predicted Y.

Where β(s) = 0, xi(s) has no effect on Y.

46 of 67 Coefﬁcient Interpretation

Xi(s) (s) Xi(s)(s) Xi(s)(s) ds Observed ! Coefficient ! Profile x Coefficient ! Z Functional Tract Profile Function (Area Under Curve Shaded) Contribution 0.05 ) p ( A F 0.2 µ 0 -0.10 FA - 1.0 -0.2 -0.25 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.5 0.05 ) p ( A F 0.0 0.2 µ -0.10

FA - 1.6 ) -0.5 -0.2 p -0.25 ( 0.00 0.25 0.50 0.75 1.00 β 0.00 0.25 0.50 0.75 1.00 -1.0 0.05 ) p ( A F 0.2 µ -1.5 -0.10

FA - 2.6 -0.2 -2.0 -0.25 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.05

) 0.00 0.25 0.50 0.75 1.00 p ( A F 0.2 µ -0.10

FA - 5.2 -0.2 -0.25 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

47 of 67 Scalar-on-function regression: The need for regularization

But the function xi(s) is only observed at N points! T xi = (xi(1/N), xi(2/N),..., xi(1)) T β = (β(1/N), β(2/N), . . . , β(1)) The model becomes

Yi = α + xi(s)β(s) ds + i Z α + (1/N)xTβ + ≈ i If we’re not thinking “functionally”, this is like doing regression with n observations and N predictors!

To get reasonable ﬁts, we must regularize in some way.

48 of 67 Basis functions

Possible basis functions: splines, orthogonal polynomials, principal components, wavelets, etc.

Let

K xi(s) = cikψk(s) Xk=1 K β(s) = θkψk(s) Xk=1 This is now a K-dimensional regression problem.

49 of 67 Scalar-on-function regression: Basis function representation

Yi = α + xi(s)β(s) ds + i Z K K = α + ci`ψ`(s) θkψk(s) ds + i `= ! ! Z X1 Xk=1 K K = α + ci` ψ`(s)ψk(s) ds θk + i "`= # Xk=1 X1 Z K = zkθk + i Xk=1

50 of 67 How to choose K?

K = 2 K = 7 K = 10

Coef. Func. Estimates Coef. Func. Estimates Coef. Func. Estimates 15 15 15 10 10 10 5 5 5 ) ) ) t t t ( ( ( β β β 0 0 0 −5 −5 −5 −10 −10 −10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

51 of 67 Regularization with roughness penalties

Could choose α and β to minimize

n 2 Y α x (s)β(s) dt + λ β00(s) 2 dt i − − i i=1 Z Z X

First term: (proportional to) mean squared error (MSE): measures ﬁdelity to the data (how well the model “ﬁts” the data)

Second term: measures the smoothness of the coefﬁcient function

52 of 67 Example ﬁts with a range of tuning parameters 15 10 5 ) t ( β 0 -5 -10

0.0 0.2 0.4 0.6 0.8 1.0

53 of 67 How to choose λ?

The tuning parameter λ controls the tradeoff between these.

If λ is too large, it will result in smooth estimates at the expense of large MSE (underﬁtting).

If λ is too small, the MSE will be small but the estimated β function will be very wiggly (overﬁtting).

Neither one of these will provide good “out of sample” predictions. Could choose λ by cross-validation:

n 2 ( ) ( ) CV(λ) = Y α i x (t)β i (t) dt i − λ − i λ Xi=1 Z Choose λ to minimize CV(λ)

Also: generalized cross-validation (GCV), restricted maximum likelihood (REML) . . . 54 of 67 Function on scalar regression: Example scenarios

X = climate zone Y = temperature (over time)

X = age Y = activity level (over time)

X = diagnosis Y = brain image

55 of 67 Canadian weather data

X = region (Arctic, Atlantic, Continental, Paciﬁc) Y = temperature (degrees Celsius) over time 20 10 0 Temperature −10 −20 −30

2 4 6 8 10 12

Month 56 of 67 Function on scalar regression

A “functional ANOVA” model:

Yij(s) = µ(s) + αi(s) + ij(s), i = 1,..., n

For identiﬁability, could constrain that i αi(s) = 0 for all t. P More generally, given data (x , Y (s), s ),..., (x , Y (s), s ), where x is a 1 { 1 ∈ S} n { n ∈ S} i p-vector, the function-on-scalar regression model is

T Yi(s) = xi β(s) + i(s),

where β(s) = (β1(s), . . . , βp(s)).

57 of 67 Function on scalar regression: data representation

If the functional observations are observed at a grid of points, say, s1,..., sN, then let

Y : n N = [Y (s )], i = 1...., n, j = 1,..., N. × i j We could also think about expressing the β functions on the same grid, i.e., let

B : p N = [β (s )], i = 1,..., p; j = 1,..., N. × i j Expressing the ’s the same way and writing the X matrix as usual, the discrete version of the model becomes

Y = XB + E.

This has the same form as multivariate analysis of variance (MANOVA). 58 of 67 Function on scalar regression: basis function representation

Given basis functions ψ1(s), . . . , ψK(s), we could express

K Yi(s) = cikψk(s) Xk=1 K βj(s) = θjkψk(s) Xk=1 The model then becomes

C = XΘ + E

59 of 67 Fitting by penalizing roughness

Could choose β to minimize

n p 2 2 Y (s) xTβ(s) dt + λ β00(s) dt i − i j Xi=1 Z Xj=1 Z More generally, in the discretized space, we could minimize

p Y XB + λ BTPB , || − || j j Xj=1

where Bj is the jth row of B.

60 of 67 Application to Canadian weather data 20 20 15 10 10 5 0 0 −5 −10 −10 Coefficient function Coefficient function Coefficient function −15 −20 −20

0 100 200 300 0 100 200 300 0 100 200 300 20 20 10 10 0 0 −10 −10 Coefficient function Coefficient function −20 −20

0 100 200 300 0 100 200 300

61 of 67 Function on function regression: Example scenarios

X = temperature (over time) Y = precipitation (over time)

X = fractional anisotropy along corpus callosum tract Y = fractional anisotropy along corticospinal tract

X = hip angle through a gait cycle Y = knee angle through a gait cycle

62 of 67 Function on function regression: the model

Given functional data ( x (s), s , Y (t), t { 1 ∈ S} { 1 ∈ ),..., ( x (s), s , Y (t), t ), the model could be T} { n ∈ S} { n ∈ T } expressed

Yi(t) = β(s, t)xi(s) ds + i(t) Z The coefﬁcient function in this case is a (two-dimensional) surface.

63 of 67 Function on function regression: Example

te(X1.smat,X1.tmat,11.99):L.X1

X1.tmat X1.smat

64 of 67 Software

refund package

fda package

fda.usc package

65 of 67 Stuff we haven’t even mentioned

Inference on functional model parameters

Model selection, model building

Alternative penalties

Model diagnostics and goodness of ﬁt

“Generalized” versions of functional linear models

Hierarchical models for functional data

Supervised/unsupervised classiﬁcation of functional data

Functional “depth” and functional boxplots

Many other topics . . .

66 of 67 Useful references

Ferraty and Vieu (2006). Nonparametric Functional Data Analysis. Springer.

Ramsay and Silverman (2005). Functional Data Analysis, Second Edition. Springer.

Ramsay and Silverman (2002). Appled Functional Data Analysis. Springer.

Sørensen, Goldsmith, and Sangalli (2013). An introduction with medical applications to functional data analysis. Statistics in Medicine 32:5222-5240.

67 of 67