Help for Mim JC Galati, P Royston & JB Carlin

------help for mim JC Galati, P Royston & JB Carlin ------

Title

mim -- A prefix command for analysing and manipulating multiply imputed datasets in Stata

Syntax

mim [, mim_options] : command

mim [, replay_options]

mim_options description ------General * category(fit|manip) specify whether command is estimation or data manipulation noisily display output from execution of command within each of the imputed datasets

Estimation (valid only for estimation commands) dots display progress dots during model fitting noindividual suppress capture of estimation results from each application of command seed(#) set seed for random number generator prior to each call to command storebv fills e(b), e(V) etc. with multiple-imputation estimates

Manipulation (valid only for data manipulation commands) + sortorder(varlist) one or more variables that uniquely identify the observations in a given imputed dataset following each execution of command ------* only necessary for estimation and data manipulation commands not listed under Description + not valid for append and reshape; MANDATORY for all other data manipulation commands.

replay_options description ------clearbv clears e(b), e(V) etc., but leaves other mim estimates intact j(#) fills e(b), e(V) etc. with estimates corresponding to imputed dataset # reporting_options level and eform options supported by command storebv same as for estimation, unless j option is specified ------

xi is allowed as a prefix to mim, but not as prefix to command, see xi. svy is allowed as a prefix to command, see svy. version is allowed as a prefix to command, see version.

Description

mim is a prefix command for working with multiply-imputed (MIM) datasets, where command can be any of a wide range of Stata commands. The function that mim performs depends on the category of command passed to mim; either estimation, data manipulation, post estimation or utility. A limited range of commands can be used with mim without specifying the category mim_option. These are:

Estimation: regress, mean, proportion, ratio, logistic, logit, ologit, mlogit, probit, oprobit, poisson, glm, binreg, nbreg, gnbreg, blogit, clogit, cnreg, mvreg, rreg, qreg, iqreg, sqreg, bsqreg, stcox, streg, xtgee, xtreg, xtlogit, xtnbreg, xtpoisson, xtmixed, svy:regress, svy:mean, svy:proportion, svy:ratio, svy:logistic, svy:logit, svy:ologit, svy:mlogit, svy:probit, svy:oprobit, svy:poisson

Post Estimation: lincom, testparm, predict

Data Manipulation: reshape, append, merge

Utility: check, genmiss

With few exceptions command is specified with its full usual syntax. The exceptions are predict, where only the xb, stpd and equation options are permitted, and merge, where only one "using" file is allowed. Also, command may be one of two internal utility commands, check and genmiss, where the required syntaxes are

mim : check [varlist]

mim : genmiss varname

respectively (see Examples - Utility commands for more details regarding these two commands).

Further Stata estimation and data manipulation commands can be used with mim by specifying the mim_option category(mim_type), where mim_type may be either fit for estimation commands or manip for data manipulation commands. Use of mim in this way is at the user's discretion, and the results are not guaranteed. The dataset format used by mim is a stacked format similar to that created by Royston's ice (if installed) command. Details of the required dataset format may be found under Remarks - MIM Dataset Format below.

Options

General:

category specifies the type of command that is being passed to mim, either estimation (category fit) or data manipulation (category manip)

noisily specifies that the results of the application of command to each of the individual imputed datasets should be displayed

Estimation:

dots specifies that progress dots should be displayed

noindividual specifies that capture of the estimation results corresponding to the fitting of the given estimation command to each of the individual imputed datasets should be suppressed (see Remarks - Returned Results)

seed specifies an initial value for the random number generator; during model fitting the random number generator is set to this value each time the given estimation command is fitted to one of the imputed datasets

storebv specifies that the standard list of returned results for estimation commands be filled using the multiple-imputation results. In particular this forces the multiple-imputation coefficient and covariance matrix estimates into e(b) and e(V), respectively (where the Li-Raghunathan-Rubin (1991) method of calculating the covariance matrix is used), enabling application at the user's own discretion of Stata post-estimation commands that use these quantities directly (see Examples - Replay of Estimation Results [Advanced] for further details)

Manipulation:

sortorder specifies a list of one or more variables that uniquely identify the observations in each of the datasets in a mim-compatible dataset; for data manipulation, this option must specify a list of variables that together uniquely identify the observations in each dataset AFTER command has been applied to the given dataset (note that varlist cannot include _mi, since the _mj and _mi variables are dropped from each dataset prior to the call to command)

Replay: clearbv, specifies that the additional items returned using the storebv or j options be cleared, but that all other estimation results returned by mim be left intact

j specifies that the standard list of returned results for estimation commands be filled using the estimates from the last fit of an estimation command to the `j'th imputed dataset, and these estimates be replayed

reporting_options specifies level and eform options supported by command

storebv, same as for estimaton, unless the j option is specified

(Note that there are no mim_options for mim: predict, mim: check and mim: genmiss)

Remarks - MIM Dataset Format

For a multiply-imputed dataset to be compatible with mim, the dataset must contain:

a numeric variable called _mj whose values identify the individual dataset to which each observation belongs, a numeric variable called _mi whose values identify the observations within each individual dataset.

Moreover, if the original data with missing values are to be stored in the dta file, then those observations must be identified with the value _mj==0, while imputed datasets are identified using positive _mj values. In particular, the dataset in the stack identified by _mj==0 is ignored for the purpose of model fitting with mim. For convenience, a multiply-imputed dataset satisfying the above requirements is called a MIM dataset.

The requirements above have been kept as simple as possible. They allow a set of multiply-imputed datasets stored in separate files to be stacked into the format required by mim using only the basic data processing commands generate, append and replace. (Nevertheless, for convenience, a dedicated command mimstack has been provided for this purpose.)

An example of a multiply imputed dataset in mim-compatible format is shown below. The original data consist of a completely observed variable y and a variable x with missing values in the 3rd, 4th and 6th observations, and there are 2 imputed copies of the original dataset in the stack.

_mj _mi y x ------0 1 1.1 105 0 2 9.2 106 0 3 1.1 . 0 4 2.3 . 0 5 7.5 108 0 6 7.9 . 1 1 1.1 105 1 2 9.2 106 1 3 1.1 109.796 1 4 2.3 110.456 1 5 7.5 108 1 6 7.9 102.243 2 1 1.1 105 2 2 9.2 106 2 3 1.1 107.952 2 4 2.3 115.968 2 5 7.5 108 2 6 7.9 114.479

Remarks - Returned Results

After model fitting mim returns results in e() as follows.

Result description ------Multiple-imputation estimates

Matrices

e(MIM_Q) coefficient estimates e(MIM_T) total covariance matrix estimate e(MIM_TLRR) Li-Raghunathan-Rubin (1991) estimate of total covariance matrix e(MIM_W) within imputation covariance matrix estimate e(MIM_B) between imputation covariance matrix estimate e(MIM_dfvec) vector of MI degrees of freedom

Scalars

e(MIM_dfmin) minimum of e(MIM_dfvec) e(MIM_dfmax) maximum of e(MIM_dfvec) e(MIM_Nmin) minimun number of observations used in estimation e(MIM_Nmax) maximum number of observations used in estimation

Macros

e(MIM_m) number of imputed datasets used in estimation e(MIM_levels) values of _mj variable used in estimation e(MIM_prefix) value of e(prefix) returned by command e(MIM_prefix2) mim e(MIM_cmd) the name of the estimation command specified in command e(MIM_depvar) value of e(depvar) returned by command e(MIM_title) value of e(title) returned by command e(MIM_properties) value of e(properties) returned by command e(MIM_eform) value of e(eform) returned by command

Additional results (these are returned when storebv option is specified)

e(b) equal to e(MIM_Q) e(V) equal to e(MIM_TLRR) e(N) equal to e(MIM_Nmin) e(cmd) equal to e(MIM_cmd) e(depvar) equal to e(MIM_depvar) e(df_r) equal to e(MIM_dfmin) e(properties) equal to e(MIM_properties)

Individual estimates (these are returned whenever noindividual option is omitted)

+ e(MIM_`k'_matrix) for each matrix returned by the `k'-th application of command * e(MIM_`k'_scalar) for each scalar returned by the `k'-th application of command * e(MIM_`k'_macro) for each macro returned by the `k'-th application of command

Additional results (these are returned when replaying individual estimates)

! e(MIM_`j'_matrix) for each matrix returned by the `j'-th application of command ! e(MIM_`j'_scalar) for each scalar returned by the `j'-th application of command ! e(MIM_`j'_macro) for each macro returned by the `j'-th application of command

------+ where k varies over the values of e(MIM_levels) and matrix, scalar and macro are the names of e-type results returned by command, of type matrix, scalar or macro, respectively * a macro or scalar from a subsequent imputed dataset is only returned if its value differs from the corresponding macro or scalar from the first imputed dataset ! where j is the value specified via the j replay option

Examples - Model Fitting

When invoked for model fitting, mim applies command to each of the imputed datasets in the current MIM dataset, and then combines the individual estimates using Rubin's rules for multiple-imputation-based inferences. In most cases fitting a statistical model to a multiply-imputed dataset with mim is simply a matter of loading the MIM dataset into Stata and executing the desired estimation command, prefixing it with the mim prefix. Several examples are provided below.

. use mymimdataset1, clear . mim: regress y x1 x2 x3 x4

. use mymimdataset2, clear . mim: logistic y x1 x2, coef

. use mymimdataset3, clear . xi: mim: glm low age lwt i.race smoke ptl ht ui, f(bin) l(logit) le(90)

. use mymimdataset4, clear . mim: svy: proportion heartatk . mim: svy: logistic heartatk age weight height . mim, noi: svy jackknife, nodots: logit highbp height weight age age2 female black, or

. use mymimdataset5, clear . mim: xtmixed gsp private emp water other unemp || region: R.state, l(90)

Additionally, other Stata estimation commands may by fitted to a MIM dataset using the category option of mim. Two examples are given below.

. use mymimdataset6, clear . mim, cat(fit): mvprobit (private = years logptax loginc) (vote=years logptax loginc), nolog

. use mymimdataset7, clear . mim, cat(fit): MyNewCommand y x1 x2

Examples - Data Manipulation

The stacked dataset format used by mim allows simple data manipulation such as generating and replacing variables to be performed using existing Stata commands. More complex data manipulation tasks, particularly those that alter the number of observations in each of the imputed datasets, usually require more detailed programming. For convenience, three common tasks, namely reshaping, appending and merging datasets, can be accomplished by prefixing the relevant command with mim. The first two are straightforward, and in most instances will be applied by simply prefixing the usual syntax with mim.

. use mymimdataset7, clear . mim: reshape wide income, i(id) j(year) . mim: reshape long

. use mymimdataset8, clear . mim: append using mymimdataset9 Merging two mim-compatible datasets requires a little further explanation, since it requires that the sortorder option be specified to mim. This option is necessary so that mim can generate a new _mi variable once merging is complete. For example, suppose that mymimdataset10 is a mim-compatible dataset containing patient details, with each patient having a unique id, and mymimdataset11 is a second stacked dataset containing additional longitudinal measurements on each patient, with each measurement uniquely identified by the two variables id time. Merging these data into a single dataset would usually be accomplished by a match-merge on the id variable. However, once merging is complete, the observations in the merged dataset are determined by the pair of variables id and time. Using mim the merge would be accomplished as follows:

. use mymimdataset10, clear . mim, sortorder(id time): merge id using mymimdataset11

Additionally, other Stata commands that either manipulate a single dataset or a master/using pair of datasets may by applied to a multiply-imputed dataset using the category option of mim. This is most likely to be of interest when command is a user-written program designed to accomplish a project-specific task.

. use mymimdataset12, clear . mim, category(manip) so(id): mystatacmd x1 x2 x3

Examples - Post Estimation

In general Stata's standard postestimation methods cannot be directly applied with multiply-imputed data. Methods relying on likelihood comparisons (lrtest) are not applicable because multiple imputation does not involve calculation of likelihood functions for the data. Furthermore, application of a postestimation command directly to the multiple-imputation estimates will not in general produce valid simultaneous inferences for multiple parameters, since applying Rubin's rules to the vector of parameter estimates and their associated variance-covariance matrices does not work reliably (Li et al, 1991). Performing inferences for target parameters that are scalar (unidimensional) is however easily accomplished using Rubin's rules, and this has enabled us to create multiple-imputation versions of lincom and predict. In addition, we have implemented the method of Li et al (1991) to create a mim-specific version of testparm, which allows the testing of null hypotheses relating to a vector of parameters. Examples of the use of mim: lincom, mim: testparm and mim: predict are given below. For other postestimation tasks see the additional remarks under Examples - Replay of estimation results [Advanced].

. use mymimdataset2, clear . mim: logit y x1 x2 . mim: lincom x1+2*x2 . mim: testparm _all . mim: predict xb xbse

Note that when using predict with mim, the syntax requires that standard errors for the linear predictor, if required, be calculated at the same time as the linear predictor (see Syntax for predict, check and genmiss). The reason is that in a multiple-imputation framework, the point estimates are used in calculating estimates for the corresponding standard errors.

Examples - Replay of estimation results [Advanced]

Multiple-imputation estimates may be replayed by simply typing mim at the command line. A level option and any eform options supported by command may be specified during replay.

. use mymimdataset2, clear . mim: logit y x1 x2 . mim, or l(90)

Multiple-imputation estimates may be copied into e(b), e(V) etc. by specifying the storebv option during replay. In this case the Li-Raghunathan-Rubin (1991) estimate of the variance-covariance matrix is stored in e(V). Note that use of multiple-imputation estimates in this way is at the user's descretion, and validity of the results is not guaranteed. In particular, forcing the multiple-imputation estimates into e(b) and e(V) will allow application of a Stata postestimation command directly to the multiple-imputation estimates, and this may be valid in specific cases, even though it may not be valid in general (see Examples - Post Estimation for additional comments).

. mim, storebv

(Note that the storebv option may also be specified during model fitting.)

Alternatively, by specifying the j option of mim, the estimates corresponding to the application of command to one of the individual imputed datasets may be copied into their usual place in e() (that is, into e(b), e(V) etc.)

. mim, j(1)

This facility to replay individual estimates has been incorporated with extensibilty in mind, particularly with regard to postestimation. The most likely application is to loop over the individual estimates, replaying and capturing necessary quantities from each set of results in turn, and then combining these in some way, where the standard approach for simple scalar estimation would be to use Rubin's rules.

. use mymimdataset2, clear . mim: logit y x1 x2 . local levels `"`e(MIM_levels)'"' . foreach j of local levels { . quietly mim, j(`j') . ... apply some postestimation command or capture some stored results here ... . } . combine results from individual estimations using Rubin's rules ...

To avoid inadvertent application of a Stata post-estimation command to estimates copied into e(b), e(V) etc. using either the j or storebv option, the clearbv option is provided to allow the user to clear these estimates when finished (without losing the multiple imputation estimates from memory). It is recommended that the user always make use of this facility.

. mim, clearbv

Examples - Utility commands

The check command provides a detailed integrity check of a multiply imputed dataset in stacked format. The main checks are that non-missing values must be constant across imputed datasets and that all missing values must have been imputed. Note that the utility commands are only applicable when the original dataset with missing values has been included in the stacked dataset (see MIM Dataset Format).

. use mymimdataset12, clear . mim: check

Alternatively, the check can be restricted to selected variables.

. mim: check x1 x2 x3 x4 x5

The genmiss command generates a missing indicator variable for a specified variable.

. mim: genmiss x1

In this case the generated indicator variable is called _mim_x1 (and in general the naming convention used is to prefix varname with _mim_).

References

[1] Carlin JB, Li N, Greenwood P and Coffey C. (2003) Tools for analyzing multiple imputed datasets. Stata Journal 3:226-244. [2] Royston P. (2004) Multiple imputation of missing values, Stata Journal 4:227-241. [3] Royston P. (2005) Multiple imputation of missing values: update. Stata Journal 5:188-201. [4] Li KH, Raghunathan TE, Rubin DB. (1991) Large-sample significance levels from multiply-imputed data using moment-based statistics and an F reference distribution. Journal of the American Statistical Association. 86:1065-1073.

Also see

Online: help for mimstack .