Chapter 1-7. Looping, Collapsing, and Reshaping

Looping through variables

We will use the births.dta dataset for this exercise. After modifying the “change directory”, the “cd” command(s) to match your own directory, we can bring the dataset into Stata using,

clear cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use births

Suppose you wanted to generate a frequency table, followed by descriptive statistics, for several variables in the dataset.

To loop through variables, we first need to know the order the variables are arranged in the Stata browser. We can do this with the describe command or just look at the order in the Variables window.

describe

Contains data from births.dta obs: 500 Data from 500 births vars: 9 20 Mar 2002 10:07 size: 15,500 (99.9% of memory free) ------storage display value variable name type format label variable label ------id float %9.0g identity number bweight float %8.0g birth weight in grams lowbw byte %9.0g low birth weight gestwks float %9.0g gestation period preterm byte %9.0g pre-term matage byte %8.0g maternal age hyp byte %8.0g hypertens sex byte %8.0g sex of baby sexalph str6 %9s sex coded as string ------Sorted by: id The variables are listed in the order they are left to right if we looked in the data browser or data editor. This is the same order they are shown in the Variables window.

Another useful version of the describe command is “describe , simple”, or ds, which just gives the variable names.

ds

id bweight lowbw gestwks preterm matage hyp sex sexalph

______

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah Chapter 1-7 (revision 4 Mar 2011) p. 1 School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385

Chapter 1-7 (revision 4 Mar 2011) p. 2 Example 1. looping through entire variable list.

We don’t need to tabulate the id variable, so we will begin with the second variable and end with the last variable. We will use a foreach loop.

foreach x of varlist bweight-sexalph { tab `x' sum `x' }

<-- the “x” can be any name we want—it is just a nickname that will be replaced with actual variable names <-- the keyword “varlist” informs Stata that a variable list will follow <-- the “{“ and “}” tell Stata to execute everything in between for each variable in the varlist. Nothing can follow the opening brace on the first line. <-- the inward slanting single quote and the apostrophe tell Stata to replace what is inside with the name of that thing, not its value. Thus for each variable in turn, the x is replace with the variable name

This foreach loop has the same effect as

tab bweight sum bweight tab lowbw sum lowbw .... tab sexalph sum sexalph

Note: In the do-file editor, the line on the left notifies you that you are defining a loop, and it follows you down the page. You can terminate that action by putting anything but a space after the termininating brace, “}”. So, you might was well make it an asterik, “*”, which is a comment that will not produce an error message when you run it in the do-file (safe to just leave it there). It will look like this,

foreach x of varlist bweight-sexalph { tab `x' sum `x' } *

Chapter 1-7 (revision 4 Mar 2011) p. 3 Example 2. looping through non-adjacent variables

describe

Contains data from births.dta obs: 500 Data from 500 births vars: 9 20 Mar 2002 10:07 size: 15,500 (99.9% of memory free) ------storage display value variable name type format label variable label ------id float %9.0g identity number bweight float %8.0g birth weight in grams lowbw byte %9.0g low birth weight gestwks float %9.0g gestation period preterm byte %9.0g pre-term matage byte %8.0g maternal age hyp byte %8.0g hypertens sex byte %8.0g sex of baby sexalph str6 %9s sex coded as string ------Sorted by: id

This time, let’s only use bweight through preterm plus sex. In the variable list, we can mix and match the “through”, the “-” specifier, with just listing individual variables.

foreach x of varlist bweight-preterm sex { tab `x' sum `x’ }

foreach x of varlist bweight-preterm sex { tab `x' sum `x' }

A feature to help you create this variable list is a macro called “r(varlist)”, which is created, or returned, by the describe or ds command. To see this, use return list after the describe.

ds return list

. ds id bweight lowbw gestwks preterm matage hyp sex sexalph

. return list macros: r(varlist) : "id bweight lowbw gestwks preterm matage hyp sex sexalph"

Chapter 1-7 (revision 4 Mar 2011) p. 4 So, just use,

ds display r(varlist)

id bweight lowbw gestwks preterm matage hyp sex sexalph

Highlight this output with the mouse, and copy-and-paste it into the do-file editor after foreach x of varlist to help you build the loop.

Example 3. Turn scrolling off, saving output to log file, and printing the log file

In the previous two examples, the scrolling feature was bothersome, so we might want to turn it off.

Also, if we have a long list of variables, the beginning of the output will be erased from the Stata Results window memory and we won’t be able to look at it.

Finally, we might want to print the output.

set more off // turn scrolling off log using junk.smcl, replace // start logging to file foreach x of varlist bweight-preterm sex { tab `x' sum `x' } log close // stop logging to a file set more on // turn scrolling back on * print junk.smcl // print log file to the Windows default printer

To look at the output in the log file, click on the scroll icon (4th from left) on the Stata menu bar.

Chapter 1-7 (revision 4 Mar 2011) p. 5 Looping Over Numbers

We will use a simple example that has no research significance, just to learn the looping structure.

Let’s display the numbers 1, 3,4,5, 10, 15, 20, and 100, along with the numbers squared.

We could use,

foreach num of numlist 1 3/5 10(5)20 100 { display `num' _column(10) `num'*`num' } *

1 1 3 9 4 16 5 25 10 100 15 225 20 400 100 10000

The “3/5” gave us every integer between 3 and 5. The “10(5)20” gave us the numbers between 10 and 20, incrementing by 5. The “_column(10)” was tab to column 10.

Other options for the display command can be found using,

help display

A partial list is:

_skip(#) skips # columns _column(#) skips to the #th column _newline goes to a new line _newline(#) skips # lines _continue suppresses automatic newline at end of display command

Chapter 1-7 (revision 4 Mar 2011) p. 6 If you want to loop over a large number of equally spaced numbers, a more efficiently executed looping syntax is,

forvalues x = 1/10 { display `x' _column(10) `x'*`x' }

1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64 9 81 10 100

Inner and Outer Loops

Looping can be nested to as many levels as you desire. To create a multiplication table for numbers one to three, we could use the following,

* -- multiplication table: attempt 1 * r = row , c = col forvalues r = 1/3 { forvalues c = 1/3 { display "`r' x `c' = " `r'*`c' } }

1 x 1 = 1 1 x 2 = 2 1 x 3 = 3 2 x 1 = 2 2 x 2 = 4 2 x 3 = 6 3 x 1 = 3 3 x 2 = 6 3 x 3 = 9

Notice that the outer loop, with the r counter, gets set to a value. Then the inner loop runs through all of its numbers. Then the outer loop increments to its next value, and so on.

The display always ends with a carriage return, or newline, unless instructed not to.

Chapter 1-7 (revision 4 Mar 2011) p. 7 Let’s put the result in a square matrix.

* -- multiplication table: attempt 2 * r = row , c = col forvalues r = 1/3 { forvalues c = 1/3 { display `r'*`c' " " _continue } }

1 2 3 2 4 6 3 6 9

The suppressing of the newline, using “_continue” remained in effect, even though we went back out to the outer loop.

To fix that,

* -- multiplication table: attempt 3 * r = row , c = col forvalues r = 1/3 { forvalues c = 1/3 { display `r'*`c' " " _continue } display // display nothing goes to next line }

1 2 3 2 4 6 3 6 9

Notice that this square matrix is symmetric around the main diagonal, with the “2 3 6” in the upper right corner and in the lower left corner.

To eliminate the duplicates, so that we display just an upper triangular matrix,

* -- multiplication table (upper triangular matrix): attempt 1 * r = row , c = col forvalues r = 1/3 { forvalues c = `r'/3 { display `r'*`c' " " _continue } display // display nothing goes to next line }

1 2 3 4 6 9

Chapter 1-7 (revision 4 Mar 2011) p. 8 We got the right numbers, it just that we need to tab over to the right appropriately. To do that, we add the following two lines between the forvalues statements:

* -- multiplication table (upper triangular matrix): attempt 2 * r = row , c = col forvalues r = 1/3 { local m=(`r'-1)*2 display _skip(`m') _continue forvalues c = `r'/3 { display `r'*`c' " " _continue } display // display nothing goes to next line }

1 2 3 4 6 9

Here a we used a “local macro”, which can be a scalar or matrix. It only exists inside a program or do-file. If you go out to the Command window and try to use it, it will not be found.

Looping Over Strings

To loop over names, or string constants, that have no embedded spaces, we could use,

foreach x in Joe Mary Bill { display "My name is " "`x'" }

My name is Joe My name is Mary My name is Bill The double quotes are required by the display command to display a string. The inner single special quotes, `x’, mean evaluate the macro and insert it’ contents right at that position, as if you had typed in the contents yourself from the keyboard.

If the list has embedded spaces, use quotes around the strings,

foreach x in "Joe Brown" "Mary Jones" "Bill Smith" { display "My name is " "`x'" }

My name is Joe Brown My name is Mary Jones My name is Bill Smith

Chapter 1-7 (revision 4 Mar 2011) p. 9 While Loops

If we do not know in advance the range of numbers to loop over, a while loop structure is better. It has the syntax:

while exp { stata commands }

The looping continues until the expression is no longer true.

Here is a simple example,

local x=0 while `x' < 5 { local x=`x'+1 display `x' }

1 2 3 4 5

Notice it was necessary to initialize x (assign it a starting value) before entering the while loop. Then, inside the loop, it needs to eventually change to a value that evaluates to false in the while expression; otherwise, the loop will continue indefinitely (called an infinite loop).

Chapter 1-7 (revision 4 Mar 2011) p. 10 Scalars and Local Macros

In the preceding while loop, we used a “local macro”, which was a scalar (a matrix with one element) that exists only in a program or do-file which uses it.

To illustrate

local x=0 while `x' < 5 { local x=`x'+1 display `x' } display "x = " `x'

1 2 3 4 5

. display "x = " `x' x = 5

Now, in the Command window, which is outside of the do-file, execute the following line

display "x = " `x'

. display "x = " `x' x =

.

We see that x did not have a value outside of the do-file. It simply evaluated as a blank macro.

Chapter 1-7 (revision 4 Mar 2011) p. 11 A scalar, on the other hand, exists for the entire Stata session, inside or outside of the do-file or a program. Using a scalar this time,

scalar x=0 while x < 5 { scalar x=x+1 display x } display "x = " x

1 2 3 4 5

. display "x = " x x = 5

From the Command window,

display "x = " x

. display "x = " x x = 5

.

Notice that a scalar does not need the special single quotes around it. Those are used to evaulate a macro. A scalar is actual just a variable with one number. It also continues to exist outside of the do-file, or program, and remains until Stata is exited, or you use,

scalar drop x

The choice between local macros or scalars is largely just a user personal preference.

Chapter 1-7 (revision 4 Mar 2011) p. 12 Working with Multiple Observations Per Subject: Row calculations across several variables (means, sums, etc).

Let’s suppose we have 3 systolic blood pressure readings per patient.

clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 2 1 123 118 . end list

+------+ | id sex sbp0 sbp15 sbp30 | |------| 1. | 1 0 120 125 128 | 2. | 2 1 123 118 . | +------+

If all we wanted to do was compute a mean of these three blood pressures to use as a new variable in our analysis, we might try:

capture drop meansbp gen meansbp=(sbp0+sbp15+sbp30)/3 list

+------+ | id sex sbp0 sbp15 sbp30 meansbp | |------| 1. | 1 0 120 125 128 124.3333 | 2. | 2 1 123 118 . . | +------+

Notice that Stata used was is called listwise deletion of missing values in this computation. That is, if any of the variables have a missing value, the generated variable is set to missing.

What we want, however, is to compute the mean for the second subject as

gen meansbp=(sbp0+sbp15)/2 if id==2

We wouldn’t to it this way, however. A better approach is to use:

capture drop meansbp egen meansbp = rowmean(sbp0 sbp15 sbp30) // compute row mean omitting missing values list

+------+ | id sex sbp0 sbp15 sbp30 meansbp | |------| 1. | 1 0 120 125 128 124.3333 | 2. | 2 1 123 118 . 120.5 | +------+

This time we get the correct result, where the function rowmean sums the nonmissing observations and divides by the correct number of nonmissing observations. Chapter 1-7 (revision 4 Mar 2011) p. 13 Use help egen to see a long list of such row functions, for computing sums, minimum, maximums, etc.

The egen command (extension to generate) is a generate command that only works using the functions that are specific to this command.

Reshaping the Data Structure

In some statistical packages, such as SPSS, the software expects the data to be arranged in wide data structure. In most other statistical packages, such as Stata, SAS, MLwiN, S-Plus, and Spida, the software expects the data to be arranged in long data structure for some commands.

Wide data structure

id sex sbp0 sbp15 sbp30 1 0 120 125 128 2 1 123 118 .

Long data structure

id time sex sbp 1 0 0 120 1 15 0 125 1 30 0 128 2 1 1 123 2 15 1 118

Generally, Stata only requires the long format if we are using regression models for longitudinal (repeated measurements) data, or if we have several measurements on the same individuals which are not necessarily in any time order.

In Stata, we can switch from one structure to the other using the reshape command.

Chapter 1-7 (revision 4 Mar 2011) p. 14 We will start with the dataset with a wide structure.

clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 2 1 123 118 . end list

+------+ | id sex sbp0 sbp15 sbp30 | |------| 1. | 1 0 120 125 128 | 2. | 2 1 123 118 . | +------+

To convert this to long structure, we use

reshape long sbp, i(id) j(time) // convert to long structure list

+------+ | id time sex sbp | |------| 1. | 1 0 0 120 | 2. | 1 15 0 125 | 3. | 1 30 0 128 | 4. | 2 0 1 123 | 5. | 2 15 1 118 | |------| 6. | 2 30 1 . | +------+

In this command, the “sbp” is called the stub variable (prefix variable would be a more intuitive name). Stata used the j subscript variable, time, to store the suffix that followed “sbp” in the variable names.

Stata used the i subscript variable to identify what variable identifies each subject. This variable has to contain a unique number across the rows of data, or it will get confused (see box).

It then just duplicated the values of all the other variables in the file and placed them on each newly created row.

Chapter 1-7 (revision 4 Mar 2011) p. 15 What happens if i( ) variable is not unqiue for each observation (each row).

* -- intentional error -- preserve clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 1 1 123 118 . end list reshape long sbp, i(id) j(time) // will crash because id not unique list restore

. reshape long sbp, i(id) j(time) (note: j = 0 15 30) i=id does not uniquely identify the observations; there are multiple observations with the same value of id. Type "reshape error" for a listing of the problem observations. r(9);

The do-file crashes because Stata needs to have a unique value for every row for the i( ) variable.

You can quickly create such a variable using the “_n” variable, which is Stata’s internal variable for observation number, or row number, in the data editor/browser. Use this in a generate statement before the reshape command, like so:

preserve clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 1 1 123 118 . end list gen id2 = _n list reshape long sbp, i(id2) j(time) list restore

+------+ | id sex sbp0 sbp15 sbp30 id2 | |------| 1. | 1 0 120 125 128 1 | 2. | 1 1 123 118 . 2 | +------+

Chapter 1-7 (revision 4 Mar 2011) p. 16 With our data currently in long format,

+------+ | id time sex sbp | |------| 1. | 1 0 0 120 | 2. | 1 15 0 125 | 3. | 1 30 0 128 | 4. | 2 0 1 123 | 5. | 2 15 1 118 | |------| 6. | 2 30 1 . | +------+ if we wanted to convert this to wide structure, we would use

reshape wide sbp, i(id) j(time) // convert to wide structure list

+------+ | id sbp0 sbp15 sbp30 sex | |------| 1. | 1 120 125 128 0 | 2. | 2 123 118 . 1 | +------+

This time, Stata took the value in the j( ) subscribe variable and appended it to the prefix we specified in the stub variable, sbp.

Chapter 1-7 (revision 4 Mar 2011) p. 17 Working with Multiple Observations Per Subject: Column calculations across several observations (means, sums, etc).

Put the data back into long structure:

reshape long sbp, i(id) j(time) list

+------+ | id time sbp sex | |------| 1. | 1 0 120 0 | 2. | 1 15 125 0 | 3. | 1 30 128 0 | 4. | 2 0 123 1 | 5. | 2 15 118 1 | |------| 6. | 2 30 . 1 | +------+

This time, we use the following to compute the mean SBP:

capture drop meansbp bysort id: egen meansbp=mean(sbp) list

+------+ | id time sbp sex meansbp | |------| 1. | 1 0 120 0 124.3333 | 2. | 1 15 125 0 124.3333 | 3. | 1 30 128 0 124.3333 | 4. | 2 0 123 1 120.5 | 5. | 2 15 118 1 120.5 | |------| 6. | 2 30 . 1 120.5 | +------+

Notice Stata ignored the missing value and calculated the mean based on the number of nonmissing observations. That is what we wanted.

Chapter 1-7 (revision 4 Mar 2011) p. 18 Tagging multiple observations per subject in long format

Given our data are on six rows (six observations), Stata will think we have a sample size of 6. Actually, we only have a sample size of 2.

+------+ | id time sbp sex meansbp | |------| 1. | 1 0 120 0 124.3333 | 2. | 1 15 125 0 124.3333 | 3. | 1 30 128 0 124.3333 | 4. | 2 0 123 1 120.5 | 5. | 2 15 118 1 120.5 | |------| 6. | 2 30 . 1 120.5 | +------+

If we asked for a frequency table, then, we would get

tab meansbp

meansbp | Freq. Percent Cum. ------+------120.5 | 3 50.00 50.00 124.3333 | 3 50.00 100.00 ------+------Total | 6 100.00

One approach to insuring the correct sample size is used is to create a tag variable.

egen tag = tag(id), missing list

+------+ | id time sbp sex meansbp tag | |------| 1. | 1 0 120 0 124.3333 1 | 2. | 1 15 125 0 124.3333 0 | 3. | 1 30 128 0 124.3333 0 | 4. | 2 0 123 1 120.5 1 | 5. | 2 15 118 1 120.5 0 | |------| 6. | 2 30 . 1 120.5 0 | +------+

The tag function of the egen command tags one observation in each distinct group of the varlist

egen tag = tag(varlist)

The result is 1 or 0, according to whether the observation is tagged, and never missing.

When the missing option is specified,

egen tag = tag(varlist), missing Chapter 1-7 (revision 4 Mar 2011) p. 19 the first observation of a distinct group of varlist is still tagged, even if the group is all missing. To use this tag variable, simply include if tag in any command you want to limit the analysis to the actual sample size.

tab meansbp if tag // limit to tagged observations

meansbp | Freq. Percent Cum. ------+------120.5 | 1 50.00 50.00 124.3333 | 1 50.00 100.00 ------+------Total | 2 100.00

Notice we can use “if tag”, instead of “if tag==1” if we want to, and we still get the same result.

tab meansbp if tag==1 // same thing

meansbp | Freq. Percent Cum. ------+------120.5 | 1 50.00 50.00 124.3333 | 1 50.00 100.00 ------+------Total | 2 100.00

If a variable is given in an if statement, without an expression, stata interprets it as “true” if a nonzero value is found. In this case we get the right result, which we always do with the tag variable. However, you have to be careful with this shortcut for other variables. Stata stores missing values as a “very large number”, or + for all practical purposes. So, if we take this shortcut for a variable that has missing values, it selects those observations as well.

list if sbp // shortcut method list if sbp~=. // better to just use this safer method

. list if sbp // shortcut method

+------+ | id time sex sbp tag | |------| 1. | 1 0 0 120 1 | 2. | 1 15 0 125 0 | 3. | 1 30 0 128 0 | 4. | 2 0 1 123 1 | 5. | 2 15 1 118 0 | |------| 6. | 2 30 1 . 0 | +------+

. list if sbp~=. // better to just use this safer method

+------+ | id time sex sbp tag | |------| 1. | 1 0 0 120 1 | 2. | 1 15 0 125 0 | 3. | 1 30 0 128 0 | 4. | 2 0 1 123 1 | 5. | 2 15 1 118 0 | Chapter 1-7 (revision 4 Mar 2011) p. 20 +------+

Chapter 1-7 (revision 4 Mar 2011) p. 21 Collapsing Long Format

Alternatively, we could simply collape the dataset down to one observation per subject.

Starting over:

clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 2 1 123 118 . end reshape long sbp, i(id) j(time) list

+------+ | id time sex sbp | |------| 1. | 1 0 0 120 | 2. | 1 15 0 125 | 3. | 1 30 0 128 | 4. | 2 0 1 123 | 5. | 2 15 1 118 | |------| 6. | 2 30 1 . | +------+

We want the ID to go to its unique value time to just drop out of the collapsed dataset sex to go to its unique value sbp to go to its mean

collapse sex (mean) sbp, by(id) list

+------+ | id sex sbp | |------| 1. | 1 0 124.3333 | 2. | 2 1 120.5 | +------+

Variables, such as time, which are not specified on the collapse command are simply dropped.

All variables preceding the ( ) are reduced to their mean by default, so this was the same thing as collapse (mean) sex sbp, by(id)

All variables following a ( ) have the operation performed on them, such as (mean), (sum), or (max).

The by( ) defines “over each unique value” in the by-list.

Note: Be sure you save your dataset before you do this, or at least can easily recreate it using your do-file—alternative, you can use preserve and restore to un-collapse.

Chapter 1-7 (revision 4 Mar 2011) p. 22 Preserve and Restore

If you want to get your data back to the state it was before the collapse, use preserve and restore.

Starting with the original dataset in long format,

clear input id sex sbp0 sbp15 sbp30 1 0 120 125 128 2 1 123 118 . end reshape long sbp, i(id) j(time) list

+------+ | id time sex sbp | |------| 1. | 1 0 0 120 | 2. | 1 15 0 125 | 3. | 1 30 0 128 | 4. | 2 0 1 123 | 5. | 2 15 1 118 | |------| 6. | 2 30 1 . | +------+ This time we will preserve the data before we collapse, and then restore it afterwards.

preserve // save copy of data in this original state collapse sex (mean) sbp, by(id) list restore // restore data into original state list

. preserve // save copy of data in this original state . collapse sex (mean) sbp, by(id) . list

+------+ | id sex sbp | |------| 1. | 1 0 124.3333 | 2. | 2 1 120.5 | +------+

. restore // restore data into original state . list

+------+ | id time sex sbp | |------| 1. | 1 0 0 120 | 2. | 1 15 0 125 | 3. | 1 30 0 128 | 4. | 2 0 1 123 | 5. | 2 15 1 118 | |------| 6. | 2 30 1 . | +------+

Chapter 1-7 (revision 4 Mar 2011) p. 23 Example (graphs with error bars)

One possible use of the collapse command is to draw graphs with error bars.

use births, clear sum gestwks

Variable | Obs Mean Std. Dev. Min Max ------+------gestwks | 490 38.72186 2.314167 24.69 43.16

We see that gestational has two decimal places, so we will need to round it to the nearest week.

If we went to the help for egen, we would not find the round function. To find it, search the help for “functions”. You will see a help screen that looks like:

help functions

[D] functions -- Functions in expressions

Quick references are available for the following types of functions:

+------+ | Type of function | See help | |------+------| | Mathematical functions | math functions | | | | | Probability distributions and | | | density functions | density functions | | | | | Random-number functions | random-number functions | | | | | String functions | string functions | | | | | Programming functions | programming functions | | | | | Date and time functions | dates and times | | | | | Selecting time spans | time-series functions | | | | | Matrix functions | matrix functions | | | | +------+

Click on math functions, which will give you a long list of functions:

Mathematical functions

abs(x) Domain: -8e+307 to 8e+307 Range: 0 to 8e+307 Description: returns the absolute value of x.

. . .

round(x,y) or round(x) Domain x: -8e+307 to 8e+307 Domain y: -8e+307 to 8e+307 Range: -8e+307 to 8e+307 Chapter 1-7 (revision 4 Mar 2011) p. 24 Description: returns x rounded in units of y or x rounded to the nearest integer if the argument y is omitted. All functions in Stata work with the generate, or gen, command, except for the very small list of functions you find listed when you search on egen.

We create a rounded gestational age variable using,

gen rndgestwks=round(gestwks) tab rndgestwks

rndgestwks | Freq. Percent Cum. ------+------25 | 1 0.20 0.20 27 | 2 0.41 0.61 28 | 2 0.41 1.02 31 | 6 1.22 2.24 32 | 3 0.61 2.86 33 | 5 1.02 3.88 34 | 6 1.22 5.10 35 | 9 1.84 6.94 36 | 21 4.29 11.22 37 | 34 6.94 18.16 38 | 84 17.14 35.31 39 | 107 21.84 57.14 40 | 128 26.12 83.27 41 | 66 13.47 96.73 42 | 14 2.86 99.59 43 | 2 0.41 100.00 ------+------Total | 490 100.00 We see that there is too little data for gestational ages outside of the 36 to 41 weeks range, particularly if we produce a graph separately for male and female fetuses.

To make sure we can get back to our original data, since we will immediately return to other analyses, we could type the following in the do-file, and then do our graph work inside this preserve-restore block. (Alternatively, we could just read the data back in later, which is usually just as easy.)

* -- begin graph preserve

restore * -- end graph

Let’s begin by limiting the dataset to weeks 36 to 41.

keep if rndgestwks>=36 & rndgestwks<=41 tab rndgestwks

rndgestwks | Freq. Percent Cum. ------+------36 | 21 4.77 4.77 37 | 34 7.73 12.50 38 | 84 19.09 31.59 39 | 107 24.32 55.91 40 | 128 29.09 85.00 Chapter 1-7 (revision 4 Mar 2011) p. 25 41 | 66 15.00 100.00 ------+------Total | 440 100.00

We are going to apply the same formula for a 95% CI that the t-test output uses, so that we can check our work. Doing this for week 41,

ttest bweight if rndgestwks==41, by(sex) // use to check our work

Two-sample t test with equal variances ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------1 | 35 3589.057 72.47286 428.7552 3441.775 3736.34 2 | 31 3352.742 62.90985 350.2672 3224.263 3481.221 ------+------combined | 66 3478.061 50.28804 408.542 3377.628 3578.493 ------+------diff | 236.3152 97.15402 42.22775 430.4027 ------diff = mean(1) - mean(2) t = 2.4324 Ho: diff = 0 degrees of freedom = 64

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9911 Pr(|T| > |t|) = 0.0178 Pr(T > t) = 0.0089

We will use this output momentarily.

Now we collapse the data by sex to get means, standard deviations, and sample sizes, which we will need to compute 95% confidence intervals.

collapse (mean) mbw=bweight (sd) sdbw=bweight /// (count) nbw=bweight, by(rndgestwks sex) list , abbrev(15)

+------+ | sex rndgestwks mbw sdbw nbw | |------| 1. | 1 36 2693.09 348.534 11 | 2. | 2 36 2655.5 532.578 10 | 3. | 1 37 2840.55 496.511 20 | 4. | 2 37 2834.5 332.788 14 | 5. | 1 38 3195.75 529.639 40 | |------| 6. | 2 38 2870.05 450.433 44 | 7. | 1 39 3307.76 453.279 59 | 8. | 2 39 3262.83 309.326 48 | 9. | 1 40 3447.01 500.553 70 | 10. | 2 40 3290.69 417.706 58 | |------| 11. | 1 41 3589.06 428.755 35 | 12. | 2 41 3352.74 350.267 31 | +------+

Notice this time we used variable names to store the results into, rather than just keeping the original variable name, which was done in the preceding example.

Chapter 1-7 (revision 4 Mar 2011) p. 26 Now, we could construct a confidence interval around the mean using the large sample normal approximation confidence interval (CI),

mean  1.96  SE. However, a more appropriate CI is constructed from the t-distribution, rather than the normal distribution (1.96 bounds the middle 95% of the normal distribution). The t-test output uses a CI constructed from the t-distribution. For small sample sizes, the 1.96 approach gives an interval that is too narrow. For infinite sample sizes, the two approaches are identical.

We do this by replacing 1.96, which comes from the standard normal distribution, with the value of the t distribution, for our given sample size, that provides the middle 95% of the area under the t distribution curve.

gen bwlcl=mbw-invttail(nbw-1,0.025)*sdbw/sqrt(nbw) gen bwucl=mbw+invttail(nbw-1,0.025)*sdbw/sqrt(nbw) list , abbrev(15)

+------+ | sex rndgestwks mbw sdbw nbw bwlcl bwucl | |------| 1. | 1 36 2693.09 348.534 11 2458.943 2927.239 | 2. | 2 36 2655.5 532.578 10 2274.517 3036.483 | 3. | 1 37 2840.55 496.511 20 2608.176 3072.925 | 4. | 2 37 2834.5 332.788 14 2642.354 3026.646 | 5. | 1 38 3195.75 529.639 40 3026.363 3365.137 | |------| 6. | 2 38 2870.05 450.433 44 2733.101 3006.99 | 7. | 1 39 3307.76 453.279 59 3189.637 3425.888 | 8. | 2 39 3262.83 309.326 48 3173.014 3352.652 | 9. | 1 40 3447.01 500.553 70 3327.662 3566.367 | 10. | 2 40 3290.69 417.706 58 3180.86 3400.52 | |------| 11. | 1 41 3589.06 428.755 35 3441.775 3736.34 | 12. | 2 41 3352.74 350.267 31 3224.263 3481.221 | +------+ Comparing this to the t-test output for week 41 above,

Two-sample t test with equal variances ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------1 | 35 3589.057 72.47286 428.7552 3441.775 3736.34 2 | 31 3352.742 62.90985 350.2672 3224.263 3481.221 ------+------combined | 66 3478.061 50.28804 408.542 3377.628 3578.493 ------+------diff | 236.3152 97.15402 42.22775 430.4027 ------diff = mean(1) - mean(2) t = 2.4324 Ho: diff = 0 degrees of freedom = 64

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9911 Pr(|T| > |t|) = 0.0178 Pr(T > t) = 0.0089

We see that our calculations were correct.

Chapter 1-7 (revision 4 Mar 2011) p. 27 We now have the data set up for constructing the graphs. We could have run a bunch of ttests to get these numbers and then inputted them back in to set up the data for graphing. The collapse way is just more automated.

Chapter 1-7 (revision 4 Mar 2011) p. 28 To get a line graph with 95% CI error bars, we use

capture drop rndgestwks2 gen rndgestwks2=rndgestwks+0.1 if sex==2 sort rndgestwks #delimit ; twoway (line mbw rndgestwks if sex==1, clcolor(blue) lwidth(*1.5)) (line mbw rndgestwks2 if sex==2, clcolor(green) lwidth(*1.5)) (rcap bwucl bwlcl rndgestwks if sex==1, lwidth(*1.5) lcolor(blue) msize(*1.5)) (rcap bwucl bwlcl rndgestwks2 if sex==2, lwidth(*1.5) lcolor(green) msize(*1.5)) , legend(off) xtitle("gestational age (weeks)") ytitle("birthweight (kg)") text(3400 37.5 "Males" ) text(2700 38.75 "Females" ) ; #delimit cr 0 0 0 4 0 0 5 3 Males ) g k (

t h 0 g 0 i 0 e 3 w h t r i b Females 0 0 5 2 0 0 0 2 36 37 38 39 40 41 gestational age (weeks)

Chapter 1-7 (revision 4 Mar 2011) p. 29