<<

Slide 1

Statistics and

Review of Book 7 SEER Program Self Instructional Manual for Registrars

Cancer Program Outcomes, Quality and Data Utilization Course , Epidemiology and Data Utilization Module

When or where do you use statistics in your everyday operations? Annual reports, data requests? How about evaluating quality or productivity, cancer conference documentation, survey documentation, budgets…just to name a few?

If you stop and think about it, you may see that you are already doing many of the statistical techniques discussed in Book 7. You may just not think about yourself in terms of being a “”.

This presentation will provide a brief overview of the statistical and epidemiological methods introduced in Book 7: Statistics and Epidemiology for Cancer Registries. You may find it helpful to follow along in your book.

Slide 2

Statistics

Branch of Collection, summarization, , interpretation, and presentation of masses of numerical data

We will begin with statistics.

Statistics is the of gathering and interpreting facts and figures. Statistics can be further described as descriptive or inferential.

Descriptive statistics will be discussed in more detail in the following slides.

Inferential statistics refer to the procedures used to make a statement about a based upon the of a . Inferential statistics is related to methods applied in hypothesis testing. Cancer registrars would usually seek assistance from a statistician or researcher before attempting to conduct a study using hypothesis testing.

Slide 3

Statistical Analysis

Summarize the essential features and relationships of the data Reveal the major characteristics of the patient group Determine broad patterns of behavior or tendencies

Statistical Analysis is the process of: -summarizing data -identifying the major characteristics of the patient group based on the data -then using the data and those major characteristics identified to determine broad patterns of behavior of a group of people.

Slide 4

Descriptive Statistics

SEER Book 7, Section B

Descriptive statistics uses numerical summaries to describe an observed distribution.

Study Hint: Have you ever read a definition of a term or phrase and you had to look up the definition of the words contained in the original definition?

Try to think of the principles of statistics as stepping stones. Such as you need to know how to add and subtract before you can do . It is important to have a clear understanding of the basic terminology. You will see them used again and again as we get into the more complicated side of statistics and epidemiology.

For example: Once you have a clear understanding of the use of the term “”, those same principles are applied to any other definition or phrase including the word “mean”. A mean tumor size or mean survival time is still based on calculating .

Slide 5

Shorthand notation

X = value of an observed (X=…) ∑ = sum of the values of X n = number (count) of observations in our group _ n – 1 = degrees of freedom (the number of __ observations that are free to vary) X = mean, value SD = standard √ = root

This slide contains a reference key of symbols that will be used in the formulas on the upcoming slides. It is not necessary to memorize these symbols. It is more for your own use during your independent study time and for understanding the shorthand used in the workbook.

The calculation of introduces the term “Degrees of freedom”. If you have 10 observations and the sum of the observation is 100, the first 9 (10-1) can be any number. The last number cannot. The last number has to be the number that when added to the 1st nine equals 100. Therefore, 9 of the numbers are free to vary.

Slide 6

How do we summarize a set of data?

Characterize a set of data in terms of:

1) Central values about which the data tend to cluster

2) The amount of spread or the dispersion of the observations

If you will remember, statistics is the collection and summation of masses of numerical data, so… If measurable characteristics among individuals did not vary, describing a set of data would be completed after the first observation.

Example: If everyone’s blood pressure was 110/70, then there would be no need in taking BP at each doctor visit.

We will see this concept again when we look at the normal curve.

Slide 7

Measures of

Central values about which the data tend to cluster “Typical” values Example: Average tumor size

Measuring central tendency are important because measurable characteristics (such as age and stage) vary from individual to individual. Therefore, we need to summarize the data in order to analyze the results.

Typical values = what you see most often.

The graphic at the bottom is a visual reminder. When you think of measures of central tendency, think about how the data clusters in the middle of the observations.

Slide 8

Measures of Central Tendency

Widely used measures of central tendency

 Mean

The tools we use to calculate the typical values are the 3 M’s…mean, median, and mode. To help describe mean, median, and mode, we will use an example of five tumor sizes. Slide 9

Mean _ Average (X) Influenced more by extreme values

_ Sum of all values ∑X 8+5+3+6+3 25 X = ------= ---- = ------= --- = 5 # of values n 5 5

Mean = the statistical verbiage for average.

Q: What is the mean in our example of tumor sizes? A: Add all of the values together (25) and divide by the number of values (5). The mean is 5.

Q: What would happen if we added an extreme value (20) to our set of tumor sizes? A: Add 20 to the list of tumor sizes. Mean = 45 / 6 = 7.5.

As you can see, the mean is more influenced by extreme values than is median and mode.

Note: It is not necessary to memorize the formulas in this presentation. They are provided only to help with demonstrating and understanding the use of the terms and definitions.

Slide 10

Median

Middle value Sort the observations in order from smallest to largest Stable measure

83 53 35 6 38

Median is the 50th – ½ of the values are smaller and ½ are larger

Q: What is the median in our example of tumor sizes? A: First we have to them in order from smallest to largest (click), then take the middle value (click). Median is 5 (click).

Q: What would happen to our median if we added an extreme value. Let’s use 20 again? A: Add 20 to the list of tumor sizes - 3,3,5,6,8,20. The middle falls between 5 and 6. If the middle falls between two values, then average the two middle values. Median is now 5.5.

Stable measure – adding extreme values to a of observations tends to only a limited change in the value of the median

Slide 11

Mode

Most frequently seen value

3 3 5 6 8

There may be no modal value  (3, 5, 6, 8) There may be more than one modal value  (3, 3, 5, 6, 6, 8)

The mode is the value that occurs most frequently. A distribution with two most-common values is called a bimodal distribution.

Slide 12

Measures of Variation

Amount of spread or dispersion of the observations Example: Fluctuation of tumor sizes

Again, the graphic at the bottom is a visual reminder. When you think of measures of variation, think about how the data spreads out along the observations.

Slide 13

Measures of Variation

Widely used measures of variation

(SD)

The tools we use to measure variation are range and standard deviation.

The standard deviation is a companion to the mean.

Q: What type of measure is the mean? A: Central Tendency (typical values, clustering in the middle)

Q: And, what type of measure is the standard deviation? A: Variation (spread, dispersion)

So, the standard deviation expresses the spread of data about the mean.

Make a mental note of this. You will see this concept again when we talk about the normal curve.

Slide 14

Range

Difference between the highest and lowest values Easiest measure of variation Greatly influenced by extreme values

Highest # - Lowest # = 8 – 3 = 5

Easiest measure…it’s just simple subtraction.

Q: What is the range in our example of tumor sizes? A: The highest number was 8. The lowest number was 3. 8-3 = 5.

Q: What would happen to the range if we added an extreme value. Let’s use 20 again? A: 20 – 3 = 17

As you can see, range is greatly influenced by extreme values.

Study Hint: An exam question may not be worded exactly as they are seen on the slides. It may be asked with a slight twist, so-to-speak. We’ve talked mostly in terms of measures that are most influenced by extreme values. A good exam question that comes to mind is: Q: Given a set of values, which is least likely to be influenced by an extreme value? Mean, Median, Mode, Range A: Median Slide 15

Standard Deviation

How far the observations tend to vary from the mean of variance

_ ∑ (X – X)2 18 18 SD = √ ------= √ ------= √ ---- = √4.5 = 2.12 (n-1) (5-1) 4

Standard Deviation….sounds and looks complicated, but the calculation is based on fairly simple mathematics.

Q: First of all, how did we get 18 in the example above? A: Remember, we are still using our example of 5 tumor sizes (8,5,3,6,3) and the symbols used here are from our reference key. Take each value from our list of 5 tumor sizes (x), subtract the mean (which is 5, refer to slide 9 if needed), square it, and add them together. (8-5)2 + (5-5)2 + (3-5)2 + (6-5)2 + (3-5)2 = (3)2 + (0)2 + (-2)2 + (1)2 + (-2)2 = 9 + 0 + 4 + 1 + 4 = 18

Q: How did we get the number 4 in the equation? A: This is the degrees of freedom (refer to slide 5 if necessary). The total number of values in our list of tumor sizes was 5. Therefore, n=5. n-1 or 5-1=4.

Variance = 18 divided by 4 = 4.5. Standard deviation is the square root of the variance.

Q: So, what does a standard deviation of 2.12 tell us? A: This tells us that using our example of tumor sizes, our observations will tend to fall within plus or minus 2.12 of the mean. Our mean is 5. So in other words, our observations will tend to be between 2.88 (5 - 2.12) and 7.12 (5 + 2.12).

The numbers in our example ranged from 3 to 8…which corresponds with the standard deviation. Significance of the standard deviation will be discussed later with the normal curve. Remember our measures of central tendency (mean, median, mode) focus on the center of our observations. So the standard deviation describes how the data spreads out on either side of the mean. The more widely the values are spread out, the larger the standard deviation.

Q: Why do we use the square root of the variance? A: The variance is a squared number (a number that has been squared). The mean that we calculated is not a squared number. When making comparisons, you want your units to be the same. Therefore, we take the square root of the variance so it will be the same units as the mean – a number that is not squared. Slide 16

Kinds of Data

Discrete

Qualitative Quantitative

Continuous

Nominal Ordinal Interval Ratio

Qualitative (Categories, Classes): Serves to classify data into related groups. It is not a measurement value.

There are two types of qualitative data: Nominal and Ordinal.

Nominal (Classification) – Assigning a name. Has no numerical meaning. Example: male = 1, female = 2, True/False, Yes/No You can’t add all of the 2’s together to tell you how many female cases you have.

Ordinal (Ranked) – placed into order or . Example: Stage of 0 – IV, Grade The higher the number represents a different degree or severity, etc.

Quantitative (): Actual values or measures

There are also two types of quantitative data: Interval and Ratio.

Interval (Arbitrary zero) – no set starting point. Example: Survival Starting point depends on when you were diagnosed.

Ratio (Absolute zero) - zero = none, starts at zero. Example: Tumor size, Age, Amount of radiation

Discrete: can have only a particular set of values. Example: A person can own 1 car, but not 1.5 cars.

Continuous: Variable can have more precise values with further refinements of the measuring scale. Example: Approx 30yo can be refined to 30y 4mo which can be refined to 30y 4mo 2 days, and so on. Slide 17

Preparation of Reports

First step is to define the problem

Define the objective and scope Select, Assemble, Present, and Analyze the data

So, we now know the options for summarizing the data, but how do we go about preparing reports about the data? Begin by asking the following questions: What information does the user want? What information is available in the registry? Are the data routinely collected by the registry or will it require a collection of additional data?

The objective and scope is what the purpose of the final data will be (what it will be used for).

Slide 18

Selection of Cases

Determine the cases to be included Population Determine the data Sample items Random Sample

Sample Population

Selection methods apply to both statistics and epidemiology techniques.

When selecting data, if all cases are not to be used then bias should be avoided. Clearly define the population to be studied. A sample should be a random sample.

Bias: Tendency of a statistical estimate to deviate in one direction from the true value. The selection does not fairly represent the population being studied. Population: Any set of individuals having some common observable characteristic. Sample: A subset of the population under study that represents the entire population. Random Sample: Each individual has an equal and independent chance of being chosen. To select a random sample, a random numbers table or a systematic method (every 5th case) can be used.

Using a random sample will help avoid bias.

Slide 19

Selection of Cases Misclassification Bias Selection

Sensitivity Specificity Positive Predicted Value

Equal Random Independent

Bias results from: Misclassification – individuals are assigned to the wrong group. Selection - Individuals do not have the same opportunity to be included. Confounding - Mixing up the effects from several factors. Cases share certain characteristics or prognostic factors (age, stage of disease)

Screening: A method of searching for occult or early disease. Q: Does screening have to be a formal screening program such as a skin screening that takes place at the mall? A: No, it could be a that is performed during a routine physical, such as a PSA.

Sensitivity: ability of a test to find diseased people (screening test is positive when person is sick). This is also known as the true positive rate. Specificity: ability of a test to find well people (screening test is negative when person is not sick). This is also known as true negative rate.

Positive Predicted Value: percent of people who are said to have disease that actually do have disease (the that a person has the disease given a positive test ).

And of course, there are opposites to these: False negative: a person who tests as negative but who is actually positive False positive: a person who tests as positive but who is actually negative

Equal: each name has an equal chance of being picked. Independent: selecting a certain name is not affected by the names previously selected. Examples: drawing names out of hat and putting that name back in the hat before the next draw (so the next draw has an equal chance of being selected). A random numbers table may also be used.

Slide 20

Assemble the Data

Review for obvious or highly unusual cases Summarize the data Mutually exclusive categories Frequency Distributions

 Counts

 Relative Frequencies

We will take a look at assembling the data using mutually exclusive categories and frequency distributions. But remember, first you should review the report results for obvious errors or highly unusual cases and make any corrections to the data before continuing.

Slide 21

Mutually Exclusive Categories

Mutually exclusive: Case falls into one and only one category. Let’s look at the examples in A, B, C, and D.

Q: What’s wrong (or right) with each of these categories?

A: Ambiguous Q: Where should 2cm be counted?

B: Not clear what the limits of each class are (0-1, 0-2, or 1-2)

C: Appropriate for discrete data only (whole numbers). Q: Where is 1.8 counted? Round up to 2?

D: Most suitable for continuous data when some values could include a decimal value Q: But what about 10 and above? A: For this to be complete there should be a 10+ in the last class. This category assumes there is not going to be a tumor size greater that 9.9.

Slide 22

Frequency Distributions

Absolute Frequency

COUNT of the number of cases Information on magnitude Example: 250 male patients for the year 2000

A is created by setting up categories (may also be called classes, groups, or bins)

An absolute frequency is the count of the number of cases that fall into each category.

When we say “magnitude”, we mean “how many”.

Q: What type of absolute frequency distributions might you provide from your registry data? A: In your annual report, you probably provide a distribution of the major primary sites.

Slide 23

Frequency Distributions

Relative Frequency

PERCENTAGE Proportion of the part to the whole Facilitates comparisons Example: 250 male patients out of 500 total patients = 50% males

# in each group ------X 100 Total # in the population

Relative frequency converts the count in the absolute frequency to a percentage.

Example: In your annual report, you may also provide a distribution of the major primary sites comparing them to state and national data.

Q: If your facility saw 240 new lung cases and there were 5000 new lung cases reported in the state and 170,000 new lung cases in the US, would you create a graph that shows these actual numbers (240, 5000, and 170,000)? A: No, that would be a mighty big graph because the number would range from 240 to 170,000. And, what would it tell you about your data? Not much. By converting each number to a percentage (using the formula on the slide), you can compare whether or not the at your facility is high or low – compared to state and national data.

Relative frequency can be applied to any type of measure – primary site, gender, race, etc. Slide 24

Analyzing the Data

Reviewing the data for completeness, errors, and inconsistencies Deciding on appropriate format for graphs and tables Determining how the data will be grouped

Anytime you generate data, you want to analyze it for accuracy and completeness before you release that information to someone else. Build other’s confidence in your data by having it be as accurate as possible.

Slide 25

Presentation of Data

TABLES

100 90 80 70 GRAPHS 60 50 40 30 20 10 0 A B C D

So, now we’ve gathered all of this information, we know how to summarize it using frequencies, central tendencies, variations, etc. Now, how do we share this data with others so that it can be understood accurately?

First of all, it depends on the purpose of the data. Tables would be used for complex, detailed information. Graphs would be used to show relationships in data. Slide 26

Presentation of Data

Tables and graphs should contain enough information to stand alone without text explanation Title should answer: Who, What, Where, When Should contain certain essential components Significant results and relationships pointed out

Slide 27

Tables

More information can be presented Exact values can be read from a table Less work and cost required in preparation Flexibility is maintained without distortion of data

Slide 28

Basic Components of a Table

Title Boxhead

Stub Head Column Headings Stub

Rows

Footnote: Source:

Anything in a table (or graph) that cannot be understood by the reader from the title, captions and/or stub should be explained by a footnote. IE: abbreviations, types of cases included/excluded. If you use data from outside your institution for comparison purposes, always indicate the source of the data.

Slide 29

Types of Tables

Summary Table Reference Table

 Uses  Provides complete

 Presents specific information data for a particular  Not intended to be use read through

Q: Can you think of an example of a summary table? A: How about a list of patients with stage 3 and 4 breast cancer for year 2004 that contains the name, medical record number, stage, date of diagnosis, and type of surgery.

Q: Can you think of an example of a reference table? A: How about the master patient index for the entire . This could be notebooks worth of information for larger registries.

Slide 30

Graphs

Attracts attention more readily More easily understood Shows trends or comparisons more vividly More easily remembered

I’m sure you have heard the phrase: A picture is worth a 1000 words.

Slide 31

Basic Components of a Graph

Title

90 80 70 60 Legend 50 40

axis (Ordinate) axis 30 -

Y 20 10 0 Scale A B C D X-axis (Abscissa) Footnote: Source:

Study Hint: A possible CTR exam question might be: What else is the y-axis and x- axis known by? Slide 32

Types of Graphs

Bar Graphs Frequency Polygons Line Graphs Pie Charts Scatter Diagrams Pictographs Geographic Maps

We will discuss very briefly the major characteristics of each of these type graphs.

Slide 33

Bar Graphs

Major Primary Sites for 2000

60 50 40 30 Year 2000 20 10 Number of Cases 0 Lung Colon Prostate Breast Melanoma Primary Site

Footnote: Analytic cases Source: Cancer Data Services

Q: What else is this data called? A: Absolute frequency distribution (count of cases that falls into each category).

Used to display frequencies, proportions, or percentages. Emphasizes individual amounts (discrete data). One axis is not continuous. Individual heights of each bar represent a whole. Bars of equal width. Space between bars. The value of each bar is independent of the value of other bars. Slide 34

Histograms

Tumor Size for Breast Cancer, 2000 Number of of Cases Number

Size (centimeters)

64 patients, Class of Case 0-2

Use to present observations for one continuous variable. Sum of the heights of the bars represents all the cases. If we were to add up the values for each category, we would get the total number of the patients (64).

Again, bars are of equal width. BUT, there are NO spaces between the bars.

A is a frequency distribution in bar graph form – the total area covered by the graph represents the whole. It is most effective when only one distribution is shown.

Slide 35

Frequency Polygons

Tumor Size for Breast Cancer, 2000 Number of of Cases Number

Size (centimeters)

64 patients, Class of Case 0-2

Histogram in line graph form. Also represents all cases. Created by joining the midpoints at the top of each interval and connecting the dots.

This graph is using the same numbers as we used in the histogram in the previous slide.

Allows you to several frequency polygons on the same graph for comparison.

Q: And, what do we want to do if we want to compare values that vary widely? A: We convert them to percentages.

Since the line represents the total distribution, the line always starts and ends with zero.

Slide 36

Cumulative Frequency Polygons (Ogive)

Cumulative Percent of Tumor Size for Breast Cancer, 2000

Median (50%) Cumulative Percent Cumulative

Size (centimeters)

Expressed in terms of percentages.

The cumulative frequency for any interval on the scale is the total of the frequencies for that interval + the total for all of the lower intervals.

Using the same numbers as on the frequency polygon and histogram: Group 1 = 27% (17 divided by 64), Group 2 = 63% (27%+36%), Group 3 = 80% (27%+36%+17%), etc.

Slide 37

Line Graphs

Relative Survival Rates by Year of Diagnosis for Prostate Cancer, 1996 - 2000

100 80 1 Year 2 Year 60 3 Year 40 4 Year 20 5 Year

Percent Surviving Percent 0 1996 1997 1998 1999 2000 Year of Diagnosis Observed Survival by Stage, 1996 - 2000 Scale

Semilog Surviving Percent Years After Diagnosis

Used to display trends over time and survival curves. Can display multiple sets of data on one graph (%’s should be used).

Arithmetic scale – absolute numerical difference (how far). Uses equal units of measurement for the intervals on the horizontal and vertical scale.

Semilog – the steeper the line, the greater the rate of change (how fast). Uses equal units of measurement for the intervals on the horizontal scale. Uses unequal units of measurement for the intervals on the vertical scale. Usually used for survival or incidence rates.

Slide 38

Pie Charts

Percentage Distribution of Prostate Cancer by Stage, 2000 18%

Stage I 47% Stage II 20% Stage III Stage IV

15%

Another way of showing the component parts of the whole. All %’s must sum to 100%. Uses a circle (360 degrees). Not as appropriate for comparing distributions. Each part is a percent of the total. 1% = 3.6 degrees. Plot clockwise beginning with the largest size of the wedge – if there is NOT a logical order to the values (such as stage).

Slide 39

Scatter Diagrams

100 90 80 70 60 Positive 50 Inverse 40 Random 30 20 10 0 0 1 2 3 4 5

Means of presenting relationships between two variables. Example: Size of tumor on the X-axis and Depth of invasion on the Y-axis. Individual observations are plotted at the point of intersection of the values of the two variables. If the points tend to form a line at an angle to the axes, there may exist either a positive (upward line) or inverse (downward) relationship. If randomly distributed, there would appear to be no relationship.

Slide 40

Pictographs & Maps

Pictographs use symbols to represent numbers. Such as: 1 out of every 5 women……

Maps: 2 types – dot and shaded. Customarily: Dots represent general effect of density. Shading increases with rate.

Slide 41

Epidemiology

Slide 42

Epidemiology

Branch of medical science Concerned with the study of the distribution of disease in a population (descriptive epidemiology) and the search for the determinants of disease (analytic epidemiology)

Q: What does distribution of disease mean? A: Person, Place, and Time … Who, What, Where, When.

Slide 43

Triad of Disease Occurrence

Agent

Host Environment

Q: What is the basic behind this triad – as it relates to epidemiology? A: Disease can be prevented by eliminating any one of the three elements of the triad.

Slide 44

Steps in Epidemiologic Reasoning

Observe a statistical association between an exposure and an endpoint Develop a hypothesis about the relationship Test the hypothesis

Slide 45

Descriptive Epidemiology

SEER Book 7, Section C

Takes into consideration the distribution of disease (cancer) in a population.

Uses variables such as age, race, sex, county of residence. Identifies low and high risk subgroups of the population.

Slide 46

Measurements of Risk Primary tool for the measurement of risk is called a RATE Three measures of risk:

1. Incidence Rates 2. Rates}

3. Mortality Rates }

Risk measures the strength of an association between exposure to a particular factor and risk of certain .

Q: So, how do we measure risk? A: By calculating morbidity and mortality rates for a population.

Morbidity: Illness rate Mortality: risk Morbidity and Mortality rates are essential for comparison of disease risk.

To bring it down to the bottom line - measures the risk of getting or dying from cancer.

Incidence Rate

OCCURRENCE

Rate of occurrence of NEW cases that are diagnosed during a set time period in a defined population

Slide 47

Basic descriptive tool for epidemiologists for studying possible causes of cancer.

Q: What is an example of how incidence is used in the registry? A: Annual report. For example, the number of new analytic cases diagnosed in year 2004.

Prevalence Rate

PRESENCE OF DISEASE

Quantifies the TOTAL amount of active disease present in a defined population at a particular point in time

Slide 48

May be difficult to measure since it is not always possible to determine whether a person with a prior diagnosis of cancer still has active disease.

Usually based on the TOTAL number of living cases, both new and previously diagnosed (historical prevalence). Higher for older registries with more historical data. True prevalence rate would be much lower for active disease.

Slide 49

Mortality Rate

DEATH

Measures the risk of DEATH for the cause under study in a defined population during a given time period

Slide 50

Calculation of Morbidity and Mortality Rates Morbidity and mortality rates are based on TWO primary components:

1) A count of the number of disease occurrences or (numerator)

2) The size of the population – number of people at risk of getting the disease (denominator)

# of events ------X 100,000 # in the population at risk

The next few slides will take you through calculating these rates.

The base is one that is large enough to report the rate in whole numbers. In this case, 100,000 was chosen. The base is usually 100,000 for adults and 1,000,000 for children.

Slide 51

Calculation of Crude Morbidity and Mortality Rates

# of NEW cases diagnosed Incidence Rate = during a given time period ------X 100,000 # in the population at risk

# of TOTAL active cases Prevalence Rate = at a given point in time ------X 100,000 # in the population at risk

# of cancer DEATHS = during a given time period ------X 100,000 # in the population at risk

To calculate the rates, you have to take into consideration the number of people in the population at risk.

Incidence = the number of new cases diagnosed during a given time period Incidence Rate = the number of new cases diagnosed during a given time period out of the number in the population at risk

Remember: -To measure risk: incidence, prevalence, and mortality are expressed in the form of a RATE. -The calculation of a rate factors in the population at risk -Crude rates are based on the entire population and includes all cancer sites.

Slide 52

Calculation of Specific Morbidity and Mortality Rates

Risks for specific in an entire population or a subgroup of the population Uses the same formulas The selection is limited Example: An age-specific rate is specific for persons within a given age group

To calculate specific morbidity and mortality rates, we will use the same formulas; we just the selection of cases to include.

Remember crude rates take into account the entire population and includes all cancer sites.

For example: You may want to calculate age-specific rates. This would only include those patients within the age group you specified. IE: for patients age 70- 80.

Slide 53

Calculation of Age-adjusted Morbidity and Mortality Rates

Uses a single summary measure that takes into account the differing age distributions among more than one population Two types:  Direct - Applies a standard set of weights (IE: US ) to each population  Indirect – What you would expect if the were the same between the study population and the standard population (standardized ratios) Not a “real” rate

Age-adjusted rates are a little more complicated.

Q: When would you want to use age-adjusted rates? A: Comparisons. When you want to compare the risks between two or more that have a different age compositions.

For example: Q: In comparing two populations, if one was generally younger, and we calculated, say, the death rate for each of those populations, what would you expect to find? A: The population with a higher proportion of older persons will have a higher crude death rate than a population consisting of predominantly young persons.

Q: Is that very meaningful in our of the cancer risk? A: Probably not. We need to take into account the younger age structure of the other population who generally has less risk of cancer and death.

A crude incidence rate includes all newly diagnosed cases within a given time period regardless of age. An age-specific incidence rate is specific for persons of a given age group.

Not a real rate – so can’t be used as an indicator of the actual level of risk in either population.

Slide 54

Survival Analysis

SEER Book 7, Section D

Survival analysis is necessary for evaluation patient care.

First determine the purpose of the study and the selection of cases. Follow-up percentage should be greater than 90%. Determine grouping of cases. Determine starting and ending points.

Slide 55

Survival Analysis

Types of measures that can be used:

Survival Time Survival Rate

Recurrence Rate

We will take a look at each of these in a little more detail.

Slide 56

Survival Time

Average survival time Median survival time How long patients tend to live after diagnosis with a certain type of cancer

Average survival time: Q: What else might this be referred to? A: Mean survival time. Average = Mean. To calculate average survival time, we would have to know the length of time until death for each one of the patients. All of the patients would have to be dead. We would then add up all of the survival times and divide by the number of patients.

Disadvantages: - Sensitive to extreme values - rarely in a situation where all patients are dead.

Median Survival Time: Q: Remember the definition of median from before? A: Sort the patients in order from shortest to longest survival. Choose the middle value. Not as affected by extreme values.

Slide 57

Survival Rate

Observed Adjusted Relative

Proportion of persons Counts only those surviving regardless of persons known to cause of death have died from the cancer being studied

Direct Actuarial Kaplan-Meier

Three types of survival rates: Observed, Adjusted, and Relative

The observed rate is calculated using either the direct, actuarial, or Kaplan-Meier methods.

Adjusted and Relative: Not just any cancer, but the cancer under study.

Slide 58

Observed Survival Rate

Direct Method:

 % of patients surviving a specified period of time

 Requires that patients are diagnosed early enough to have lived the interval under study

 Does not use information from more recent patients # surviving for 5 years ------# at risk for 5 years

All methods include all causes of death.

For a 5 year survival rate, the patients would have to be diagnosed at least 5 years prior to the study cutoff date.

Slide 59

Observed Survival Rate

Actuarial () Method:

 Allows you to use information from patients who were diagnosed less than 5 years prior to the study cutoff date

 Uses all patients in the study group (alive, dead, lost to follow-up)

 Acceptable method for larger groups of patients (requires fewer calculations)

Cancer registrars are occasionally called upon to explain survival data. In understanding and explaining how the survival data was calculated, the registrar must first have an understanding of life tables. Life tables are certainly perplexing. In fact, the more frequent use of graphs to demonstrate survival stems from the confusion in interpreting life tables. It is useful to remember that life tables are simply a of calculating the time period from the date of diagnosis to the date of last contact for large groups of patients. At the end of each interval, the calculations simply represent each patient’s chances of surviving during that interval.

Using the actuarial method to calculate survival rates allows you to use information from your more recent patients.

This allows you to be able to calculate your survival rate for your annual report if you: • Have less than 5 years of data and • Use cases that were only diagnosed 1,2,3, or 4 years prior to calculating the survival rate.

Slide 60

Observed Survival Rate

Actuarial (Life Table) Method:

Calculates a 1-year survival rate for all patients. For those who survived that 1 year, another 1-year survival is calculated, and so on.

Column H: Cumulative Survival Rates would be what you would graph. Notice that it decreases as you go from year to year. Depending on the cancer registry software, the report to generate survival rates may only produce a life table. The cancer registrar must then use the life table to produce a graph of survival rates.

A: Interval should be mutually exclusive. These intervals include those surviving less than or up to but not including 1 year, 2 years, etc. Count the number of living and dead patients for each interval. B: Count the number of living and dead patients for each interval. All patients in the study are alive at the start of the study. Autopsy and death certificate cases not included. C & D: Use the patient’s vital status and # of years surviving to decide which column/row that patient should be tabulated. Column C is the number of patients who died during that interval. D: The number of patients that were alive with no further follow-up during that interval. This includes patients lost to follow-up. E: Takes into account those patients who were alive but without a whole year of observation during a particular interval. Gives them a ½ year of “credit”. F: Calculates the proportion (%) dying for each interval by the adjusted number (E) G: F+G add up to 1 for each row. F = % who died, G= % who survived H: Cumulative survival. Multiply % in G in the current row by % in H in the previous row

To graph survival, plot the numbers (multiply by 100 first) in column H on the y axis and the Interval (column A) as the x axis. With the availability of cancer registry software, survival rates can be created by software. Prior to this, cancer registrars would use the actuarial life table to calculate survival rates for the facility. Slide 61

Observed Survival Rate

Kaplan-Meier (Product ) Method:  Calculation is done each time a person dies  Uses information from patients who were diagnosed less than 5 years prior to the study cutoff date  More exact description of the pattern of survival  Done with small patient groups  Recommended for registries that have a to do the calculations

Calculation is done each time a person dies instead of by year 1, then year 2, etc.

Kaplan-Meier and Actuarial will usually give very similar results.

Important to choose the same method of calculating survival rates when comparing to someone else’s rates – because they are slightly different.

Slide 62

Observed Survival Rate

Kaplan-Meier Method:

For 1,2,3,4, and 5-year survival rates, use the last computed survival rate just previous to the 12, 24, 36, 48, and 60-month interval.

5 year survival on this table would be taken from month 54 and would be 65%

Again, Column H is what you would graph.

A: write survival time in months in order from smallest to largest ending with 60 for a 5- year survival B: B - C - D for the previous row C & D: same as for actuarial method F: only for those months in which someone died, calculate the proportion dying G: F= % dying, G= % surviving H: multiply % in G in the current row by % in H in the previous row in which someone died

Slide 63

Adjusted Survival Rate

Need good cause of death information Uses the Actuarial or Kaplan-Meier methods for calculating survival Only counts those known to have died from the cancer being studied Causes other than cancer being studied are tabulated with the withdrawn cases

Good (or current) cause of death information is necessary to calculate adjusted survival rates because it only counts those known to have died from the cancer being studied. In the lessons related to Follow-up, the importance of recording the source of the information and using that information to determine the accuracy of the disease status was discussed. The best source for cause of death information is from the death listings available from most state central cancer registries. This information is based on the actual cause of death coded on the death certificate. It cannot be assumed that because a patient had evidence of cancer that they died from that cancer.

Slide 64

Relative Survival Rate

For registries that do not have good cause of death information Obtains expected survival rates from a standard life expectancy table Ratio of the observed survival rate and the average expected survival rate

Relative survival rate is usually larger than the observed rate.

Relative survival rates can go up and down from one year to the next because it is an estimate based on the general population.

Slide 65

Recurrence Rate

Measured from the time of complete remission until time of recurrence

Measured to the time of recurrence as opposed to measuring to the time of death as with survival rates.

Slide 66

Presenting Survival Results

Graphs

 Starts at 100% because all patients are alive at the beginning of the study

 Bar graphs can be used for presenting mean and median survival times and survival rates for a single period

 Line graphs should be used to present a pattern of change over time

Refer back to the slides on line graphs for the two different types (arithmetic scale and the semilog scale).

Slide 67

Presenting Survival Results

Adjusted Survival Rates for Colorectal XYZ Facility Compared to National, 1996-2000

100

90

80

70

60 Facility 50 National 40

Percent Surviving 30

20

10

0 0 1 2 3 4 5 Years Surviving

For actuarial survival, a slanted line is used to connect the points. From the life table, this would be the numbers in column H.

This implies that survival between the points plotted changes gradually between those two points. Can use the arithmetic scale to emphasize the numeric change. Can use the semi-logarithmic graph to emphasize the percent change. Slide 68

Presenting Survival Results

Observed Kaplan-Meier Survival Rates For Lung Cancer, XYZ Facility, 1996-2000 Percent Surviving Percent

Months after Diagnosis

For Kaplan-Meier survival the graph looks like a stair step.

Since a calculation is made every time someone dies, the assumption is made that the survival is constant until the next death occurs.

Slide 69

Presenting Survival Results

Reports

 Must contain more than just the survival rate or time

 Must also contain a complete description of the information that was used to calculate the survival

Slide 70

Analytic Epidemiology

SEER Book 7, Section E

Study of the employed in investigating possible determinant (factors or causes) associated with the occurrence of disease.

Slide 71

Analytic Epidemiology

Two general forms:

Observational Experimental

Slide 72

Cohort Studies

Also called Prospective Studies Designed to test a specific hypothesis One group (the cohort) is identified and then followed to observe the development of disease Analyzed using

Population at Risk Disease

Go forward in time.

Arrow shows where the present time is.

A group of people (cohort) without disease are identified and characterized by a common experience or exposure (IE: smoking).

The group is then followed forward (prospectively) over a period of time to observe the development of the disease under study.

Example: the occurrence of lung cancer among smokers.

Slide 73

Case-Control Studies Also called Retrospective Studies Two groups are selected  One with disease (cases)  One without disease (controls) Groups are compared for differences in past exposure to the hypothesized causes of the disease Analyzed using Ratio

Population at Risk Disease/Disease Free

Go backward in time.

Arrow shows where the present time is.

Both the exposure and the endpoint have already occurred at the time of the study.

Slide 74

Retrospective Also called Historical Cohort Study or Retrospective Prospective Study Review of a set of historical records in which people were already classified into various groups The current disease status is then observed

Population at Risk Disease

Arrow shows the present time.

Go back in time to identify the groups.

Then, come back to the present time to observe the groups current disease status.

Slide 75

Experimental Studies Study of the impact on the natural history of a disease by varying some factor which is under the control of the investigator

 Reduce risk factors in high-risk groups

 Screening for early stage of disease

 Clinical Trials Example: Behavior modification for men at risk for a MI (smoking, high cholesterol, hypertension)

Population at Risk Disease

Follows the concept of a cohort study.

Arrow shows where the present time is.

A group of people without disease are identified and characterized by a common experience or exposure.

The group is then followed forward (prospectively) over a period of time to observe the development of the disease under study.

The difference is the investigator varies some factor under their control to see what effect it has.

Slide 76

Statistical Inference

SEER Book 7, Section F

Process of drawing conclusions about population based on data from limited samples.

Must understand the concept of a in order to apply the technique of .

Slide 77

Normal Distribution One of the most important frequency distributions in statistics If you were to measure the variable on every person in the population you would find the frequency distribution would display a “normal” pattern with most of the measurements near the center of the frequency Example: Blood pressure, height

Q: When we say “normal pattern with most of the measurements near the center of the frequency”, what do you think of? A: Mean, median and mode.

Q: Remember our discussion of central tendencies and variation? What were our measures of each? A: Central tendencies: mean, median, mode. Variation: range, standard deviation. As promised, here are repeating terms.

Slide 78

Normal Curve

Diagram of the Area under a Normal Curve

20

15 34% 34% 10 5 2.5% 0 2.5% 13.5% 13.5% -3 -2 -1 Mean 1 2 3

- or + Standard Deviation

Frequency distributions of the normal population, when plotted, form a curve with most of the observations near the center and the fewer observations occurring toward the tails.

Characteristics of a normal curve: -Symmetrical, bell-shaped -Mean, median and mode are identical and are located in the center of the curve -Width of the curve depends on the standard deviation (SD) - spread of values outward from the mean in both directions. -The observations are distributed around the mean such that 95% of the observations lie between the mean and +-1.96 SD from a normal distribution. - Example: If the sampling was repeated 100 times, the mean of the population would lie within approximately 2 standard deviations 95% of the time. - How did we get 95%? Under the curve, add 13.5 + 34 + 34 + 13.5 = 95

So, when a variable follows the normal distribution, the normal range is simply the mean value +/- roughly 2 standard deviations.

Slide 79

Kurtosis Symmetric How the data are grouped (dispersed)

70

60

50

40

30

20

10

0 Mean

Normal Skew Platykurtotic Leptokurtotic

Mesokurtotic (bell-shaped) – normal distribution because it occurs most frequently. Leptokurtotic (leaping) – tightly grouped around the mean Platykurtotic (flat, plate) – short sample size, broad distribution

Slide 80

Skewed Distributions May be asymmetric Drawn out in an extreme direction

80

70

60

50

40

30

20

10

0 Mean

Normal Skew Positive Skew Negative Skew

The curve may also be skewed.

A distribution is skewed if one of its tails is longer than the other. The yellow line shown has a positive skew. This means that it has a long tail in the positive direction. The distribution with the green line has a negative skew since it has a long tail in the negative direction. Finally, the distribution with the red line is symmetric and has no skew.

Distributions with positive skews are more common than distributions with negative skews. Let’s use an example of income to show how a distribution can have a positive skew. If you were to evaluate the salaries of people in a population, you would find that most make under $40,000 a year. However, some make quite a bit more with a small number making many millions of dollars per year. The mean (or average) salary is higher than what most people make because of the small number of people making a very high salary. However, the majority of people are earning less than $40,000 a year. Therefore, the graph will be clumped around the lower salary with the tail stretching out in the positive direction to account for the small number of people making more than the mean salary.

Negatively skewed distributions do and can occur. An example of a negative skew would be a distribution of test grades on a statistics test where most students did very well but a few did poorly. It has a large negative skew.

Slide 81

Statistical Hypothesis Testing

SEER Book 7, Section G

Part of the epidemiological steps is to test the hypothesis.

Slide 82

Hypothesis

A statement which claims a relationship exists between a study variable and an outcome variable

Example: A new treatment (study variable) results in longer survival (outcome variable)

In a hypothesis, we are claiming that a relationship exists between two variables.

Slide 83

Null Hypothesis

Hypothesis is stated in the opposite way to what is really being investigated Stated as having no difference

Example: The hypothesis that new treatment A is more effective than standard treatment B would be stated as new treatment A is the same as treatment B

The objective of a statistical test is to determine the probability that the resulting represents the finding that the null hypothesis is true.

The researcher wants to reject the null hypothesis in favor of the (that there is a “real” difference).

Slide 84

Hypothesis Testing

Requires establishing confidence intervals Two ways that a “true” result can follow from a statistical test

 For the null hypothesis to be true and the statistical test finds that it is true

 For the null hypothesis to be false and the statistical test finds that it is false

Confidence intervals estimate the range of values that include the actual population mean. Slide 85

Hypothesis Testing

Two ways to be wrong:

 Type I () - null hypothesis is true and the statistical test finds that it is false  Decision point is < 5%

 Type II (beta) error - null hypothesis is false but it is accepted as true  Decision point is < 10%

Decision point for a Type I error is lower than a Type II error because of one of the most feared charges of the Hippocratic oath for a physician to “do no harm”.

A Type I error would cause a physician to accept one treatment as being better than another when in fact it is not.

A Type II error is deemed more acceptable because the result is no change from the current practice.

Slide 86

P Value

Probability that the statistic is the result of a random observation Chance the hypothesis was interpreted wrongly p < .05  Probability of an undesirable result is less than 5 percent  One observation is statistically significantly different  Acceptable probability level  Standard in

The probability that a difference as large as that observed would occur by chance alone.

The assumption is that any outcome is just as likely to occur as any other outcome.

Through much in-depth statistical expertise and testing the decisions point of .05 has been adopted as the standard in medical research.

Slide 87

Common Statistical Tests t-Test  Evaluates the difference between two means  Determines whether the means of two samples represent the same population z-Test  Assigns a probability to a particular observation  Uses sample proportions rather than sample means  Evaluates the statistically significant difference between two proportions Chi-Squared Test  Can handle more than 2 groups  Uses counts of observed and expected numbers

z-statistic and t-statistic tests for significance about a population mean or a .

T-Test – Uses quantitative data (interval and ratio). Used to test survival curves.

Chi-Squared Test – Uses qualitative data (nominal and ordinal). When the variables are independent, we are saying that knowledge of one gives us no information about the other variable. When they are dependent, we are saying that knowledge of one variable is helpful in predicting the value of the other variable. One popular method used to check for independence is the chi-squared test of independence. IE: Is level of education related to level of income?

Slide 88

Additional Resources

American Cancer Society’s Cancer Facts and Figures

AJCC Cancer Staging Manual, 7th Edition

Also, you may want to familiarize yourself with some of the core information that can be found in the American Cancer Society’s Cancer Fact and Figures. These statistics are commonly used to help create the cancer program annual report. It is a great for seeing how cancer statistics are used and discussed. Includes such topics as: who is at risk of developing cancer, relative risk, leading sites of new cancer cases by site and sex, and even definitions of selected terms.

AJCC Cancer Staging Manual, 7th Edition provides another resource for survival analysis information in the front part of the manual.