Laboratory Data Management using the SAS® System Marianne Hack, Covance, Inc., Princeton NJ

ABSTRACT In terms of volume, for many clinical trials laboratory process lab data, these standard agreements are data comprises a very large portion of all data collected. adhered to whenever possible on any studies that Add to that the fact that laboratory data is inherently utilize these laboratories. complex, and the management of the data acquisition (electronic and/or data entry), cleaning (synchronization Given that we know what to expect when receiving data of databases), and reporting (unit conversions, files from a particular laboratory data vendor, we can character to numeric conversions, result flagging, etc.) then build standard SAS modules to import and processes can be overwhelming. This paper discusses reformat the vendor data. The end result is that a SAS an approach to building and using an arsenal of data set (see example B) containing data from Vendor specialized SAS program modules in order to A will be very similar to a SAS data set containing data standardize the various processes involved in from Vendor B. These SAS data sets are then ready managing and reporting clinical laboratory data. for the data cleaning stage.

DATA CLEANING INTRODUCTION Data review and cleaning of vendor data requires As a CRO (Contract Research Organization), we are additional steps over and above what one might required to be able to work with data from a multitude of ordinarily do to review data-entered laboratory data. sources. Clinical laboratory data is usually integrated Commonly called ‘header cleaning’, this involves into the clinical database via direct data entry or reconciling the patient identifiers from the vendor through importation of electronic data files. Lab data database with the clinical database to ensure rarely arrives ‘analysis-ready’, usually requiring consistency and completeness of data. reformatting, data cleaning, conversions, or flagging. To streamline the processing effort prior to analysis, our How do we accomplish this? The missing link is a table approach has been to reformat the laboratory data as that we create in the clinical database. This linking much as possible into standard data set structures, table is data-entered from the hardcopy lab report that which then allow for usage of standard SAS data review has been sent from the laboratory to the investigator’ and reporting modules. site. This hardcopy report contains all of the test results for a particular patient’s lab sample, and arrives as part of the patient’s Case Report Form (CRF). Selected DATA ACQUISITION (MAKING SENSE OUT OF identifying information from the report is then data- GOBS OF DISSIMILAR DATA) entered along with the other CRF pages. Many clinical trials utilize a ‘Central Laboratory’ for sample analysis. All investigators ship the test samples The data vendor’s own unique sample identifier field is to the same laboratory. The advantages to using a used as the pivot point from which to compare the central laboratory are consistent testing methodology, vendor’s patient identifiers with the clinical database equipment calibration, and result reporting methods as patient identifiers. We have built standard SAS well as consistent data structures on the back end. modules that perform this comparison based on the Data from central laboratories is usually electronically expected data set structures. imported into the clinical database.

Some clinical trials may utilize the ‘local’ laboratory DATA DICTIONARY MAPPING scheme in which investigators take the test samples to Most laboratories utilize an in-house technique a laboratory in their local area. Data from these to identify each specific laboratory test; each laboratory laboratories may be electronically imported if the has its own method. For example, Vendor A may laboratory is sophisticated enough, or may be data- record hemoglobin as ‘HGB’, whereas Vendor B may entered directly into the database. record it as ‘R3701’. To allow for standard reporting modules, our approach has been to utilize our own in- Trials may also use a combination of central and local house dictionaries. This requires us to set up a laboratories. The challenge is to assimilate the data mapping from the laboratory’s test identifiers to our in- into one coherent stream for cleaning and reporting house test identifiers. This is also done for units of purposes. measurement.

Having a standardized set of test and unit identifiers MANAGEMENT OF THE DATA VENDORS facilitates the building of standard data manipulation The first step is to manage the lab data vendors. This and reporting tools. is critical to the development of processes to be described below. We have set up data-transmission agreements with several clinical laboratories that PREPARING DATA FOR ANALYSIS outline a standard data file format and contents (see Creating reports and statistical summaries of laboratory example A). To maximize our ability to efficiently test results is probably the most complex part of dealing with lab data. On any given study there may be comments on certain laboratory test results. These approximately 25 to 30 different tests performed for comments are then collected with the CRF and data- each patient at each study visit where labs are required. entered into the database. It is then necessary to This results in a large amount of data to be match these comments up with the appropriate result summarized. Tests are usually grouped into three main data from the lab. categories (Hematology, Blood Chemistry, and Urinalysis); other specialized groups may be required. Each test category, as well as different tests within REFERENCE RANGE FLAGGING each category, may have its own data handling rules. Each laboratory has its own reference range of what is considered normal values for each test. Any test result Chemistry test results are the easiest to manage. The above the high limit of the range is flagged as ‘High’ results are usually reported in numeric units. Many of and any result below the low limit of the range is the conventional units (i.e., used in the USA) follow the flagged as ‘Low’. Reference ranges are generally set Standard International (S.I.) conventions. by test, unit, sex, age, and effective date, and possibly by racial origin. Hematology data may require a bit more work. For example, the White Blood Cell (WBC) Differential is a ADJUST > OR < VALUES breakdown of the WBC into different cell types. Some Test results that are reported with a greater than (>) or laboratories report the WBC differentials as a percent of less than (<) sign (e.g., > 100), cannot be used in any the whole (%), while others may report them as statistical summaries unless the character value is absolute counts of cells. Depending on the converted to a numeric. Our convention has been to requirements of any statistical analyses, it may be add ‘1’ to the decimal past the significant decimal for necessary to convert the absolute counts into a percent the test. For example, if the test result has one of the whole (or vice versa). significant decimal, we would add a value of ‘0.01’ to the numeric part of the result for use in statistical Urinalysis data is the most complex data type, and summaries. according to many, frequently the least useful! However, deal we must. An issue that arises with urinalysis is the subjective nature of one of the testing UNIT CONVERSIONS methodologies. The urine sample is viewed under a The next step is to attempt to insure that all of each microscope, and the person viewing the sample reports particular test result across the study is reported in the what he/she sees. This may result in a test result that same unit of measurement. This is necessary for consists of words (such as ‘Cloudy’), a range of appropriate summarization of a particular test’s results numbers (such as 5-10 for a count of cells) or a mix of across the study. If multiple laboratories were used for words and numbers (RBC: 5-10). This may seem fine analyzing samples, chances are some unit conversions to a reviewer, but makes for difficult statistical will be required prior to any analyses. Note that the summarization. reference range-flagging step occurs prior to any unit conversions. If unit conversions are necessary, the range limits will be converted as well as the test results. SO WHERE DO WE START? It is more conservative to flag the results prior to Our approach has been to perform all necessary data converting, since the conversion could make a manipulations in what we call a derived data set difference in the flagging. program. For laboratory data, we have a skeleton front- end program that the modifies to fit the study. This front-end program then invokes a main ABSOLUTE HEMATOLOGY CONVERSIONS macro, which in turn invokes smaller specialized As mentioned earlier, it may be necessary to convert macros. Some of the specialized macros are optional, the absolute hematology results. This conversion depending upon parameters set in the front-end requires that the sample has the WBC count program. These steps are described in more detail successfully reported, as it is the basis for converting below. from % of whole to an absolute count or vice versa.

CONVERTING CHARACTER RESULTS VISIT SLOTTING / UNSCHEDULED TESTS In order to summarize and report laboratory results, we Laboratory data is often collected at unscheduled need to do our best to convert as much of the character intervals in addition to the protocol-specified visit results to numeric results (where it makes sense to do schedule. The investigator may see a test value that is so). For instance, you probably wouldn’t convert out of range for a patient and decide to re-perform the ‘Cloudy’ to a numeric result (although tests with discrete test prior to the patient’s next scheduled visit. This is values such as this can be mapped to integer values done with the patient’s safety in mind. For analysis based on clinical input). However, results like ‘5-10’ or purposes, only one test value per patient-visit will be ‘RBC: 5-10’ can realistically be converted to a number. used in the statistical summaries. We refer to this value The conservative method is to take the highest count as as the ‘slotted’ value. Our convention is to use the last your numeric value (i.e., 10 in these two examples). value (by date and time) as the slotted value for pre- This value of 10 may then be used in any necessary baseline visits, and to use the first value as the slotted calculations. value for post-baseline visits.

INCORPORATING SUBJECTIVE COMMENTS FROM GET BASELINE VALUES THE INVESTIGATOR Frequently, a comparison of a patient’s test result to the Many studies require the investigator to provide result of the same test at the baseline visit is required. If a patient missed a certain lab test at baseline, or if the * Perform unit conversions, if sample was damaged, it may be necessary to “carry necessary ; forward” a pre-baseline result to use as the baseline %HCONV ; value for that patient. This carry forward is done at the test level rather than at the visit level, so a patient may * Convert Absolute Hematology, if have some test baseline values coming from the necessary ; baseline visit, and other test baseline values coming %HHABS ; from a screening visit. The baseline value is retained as a variable on all data rows for the patient for each * Select slotted values for lab test. statistical summaries ; %HSLOTS ; GET ENDPOINT VALUES Carry forward is also used at the test level for finding * Determine baseline value for the endpoint values for each patient. That is, for each each test for each patient ; patient, the last available post-baseline test result for %HBASE ; each test (regardless of what visit it was from) will be considered the endpoint value. The endpoint value is * Determine endpoint value for retained as a variable on all data rows for the patient for each test for each patient ; each lab test. %HFINAL ;

OTHER RESULT FLAGGING * Flag data values according to pre-defined Marked Abnormality Special criteria may be applied to the data to add criteria ; another layer of information to what has been flagged %&PDLMACRO ; according to the reference ranges. For example, if the study’s patient population is known to have Renal Failure, one would expect all patients to have high * Perform any additional study- Creatinine values according to the reference ranges, specific logic ; (since the reference ranges are based on the normal %SPECIFIC ; population). In this case it may be useful to add a more refined set of criteria to flag extremely high values. These additional criteria may be called ‘Pre-defined STANDARD REPORTING MODULES Limits’ or ‘Marked Abnormality Criteria’. Now that we have a standard-format analysis-ready data set (see example ), we can select reporting programs from our set of standard reporting modules STUDY-SPECIFIC LOGIC that will be required for our study. We have an arsenal The derived data set allows for any study-specific logic of SAS programs ready for study-specific that may be required. This study-specific logic would customization, which produce a variety of popular be built into a module that the derived program will run summary tables, data listings, and figures. if it exists. CONCLUSION THE DERIVED DATA SET (IN ORDER OF The flexibility attained by creating a modular system of OPERATIONS) SAS programs has allowed us to standardize and * Beginning of derived data set program ; streamline the processing of a most complex and * Define study-specific parameters ; problematic clinical data type. . . ACKNOWLEDGMENTS * Set up necessary data sets ; SAS is a registered trademark of SAS Institute Inc. in . the USA and other countries. ® indicates USA . registration. * Invoke main macro ; %HDERIVE(parameters...) ; Special thanks to Mike Walega and Kathy Leedom for their review and input. * Perform character to numeric conversions ; CONTACT INFORMATION %HCTON ; Your comments and questions are valued and encouraged. Contact the author at: * Add Investigator comments ; %HIMPSET ; Marianne Hack Covance, Inc. * Get Reference Range and perform 210 Carnegie Center flagging ; Princeton, NJ 08540 %HNORMAL ; Work Phone: 609.452.4091 Fax: 609.520.1754 * Adjust for < and > results ; Email: [email protected] %HLESGRT ; EXAMPLES

A. RAW DATA The following is a sample portion of an ASCII file from Covance Central Laboratory Services.

A 30SEP1999 130128 Covance CLS USA COVANCE CDNA H AAA0001 Protocol Name 854 1004 1004 502 U VJO 22APR1935 1 F AAA0001 AD1687 ADT329 ADC08 S No R AAA0001 HM354 HMT20 N 0.00 x10^3/uL R AAA0001 HM564 HMT14 HMC01 S Normocytic R AAA0001 HM574 HMT10 N 0.36 x10^3/uL R AAA0001 HM574 HMT11 N 0.44 x10^3/uL R AAA0001 HM574 HMT12 N 0.07 x10^3/uL R AAA0001 HM574 HMT13 N 192 x10^3/uL R AAA0001 HM574 HMT2 N 37 % R AAA0001 HM574 HMT3 N 3.9 x10^6/uL R AAA0001 HM574 HMT4 N 96 fL R AAA0001 HM574 HMT40 N 12.7 g/dL R AAA0001 HM574 HMT7 N 6.96 x10^3/uL

B. INITIAL SAS DATA SET Here is a sample printout of a SAS data set created from the laboratory’s ASCII file.

OBS GENDATE SPECNUM SEQUENCE TIMECR LABNAME ORIGLAB OURNAME CCLSPROT CHARINV CHINVSEC CHARPAT CRANDOM CSCREEN

1 01OCT1999 AAA0001 . 15:46:42 Covance CLS USA COVANCE CDNA Protocol 854 1004 1004 2 01OCT1999 AAA0001 . 15:46:42 Covance CLS USA COVANCE CDNA Protocol 854 1004 1004 3 01OCT1999 AAA0001 . 15:46:42 Covance CLS USA COVANCE CDNA Protocol 854 1004 1004 4 01OCT1999 AAA0001 . 15:46:42 Covance CLS USA COVANCE CDNA Protocol 854 1004 1004

OBS CVIS TUNVIS CPATINIT CDATEBIR CRACE CSEX CDATE TIMECOL DATEREC TIMEREC DATEACT VISTYPE RGENDATE GRPCODE TESTCODE

1 502 AAA 22MAR1960 1 F 25MAR1999 9:30:00 26MAR1999 8:43:00 09APR1999 U 30SEP1999 AD1687 ADT329 2 502 AAA 22MAR1960 1 F 25MAR1999 9:30:00 26MAR1999 8:43:00 09APR1999 U 30SEP1999 HM354 HMT20 3 502 AAA 22MAR1960 1 F 25MAR1999 9:30:00 26MAR1999 8:43:00 09APR1999 U 30SEP1999 HM564 HMT14 4 502 AAA 22MAR1960 1 F 25MAR1999 9:30:00 26MAR1999 8:43:00 09APR1999 U 30SEP1999 HM574 HMT10

OBS RSEQUENC RTIMECR OCCURR RSLTCODE RSLTTYPE RSLTSIGN NRSLT_CV CRSLT_CV UNIT_CV RLOW_CV RHIG_CV ALERT EXCLUS SPONSFLG

1 3 13:01:28 . ADC08 S . No 2 4 13:01:28 . N 0.00 x10^3/uL 3 5 13:01:28 . HMC01 S . Normocytic 4 6 13:01:28 . N 0.36 x10^3/uL

OBS NRSLT_SI CRSLT_SI UNIT_SI RLOW_SI RHIG_SI SAMPCOND D_FILED T_FILED BASELINE BLNDSPON BLNDINV BLNDOTH

1 . No 14329 10:02:29 N N N 2 0.00 GI/L 14329 11:03:25 0.00 N N N 3 . Normocytic 14329 11:03:25 N N N 4 0.36 GI/L 14329 11:03:25 0.40 N N N C. DERIVED SAS DATA SET Here is a sample PROC CONTENTS of a derived SAS data set.

# Variable Type Len Pos Format Label ------19 AGE Num 8 159 Age-calc. rel. to first of vacc or civa 37 BAS Num 5 241 Baseline period 38 BASE Num 8 246 Baseline value 41 BASEFLAG Num 5 261 Flag=1 for Baseline Obs. 40 BHLFLAG Char 1 260 Baseline Normal Range H/L flag 39 CBASE Char 6 254 Baseline character value 45 CFINAL Char 6 280 Endpoint1 character value 31 CONV Char 1 219 Unit converted flag 21 CSCD Num 8 168 CLINICALLY SIGNIFICANT 11 CVALUE Char 50 62 CVALUE 8 DATEBIR Num 5 45 DATE9. Birth date 52 DOSTAT1 Num 5 308 Flag for means on labcode (1=do) 53 DOSTAT2 Num 5 313 Flag for mean diffs on labcode (1=do) 54 DOSTAT3 Num 5 318 Flag for normal counts on labcode (1=do) 55 DOSTAT4 Num 5 323 Flag for pdl counts on labcode (1=do) 17 DRUGSORT Num 5 146 **** DUMMY DRUGSORT **** 23 ELAPDAY Num 4 177 Day number relative to rxstart=1 43 FIN Num 5 267 Endpoint1 period 44 FINAL Num 8 272 Endpoint1 value 46 FINFLAG Num 5 286 Flag=1 for Endpoint Obs. 56 FLAG Char 7 328 Seven flags 2 GEOLOC Char 3 8 GEOGRAPHIC LOCATION OF LAB 29 HLFLAG Char 1 217 Flag comparing result with range 22 IMPFLAG Char 1 176 Clinically important flag 3 INVNO Num 8 11 INVESTIGATOR NUMBER 16 KGLBCD Num 8 138 KG OR POUNDS 9 LABCODE Num 4 50 LABCODE 7 LABDATE Num 5 40 DATE9. Lab date 1 LABLOC Num 8 0 LABLOC 6 LABTIME Char 5 35 LAB COLLECTION TIME 30 LGFLAG Char 1 218 Flag for value with >< sign 12 N Num 5 112 counter for specimens on this date 25 NONORM Num 4 189 labcode had no normal ranges (1=yes) 49 NOPDL Num 5 293 labcode had no pdl (1=yes) 36 NOSTFLAG Char 1 240 Flag for not used in 4 PATNO Num 8 19 PATIENT NUMBER 34 PERIOD Num 5 230 Slotted period 47 PFLAG Char 1 291 flag for pdl out of range 48 PHLFLAG Char 1 292 flag for pdl hi or lo 50 PRNTORD Num 5 298 Order to print labtests 14 RACECD Num 8 122 ETHNIC ORIGIN 26 RANGEHIG Num 8 193 RANGE HIGH 27 RANGELOW Num 8 201 RANGE LOW 42 REPLACEB Char 1 266 B=non-baseline value used as baseline 13 RPTFLAG Num 5 117 Repeat labtest flag 18 RXSTART Num 8 151 DATE9. Start Date of Vaccine(Study Med) 57 SAMEID Num 5 335 Counter for obs with same identifiers 24 SEX Num 8 181 SEX 20 SPANFLAG Char 1 167 Flag for cvalue (a-b) conv to (b) value 35 STATS Num 5 235 period# for value used in statistics 33 UNIFLAG Num 5 225 =1 for diff. units/labcode-all data 51 Num 5 303 =1 for diff. units/labcode-stats data 28 UNITS Num 8 209 UNITS OF MEASURE 10 VALUE Num 8 54 VALUE 5 WEEKNO Num 8 27 WEEK NUMBER 15 WT Num 8 130 WEIGHT 32 XSIG Num 5 220 Max #significant decimals/labcode-unit