The Use of Metadata in Creating, Transforming and Transporting Clinical Data

PhUSE 2010 Paper CD08 The Use of Metadata in Creating, Transforming and Transporting Clinical Data Gregory Steffens, ICON Development Solutions, Ellicott City, MD, USA ABSTRACT Creating databases is a significant part of the work required to analyze clinical trials. Data flows from applications that collect and clean data as well as from many other data sources, such as laboratories which supply electronic lab data. This source data is used to create standardized raw data (e.g. SDTM), study analysis databases and on to integrated databases. This represents a great deal of database design, creation, transformation and validation. This paper presents innovative technical solutions that vastly reduce the effort and cost to implement the data flow, while delivering a consistently higher level of quality. The technology is built with base SAS, standardized metadata set structures and the SAS macro language. So, no new software products are required – just effective design and process. It will be seen that standardizing metadata is at least as important as standardizing clinical trial data and that automation can be data standard neutral, not assuming any data standard. Clarification of the objectives of data standards and metadata is necessary in order to measure the success of these standards. Standards are not an end themselves; they are a means to the end of creating analysis with less effort and cost and to attain a high and consistent quality. Data standards, alone, will not meet these efficiency and quality objectives, standard metadata, data automation and good process definition is also required. INTRODUCTION Pharmaceutical companies must address the need to define database structures quickly and easily in order to support such tasks as the following. • support clinical trial analysis databases • import data from central laboratories and CROs • export data to Data Safety Monitoring boards and other data review groups • share data between corporate sites • create integrated databases of multiple study databases • submit data and metadata to the FDA and other regulatory agencies and reviewers Industry-level standards are being defined by CDISC to help address this need and most pharmaceutical companies have had corporate standards governing database structures for many years. However, often these industry and corporate standards are stored in .pdf or MSword documents or in excel worksheets that do not rigorously follow a metadata standard. As such they are accessible only to people reading the documents. Storing this same information in rigorously standardized metadata sets adds much value to the effort put into defining these standards and study data specifications because this information is made available to computer programs that can automate database creation and data transformations. This paper describes the value of standard metadata set structures that are capable of storing specifications for any of the above database requirements and others too. These metadata sets, along with SAS macros and a SAS/AF application that access these metadata sets, supports the specification, creation, importation, exportation, comparison and validation of databases. Further, these metadata sets can be used to export data and metadata to XML format following the CDISC define schema. The metadata set structures can be implemented in any relational database or in SAS itself. The macros support such functions as: • publishing the specification in several formats, including the define.xml and define.html formats • implementing all the data set and variable attributes to the database with a simple macro call (variable name, type, length, label, format, etc.) 1 PhUSE 2010 • listing discrepancies between the data specification and the database, including structure and valid values • sorting the data by primary keys • reordering variables in the PDV according to FDA preferences • creation of user defined format catalogs • creation of character decode variables from code variables • comparison of database structures to standards or to other studies, to assist in enforcing data standards and consistency • Creation of study specifications from data standards or other study specifications • transforming a database from one database structure to another, using map metadata which defines a map between metadatabases As data standards are developed there is great value in documenting these in metadata sets, that are accessible to computer programs, rather than in word or .pdf. The metadata set structures must be standardized and capable of containing descriptions of all the database attributes required to build that database. In summary, when importing data from outside sources, the data specification is implemented in metadata set so that computer programs can automatically compare the data submitted by the lab or CRO to the specification and list discrepancies where the database does not conform to the specification. When specifying study analysis databases, macros assist in the building of the database by automatically adding all the specified variable and data set attributes, such as labels and formats, adding decode variables, sorting the data, building format catalogs, etc. When submitting data to the FDA a thorough specification can be printed from the metadata or exported to XML format, with bookmarks and hyperlinks. STANDARDIZED METADATA The first step is to define a standard metadatabase design and use that standard design in all instances of defining database standards and study database specifications. A metadata design includes two main components – a standard list of attributes of databases and a standard metadata set design used to store those attributes. THE TWO COMPONENTS OF A STANDARD METADATA DESIGN – STANDARD ATTRIBUTES IN A STANDARD METADATABASE STANDARD ATTRIBUTES The first component of a standard metadatabase design is a standard list of attributes that describe the characteristics of databases that are required to be known when creating, validating and reviewing the database. These attributes define database characteristics at different levels of detail, such as attributes of a data set, of each variable, of valid values of variables such as codelists, of descriptions of the variables purpose, of descriptions of the variable’s derivation logic and of a special set of attributes that I call “row-level” metadata. The data set level attributes include a short and a long name of each data set, the data set label, the order the data set should be displayed in a define file, the location of the data set and CDISC attributes required to create a define.xml file such as structure, repeating, is_reference_data, purpose and class. The variable level attributes include a short and long name of each variable, the data set that the variable is included in, the label, type, length, format to assign to the variable, the name of a list of valid values or codelist to associate with the variable, the relationship between code and decode variables, the order to present the variables in a define file and to order the variables in the data set program data vector, the name of a description to attach to the variable, whether the variable is a header variable that should be copied to other data sets, the ability to flag a variable as a supplementary qualifier when creating CDISC data sets and other attributes required to create the define.xml file. It is also useful to include various flags that categorize variables. For example, flagging a variable as a derived variable, or related to primary efficacy endpoints, or safety endpoints support the ability to publish the metadata content for different audiences, by filtering different subsets of variables to publish. A programmer may need all the variables but others may want to focus only on safety or only on efficacy. The valid value attributes include a name of the value list that can be assigned to a variable, the valid value of the code variable and the valid value of a decode variable, the ability to define either a list of discrete values or a list of ranges of values and the order in which to display the values in a define file and in TFLs. It is important to represent the relationship between a valid code value and its decode value as well as the relationship between a value list and 2 PhUSE 2010 the variable it is assigned to. For example, metadata should represent a values list for sex that includes both the information that 1=Male and 2=Female and also the fact that this list is associated with a variable named “sex” as a code variable and “sex_dcd” as a decode variable. The description attributes include the derivation logic to assign to a derived variable, a comment to attach to a variable to explain its source and use and the name of the description. The final category of database attributes to include in a metadata design is row-level metadata. The objective of this level of attributes is to define tall-thin data structures in a robust manner to more fully support automation of data transformations. Some variables in tall-thin data sets, like VSRN, do not have the same attributes in all rows or observations. Different attributes pertain depending on the row and the row types are defined by a parameter variable like VSTESTCD. I call these variables, whose attributes differ in different rows depending on the parameter variable value, parameter related variables (or paramrel variables). The full set of variable level attributes must be assigned to parameter related variables for each value of the parameter variable. Without this, data transformation automation is not possible and a full description of the database is also not possible. I identified the need for row level metadata in the first CDISC pilot project and we are attempting to demonstrate this in the second pilot, but it will require an enhancement to the define.xml schema. These paramrel variables can also be referred to as virtual variables because they require all the definition that a variable requires but they do not exist as separate physical variables in the database.

Load more