GEMS DATA MODEL FOR PUBLISHED MODELS AND CALCULATORS QSAR WORKBENCH 2019 Copyright Notice

©2018 Dassault Systèmes. All rights reserved. 3DEXPERIENCE, the Compass icon and the 3DS logo, CATIA, SOLIDWORKS, ENOVIA, DELMIA, , GEOVIA, , , 3DSWYM, BIOVIA, , IFWE and 3DEXCITE, are commercial trademarks or registered trademarks of Dassault Systèmes, a French "société européenne" (Versailles Commercial Register # B 322 306 440), or its subsidiaries in the U.S. and/or other countries. All other trademarks are owned by their respective owners. Use of any Dassault Systèmes or its subsidiaries trademarks is subject to their express written approval.

Acknowledgments and References

To print photographs or files of computational results (figures and/or data) obtained by using Dassault Systèmes software, acknowledge the source in an appropriate format. For example: "Computational results were obtained by using Dassault Systèmes BIOVIA software programs. BIOVIA QSAR Workbench was used to perform the calculations and to generate the graphical results."

Dassault Systèmes may grant permission to republish or reprint its copyrighted materials. Requests should be submitted to Dassault Systèmes Customer Support, either by visiting https://www.3ds.com/support/ and clicking Call us or Submit a request, or by writing to:

Dassault Systèmes Customer Support 10, Rue Marcel Dassault 78140 Vélizy-Villacoublay FRANCE Contents

QSAR Workbench GEMS Data Model for Published Models 1 Introduction 1 Data Model 2 QSAR Model Class 2 Model Version Class 3 Official list of Output Types 7 Examples of JSON Strings for Selected Fields 8 Training and Test Statistics 8 Validation Parameters 8 Validation Statistics (for a classification model) 8 Build Parameters 9 Calculator Data Model 10

QSAR Workbench • GEMS Data Model for Published Models and Calculators | Page i

QSAR Workbench GEMS Data Model for Published Models

Introduction This document contains a brief description of the data model for statistical models published by the QSAR Workbench application, or more generally for any statistical model imported into the system, whether or not it was developed in QSAR Workbench. In addition, classes are available in the data model to support calculator components. These are components that compute output properties from a set of inputs, but which were not derived by a process of statistical model building. Most users and administrators of QSAR Workbench will not need to be familiar with the GEMS data model. This document is primarily useful for developers who want to develop protocols in Pipeline Pilot for importing custom models into the system or for editing attributes of published models and calculators.

QSAR Workbench GEMS Data Model for Published Models | Page 1 Data Model The basic data model is summarized below.

QSAR Model Class

Field Description Name Globally unique name of the model; it is also the base name of all model outputs. The name must be unique across models and calculators (i.e., a calculator is not allowed to have the same name as a model.) Endpoint Specific endpoint on which the model is based. It includes a Name, Category, Unit, and Assay Name. A given biological phenomenon (e.g., hERG) might have more than one Endpoint (e.g., IC50 and patch clamp). … Name Name of endpoint (e.g., hERG IC50). … Category General category of endpoint (e.g., Toxicity). … Unit Unit of measure (e.g., molar). … Assay Name of the assay used to measure the endpoint for the data used to train the model. Name As there may be more than one assay, this field has multiple cardinality. Scope Specifies the scope of data used to build the model and its general scope of applicability. Options are Global or Project.

Page 2 | QSAR Workbench • GEMS Data Model for Published Models and Calculators Field Description Project If Scope is “Project,” this is the name of the project for which the model was built and whose data were used to build the model. Note: This refers to a project within your organization, as opposed to a QSAR Workbench project.

Domain Normally, there can be only one model of a given category for a given endpoint, Subset category, and scope. But there can be cases where models applicable to different compound types (e.g., acids versus amines) are desired at the same scope. Providing a value for this field allows this. For example, one model for a given endpoint could have a Domain Subset value of “Acids” while the other has the value “Amines”. Category Specifies whether the endpoint data are modeled as continuous (Regression) or categorical (Classification) in the model. Response Optional multi-value field describing any transformations done on the endpoint Prep Rules response prior to building the model. Data Prep Optional multi-value field describing how the input data, including any molecular Rules structures, were standardized or pre-processed before building the model. Description Free text description. Therapeutic General therapeutic area to which the model applies (e.g., Oncology, CNS, ...). Area Note: Applies to project models but not to global models.

Supplemental JSON string containing any additional information that could be useful to selected Info clients (e.g., perhaps an endpoint subcategory for presenting models in a tree.) Workflow Type of workflow based on data being used to build the model. For structural data, the workflow is specified as Chemistry while for general data, the workflow is specified as Generic. In the case of both types of data, the workflow is specified as Mixed. Lifecycle Lifecycle status value (published or retired).

Model Version Class The following table contains a detailed description of each field in the Model Version class. Field Description Author Multi-valued field containing usernames of authors of this version. Created Date Date this version was created. Version Integer number beginning at 1 for initial version. Number Method Short string indicating algorithm and implementation, e.g., “PP RP Forest”, “R PLS”

Data Model | Page 3 Field Description Descriptor Comma-separated list of names of all descriptors used to build this version (CLOB). List Input Comma-separated list of names of all generic descriptor names used to build this Descriptor version (CLOB). List Training Set Attributes of training data used to build this version. Specific attributes are Location, Sample Count, Sample Indices, and Statistics. … Location Multi-valued identifier of location or source of training data. Can be a path or paths to files; database locations and queries, or the path to a ksh script used to extract data. For this field, there is no specific format enforced by the QSAR Workbench or GEMS framework. You can make use of it as deemed appropriate by your organization. … Sample Number of samples (N) in training set. Count … Sample The actual training samples can be a subset of the data indicated by Location. If so, Indices this field contains a comma-separated list of the indices in the range [1, N] indicating the subset used for model training (CLOB). … Statistics JSON string containing statistics of this model version as applied to the training set. Test Set Attributes of training data used to build this version. Specific attributes are Location, Sample Count, Sample Indices, and Statistics, with same meanings as for Training Set. Validation Attributes of any validation (e.g., cross-validation) performed as part of the model building process. This is distinct from “test set validation” performed after the model is built and whose attributes fall under Test Set. … Parameters JSON string containing parameters of the validation process such as type of validation, number of iterations, etc. … Statistics JSON string containing validation results such as cross-validation R-squared, RMS error, etc. Build JSON string containing parameter settings used in building this version of the model. Parameters The specific list of parameters varies according to the value of Method. Component Attributes of the Pipeline Pilot/Pipeline Pilot component encapsulating this version of the model. ... Path Full path of component within the Pipeline Pilot XMLDB. … Server URL indicating the Pipeline Pilot XMLDB endpoint where the component is located. … Runner URL indicating the Pipeline Pilot “runner” endpoint for the server on which the component is based. … Version Component version number as saved in XMLDB. Will often, but not necessarily, be the same as Version Number.

Page 4 | QSAR Workbench • GEMS Data Model for Published Models and Calculators Field Description … XML Compressed XML of component at publication time. Note: Note: This is created from component XML using the PilotScript compress() function.

Workbench Name of QSAR Workbench project in which this model version was developed. If this Project field is empty, the model was developed outside of QSAR Workbench. Supplemental JSON string containing any additional information that may be useful to selected Info clients. Output Multi-cardinality set of attributes describing the model’s “published” outputs. This can be a subset of all of the outputs the model is capable of calculating. … Name Name of output, typically of the form _, except for the “primary” output of the model representing a score or response and whose name is typically just . … Type Specifies what this output represents. A specific vocabulary needs to be agreed upon, but potential values of this field are: Continuous Response Score Class Probability Categorical Response Error Goodness Distance to Training Sample Training Sample Property Applicability Sub-model Names Sub-model Values Atom Contributions The last three of these represent array outputs. … Parameter Parameter settings on the model component required to get the specified output. Settings … Color Rules JSON string conforming to the rules for specifying coloring of cells in Insight. (See the Conditional Formatting document.) Only the propertyType, range, and outOfRange values need to be specified, as the propertyNames value is given by Output Name. For output of type “Atom Contributions”, the rules govern the coloring of atoms rather than of spreadsheet cells. Parameter Multi-cardinality set of attributes describing the model’s parameters which govern the specific outputs produced. One example of this is the Number of Closest parameter which indicates the number of training set nearest neighbors to return.

Data Model | Page 5 Field Description … Name Name of parameter. … Data Type string, number, or Boolean. … Description Explanation of the purpose and usage of the parameter. Error Model Indicates whether an error model was included with the model. Included Error JSON string containing the statistics associated with the error model. Statistics Parent Model The original model name that was published. Name Lifecycle Lifecycle status values are: prototype, current, superseded, or deleted.

Page 6 | QSAR Workbench • GEMS Data Model for Published Models and Calculators Official list of Output Types The following list contains the output type vocabulary entries that are entered into the GEMS schema and are supported by the Get Model Outputs component: Actual Response for Training Sample Applicability Atom Contribution Array Categorical Response Class Probability Class Score Continuous Response Distance to Training Sample Error Estimate Goodness Score Name of Training Sample Predicted Response for Training Sample Property of Training Sample Standard Deviation Submodel Name Array Submodel Value Array

Official list of Output Types | Page 7 Examples of JSON Strings for Selected Fields The following examples show JSON strings currently produced by QSAR Workbench for various fields. One benefit of the JSON format is that the specific attributes included in the string can be changed without the need to change the GEMS schema. The drawback is a reduced ability to directly query for specific values contained within the JSON. Instead, filters must be applied following an initial query.

Training and Test Statistics Training and test set statistics for a regression model appear as follows: { "Correlation Coefficient (train)": 0.9752, "Determination coefficient (train)": 0.9511, "Kendall Tau (train)": 0.9143, "Bias (train)": 0.0177, "RMSE (train)": 1.6966, "SCE (train)": 273.44 }

Validation Parameters { "validation_type": "Bootstrap", "validation_iterations": "3", "validation_crossvalidation_folds": "4", "validation_random_seed": "12345", }

Validation Statistics (for a classification model) { "ReferenceClass": [ "c3" ], "ROCTraining": [ 0.944827586206896 ], "Iteration": [ 1, 2, 3 ], "ROCTest": [ 0.868518518518518, 0.977040816326531, 0.969444444444444 ], "MeanROCTest": [ 0.938334593096497 ], "KappaTest": [

Page 8 | QSAR Workbench • GEMS Data Model for Published Models and Calculators 0.38476190476190475, 0.35714285714285704, 0.2565130260521042 ], "SensitivityTest": [ 0.9166666666666666, 1, 1 ], "SpecificityTest": [ 0.6444444444444445, 0.7142857142857143, 0.5333333333333333 ], "Ntest": [ 57, 63, 53 ], [...some data omitted...] "ValidationType": "Bootstrap" }

Build Parameters The following unusual, double-escaped format is the “native” one used by QSAR Workbench. Be aware that this format is subject to change in the future. [ "{name=\"PP Bayes\", value=\"\"}", "{name=\"PP Bayes Multiselect Learn Options\", value=\"Validate Models\"}", "{name=\"PP Bayes Numeric Distance Function\", value=\"Euclidean\"}", "{name=\"PP Bayes Multiselect Numeric Scaling\", value=\"Mean-Center and Scale\"}", "{name=\"PP Bayes Fingerprint Distance Function\", value=\"Tanimoto\"}", "{name=\"PP Bayes Additional Properties\", value=\"\"}", "{name=\"PP Bayes Additional Options\", value=\"\"}", "{name=\"PP Bayes NumberOfBins\", value=\"10\"}", "{name=\"PP Bayes DestinationFolder\", value=\"$(Username)/LearnedProperties\"}", "{name=\"PP Bayes Post-Processing Script\", value=\"\"}" ]

Examples of JSON Strings for Selected Fields | Page 9 Calculator Data Model The data model for calculators is in essence a condensed version of that for statistical models. The basic data model is summarized here, with the fields having identical meanings to those for models:

Page 10 | QSAR Workbench • GEMS Data Model for Published Models and Calculators