University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln UCARE: Undergraduate Creative Activities & UCARE Research Products Research Experiences

Spring 4-15-2016 ChIPathlon: A competitive assessment for regulation tools. Avi Knecht University of Nebraska-Lincoln, [email protected]

Adam Caprez University of Nebraska-Lincoln, [email protected]

Istvan Ladunga University of Nebraska-Lincoln, [email protected]

Follow this and additional works at: http://digitalcommons.unl.edu/ucareresearch Part of the Bioinformatics Commons, Cancer Biology Commons, Other Computer Sciences Commons, and the Software Engineering Commons

Knecht, Avi; Caprez, Adam; and Ladunga, Istvan, "ChIPathlon: A competitive assessment for gene regulation tools." (2016). UCARE Research Products. 76. http://digitalcommons.unl.edu/ucareresearch/76

This Presentation is brought to you for free and open access by the UCARE: Undergraduate Creative Activities & Research Experiences at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in UCARE Research Products by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. ChIPathlon: a competitive assessment for gene regulation tools Avi Knecht, Adam Caprez, Istvan Ladunga Gene regulation: why do we care?

 When gene regulation of the cell cycle malfunctions, it frequently causes cancer.

 Adult, differentiated cells can be reprogrammed to induced pluripotent stem cells  Which can then be reprogrammed to heart muscle, skin, etc, to repair damaged tissue (to limited extent in clinical practice) Mapping transcription factors & histone modifications to genome

, are regulated by transcription factors and , that bind to specific sequences on the DNA.

 Transcription factors are mapped to the DNA by chromatin immunoprecipitation followed by next-generation sequencing.

Challenges in mapping transcription factors to the genome

 Background correction is an open problem.

 DNA fragments can be much larger than the binding site.

 Sequencing read location does not follow any statistical distribution.

 Over 50 different methods are used for mapping, which produce different results. ChIPathlon

Evaluate the performance of all mapping (peak calling) methods.

To this end, we will develop a scalable and easy to use super computing pipeline to stage data, compare many different peak calling and differential binding site tools, and store all results into a single database. MongoDB

• Works well with large data sets. • Can handle incomplete data. Pegasus

• Too much data for manual processing, need to create workflows. • Built on condor, which is used by a variety of super computing centers. Python

• Many bioinformatics packages already managed in python under bioconda. • Has interfaces for both MongoDB & Pegasus. YAML to Workflows I

Each individual job is defined in a plain text YAML file. YAML to Workflows II

Jobs are chained together by using module YAML files. YAML to Workflows III

Users need to input a file selecting ENCODE experiment id’s for files to process, tools they want to use, and a path to a genome. Conclusion

• The current pipeline handles all downloads, alignment of single or paired end reads, and peak calling. • The modularity of the underlying architecture makes it very easy to add additional tools or processing steps without changing the workflow generation code. • Workflows can be generated for any ENCODE experiment, making this a very versatile pipeline for comparing bioinformatics tools. Database Architecture Workflow I Workflow II Regulation of cRPGs (outer circle) S4XP21L10 L12 S6KC1S18 L26 L41 S24 L19 S16 L8 S19BP1 L6 S29 L27A and mRPGs ML1 S6KA6 ML46 ML54 ML44 ML11 L38 ML52 ML34 SAP58 ML13 ML3 L24 S13 ML37 ML9 L36A (circle next from SA ML19 ML23 L9 S17 ML15 ML41 L29

S11 MS31 NF-E2 SIRT6 ML21 L4 MEF2C PGC1A the outside). S6KB1 MS28 ZNF274 TFIIIC-110 ML16 L13 GR GRP20 TAL1 NF-YB SETDB1 S6KA3 MS2 ML36 L7A HNF4A ATF3 NRF1 USF2 NF-YA BAF155 S4X L10L Yellow circles: RPGs, MS36 EGFP-JUND BRCA1 ML4 S6KA4 ERALPHA GCN5 L22L1 MS16 JUN BCLAF1 ZNF143 HMGN3 SRF NANOG ML22

S14 BHLHE40 CHD2 SMC3 PBX3 L18A MS17 OCT-2 ML17 FOS POL2(B)HA- E2F6 green diamonds: S27 MEF2A EGFP-JUNB L35 EGR-1 GTF2F1 SIX5 MS23 NELFE IRF3 ML32 S28 MAX L30 STAT3 SIN3AK-20 FOSL2 MS6 TBP POL2 GABP ML49 regulators. S9 WHIPAP-2GAMMA RAD21 CCNT2 GATA-2STAT2 L18 SP1 MS24 NRSF ELF1 ML50 S6KL1 HDAC2 YY1 CEBPB GATA-1 L5 E2F1 CTBP2 P300 JUND S6KA1 MS7 NFKB ML48 L39L Icon size is EBF PAX5-C20 USF-1 MAFK TAF1 BATF GTF2B TCF4 ZZZ3 S15 MS34 CTCF ML53 L37 TCF12 MXI1 POU2F2 IRF1 S10 TAF7 L34 proportional to the MS12 AP-2ALPHA MAFF HNF4GRPC155 ML55 RXRA PU.1 FOXA1 S27L L27 MS5 ML2 BCL11A ZNF263 ETS1 HSF1 S19 ZEB1 ELK4 RFX5 L11 INI1 THAP1 number of MS33 ML35 S2 L10A EGFP-FOS ZBTB7A BCL3 SPT20 MS30 ML24 S25 CTCFL FOXA2 L3L GATA3 KAP1 STAT1 BRG1 ERRA regulatory S26 MS11 ML20 L26L1 BAF170 TR4 IRF4 BRF2 S6KA2 MS18B ML42 L23 SP2 PAX5-N19 EGFP-GATA2 ZBTB33 PRDM1 relationships. S15A MS18A ML47 L17 SREBP2 POU5F1 S23 L23A MS9 SREBP1 SUZ12 ML10 FOSL1 BDP1 S6KB2 MS15 EGFP-HDAC8 ML27 L36AL S12 MS10 ML45 LP0 S3 MS26 ML18 LP1 S6KA5 MS25 ML39 L39 S3A MS35 ML38 L15 Note the extreme MS14 ML43 MS18C ML28 S4Y1 MS27 ML33 ML51 L32 S21 L21 S7 L22 S6 LP2 S8 L37A density of regulatory S27A L3 S5 S20 L14 L7L1 L13A relationships.