Supplemental Materials

for

TRACE: factor footprinting using accessibility data

and DNA sequence

Ningxin Ouyang1 and Alan P. Boyle1,2

1Department of Computational Medicine and Bioinformatics, 2Department of Human Genetics

University of Michigan, Ann Arbor, MI

Supplemental Methods

1. Bias-corrections, Ddigestion signals, and derivatives

To generate biased-corrected read counts in a genome with length N 푥 (1.1), we first did theperformed bias correction based on model and bias values reported in He et al. (2014). A separate set of raw reads were also used to count raw digestion signal (푥) since they better conserved the overall read counts number difference among different peaks. Only 5’ loci were used to count digestion signals in both cases.

푥 =(푥 ,…, 푥 ,…, 푥) (1.1)

푥 =(푥 ,…, 푥 ,…, 푥) (1.2)

The raw digestion signals were then normalized by the mean of non-zero read counts inside surrounding 10kb window 푊 (1.3) and smoothed using local regression function LOESS from R with degree of the local polynomial = 2 and a small span so that 30 surrounding points were used in the fitting. The resulting signals were saved as read count file that was used in TRACE.

∙ 푥 = (1.3) ∑∈ ∑∈ ( )

We also applied within-dataset normalization (1.4) on the bias corrected signals by surrounding 10kb window, as well as between-dataset normalization (1.5) based on kth

percentile 푝 and standard deviation 휎 of the ATAC/DNase-seq peak 퐿 or region of interest that

∙ contains 푥. R function LOESS was then applied on normalized signal 푥 for local regression.

∙ 푥 = (1.4) ∑∈ ∑∈ ( )

∙ 푥 = ∙ (1.5) ( )/ The normalized and smoothed bias-corrected signals were then used to calculate derivatives using Savitzky-Golay filter from python package Scipy. This method fits data into a second order polynomial with window size of 9. There values were stored in deviation file and used as input data in TRACE.

2. Hidden Markov Model for footprinting

2.1 Model parameters

Hidden Markov models (HMM) have widely applied in bioinformatics. It can represent probability distributions over sequence of observations 푂 =(푂, 푂,…, 푂) generated by hidden state path 퐻 =(퐻, 퐻,…, 퐻). In TRACE, observations are our processed data and hidden states are different chromatin labels such as footprints and small peaks. Observed features at each genomic position t (푂 =(푂,, 푂.,…, 푂,)) include cut count, derivative of

ATAC/DNase-seq cut counts and PWM score. HMM parameters also include emission distribution 푏(푘) = 푃(푂 = 푘|퐻 = 푖), initial states probability 휋 = 푃(퐻 = 푖) and transition probability 푎 = 푃(퐻 = 푗|퐻 = 푖).

2.2 Model details

Our model was build based on the idea of generalized HMM, in which each motif consists of n states each representing one position in its PWM (n is the length of PWM.). But each state in a motif can only transition to the next state in that motif, and the last state in this motif will transition to state of the small peak, so each motif can still be considered as an individual large state but its parameters at each can be captured separately. There are also fp states representing generalized footprints which do n’ot match any motif in model. Fig. 1B is just shows this in a simplified structure of TRACE model.

Two Background states represent starting and ending positions for each region of interest.

Small peaks that surround footprints are divided into UP, TOP, and DOWN state with one- direction connection. DOWN state will either transit to a footprint state or the end of open chromatin region state (Background state). Only the last state in motif states or start of the region

(Background state) can transit to UP state.

To better predict the functional binding sites, we included two sets of states of each motif to represent active and inactive binding sites. For each , TRACE will differentiate and predict its functional binding sites and those regions with a matching motif but are non’t necessarily bound by that TF.

2.3 Model learning

Given sets of observed features, the Baum–Welch algorithm were was applied to find the maximum likelihood estimate of the parameters 휆 =(푎, 푏(푘), 휋). Forward-backward algorithm is applied here. Forward probability 훼(푖) (2.1) is the probability of observing the partial sequence (푂, 푂,…, 푂) such that the state 퐻 is 푖. Backward probability 훽(푖) (2.2) is the probability of observing the partial sequence (푂, 푂,…, 푂) such that the state 퐻 is 푖.

Forward algorithm induction: (2.1)

 Initialization: 훼(푖) = 휋푏(푂)  Induction: 훼(푖) = ∑ 훼(푗)푎푏(푂)  Termination: 푃(푂|휆) = ∑ 훼(푗)

Backward algorithm induction: (2.2)

 Initialization: 훽(푖) = 1  Induction: 훽(푖) = ∑ 푎 푏(푂)훽(푗)

The Baum–Welch algorithm iteratively updates model parameters 휆. It combines the entire observation 푂 and the forward/backward variables to obtain the probability of being in state 푖 at time 푡 and in state 푗 at time 푡 +1 as 휉(푖, 푗) (2.4) and the probability of being in state 푖 at time 푡 given the observation sequence as 훾(푖) (2.5). 휉(푖, 푗) and 훾(푖) will then be used for calculation

of 푎 (2.6) and 푏(푘) (2.7) to update 휆. These newly estimated 휆 will replace the previous 휆 and then to be used in the next iteration, until the increase of log 푃(푂|휆) is smaller than predetermined small number.

()()() ()()() 휉(푖, 푗) = 푃(퐻 = 푖, 퐻 = 푗|푂, 휆) = = (2.3) (|) ∑ ∑ ()()() 훾(푖) = 푃(퐻 = 푖|푂, 휆) = ∑ 휉(푖, 푗) (2.4)

∑ (,) 푎 = (2.5) ∑ () ∑ () , 푏(푘) = (2.6) ∑ ()

The emission distribution (푏(푘)) (2.1) for each hidden state 퐻 = 푖 is modeled with a multivariance multivariate normal distribution, with D-dimensional mean vector 휇 =

(휇,, 휇,,…, 휇,) and full covariance matrix Σ. D is number of features included as input data.

Emission parameters that are estimated in Baum-Welch algorithm include (휇,Σ) for each state

푖.

( )( )( ) ( ) ( ) 푏 푘 = 푃 푂 = 푘|퐻 = 푖 = 푃(푘|휇,Σ)= 푒 (2.7) () | | ∑ ퟏ() 휇 = (2.8) ∑ ퟏ()

∑() ()ퟏ() Σ = (2.9) ∑ ퟏ()

The Viterbi algorithm wasere implemented to find the best hidden path 퐻 that maximizes the likelihood 푃(퐻|푂, 휆) from observation 푂 and estimated model 휆. It finds the most probable hidden path 퐻 recursively by getting determining 훿(푖) = 푃(퐻, 푂|휆), the highest probability path ending in state 푖, then backtrack to obtain the best path. Since one of the features in TRACE model is PWM score, an optional bit score threshold can be included to reduce the false positive calls. When threshold is provided, 훿(푖) for state i at position t will be set to a minimal value if PWM score at t is below the threshold.

3. Bait motif selection

Besides In addition to the states of the TF of interest, the TRACE model also includes other motifs which serve as bait motifs. Adding bait motifs can potentially reduce the false positive labels of regions with footprint-like digestion patterns but a weak sequence match with the TF of interest, as those regions might have higher sequence preference for bait motifs. This binding competition can potentially increase the accuracy of identifying binding sites.

In order to add useful information in the model, the extra motifs should not have similar binding preference, otherwise they will only contain repetitive sequence information and be treated as same states by TRACE. To ensure all motifs included in the model are different, we obtained hierarchical clustering information of PFMs from JASPAR database using the RSAT matrix- clustering tool (Castro-Mondragon et al. 2017). The motifs at the root of each tree encompasses all the position-specific scoring matrices (PSSMs) of a cluster. Those root alignments are the only PWMs should be added in TRACE model as baits. The root motif from the cluster that contains the TF of interest should also be excluded from the model. To decide which motifs should be added in the model, we scanned each root motif across the genome and ranked their numbers of occurrences. For a N-motif model for a certain TF, the bait motifs will be (N-1) most abundant root motifs from the clusters that don’t contain that TF of interest.

4. Existing footprinting methods evaluation

To assess the performances of existing computational footprinting tools including DeFCoM,

BinDNase, CENTIPEDE, PWM score only, DNase2TF, HINT, FLR, CENTIPEDE, PIQ,

Wellington and original Boyle method, we followed the motif-centric evaluation approach and tested on chr1. For motif-centric methods, candidate binding sites overlapped with DNase-seq peaks were used as testing sites and included in evaluation. For de novo methods, that same group of DNase-seq peaks were selected as testing regions.

For evaluation, all sites and their scores from motif-centric methods are included and assessed. For de novo methods, only those predicted footprints overlapped with at least 10% of a motif site were included; missing candidate binding sites from those input DNase-seq peaks were also included and assigned with a minimum score. We used the default settings for the majority of existing methods. Those ones with modifications or needing clarifications are listed below:.

DeFCoM and BinDNase

Since they are supervised methods, we trained model on chr19 and tested on chr1. Motif sites overlapped with DNase-seq peaks in chr1 and chr19 were used as testing set and training set respectively. We used default parameters as specified by Kähärä and Lähdesmäki (2015) and

Quach and Furey (2016).

CENTIPEDE

We utilized cut signal, PWM score, distance to closest TSS and conservation score as input data.

The other parameters were set the same as described by Pique-Regi et al. (2011).

HINT

We applied HINT on only DNase-seq or ATAC-seq data with default setting, and included the bias correction step. Also, we used newest version of HINT-ATAC for tests involving ATAC- seq data.

Wellington

We used data from double-hit DNase-seq protocol as required by Wellington and set -fdrlimit to

0. For evaluation, the absolute value of its reported scores for each footprint were used.

Boyle method

We have a build-in option in TRACE program to apply Boyle method. We included read counts and slope data and build a 6-state model similar to the model described in Boyle et al. (2011), then used the posterior probabilities in evaluation.

PWM score only

We calculated PWM scores for all motif sites that were included in motif-centric methods evolution. This includes only PWMs that overlap open chromatin sites in the current data set.

Only PWM scores were used in this evolution.

5. ATAC-seq pipeline

The ATAC-seq data for GM12878 were obtained from GSE47753, Omni-ATAC-seq data were obtained from Sequencing Read Archive (SRA) with the BioProject accession PRJNA380283. These data were processed following Kundaje lab's ATAC-seq pipeline.

(https://github.com/kundajelab/atac_dnase_pipelines)

Supplemental Figures

DOWN DOWN 1 1

(

u

n ) 2 2 C

b d

other motifs T F

o n

C C

or u u TOP TOP

F T

n

o background

C d

b

(

18 18 )

19 19 UP UP

Supplemental Figure 1. Detailed schematic of bound and unbound CTCF state in CTCF model.

Circles represent different hidden states including binding sites and peaks, lines with arrows represent transitions between different states. For simplicity, all other motifs, generic footprint states and background are represented by a dashed line circle.

A ATF1 BHLHE40 ELF1 CTCF B NRF1 YY1 GABPA 1.00 1.00 ELF4 MAX JUND ATF7 MAX ELF1 CTCF YY1 MITF FOXK2 MAFK NFYB ATF7 ELF4 ATF1 ATF4 USF1 REST REST MAFF MITF MNT MYC CREM CEBPG EGR1 NR2F6 ZNF24 ZFX ZNF24 JUN MNT ETV1BHLHE40JUND ZFX GABPA ELK1 GATA2 USF2 NFYB USF2 EGR1USF1 0.75 CREB3L1 ATF3 ZNF143 0.75 SP1 NRF1 NR2C2 CREB3L1 MXI1 IRF1 RFX1 MXI1 ZNF740 RFX1 MAFG ESRRA TEAD4 JUNB NFE2 MAFG ATF3 NR2C2 RUNX1 PBX2 ZNF740 CREM CTCFL GMEB1 IRF1 CEBPG CEBPB NFE2 TFDP1 ELK1 MEF2A CEBPB PBX2 ETS1 ZNF143 E2F6 MEF2D CUX1 NR2F1 ZBTB7AJUNB CUX1 NR2F2 SP1 GATA1 ATF4 MAFK MAFF MEIS2 PKNOX1 SOX6 PKNOX1 E2F6 KLF13 CTCFL NR2F1 TFDP1 ETS1 ETV1 GATA1 0.50 SOX6 ZBTB7A 0.50 FOXJ2 TEAD4 TBP IRF2 E2F1 LEF1 NFIC E2F7 TBP GATA2 NFYA NR2F6 NFYA KLF13 NEUROD1 MEIS2 DeFCoM E2F8 E2F8 NEUROD1 BinDNase STAT1 GMEB1 FOXK2 LEF1 ZBED1 RUNX1 ESRRA ETV6 SREBF1 TCF7L2 ZNF263 ETV6 FOXA1 E2F7 STAT1 ZBED1 FOXJ2 HMBOX1 ZNF384 MEF2A ZNF282 CREB3 ZNF384 MEF2D NFIC IRF2 TCF12 ZNF263 TCF12 TEAD2 TEAD2 0.25 0.25 SREBF1 NR2F2 NR4A1 HES1 NFATC3 ZNF282 THAP1 NFATC3 HES1 CREB3 RFX5 THAP1 HINFP NR4A1 HMBOX1 ZBTB33 IRF9 MYBL2 ZBTB33 KLF16 HINFP KLF16 NR3C1 RFX5 0.00 0.00 NR3C1 TCF7L2

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TRACE TRACE C D 1.00 1.00 CTCF

CTCF NRF1

JUND ATF4 CREM ATF4 RFX1 USF1 ZNF740 EGR1 JUND NRF1 MITF ATF1 USF1 MAFK EGR1 MITF 0.75 ATF7 0.75 MAFK ATF7 ZNF740 CEBPB ZNF24 CEBPG CREM SP1 NFYB MAFF CEBPB GABPA RFX1 USF2 ELF1 MAFG MAFF IRF1 ZNF24 CEBPG USF2 MAX BHLHE40 ATF3 ETV1 GABPA TBP GATA2 CTCFL MAX ELF1 MAFG FOXK2 ATF1 NR2F6 JUN NFYB CTCFL NFYA ZNF263 ELF4 BHLHE40 YY1 IRF2 NFE2 GMEB1 0.50 PBX2 ETV1 MYC 0.50 REST ATF3 NFE2 ELK1 YY1 ZNF143 REST GATA1 MNT ELF4 NFIC KLF13 NFYA NR2F1 CUX1 ZBTB7A JUNB MEF2A MYC Centipede Wellington ZBED1 MEF2D SP1 IRF1 NR3C1 NR2C2 TFDP1 KLF13 TCF7L2 NR2F6 JUNB MNT ESRRA GMEB1 STAT1 NR2F1 E2F6 ZFX MEF2A ZBTB7A CREB3 LEF1 GATA2 TEAD4 NR2C2 MEF2D ETV6 HES1 CREB3L1 TFDP1 ZNF143 JUN ETV6 IRF9 GATA1 TBP IRF2 STAT1 NFATC3 CREB3 E2F1 RUNX1 ESRRA 0.25 E2F6 TEAD4 ELK1 ZFX 0.25 PBX2 ZNF263 SOX6 FOXK2 ETS1 ZBED1 HMBOX1NFIC RFX5 TCF12 TEAD2 HMBOX1 LEF1 CREB3L1 PKNOX1 ETS1 RFX5 MEIS2 MXI1 NR2F2 NEUROD1 TEAD2 ZNF282 E2F7 SREBF1 E2F1 MEIS2 MXI1 ZBTB33 RUNX1 KLF16 E2F8 FOXA1 PKNOX1 TCF12 NEUROD1 FOXJ2 HES1 E2F8 CUX1 HINFP FOXA1 HINFP NR4A1 TCF7L2 ZNF384 KLF16 FOXJ2 THAP1 E2F7 NR2F2 IRF9 ZBTB33 SOX6 0.00 MYBL2 0.00 THAP1 SREBF1 NR3C1 NR4A1 ZNF282 MYBL2

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TRACE TRACE E F 1.00 NRF1 CREM 1.00

CEBPG USF1 CTCF NRF1 CTCF

CEBPB ELF1 JUND ATF4 JUND RFX1 EGR1 EGR1 GABPA ELF1 MITF ATF1 MITF BHLHE40 MAX USF1 MEF2A NFYB USF2 USF2 RFX1 ATF7 0.75 ATF7 YY1 0.75 YY1 MAFF IRF1 ETV1 ATF1 MEF2D ETV1 BHLHE40 MAFK GABPA MAFK MAFF ZNF24 ELF4 CEBPB ATF3 MNT ELF4 ZNF24 CREM MAFG FOXK2 REST MEF2D CEBPG ZNF740 MYC ZNF740 NFYA SP1 NR2F6 MAX CTCFL ZBTB7A TBP LEF1 NFE2 IRF1 MYC ETS1 ELK1 GATA1 GATA2 MAFG GATA2 CREB3L1 LEF1 NFE2 ZBTB7A NFYB MNT 0.50 0.50 REST GMEB1 PBX2 JUN ATF3 CTCFL NR2F6 HINT TFDP1 IRF2 NR2F1 TFDP1 ZFX SP1 MEF2A ZFX ZNF143

DNase2TF KLF13 ESRRA NR2F1 E2F6 GMEB1 STAT1 E2F7 PBX2 FOXK2 GATA1 JUN E2F6 MXI1 JUNB HMBOX1 IRF2 NR2C2 NFIC CUX1 E2F1 ESRRA ELK1 MXI1 ZNF143 ETV6 NR2C2 MEIS2 TEAD4 NFYA STAT1 NFIC HMBOX1 HES1 CREB3L1 FOXA1 TBP ZNF263 RUNX1 FOXA1 PKNOX1 ETV6 NR4A1 CREB3 E2F8 PKNOX1 0.25 0.25 E2F1 RUNX1 JUNB ETS1 IRF9 TCF12 FOXJ2 ZNF384 FOXJ2 ZBED1 NFATC3 ZNF263 HES1 E2F8 SOX6 KLF13 TEAD4 TCF7L2 RFX5 NR4A1 NEUROD1 IRF9 CREB3 CUX1 NR2F2 SOX6 MEIS2 NR3C1 NR2F2 TEAD2 ZNF282 TEAD2 RFX5TCF12 E2F7 ZBTB33 TCF7L2 ZNF384 KLF16 THAP1 NFATC3 THAP1 SREBF1 MYBL2 ZBED1 HINFP HINFP ZBTB33 MYBL2 KLF16 NR3C1 ZNF282 NEUROD1 0.00 0.00 SREBF1

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TRACE TRACE G H 1.00 1.00

CTCF ATF4 CTCF RFX1 MITF ATF7 NRF1 ATF7 MAX JUND MITF CEBPG EGR1 USF1 CEBPB ATF4 MNT ZNF24 BHLHE40 JUND 0.75 ZNF24 MAFK NRF1 0.75 EGR1 ELF1 MYC CEBPB MAFF CEBPG BHLHE40 MAFK MAX MNT MAFG ZNF143 MYC ZNF740 YY1 USF1 MAFF ELF4 USF2 MAFG NFYB CTCFL ZNF384 TBP REST USF2 CREM IRF1 ETV1 YY1 JUNB REST NFE2 ATF3 GATA2 NFYB ELF1 CTCFL ZNF740 GABPA TFDP1 CREB3L1 MEF2D RFX1 0.50 E2F6 MXI1 SP1 ATF1 GABPA 0.50 NR2C2 ZBTB7A SP1 PIQ FLR CREB3L1 CREM TCF7L2 PKNOX1 NFYA ELK1 ATF1 CUX1 SREBF1 NR2F6 GATA1 ETV1 ELF4 E2F1 KLF13 NFE2 NR2F1 TFDP1 PBX2 JUN STAT1 NFIC MEF2A IRF1 ESRRA HMBOX1 IRF2 E2F6 ATF3 NR2F6 ZNF143 FOXK2 NFATC3 TEAD2 HES1 ZFX PBX2 FOXK2 ETS1 KLF13 ZBED1 E2F8 NR2F1 MXI1 HMBOX1 PKNOX1 ETS1 NR3C1 ETV6 RUNX1 ZBTB7A GMEB1 JUN ESRRA E2F1 ETV6 JUNB GATA2 0.25 ZBED1 IRF2 0.25 NFIC RUNX1 TCF12 NR4A1 MEIS2 TEAD4 ELK1 E2F8 TEAD2 ZNF263 LEF1 TEAD4 GATA1 TBP NEUROD1 STAT1 CUX1 MYBL2 ZNF282 LEF1 NFYA NR2F2 IRF9 NFATC3 NR2F2 ZNF384 ZFX MEF2A RFX5 NEUROD1 CREB3 CREB3 NR2C2 SOX6 RFX5 SOX6 MEIS2 GMEB1 ZBTB33THAP1 ZNF263 MYBL2 THAP1 HES1 E2F7 FOXA1 E2F7 HINFP IRF9 FOXJ2 HINFP FOXA1 MEF2D KLF16 TCF12 TCF7L2 FOXJ2 KLF16 ZBTB33 0.00 0.00 NR3C1 NR4A1 SREBF1 ZNF282

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 TRACE TRACE Supplemental Figure 2. PR AUC comparison between TRACE and existing methods for binding sites prediction of all TFs tested

A B 0.9 0.9

YY1

GATA1 ELF1 GABPA 0.8 0.8 ELF1 RUNX1 MEF2A TBP REST CREM NEUROD1 ATF1 SREBF1 MNT YY1 CUX1 ZNF143 MAX ATF1 TBP MEIS2 REST JUN ELF4 ZNF143 BHLHE40 MAX NEUROD1 MYC MNT NFYA CREB3L1 MAFK CREB3L1 JUND GATA2 MXI1 MYC USF1 GABPA ATF7 USF2 0.7 0.7 ATF4 RFX1 ELK1 NR2C2 JUND MEF2D NR2F2 CEBPG FOXK2 USF2 TCF7L2 E2F8 MXI1 BHLHE40 USF1 CUX1 NRF1 NFYB RFX1 MAFF NFYB KLF13 SOX6 E2F7 SOX6 ATF7 E2F8 CREM NFE2 NR2C2 FOXJ2JUNB ELF4 ZNF282 HMBOX1 ATF3 E2F1 JUNB GMEB1 PBX2 ESRRA ELK1 ZNF282 PKNOX1 PBX2 IRF2 MITF ETV1NR2F6ZBED1 TEAD4 RUNX1 DeFCoM NFE2 TCF12 IRF1 ETS1 BinDNase ATF3 EGR1 FOXA1 TEAD4 ZFX NR2F1 IRF1 MEIS2 SREBF1 PKNOX1 0.6 NFYA 0.6 ZFX E2F6 NFIC ZBED1 NR4A1 MAFG STAT1 E2F1 CTCF ETS1 GATA1 KLF13 TCF12 ZNF740 GATA2 STAT1 ZNF24 MITF ZNF384 IRF2 LEF1 CTCFL NR4A1 E2F7 ZNF24 ZBTB7A ESRRA EGR1 ZNF263 ETV6 ZBTB7A ZNF740 CTCF FOXJ2 CREB3 RFX5 HINFP ZNF263 THAP1 CEBPG MAFK TEAD2 MAFG ZBTB33NR2F1 CEBPB TEAD2 NR2F6 ETV1 NFIC ZNF384 FOXK2 0.5 0.5 KLF16 SP1 MEF2A CTCFL KLF16 ATF4 MYBL2 CREB3 THAP1 ETV6 NR2F2 LEF1 E2F6 MAFF MEF2D TCF7L2 HINFP TFDP1 IRF9 SP1 NR3C1 GMEB1 ZBTB33 TFDP1 NFATC3 HMBOX1 CEBPB NFATC3 NRF1 HES1 RFX5 HES1 NR3C1 0.4 0.4

0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 TRACE TRACE C D 0.9 0.9

0.8 0.8

TBP CREM ZNF740 ZNF740 CTCF JUND TCF7L2 IRF9 ATF1 CTCF ZBED1 CREB3L1 RFX1 JUN NEUROD1 SOX6 CREM MEF2A NRF1 IRF1 JUN ZNF143 MEF2A 0.7 0.7 ZNF263 ATF3 ATF7 CREB3 ETV1 ATF1 CUX1 CEBPG USF1 ZNF143 NR2F6 USF1 TBP MAFF NFIC NR3C1 JUND USF2 ATF4 MAFK FOXJ2 NR3C1 ATF7 CEBPG FOXK2 BHLHE40 SP1 MAFG IRF2 KLF13 REST HES1 NRF1 NR2F1 MAFF MAFK RFX1 REST PBX2 RUNX1 SREBF1 GATA1 GABPA ELF4 STAT1 RFX5 ATF4 MEF2D JUNB Centipede MITF STAT1 IRF2 MEF2D Wellington NFYA ZBED1 MITF ETV1 BHLHE40 ELF1 0.6 MYC YY1 0.6 ELK1 MXI1 GATA2 USF2 NFE2 GMEB1 NFE2 NR2C2 MEIS2 EGR1 NEUROD1 RFX5 JUNB CEBPB E2F8 MAX ZNF282 MNT ELF1 E2F7 GABPA KLF13 ZFX NFATC3 CREB3 CREB3L1 MAFG ZNF24 ESRRA NR2C2 TEAD4 TEAD4 FOXA1 MAX NFYB MNT MYC HINFP ETV6 E2F6 ZFX ZNF263 ATF3 TCF12 LEF1 PKNOX1 MXI1 TEAD2NR2F2 ETS1 YY1 TCF7L2 NFIC NFYA LEF1 CTCFL ETV6 HINFP NFYB RUNX1 E2F8 KLF16 HMBOX1 SREBF1 0.5 EGR1 NR2F6 0.5 SOX6 FOXA1 CTCFL TFDP1 KLF16 TCF12 ELF4 ZBTB33 NR2F1 ESRRA E2F1 ELK1 IRF1 ETS1 MYBL2 NR4A1 GATA2 ZNF282 ZBTB33 E2F6 ZNF24 ZBTB7A MEIS2 ZBTB7A CEBPB THAP1 E2F7 NR2F2 FOXJ2 CUX1 PKNOX1 GATA1 TFDP1 E2F1 GMEB1 HMBOX1 TEAD2 MYBL2 NR4A1 THAP1 ZNF384 PBX2 FOXK2

0.4 IRF9 SP1 HES1 0.4

0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 TRACE TRACE E F 0.9 0.9

0.8 0.8

TBP

ATF1 CEBPG MEF2A RFX1 CREM MAX MEF2D FOXJ2 NR3C1 ETS1 ATF1 ETV1 E2F7 0.7 USF2 ELF1 0.7 IRF2 CEBPB NEUROD1 ZNF282 IRF1 NR2F2 CUX1 BHLHE40 IRF9 MEF2D USF1 GABPA CREB3L1 HMBOX1 ETV1 NR2F6 NR4A1 LEF1 FOXK2 MEF2A FOXA1 NRF1 IRF1 TCF7L2 MNT CREB3 HMBOX1 BHLHE40 MAFF ELK1 ZNF740 LEF1 GATA2 ATF7 NEUROD1 YY1

CEBPB REST HINT YY1 GATA1 E2F7 RFX5 TCF12 IRF2 MYC NRF1 IRF9 ESRRA CEBPG E2F8 E2F8 MAX DNase2TF NFYA STAT1 STAT1 ATF3 ELF4 PBX2 NFYB JUND MITF MAFK RFX1 ZNF384 TBP 0.6 NFIC 0.6 RUNX1 USF2 ELK1 ZNF384 RFX5 MNT NFE2 ZNF24 MAFK SREBF1 FOXA1 NFATC3 CREB3 NR3C1 ELF1 MAFF ZNF740 CREM ETV6 E2F6 JUNDFOXK2 MAFG ETV6 TEAD4 MXI1 RUNX1 ZNF143 NR2F1 MEIS2PBX2 NR2F1 ESRRA MYC GABPA MAFG ZFX ZNF24 MXI1 USF1 ZFX CTCFL MITF FOXJ2 PKNOX1 NR4A1 ATF7 E2F1 ZNF143 ZNF263 CTCFL NFIC CREB3L1 JUN ELF4 0.5 ZNF263 TFDP1 GMEB1 HES1 ZBED1 SOX6 CUX1 0.5 PKNOX1 HINFP KLF16 TFDP1 GMEB1 TEAD2 JUNB GATA1 HINFP ZBTB7A SP1 CTCF NR2F2 JUN ETS1 TEAD4 TCF7L2 ZBTB33 ATF3 E2F1 SP1 MYBL2 ZBTB33 E2F6 EGR1 TEAD2 MEIS2 JUNB REST GATA2 SOX6 NFE2 ZBTB7A NR2C2 THAP1 CTCF KLF16 NFATC3 THAP1 MYBL2 SREBF1 TCF12 ATF4 EGR1 HES1 ZBED1 0.4 NR2C2 KLF13 NR2F6 0.4 KLF13 NFYA NFYB ZNF282

0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 TRACE TRACE G H 0.9 0.9

0.8 0.8

NEUROD1 NR4A1 TBP

REST ZNF384 SREBF1 ZNF282 TEAD2 E2F7 CTCF JUND ZNF143 TCF7L2 CTCF NEUROD1 CEBPG MEF2D MAFK 0.7 0.7 IRF2 ETV1 RFX1 NR3C1 ETS1 MAX MNT E2F1 NFYA ZBED1 ELF4 MXI1 ZNF740 CEBPB ELF1 NRF1 ATF1 ATF7 YY1 MYC MITF IRF1 ATF7 RUNX1 ELK1 MNT STAT1 NFIC MAFF USF2 KLF13

MXI1 PIQ FLR CREM NR2C2 ATF4 REST ZBED1 USF1 E2F8 MYC ATF3 MITF HES1 USF1 NFATC3 MEF2A ELK1 NFYB CUX1 GATA2 RUNX1 BHLHE40 ZNF24 EGR1 HMBOX1 ZNF143 CREB3L1 MYBL2 MAX 0.6 TCF12 RFX5 HMBOX1 JUNB 0.6 MYBL2 TBP CEBPG FOXA1 ELF1 GABPA TFDP1 CREB3 MAFF MAFK KLF13 PKNOX1 USF2 MAFG CTCFL IRF2 ZNF384 ETS1 BHLHE40 MEIS2 ZBTB7A SP1 CREB3L1 SOX6 FOXK2 GATA1 E2F8 CTCFL HINFP PBX2 SP1 FOXK2 JUND YY1 ZFX E2F6 NFE2 MEIS2 IRF9 IRF9 NR2F6 JUN TEAD4 ELF4 ZNF263 RFX5 NR3C1TEAD4 ZNF282 JUN PKNOX1 E2F1 ETV6 GMEB1 E2F7 MAFG STAT1 KLF16 E2F6 CEBPB TFDP1 ESRRA ETV1 LEF1 JUNB MEF2A TCF7L2 0.5 HINFP 0.5 ETV6 ZBTB33 NFE2 MEF2D NR2F2 ZNF24 NR2F1 ATF4 THAP1 FOXJ2 PBX2 SREBF1 FOXA1 CREM GABPA ZNF263 ZFX NFIC TCF12 GMEB1 THAP1 KLF16 NRF1 NR2C2 NFYA NR2F2 FOXJ2 GATA2 GATA1 NR4A1 ZBTB33 ATF3 NR2F1 ESRRA ZBTB7A EGR1 CREB3 IRF1 NFYB HES1 ATF1 CUX1 SOX6 NFATC3 TEAD2 0.4 0.4 RFX1 NR2F6 ZNF740 LEF1

0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 TRACE TRACE Supplemental Figure 3. ROC pAUC comparison between TRACE and existing methods for binding sites prediction of all TFs tested

Supplemental Figure 4. Numbers of TFs with ChIP-seq data available in (A) A549, (B) H1, (C) K562 and (D) GM12878 among all motifs in JASPAR CORE database (non-redundant). Each pie plot has a total number of 746 TFs, orange protion represents number of TFs has ChIP-seq data in that cell line in ENCODE.

Supplemental Figure 5. Average rank of ROC pAUC across all TFs tested using ATAC-seq data for TRACE, DeFCoM and HINT-ATAC

PR AUC

1.00

BHLHE40 EBF1 ELF1 YY1 SRF CREM CTCF 0.75 USF2 USF1 ZNF143 NFYB PBX3 TBX21 MXI1 ETS1 MAX TCF12 ELK1 PKNOX1 JUND NR2F1 NFATC3 0.50

NFYA PAX5 NRF1 MAFK OMNI−ATAC−seq GABPA ETV6

ZBED1 RELB CEBPB E2F8 0.25 ZBTB33 RFX5 RXRA

NR2C2

0.00

0.00 0.25 0.50 0.75 1.00 DNase−seq

Supplemental Figure 6. DNase-seq and OMNI-ATAC-seq based TRACE performance comparison on PR AUC. A B 1.00 1.00

0.75 Mean 0.75 Mean −10 −10 −5 −5

0.50 −2 0.50 −2 PR AUC PR

PR AUC PR 0 0 D 3 3

0.25 5 0.25 5

0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Prevalence Prevalence C D 1.0 0.5

0.9 Mean 0.4 Mean

−10 −10 0.8 −5 0.3 −5 0 0 0.7

2 AUCROC 2

ROC AUCROC 0.2 D 3 3 5 5 0.6 0.1

0.5 0.0

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Prevalence Prevalence E ZNF282: PR AUC = 0.25, ROC AUC = 0.8 F CTCF: PR AUC = 0.94, ROC AUC = 0.84 500 1250

400 1000

300 Label 750 Label

Negative Negative

Count 200 Positive Count 500 Positive

100 250

0 0 −2.5 0.0 2.5 5.0 −4 −2 0 2 4 6 Score Score

Supplemental Figure 7. (A) PR AUC and (C) ROC AUC from simulation test with different prevalence. Scores for positive examples for all simulation were drawn from the same distribution N(5, 3). Each line represents a different negative example distribution with their own mean value, varying from 5 to -10. Black lines represent AUCs from random labels. (B) PR AUC increase and (D) ROC AUC increase from simulation test with different prevalence. (E, F) Score distributions for ZNF282 and NR2C2 as examples of TFs with different level of data imbalance.

Supplemental Figure 8. increase of footprinting methods’ best (A) ROC AUC and (B) ROC pAUC over permutation are not correlated with prevalence. Orange line is from simulation test using positive set from N(10, 8), negative set from N(0, 7).

GM12878 PR AUC RXRA IRF3 0.8 NR2C2 STAT1 0.6 RFX5 E2F4 0.4 E2F8 FOXK2 0.2 NFIC ZEB1 NFYA ZBTB33 NFATC1 JUND ZBED1 TBP NFATC3 REST GABPA PKNOX1 CUX1 MAX PAX5 CEBPB MXI1 RELB ETV6 ZNF384 TCF12 NRF1 PBX3 ETS1 MEF2B ELK1 NR2F1 ZNF143 TBX21 MEF2C CREM NFYB USF2 SRF BHLHE40 USF1 ELF1 YY1 EBF1 CTCF

e eq s se−seq TAC− A revalenc DNa p

Supplemental Figure 9. Heatmap of PR AUC of all TFs tested using DNase-seq and ATAC-seq data, sorted by prevalence.

PBX3

PKNOX1 MAX USF2

USF1 YY1 NFATC3 ZNF143 ELK1 CREM TCF12 ELF1 0.4 ZNF384 ETS1 NFYB MXI1 FOXK2 SRF BHLHE40 ZEB1 MEF2C PR AUC from footprinting CUX1 EBF1 NFYA TBX21 MEF2B −0.2 CEBPB NR2F1 NFATC1 NRF1 GABPA −0.4 RXRA STAT1 JUND PAX5 −0.6 IRF3 TBP RELB −0.8 0.2 ZBTB33

PR AUC PR E2F8 REST

D NFIC NR2C2 ETV6 E2F4 Data ZBED1 CTCF ATAC−seq RFX5 DNase−seq

0.0

0.00 0.25 0.50 0.75 1.00 Prevalence

Supplemental Figure 10. Performance improvement of TRACE model over permutation for each TF in GM12878, colored by its best PR AUC from DNase-seq or ATAC-seq data. Orange line is from simulation test using positive instances drawn from N(12, 9), and negative instances from N(0, 9) to demonstrate expected PR AUC trend as binding prevalence changes.

A PR AUC

0.6 JUN

JUNB 0.4

NR3C1 JUND

Max Increase Max 0.2 CTCF

0.0 0.0 0.2 0.4 0.6 0.8 Prevalence B ROC pAUC

0.6

0.4 Residence Time

long intermediate

Max Increase Max CTCF short 0.2 JUND NR3C1 JUN JUNB

0.0 0.0 0.2 0.4 0.6 0.8 Prevalence C ROC AUC

0.6

0.4 JUNB CTCF JUN JUND NR3C1

Max Increase Max 0.2

0.0 0.0 0.2 0.4 0.6 0.8 Prevalence Supplemental Figure 11. Comparison of footprinting performance increase on TFs with (A) short, (B) intermediate and (C) long residence time.

PR AUC

1.00 YY1 USF1 CTCF ELF1 USF2 EBF1 0.75 ELK1

ETS1 GABPA NRF1 MXI1 MEF2C FOXK2 MEF2B 0.50 PAX5

ATAC−seq NR2F1 ETV6 JUND ZBTB33 MAX NFATC1 NFIC RELB E2F4 STAT1 0.25 NR2C2

RXRA 0.00

0.00 0.25 0.50 0.75 1.00 DNase−seq (low read depth)

Supplemental Figure 12. TRACE performance comparison on DNase-seq and ATAC-seq with comparable level of read depth. DNase-seq data used in this analysis has fewer reads than the one in Figure 3.

Supplemental Table

(A)

1-motif model 10-motif model

Size of Number Computational Number Computational Number training set Memory Memory of states time of states time of cores (kilobases) 71.9 28 <1min 0.21G 310 4min 1.4G 40

180.6 34 1min 0.59G 316 9min 3.5G 40

238.3 52 2min 0.98G 334 17min 4.8G 40

883.1 32 4min 2.8G 296 62min 15.6G 40

1308.6 48 7min 4.2G 316 90min 20.2G 40

(B)

1-motif model 10-motif model

Size of Number Computational Number Computational Number testing set Memory Memory of states time of states time of cores (kilobases) 71.9 28 <1s 0.2G 310 3s 1.1G 40

180.6 34 <1s 0.5G 316 7s 2.9G 40

238.3 52 1s 0.9G 334 9s 4.0G 40

883.1 32 2s 2.3G 296 29s 12.8G 40

1308.6 48 4s 3.5G 316 39s 16.8G 40

Supplemental Table 3. Computational time and memory TRACE requires for (A) training and (B)Viterbi step, with different sizes of model and training set. Each row shows the time and memory needed to finish a TRACE run for an example TF on a certain number of kilobase length of training data, using models with different numbers of hidden sates depending on motif size. CPU: Intel Xeon E5-2696 v4 @ 3.7GHz

References

Castro-Mondragon JA, Jaeger S, Thieffry D, Thomas-Chollier M, van Helden J. 2017. RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections. Nucleic Acids Res 45: e119–e119. https://academic.oup.com/nar/article/45/13/e119/3862068 (Accessed February 19, 2020).