SUPPLEMENTARY RESULTS

Supplementary Table 1. Complete listing of factor determined to bind differentially in modern human and Neanderthal promoteromes.

TF name Wilcoxon stat Wilcoxon pval HGNC ID UniProt ID Description Family

LHX3 442284.5 4.83E-38 6595 Q9UBR4 LIM/ Lhx3 homeodomain (PC00119)

PROP1 271411.5 3.18E-35 9455 O75360 Homeobox protein prophet of Pit-1

FOXC1 450055.5 3.71E-33 3800 Q12948 Forkhead box protein C1 winged helix/forkhead transcription factor(PC00246)

POU4F3 865739 1.14E-26 9220 Q15319 POU domain, class 4, transcription factor 3

POU4F1 638453.5 2.33E-26 9218 Q01851 POU domain, class 4, transcription factor 1

HOXD8 908880 6.68E-23 5139 P13378 Homeobox protein Hox-D8

FOXB1 510034 9.09E-23 3799 Q99853 Forkhead box protein B1 winged helix/forkhead transcription factor(PC00246)

FOXQ1 134647.5 1.03E-21 20951 Q9C009 Forkhead box protein Q1 winged helix/forkhead transcription factor(PC00246)

POU4F2 1135242 1.39E-19 9219 Q12837 POU domain, class 4, transcription factor 2

PHOX2A 452784.5 1.45E-19 691 O14813 Paired mesoderm homeobox protein 2A homeodomain transcription factor(PC00119)

ARID3B 256994 1.68E-19 14350 Q8IVW6 AT-rich interactive domain-containing protein 3B DNA-binding transcription factor(PC00218)

ONECUT2 313241.5 2.12E-19 8139 O95948 One cut domain family member 2 homeodomain transcription factor(PC00119)

PHOX2B 510851 9.20E-18 9143 Q99453 Paired mesoderm homeobox protein 2B homeodomain transcription factor(PC00119)

PAX7 539656 1.00E-17 8621 P23759 Paired box protein Pax-7

POU3F3 631771 4.21E-17 9216 P20264 POU domain, class 3, transcription factor 3

MAFF 23605 8.03E-17 6780 Q9ULX9 Transcription factor MafF basic transcription factor(PC00056)

POU5F1 (POU5F1::) 43260 8.11E-16 9221 Q01860 POU domain, class 5, transcription factor 1

SOX2 (POU5F1::SOX2) 43260 8.11E-16 11195 P48431 Transcription factor SOX-2 HMG box transcription factor(PC00024)

POU2F2 898740 2.16E-15 9213 P09086 POU domain, class 2, transcription factor 2

POU3F1 608911.5 7.17E-15 9214 Q03052 POU domain, class 3, transcription factor 1

ONECUT3 273456 1.17E-14 13399 O60422 One cut domain family member 3 homeodomain transcription factor(PC00119)

POU3F2 689576.5 1.92E-14 9215 P20265 POU domain, class 3, transcription factor 2

POU2F1 732151 2.01E-14 9212 P14859 POU domain, class 2, transcription factor 1

FOXC2 296082 1.08E-13 3801 Q99958 Forkhead box protein C2 winged helix/forkhead transcription factor(PC00246)

ARID5A 778558 2.08E-13 17361 Q03989 AT-rich interactive domain-containing protein 5A transcription cofactor(PC00217)

ONECUT1 131454.5 5.31E-12 8138 Q9UBC0 Hepatocyte nuclear factor 6 homeodomain transcription factor(PC00119)

ARID3A 47976.5 1.51E-11 3031 Q99856 AT-rich interactive domain-containing protein 3A DNA-binding transcription factor(PC00218)

HOXC10 243032.5 2.52E-11 5122 Q9NYD6 Homeobox protein Hox-C10

CUX1 586 4.24E-11 2557 P39880 Homeobox protein cut-like 1 homeodomain transcription factor(PC00119)

POU1F1 1122196 4.30E-11 9210 P28069 Pituitary-specific positive transcription factor 1

POU5F1B 488691 2.07E-09 9223 Q06416 Putative POU domain, class 5, transcription factor 1B

MEF2D 437577.5 2.26E-09 6997 Q14814 Myocyte-specific enhancer factor 2D MADS box transcription factor(PC00250) MEF2A 441549 2.41E-09 6993 Q02078 Myocyte-specific enhancer factor 2A MADS box transcription factor(PC00250)

NKX2-5 773 7.84E-09 2488 P52952 Homeobox protein Nkx-2.5 homeodomain transcription factor(PC00119)

TBX19 4898.5 9.31E-09 11596 O60806 T-box transcription factor TBX19 Rel homology transcription factor(PC00252)

LIN54 10569.5 1.05E-08 25397 Q6MZP7 Protein lin-54 homolog

PAX3 224681 1.12E-08 8617 P23760 Paired box protein Pax-3

IRF1 4225 1.31E-08 6116 P10914 Interferon regulatory factor 1 winged helix/forkhead transcription factor(PC00246)

MECOM 15265.5 3.32E-08 3498 Q03112 Histone-lysine N-methyltransferase MECOM C2H2 transcription factor(PC00248)

HOXD9 86178.5 3.70E-08 5140 P28356 Homeobox protein Hox-D9

SIX3 3398 5.20E-08 10889 O95343 Homeobox protein SIX3 homeodomain transcription factor(PC00119)

MAFK 7199.5 6.53E-08 6782 O60675 Transcription factor MafK basic leucine zipper transcription factor(PC00056)

HOXD11 279435.5 1.28E-07 5134 P31277 Homeobox protein Hox-D11

CDX1 172275 2.60E-07 1805 P47902 Homeobox protein -1 homeodomain transcription factor(PC00119)

POU3F4 533271 2.67E-07 9217 P49335 POU domain, class 3, transcription factor 4

POU6F1 77984 5.00E-07 9224 Q14863 POU domain, class 6, transcription factor 1

HOXA13 61766 6.26E-07 5102 P31271 Homeobox protein Hox-A13

CEBPA 31 1.23E-06 1833 P49715 CCAAT/enhancer-binding protein alpha basic leucine zipper transcription factor(PC00056)

BARHL2 15912 1.48E-06 954 Q9NY43 BarH-like 2 homeobox protein homeodomain transcription factor(PC00119)

UNCX -399 3.70E-06 33194 A6NJT0 Homeobox protein unc-4 homolog

LMX1B 146337.5 3.95E-06 6654 O60663 LIM homeobox transcription factor 1-beta homeodomain transcription factor(PC00119)

MEF2B 542518 4.27E-06 6995 Q02080 Myocyte-specific enhancer factor 2B MADS box transcription factor(PC00250)

OTX2 97 5.41E-06 8522 P32243 Homeobox protein OTX2 homeodomain transcription factor(PC00119)

HOXD13 90020 1.05E-05 5136 P35453 Homeobox protein Hox-D13

HOXA10 172880 2.30E-05 5100 P31260 Homeobox protein Hox-A10

POU2F3 535473 3.18E-05 19864 Q9UKI9 POU domain, class 2, transcription factor 3

HNF1A 434983 5.38E-05 11621 P20823 Hepatocyte nuclear factor 1-alpha DNA-binding transcription factor(PC00218)

HESX1 -684 5.45E-05 4877 Q9UBX0 Homeobox expressed in ES cells 1

HNF1B 294035.5 6.01E-05 11630 P35680 Hepatocyte nuclear factor 1-beta DNA-binding transcription factor(PC00218)

MAFG 1054.5 6.43E-05 6781 O15525 Transcription factor MafG basic leucine zipper transcription factor(PC00056)

IRF2 557 9.28E-05 6117 P14316 Interferon regulatory factor 2 winged helix/forkhead transcription factor(PC00246)

ZNF384 67.5 0.0001402 11955 Q8TF68 Zinc finger protein 384 C2H2 zinc finger transcription factor(PC00248)

FOXA2 3489 0.0001624 5022 Q9Y261 Hepatocyte nuclear factor 3-beta winged helix/forkhead transcription factor(PC00246)

HMX3 50292 0.0003858 5019 A6NHT5 Homeobox protein HMX3

HMX2 46211 0.0005692 5018 A2RU54 Homeobox protein HMX2

NKX6-2 6468 0.0005991 19321 Q9C056 Homeobox protein Nkx-6.2 homeodomain transcription factor(PC00119)

IRF9 1844 0.0007433 6131 Q00978 Interferon regulatory factor 9 winged helix/forkhead transcription factor(PC00246)

NR1H3 (NR1H3::RXRA) 685 0.001769 7966 Q13133 Oxysterols LXR-alpha C4 zinc finger (PC00169)

RXRA (NR1H3::RXRA) 685 0.001769 10477 P19793 RXR-alpha C4 zinc finger nuclear receptor(PC00169)

OLIG2 3491.5 0.0018305 9398 Q13516 Oligodendrocyte transcription factor 2 basic helix-loop-helix transcription factor(PC00055) STAT6 -598 0.001918 11368 P42226 Signal transducer and activator of transcription 6 DNA-binding transcription factor(PC00218)

MEF2C 351 0.0020757 6996 Q06413 Myocyte-specific enhancer factor 2C MADS box transcription factor(PC00250)

FOXF2 29 0.0023199 3810 Q12947 Forkhead box protein F2

VENTX 14790.5 0.0036602 13639 O95231 Homeobox protein VENTX homeodomain transcription factor(PC00119)

NR2E1 1400.5 0.0040203 7973 Q9Y466 Nuclear receptor subfamily 2 group E member 1 C4 zinc finger nuclear receptor(PC00169)

DUXA 1735 0.0043964 32179 A6NLW8 Double homeobox protein A

PITX1 378 0.0045946 9004 P78337 Pituitary homeobox 1

ISL2 132 0.0046249 18524 Q96A47 Insulin enhancer protein ISL-2 homeodomain transcription factor(PC00119)

FOXD3 102497 0.0046328 3804 Q9UJU5 Forkhead box protein D3 winged helix/forkhead transcription factor(PC00246)

TCF7L2 1157.5 0.0051148 11641 Q9NQB0 Transcription factor 7-like 2

EVX1 248 0.0064663 3506 P49640 Homeobox even-skipped homolog protein 1

FOXH1 1345 0.0071219 3814 O75593 Forkhead box protein H1

OTX1 1128.5 0.008041 8521 P32242 Homeobox protein OTX1 homeodomain transcription factor(PC00119)

PBX3 -8 0.0082676 8634 P40426 Pre-B-cell leukemia transcription factor 3 homeodomain transcription factor(PC00119)

CEBPB 2082 0.0082978 1834 P17676 CCAAT/enhancer-binding protein beta basic leucine zipper transcription factor(PC00056)

FOS (FOS::JUN(VAR.2)) -217.5 0.0095211 3796 P01100 Proto-oncogene c-Fos basic leucine zipper transcription factor(PC00056)

JUN (FOS::JUN(VAR.2)) -217.5 0.0095211 6204 P05412 Transcription factor AP-1 basic leucine zipper transcription factor(PC00056)

HOXB5 7268 0.0096238 5116 P09067 Homeobox protein Hox-B5 homeodomain transcription factor(PC00119)

Supplementary Figure 1. Aggregate expression of DB TFs in 100 tissues. FANTOM5 RNA-Seq data was extracted as TPM values and the 100 tissues with highest aggregate expression of the differentially binding TF genes were selected for clustering (Figure 1), resulting in order of tissues shown here.

Supplementary Table 2. Ontological terms (Biological Process and Disease) associated with top 100 marker genes of cortical brain cells clusters expressing DB TFs.

Supplementary Figure 2. ROC Analysis model performance in the identification of experimentally verified functional TFBSs – random Ensembl transcripts. ROC analysis was performed using experimentally verified functional TFBSs as annotated in the ORegAnno/Pleides/ABS datasets as true positives, where true negatives were random locations in other Ensembl transcripts at the same distance from the TSS as the associated true positive. All ROC curve analyses were performed on TFs which had at least 10 true positives and 50 true negatives per true positive were used for each analysis. Each true positive/negative segment analyzed was 50 nucleotides long, and the highest TFBS score for the relevant dataset(s) was used for each true positive/negative. (A) Barplot of the frequency of experimental data type in the top 20 performing TFBSFootprinter models. (B) Boxplot of ROC scores for TFBSFootprinter and DeepBind for 14 TFs (left). ROC scores were also calculated based on using individual experimental metrics to show how well each contributes to accuracy of the combined model. (C) ROC scores for each individual TF tested, for each primary TFBS prediction model under study. The best scoring model among all is named for each TF (right). TFBSFootprinter best by TF, based on using the highest ROC score achieved by some combination of experimental data models; TFBSFootprinter overall best, based on using the combination of experimental data models which had the best average ROC score across all TFs analyzed; DeepBind best by TF, based on using the higher ROC score of the SELEX or ChIP-Seq DeepBind models.

Supplementary Figure 3. ROC Analysis model performance in the identification of experimentally verified functional TFBSs – random location in same Ensembl transcript. ROC analysis was performed using experimentally verified functional TFBSs as annotated in the ORegAnno/Pleides/ABS datasets as true positives, where true negatives were random locations in the same Ensembl transcript apositive. All ROC curve analyses were performed on TFs which had at least 10 true positives and 50 true negatives per true positive were used for each analysis. Each true positive/negative segment analyzed was 50 nucleotides long, and the highest TFBS score for the relevant dataset(s) was used for each true positive/negative. (A) Barplot of the frequency of experimental data type in the top 20 performing TFBSFootprinter models. (B) Boxplot of ROC scores for TFBSFootprinter and DeepBind for 14 TFs (left). ROC scores were also calculated based on using individual experimental metrics to show how well each contributes to accuracy of the combined model. (C) ROC scores for each individual TF tested, for each primary TFBS prediction model under study. The best scoring model among all is named for each TF (right). TFBSFootprinter best by TF, based on using the highest ROC score achieved by some combination of experimental data models; TFBSFootprinter overall best, based on using the combination of experimental data models which had the best average ROC score across all TFs analyzed; DeepBind best by TF, based on using the higher ROC score of the SELEX or ChIP-Seq DeepBind models.

Supplemental Figure 4. ROC Analysis in the identification of TFBS ChIP-Seq peaks – strong and weak binding. ROC Curves are presented for the best performing TFBSFootprinter model for each TF.

Supplemental Figure 5. ROC Analysis in the identification of TFBS ChIP-Seq peaks – strong and distal binding.

ROC Curves are presented for the best performing TFBSFootprinter model for each TF.

SUPPLEMENTARY METHODS Ensembl sequence retrieval The Ensembl Representational State Transfer (REST) server application programming interface (API) [1] is used by TFBSfootprinter for automated retrieval of user-defined DNA sequence near the transcription start site of an established Ensembl transcript ID. Annotations for the transcript, and Ensembl-defined regulatory regions (e.g. 'promoter flanking region') are also retrieved and mapped in the final output figure. PWMs A total of 575 transcription factor PFMs retrieved from the 2018 JASPAR database [2] (http://jaspar.genereg.net/) are used to create PWMs (Eq 1), as described by [3]:

푁 (푎𝑖 + 푏/4) (푆 + 푏) 푤푒𝑖푔ℎ푡푏𝑖푛푑𝑖푛푔 = ∑ −log2 (푛푛푢푐 + 푏) 𝑖=1 (푙푏푔 + 푏) (1)

N is the set of nucleotides in the currently scanned sequence; ai is the number of instances of nucleotide a at position i; b is a pseudocount (set to 0.8 as per [3]); S is the number of sequences describing the motif; nnuc is the count of the nucleotide in the background sequence; and lbg is the length of the background sequence. The background frequencies for each nucleotide were set to match those of the as determined previously[4]. Without a score threshold, scoring of a 1,000 nt promoter with all 575 JASPAR TF profiles will produce ~1,150,000 predictions. In order to address this issue, each of the 575 JASPAR PWMs was used to score the complete human genome (a total of 3,375,096,897,466 TF-DNA binding calculations). Scores for each TF were then used to generate a distribution which allowed pairing scores with p-values (from 10-1 to 10- 5, or smaller when possible, depending on TF profile length). As a result, an appropriate score threshold can be chosen at the discretion of the user. Computation was performed using the supercomputing resources of CSC – IT Center for Science Ltd. (a non-profit owned by the state of Finland and Finnish higher education institutions). CAGE peak locations and Spearman correlation of expression values Cap analysis of (CAGE) uses sequencing of cDNA generated from RNA to both determine identify TSSs and quantify their expression levels. The FANTOM project has performed CAGE across the human genome[5] and the results are freely available for download (http://fantom.gsc.riken.jp/data/). Using the genomic locations of FANTOM CAGE peaks, the distances from all human nucleotides to the nearest CAGE peak were calculated. The distribution of these distances was used to generate a log-likehood score for all observed distances. The CAGE peak locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 2).

푁 푝 푤푒𝑖푔ℎ푡 = ∑ −log 푃(푥 ≤ 푑 |ℎ푢푚푎푛 푔푒푛표푚푒) ∗ 𝑖⁄ 퐶퐴퐺퐸 푑𝑖푠푡푎푛푐푒푠 2 𝑖 푝푡표푡푎푙 𝑖=0 (2)

Where N is the number of all CAGE peaks associated with the target gene; di is the distance to current

CAGE peak; pi is the number of peak counts of current CAGE peak; and ptotal is the total peak count for this gene.

Expression data for CAGE peaks associated with the 575 JASPAR TF genes was then combined with expression data for all CAGE peaks to perform a total of 386,652,770 Spearman correlation analyses using the 'spearmanr' function from the SciPy Stats module[6]. Bonferroni correction was performed to account for multiple testing. Due to the size of the analysis, a cutoff correlation magnitude value of 0.3 was used, and all lesser values (-0.3

푤푒𝑖푔ℎ푡푒푥푝푟푒푠푠𝑖표푛 푐표푟푟푒푙푎푡𝑖표푛 = −log2 푃(푥 ≥ 푐푐푢푟푟푒푛푡|푐푎푙푙) (3)

Where ccurrent is the Spearman correlation between the expression of the target gene and the expression of the TF corresponding to the putative TFBS; and call is the distribution of all Spearman correlations between JASPAR TF genes and all genes.

Experimental TFBSs compiled by the GTRD Experimental data on TFBS binding (derived from ChIP-Seq, HT-SELEX, and PBMs) is one of the most direct ways to locate potentially functional TFBSs. While binding alone is not indicative of function, clusters of binding sites have been shown to imply functionality [7, 8]. The GTRD project (gtrd.biouml.org) is the largest comprehensive collection of uniformly processed human and mouse ChIP-Seq peaks and has compiled data from 8,828 experiments extracted from the gene expression omnibus (GEO), sequence read archive (SRA), and the encyclopedia of DNA elements (ENCODE) databases. One of the outputs of the performed analyses are reads that have been grouped to identify metaclusters, places where TF binding events cluster together in the human genome. We retrieved the metacluster data (28,524,954 peaks) from the GTRD database (version 18.0) and subsequently mapped the number of overlapped metaclusters for all human nucleotides. The distribution of these overlaps was used to generate a log-likehood score for all observed distances. The metacluster locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 4).

푤푒𝑖푔ℎ푡푚푒푡푎푐푙푢푠푡푒푟푠 = −log2 푃(푥 ≥ 푛표푣푒푟푙푎푝|퐷ℎ푢푚푎푛 푔푒푛표푚푒) (4)

Where noverlap is the number of metaclusters overlapped by the current putative TFBS; and Dhuman genome is the distribution of the number of overlapping metaclusters for every location in the human genome.

ATAC-Seq peaks Assay for Transposase-Accessible Chromatin using sequencing (ATAC-Seq) is an experimental method for revealing the location of open chromatin[9]. These locations are indicative of genomic regions which, due to their unpacked , may allow TFs to bind to DNA and subsequently influence transcription. Open chromatin regions have been shown to be useful in the prediction of TFBSs[10]. We retrieved and compiled data from 135 ATAC-Seq experiments stored in the ENCODE project database (www.encodeproject.org) and mapped the distance from all human nucleotides to the nearest ATAC- Seq peak, and the distribution of these distances was used to generate a log-likehood score for all observed distances. The ATAC-Seq peak locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 5).

푤푒𝑖푔ℎ푡퐴푇퐴퐶−푆푒푞 푑𝑖푠푡푎푛푐푒푠 = ∑ −log2 푃(푥 ≤ 푑𝑖|퐷ℎ푢푚푎푛 푔푒푛표푚푒) 𝑖=0 (5)

Where N is the number of ATAC-Seq peaks within the current target region; di is the distance to the current ATAC-Seq peak; and Dhuman genome is the distribution of the distances from all human nucleotides to the nearest ATAC-Seq peak. eQTLs The genome tissue expression (GTEX) project (gtexportal.org) has performed expression quantitative trait loci (eQTL) analysis on 10,294 samples from 48 tissues from 620 persons (version 7)[11]. This analysis has identified 7,621,511 variant locations in the genome, usually 1-5 nt, that affect gene expression. eQTL data was extracted from the GTEX database and used to construct a distribution of the magnitude of effect on gene expression, which was then used to generate log-likelihood scores (Eq 6). Next, we generated a second distribution of the distance from each gene to its variants; the distance was limited to 1,000,000 bp from either end of the transcript as this is the search area over which GTEx scans for variants affecting expression of each gene. The variant locations, magnitude of effect/log- likelihood score pairings, and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 7).

푤푒𝑖푔ℎ푡푒푄푇퐿 푚푎푔푛𝑖푡푢푑푒 = ∑ −log2 푃(푥 ≥ 푚𝑖|퐷ℎ푢푚푎푛 푔푒푛표푚푒) 𝑖=0 (6)

Where N is the number of eQTLs overlapping the current putative TFBS; mi is the magnitude of effect of an eQTL overlapping the current putative TFBS; and Dhuman genome is the distribution of all eQTL magnitudes in all human nucleotides.

푁 푙푡푓 ∗ 푛푣 푙푡 푤푒𝑖푔ℎ푡푒푄푇퐿 표푣푒푟푙푎푝 = ∑ −log2 푃 (푥 ≥ (( ⁄ ) ∗ ⁄ ) |퐷ℎ푢푚푎푛 푔푒푛표푚푒) 푙푔 − 푙푡푓 푙푔 𝑖=0 (7)

Where N is the number of eQTLs overlapping the current putative TFBS; ltf is the length of the current putative TFBS (nucleotides); nv is the number of variants associated with the target gene; lg is the length of the current gene (nucleotides); lt is the length of the target transcript; and Dhuman genome is the distribution of all eQTL overlaps for nucleotide windows the same length as the current TFBS, in the human genome.

CpG islands Due to the fact that methylation of DNA acts as a repressor of transcription, active promoters tend to be un-methylated. When methylated, the cytosine in a CpG dinucleotide can deaminate to thymine. Therefore, a CpG ratio close to what would be expected by chance is often indicative of an active promoter region[12, 13]. Subsequently, CpG ratios (observed/expected) across a 200 nt window were computed for all human nucleotides. A distribution of these ratios was generated and used to generate log-likelihood scores for each possible ratio. CpG ratio/log-likelihood score pairings are then used during de novo prediction of TFBSs.

푤푒𝑖푔ℎ푡퐶푝퐺 = −log2 푃(푥 ≥ 푟표푏푠/푒푥푝|퐷푔푒푛표푚푒)

(8)

Where robs/exp is the ratio of observed to expected CpG dinucleotides in a 200 nt window centered on the current putative TFBS; and Dgenome is the distribution of robs/exp across all nucleotide locations in the target genome. Conservation of vertebrate DNA Conservation of DNA across species boundaries is evolutionarily costly and thus implies function. While differential regulation of gene transcription is what accounts for much of the variation among species[14], there is conservation of regulatory elements across closely related species groups, such as among primates[15]. The Ensembl EPO (Enredo, Pecan, Ortheus) pipeline has created whole genome multiple sequence alignments for distinct clades of vertebrates [16]. As of Ensembl release 94, whole- genome alignments exist for mammals (70 species), fish (48 species), amniotes (32 species), and sauropsids (7 species). The number of available species is dependent on the current Ensembl release and is continually growing. Conservation of sequence analysis has been performed by Ensembl to identify constrained elements for each species in each species group using the genomic evolutionary rate profiling (GERP) tool[17]. For each of the vertebrate species of Ensembl release 94, we have calculated the distance from all nucleotides in the associated species genome to the nearest GERP constrained element; and generated distributions of distances which were used to calculate log- likelihood scores for each distance. GERP element distance/log-likelihood score pairings, for each species, are then used during de novo prediction of TFBSs in the relevant species.

푤푒𝑖푔ℎ푡푐표푛푠푒푟푣푎푡𝑖표푛 = −log2 푃(푥 ≤ 푑𝑖|퐷푔푒푛표푚푒) (9)

Where di is the distance between the current putative TFBS and the nearest conserved element in an alignment of 70 mammalian genomes (GERP); and Dgenome is the distribution of distances between all nucleotides in the target genome and the nearest GERP conserved element. Combined affinity score A summation of the weight scores from each experimental dataset is then performed for each putative TFBS and is represented as the 'combined affinity score'. For analysis of human sequences this is represented by Eq 10. Due to the limitations of available experimental data for non-human species, currently, for vertebrates the combined affinity score is described by Eq 11. A complete scoring of ~80,000+ transcript's promoter regions (1,000 nt) was used to generate p-values for combined affinity scoring; computation was performed using the supercomputing resources of CSC – IT Center for Science Ltd.

퐶표푚푏𝑖푛푒푑 퐴푓푓𝑖푛𝑖푡푦ℎ푢푚푎푛 = 퐸푞1 + 퐸푞2 + 퐸푞3 + 퐸푞4 + 퐸푞5 + 퐸푞6 + 퐸푞7 + 퐸푞8 + 퐸푞9 (10)

퐶표푚푏𝑖푛푒푑 퐴푓푓𝑖푛𝑖푡푦푛표푛−ℎ푢푚푎푛 = 퐸푞1 + 퐸푞8 + 퐸푞9 (11)

Supplementary Figure 6. A diagrammatic representation of the TFBSfootprinter algorithm.

Benchmarking TFBSfootprinter In order to robustly assess the ability of PWM, TFBSfootprinter, and DeepBind models to identify functional human transcription factor binding sites, several approaches were taken for benchmarking. These included using experimentally validated functional regulatory binding sites as well as ChIP-Seq peaks from the GTRD database (compiled and uniformly processed from ENCODE, GEO, and SRA databases) to serve as true positives. True negatives were also defined using several approaches in order to mitigate biases inherent in each approach. Experimentally validated functional TFBSs are true positives Experimentally verified and curated TFBSs belonging to the annotated regulatory binding sites (ABS) [18], ORegAnno [19], and Pleiades promoter project [20] databases were retrieved as curated GFF files from the Pazar database [21]. The TFs associated with each experimental TFBS were identified using a combination of Python scripting and manual parsing, due to deprecated gene naming in some cases. From this data, 504 experimentally-validated binding sites affecting gene expression for 20 DeepBind TFs and 607 experimentally-validated binding sites affecting gene expression for 25 JASPAR TFs were selected. Selection criteria for the chosen TFs was that they have at least 10 experimentally-validated binding sites affecting gene expression. All target sites were converted from Hg19 to GRCh38 genomic coordinates using Ensembl REST. Subsequently, 50 bp sequences were retrieved centered on each experimentally- validated functional binding site in the human genome, to serve as true positives.

Paired with these true positives were true negatives from three different sources, producing three sets of results. In the first approach, true negative locations were chosen based on distance from TF metaclusters. TF metacluster locations were based on ChIP-Seq data from the GTRD database. Equidistant locations within the largest TF metacluster deserts in the human genome were chosen. Any location which did not have at least 50 bp of non-ambiguous DNA was discarded, resulting in a total of 1,536 true negative locations. In the second approach, for each true positive, 50 locations were chosen in random Ensembl transcripts, at the same distance from the TSS. In the third approach, the 50 locations were chosen within the promoter of the same Ensembl transcript, at least 50 nucleotides away from the true positive.

For each of the TFBSfootprinter, PWM, and DeepBind methods, the true positive locations were scored using that method and the model for the corresponding TF which had previously been experimentally verified as functional at that location. However, for the set of true negatives, each method scored all locations using all TF models. ChIP-Seq peaks from GTRD are true positives ChIP-Seq peaks for 1,392 transcription factors from 15,982 experiments uniformly processed with MACS2[22] were downloaded from the GTRD database. Using the full complement of Ensembl (release 100) protein-coding transcripts, the ChIP-Seq peaks were reduced to those which occur within the window -1000 to +200 bp relative to an Ensembl transcript TSS. For each JASPAR TF with an identically named match in this subset of GTRD peaks, the top 100 peaks which had a fold-enrichment of ≥50x were kept as true positives and the bottom 100 peaks which had a fold-enrichment of ≤2x were kept as true negatives. Using this set of high and low-occupancy locations, a 50-bp window centered on the peak summit was defined for analysis with TFBSfootprinter and DeepBind.

Supplementary Data References 1. Yates A, Beal K, Keenan S, McLaren W, Pignatelli M, Ritchie GR, Ruffier M, Taylor K, Vullo A, Flicek P: The Ensembl REST API: Ensembl Data for Any Language. Bioinformatics 2014. 2. Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R et al: JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2016, 44(D1):D110-115. 3. Nishida K, Frith MC, Nakai K: Pseudocounts for transcription factor binding sites. Nucleic acids research 2009, 37(3):939-944. 4. Yamagishi ME, Shimabukuro AI: Nucleotide frequencies in human genome and fibonacci numbers. Bull Math Biol 2008, 70(3):643-653. 5. Consortium F, the RP, Clst, Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Haberle V, Lassmann T et al: A promoter-level mammalian expression atlas. Nature 2014, 507(7493):462- 470. 6. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J et al: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020, 17(3):261-272. 7. Cusanovich DA, Pavlovic B, Pritchard JK, Gilad Y: The functional consequences of variation in transcription factor binding. PLoS Genet 2014, 10(3):e1004226. 8. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A 2002, 99(2):757-762. 9. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ: Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding and nucleosome position. Nat Methods 2013, 10(12):1213-1218. 10. Liu S, Zibetti C, Wan J, Wang G, Blackshaw S, Qian J: Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility. BMC Bioinformatics 2017, 18(1):355. 11. Consortium GT: The Genotype-Tissue Expression (GTEx) project. Nat Genet 2013, 45(6):580- 585. 12. Cohen NM, Kenigsberg E, Tanay A: Primate CpG islands are maintained by heterogeneous evolutionary regimes involving minimal selection. Cell 2011, 145(5):773-786. 13. Long HK, Sims D, Heger A, Blackledge NP, Kutter C, Wright ML, Grutzner F, Odom DT, Patient R, Ponting CP et al: Epigenetic conservation at gene regulatory elements revealed by non- methylated DNA profiling in seven vertebrates. Elife 2013, 2:e00348. 14. Diehl AG, Boyle AP: Conserved and species-specific transcription factor co-binding patterns drive divergent gene regulation in human and mouse. Nucleic Acids Res 2018, 46(4):1878-1894. 15. Shibata Y, Sheffield NC, Fedrigo O, Babbitt CC, Wortham M, Tewari AK, London D, Song L, Lee BK, Iyer VR et al: Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection. PLoS Genet 2012, 8(6):e1002789. 16. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E: Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 2008, 18(11):1814-1828. 17. Cooper GM, Stone EA, Asimenos G, Program NCS, Green ED, Batzoglou S, Sidow A: Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 2005, 15(7):901- 913. 18. Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic acids research 2006, 34(Database issue):D63-67. 19. Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M et al: ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic acids research 2008, 36(Database issue):D107-113. 20. Portales-Casamar E, Swanson DJ, Liu L, de Leeuw CN, Banks KG, Ho Sui SJ, Fulton DL, Ali J, Amirabbasi M, Arenillas DJ et al: A regulatory toolbox of MiniPromoters to drive selective expression in the brain. Proceedings of the National Academy of Sciences of the United States of America 2010, 107(38):16589-16594. 21. Portales-Casamar E, Kirov S, Lim J, Lithwick S, Swanson MI, Ticoll A, Snoddy J, Wasserman WW: PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol 2007, 8(10):R207. 22. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W et al: Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008, 9(9):R137.