Manuscript submitted to eLife 1 Generative Modeling of 2 Multi-mapping Reads with mHi-C 3 Advances Analysis of Hi-C Studies 1 2,3 1,4* 4 Ye Zheng , Ferhat Ay , Sündüz Keleş *For correspondence:
[email protected] (SK) 1 2 5 Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA; La 3 6 Jolla Institute for Allergy and Immunology, La Jolla, CA 92037, USA; School of Medicine, 4 7 University of California San Diego, La Jolla, CA 92093, USA; Department of Biostatistics 8 and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA 9 10 Abstract Current Hi-C analysis approaches are unable to account for reads that align to multiple 11 locations, and hence underestimate biological signal from repetitive regions of genomes. We 12 developed and validated mHi-C,amulti-read mapping strategy to probabilistically allocate Hi-C 13 multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic 14 approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the 15 sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and 16 detected interactions across biological replicates. The impact of the multi-reads on the detection of 17 significant interactions is influenced marginally by the relative contribution of multi-reads to the 18 sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality 19 as reflected by the proportion of mappable reads of datasets. Computational experiments 20 highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the 21 effect of longer reads.