Linkage Disequilibrium Based Genotype Calling from Low- Coverage Shotgun Sequencing Reads Jorge Duitama1, Justin Kennedy1, Sanjiv Dinakar2, Y¨ozen Hern´andez3and Yufeng Wu1, Ion I. M˘andoiu∗1 1Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd., Unit 2155, Storrs, CT 06269-2155, USA 2Department of Computer Science, University of Maryland,College Park, Maryland 20742, USA. 3Department of Computer Science, Hunter College,695 Park Avenue, New York, NY 10021, USA. Email: Jorge Duitama -
[email protected]; Justin Kennedy -
[email protected]; Sanjiv Dinakar -
[email protected]; Y¨ozen Hern´andez -
[email protected]; Yufeng Wu -
[email protected]; Ion I. M˘andoiu∗-
[email protected]; ∗Corresponding author Abstract Background: Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping. Results: In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/. Conclusions: Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies.