Database, 2016, 1–15 doi: 10.1093/database/baw124 Original article Original article Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population Matsuyuki Shirota1,2,3 and Kengo Kinoshita2,3,4,* 1Graduate School of Medicine, Tohoku University, Sendai, Miyagi 9808575, Japan, 2Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan, 3Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan, 4Institute for Development, Aging and Cancer, Tohoku University, Sendai, Miyagi 9808575, Japan *Corresponding author: Tel: þ81227957179; Fax: þ81227957179; Email:
[email protected] Citation details: Shirota,M. and Kinoshita,K. Discrepancies between human DNA, mRNA and protein reference se- quences and their relation to single nucleotide variants in the human population. Database (2016) Vol. 2016: article ID baw124; doi:10.1093/database/baw124. Received 5 May 2016; Revised 6 July 2016; Accepted 4 August 2016 Abstract The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to poly- morphisms in the human population, but the overall landscape of the discordant se- quences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the al- ternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses.