<<

microarrays

Article A Mismatch EndoNuclease Array-Based Methodology (MENA) for Identifying Known SNPs or Novel Point

Josep M. Comeron 1,2,*, Jordan Reed 1, Matthew Christie 1, Julia S. Jacobs 1, Jason Dierdorff 1, Daniel F. Eberl 1,2 and John Robert Manak 1,2,3,*

1 Department of Biology, University of Iowa, Iowa City, IA 52242, USA; [email protected] (J.R.); [email protected] (M.C.); [email protected] (J.S.J.); [email protected] (J.D.); [email protected] (D.F.E.) 2 Graduate Program in Genetics, University of Iowa, Iowa City, IA 52242, USA 3 Department of Pediatrics, University of Iowa, Iowa City, IA 52242, USA * Correspondence: [email protected] (J.M.C.); [email protected] (J.R.M.); Tel.: + 319-335-0180 (J.R.M.); Fax: 319-335-1069 (J.R.M.)

Academic Editors: Yuriy Alekseyev and Gang Liu Received: 8 February 2016; Accepted: 30 March 2016; Published: 5 April 2016

Abstract: Accurate and rapid identification or confirmation of single nucleotide polymorphisms (SNPs), point mutations and other human genomic variation facilitates understanding the genetic basis of disease. We have developed a new methodology (called MENA (Mismatch EndoNuclease Array)) pairing DNA mismatch endonuclease enzymology with tiling microarray hybridization in order to genotype both known point mutations (such as SNPs) as well as identify previously undiscovered point mutations and small indels. We show that our assay can rapidly genotype known SNPs in a human genomic DNA sample with 99% accuracy, in addition to identifying novel point mutations and small indels with a false discovery rate as low as 10%. Our technology provides a platform for a variety of applications, including: (1) genotyping known SNPs as well as confirming newly discovered SNPs from whole sequencing analyses; (2) identifying novel point mutations and indels in any genomic region from any organism for which genome sequence information is available; and (3) screening panels of associated with particular diseases and disorders in patient samples to identify causative mutations. As a proof of principle for using MENA to discover novel mutations, we report identification of a novel of the beethoven (btv) in Drosophila, which encodes a ciliary cytoplasmic dynein motor important for auditory mechanosensation.

Keywords: microarray; mismatch; endonuclease; SNP detection; genetic variation; disease

1. Introduction Current methodologies to identify known single nucleotide polymorphisms (SNPs) include microarray analysis (either based on differential hybridization strategies or hybridization coupled with DNA polymerase activity, as in the Affymetrix (Santa Clara, CA, USA) and Illumina (San Diego, CA, USA) platforms, respectively), Taqman® assays (Roche Diagnostics, Indianapolis, IN, USA), MassARRAY®/iPLEX® (Agena Bioscience, San Diego, CA, USA) single-base extension after amplification, mismatch endonuclease-based detection followed by suitable separation methods, or DNA sequencing (either directed or whole-genome, as in medical resequencing or next generation sequencing strategies) [1–9]. However, the aforementioned microarray-based, Taqman®, MassARRAY® and mismatch endonuclease strategies are only able to interrogate SNPs whose identity is already known. Moreover, most of these methods are not easily scalable to query thousands of SNPs at a time

Microarrays 2016, 5, 7; doi:10.3390/microarrays5020007 www.mdpi.com/journal/microarrays Microarrays 2016, 5, 7 2 of 11 with high accuracy. For instance, the mismatch endonuclease-based detection approach involves PCR to amplify target DNA fragments from both and wild-type reference DNA, enzyme treatment to cleave hybrid heteroduplexes, and detection of potential variants using conventional gel electrophoresis or high-performance liquid chromatography (HPLC). On the other hand, direct sequencing of large genomic regions can identify novel SNPs (and other genomic variation) but oftentimes this strategy is time-consuming and can be expensive and inefficient when genotyping known variants. We thus wanted to develop an economic methodology that could effectively identify both types of mutations (novel and previously discovered). Therefore, we combined the enzymatic efficiency and specificity of the mismatch endonuclease strategy with a tiling microarray hybridization strategy to produce a platform, which we call Mismatch EndoNuclease Array (MENA). In particular, we focused on the Surveyor® endonuclease CEL II which is part of the CEL nuclease family derived from celery. CEL II specifically targets heteroduplex DNA (mismatched base pairs and small indels) and produces a double-strand cleavage at the mismatch site [8,10–13]. The general MENA strategy (see Figure1) relies on: (1) hybridizing a Cy3-labeled genomic DNA sample to a tiling microarray that interrogates one or more large genomic regions of interest in a sequenced genome; (2) scanning all features on the array to quantify hybridized DNA samples; (3) treating the array with a mismatch endonuclease (Surveyor®) that scans for and cleaves mismatches in heteroduplex DNA, reducing the length of heteroduplex formed by genomic fragments and array features when mismatches exist; and (4) re-scanning the array to obtain signal intensities post-treatment. Intensities of the post-digestion features are then compared with the intensities of the pre-digestion features. Probes with hybridized DNA including mismatches (SNPs and/or small indels) are expected to show a specific drop in post-treatment signal relative to pre-treatment signal that is more significant than for probes with no mismatches. It is important to note that traditional SNP array methods rely on a subtle reduction in hybridization strength (and thus signal) between DNA fragments from the sample and array probes when mismatches are present relative to perfect match. In contrast, MENA is expected to generate much more pronounced reductions in signal because genomic fragments are end-labeled and Surveyor® treatment will shorten the heteroduplex via double-strand cleavage (acting on the labeled DNA sample as well as on the array feature or probe) and thus liberate signal from the array. Array features exhibiting the greatest relative drop in signal can validate the presence of a known SNP or, alternatively, identify a new candidate genetic variant that can be earmarked for validation using PCR and Sanger sequencing. Here, we report 99% accuracy of the MENA platform in calling previously identified SNPs, as well as successful identification of novel, unknown genetic variation such as single nucleotide variants and indels. Using MENA, we also identify a novel mutation in a dynein motor protein gene required for hearing in , a mutation that was previously missed using standard Sanger sequencing. MENA is thus an effective alternative methodology for mapping unknown mutations localized to a genomic interval. Finally, given the rise in whole genome sequencing efforts, oftentimes with low sequence read coverage, we envision that this platform could play a significant role in SNP confirmation for these studies. Microarrays 2016, 5, 7 3 of 11

Microarrays 2016, 5, 7 3 of 11

Figure 1. General Mismatch EndoNuclease Array (MENA) strategy. Red stars indicate probes with Figure 1.mismatchesGeneral to Mismatch the hybridized EndoNuclease DNA. Lightning Array bolts (MENA) represent strategy. mismatches Red where stars the indicate endonuclease probes with mismatchescleaves to the the duplex hybridized DNA. DNA. Lightning bolts represent mismatches where the endonuclease cleaves the duplex DNA. 2. Materials and Methods 2. Materials and Methods 2.1. Array Designs 2.1. Array DesignsAll designs were devised for the Roche NimbleGen 385K microarray. Design files (ndf and pos) for each array are available on request. Design strategies and genomic resources are articulated in the AllResults designs section. were devised for the Roche NimbleGen 385K microarray. Design files (ndf and pos) for each array are available on request. Design strategies and genomic resources are articulated in the Results2.2. section. Protocol for Labeling DNA, MENA Array Processing One microgram of genomic DNA was labeled with Cy3-coupled random primers as described 2.2. Protocolin the for NimbleGen Labeling DNA,Arrays MENAUser’s Guide-CGH Array Processing Analysis (version 5.1) using the labeling protocol for 385K feature arrays. Sixty-four micrograms of labeled DNA was lyophilized, followed by One microgram of genomic DNA was labeled with Cy3-coupled random primers as described in resuspension in the hybridization buffer including alignment oligo. Sixteen microliters was loaded the NimbleGeninto the array Arrays mixer User’s (the chamber Guide-CGH device Analysis that is affixed (version to the 5.1) array using in which the labeling hybridization protocol of the for 385K featurelabeled arrays. DNA Sixty-four occurs), microgramsand the sample of was labeled hybridized DNA for was up lyophilized,to 3 days. followed by resuspension in the hybridizationThe array buffer was includingwashed in 42 alignment °C Wash oligo.Buffer 1 Sixteen for 2 min, microliters followed by was Wash loaded buffer into 2 for the 1 min, array mixer (the chamberand wash device buffer that 3 for is affixed15 s. The to array the arraywas then in whichdried for hybridization 15–20 s on an ofArrayIt the labeled slide dryer. DNA Arrays occurs), and were scanned on either the Roche NimbleGen MS 200 microarray scanner set to auto gain, or the the sample was hybridized for up to 3 days. Molecular Devices Axon 2.5 uM˝ resolution microarray scanner, and the PMT value was recorded. The arrayNext, was a new washed array inmixer 42 wasC Wash placed Buffer on the 1 dried, for 2 scanned min, followed array, and by the Wash array buffer was placed 2 for 1on min, and wash bufferthe NimbleGen 3 for 15 s.Hybridization The array wasSystem then set driedto 42 °C. for Eighteen 15–20 s microliters on an ArrayIt of the slideSurveyor dryer.® nuclease Arrays were scannedmaster on either mix (see the below) Roche was NimbleGen then added MS to the 200 mixe microarrayr, and the scannerarray was set mixed to auto for up gain, to 40 or min. the The Molecular Devicesarray Axon was 2.5 then uM rewashed resolution at 42 microarray °C in Wash scanner,Buffers 1 andto 3 days the PMTas described value above, was recorded. and dried on the Next,ArrayIt a new slide array dryer. mixer The array was was placed rescanned on the with dried, the same scanned PMT array,setting and as the the first array scan. was Array placed data on the was then processed in NimbleScan as per the NimbleGen Arrays User’s Guide. NimbleGen Hybridization System set to 42 ˝C. Eighteen microliters of the Surveyor® nuclease master Surveyor® nuclease reaction master mix was prepared, as following, using the relevant components mix (seeof below)the Surveyor was® then Mutation added Detection to the Kit mixer, for Standard and the Gel array Electrophoresis was mixed from for upTransgenomic, to 40 min. Inc. The array ˝ was then(now rewashed sold and at registered 42 C in by Wash Integrated Buffers DNA 1 to Tech 3 daysnologies as described(IDT), Inc., above,Coralville, and IA, dried USA): on 2.2 the μL ArrayIt slide dryer. The array was rescanned with the same PMT setting as the first scan. Array data was then processed in NimbleScan as per the NimbleGen Arrays User’s Guide. Surveyor® nuclease reaction master mix was prepared, as following, using the relevant components of the Surveyor® Mutation Detection Kit for Standard Gel Electrophoresis from Transgenomic, Inc. (now sold and registered by Integrated DNA Technologies (IDT), Inc., Coralville, IA, USA): 2.2 µL Surveyor® Nuclease S, 2.2 µL Surveyor® Enhancer S, 15.6 µL Surveyor® Reaction Microarrays 2016, 5, 7 4 of 11

® Buffer (25 mM¨ Tris-HCl, pH 9.0; 50 mM¨ KCl; 10 mM¨ MgCl2). For further dilutions of Surveyor , the Dilution Buffer used was as follows: 50 mM¨ Tris-HCl (pH 7.5), 100 mM¨ KCl, 0.01% Triton X-100, 10 uM¨ ZnCl2, 5% Glycerol.

2.3. Drosophila Strains Flies carrying the l(2)k07109 from the Kiss collection [14] contain a lethal PlacW insertion at polytene location 25F1-2, as well as an incomplete PlacW element that is w- at the Fas3 gene in polytene section 36E. This chromosome fails to complement other btv and deficiencies that define the btv locus [15]. We named this allele btv2. To maintain a non-lethal stock of btv2, we crossed the l(2)k07109 chromosome to a deficiency, Df(2L)cact255rv64, leaving the btv2 mutation hemizygous and the remainder of the chromosome freely recombining. We presume that over time, the deficiency has been lost, with the btv2 allele becoming homozygous. Because the background chromosome on which the Kiss lines were generated is not available, we used a different insertion line from the collection, l(2)k12913 containing a PlacW insertion in the rempA locus [16], as a proxy for the genetic background. To recover this reference DNA, we used flies of genotype rempAk12913/Df(2L)cact255rv64.

3. Results To test the feasibility of our strategy, we first designed a 385K Roche NimbleGen microarray that interrogates 128 human genome SNPs across 41 genes (see Supplemental Table S1). In this original design, the SNPs are interrogated by all four bases at the SNP position using four sets of . Moreover, the SNP position region is tiled at 1 bp resolution, with oligonucleotides designed to position the SNP at every possible position along the oligo, plus 5 extra “buffer” probes on either side of each SNP that do not contain the SNP position. Thus, for a 30 m probe, 5 + 30 + 5 probes per strand and per SNP nucleotide are required for this design strategy, with 5 + 40 + 5 probes per strand per SNP nucleotide required for 40 mer probes, etc. Additionally, both forward and reverse strands are interrogated for each SNP region. This strategy was employed four different ways, using oligonucleotides that were 30, 40, 50 and 60 nucleotides long. Our initial experiments demonstrated that only 60 m probes were able to perform as anticipated under the experimental conditions used (see Experimental Section), so we redesigned the array using only 60 m, and in addition to once again employing the overall design strategy outlined above, we made sure to include four replicates for each probe set in order to test whether increased replication might have an effect on call accuracy. Our MENA array, therefore, was designed to employ three out of the four probe sets (with either an A, G, C or T interrogating a particular SNP, using multiple replicates) to expose a mismatch, with only one set (with multiple replicates) being complementary to the SNP, thus protecting it from the Surveyor® endonuclease. This strategy allowed us to test the efficacy of combining Surveyor® and microarrays to correctly genotype specific nucleotide positions under a wide variety of conditions and different mismatches. We also investigated the genotyping accuracy when different numbers of partially overlapping (tiling) probes are used to infer pre- vs. post-digestion signal in order to maximize sensitivity and reliability. For example, the 128 SNPs were further split into three groups of SNPs of similar size, one that was analyzed at 1 bp resolution, a second that was analyzed at 2 bp resolution, and a third that was analyzed at 3 bp resolution. Based on knowledge generated in our original set of arrays, all probes were designed to be 60 bp (see below). We performed several initial experiments using a previously genotyped CEPH (1362-02) human genomic DNA sample (Supplementary Table S1). We labeled the DNA with Cy3 using the standard Roche NimbleGen labeling protocol (see Experimental Section), and we varied both amount of DNA hybed, length of time hybed, and amount of Surveyor® endonuclease used (Table1). We then compared the pre- and post-digestion signals from the arrays to determine whether we could observe evidence of cleavage, and whether it was discriminate and showed specificity to heteroduplexes with mismatches. Once the pre and post digestion probe signal intensities were determined via the microarray scanner, we used an algorithm which we developed in house to reveal evidence of cleavage and assess whether Microarrays 2016, 5, 7 5 of 11

MENA generated the correct SNP calls (accurate genotyping). We estimated the difference in relative signal due to Surveyor® treatment by measuring the signal pre- and post-treatment for each of the four possible probes, where all nucleotides of the probe are the same except for a single site where each one contains aMicroarrays different 2016 nucleotide., 5, 7 Importantly, we normalized these changes to allow for differences5 of 11 in signal intensity among probes as well as overall reduction in signal after treatment and washing. Our cleavage and assess whether MENA generated the correct SNP calls (accurate genotyping). We measure ofestimated pre- and the post-treatmentdifference in relative change signal indue signal to Surveyor intensity® treatment (“Relative by measuring Signal the Change” signal pre- or RSC) for each nucleotideand post-treatment (RSCi with for ieach= A, of C, the G, four or possible T) is expected probes, where to be all positive nucleotides for of probes the probe with are a the nucleotide with perfectsame match except tofor the a single sample site whilewhere alleach other one co probesntains witha different mismatched nucleotide. nucleotide Importantly, are we expected to generatenormalized negative these RSC changes values, to allow with for the differences sum of in all signal four intensity RSCi equalamong toprobes zero as (see well twoas overall examples in reduction in signal after treatment and washing. Our measure of pre- and post-treatment change in Figure2). In detail, signal intensity (“Relative Signal Change” or RSC) for each nucleotide (RSCi with i = A, C, G,or T) is expected to be positive for probes with a nucleotide with perfect match to the sample while all other probes withRSC mismatchedipi“A,C,G,T nucleotideq “ RSipre are´ expectedRSipost to“ generate pSipre ´ negativeSpreq ´ RSC pSi postvalues,´ S withpost qthe sum of (1) all four RSCi equal to zero (see two examples in Figure 2). In detail, ® where pre- and post- indicate signal intensity pre- and post-treatment with Surveyor , respectively, RSC(,,,) = RS − RS =(S −S) − (S − S) (1) measured in Log2 units. Si indicates the signal for probes with nucleotide i, and S is the average signal for all fourwhere probes. pre- and Note post- that indicate Si would signal intens representity pre- theand post-treatment average signal with among Surveyor all®, respectively, replicate probes in an array.measured The probe in Log with2 units. the Si most indicates positive the signal RSC fori probesis therefore with nucleotide the probe i, and corresponding S is the average signal to the perfect match andfor informsall four probes. ourgenotyping Note that Si would at the represent interrogated the average nucleotide signal among site. all We replicate classify probes the in genotyping an array. The probe with the most positive RSCi is therefore the probe corresponding to the perfect i of a SNPmatch as correct and informs or accurate our genotyping when the at the highest interrogate positived nucleotide value site. among We classify the four the genotypingRSC corresponds of to the nucleotidea SNP withas correct perfect or accurate match when and the correlates highest positive with the value known among SNP the four in the RSC sample. corresponds Note to that if the region isthe tiled nucleotide densely with enough, perfect match RSCi andcan correlates be estimated with the from known multiple SNP in the adjacent sample. Note probes, that if all the of which interrogateregion the is same tiled SNPdensely thus enough, adding RSC robustnessi can be estimated to the from study multiple (see below).adjacent probes, all of which interrogate the same SNP thus adding robustness to the study (see below). Table 1. Experimental conditions to study genotyping accuracy using MENA. Table 1. Experimental conditions to study genotyping accuracy using MENA.

Slide µ µ Slide NumberAmount of of DNA DNA (μ (g)g)Amount Amount of Surveyor of Surveyor (μL) ( TimeL) of Time Surveyor of Surveyor (min) (min) Number 405722 64 2.2 50 405722 64 2.2 50 405442 64 2.2 40 406241405442 64 16 2.2 2.2 40 40 406228406241 16 8 2.2 2.2 40 40 405703406228 8 64 2.2 4.4 40 20 406247405703 64 32 4.4 2.2 20 40 405711406247 32 64 2.2 5.5 40 20 411044405711 64 64 5.5 4.4 20 40 410888411044 64 64 4.4 2.2 40 30 406295410888 64 64 2.2 2.2 30 20 406306406295 64 64 2.2 1.1 20 50 410527406306 64 64 1.1 3.3 50 20 406246410527 64 64 3.3 1.1 20 40 406205 64 1.1 20 406246 64 1.1 40 406205All MENA experiments64 allowed three days of1.1 hybridization. See text for details.20 All MENA experiments allowed three days of hybridization. See text for details.

Figure 2. Relative signal change (RSC) between pre- and post-treatment arrays with Surveyor. Nucleotides (X-axis) indicate the nucleotides interrogated by the probes (not the nucleotides in the probe). In the sample analyzed, rs9451298 and rs9359876 have “T” and “C”, respectively (forward strand). See text for a detailed description of the RSC measure. Red arrows indicate identity of the SNP. Microarrays 2016, 5, 7 6 of 11

Figure 2. Relative signal change (RSC) between pre- and post-treatment arrays with Surveyor. Nucleotides (X-axis) indicate the nucleotides interrogated by the probes (not the nucleotides in the probe). In the sample analyzed, rs9451298 and rs9359876 have “T” and “C”, respectively (forward Microarraysstrand).2016 See, 5, 7text for a detailed description of the RSC measure. Red arrows indicate identity of the6 of 11 SNP.

Our analysis of several experimental and microarray conditions (Table 1)1) allowedallowed usus toto discriminate cleavage and obtain high genotyping accuracy when when using using probes probes 60 60 nucleotides nucleotides long. long. For 60-m probes, most experimental conditions al allowedlowed detection of a robust decrease in signal (negative RSCRSC) for three out of thethe fourfour setssets ofof SNP-interrogatingSNP-interrogating oligonucleotidesoligonucleotides while the fourth set showed a positive RSC,RSC, indicating that this latter set of probes contain a perfect match to the SNP present in the sample. Figure 22 showsshows representativerepresentative resultsresults whenwhen genotypinggenotyping twotwo specificspecific nucleotidenucleotide sites using the MENA approach and 60-m probes. For these reference SNPs located in the BACH2 gene (rs9451298 and rs9359876), probes interrogating the correct nucleotide (T for rs9451298 and C for rs9359876) show a positive RSC while while thethe probesprobes interrogatinginterrogating the other three nucleotides show negative RSCRSC values.values. The analysis of many (up to 27) different combinations of experimental conditions allowed us to obtain a general set of guidelines to obtain high genotyping accuracy when using MENA. First, we determined that that 2.2 2.2 μµLL of of the the Surveyor Surveyor®® nucleasenuclease was was optimal optimal for for achieving achieving the the maximal maximal accuracy accuracy of ofcalls calls (up (up to 99%) to 99%) under under a varied a varied combination combination of amounts of amounts of DNA of DNA and anddigestion digestion times times (Table (Table 1 and1 andFigure Figure 3). The3). use The of use limited of limited (1.1 μL) (1.1 SurveyorµL) Surveyor® gave less® gave accurate less accurateresults particularly results particularly when digestion when digestiontimes were times also werereduced. also reduced.Not surprisingly, Not surprisingly, the use of the higher use of amounts higher amounts of Surveyor of Surveyor® (4.4, or® 5.5(4.4, μL) or 5.5requiredµL) required the reduction the reduction of digestion of digestion time time(20 min (20 mininstead instead of 40 of 40min) min) to to maintain maintain high high genotyping genotyping accuracy, likely to reducereduce unspecificunspecific cleavage. Either 40 or 50 min of digestion were able to achieve 99% accuracy when keeping all other parameters constant, including 64 μµg of DNA hybridized, three days of hybridization hybridization and and 2.2 2.2 μµLL of of Surveyor Surveyor®,® suggesting, suggesting that that the the enzyme enzyme digestion digestion period period could could be besomewhat somewhat flexible. flexible. Additionally, Additionally, as little as little as 8 as μ 8g µofg amplified, of amplified, labeled labeled DNA DNA (see (see Methods) Methods) could could be beused used per per hybridization hybridization to achieve to achieve this this accuracy, accuracy, using using the the same same hybridization hybridization and and digestion digestion time, time, as aswell well as asSurveyor Surveyor amount. amount. Since Since labeling labeling of the of theDNA DNA usually usually results results in an in anamplification amplification of up of upto 70 to 70fold, fold, this this suggests suggests that that as aslittle little as as ~100 ~100 ng ng of of st startingarting genomic genomic DNA DNA might might be be sufficient sufficient to perform our analysis. Figure3 3shows shows genotyping genotyping accuracy accuracy for known for kn SNPsown underSNPs 14 under different 14 experimentaldifferent experimental conditions, varyingconditions, the varying amount the of DNAamount hybed, of DNA concentration hybed, concentration of the Surveyor of the® endonuclease, Surveyor® endonuclease, and the time and of digestionthe time of (Table digestion1). As shown,(Table 1). we As obtained shown, accuracies we obtained of 99% accuracies using the of RSC 99% algorithm using the described RSC algorithm above fordescribed a number above of conditions, for a number thus of indicating conditions, that thus MENA indicating is not only that highlyMENA accurate is not only in genotyping highly accurate SNPs, butin genotyping also that the SNPs, experimental but also that conditions the experimental are fairlyrobust. conditions are fairly robust.

Figure 3. Summary of genotyping accuracy. Genotyping accuracy of MENA under different Figure 3. Summary of genotyping accuracy. Genotyping accuracy of MENA under different experimental conditions. Results Results shown shown after after combining combining all all homozygous homozygous SNPs SNPs for for each each condition. See Table1 1 for for detailed detailed experimental experimental conditions. conditions.

This study also allowed us to draw additional conclusions about array design when using MENA

(Figure4). We observed that interrogating both strands always increases genotyping accuracy relative to doubling the number of probes for a single strand. We also observed that high genotyping accuracy can be obtained with low tiling overlap for probes: 95% accuracy is obtained with only five overlapping Microarrays 2016, 5, 7 7 of 11

This study also allowed us to draw additional conclusions about array design when using MENA (Figure 4). We observed that interrogating both strands always increases genotyping accuracy relative to doubling the number of probes for a single strand. We also observed that high genotyping accuracy can be obtained with low tiling overlap for probes: 95% accuracy is obtained with only five overlapping probes (double strand and four replicates). Finally, we were able to investigate the accuracy in detecting different nucleotide mismatches. We observed that accurate genotyping was lowest (albeit still very high) for probes interrogating “T” variants (with accurate genotyping of 96%). Having shown that MENA is highly efficient for detecting known SNPs, we next wanted to determine whether we could identify novel genomic variation. To this end, in our second array design we had also included the entire ~64 kb region of the human IRF6 gene (including the intergenic space 5′ and 3′ to the gene; Chr1: 208,021,000–208,085,000, hg18), tiling both strands at 2 bp resolution, with two replicates of both strands. In this first attempt to investigate whether MENA could detect candidate novel SNPs in the IRF6 region, we used the 1362-1 CEPH sample and employed MENA under optimal conditions that were identified as indicated above. In this case, probes showing the largest signal reduction (signal change, SC) are candidates for harboring differences relative to the

Microarraysreference 2016(array), 5, 7 genome. Note that for this study we did not include all four possible nucleotides7 of 11at all possible sites across the entire IRF6 region. Instead, we applied a sliding-window approach using probes that match the reference genome to describe SC along the complete region analyzed, and probesfocused (double on the strand probes and showing four replicates). the most Finally, extreme we de werecay able in signal to investigate (most negative the accuracy SC). in As detecting a first differentapproximation, nucleotide we mismatches.used a conservative We observed FDR that of accurate0.1% as genotypingthreshold to was identify lowest putative (albeit still probes very high)harboring for probes genetic interrogating variants in the “T” sample variants analyzed. (with accurate genotyping of 96%).

Figure 4.4. GenotypingGenotyping accuracy with differentdifferent numbernumber ofof probes.probes. PercentagePercentage of genotypinggenotyping accuracy based different number of probes interrogating a given SNP. ResultsResults shownshown afterafter combiningcombining all homozygous SNPs under experimentalexperimental conditions 405722 (see TableTable1 1 and and Figure Figure3 ).3). Because Because probes probes are 6060 nucleotidesnucleotides long,long, analysesanalyses based on probesprobes tiledtiled everyevery threethree nucleotidesnucleotides useuse thethe combinedcombined informationinformation of 20 20 probes, probes, analyses analyses based based on on probes probes tiled tiled every every six six nucleotides nucleotides use use information information of 10 of 10probes, probes, etcetc. .

HavingWhen using shown the that 1362-1 MENA CEPH is highly sample, efficient we iden for detectingtified 18 genomic known SNPs, regions we that next had wanted signal to determinedifference whethersignatures we consistent could identify with novel the genomicpresence variation. of a genetic To this change end,in relative our second to the array reference design wesequence. had also We included designed the primers entire ~64 to kbPCR region the regions of the human from theIRF6 CEPHgene sample (including and the performed intergenic Sanger space 5sequencing1 and 31 to in the order gene; to Chr1: confirm 208,021,000–208,085,000, the presence of novel hg18),mutations. tiling We both were strands able to at identify 2 bp resolution, 13 SNP withvariants two (12 replicates being heterozygous of both strands. and In one this being first attempthomozygous; to investigate Supplemental whether Table MENA S2) and could indels detect in candidate10 of the interrogated novel SNPs regions. in the IRF6 It is region,important we to used note the that 1362-1 the identi CEPHfication sample of andknown employed SNPs is MENAfurther undervalidation optimal that conditionsthese variants that are were in fact identified real. Notably, as indicated we were above. able Into identify this case, not probes only homozygous showing the largestsingle nucleotide signal reduction mutations (signal but change,also hete SC)rozygous are candidates mutations for as harboring well as indels. differences relative to the referenceFinally, (array) we wanted genome. to Notedetermine that for whether this study we co weuld did identify not include an uncharacterized all four possible mutation nucleotides from ata model all possible organism, sites Drosophila across the melanogaster entire IRF6. region.Previously, Instead, Eberl we and applied colleagues a sliding-window identified a gene approach critical usingfor fly probeshearing that called match beethoven the reference (btv) [17,18]. genome Several to describe mutations SC in along this gene the complete were identified; region analyzed,however, 2 andone focusedallele that on failed the probes to complement showing the the most other extreme btv alleles decay (btv in) signal was never (most characterized, negative SC). Asas ainitial first approximation,Sanger sequencing we used efforts a conservative covering several FDR of exonic 0.1% as regions threshold based to identify on early putative annotations probes harboringfailed to genetic variants in the sample analyzed. When using the 1362-1 CEPH sample, we identified 18 genomic regions that had signal difference signatures consistent with the presence of a genetic change relative to the reference sequence. We designed primers to PCR the regions from the CEPH sample and performed Sanger sequencing in order to confirm the presence of novel mutations. We were able to identify 13 SNP variants (12 being heterozygous and one being homozygous; Supplemental Table S2) and indels in 10 of the interrogated regions. It is important to note that the identification of known SNPs is further validation that these variants are in fact real. Notably, we were able to identify not only homozygous single nucleotide mutations but also heterozygous mutations as well as indels. Finally, we wanted to determine whether we could identify an uncharacterized mutation from a model organism, Drosophila melanogaster. Previously, Eberl and colleagues identified a gene critical for fly hearing called beethoven (btv)[17,18]. Several mutations in this gene were identified; however, one allele that failed to complement the other btv alleles (btv2) was never characterized, as initial Sanger sequencing efforts covering several exonic regions based on early annotations failed to identify a likely mutation. The btv2 mutation was discovered as a second-site lesion on a chromosome with a partial Microarrays 20162016,, 55,, 77 88 ofof 1111

identify a likely mutation. The btv2 mutation was discovered as a second-site lesion on a chromosome P-elementwith a partial insertion P-element in the insertionfasciclin IIIin genethe fasciclin (Fas3) nearIII genebtv. ( WeFas3 thus) near considered btv. We thus the possibilityconsidered that the thepossibilitybtv2 lesion that resulted the btv2 fromlesion a resulted “hit-and-run” from a event “hit-and-run” during hybrid event dysgenesis during hybrid in the dysgenesis recovery ofin thethe k07109recoverychromosome. of the k07109 To applychromosome. MENA weTo thusapply generated MENA awe tiling thus array generated design a (OID34126) tiling array in design which we(OID34126) tiled a portion in which of the weDrosophila tiled a portionsecond of chromosomethe Drosophila (dm3, second 2L: chromosome 17,933,592-18,006,679) (dm3, 2L: using17,933,592- 60 m probes18,006,679) with using 3 bp spacing60 m probes (both with strands, 3 bp threespacing total (both replicates strands, per three strand) total and replicates performed per strand) our MENA and assayperformed on DNA our MENA isolated assay from on flies DNA harboring isolated the frombtv 2fliesallele harboring (see Materials the btv2 and allele Methods (see Materials for a more and detailedMethods description) for a more detailed as well asdescription) flies that were as well wild-type as flies forthat the werebtv wild-typelocus. Note for that, the inbtv this locus. case, Note we usedthat, MENAin this case, with we two used samples MENA (mutant with andtwo wild-type)samples (mutant and we and compared wild-type) the reduction and we compared in signal (SC) the ® afterreduction Surveyor in signaltreatment (SC) after for btvSurveyormutant® samplestreatment (SC forbtv2 btv) relative mutant to samples the reduction (SCbtv2 in) relative signal forto thethe wild-typereduction (SCin signalcontrol) for sample. the wild-type This approach, (SCcontrol we) believe,sample. isThis superior approach, to the we one believe, applied is for superior IRF6 because to the canone captureapplied betterfor IRF6 differences because can among capture probes better after differences Surveyor® amongtreatment. probes Probes after showing Surveyor the® treatment. strongest relativeProbes showing reduction the in strongest the btv2 sample relative (see reduction Figure5 )in were the btv thus2 sample candidates (see Figure for harboring 5) were uncharacterized thus candidates SNPsfor harboring or small uncharacterized indel variants. SNPs or small indel variants.

Figure 5. ComparisonComparison of of signal signal change change (SC) (SC) using using MENA MENA between between mutant mutant and and control control samples. samples. SC SCindicates indicates the thereduction reduction in signal in signal after afterSurveyor Surveyor treatment treatment in controls in controls and btv and2 samples.btv2 samples. In red, probes In red, probesshowing showing the strongest the strongest (1%) reduction (1%) reduction in signal in signalin mutant in mutant btv2 samplebtv2 sample relative relative to controls to controls and thus and thusputative putative genetic genetic variants. variants. In black, In black, the probe the probe corresp correspondingonding to a tosingle a single bp deletion bp deletion in in exon 22 of 22 the of thebtv btvgenegene that that results results in ina frameshift a frameshift that that causes causes ea earlyrly stop stop codons. codons. In In blue, blue, all all remaining remaining probes that

werewere interrogatedinterrogated withwith MENA.MENA. SCSC shownshown ininLog Log22 units.units.

Candidate SNPSNP regions regions were were then then targeted targeted for PCRfor PCR and Sangerand Sanger sequencing sequencing (Figure (Figure5; red squares 5; red indicatesquares theindicate 1% of the probes 1% of with probes strongest with reductionstrongest relativereduction to controls,relative to and controls, the black and square the black is the square probe amongstis the probe this amongst 1% that ledthisto 1% identification that led to identificati of the relevanton of the mutation), relevant focusing mutation), on missense,focusing on nonsense missense, or nonsense or frameshift variants. Upon PCR amplification and sequencing of the candidate regions, a frameshift variants. Upon PCR amplification and sequencing of the candidate regions, a single base single base pair deletion mutation was confirmed in exon 22 (transcript btv-RD), at coordinate pair deletion mutation was confirmed in exon 22 (transcript btv-RD), at coordinate 17966613, which, 17966613, which, on the minus strand encoding btv, begins a string of 5 T nucleotides that, in btv2, on the minus strand encoding btv, begins a string of 5 T nucleotides that, in btv2, leave only four Ts. leave only four Ts. This indel results in a frameshift that causes early stop codons, thus substantially This indel results in a frameshift that causes early stop codons, thus substantially shortening the amino shortening the amino acid sequence of the cytoplasmic dynein protein (Figure 6). Thus, our MENA acid sequence of the cytoplasmic dynein protein (Figure6). Thus, our MENA analysis succeeded analysis succeeded in identifying a candidate novel variant that we experimentally confirmed to be in identifying a candidate novel variant that we experimentally confirmed to be a novel single base a novel single base deletion mutation. deletion mutation.

Microarrays 2016, 5, 7 9 of 11

nonsense or frameshift variants. Upon PCR amplification and sequencing of the candidate regions, a single base pair deletion mutation was confirmed in exon 22 (transcript btv-RD), at coordinate 17966613, which, on the minus strand encoding btv, begins a string of 5 T nucleotides that, in btv2, leave only four Ts. This indel results in a frameshift that causes early stop codons, thus substantially shortening the amino acid sequence of the cytoplasmic dynein protein (Figure 6). Thus, our MENA analysis succeeded in identifying a candidate novel variant that we experimentally confirmed to be Microarrays 2016, 5, 7 9 of 11 a novel single base deletion mutation.

Figure 6. IdentificationIdentification of of a anovel novel allele allele of of the the beethovenbeethoven (btv(btv) gene) gene in Drosophila in Drosophila. Results. Results of PCR of PCR and andSanger Sanger sequencing sequencing around around the thecandidate candidate mutation mutation detected detected by by MENA MENA confirm confirm a asingle single bp bp (“A”) deletion in btv2 relative to a (WT) sequence ( A); this deletion maps to exon 22 of btv (red(red rectangle inin B;, red arrow in C), which is located in in chromosome chromosome arm arm 2L 2L (position (position 17966613-17966617); 17966613-17966617); and (D)) the consequences ofof the “A” deletion causing a frameshift and early stop codons. Green letters represent the altered amino acids acids that that are are translated translated as as a a result result of of the the frameshift, frameshift, which which ends ends with with a astop stop codon codon indicated indicated by by the the red red x. x.

4. Discussion Discussion We havehave developed a mutation detection system called MENA (Mismatch EndoNuclease Array) inin whichwhich wewe couple couple mismatch mismatch endonuclease endonuclease enzymology enzymology with with tiling tiling DNA DNA microarrays. microarrays. We believeWe believe this systemthis system will bewill particularly be particularly well-suited well-suited for verification for verification of SNP andof SNP SNV and calls SNV for calls for sequenced genomes atsequenced low coverage at low depths, coverage whether depths, they whether be from they model be from organisms, model organisms, commercially commercially relevant organisms relevant suchorganisms as livestock such as or livestock domesticated or domesticated animals, or anim humans.als, or Importantly, humans. Importantly, number of genomesnumber of sequenced genomes issequenced oftentimes is consideredoftentimes moreconsidered essential more than essential achieving than maximum achieving accuracy maximum in callingaccuracy variants in calling for eachvariants individual for each genome, individual as genome, is often theas is case often in large-scalethe case in humanlarge-scale sequencing human sequencing efforts. Although efforts. suchAlthough a sequencing such a sequencing strategy allows strategy the allows sequencing the sequencing of more genomes, of more genomes, both low andboth high low sequenceand high coveragesequence cancoverage invariably can invariably lead to inaccurate lead to baseinaccurate calling base (the calling latter due (the to latter systematic due to sources systematic of error sources [19], particularlyof error [19], in particularly species with in large species genomes with and larg fewe genomes novel variants, and few such novel as humans. variants, Having such as an humans. accurate andHaving independent an accurate genomic and independent strategy for genomic calling the strate suspectedgy for calling variants the will suspected allow confirmation variants will of allow their identity.confirmation The currentof their Agilent identity. 8-plex The slide,current which Agilen containst 8-plex eight slide, independent which contains microarrays, eight independent for instance, wouldmicroarrays, allow confirmingfor instance, more would than allow 5000 confirming variants per more sample than for5000 less variants than $200, per sample with the for advantage less than that$200, each with array the advantage can be custom that each designed array to can include be custom novel designed SNPs/SNVs to include and genotype novel SNPs/SNVs eight different and genomicgenotype samples eight different per slide. genomic samples per slide. A second utility for MENA is identification of novel variants of interest in a genomic interval which can be tiled on a microarray. Such a strategy led us to identify a loss-of-function variant in the beethoven (btv) gene of Drosophila melanogaster, which is required for fly hearing [17,18]. Importantly, Sanger sequencing of a large genomic interval does not allow quick identification of relevant mutations, and the time required to design appropriate PCR primers, not to mention perform the sequencing, may be prohibitive depending on the size of the interval of interest. With MENA, regions of a genome that are up to approximately 50–100 kp can be assessed for variants in a matter of a few days once Microarrays 2016, 5, 7 10 of 11 the arrays are synthesized. As shown, moreover, MENA can be applied to detect variants relative to a reference genome or to compare mutant and wild-type samples, and both approaches have allowed us to detect novel variants. A third utility for MENA is the screening of panels of genes associated with a particular disease in order to quickly identify known or novel gene variants likely underlying the disease for a particular patient. This strategy could be particularly useful for diseases or disorders associated with large numbers of genes such as metabolic disorders. Moreover, since the newest array platforms contain large numbers of features (approximately one million features for the Agilent platform), a single array design could be used to interrogate disease genes from a number of different disorders, thus minimizing the number of array designs needed. After employing MENA, the researcher would only need to focus on analysis of the relevant genes for his or her disorder of interest. We chose to use Roche NimbleGen microarrays during the planning phases of this study due to their long oligos, which provide increased sensitivity and specificity over short microarrays. Although Roche NimbleGen no longer produces microarrays, Agilent microarrays have equivalent long oligonucleotides (similarly built base by base off the array surface) and oligo density. Thus, it is likely that our strategy will work for the Agilent platform as well. MENA, we propose, can be a rigorous, user-friendly and efficient methodology to simultaneously validate and test the presence of multiple known SNPs and/or detect novel variants using long-oligo array-based hybridization assays.

5. Conclusions We have developed an accurate, efficient low-cost mutation detection methodology (Mismatch EndoNuclease Array, or MENA) that pairs DNA mismatch endonuclease enzymology with tiling microarray hybridization to detect novel or known genomic variation at the level of point mutations or indels, provided the genome or genomic region of interest has been sequenced. Importantly, MENA is an independent genomics methodology that can be used to confirm variants identified by next generation sequencing methodologies, especially in cases where sequencing coverage is low. Since correlating genomic variation with disease is a primary focus of the biomedical research community, new platforms that can aid in correctly characterizing such variation will be in demand.

Supplementary Materials: The following are available online at www.mdpi.com/2076-3905/5/2/7/s1. Acknowledgments: The authors would like to thank Aline Petrin and Jeffrey C. Murray for providing the genotyping information for the Centre d’Etude du Polymorphisme Humain (CEPH) sample used in this study, as well as helpful discussions. This project was supported by the following University of Iowa internal grants to Josep M. Comeron and John Robert Manak: Carver Collaborative Pilot Grant, Institute for Clinical and Translational Science (ICTS) Translational Grant, Grow Iowa Values Fund Seed Grant. The project was also supported by an National Institutes of Health grant to Daniel F. Eberl (R01 DC04848). Author Contributions: Josep M. Comeron and John Robert Manak conceived and designed the experiments; Jordan Reed, Matthew Christie, Julia S. Jacobs, and Jason Dierdorff performed the experiments; Josep M. Comeron, Daniel F. Eberl, and John Robert Manak analyzed the data; Daniel F. Eberl contributed reagents/materials/analysis tools; and Josep M. Comeron and John Robert Manak wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.

References

1. De la Vega, F.M.; Lazaruk, K.D.; Rhodes, M.D.; Wenz, M.H. Assessment of two flexible and compatible SNP genotyping platforms: Taqman SNP genotyping assays and the snplex genotyping system. Mutat. Res. 2005, 573, 111–135. [CrossRef][PubMed] 2. Fan, J.B.; Chee, M.S.; Gunderson, K.L. Highly parallel genomic assays. Nat. Rev. Genet. 2006, 7, 632–644. [CrossRef][PubMed] 3. Gresham, D.; Dunham, M.J.; Botstein, D. Comparing whole genomes using DNA microarrays. Nat. Rev. Genet. 2008, 9, 291–302. [CrossRef][PubMed] 4. Hoheisel, J.D. Microarray technology: Beyond transcript profiling and genotype analysis. Nat. Rev. Genet. 2006, 7, 200–210. [CrossRef][PubMed] Microarrays 2016, 5, 7 11 of 11

5. LaFramboise, T. Single nucleotide arrays: A decade of biological, computational and technological advances. Nucleic Acids Res. 2009, 37, 4181–4193. [CrossRef][PubMed] 6. Trevino, V.; Falciani, F.; Barrera-Saldana, H.A. DNA microarrays: A powerful genomic tool for biomedical and clinical research. Mol. Med. 2007, 13, 527–541. [CrossRef][PubMed] 7. van Dijk, E.L.; Auger, H.; Jaszczyszyn, Y.; Thermes, C. Ten years of next-generation sequencing technology. Trends Genet. TIG 2014, 30, 418–426. [CrossRef][PubMed] 8. Qiu, P.; Shandilya, H.; D‘Alessio, J.M.; O‘Connor, K.; Durocher, J.; Gerard, G.F. Mutation detection using surveyor nuclease. BioTechniques 2004, 36, 702–707. [PubMed] 9. Olcaydu, D.; Harutyunyan, A.; Jager, R.; Berg, T.; Gisslinger, B.; Pabinger, I.; Gisslinger, H.; Kralovics, R. A common JAK2 haplotype confers susceptibility to myeloproliferative neoplasms. Nat. Genet. 2009, 41, 450–454. [CrossRef][PubMed] 10. Bannwarth, S.; Procaccio, V.; Paquis-Flucklinger, V. Surveyor nuclease: A new strategy for a rapid identification of heteroplasmic mitochondrial DNA mutations in patients with respiratory chain defects. Hum. Mutat. 2005, 25, 575–582. [CrossRef][PubMed] 11. Oleykowski, C.A.; Bronson Mullins, C.R.; Godwin, A.K.; Yeung, A.T. Mutation detection using a novel plant endonuclease. Nucleic Acids Res. 1998, 26, 4597–4602. [CrossRef][PubMed] 12. Till, B.J.; Reynolds, S.H.; Greene, E.A.; Codomo, C.A.; Enns, L.C.; Johnson, J.E.; Burtner, C.; Odden, A.R.; Young, K.; Taylor, N.E.; et al. Large-scale discovery of induced point mutations with high-throughput tilling. Genome Res. 2003, 13, 524–530. [CrossRef][PubMed] 13. Yang, B.; Wen, X.; Kodali, N.S.; Oleykowski, C.A.; Miller, C.G.; Kulinski, J.; Besack, D.; Yeung, J.A.; Kowalski, D.; Yeung, A.T. Purification, cloning, and characterization of the CEL I nuclease. Biochemistry 2000, 39, 3533–3541. [CrossRef][PubMed] 14. Spradling, A.C.; Stern, D.; Beaton, A.; Rhem, E.J.; Laverty, T.; Mozden, N.; Misra, S.; Rubin, G.M. The Berkeley Drosophila gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes. Genet. 1999, 153, 135–177. 15. Mancebo, R.; Zhou, X.; Shillinglaw, W.; Henzel, W.; Macdonald, P.M. BSF binds specifically to the bicoid mRNA 31 untranslated region and contributes to stabilization of bicoid mRNA. Mol. Cell. Biol. 2001, 21, 3462–3471. [CrossRef][PubMed] 16. Lee, E.; Sivan-Loukianova, E.; Eberl, D.F.; Kernan, M.J. An IFT-A protein is required to delimit functionally distinct zones in mechanosensory CILIA. Curr. Biol. CB 2008, 18, 1899–1906. [CrossRef][PubMed] 17. Eberl, D.F.; Hardy, R.W.; Kernan, M.J. Genetically similar transduction mechanisms for touch and hearing in Drosophila. J. Neurosci. 2000, 20, 5981–5988. [PubMed] 18. Eberl, D.F.; Duyk, G.M.; Perrimon, N. A genetic screen for mutations that disrupt an auditory response in Drosophila melanogaster. Proc. Natl. Acad. Sci. USA 1997, 94, 14837–14842. [CrossRef][PubMed] 19. Wall, J.D.; Tang, L.F.; Zerbe, B.; Kvale, M.N.; Kwok, P.Y.; Schaefer, C.; Risch, N. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Res. 2014, 24, 1734–1739. [CrossRef] [PubMed]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).