US 2011 0098193A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2011/0098193 A1 Kingsmore et al. (43) Pub. Date: Apr. 28, 2011

(54) METHODS AND SYSTEMS FOR MEDICAL type (such as a disease symptom) of the trait, wherein the SEQUENCING ANALYSIS association of the relevant element with the relevant compo nent phenotype identifies the relevant element as an element (76) Inventors: Stephen F. Kingsmore, St. associated with the trait, wherein the relevant component Augustine, FL (US); Callum J. phenotype is a component phenotype having a threshold Bell, Santa Fe, NM (US) value of severity, age of onset, specificity to the trait or dis ease, or a combination, wherein the relevant element is an (21) Appl. No.: 12/910,764 element having a threshold value of importance of the ele ment to homeostasis relevant to the trait, intensity of the (22) Filed: Oct. 22, 2010 perturbation of the element, duration of the effect of the element, or a combination. The disclosed methods are based Related U.S. Application Data on a model of how elements affect complex diseases. The disclosed model is based on the existence of significant (60) Provisional application No. 61/254,115, filed on Oct. genetic and environmental heterogeneity in complex dis 22, 2009. eases. Thus, the specific combinations of genetic and envi ronmental elements that cause disease vary widely among the Publication Classification affected individuals in a cohort. The disclosed model is an (51) Int. Cl. effective, general experimental design and analysis approach C4OB 30/04 (2006.01) for the identification of causal variants in common, complex C4DB 60/2 (2006.01) diseases by medical sequencing. Also disclosed herein are methods of identifying an inherited trait in a subject. The (52) U.S. Cl...... 506/9; 506/39 disclosed methods compare a reference sequence from a Sub ject to a library of sequences that contain each mutation. For (57) ABSTRACT a given mutation, a normal sequence read aligns best to the Disclosed are methods of identifying elements associated normal library sequence. A read having the mutation aligns with a trait, such as a disease. The methods can comprise, for best to the mutant library sequence. The disclosed model and example, identifying the association of a relevant element the disclosed methods based on the model can be used to (such as a genetic variant) with a relevant component pheno generate valuable and useful information.

O

Select a set of samples

102 Extract nucleic acids from samples

1

Perform DNA sequencing

04 Align sequence reads to a reference database 1 05 Identify potential variants for each sample (for example, SNPs) 106 Apply a first subset of rules to the candidate variants 1 Of Apply a second subset of rules to the candidate variants

1 08

Walidate nominated genes

109

Examine association of validated gene variants with component phenotypes Patent Application Publication Apr. 28, 2011 Sheet 1 of 40 US 2011/0098193A1

FIG. 1 FIG. 2 101 201

Select a set of samples Select a set of samples

102

Extract nucleic acids from Extract nucleic acids from samples samples

103 203

Perform DNA sequencing Perform DNA sequencing

104 204 Align Sequence reads to a Align Sequence reads to a reference database reference database 105 205

ldentify potential variants for ldentify potential variants for each sample (for example, each Sample (for example, SNPs) indels) 106 Apply a first subset of rules to Apply a first subset of rules to the candidate variants the Candidate variants 107 Apply a second subset of rules Apply a second subset of rules to the Candidate variants to the Candidate variants

108 208

Validate nominated genes Validate nominated genes

109 209

Examine association of validated Examine association of validated gene variants with component gene variants with component phenotypes phenotypes Patent Application Publication Apr. 28, 2011 Sheet 2 of 40 US 2011/00981.93 A1

FIG. 3

301

Identify an association of a relevant element With a relevant component phenotype of the trait

Wherein the association of the relevant element With the relevant component phenotype identifies the relevant element as an element associated with the trait

Wherein the relevant component phenotype is a component phenotype having a threshold value of severity, age of Onset, specificity to the trait or disease, Or a Combination

Wherein the relevant element is an element having a threshold value of importance of the element to homeostasis relevant to the trait, intensity of the perturbation of the element, duration of the effect of the element, or a Combination Patent Application Publication Apr. 28, 2011 Sheet 3 of 40 US 2011/00981.93 A1

9717

Patent Application Publication Apr. 28, 2011 Sheet 4 of 40 US 2011/00981.93 A1

G09

9ouembæS Spee»-

?seO M??AuÐAO

Patent Application Publication Apr. 28, 2011 Sheet 5 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 6 of 40 US 2011/00981.93 A1

transcript: NM 080426.1 cose: SD1437 Help variant filter A. Restrict where ot least reods Co variont DSNP OnSSNP Oin/del B. Restrict where ot least reads cover position Voront in Coding region C. Restrict where between 2, and % of reads call variant DVOriont Couses premature stop NOTE: A/B=C. UPDATE CLEAR Overview o SNP nSSNP in/del

closeup Obp

----Y - -a -a -a -a -a -a -a --a -

- a 4- - A 4-- - a 4- 4 - - a 4- - a 4

a 4- 4- -a -- 4----- 4- 4- -- 4 4- - a -a - a

4- 4- -a 4- -A

FG.7A Patent Application Publication Apr. 28, 2011 Sheet 7 of 40 US 2011/00981.93 A1

transcript: NM_080426. COSe: SD1437 Help variant filter A. Restrict where ot least 4 reads cat YOriont SNP InsSNP in/del B. Restrict where at least reods COver position Variant in Coding region C. Restrict where between 50 % and % of reads coil variont VOriont Couses premature stop NOTE: A/B=C% UPDATE CLEAR Overview o SNP asSNP in/del

closeup Obp 1551bp

-a a -ara a4- -- - 4--a 4 --a 4-4-4--as-a-- A 4 - - A - A - A - A - 4 - A a - a 4- - a - a 4- -a - a 4- 4 - 4 - - a 4 -la-la all 4l la la 4-l. 4 l 4-l la 4-l la a sea a a sa sea a -a a -a a - a 4- - a -a - a -a - a 4- - A 4 - - A 4- 4- -a -a - a 4- 4- 4 - -a -a -a 4 4- 4 - -A 4 - - a 4- -a - A 4- - A 4 - - a 4- - a 4 + 4- - a 4- - a -a - a 4- - a 4 -a 4- 4 - -A 4- -A 4-4- 4--a 4- -A as a 4-a- 4a aa a -a 4- a -a a 4-a- sa 4s sa sa sy sla as a as sla 4-s - a 4- - a 4- - a -a - a -a - a 4- - a as a 4s as a sa as sea a 4s as sla - a 4- 4- 4 - - a 4- -a - a 4- -a - a 4 -a -a. -a -a -a -a -a -a -a - a 4 4- 4l -la -la 4-l. 4-l la la la la sa sa as as sea as a sla sa sla - a 4- - a 4- - a 4- 4 - - a 4- - a 4 -A - a 4- a 4- -a -a -a -a 4 area rea 4-e 4-e are sea e A v. A 4-see sea ala 4- a - a 4a a ma ma 4- 4- -a -a -a - a -A 4- 4- -a -a 4 sea area 4e are 4- 4 - - a 4 4- - a -a - a 4- -a -la 4 - a 4- 4- 4 - 4- 4 - a 4- -a -a - a 4 4l 4-l -la. 4l la sea as as a ra as

- a 4- - a m a 4m ma 4m.

FIG.7B Patent Application Publication Apr. 28, 2011 Sheet 8 of 40 US 2011/00981.93 A1

FG. 8 30 SID1437 25 o .. SD1438 2O X ...) SD1439 t 15 & 10

& S S S S S is is & S S is is

303

f 8. :::: Number of aligned teads per RefSeq transcript, run Patent Application Publication Apr. 28, 2011 Sheet 9 of 40 US 2011/00981.93 A1

A01A/040

0 Patent Application Publication Apr. 28, 2011 Sheet 10 of 40 US 2011/00981.93 A1

0??V\/\/\/\/010LOLVOLLOLLO\/\/910L\/\/9LVVOOLV?©(VVO 669][99VVVVV?LÐLOLVOLLOLLOVVOLS)LVVOLVVOOLV?©?VVOLO

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||

Patent Application Publication Apr. 28, 2011 Sheet 12 of 40 US 2011/00981.93 A1

O

FIG. 3 Patent Application Publication Apr. 28, 2011 Sheet 13 of 40 US 2011/00981.93 A1

y Patent Application Publication Apr. 28, 2011 Sheet 14 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 15 of 40 US 2011/00981.93 A1

Process Overview Enter Sample Rock the Samples in to MS -8) Bood Rock the Extract DNA DNA sample QC by NSureSelect

Samples in from blood in hand nanodrop fridge

Roin)drce 7

SPR Corcotenate QC by RainDOnce QC by Sonicote purification- Prep Bibrary Sonicote bioanalyzer errichmenttarget bioanalyzer 3 kb

Pool braries

QC enriched Poolbroies 8. QC by E.--DNA conse--(PSR-Agent yard Bioanalyzer & SPR, Prep library - St. Sequencing & nonodrop) DNA enrichment nanodrop purification 150 bp

Disambiguate Export MP Pogle Run illuminC multiplexed Sequence Quality Scores Run Alpheus Run COverage plot and VC pipeline Indexes dota as files report pipeline Script chart reports

1st Cycle % of redds Run mutation % contro 8 ot binned detection genotypes colled Final report Scripts correctly FG 16 Patent Application Publication Apr. 28, 2011 Sheet 16 of 40 US 2011/00981.93 A1

F.G. 17

A

o t 2s 100on C. S.

*...*** ---- so 360 5oo 1ooo sooo Size (Nucleotides)

B 800

s $30 s

30

3.

Average Fold-Coverage Patent Application Publication Apr. 28, 2011 Sheet 17 of 40 US 2011/00981.93 A1

F.G. 17

& Sequence Depth (G B) :-

SO Average3. 38 Fold-Coverage23. 38 3& :: Patent Application Publication Apr. 28, 2011 Sheet 18 of 40 US 2011/0098193A1

F.G. 18

SOLD3 SB, lumina GAix SBS

infinium Omni 1 Ouad Array

SOiD3 SB, lumina GAlix SBS

infinium Omni1 Ouad Array Patent Application Publication Apr. 28, 2011 Sheet 19 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 20 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 21 of 40 US 2011/00981.93 A1

FG. 20

**** ?

2 35 S.

d 3s .8 F. Sample Sample Patent Application Publication Apr. 28, 2011 Sheet 22 of 40 US 2011/00981.93 A1

FG. 20

Sarge Patent Application Publication Apr. 28, 2011 Sheet 23 of 40 US 2011/00981.93 A1

FG. 21

33% :::::::::::::::::::x:x:x:8:xxx-xx-x-x: specificity

33% . .xx &KY

97%

3.x.

S&

83.3 2: s .. 4 Sequence epth (GB)

3

2 :

.

O 10 20 3O4 SO 80 8O3C OO Allele Frequency (% Patent Application Publication Apr. 28, 2011 Sheet 24 of 40 US 2011/00981.93 A1

FG, 21

8 --- . 8 r . 4-8 . . 2. cooa cosos" to : to ozoa os os is False Positive (1-Specificity) False Positive (1-Specificity) Patent Application Publication Apr. 28, 2011 Sheet 25 of 40

Patent Application Publication Apr. 28, 2011 Sheet 26 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 27 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 28 of 40 US 2011/00981.93 A1

Z-8ZZ"SO|- Patent Application Publication Apr. 28, 2011 Sheet 29 of 40 US 2011/00981.93 A1

F.G. 22C

O 2 3 4 is is Y 8 Carrier burden Patent Application Publication Apr. 28, 2011 Sheet 30 of 40 US 2011/00981.93 A1

FG. 23

8:a:: 88.S. SESSE4: ... i. 3 &S. E4:... i. e.g.: S.

... it is a .8 s :8: Sr.: as sis 3.l. i. 3: Si:. . ii. 33:

&c. : 85.3888 &tta:38tgg3.p.3.383&tigtgg Caatig33.3g 3.85382.5 Raasi. . .343 3838 . . .3333.24. Til I, Eigt. Si

38::::ities: K.8s 38 3. 8: a 3xiii.33. 3.ligasaki. e. gig &

< ...&S3888 taaatgttgcatatastastttggctggiastic gags ago. 3538.33. easi. 11 FFE3 3:38 - F - 3 3313225, CT: , argth. S.

is.... as r &is is: Stasic r is: it art as Irisis

3&t . .

8:t. i. 8.3888 it giggl3 tact:33:3:ttgggg.88itigaga 3.g.: .8538.33 Essi. 33.38 SEPTERE: 3:53: Si Tii, , exagtik: :

k:... as a : 8 is & 3.3 r is...s. irrest s i.e.

s :38: . riri Skict: .3333333 388tgttgata Criticia&taatttggotg83tti gag caaga '853:328 agai: .33333 S&S. E . . . . . 35. 338; i.i. . . argt... &

... is 8 S. E. St. &rs, ri as its i.l. i. its it is i.i.d.s.

383 : S. Patent Application Publication Apr. 28, 2011 Sheet 31 of 40 US 2011/00981.93 A1

|}0001|| \|0||}()0$

|(76769919987699)

Patent Application Publication Apr. 28, 2011 Sheet 32 of 40 US 2011/00981.93 A1

‘)(){}\0!{ 01\\$$)\!

Patent Application Publication Apr. 28, 2011 Sheet 33 of 40 US 2011/00981.93 A1

3

9000ffb0$-?N 0$| |(

y y

y

y y

× y y

y y

Patent Application Publication Apr. 28, 2011 Sheet 36 of 40 US 2011/00981.93 A1

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

|:Álên!) |:ÁJºnò |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Patent Application Publication US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 39 of 40 US 2011/00981.93 A1

Patent Application Publication Apr. 28, 2011 Sheet 40 of 40 US 2011/00981.93 A1

Sample NA20581 (known heterozygous for 966bp CLN3 deletion mutation) Sequences aligning to mutant reference: 62 Sequences aligning to normal reference: 112 Sample NA20582 (known heterozygous for 966bp CLN3 deletion mutation) Sequences aligning to mutant reference: 38 Sequences aligning to normal reference: 94 Sample NA20584 (known heterozygous for 966bp CLN3 deletion mutation) Sequences aligning to mutant reference: 37 Sequences aligning to normal reference: 68 FIG. 3OC US 2011/0098.193 A1 Apr. 28, 2011

METHODS AND SYSTEMIS FOR MEDICAL value of severity, age of onset, specificity to the trait or dis SEQUENCING ANALYSIS ease, or a combination, wherein the relevant element is an element having a threshold value of importance of the ele BACKGROUND ment to homeostasis relevant to the trait, intensity of the perturbation of the element, duration of the effect of the 0001 Medical sequencing is a new approach to discovery element, or a combination. of the genetic causes of complex disorders. Medical sequenc ing refers to the brute-force sequencing of the genome or 0006. The disclosed methods are based on a model of how transcriptome of individuals affected by a disease or with a elements affect complex diseases. The disclosed model is trait of interest. Dissection of the cause of common, complex based on the existence of significant genetic and environmen traits is anticipated to have an immense impact on the bio talheterogeneity in complex diseases. Thus, the specific com technology, pharmaceutical, diagnostics, healthcare and agri binations of genetic and environmental elements that cause cultural biotech industries. In particular, it is anticipated to disease vary widely among the affected individuals in a result in the identification of novel diagnostic tests, novel cohort. Implications of this model include: (1) comparisons targets for drug development, and novel strategies for breed of candidate variant allele frequencies between affected and ing improved crops and livestock animals. Medical sequenc unaffected cohorts that do not identify statistical differences ing has been made possible by the development of transfor in a complex disease do not exclude that variant from causal mational, next generation DNA sequencing instruments, such ity in individuals within the affected cohort; (2) experimental as those, for example, developed by 454 LifeSciences/Roche designs based upon comparisons of candidate variant allele Diagnostics, Applied Biosystems/Agencourt, Illumina/ frequencies between affected and unaffected cohorts, even if Solexa and Helicos, which instruments are anticipated to undertaken on a large Scale, will fail to disclose causal vari increase the speed and throughput of DNA sequencing by ants in situations where there is a high degree of heterogeneity 3000-fold (to 2 billion base pairs of DNA sequence per instru among individuals in causal elements; and (3) statistical ment per experiment). methods will not give detailed information on a specific indi 0002 Common, conventional approaches to the discovery vidual, which is a key need in personalized medicine and of the genetic basis of complex disorders include the use of medical sequencing. linkage disequilibrium to identify quantitative trait loci in 0007. The disclosed model is an effective, general experi studies of multiple sets of affected pedigrees, candidate gene mental design and analysis approach for the identification of based association studies in cohorts of affected and unaf causal variants in common, complex diseases by medical fected individuals that have been matched for confounding sequencing. The model can utilize various approaches includ factors such as ethnicity, and whole genome genotyping stud ing, but not limited to, one or more of the following: (1) ies in which associations are sought between linkage disequi evaluating associations with component phenotypes (Cp) librium segments (based upon tagging SNP genotypes or rather than diseases (D): a “candidate component phenotype haplotypes), and diagnosis in cohorts of affected and unaf approach; (2) including severity (Sy) and duration (t) when fected individuals that have been matched for confounding evaluating associations with Cp; (3) evaluating associations factors. in individuals and subsets of cohorts in addition to cohorts: 0003. These methods are based on the assumption that (4) evaluating associations in single pedigrees rather than complex disorders share underlying genetic components (i.e., integrating results of several pedigrees; (5) including inten are largely genetically homogeneous). In other words, while sity of the perturbation (I) and t in associations of elements complex diseases result from the cumulative impact of many (E). For medical sequencing, this can mean, for example, genetic factors, those factors are largely the same in individu focusing on non-synonymous variants with large negative als. While this assumption has met with some Success, there BLOSUM (BLOcks of Amino Acid SUbstitution Matrix are numerous cases where this commonality has failed. scores). For medical sequencing this has the further implica Progress in dissecting the genetics of complex disorders tion that evaluations of the transcriptome sequence and abun using these approaches has been slow and limited. Software dance in affected cells or tissues is likely to provide greater systems for DNA sequence variant discovery operating under signal to noise than the genome sequence; (6) following cata this assumption are inadequate for next-generation DNA loging of E, I and t, assemble E into a minimal set of physi sequencing technologies that feature short read lengths, novel ologic or biochemical pathways or networks (P). Seek asso base calling and quality score determination methods, and ciations of resultant P with Cp; and (7) seeking unbiased relatively high error rates. approaches to selection of Cp. For example, seek associations 0004. Therefore, what are needed are systems and meth with Cp that are suggested by P. Further, Cp can vary from ods that overcome the challenges found in the art, some of highly specific to general. Initial associations with Cp can be which are described above. as specific as possible based upon P. 0008. The disclosed model and the disclosed methods SUMMARY based on the model can be used to generate valuable and useful information. Atabasic level, identification of elements 0005 Disclosed are methods of identifying elements asso (such as genetic variants) that are associated with a trait (Such ciated with a trait, such as a disease. The methods can com as a disease or phenotype) provides greater understanding of prise, for example, identifying the association of a relevant traits, diseases and phenotypes. Thus, the disclosed model element (Such as a genetic variant) with a relevant component and methods can be used as research tools. At another level. phenotype (such as a disease symptom) of the trait, wherein the elements associated with traits through use of the dis the association of the relevant element with the relevant com closed model and methods are significant targets for, for ponent phenotype identifies the relevant element as an ele example, drug identification and/or design, therapy identifi ment associated with the trait, wherein the relevant compo cation and/or design, Subject and patient identification, diag nent phenotype is a component phenotype having a threshold nosis, prognosis as they relate to the trait. The disclosed US 2011/0098.193 A1 Apr. 28, 2011

model and methods can identify elements associated with comprise identifying the association of a relevant element traits that are more significant or more likely to be significant with a relevant component phenotype of the trait; to the genesis, maintenance, severity and/or amelioration of 0016 FIG. 4 is a block diagram illustrating an exemplary the trait. The display, output, cataloging, addition to databases operating environment for performing the disclosed method; and the like of elements associated with traits and the asso 0017 FIG. 5 is a block diagram illustrating an exemplary ciation of elements to traits provides useful tools and infor web-based navigation map. Several user-driven query and mation to those identifying, designing and validating drugs, reporting functions can be implemented; therapies, diagnostic methods, prognostic methods in relation 0018 FIG. 6 shows an example of a sequence query inter to traits. face; 0009. Also disclosed are methods of identifying an inher (0019 FIG. 7 illustrates the identification of a coding ited trait in a subject. These methods exploit the simple obser domain (CD) SNP in the C. subunit of the Guanine nucleotide Vation that any sequence, normal or otherwise, matches per binding stimulatory protein (GNAS) using the disclosed fectly with itself Instead of comparing sequence reads from a methods: patient to a general reference genome, the methods of the present invention can create a library of sequences, each of 0020 FIG. 8 is a graph showing the length distribution of which is a perfect match to a known mutation. The library 454 GS20 reads: includes the normal sequence at each mutation position. 0021 FIG. 9 is a graph showing run-to-run variation in Incoming sequence reads are compared to every sequence the RefSeq transcript read counts; library and the best matches are determined For a given muta 0022 FIGS. 10A-C illustrate an example of a novel splice tion, a normal sequence read (i.e., one lacking the mutation) isoform identified with GMAP by an apparent SNP at the aligns best to the normal library sequence. A read having the penultimate base of an alignment; mutation aligns best to the mutant library sequence. 0023 FIG. 11 illustrates an example of a novel splice 0010. It should be understood that elements (such as isoform identified with GMAP by an apparent SNP at the genetic variants) identified using the disclosed model and penultimate base of an alignment; methods can be part of other components or features (such as (0024 FIG. 12 illustrates a GMAP alignment of read the gene in which the genetic variant occurs) and/or related to D9VJ59F02JQMRR (nt 1-109, top) from SID 1438, to SYN other components or features (such as the protein or expres CRIP (NM 006372.3, bottom) showing a nsSNP at nt 30 sion product encoded by the gene in which the genetic variant (yellow, a1384 g) and a novel splice isoform that omits an occurs or a pathway to which the expression product of the 105-bp exon and maintains frame; gene belongs). Such components and features related to iden 0025 FIG. 13 is a graph showing the results of pairwise tified elements can also be used in or for, for example, drug comparisons of the copy numbers of individual transcripts in identification and/or design, therapy identification and/or lymphoblast cell lines from related individuals showed sig design, Subject and patient identification, diagnosis, progno nificant correlation; sis as they relate to the trait. Such components and features 0026 FIGS. 14A-D show the alignment of a reference related to identified elements can also be targets for identify sequence to other various sequences including normal and ing, designing and Validating drugs, therapies, diagnostic mutant Sequences: methods, prognostic methods in relation to traits and/or can 0027 FIGS. 15A-C illustrate the alignment of sequence provide useful tools and information to those identifying, reads to a normal reference and to a mutant reference. designing and validating drugs, therapies, diagnostic meth (0028 FIG. 16 shows the workflow of the comprehensive ods, prognostic methods in relation to traits. carrier Screening test, comprising sample receiving and DNA 0011 Additional advantages are set forth in part in the extraction, target enrichment from DNA samples, multi description which follows or can be learned by practice. The plexed sequencing library preparation, next generation advantages are realized and attained by means of the elements sequencing and bioinformatic analysis. and combinations particularly pointed out in the appended (0029 FIGS. 17A-D shows analytic metrics of multiplexed claims. It is to be understood that both the foregoing general carrier testing by next generation sequencing. description and the following detailed description are exem 0030 FIGS. 18A-B show Venn diagrams of specificity of plary and explanatory only and are not restrictive, as claimed. on-target SNP calls and genotypes in 6 samples. 0031 FIG. 19 shows a decision tree to classify sequence BRIEF DESCRIPTION OF THE DRAWINGS variation and evaluate carrier status. 0012. The accompanying drawings, which are incorpo 0032 FIGS. 20A-G show detection of gross deletion rated in and constitute a part of this specification, illustrate mutations by local reduction in normalized aligned reads. embodiments and together with the description, serve to 0033 FIGS. 21A-D show clinical metrics of multiplexed explain the principles of the methods and systems: carrier testing by next generation sequencing. 0013 FIG. 1 is a block diagram illustrating an exemplary 0034 FIGS. 22A-C show disease mutations and carrier medical sequencing method utilizing, for example, 454 burden in 104 DNA samples. pyrosequencing and Substitution variants in transcriptome 0035 FIG. 23 shows five reads from NA202057 showing sequence data; AGA exon 4, c.488GdC, C163S, chr4:178596912GdC and 0014 FIG. 2 is a block diagram illustrating another exem exon 4, c.482G2-A, R161O, chr4:178596918G-A (black plary medical sequencing method utilizing, for example, 454 arrows). 193 of 400 reads contained these substitution DMs pyrosequencing and indel variants in transcriptome sequence (CM910010 and CM910011). data; 0036 FIG. 24 shows a screen shot of the custom Agilent 0015 FIG. 3 is a block diagram illustrating a method of Sure Select RNA bait for hybrid capture of gene GAA (dis identifying elements associated with a trait, the methods can ease GSD2). US 2011/0098.193 A1 Apr. 28, 2011

0037 FIG. 25 shows a screen shot of the custom Agilent can be performed with any specific embodiment or combina Sure Select RNA bait for hybrid capture of gene HBZ-HBO1 tion of embodiments of the disclosed methods. (disease—thalassemia). 0048. The present methods and systems may be under 0038 FIG. 26 shows a screen shot of the custom Agilent stood more readily by reference to the following detailed Sure Select RNA bait for hybrid capture of gene CLN3 (dis description of preferred embodiments and the Examples ease—Battten). included therein and to the Figures and their previous and 0039 FIG. 27 shows one end offive reads from NA01712 following description. showing ERCC6 exon 17, c.3536delA, Y1179fs, chr10: 5034847 6.delA. I. MODEL 0040 FIG. 28 shows one end offive reads from NA20383 0049 Genetic heterogeneity is a potential cause for the showing CLN3 exon 11, c.1020G>T, E295X, chr16: lack of replication among studies of complex disorders. The 284.01322G>T (black arrow). prevailing assumption has been that there is sufficient homo 0041 FIG. 29 shows one end offive reads from NA16643 geneity in causal elements in individuals affected by a com showing HBB exon 2, c.306G>C, E102D, chr11: mon, complex disease that the comparisons of candidate vari 5204392G>C (Black arrow). ant allele frequencies between affected and unaffected 0042 FIG. 30 shows the strategy for detection of a large cohorts can identify differences based on some inferential deletion mutation in a human genomic DNA sample. measure. This assumption was borne out of Successes in studies of this type. For example, HLA haplotypes show DETAILED DESCRIPTION association with several common, complex diseases. 0043. Before the present methods and systems are dis 0050. However, to uncover the causative genetic compo closed and described, it is to be understood that the methods nents relevant to individual, personalized medicine, a move and systems are not limited to specific synthetic methods, from the statistical to the determinate is desired. Regarding specific components, or to particular compositions, as Such complex diseases, if there is insufficient homogeneity of can, of course, vary. It is also to be understood that the termi causal elements among affected individuals to enable detec nology used herein is for the purpose of describing particular tion of statistical differences, then a move from the statistical embodiments only and is not intended to be limiting. to the determinate is also desired. The disclosed model is 0044 As used in the specification and the appended based on the existence of significant genetic and environmen claims, the singular forms “a,” “an and “the include plural talheterogeneity in complex diseases. Thus, the specific com referents unless the context clearly dictates otherwise. binations of genetic and environmental elements that cause Ranges can be expressed herein as from “about one particu disease vary widely among the affected individuals in a lar value, and/or to “about another particular value. When cohort. Implications of this model include: (1) comparisons Such a range is expressed, another embodiment includes from of candidate variant allele frequencies between affected and the one particular value and/or to the other particular value. unaffected cohorts that do not identify statistical differences Similarly, when values are expressed as approximations, by in a complex disease do not exclude that variant from causal use of the antecedent “about it will be understood that the ity in individuals within the affected cohort; (2) experimental particular value forms another embodiment. It will be further designs based upon comparisons of candidate variant allele understood that the endpoints of each of the ranges are sig frequencies between affected and unaffected cohorts, even if nificant both in relation to the other endpoint, and indepen undertaken on a large Scale, will fail to disclose causal vari dently of the other endpoint. ants in situations where there is a high degree of heterogeneity 0045 “Optional or “optionally’ means that the subse among individuals in causal elements; and (3) statistical quently described event or circumstance may or may not methods will not give detailed information on a specific indi occur, and that the description includes instances where said vidual, which is a key need in personalized medicine and event or circumstance occurs and instances where it does not. medical sequencing. 0046 Throughout the description and the claims of this 0051. The disclosed model is based upon genetic, environ specification, the word “comprise' and variations of the mental and phenotypic heterogeneity in common, complex word, Such as “comprising and “comprises.” means “includ diseases. The model notes that multiple elements (E. . . E) ing but not limited to.” and is not intended to exclude, for can be involved in the causality of a common, complex dis example, other additives, components, integers, or steps. ease (D). These elements can be genetic (G) factors, environ “Exemplary' means “an example of and is not intended to mental (E) factors or combinations thereof. The traditional convey an indication of a preferred or ideal embodiment. approach is to decompose GXE into genetic factors, G (which “Such as is not used in a restrictive sense, but for explanatory can be further decomposed into additive 'a', dominance “d'. purposes. and epistatic 'e' factors), an environment factor “E”, their 0047 Disclosed are components that can be used to per non-linear interaction "GxE', and a noise term "epsilon” form the disclosed methods and systems. These and other (always present in every experiment and every data set). The components are disclosed herein, and it is understood that genetic decomposition can be important because additive when combinations, Subsets, interactions, groups, etc. of genetic variance is heritable, while dominance and epistatic these components are disclosed that while specific reference variance are reconstituted each generation as a result of each of each various individual and collective combinations and individual's unique genome. It is further noted that elements permutation of these may not be explicitly disclosed, each is can have heterogeneous contributions to phenotypes. Thus specifically contemplated and described herein, for all meth elements can be either deleterious (predisposition) or advan ods and systems. This applies to all aspects of this application tageous (protection) in terms of disease development. Fur including, but not limited to, steps in disclosed methods. ther, elements can vary in expressivity and penetrance. It is Thus, if there are a variety of additional steps that can be further noted that some elements can have very specific performed, it is understood that each of these additional steps effects whereas others are pleiotropic. For example, a variant US 2011/0098.193 A1 Apr. 28, 2011

in an enzyme can affect only a single biochemical pathway (E) and occurrence of a common, complex disease in an whereas a variant in a can affect many outbred population can lead to false negative results. pathways. These additive and nonadditive effects can be con 0056. Different elements in any individual can lead to a text dependent. Thus, the model can view Das a phenomenon given effect. Thus, both genocopies and envirocopies exist. that broadly describes the outward phenotype of the combi 0057 Values oft and I can have significant impact on E. natorial consequence of allelic and environmental variations. Thus, strategies that evaluate gene candidacy based upon a The disclosed model utilizes a more general approach that tagged SNP (which can ignore the variables tand I) can yield can seek associations in individuals. It is further noted that the false positive results. magnitude of the effect of an individual element can be 0.058 Sampling of multiple individuals within a single dependent upon at least three variables: pedigree can be highly informative since the number of com 0052 First, the importance of that particular element for binations of possible elements is greatly decreased by laws of maintenance of homeostasis (H) relevant to the disease (D). inheritance. Some elements have minor importance, while others have 0059 While in any individual pedigree there can be a finite major importance. For example, the knockout of a specific number of elements that cause a common complex disease, in gene in a mouse can result in a phenotype that varies between a set of unrelated pedigrees there exist an extraordinarily no effect and embryonic lethality. Thus each element (E. . . large number of possible combinations of E . . . E., that can E) has a specific, contributory role as part of the cause of, or lead to that disease. In turn, while the variance explained by a protection against, a complex disease (H...H.). Second, the given element (E) in an individual pedigree can certainly be intensity of the perturbation of that element (I). For genetic large, the variance between that element and a disease in a set elements, the intensity of the perturbation is dependent upon of unrelated pedigrees is most likely to be very small. Thus the type of variant, the number of copies of variant element or the magnitude of gene expression difference. The types of associations between individual element frequencies (E) and genetic variant include synonymous (which can be further occurrence of a common, complex disease in sets of unrelated categorized into regulatory and non-regulatory SNP and/or pedigrees can lead to false negative results. coding and noncoding SNP) and non-synonymous SNPs 0060 Another implication includes phenotypic heteroge (which can be further categorized by scores such as BLO neity in common, complex diseases. The model notes that SUM score), indels (coding domain and non-coding domain), conventional definitions of common, complex diseases can and whole or partial gene duplications, deletions and rear represent a combination of multiple component phenotypes rangements. The number of copies of a variant genetic ele (Cp . . . Cp), also known as "endophenotypes, that have ment can reflect homozygosity, heterozygosity or hemizy been rather arbitrarily assembled through years of medical gosity. Thus each element (E. . . E) in an individual has a experience and consensus. These component phenotypes can specific and variable intensity (I...I). Third, the duration of be symptoms, signs, diagnostic values, and the like. the effect of the element (t). Environmental elements can be 0061. Given the informal process of inclusion or exclusion acute or chronic in nature. An example is occurrence of skin of Cp in a common, complex disease, the disclosed model cancer following acute exposure to ultraviolet radiation while notes that individual Cp may not always be present in any Sunbathing versus continuous exposure through an outdoor individual case of a common, complex disease (i.e., pheno occupation. Genetic elements can also be acute or chronic in copies exist). Some Cp are present in the vast majority of nature, since many genes are not constitutively expressed but cases (commonly referred to as pathognomonic features), rather under transcriptional and/or post-transcriptional regu whereas others will be present in only a few. Further, some Cp lation. Therefore, a variant genetic element can not necessar are pleiotropic (i.e., present in multiple common, complex ily be expressed in an individual (called “expressivity” for diseases). An example is elevated serum or plasma C reactive within an individual; “penetrance' for occurrence in a popu protein. Other Cp are unique to a single D. An example is lation). Thus each element (E. . . E) in an individual has a auditory hallucinations. Most Cp are anticipated to fit some specific and variable duration of effect (t... t.) that can not where between these extremes (such as giant cell granulomas be constant but that can be a function of the environment. on histology). 0053 Thus, for any given element E, the contribution 0062. The model further notes that for any D, the conven towards causality in a disease can be a function, f, of these tional cluster of Cp that is used for disease definition is inex three factors. Thus: act. It does not include all relevant Cp—but rather a subset that are currently known, established or included in the description of that disease. Furthermore, some Cp may be 0054 and similarly the disease itself can be a function, incorrectly included in the definition of that D. Other Cp may g, of these n elements: have been incorrectly omitted. Thus each Cp (Cp . . . Cp) can have a specific and individual value in the description of Dg(E) the presence of a common, complex disease (D). The set of Cp 0055. This variability has several implications. For that are used for traditional diagnosis may not be complete or example, while in any individual, there are likely to be a finite completely correct. number of elements that cause a common complex disease, in 0063. An implication of the model is that comparisons of an outbred population there exist an extraordinarily large candidate variant allele frequencies between affected and number of possible combinations of E... E., that can lead to unaffected cohorts as defined by D that do not identify statis that disease. In turn, while the variance explained by a given tical differences in a common, complex disease do not element (E) in an individual can certainly be large (i.e., exclude that variant from causality in Cp in individuals within 5-20%), the variance between that element and a disease in an the affected cohort. A further implication is that experimental outbred population is most likely to be very small (i.e., 0.1%). designs based upon comparisons of candidate variant allele Thus, associations between individual element frequencies frequencies between affected and unaffected cohorts as US 2011/0098.193 A1 Apr. 28, 2011

defined by D. can be subject to false negative errors. A more 0073. The convergence of elements into networks and general approach is to seek associations with Cp. pathways is also based upon experience in analysis of gene 0064. The model further notes that the magnitude of the expression profiling experiments, where many disparate tran effect of an individual Cp can be dependent upon two addi Scripts are typically up-regulated or down-regulated in tional variables. One of the variables is the severity of the expression between two states or individuals. Lists of differ perturbation (Sv) of that Cp. For example, one might have a entially expressed genes are typically analyzed by synthesis thrombocytopenia of 100/mm or 50,000/mm of blood. into perturbed networks or pathways in order to understand Auditory hallucinations may have occurred once a year or the principal differences. many times per hour. Thus each Cp (Cp . . . Cp) in an 0074 Another implication of the model is the combination individual with disease has a specific and variable severity of medical sequencing data with genetic, gene and protein (Sv... Sv). expression and metabolite profiling data. The analysis of 0065. The other variable that an individual Cp can be medical sequencing data—a list of genes with putative, dependent upon is the age of onset (A) of that Cp. For physiologically important sequence variation—can be facili example, dementia can occur in young persons or in the tated by integrative approaches that combine medical elderly. The pathophysiology of dementia in young people is sequencing data results with results of other approaches. Such frequently brain tumor. In elderly persons, it is frequently as genetic (linkage) data, gene expression profiling data and Alzheimer's disease or secondary to depression. Thus each proteomic and metabolic profiling data. Cp (Cp . . . Cp,) in an individual has a specific and variable 0075. The disclosed model is an effective, general experi time to onset (A ...A.). mental design and analysis approach for the identification of 0066. Thus, for any given Cp, an effective definition can causal variants in common, complex diseases by medical be a function, h, of these three factors. Thus: sequencing. The model can utilize various approaches includ ing, but not limited to, one or more of the following: (1) evaluating associations with component phenotypes (Cp) 0067 and therefore: rather than diseases (D): a “candidate component phenotype Dg(E)-h(Cpl. Sv. A....) approach; (2) including severity (Sy) and duration (t) when evaluating associations with Cp; (3) evaluating associations 0068 thus mapping causal elements to phenotypic in individuals and subsets of cohorts in addition to cohorts: expression. (4) evaluating associations in single pedigrees rather than 0069 Cp heterogeneity can have several other implica integrating results of several pedigrees; (5) including inten tions including that attempts to find causal elements in studies sity of the perturbation (I) and t in associations of elements predicated on the traditional definitions of common, complex (E). For medical sequencing, this can mean, for example, diseases are likely to be unsuccessful due to the informal focusing on non-synonymous variants with large negative methods whereby Cp have been assembled into conventional BLOSUM scores. For medical sequencing this has the further definitions and by the weightings of SV or t (if any) by which implication that evaluations of the transcriptome sequence Cp have empirically been weighted. Attempts to find solu and abundance in affected cells or tissues is likely to provide tions for individual Cp are more likely to be successful. Fur greater signal to noise than the genome sequence; (6) follow thermore, attempts to find solutions for individual Cp are ing cataloging of E, I and t, assemble E into a minimal set of more likely to be successful if Sv and t values are measured physiologic or biochemical pathways or networks (P). Seek and cut-off values defined prospectively. associations of resultant P with Cp; and (7) seeking unbiased 0070 Additionally, the inclusion/exclusion of traditional approaches to selection of Cp. For example, seek associations Cp are biased by medical experience and consensus. Unbi with Cp that are suggested by P. Further, Cp can vary from ased Cp (Suggested by experimentally-derived values of E or highly specific to general. Initial associations with Cp can be physiologic or biochemical pathways or networks (P)) are as specific as possible based upon P. more likely to show associations. Molecular Cp, such as gene 0076. As noted above, common complex diseases can or protein expression profiles, are an example of phenotypes have heterogeneous descriptions based on informal assembly that are experimentally-derived and likely to be intermediary of component phenotypes into the disease description. Given between gene sequences and organismal traits. this heterogeneity of the features that can be ascribed to a 0071 Another implication is the convergence of elements disease, and because the principles of this model are not into networks and pathways. Genetic and environmental het limited to “diseases” as that term is used in the art, the dis erogeneity in common, complex disorders can be partitioned closed model and methods can be used in connection with by assembly of individual E into physiologic or biochemical “traits.” The term trait, which is further described elsewhere pathways or networks (P). This is based upon the observa herein, is intended to encompass observed features that may tions that: (a) eukaryotic biochemistry is organized into path or may not constitute or be a component of an identified ways and networks of interacting elements. Very few genes disease. Such traits can be medically relevant and can be act in isolation; (b) eukaryotic biochemistry is rather con associated with elements just as diseases can. strained; and (c) challenges to homeostasis typically evoke 0077. The disclosed model and the disclosed methods Stereotyped responses. based on the model can be used to generate valuable and 0072 Thus, common, complex disorders are anticipated useful information. Atabasic level, identification of elements to appear stochastic or indecipherable when considered at the (such as genetic variants) that are associated with a trait (Such level of Edue both to interactions with the genome and to the as a disease or phenotype) provides greater understanding of intrinsic heterogeneity in causality of D. However, it has been traits, diseases and phenotypes. Thus, the disclosed model realized that heterogeneous combinations of individual E and methods can be used as research tools. At another level. converges into a discrete number of P. Linked, non-casual the elements associated with traits through use of the dis variations, in contrast, are not anticipated to converge into P. closed model and methods are significant targets for, for US 2011/0098.193 A1 Apr. 28, 2011 example, drug identification and/or design, therapy identifi identified for each sample in the discovery set (for example, cation and/or design, Subject and patient identification, diag indels). At block 206, a first subset of rules (a first filter) can nosis, prognosis as they relate to the trait. The disclosed be applied to identify candidate variants (for example, vari model and methods can identify elements associated with ants that can be associated with a trait or disease). In this traits that are more significant or more likely to be significant example, the first Subset of rules can comprise one or more of to the genesis, maintenance, severity and/or amelioration of the following: (1) present in >4 sequence reads; (2) present in the trait. The display, output, cataloging, addition to databases >30% reads (assumes frequency is at least heterozygous); (3) and the like of elements associated with traits and the asso absence of homopolymer bases immediately preceding indel ciation of elements to traits provides useful tools and infor (within 5 nucleotides); (4) high quality score at variant base mation to those identifying, designing and validating drugs, (s); (5) present in sequence reads in both orientations (5' to 3' therapies, diagnostic methods, prognostic methods in relation and 3' to 5"); (6) confirm read alignment to reference to traits. sequence; and (7) exclude reference sequence errors by align 0078. The implications of this model can be incorporated ment to a second reference database into the design of an analysis strategy such as the examples I0083. At block 207, a second subset of rules (a second shown in FIG. 1 and FIG. 2. filter) can be applied to the resulting candidate variants in 007.9 FIG. 1 illustrates an exemplary medical sequencing order to prioritize the candidate variants and nominate can method utilizing, for example, 454 pyrosequencing and Sub didate genes. In this example, the second Subset of rules can stitution variants in transcriptome sequence data. At block comprise one or more of the following: (1) coding domain 101, a discovery set of samples can be selected. At block 102. non-synonymous variant; severity of gene lesion (BLOSUM nucleic acids (for example, RNA) can be extracted from the etc.); (3) gene congruence in >1 sample; (4) network or path discovery set of samples. At block 103, DNA sequencing can way congruence in >1 Sample; (5) functional plausibility; (6) be performed (for example, with 454/Roche pyrosequenc chromosomal location congruence with known quantitative ing). The DNA sequencing can result in the generation of trait loci; and (7) congruence with other data types (e.g., gene sequence reads. At block 104, the sequence reads can be or protein expression information). aligned to a reference database (for example, RefSeq with I0084. At block 208, the resulting nominated genes can be MegaBLAST). At block 105, potential variants can be iden validated by re-sequencing the nominated genes in "Discov tified for each sample in the discovery set (for example, ery' & independent “Validation” sample sets. At block 209, SNPs). At block 106, a first subset of rules (a first filter) can be the association of validated gene variants with component applied to identify candidate variants (for example, variants phenotypes can be examined that can be associated with a trait or disease). In this example, the first subset of rules can comprise one or more of the II. EXEMPLARY METHODS following: (1) present in >4 sequence reads; (2) present in 0085 Provided, and illustrated in FIG. 3, are methods of >30% reads (assumes frequency is at least heterozygous); (3) identifying elements associated with a trait, the methods can high quality score at variant base(s); (4) present in sequence comprise identifying the association of a relevant element reads in both orientations (5' to 3' and 3’to 5"); (5) confirm read with a relevant component phenotype of the trait at 301, alignment to reference sequence; and (6) exclude reference wherein the association of the relevant element with the sequence errors by alignment to a second reference database evant component phenotype identifies the relevant element as 0080. At block 107, a second subset of rules (a second an element associated with the trait, wherein the relevant filter) can be applied to the resulting candidate variants in component phenotype is a component phenotype having a order to prioritize the candidate variants and nominate can threshold value of severity, age of onset, specificity to the trait didate genes. In this example, the second Subset of rules can or disease, or a combination at 302, wherein the relevant comprise one or more of the following: (1) coding domain element is an element having a threshold value of importance non-synonymous variant; (2) severity of gene lesion (BLO of the element to homeostasis relevant to the trait, intensity of SUM etc.); (3) gene congruence in >1 sample; (4) network or the perturbation of the element, duration of the effect of the pathway congruence in >1 sample; (5) functional plausibility; element, or a combination at 303. It should be understood that (6) chromosomal location congruence with known quantita the method can include identification of one or multiple ele tive trait loci; and (7) congruence with other data types (e.g., ments, association of one or multiple elements with one or gene or protein expression or metabolite information). multiple traits, use of one or multiple elements, use of one or 0081. At block 108, the resulting nominated genes can be multiple component phenotype, use of one or more relevant validated by re-sequencing the nominated genes in "Discov elements, use of one or more relevant component phenotypes, ery' & independent “Validation” sample sets. At block 109, etc. Such single and multiple components can be used in any the association of validated gene variants with component combination. The model and methods described herein refer phenotypes can be examined to singular elements, traits, component phenotypes, relevant 0082 FIG. 2 illustrates another exemplary medical elements, relevant component phenotypes, etc. merely for sequencing method utilizing, for example, 454 pyrosequenc convenience and to aid understanding. The disclosed methods ing and indel variants in transcriptome sequence data. At can be practiced using any number of these components as block 201, a discovery set of samples can be selected. At can be useful and desired. block 202, nucleic acids (for example, RNA) can be extracted I0086 A trait can be, for example, a disease, a phenotype, from the discovery set of samples. At block 203, DNA a quantitative or qualitative trait, a disease outcome, a disease sequencing can be performed (for example, with 454/Roche Susceptibility, a combination thereof, and the like. As used pyrosequencing). The DNA sequencing can result in the gen herein in connection with the disclosed model and methods, eration of sequence reads. At block 204, the sequence reads trait refers to one or more characteristics of interest in a can be aligned to a reference database (for example, RefSeq, Subject, patient, pedigree, cohort, groups thereof and the like. with MegaBLAST). At block 205, potential variants can be Of particular interest as traits are phenotypes, features and US 2011/0098.193 A1 Apr. 28, 2011

groups of phenotypes and features that characterize, are 0091 Additionally, the association of the relevant element related to, and/or are indicative of diseases and conditions. with the relevant component phenotype can be identified by a Useful traits include single phenotypes, features and the like threshold value of the coincidence of the relevant element and and plural phenotypes, features and the like. A particularly the relevant component phenotype within a set of discovery useful trait is a component phenotype, such as a relevant samples. Threshold value of coincidence can refer to the component phenotype. coincidence (that is, correlation of occurrence/presence) of the element and the component phenotype. Such a coinci 0087. A relevant element can be an element that has a dence can be a basic observation of the disclosed method. The certain threshold significance/weight based on a plurality of significance of this coincidence is enhanced (relative to prior factors. The relevant element can be an element having a methods of associating elements to diseases) by the selection threshold value of for example, importance of the element to of relevant elements and relevant component phenotypes, homeostasis relevant to the trait, intensity of the perturbation based on the plurality of factors as discussed herein. of the element, duration of the effect of the element, or a 0092 Discovery samples can be any sample in which the combination. The relevant element can be, for example, an presence, absence and/or level or amount of an element can be element associated with one or more genetic elements asso assessed. Generally, a set of discovery samples can be ciated with the trait or disease. The one or more genetic selected to allow assessment of the coincidence of component elements can be derived from, for example, DNA sequence phenotypes with elements. For example, a set of discovery data, genetic linkage data, gene expression data, antisense samples can be selected or identified based on principles RNA data, microRNA data, proteomic data, metabolomic described in the disclosed model. The set of discovery data, a combination, and the like. The relevant element can be samples can comprise, for example, samples from a single a relevant genetic element. A relevant component phenotype individual, samples from a single pedigree, Samples from a (also referred to as an endophenotype) can be a component Subset of a single cohort, Samples from a single cohort, phenotype that has a certain threshold significance/weight samples from multiple individuals, samples from multiple based on one or a plurality of factors. The relevant component unrelated individuals, samples from multiple affected sib phenotype can be a component phenotype having a threshold pairs, samples from multiple pedigrees, a combination value of, for example, severity, age of onset, specificity to the thereof, and the like. The set of discovery samples can also trait or disease, or a combination. The relevant component comprise, for example, both affected Samples and unaffected phenotype can be a component phenotype associated with a samples, wherein affected Samples are samples associated network or pathway of interest. The relevant component phe with the relevant component phenotype, wherein unaffected notype can be a component phenotype specific to the network samples are samples not associated with the relevant compo or pathway of interest. nent phenotype. Samples associated with the relevant com 0088. The threshold value can be any useful value (rel ponent phenotype can be samples that exhibit, or that come evant to the parameter involved). The threshold value can be from cells, tissue, or individuals that exhibit, the relevant selected based on the principles described in the disclosed component phenotype. Samples unassociated with the rel model. In general, higher (more rigorous or exclusionary) evant component phenotype can be samples that do not thresholds can provide more significant associations. How exhibit, and that do not come from cells, tissue, or individuals ever, higher threshold values can also limit the number of that exhibit, the relevant component phenotype. The methods elements identified as associated with a trait, thus potentially can further comprise selecting a set of discovery samples, limiting the useful information generated by the disclosed wherein the set of discovery samples consist of samples from methods. Thus, a balance can be sought in setting threshold a single individual, Samples from a single pedigree, Samples values. The nature of a threshold value can depend on the from a Subset of a single cohort, or samples from a single factor or feature being assessed. Thus, for example, a thresh cohort. The relevant element can be selected from variant old value can be a quantitative value (where, for example, the genetic elements identified in the discovery samples. feature can be quantified) or a qualitative value. Such as a 0093. The threshold value of importance of the element to particular form of the feature, for example. homeostasis relevant to the trait or disease can be, for 0089. The disclosed model and methods provide more example, derived from the phenotype of knock-out, transgen accurate and broader-based identification of trait-associated esis, silencing or over-expression of the element in an animal elements by preferentially analyzing relevant component model or cell line; the phenotype of a genetic lesion in the phenotypes and relevant elements. Such relevant component element in a human or model inherited disorder; the pheno phenotypes and relevant elements have, according to the dis type of knock-out, transgenesis, silencing or over-expression closed model, more significance to traits of interest, Such as of an element related to the element in an animal model or cell diseases. By using relevant component phenotypes and rel line; the phenotype of a genetic lesion in an element related to evant elements, the disclosed model and methods reduce or the element in a human or model inherited disorder; knowl eliminate the confounding and obscuring effect less relevant edge of the function of the element in a related species, a phenotypes and elements have to a given trait. This allows combination, and the like. The element related to the element more, and more significant, trait associations to be identified. can be a gene family member or an element with sequence 0090. The association of the relevant element with the similarity to the element. relevant component phenotype can be identified by identify 0094. The threshold value of intensity of the perturbation ing the association of the relevant element with, for example, of the element can be, for example, derived from the type of a network or pathway associated with the relevant component element, the amount or level of the element, or a combination. phenotype. The network or pathway can be associated with The relevant element can be a relevant genetic element, the relevant component phenotype when the relevant compo wherein the type of element is a type of genetic variant, nent phenotype occurs or is affected when the network or wherein the type of genetic element is a regulatory variant, a pathway is altered. non-regulatory variant, a non-synonymous variant, a synony US 2011/0098.193 A1 Apr. 28, 2011

mous variant, a frameshift variant, a variant with a severity limited to, the RefSeq genome database, the transcriptome score at, above, or below a threshold value, a genetic rear database, the GENBANK database, a combination thereof, rangement, a copy number variant, a gene expression differ and the like. The variant genetic elements identified in the ence, an alternative splice isoform, a combination, and the discovery samples can be part of a catalog of variant genetic like. The relevant element can be a relevant genetic element, elements identified in a plurality of sets of discovery samples. wherein the amount or level of the element is the number of The variant genetic elements can be filtered to select candi copies of the relevant genetic element, the magnitude of date variant genetic elements, wherein the variant genetic expression of the genetic element, a combination, and the elements are filtered, for example, by selecting variant like. genetic elements that are present in a threshold number of 0.095 The element can be an environmental condition, and sequence reads, are present in a threshold percentage of the threshold value of duration of the effect of the element can sequence reads, are represented by a threshold read quality be derived, for example, from the duration of an environmen score at variant base(s), are present in sequence reads from in tal condition or the duration of exposure to an environmental a threshold number of strands, are aligned at a threshold level condition. to a reference sequence, are aligned at a threshold level to a 0096. The element can be a genetic element, and the second reference sequence, are variants that do not have bias threshold value of duration of the effect of the element can be ing features bases within a threshold number of nucleotides of derived from, for example, the duration of expression of the the variant, a combination thereof, and the like. genetic element, the expressivity of the genetic element, or a 0102 The candidate variant genetic elements can be pri combination. oritized to select relevant variant genetic elements, wherein 0097. The threshold value of severity of the component the candidate variant genetic elements are prioritized, for phenotype can be derived, for example, from the frequency of example, according to the presence in the candidate variant the component phenotype, the intensity of the component genetic element of a non-synonymous variant in a coding phenotype, the amount of a feature of the component pheno region, the presence of the candidate variant genetic element type, or a combination. in a plurality of samples, the presence of the candidate variant 0098. The threshold value of specificity to the trait or genetic element at a chromosomal location having a quanti disease of the component phenotype can be derived, for tative trait locus associated with the trait or disease, the sever example, from the frequency with which the component phe ity of the putative functional consequence that the candidate notype is present in other traits or diseases, the frequency with variant genetic element represents, association of the candi which the component phenotype is present in the trait or date variant genetic element with a network or pathway in a disease, or a combination. For example, the component phe plurality of samples, association of the candidate variant notype can be not present in other traits or diseases; the genetic element with a network or pathway with which one or component phenotype can be always present in the trait or more other candidate variant genetic elements are associated, disease; the component phenotype can be not present in other the plausibility or presence of a functional relationship traits or diseases and can always be present in the trait or between the candidate variant genetic element and the rel disease; and the like. evant component phenotype, a combination thereof, and the 0099 Embodiments of the methods can further comprise like. selecting an element as the relevant element by assessing, for (0103. The association of a relevant element with a relevant example, the value of importance of the element to homeo component phenotype of the trait or disease can be per Stasis relevant to the trait or disease, intensity of the pertur formed, for example, for a plurality of relevant elements, a bation of the element, duration of the effect of the element, or plurality of relevant component phenotypes of the trait or a combination and comparing the value to the threshold value. disease, or a plurality of relevant elements and a plurality of One skilled in the art recognizes that comparison of the value relevant component phenotypes of the trait or disease. to the threshold value can be successful if the threshold is 0104 Embodiments of the methods can further comprise exceeded or if the threshold is not exceeded. Success can validating the association of the relevant element with the depend upon what the value and the threshold value repre relevant component phenotype. Association of the relevant SentS. element with the relevant component phenotype can be vali 0100. The methods can further comprise selecting a com dated by assessing the association of the relevant element ponent phenotype as the relevant component phenotype by with the relevant component phenotype in one or more sets of assessing the value of clinical features of the phenotype, and validation samples, wherein the set of validation samples is comparing the value to the threshold value. The clinical fea different than the samples from which the relevant element tures of the phenotype can comprise, for example, the value of was selected. The set of validation samples can comprise severity, age of onset, duration, specificity to the phenotype, samples from a single individual, samples from a single pedi response to a treatment or a combination. The methods can gree, samples from a Subset of a single cohort, Samples from further comprise selecting a component phenotype as the a single cohort, samples from multiple individuals, Samples relevant component phenotype by assessing the value of labo from multiple unrelated individuals, samples from multiple ratory features of the phenotype, and comparing the value to affected sib-pairs, samples from multiple pedigrees, a com the threshold value. bination, and the like. 0101 The variant genetic elements can be identified, for 0105. Also disclosed herein are methods of identifying an example, by sequencing nucleic acids from the discovery inherited trait in a Subject, comprising collecting a biological samples and comparing the sequences to one or more refer sample from the Subject; counting sequence reads aligning to ence sequence databases. The comparison can involve, but is normal references; counting sequence reads aligning to not limited to, BLAST alignments, megaBLAST alignments, mutant references; and determining whether the Subject's GMAP alignments, BLAT alignments, a combination, and sample yields more reads aligning to the mutant references the like. The reference sequence database can be, but is not than to the normal references. The biological samples of the US 2011/0098.193 A1 Apr. 28, 2011

disclosed methods are samples that provide viable DNA for aligned at a threshold level to a reference sequence, (6) sequencing, and include, but are not limited to, Sources Such aligned at a threshold level to a second reference sequence, as blood and buccal Smears (7) variants that do not have biasing features bases within a 0106 Disclosed herein are methods of determining the threshold number of nucleotides of the variant, or (8) a com status of a subject with regard to one or more inherited traits bination thereof. comprising assaying a relevant element or elements from a 0110 DNA sequencing can be used to perform the dis sample from the individual, and comparing the values of the closed methods. Comparing the values of the relevant ele relevant element or elements to a reference set or sets. The ment or elements to a reference set of set involves, but is not status of the subject can be (1) unaffected and non-carrier of limited to, BLAST alignments, megaBLAST alignments, the inherited trait, (2) unaffected and carrier of the inherited GMAP alignments, BLAT alignments, or a combination trait, or (3) affected and carrier of the inherited trait. The trait thereof. The reference sequence database is, but not limited is a disease, a phenotype, a quantitative or qualitative trait, a to, the RefSeq genome database, the transcriptome database, disease outcome, or a disease Susceptibility, which disease the GENBANK database, or a combination thereof. In an includes, but is not limited to, a recessive disease. The dis aspect of the present invention, the reference sequence is closed methods can determine the status of 1 or more traits generated based on identified mutants. including, but not limited to, 5, 10, 15, 25, 30, 35, 40, 45,50, 0111. The methods disclosed herein exploit the observa 60, 70, 80,90, 100, 150, 200, 250, 300,350, 400, or 450 traits tion that any sequence, normal or otherwise, matches per from a biological sample. fectly with itself Instead of comparing sequence reads from a 0107. In an aspect of the present invention, the association patient to a general reference genome, the methods of the of the relevant element with the relevant trait is identified by present invention can create a library of sequences, each of a threshold value of the coincidence of the relevant element which is a perfect match to a known mutation. The library and the relevant trait within the sample. The relevant element includes the normal sequence at each mutation position. is a type of genetic variant, wherein the type of genetic ele Incoming sequence reads are compared to every sequence in ment is a regulatory variant, a non-regulatory variant, a non the library and the best matches are determined For a given synonymous variant, a synonymous variant, a frameshift vari mutation, a normal sequence read (i.e., one lacking the muta ant, a variant with a severity Score at, above, or below a tion) aligns best to the normal library sequence. A read having threshold value, a genetic rearrangement, a copy number the mutation aligns best to the mutant library sequence. This variant, a gene expression difference, an alternative splice approach avoids potential biases associated with aligning isoform, a deletion variant, an insertion variant, a transversion sequencing reads to non-exact matching reference sequences. variant, an inversion variant, or a combination thereof. In an The extent of such biases is variable and difficult to eliminate. aspect of the invention, the association of a relevant element 0112 Furthermore, since the Zygosity of a potential muta with a relevant component phenotype of the trait is performed tion is derived from the proportion of reads that contain a for (1) a plurality of relevant elements, (2) a plurality of putative mutation that align divided by the total number of relevant component phenotypes of the trait, or (3) a plurality reads aligning, Such biases can result in mischaracterization of relevant elements and a plurality of relevant component of the Zygosity of a mutation based on sequence analysis. In phenotypes of the trait. an extreme case, a mutation can be entirely missed. In the case 0108. In an aspect of the present invention, comparing the of copy number variants, the invention described herein cor values of the relevant element or elements is performed by rectly identifies the copy number. alignment of the DNA sequences to a reference set or sets of 0113 FIG. 14A shows the reference sequence (R) from a DNA sequences, wherein the reference sets of DNA normal segment of the human PLP1 gene on chromosome X. sequences contain both normal, unaffected DNA sequences FIG. 14B shows the alignment of the reference sequence (R) and mutated, variant DNA sequences. The mutated, variant and a sequence read from a normal chromosome (N). The DNA sequences include the plurality of known variant positions are identical. FIG. 14C shows the alignment for the sequences. The alignment of the DNA sequences to a refer reference sequence and a sequence read from a mutant chro ence set or sets of DNA can be performed under conditions mosome (M). By post-processing the output of the alignment requiring a perfect match between the sample and a member algorithm, the alignment indicates that there is a single mis of the reference set. In an aspect of the present invention, the match (a “C” in the reference sequence and a “T” in the status of the Subject is determined by measuring the ratio of mutant sequence). This represents the standard method by DNA sequences that match the normal, unaffected DNA which the art detects mutations. FIG. 14D shows the methods sequences and the mutated, variant DNA sequences. of the present invention, whereby a library of two references 0109. In the methods disclosed herein, the amount or level (Sequence 1 and Sequence 2) differing at the mutation posi of the element can be the number of copies of the relevant tion is used to detect the mutation. genetic element, the magnitude of expression of the genetic 0114. According to the methods disclosed herein, a element, or a combination thereof. In an aspect of the present sequence read is aligned to both references. The number of invention, the variant genetic elements identified in the dis mismatches between the read and each reference is recorded. covery samples are part of a catalog of variant genetic ele The smaller the number of mismatches, the better the align ments identified in a plurality of sets of discovery samples and ment. In a read with Zero errors, the alignment between a the variant genetic elements can be filtered to select candidate normal read and the normal reference has Zero mismatches. In variant genetic elements. Genetic elements are filtered by a read with Zero errors, the alignment between a mutant read selecting variant genetic elements that are (1) present in a and the mutant reference has Zero mismatches. By recording threshold number of sequence reads, (2) presentina threshold only the best alignment for a read (i.e., the alignment having percentage of sequence reads, (3) represented by a threshold fewest mismatches), each read aligns only once. In other read quality Score at variant base or bases, (4) present in words, mutant reads align to the mutant reference and normal sequence reads from in a threshold number of Strands, (5) reads align to the normal reference. US 2011/0098.193 A1 Apr. 28, 2011

0115 Sequences coming from an individual homozygous counting sequence reads aligning to mutant references; and for the normal nucleotide have all reads aligning to the normal determining a ratio of aligned reads, wherein if the ratio is reference. Sequences coming from an individual homozy greater than a first value the inherited trait is a homozygous gous for the mutant nucleotide have all reads aligning to the mutant, if the ratio is between a second value and a third value mutant reference. Sequences coming from a heterozygous the inherited trait is a heterozygous mutant, and if the ratio is individual have sequence read alignments distributed less than a fourth Value the inherited trait is a homozygous approximately equally between the mutant and normal refer wild-type. In an aspect, in the disclosed methods disclosed, ences. The basis of the carrier detection algorithm focuses on the first value can be 86%, the second value can be 18%, the the counting of sequence reads aligning to the normal refer third value can be 14%, and the fourth value can be 14%. ence and sequence reads aligning to the mutant reference. I0120 In an aspect, disclosed herein are methods of deter 0116. The present method is applicable to any type of mining a status of a Subject with regard to an inherited trait. mutation. A mutant reference sequence that is identical to the The disclosed methods can comprise assaying an element DNA from a mutant chromosome is generated. A mutant from a sample from a subject to determine a subject DNA reference sequence can be referred to as a custom reference. sequence; comparing the Subject DNA sequence to a set of For deletion mutants, generating a mutant reference sequence DNA sequences by alignment wherein the set of DNA is achieved by taking the DNA sequence on either side of the sequences comprises both normal, unaffected DNA deletion and making them into a continuous DNA sequence. sequences and mutated, variant DNA sequences; identifying For example, FIG. 15A shows the alignment between a nor the element as being associated with the inherited trait by the mal sequence of a segment of the human HPRT1 gene and a coincidence of the element and the trait within the sample by mutant sequence having a 17 base pair deletion. The mutant determining a ratio of the Subject DNA sequence that matches reference is created by joining the sequences flanking the normal, unaffected DNA sequences and the mutated variant deletion as indicated. This works for any size of deletion. DNA sequences. 0117 For insertion mutants, the approach for generating a 0.121. In the methods disclosed herein, the status can be mutant reference depends on the size of the insertion. For unaffected and non-carrier of the inherited trait and/or unaf example, when the insertion is smaller than the size of the fected and carrier of the inherited trait and/or affected and sequence read, the approach for generating a mutant refer carrier of the inherited trait. The status of a predetermined ence is identical to the approach used for generating a deletion number of inherited traits can be determined from a sample. mutant. FIG. 15B shows the alignment between a normal The predetermined number can be, for example, from about 1 sequence of a segment of the human ATP7A gene and a to about 5,000. In an aspect, the predetermined number can be mutant sequence having a 5bp insertion. When the insertion up to 500, up to 1000, up to 1500, and the like. is longer than the sequence read, a check for perfect align 0122. In an aspect, the sample can be a blood sample, ment of mutant reads at each border of the insertion occurs. A buccal Smear, saliva, urine, excretions, fecal matter, or tissue sequence read that occurs entirely within the insertion does biopsy. The sample can be any type of sample. The sample not reliably indicate that it is from the mutant. Because that can be formaldehyde fixed, paraffin embedded, Guthrie sequence read can be from a different location in the genome, cards, and the like. at least two custom references are generated. Each custom I0123. In an aspect, in the methods disclosed herein, the reference spans the border between the normal sequence and inherited trait can be a disease, a phenotype, a quantitative or the mutant insertion. Using the DNA from an individual qualitative trait, a disease outcome, a disease Susceptibility, a having the insertion, Some reads can be expected to align biomarker, or a syndrome. In an aspect, the inherited trait can perfectly to each custom reference. The normal reference be recessive, dominant, partially dominant, X-linked, com used in this situation is a segment of normal DNA that spans plex, co-dominant, or multi-factorial. the insertion point. FIG. 15C provides a schematic represen 0.124. In an aspect, the assay of the element can be per tation of the alignment of sequence reads to a normal refer formed by DNA sequencing. In an aspect, the element can be ence (top panel) and to an insertion mutant reference (bottom a genetic element, wherein the type of element can be a type panel). of genetic variant, wherein the type of genetic element can be 0118 Embodiments of the present invention consider the a regulatory variant, a non-regulatory variant, a non-synony introduction of sequencing errors. By setting the parameters mous variant, a synonymous variant, a frameshift variant, a of the alignment algorithm to accept no mismatches, a variant with a severity score at, above, or below a threshold sequence read containing an error is eliminated from further value, agenetic rearrangement, a copy number variant, a gene analysis and aligns to neither the normal or mutant reference. expression difference, an alternative splice isoform, a dele The rare cases when an error transforms the nucleotide at the tion variant, an insertion variant, a transversion variant, an mutation position from normal to mutant or vice versa is the inversion variant, a translocation, or a combination thereof. exception. Embodiments of the present invention detect such The mutated, variant DNA sequences can comprise a plural cases by considering the base quality scores. Bases in error ity of known variant sequences. The alignment can be per frequently have low quality Scores. Perfectly matching reads formed under conditions requiring a perfect match between with a nucleotide at the mutation position having a signifi the subject DNA sequence and a member of the reference set cantly lower quality score than the Surrounding nucleotides of DNA sequences. The element can be a genetic element, are considered Suspect. wherein an amount of the element is a number of copies of the 0119. In an aspect, disclosed herein are methods of iden genetic element, the magnitude of expression of the genetic tifying an inherited trait in a subject. These methods can element, or a combination thereof. Comparing the Subject comprise collecting a biological sample from the Subject DNA sequence to a set of DNA sequences by alignment can comprising a DNA sequence; aligning the DNA sequence to comprise one or more of BLAST alignments, megaBLAST normal reference sequences and mutant reference sequences; alignments, GMAP alignments, BLAT alignments, MAQ counting sequence reads aligning to normal references; alignments, gSNAP alignments, or a combination thereof. US 2011/0098.193 A1 Apr. 28, 2011

The reference set of DNA sequences can comprise one or This exemplary operating environment is only an example of more of the RefSeq genome database, the transcriptome data an operating environment and does not indicate limitation as base, the GENBANK database, or a combination thereof. to the scope of use or functionality of operating environment 0.125. The variant genetic elements can be filtered to select architecture. Neither should the operating environment be candidate variant genetic elements, wherein the variant interpreted as having any dependency or requirement relating genetic elements can be filtered by selecting variant genetic to any one or combination of components illustrated in the elements that are present in a threshold number of sequence exemplary operating environment. One skilled in the art reads, are present in a threshold percentage of sequence reads, appreciates that this is a functional description and that the are represented by a threshold read quality score at variant respective functions can be performed by software, hardware, base(s), are present in sequence reads from in a threshold or a combination of software and hardware. number of strands, are aligned at a threshold level to a refer 0.130. The present methods and systems can be operational ence sequence, are aligned at a threshold level to a second with numerous other general purpose or special purpose com reference sequence, are variants that do not have biasing puting system environments or configurations. Examples of features bases withina threshold number of nucleotides of the well known computing systems, environments, and/or con variant, or a combination thereof. figurations that can be suitable for use with the system and 0126. Also disclosed are systems for identifying an inher method comprise, but are not limited to, personal computers, ited trait in a Subject. The systems can comprise a memory; server computers, laptop devices, and multiprocessor sys and a processor, coupled to the memory, configured for, col tems. Additional examples comprise set top boxes, program lecting a biological sample from the Subject comprising a mable consumer electronics, network PCs, minicomputers, DNA sequence, aligning the DNA sequence to normal refer mainframe computers, distributed computing environments ence sequences and mutant reference sequences, counting that comprise any of the above systems or devices, and the sequence reads aligning to normal references, counting like. sequence reads aligning to mutant references, and determin I0131 Further, one skilled in the art appreciates that the ing a ratio of aligned reads, wherein if the ratio is greater than systems and methods disclosed herein can be implemented a first value the inherited trait is a homozygous mutant, if the via a general-purpose computing device in the form of a ratio is between a second value and a third value the inherited computer 401. The components of the computer 401 can trait is a heterozygous mutant, and if the ratio is less than a comprise, but are not limited to, one or more processors or fourth value the inherited trait is a homozygous wild-type. processing units 403, a system memory 412, and a system bus The first value can be 86%, the second value can be 18%, the 413 that couples various system components including the third value can be 14%, and the fourth value can be 14%. processor 403 to the system memory 412. Comparing aligning the DNA sequence to normal reference 0.132. The system bus 413 represents one or more of sev sequences and mutant reference sequences can comprise one eral possible types of bus structures, including a memory bus or more of BLAST alignments, megaBLAST alignments, or memory controller, a peripheral bus, an accelerated graph GMAP alignments, BLAT alignments, MAQ alignments, ics port, and a processor or local bus using any of a variety of gSNAP alignments, or a combination thereof. The normal bus architectures. By way of example, Such architectures can reference sequences and mutant reference sequences can comprise an Industry Standard Architecture (ISA) bus, a comprise one or more of the RefSeq genome database, the Micro Channel Architecture (MCA) bus, an Enhanced ISA transcriptome database, the GENBANK database, or a com (EISA) bus, a Video Electronics Standards Association bination thereof. (VESA) local bus, an Accelerated Graphics Port (AGP) bus, 0127. In the methods disclosed herein, the parameters of and a Peripheral Component Interconnects (PCI) bus also the alignment algorithm can be set to accept a specified num known as a Mezzanine bus. The bus 413, and all buses speci ber of mismatches. With one allowed mismatch, a mutant fied in this description can also be implemented over a wired read containing a sequencing error has one mismatch com or wireless network connection and each of the Subsystems, pared to the mutant reference and two mismatches compared including the processor 403, a mass storage device 404, an to the normal reference. It aligns best to the mutant reference. operating system 405, analysis software 406, MRS data 407, The same argument applies to relaxation of the parameters to a network adapter 408, system memory 412, an Input/Output allow 2 or more mismatches. Interface 410, a display adapter 409, a display device 411, and 0128. Although the disclosed model and methods include a human machine interface 402, can be contained within one the use of new traits, phenotypes, elements and the like, the or more remote computing devices 414a,b,c at physically disclosed model and methods also represent a new use of the separate locations, connected through buses of this form, in many traits, phenotypes, elements and the like that are known effect implementing a fully distributed system. and used in genetic and disease analysis. The disclosed model I0133. The computer 401 typically comprises a variety of and methods use these traits, phenotypes, elements and the computer readable media. Exemplary readable media can be like in selective and weighted ways as describe herein. Those any available media that is accessible by the computer 401 of skill in the art are aware of many traits, phenotypes, ele and comprises, for example and not meant to be limiting, both ments and the like as well as methods and techniques of their Volatile and non-volatile media, removable and non-remov detection, measurement, assessment. Such traits, phenotypes, able media. The system memory 412 comprises computer elements, methods and techniques can be used with the dis readable media in the form of Volatile memory, Such as ran closed model and methods based on the principles and dom access memory (RAM), and/or non-volatile memory, description herein and Such use is specifically contemplated. such as read only memory (ROM). The system memory 412 typically contains data such as MRS data 407 and/or program III. EXEMPLARY SYSTEMS modules such as operating system 405 and analysis Software 0129 FIG. 4 is a block diagram illustrating an exemplary 406 that are immediately accessible to and/or are presently operating environment for performing the disclosed methods. operated on by the processing unit 403. US 2011/0098.193 A1 Apr. 28, 2011

0134. In another aspect, the computer 401 can also com tional and commonplace in offices, enterprise-wide computer prise other removable/non-removable, volatile/non-volatile networks, intranets, and the Internet 415. computer storage media. By way of example, FIG. 4 illus 0.139. The processing of the disclosed methods and sys trates a mass storage device 404 which can provide non tems can be performed by software components. The dis Volatile storage of computer code, computer readable instruc closed system and method can be described in the general tions, data structures, program modules, and other data for the context of computer-executable instructions, such as program computer 401. For example and not meant to be limiting, a modules, being executed by one or more computers or other mass storage device 404 can be a hard disk, a removable devices. Generally, program modules comprise computer magnetic disk, a removable optical disk, magnetic cassettes code, routines, programs, objects, components, data struc or other magnetic storage devices, flash memory cards, CD tures, etc. that perform particular tasks or implement particu ROM, digital versatile disks (DVD) or other optical storage, lar abstract data types. The disclosed method can also be random access memories (RAM), read only memories practiced in grid-based and distributed computing environ (ROM), electrically erasable programmable read-only ments where tasks are performed by remote processing memory (EEPROM), and the like. devices that are linked through a communications network. In 0135 Optionally, any number of program modules can be a distributed computing environment, program modules can stored on the mass storage device 404, including by way of be located in both local and remote computer storage media example, an operating system 405 and analysis Software 406. including memory storage devices. Each of the operating system 405 and analysis software 406 0140. In one aspect, the methods can be implemented in a (or Some combination thereof) can comprise elements of the Software system that can utilize data management services, an programming and the analysis software 406. MRS data 407 analysis pipeline, and internet-accessible Software for variant can also be stored on the mass storage device 404. MRS data discovery and analysis for ultra-high throughput, next gen 407 can be stored in any of one or more databases known in eration medical re-sequencing (MRS) data with minimal the art. Examples of such databases comprise, DB2R, human manipulation. The Software system cyberinfrastruc Microsoft(R) Access, Microsoft(R) SQL Server, Oracle(R, ture can use an n-tiered architecture design, with a relational mySQL, PostgreSQL, and the like. The databases can be database, middleware and a web server. The data manage centralized or distributed across multiple systems. ment services can include organizing reads into a searchable 0136. In another aspect, the user can enter commands and database, secure access and backups, and data dissemination information into the computer 401 via an input device (not to communities over the internet. The automatic analysis shown). Examples of such input devices comprise, but are not pipeline can be based on pair-wise megaBLAST or GMAP limited to, a keyboard, pointing device (e.g., a “mouse'), a alignments and an Enumeration and Characterization module microphone, a joystick, a scanner, tactile input devices Such designed for identification and characterization of variants. as gloves, and other body coverings, and the like These and The variant pipeline can be agnostic as to the read type or the other input devices can be connected to the processing unit sequence library searched, including RefSeq genome and 403 via a human machine interface 402 that is coupled to the transcriptome databases. system bus 413, but can be connected by other interface and 0141 Data, analysis and results can be delivered to the bus structures, such as a parallel port, game port, an IEEE community using an application server provider implemen 1394 Port (also known as a Firewire port), a serial port, or a tation, eliminating the need for client-side Support of the universal serial bus (USB). Software. Dynamic queries and visualization of read data, 0.137 In yet another aspect, a display device 411 can also variant data and results can be provided with a user interface. be connected to the system bus 413 via an interface, such as a The software system can report, for example, sSNPs, nsSNPs, display adapter 409. It is contemplated that the computer 401 indels, premature stop codons, and splice isoforms. Read can have more than one display adapter 409 and the computer coverage statistics can be reported by gene or transcript, 401 can have more than one display device 411. For example, together with a visualization module based upon an indi a display device can be a monitor, an LCD (Liquid Crystal vidual transcript or genomic segment. As needed, data access Display), or a projector. In addition to the display device 411, can be restricted using security procedures including pass other output peripheral devices can comprise components word protection and HTTPS protocols. Such as speakers (not shown) and a printer (not shown) which 0142. In an aspect, reads can be received in, for example, can be connected to the computer 401 via Input/Output Inter FASTA format with associated quality score numbers. For face 410. Any step and/or result of the methods disclosed can example, 454 quality scores can be Supplied in “pseudo be output in any form known in the art to any output device phred' format (FASTA format with space delimited base 10 (such as a display, printer, speakers, etc.) known in the art. ASCII representations of integers in lieu of base pairs). The 0.138. The computer 401 can operate in a networked envi FASTA headers contain metadata for the sequence including ronment using logical connections to one or more remote an identifier and sample-specific information. The concept of computing devices 414a, b, c. By way of example, a remote a sample can be equivalent to an individual run or a specific computing device can be a personal computer, portable com sample. Data inputs (sequences, lengths and quality scores) puter, a server, a router, a network computer, a peer device or can automatically be parsed and loaded into a single other common network node, and so on. Logical connections tional database table linked to a representation of the sample. between the computer 401 and a remote computing device 0143. In one aspect, the Software system can generate 414a,b,c can be made via a local area network (LAN) and a alignments to the NCBI human genome and RefSeq tran general wide area network (WAN). Such network connec script libraries, which includes both experimentally-verified tions can be through a network adapter 408. A network (NM and NRaccessions) and computationally predicted tran adapter 408 can be implemented in both wired and wireless Scripts (XM and XR accessions). Reference sequence data, environments. Such networking environments are conven location based feature information (e.g. CDS annotations, US 2011/0098.193 A1 Apr. 28, 2011

variation records) and basic feature metadata imported and 0.148. The Details 506 page can present a tabular view of stored in an application specific schema. all gene segment or transcript associated Sequence Reads 0144. In a further aspect, reads and quality data can be 503, pair wise Alignments 505 and a comprehensive read imported and aligned pairwise to sequence libraries using, for overview (Pileup View 504) grouped by case or sample. It can example, MegaBLAST or GMAP. MegaBLAST alignment also provide a table of all variants in cases grouped into SNP, parameters can be adapted from those used to map SNPs to indel and splice variant. For each identified variant, there can the human genome: wordsize can be 14: identity count can be be drill-down links to relevant Sequence Reads 503 and pair >35; expect value filter can be e-10; and low-complexity wise BLAST or GMAP-generated Alignments 505. 0149. The Pileup View 504 is further illustrated in FIG. 7. sequence can not be allowed to seed alignments, but align The Pileup View 504 can display reads from a single sample ments can be allowed to extend through such regions. GMAP aligned against a transcript or genomic segment, along with parameters can be: identity count can be >35 and identity can all nucleotide variants detected in those reads. FIG. 7 illus be >95%. The best-match alignments for reads can be trates the identification of a coding domain (CD) SNP in the imported into the database. All alignments equivalent in qual C. Subunit of the Guanine nucleotide-binding stimulatory pro ity to the best match can be accepted (as in the case of hits to tein (GNAS) using the disclosed methods. GNAS is a schizo shared exons in splice variants). phrenia candidate gene, with a complex imprinted expression 0145 All positions at which a read differs from the aligned pattern, giving rise to maternally, paternally, and biallelically reference sequence can be enumerated. Contiguous indel expressed transcripts that are derived from four alternative events can be treated as single polymorphisms. All occur promoters and 5' exons. The 1884 by GNAS transcript, rences of potential polymorphisms in reads with respect to a NM 080426.1, is indicated by a horizontal line, oriented given position can be unified as a “single polymorphism.” from 5' to 3', from left to right), along with its associated CD with associated Statistics on frequency, alignment quality, (in green). Three hundred and ninety four 454 reads from base quality, and other attributes that can be used to assess the sample 1437 are displayed as arrows aligned against likelihood that the polymorphism is a true variant. Candidate NM 080426.1 whose direction reflects their orientation with variants can be further characterized by type (SNP, indel, respect to the transcript. Variants found in individual reads are splice isoform, stop codon) and as synonymous variant (SV) displayed by hash marks at their relative position on the read. or non-synonymous variant (ns'V). Variants are characterized as synonymous SNPs (SSNPs, 0146 A web-based, user interface can be used to allow blue), nsSNPs (red) and deletions or insertions (black) with data navigation and viewing using a wide variety of paths and respect to individual sequence read alignments. The left panel filters. FIG. 5 illustrates an exemplary web-based navigation displays all putative variants. The right displays variants fil map. Several user-driven query and reporting functions can tered to retain those present in –4 reads, in 30% of reads be implemented. Users can search based upon a gene name or aligned at that position, and in bidirectional reads. One sSNP symbol and view their associated reads. Users can also search (C398T) was retained that was present in seven of thirteen based upon all genes that meet selectable read coverage, reads aligned at that position in sample 1437, nine of eighteen variant frequency, or variant type criteria. FIG. 6 provides an reads in sample 1438 and twenty of twenty-one reads in 1439. exemplary sequence query interface. Alternatively, a list of C398T is validated (dbSNP number rs7121), and the candidate genes, Supplied prospectively, can be used as an homozygous 398T allele has shown association with deficit entry point into the results. Resultant data can be further Schizophrenia. filtered by case, sample or associated read count. Users can 0150. In one aspect, the analysis software 406 can imple search a sample or set of samples. Users can specify the ment any of the methods disclosed. For example, the analysis alignment algorithm and reference database from drop down software 406 can implement a method for determining a lists. The result of the query can be a sortable Candidate Gene candidate biological molecule variant comprising receiving Report 501 table that features, for example, gene symbol biological molecule sequence data, annotating the biological (linked to Gene Detail 502 page), gene description, the tran molecule sequence data wherein the step of annotating results Scripts or genome segments associated with the gene, in identification of a plurality of biological molecules, deter sequencing read count total for all matches, and chromosome mining if the at least one of the plurality of biological mol location. List results can be exportable to Excel and in XML ecules is a potential biological molecule variant of a known and PDF formats. biological molecule, filtering the biological molecule 0147 Once a gene of interest has been selected, the user sequence data to determine if the determined potential bio can have access to a detailed gene information page. This logical molecule variant is a candidate biological molecule page can present gene-centric information, for example, Syn variant, prioritizing the candidate biological molecule Vari onyms, chromosome position and links to cytogenetic maps, ants, and presenting a list of the plurality of the candidate disease association and transcript details at NCBI. For each biological molecule variants. gene, the gene information page can also display the associ 0151. In another aspect, the analysis software 406 can ated transcripts, genomic segments, reads and variants implement a method for determining an association between grouped by case or sample. LinkS can be made available to a biological molecule variant and a component phenotype views of Sequence Reads 503 and the Pileup View 504. The comprising receiving biological molecule sequence data Sequence Reads 503 page can present a textual display of all comprising a plurality of biological molecule variants, deter annotated reads (with read identifier, length and average qual mining a homeostatic effect for at least one of the plurality of ity score) by case number along with the transcript name to biological molecule variants, determining an intensity of per which they map (linked to Alignments 505). In Alignments turbation for the at least one of the plurality of biological 505, each nucleotide in the read can be color coded with the molecule variants, determining a duration of effect for the at base quality Score to enable facile Scanning of overall and least one of the plurality of biological molecule variants, position-specific read quality. compiling the at least one of the plurality of biological mol US 2011/0098.193 A1 Apr. 28, 2011

ecule variants into at least one biological pathway based on 0156 Also disclosed is a recombinant or synthetic the homeostatic effect, the intensity of perturbation, and the polypeptide for the manufacture of reagents for use as thera duration of effect, determining if the at least one biological peutic agents in the treatment of Schizophrenia and/or affec pathway is associated with the component phenotype, and tive psychosis. In particular, disclosed are pharmaceutical presenting a list comprising the plurality of biological mol compositions comprising the recombinant or synthetic ecule variants in the at least one biological pathway associ polypeptide together with a pharmaceutically acceptable car ated with the component phenotype. rier therefor. 0152 For purposes of illustration, application programs 0157 Also disclosed is a method of diagnosing schizo and other executable program components such as the oper phrenia and/or affective psychosis or Susceptibility to Schizo ating system 405 are illustrated herein as discrete blocks, phrenia and/or affective psychosis in an individual or Subject, although it is recognized that such programs and components wherein the method comprises determining if one or more of reside at various times in different storage components of the the GPX, GSPT1 and TKT genes in the individual or subject computing device 401, and are executed by the data processor contains a genetic variation. The genetic variation can be a (s) of the computer. An implementation of analysis Software genetic variation identified as associated with Schizophrenia, 406 can be stored on or transmitted across some form of affective psychosis disorder or both. computer readable media. Computer readable media can be 0158. The methods which can be employed to detect any available media that can be accessed by a computer. By genetic variations are well known to those of skill in the art way of example and not meant to be limiting, computer read and can be detected for example using PCR or in hybridiza able media can comprise "computer storage media.” “Com tion studies using Suitable probes that are designed to span an puter storage media' comprise Volatile and non-volatile, identified mutation site in one or more of the GPX, GSPT1 removable and non-removable media implemented in any and TKT genes, such as the mutations described herein. method or technology for storage of information Such as 0159. Once a particular polymorphism or mutation has computer readable instructions, data structures, program been identified it is possible to determine a particular course modules, or other data. Exemplary computer storage media of treatment. For example the GPX, GSPT1 and TKT genes comprises, but is not limited to, RAM, ROM, EEPROM, flash are implicated in brain glutathione levels. Thus, treatments to memory or other memory technology, CD-ROM, digital ver change brain glutathione levels are contemplated for indi satile disks (DVD) or other optical storage, magnetic cas viduals or subjects determined to have a genetic variation in settes, magnetic tape, magnetic disk storage or other mag one or more of the GPX, GSPT1 and TKT genes. netic storage devices, or any other medium which can be used 0160 Mutations in the gene sequence or controlling ele to store the desired information and which can be accessed by ments of a gene, e.g., the promoter, the enhancer, or both can a computer. have subtle effects such as affecting mRNA splicing, stability, 0153. The methods and systems can employ Artificial activity, and/or control of gene expression levels, which can Intelligence techniques such as machine learning and itera also be determined Also the relative levels of RNA can be tive learning. Examples of Such techniques include, but are determined using for example hybridization or quantitative not limited to, expert Systems, case based reasoning, Baye PCR as a means to determine if the one or more of the GPX, sian networks, behavior based AI, neural networks, fuZZy GSPT1 and TKT genes has been mutated or disrupted. systems, evolutionary computation (e.g., genetic algorithms), 0.161 Moreover the presence and/or levels of one or more Swarm intelligence (e.g., ant algorithms), and hybrid intelli of the GPX, GSPT1 and TKT gene products themselves can gent systems (e.g., Expert inference rules generated through a be assayed by immunological techniques such as radioimmu neural network or production rules from statistical learning). noassay, Western blotting and ELISA using specific antibod ies raised against the gene products. Also disclosed are anti IV. SCHIZOPHRENIA-ASSOCIATED GENES bodies specific for one or more of the GPX, GSPT1 and TKT 0154 Schizophrenia and Bipolar Affective Disorder are gene products and uses thereof in diagnosis and/or therapy. common and debilitating psychiatric disorders. Despite a 0162 Also disclosed are antibodies specific to the dis wealth of information on the epidemiology, neuroanatomy closed GPX, GSPT1 and TKT polypeptides or epitopes and pharmacology of the illness, it is uncertain what molecu thereof. Production and purification of antibodies specific to lar pathways are involved and how impairments in these an antigen is a matter of ordinary skill, and the methods to be affect brain development and neuronal function. Despite an used are clear to those skilled in the art. The term antibodies estimated heritability of 60-80%, very little is known about can include, but is not limited to polyclonal antibodies, mono the number or identity of genes involved in these psychoses. clonal antibodies (mAbs), humanised or chimeric antibodies, Although there has been recent progress in linkage and asso single chain antibodies, Fab fragments, F(ab')2 fragments, ciation studies, especially from genome-wide scans, these fragments produced by a Fab expression library, anti-idio studies have yet to progress from the identification of suscep typic (anti-Id) antibodies, and epitope binding fragments of tibility loci or candidate genes to the full characterization of any of the above. Such antibodies can be used in modulating disease-causing genes (Berrettini, 2000). the expression or activity of the particular polypeptide, or in (O155 Disclosed are the GPX, GSPT1 and TKT genes, detecting said polypeptide in vivo or in vitro. polynucleotide fragments comprising one or more of GPX, 0163. Using the sequences disclosed herein, it is possible GSPT1 and TKT genes or a fragment, derivative or homo to identify related sequences in other animals, such as mam logue thereof, the gene products of the GPX, GSPT1 and TKT mals, with the intention of providing an animal model for genes, polypeptide fragments comprising one or more of the psychiatric disorders associated with the improper function gene product of the GPX, GSPT1 and TKT genes or a frag ing of the disclosed nucleotide sequences and proteins. Once ment, derivative or homologue thereof. It has been discovered identified, the homologous sequences can be manipulated in that genetic variations in the GPX, GSPT1 and TKT genes are several ways known to the skilled person in order to alter the associated with Schizophrenia. functionality of the nucleotide sequences and proteins US 2011/0098.193 A1 Apr. 28, 2011 homologous to the disclosed nucleotide sequences and pro bind to the protein encoded by said polynucleotide fragment teins. For example, “knock-out' animals can be created, that from step (a); (e) establishing whether a ligand has bound to is, the expression of the genes comprising the nucleotide the expressed protein; and (f) optionally isolating and identi sequences homologous to the disclosed nucleotide sequences fying the ligand. As a preferred way of detecting the binding and proteins can be reduced or Substantially eliminated in of the ligand to the expressed protein, also signal transduction order to determine the effects of reducing or substantially capacity can be measured. eliminating the expression of Such genes. Alternatively, ani mals can be created where the expression of the nucleotide 0169 Compounds which activate or inhibit the function of sequences and proteins homologous to the disclosed nucle one or more of the GPX, GSPT1 and TKT gene products can otide sequences and proteins are upregulated, that is, the be employed in therapeutic treatments to activate or inhibit expression of the genes comprising the nucleotide sequences the disclosed polypeptides. homologous to the disclosed nucleotide sequences and pro 0170 Schizophrenia and/or affective psychosis as used teins can be increased in order to determine the effects of herein relates to schizophrenia, as well as other affective increasing the expression of these genes. In addition to these psychoses such as those listed in “The ICD-10 Classification manipulations substitutions, deletions and additions can be of Mental and Behavioural Disorders’ World Health Organi made to the nucleotide sequences encoding the proteins zation, Geneva 1992. Categories F20 to F29 inclusive homologous to the disclosed nucleotide sequences and pro includes Schizophrenia, Schizotypal and delusional disor teins in order to effect changes in the activity of the proteins ders. Categories F30 to F39 inclusive are Mood (affective) to help elucidate the function of domains, amino acids, etc. in disorders that include bipolar affective disorder and depres the proteins. Furthermore, the disclosed sequences can also sive disorder. Mental Retardation is coded F70 to F79 inclu be used to transform animals to the manner described above. sive. The Diagnostic and Statistical Manual of Mental Disor The manipulations described above can also be used to create an animal model of schizophrenia and/or affective psychosis ders, Fourth Edition (DSM-IV). American Psychiatric associated with the improper functioning of the disclosed Association, Washington D.C. 1994. nucleotide sequences and/or proteins in order to evaluate 0171 "Polynucleotide fragment” as used herein refers to a potential agents which can be effective for combating psy chain of nucleotides such as deoxyribose nucleic acid (DNA) chotic disorders, such as Schizophrenia and/or affective psy and transcription products thereof, such as RNA. The poly chosis. nucleotide fragment can be isolated in the sense that it is (0164. Thus, also disclosed are screens for identifying substantially free of biological material with which the whole agents Suitable for preventing and/or treating Schizophrenia genome is normally associated in vivo. The isolated poly and/or affective psychosis associated with disruption or alter nucleotide fragment can be cloned to provide a recombinant ation in the expression of one or more of the GPX, GSPT1 and molecule comprising the polynucleotide fragment. Thus, TKT genes and/or its gene products. Such screens can easily "polynucleotide fragment includes double and single be adapted to be used for the high throughput Screening of stranded DNA, RNA and polynucleotide sequences derived libraries of compounds such as Synthetic, natural or combi therefrom, for example, Subsequences of said fragment and natorial compound libraries. which are of any desirable length. Where a nucleic acid is 0.165 Thus, one or more of the GPX, GSPT1 and TKT single stranded then both a given Strand and a sequence or gene products can be used for the in vivo or in vitro identifi reverse complementary thereto is contemplated. cation of novel ligands or analogs thereof. For this purpose 0172. In general, the term “expression product' or “gene binding studies can be performed with cells transformed with the disclosed nucleotide fragments or an expression vector product” refers to both transcription and translation products comprising a disclosed polynucleotide fragment, said cells of said polynucleotide fragments. When the expression or expressing one or more of the GPX, GSPT1 and TKT gene gene product is a “polypeptide' (i.e., a chain or sequence of products. amino acids displaying a biological activity Substantially (0166 Alternatively also one or more of the GPX, GSPT1 similar (e.g., 98%. 95%, 90%, 80%, 75% activity) to the and TKT gene products as well as ligand-binding domains biological activity of the protein), it does not refer to a specific thereof can be used in an assay for the identification of func length of the product as such. Thus, it should be appreciated tional ligands or analogs for one or more of the GPX, GSPT1 that "polypeptide' encompasses interalia peptides, polypep and TKT gene products. tides and proteins. The polypeptide can be modified in vivo 0167 Methods to determine binding to expressed gene and in vitro, for example by glycosylation, amidation, car products as well as in vitro and in vivo assays to determine boxylation, phosphorylation and/or post-translational cleav biological activity of gene products are well known. In gen age. eral, expressed gene product is contacted with the compound to be tested and binding, stimulation or inhibition of a func V. EXAMPLES tional response is measured. 0168 Thus, also disclosed is a method for identifying 0173 The following examples are put forth so as to pro ligands for one or more of the GPX, GSPT1 and TKT gene vide those of ordinary skill in the art with a complete disclo products, said method comprising the steps of: (a) introduc Sure and description of how the compounds, compositions, ing into a Suitable host cell a polynucleotide fragment one or articles, devices and/or methods claimed herein are made and more of the GPX, GSPT1 and TKT gene products; (b) cul evaluated, and are intended to be purely exemplary and are turing cells under conditions to allow expression of the poly not intended to limit the scope of the methods and systems. nucleotide fragment; (c) optionally isolating the expression Efforts have been made to ensure accuracy with respect to product; (d) bringing the expression product (or the host cell numbers (e.g., amounts, temperature, etc.), but there can be from step (b)) into contact with potential ligands which can an accounting of errors and deviations. Unless indicated oth US 2011/0098.193 A1 Apr. 28, 2011

erwise, parts are parts by weight, temperature is in C. or is at mens of four mesotheliomas (MPM) and two controls. A ambient temperature, and pressure is at or near atmospheric. bioinformatic pipeline was used to identify candidate causal mutations, namely non-synonymous variants (nsSNPs), in A. Mendelian Disorders tumor-expressed genes. Of ~15,000 annotated (RefSeq) genes evaluated in each specimen, 66 genes with previously 0174 The disclosed model notes that: undescribed insSNPs were identified in MPM tumors. g(E1) h(Cpl. ,Sv1,...A.) Genomic resequencing of 19 of these nsSNPs revealed 15 to be germline variants and 4 to represent loss of heterozygosity 0175 For Mendelian disorders, there is typically a single (LOH) in MPM. Resequencing of these 4 genes in 49 addi value for E (the causal gene), H (the impact of the causal gene tional MPM surgical specimens identified one gene (MPM1), on relevant homeostasis), t (the time at which the causal gene that exhibited LOH in a second MPM tumor. No overlap was is expressed) and Cp (a pathognomonic phenotype). observed in other genes with nsSNPs or LOH among MPM tumors. This study agrees with the model described herein, namely that in complex diseases, there is insufficient homo geneity of causal elements among affected individuals to Therefore, for a Mendelian disorder in an individual patient, enable detection of statistical differences. variation in the value of I (the specific variant in the causal gene) determines the value of Sv (phenotype severity) and A D. Schizophrenia (age of onset). This is in agreement with most evidence in Mendelian disorders. For example, the magnitude of triplet i. Example 1 repeat expansions generally is associated with severity and 0180 Medical sequencing was performed on three related age of onset of symptoms. individuals with Schizophrenia, multiple expressed genes were identified with variants in each affected individual. B. Hypertension Schizophrenia is a “complex' disorder in which inherited (0177. Multiple, rare families that exhibited Mendelian elements are believed to be a significant factor. Previous segregation of the phenotype (Cp) of severe hypertension studies have identified some inherited elements but the most were studied to identify single gene mutations (E) that result common, important contributors remain unknown. The dis in a phenotype indistinguishable from that of a common, parate genes (E) identified in affected individuals were found complex disorder—namely hypertension. The majority of the to converge into several discrete pathways (P) that are disor individual genes that had mutations (E) and resulted in the dered in schizophrenia. For example, in the affected proband, hypertension phenotype can be collapsed into a single meta a male Caucasian of Jewish ethnicity, 621341 sequence reads bolic pathway (P). Thus, these studies agree with the model were identified that matched to 15530 genes, non-synony described herein, namely the convergence of distinct Ele mous single nucleotide polymorphisms in the genes glu ments (E) Into Networks and Pathways (P) in causality of tathione peroxidase 1 (GPX1) and glutathione S-transferase common, complex disorders. pi (GSTP1). These amino-acid changes were also identified in the other two, related individuals with schizophrenia. Thus, C. Cancer Some non-synonymous variants in patients with Schizophre nia converge into the glutathione metabolism pathway. 0.178 Recently, researchers undertook medical sequenc 0181. These studies of schizophrenia also exemplified the ing of 13,023 genes in 11 breast and 11 colorectal cancers. concept of Cp, and especially molecular Cp that are Suggested The study revealed that individual tumors accumulate an by the E identified in affected individuals, being informative. average of ~90 mutant genes but that only a subset of these For example, glutathione (GSH) is converted to oxidized contribute to the neoplastic process. Using criteria to delin glutathione (GSSG) through glutathione peroxidase (GPX), eate this subset, the researchers identified 189 genes (11 per tumor) that were mutated at significant frequency. The major and converted back to GSH by glutathione reductase (GR). ity of these genes were not known to be genetically altered in Measurements of GSH, GSSG, GPx and GR in the caudate tumors and were predicted to affect a wide range of cellular region of postmortem brains from Schizophrenic patients and functions, including transcription, adhesion, and invasion. control subjects (with and without other psychiatric disor This study agrees with the model described herein, namely ders) represent molecular Cp that would be of benefit to seek that in complex diseases, there is insufficient homogeneity of associations with variants in GPX1 and GSTP1 candidate causal elements among affected individuals to enable detec genes. For example, significantly lower levels of GSH, GPx, tion of statistical differences. The disclosed model notes that and GR were found in Schizophrenic group than in control there exists significant genetic and environmental heteroge groups without any psychiatric disorders. Concomitantly, a neity in complex diseases. Thus the specific combinations of decreased GSH:GSSG ratio was also found in schizophrenic genetic and environmental elements that cause D vary widely group. Moreover, both GSSG and GR levels were signifi among the affected individuals in a cohort. In agreement with cantly and inversely correlated to age of Schizophrenic this study, experimental designs based upon comparisons of patients, but not control Subjects. candidate variant allele frequencies between affected and unaffected cohorts, even if undertaken on a large scale, fail to ii. Example 2 disclose causal variants in situations where there is a high degree of heterogeneity among individuals in causal ele 0182. Three lymphoblastoid, two lung and four lung can mentS. cer RNA samples were sequenced with 454 technology. The 0179 Another study showed similar findings. Compre disclosed methods were used to comprehensively catalog hensive, shotgun sequencing of tumor transcriptomes of Sur nsV. 350 ug of total RNA was isolated from Epstein-Barr gical specimens from individual mesothelioma tumors, an virus-transformed lymphoblastoid cell lines from a schizo environmentally-induced cancer, was performed. High phrenia pedigree (from the NIGMS Cell Repository panel, throughput pyrosequencing was used to generate 1.6 giga Coriell Institute for Medical Research, Camden, N.J.) and 6 bases of transcriptome sequence from enriched tumor speci lung Surgical specimens. The proband had schizophrenia with US 2011/0098.193 A1 Apr. 28, 2011 primarily negative clinical features (Table 1). His father had polymorphisms and sequence errors, and without using major depression. His sister had anorexia nervosa and schiz probabilistic splice site models. GMAP features a minimal oid personality disorder. The mother (not studied) was not sampling strategy for genomic mapping, oligomer chaining affected. for approximate alignment, Sandwich DP for splice site detec

TABLE 1. Family 176 B Lymphoblastoid Cell Line Characteristics Sample 1437 Sample 1438 Sample 1439 Repository # GMO 1488 GMO 1489 GMO1490 db SNP number 10411 10412 10413 Age 23 YR 55YR 27 YR Gender Male Male Female Race Caucasian Caucasian Caucasian Ethnicity Jewish Jewish Jewish Relation Proband affected father affected sister Symptoms, paralogical thinking: 3 episodes of anorexia nervosa History affective shielding, depression; ECT: since adolescence; splitting of affect from no hypomania more schizoid than content; Suspiciousness; depressed onset age 15; hospitalized ISCN 46, XY n.d. n.d. HLA type Aw26, B16Aw26, B16 Aw26, B16A18, B- Aw26, B16A2, B35

0183 Poly-A+ RNA was prepared using oligo(dT) mag tion, and microexon identification. These features are particu netic beads (PureBiotech, Middlesex, N.J.), and 1st-strand larly useful for alignments of short reads with relatively high cDNA prepared from 5-8 ug of poly(A)-- RNA with 200 pmol base calling error rates. GMAP was also anticipated to be oligo(dT)25 V (V=A, C or G) using 300 U of Superscript II useful in identifying novel splice variants. Synasearch reverse transcriptase (Invitrogen). Second-strand synthesis (Synamatix, Kuala Lumpur, Malaysia) is a novel, rapid alig was performed at 16° C. for 2 h after addition of 10 U of E. ment method. coli DNA ligase, 40 U of E. coli DNA polymerase, and 2U of RNase H (all from Invitrogen). T4 DNA polymerase (5 U) 0185. Computationally, SynaSearch and MegaBLAST was added and incubated for 5 min at 16° C. cDNA was were most efficient in transcript alignments, whereas Syn purified on QIAquick Spin Columns (Qiagen, Valencia, aSearch and GMAP had the best efficiency for genome align Calif.). Single-stranded template DNA (sstDNA) libraries ments (Tables 3, 4). SynaSearch alignments were performed were prepared using the GS20 DNA Library Preparation Kit on a dual Itanium server while the other methods employed a (Roche Applied Science, Indianapolis, Ind.) following the much larger blade cluster. Genome alignments were much manufacturer's recommendations. SStdNA libraries were more computationally intensive than transcript alignments. clonally amplified in a bead-immobilized form using the GMAP aligned the greatest number of reads (82% to the GS20 emPCR kit (Roche Applied Science). sstDNA libraries human transcript database and 97.8% to the genome). The were sequenced on the 454 GS20 instrument. Two runs were greater number of alignments to the genome reflects RefSeq, performed on SID1437 and SID1438, 3 runs on SID1439 having only 40,545 of ~185,000 human transcripts. For tran (56-64 MB sequence: Table 2, FIG. 8), and up to 18 runs on scripts with aligned reads, GMAP provided the greatest each of the lung specimens (1.65 GB). FIG. 8 illustrates length and depth of coverage of the methods evaluated. Mega length distribution of 454 GS20 reads. BLAST and Synamatix performed similarly for these latter metrics, while BLAT was inferior. These comparisons indi TABLE 2 cated GMAP to be the most effective method for alignment of 454 GS20 Statistics 454 reads to the human genome and transcript databases, and that the blade cluster was adequate for pipelining ~1 M reads SID1437 SID1438 SID1439 per day. Number of GS20 runs 2 2 3 Average read length 104 104 103 TABLE 3 Average read quality 25 24 25 Number Of Reads 621,341 536, 463 586, 232 Comparison of alignment methods for mapping 621, 341 454 Number Of Bases 64.9M 56.2M 60.4M reads from SID 1437 BLAT GMAP MegaBLAST Synamatix 0184 Four alignment techniques (MegaBLAST, GMAP. % of reads with 64.7 82.4 66.5 68.5 BLAT and SynaSearch) were evaluated for alignment of 454 transcript match reads from SID1437 to the NCBI human genome and RefSeq, Transcript CPU Time 2.0 15.5 O.S O.9 transcript databases using similar parameters. MegaBLAST (hr) % of reads with genome 88.0 97.8 87.6 96.5 and BLAT are standard methods for for aligning sequences match that differ slightly as a result of sequencing errors. GMAP is Genome CPU Time (hr) 52.3 14.O 171.8 3.2 a recently described algorithm that was developed to align cDNA sequences to a genome in the presence of Substantial US 2011/0098.193 A1 Apr. 28, 2011

0186 MegaBLAST v.2.2.15, BLAT v.32x1, GMAP toid cDNA MRS data (Table 4). -75% of genes were detected V.2006-04-21 were used to align 454 reads with human Ref by >1 aligned readin-300MB of lung cDNA MRS data. Very Seq transcript dB release 16 and human genome release 16, little run-to-run variation was noted in the number of reads and Synasearch V1.3.1 with RefSeq release 19 and human aligning to each gene (r2)-0.995, FIG. 9). FIG. 9 illustrates run-to-run variation in RefSeq transcript read counts. Two genome release 36.1. GMAP, BLAT and MegaBLAST align runs of 454 sequence were aligned to the RefSeq transcript dB ments were performed on a 62-Dual-core Processor Dell with megaBLAST. In the range examined (up to 1.65 GB per 1855 Blade Cluster with 124 GB RAM and 2.4 TB disk. sample type), the number of transcripts with aligned reads Synamatix alignments were performed on a dual Intel Ita and the depth of coverage increased with the quantity of nium 1.5 GHz CPU with 64 GB RAM. Similar figures were MRS. This was true both of lymphoblastoid cell lines and obtained with SID 1438 and SID 1439. lung specimens. These data indicate that 3 GB of MRS per 0187. On the basis of MegaBLAST and GMAP read align sample provide 8x coverage of ~40% of human transcripts ments, it was found that the majority of genes were expressed (Sufficient to unambiguously identify heterozygous nsV. see in lymphoblastoid lines and lung samples. ~55% of genes below) and ~50% of transcripts with 4.x coverage (sufficient were detected by >1 aligned read in ~60 MB of lymphoblas to unambiguously identify heterozygous nsV).

TABLE 4 RefSeq transcript alignment statistics for 454 sequences from lymphoblastoid cell line RNAs

1437 1438 1439 Mega 1437 Mega 1438 Mega 1439 Case, Method BLAST GMAP BLAST GMAP BLAST GMAP

Number of reads 621341 621341 S36463 S36463 S86232 S86232 % reads aligned 72 64 79 61 64 64 to a RefSeq, transcript % RefSeq, 58 53 57 51 57 52 transcripts with 21 aligned read Number of 704662. 211882 SS6910, 1777O2 60492O 1704O7 indels Number of SNPs 28191S 204730 275277 172183 253182 190491 Indel per kb 10.8 3.3 9.9 3.2 1O.O 2.8 SNP per kb 4.3 3.1 4.9 3.1 4.2 3.2

0188 A moderate 3' bias was observed in the distribution of read coverage across transcripts, as anticipated with oligo dT priming. The bias was not, however, sufficiently pro nounced to preclude analysis of 5' regions.

TABLE 5

Schizophrenia Candidate Genes (from www.polygenicpathways.co.uk)

ACE, ADH1B, APOE, ARVCF, ADRA1A, ATN1, AGA, ATXN1, AHI1, AKT1, ALDH3B1, ALK, APC, B3GAT1, BDNF, BRD1, BZRP, CCKAR, CHGB, CHL1, CHN2, CHRNA7, CLDN5, CNP, CNR1, CNTF, COMT CPLX2, CTLA4, DAO, DAOA, DISC1, DLG2, DPYSL2, DRD2, DRD3, DRD4, DRD5, DTNBP1, EGF, ELSPBP1, ENTH, ERBB4, FEZ1, FOXP2, FZD3, GABBR1, GABRB2, GAD1, GALNT7, GCLM, GFRA1, GNAS, GNPAT, GPR78, GRIA1, GRIA4, GRID1, GRIK3, GRIK4, GRIN1, GRIN2A, GRIN2B, GRIN2D, GRM3, GRM4, GRM5, GRM8, GSTM1, HLA-B, HLA-DRB1, HMBS, HOMER1, HP, HRH2, HTR2A, HTR5A, HTR6, HTR7, IL10, IL1B, IL1RN, IL2, IL4, IMPA2, JARID2, KCNN3, KIF2, KLHL1AS, KPNA3, LGI1, LTA, MAG, MAOA, MAP6, MCHR1, MED12, MLC1, MOG, MPZL.1, MTHFR, NAALAD2, NDUFV2, NOS1, NOS1AP, NOTCH4, NPAS3, NPTN, NPY, NQO2, NRG1, NRG3, NTF3, NTNG1, NTNG2, NUMBL, OLIG2, OPRS1, PAH, PAX6, PCM1, PCQAP. PDE4B, PDLIM5, PHOX2B, PICK1, PIK3C3, PIPSK2A, PLA2G4A, PLA2G4B, PLA2G4C, PLP1, PLXNA2, PNOC, PPP3CC, PRODH, PTGS2, RANBP5, RGS4, RHD, RTN4, RTN4R, S100B, SLC15A1, SLC18A1, SLC1A2, SLC6A3, SLC6A4, SNAP29, SOD2, SRR, ST8SIA2, STX1A, SULT4A1, SYN2, SYN3, SYNGR1, TAAR6, TH, TNF, TNXB, TP53, TPH1, TPP2, TUBA8, TYR, UFD1 L, UHMK1, XBP1, YWHAH, ZDHHC8, ZNF74 US 2011/0098.193 A1 Apr. 28, 2011

0189 The expression of schizophrenia candidate genes in lymphoblastoid cells was a concern. 172 schizophrenia can TABLE 7 didate genes were identified by literature searching (Table 5). Variants identified by GMAP alignment of SID 1437 cDNA 454 reads 66-68 candidate genes (40%) had >3 reads aligned by GMAP to the RefSeq transcript dB without and with bioinformatics filters. in the three lymphoblastoid lines. Scaling from 50 MB to 3 GB MRS per sample, this read count is equivalent to 8x Genes with aligned reads Unfiltered Filtered coverage. Thus, ~40% of Schizophrenia candidate genes are With e1 SNP 11.459 (40%). 932 (3%) With 21 coding domain SNP 7595 (26%) 356 (1%) evaluated for nSV by lymphoblastoid transcriptome MRS. With 21 coding domain, synonymous SNP 4933 (17%) 238 (0190. The number of SNPs and indels for reads aligned With 21 non-synonymous SNP (nsSNP) 6891 (24%) 199 With a SNP causing a premature top codon 1660 (6%) 4 with MegaBLAST and GMAP was enumerated for each With 21 indel 11,313 (39%) 313 (1%) sample (Table 4). One effect of the incompleteness of the With 21 coding domain indel 8,372 (29%) S4 RefSeq transcript database was that some MegaBLAST best matches that met criteria for reporting were misalignments. This was not observed with GMAP. Read misalignment gen 0193 Further, distributed characterization of nsV (nsS NPs and CD indels) was undertaken with the disclosed meth erated false positive SNP and indel calls. Other causes of SNP ods, in order to identify a Subset of candidate genes likely to and indel calls were true nucleotide variants, RefSeq database be associated with medically relevant functional changes in errors and 454 basecalling errors. 454 data has a higher base schizophrenia. A second rule set was developed to identify call error rate than conventional Sanger resequencing, par high-likelihood, medically relevant insV (Table 8). These ticularly indel errors adjacent to homopolymer tracts. The rules represent a second set of threshold values for these unfiltered indel rate per kb with MegaBLAST read alignment elements. Particularly important at this stage were inspection was 9.9-10.8 per kb, and for GMAP was 2.8-3.3 per kb. The of the quality of read alignment and BLAST comparison of SNP rate perkb with MegaBLAST was 4.2-4.9 perkb, and for the read to a second database. ~10% of nsSNPs were RefSeq, GMAP was 3.1-3.2 per kb. In contrast, the true SNP rate per transcript database errors and the reads matched perfectly to kb in the human genome is -0.8 per kb and indel rate is the NCBI human genome sequence or, upon translation, to approximately 10-fold less than the SNP rate. These data protein sequence databases. BLOSUM scores were calcu indicated that use of additional filter sets can identify high lated, but were not used to triage candidate genes, since likelihood, true-positive SNPs and indels in MRS data. nsSNPs in complex disorders insSNPs are strongly biased toward less deleterious substitutions. Congruence with 0191 To circumvent the identification of false-positive altered gene or protein expression in brains of patients with nucleotide variants, a rule set was developed for SNP and schizophrenia was ascertained by link-out to the Stanley indel identification in 454 reads (Table 6). These rules repre Medical Research Institute database. Congruence with sent the threshold values of these elements. These filters had altered gene expression is important in view of recent studies been previously validated on a set of -2.5 million 454 reads showing that SNPs are responsible for >84% of genetic varia and 2.465 previously described human SNPs present in 1415 tion in gene expression. Functional plausibility of the candi genes in a human lung RNA sample and it was found that 96% date gene was ascertained by link-outs to OMIM, ENTREZ of known SNPs were detected. Application of these filters via gene and PubMed. Confluence of candidate genes into net the disclosed methods reduced the number of genes with nsV works or pathways was considered highly significant, given by 60-fold. the likelihood of pronounced genetic heterogeneity. Pathway analysis was performed both by evaluation of standard path TABLE 6 way databases, such as KEGG, and also by custom database creation and visualization of interactions among these genes Rules for identification of high-likelihood, true using Ariadne Pathways software (Ariadne Genomics, Rock positive SNPs and indels in 454 transcriptome MRS: ville, Md.). Variant present in 24 reads TABLE 8 Variant present in 230% of reads High quality score at variant base Rules for identification of high-likelihood, medically relevant insV in Present in 5'-->3' and 3'->5" reads transcriptome MRS studies >90% read alignment to reference sequence 0.192 An example of the utility of application of these Exclude reference sequence error by alignment to 2nd reference dB (e.g. if bioinformatic filters is shown in FIG. 7. SNPs were 3-times initial alignment to RefSeq transcript, confirm by alignment to NCBI human genome) more common than indels (Table 7). The relative frequency of BLOSUM62 Score genes with CD sSNP and nsSNP was similar. The frequency insV congruence in parent-child trio, ASP or pedigree Confluence of nsV into network or pathway of genes with SNPs in untranslated regions (UTRs) was Functional plausibility (ENTREZ, OMIM) 2-fold greater than in CDs, in agreement with the lung MRS Chromosomal location with QTL data:8.. nsSNPs causing premature stop codons were rare. CD Congruence with gene or protein expression data (for example, SNPs were 7-fold more common than indels. The ratio of the Stanley dB, and the like) number of reads with wild-type and variant allele nucleotides appeared able to infer homozygosity and heterozygosity, as 0194 Of the 172 schizophrenia candidate genes (Table 5), previously validated. In the pedigree, inheritance patterns of 3 (HLA-B, HLA-DRB1 and KIF2) exhibited ansSNP in the alleles inferred from read-ratios agreed well with identity by proband, and 2 (LTA, UHMK1) had a nsSNP in one of the descent and inheritance rules. other cases. KIF2 contained a novel insSNP (a821g) at all US 2011/0098.193 A1 Apr. 28, 2011 20 aligned reads in SID1437 and SID1439. No reads aligned at glutathione metabolism. Glutathione is the main non-protein this location in SID1438. KIF2 is important in the transport of antioxidant and plays a critical role in protecting neurons membranous organelles and protein complexes on microtu from damage by reactive oxygen species generated by bules and is involved in BDNF-mediated neurite extension. A dopamine metabolism. A large literature exists regarding glu prior study of transmission disequilibrium in a cohort of tathione deficiency in prefrontal cortex in Schizophrenia and affected family samples identified a common two-SNP hap several groups have sought associations between glutathione lotype (rs2289883/rS464058, G/A) that showed a significant metabolism genes or polymorphisms with Schizophrenia and association with schizophrenia, as did a common four-SNP tardive dyskinesia. Mendelian deficiency in glutathione haplotype (P<0.008). metabolism genes results in mental deficiency and psychosis. An interesting follow-up study comprises determining the TABLE 9 association between the endophenotype of prefrontal glu tathione level (measured by NMR spectroscopy) and GPX1 insV identified in three lymphoblastoid lines by GMAP alignment and GSTP1 genotypes. to RefSeq transcript following application of bioinformatics filters 0.197 Also notable were numerous genes involved in syn Genes with aligned reads and filtering aptic vesicle exocytosis (ACTN4, ANXA11, ANXA2. MTDH, SYNCRIP, SNX3). SID1437 SID1438 SID1439 All (0198 Interestingly, two nsV identified by GMAP were e1 nsSNP 199 2O2 252 74 associated with novel splice isoforms (KHSRP. FIG. 10 and SNP-induced premature stop 4 4 6 O FIG. 11, and SYNCRIP. FIG. 12). In the case of KHSRP, the codon nsSNP was an artifact of GMAP-based alignment extension 21 coding domain indel S4 78 123 5 through a hexanucleotide hairpin that was present at the 3' terminus of both exon 19 and intron 19. A novel KHSRP 0.195. Seventy-nine genes had a nsv in all 3 individuals splice isoform was identified that retains intron 19 sequences. (Table 9). Of these, four were RefSeq transcript database The novel SYNCRIP splice isoform omits an exon present in errors. Ten were in highly polymorphic HLA genes, including the established transcript. two in schizophrenia candidate genes HLA-Band HLA-DRB 0199 Since next generation sequencing technologies gen 1. Thirty-one occurred in putative genes that have been iden erate clonal sequences from individual mRNA molecules, tified informatically from the human genome sequence. nsV enumeration of aligned reads permits estimation of the copy within such genes were found to be unreliable due to: i) number of transcripts, splice variants and alleles. As noted uneven coverage (likely misannotation of splice variants), ii) above, the aligned read counts for individual transcripts in a an overabundance of putative SNPs, and/or iii) premature sample showed little run-to-run variation (FIG. 9). Read truncation of alignments. Of the remaining 36 genes, count was affected by the length of the transcript, the fidelity ADRBK1, GSTP1, MTDH, PARP1, PLCG2, PLEK, of alignment, and the repetitiveness of transcript Sub-se SLC25A6, SLC38A1 and SYNCRIP were particularly inter quences. In particular, Some transcripts with repetitive esting since they were related to schizophrenia candidate sequences within the 3' UTR exhibited significant local genes (Table 10). increases in read counts at those regions, as has been described for pyknons and short tandem repeats. Thus, com TABLE 10 parisons of read count-based abundance of different tran Scripts within a sample were not always accurate. However, Genes related to candidates with nsV in SID 1437 comparisons of abundance of a transcript between samples that were based upon read counts were accurate, as previously Related Gene With nsV validated. Pairwise comparisons of the copy numbers of indi Function Candidate Gene in SID 1437 vidual transcripts in lymphoblast cell lines from related indi Glutamate receptor NAALAD2 DPP7 viduals showed significant correlation (FIG. 13, r >0.93) and agonist availability SLC15A1 SKC25A6 allowed identification of transcripts exhibiting large differ PRODH P4HA1 ences in read count between individuals. SLC1A2 SLC38A1 DTNBP1 WAPA (0200 FIG. 10A-C and FIG. 11 illustrate an example of a ENTH FLNA novel splice isoform identified with GMAP by an apparent Synaptic vesicle SNAP29 ACTN4 SNP at the penultimate base of an alignment. FIG. 10A illus exocytosis SYN2 ANXA11, ANXA2 SYN3 MTDH trates GMAP based alignment of SID1437 reads to nucle STX1A SYNCRIP otides 1507-2507 of KHSRP transcript NM 003.685.1, SYNGR1 SNX3 showing a nsSNP in five of twelve reads (red line, a2216c. Plasticity PLXNA2 PLEK inducing a Q to C non-conservative substitution, BLOSUM Cytokine-related PIPSK2A PLCG2 score -1). FIG. 10B illustrates the FASTA-format of the Glutathione GSTM1, GCLM GPX1, GSTP1 GMAP alignment of one of the five cDNA reads containing a Postsynaptic density ADRA1A ADRBK1 insSNP (D93AXQM01ARQC5) to KHSRP transcript MED12 PAPOLA, PAP1, PCB1 NM 003.685.1. Note that only the 3'50 nt of the read aligned MAP6 MARK3 to this transcript. ThensSNP is indicated in yellow, the stop codon in red, and a stable hexanucleotide hairpin in green. 0196. Of 244 genes with an insV in the proband (Table 9), Score=0 bits (50), Identities=50/50 (98%), Strand=+/+. FIG. seven were RefSeq transcript database errors, 71 were in 10C illustrates alignment of the entire read putative genes and twelve were in HLA genes. Twenty-one D93AXQM01ARQC5 to KHSRP intron 19 and exon 20. genes had ansv in the proband that were either close relatives Chr19 nucleotides refer to contig reflNW 927173. of Schizophrenia candidate genes or in the same pathway 1HsCraAADB02 624. The nucleotide that corresponded to (Table 10). Notable were GPX1 and GSTP1, both of which a nsSNP when aligned to NM 003.685.1 shows identity contained known insSNPs (rs1050450 and rs1695 and when aligned against Chr19 (yellow). The stop codon is indi rs179981, respectively). GPX1 and GSTP1 are important in cated in red, a stable hexanucleotide hairpiningreen and exon US 2011/0098.193 A1 Apr. 28, 2011

20 in grey. Score=204 bits (110), Expect=2e-50, Identi #223900 in individuals of Ashkenazi descent). Thus, while ties=100%, Gaps=0%, Strand=+/-. individual Mendelian diseases are uncommon in general 0201 FIG. 11 illustrates the genomic sequence of KHSRP populations, collectively they continue to account for ~20% exon 19 (purple), exon 20 (grey) and the 3' end of intron 19 of infant mortality and ~10% of pediatric hospitalizations. A (blue) which is present in 5 cDNA reads (including framework for development of criteria for comprehensive D93AXQM01ARQC5). Apparent insSNP when aligned to preconception screening can be inferred from an American NM 003.685.1 shows identity when aligned against Chr19 College of Medical Genetics report on expansion of newborn (indicated in yellow). The stop codon is indicated in red and screening for inherited diseases. Criteria included test accu a stable hexanucleotide hairpin in green. Interestingly, the racy, cost of testing, disease severity, highly penetrant reces hairpin sequence flanks the splice donor site of exon 19 and sive inheritance and whether an intervention is available for those identified as carriers. Hitherto, the criterion precluding splice acceptor site of intron 19, indicating a possible mecha extension of preconception screening to most severe reces nism whereby KHSRP can be alternatively spliced to retain sive mutations or general populations has been cost (defined intron 19 sequences. in that report as an overall analytical cost requirement of >S1 (0202 FIG. 12 illustrates a GMAP alignment of read per test per condition). D9VJ59F02JQMRR (nt 1-109, top) from SID 1438, to SYN 0206 Target capture and next generation sequencing have CRIP (NM 006372.3, bottom) showing a nsSNP at nt 30 shown efficacy for resequencing human genomes and (yellow, a 1384g) and a novel splice isoform that omits an exomes, providing an alternative potential paradigm for com 105-bp exon and maintains frame. Consensus splice donor prehensive carrier testing. An average 30-fold depth of cov and acceptor nucleotides are in red. Four reads demonstrated erage can be sufficient for single nucleotide polymorphism thensSNP. Score=0 bits (119), Identities=109/119 (98%). (SNP) and nucleotide insertion ordeletion (indel) detection in 0203. In summary, ~150 MB of shotgun, clonal, cDNA genome research. The validation of these methods for clinical MRS of lymphoblastoid lines from a pedigree with mental utility can be different. Data demonstrating the sensitivity and illness was performed, using approaches developed for a prior specificity of genotyping of disease mutations (DM), particu ~2 GB MRS study in cancer. Automated data pipelining and larly polynucleotide indels, gross insertions and deletions, distributed, facilitated analysis was accomplished using web copy number variations (CNVs) and complex rearrange based cyberinfrastructure. A two-tiered analysis schema ments, is limited. High analytic validity, concordance in identified twenty-one schizophrenia candidate genes that many settings, high-throughput and cost-effectiveness (in showed reasonable accord with current understanding of the cluding sample acquisition and preparation) can be used for molecular pathogenesis of schizophrenia (Table 10). broader population-based carrier screening. Here, the devel opment of a preconception carrier screen for 489 severe E. Carrier Testing recessive childhood disease genes based on target enrichment 0204 Preconception testing of motivated populations for and next generation sequencing that meets most of these recessive disease mutations, together with education and criteria is reported Furthermore, the first assessment of carrier genetic counseling of carriers, can dramatically reduce their burden for severe recessive diseases of childhood is also incidence within a generation. Tay-Sachs disease (TSD: reported. Mendelian Inheritance in Man accession number (OMIM #) 272800), for example, is an autosomal recessive neurodegen 1. Materials and Methods erative disorder with onset of symptoms in infancy and death 0207 i. Disease Choice by two to five years of age. Formerly, the incidence of TSD 0208 Criteria for disease inclusion for preconception was one per 3,600 Ashkenazi births in North America. After screening were broadly based on those for expansion of new forty years of preconception screening in this population, born screening, but with omission of treatment criteria'. however, the incidence of TSD has been reduced by more Thus, very broad coverage of severe childhood diseases and than 90%. While TSD remains untreatable, therapies are mutations was sought to maximize cost-benefit, potential available for many severe recessive diseases of childhood. reduction in disease incidence and adoption. A Perl parser Thus, in addition to disease prevention, preconception testing identified severe childhood recessive disorders with known enables early treatment of high risk pregnancies and affected molecular basis in OMIM. Database and literature searches neonates, which can profoundly diminish disease severity. and expert reviews were performed on resultant diseases. Six 0205 Over the past twenty five years, 1,123 genes that diseases with extreme locus heterogeneity were omitted cause Mendelian recessive diseases have been identified. To (OMIM #209900, #209950, Fanconi anemia, #256000, date, however, preconception carrier testing has been recom #266510, #214100). Diseases were included if mutations mended in the US only for five of these (fragile X syndrome caused severe illness in a proportion of affected children and OMIM #300624 in selected individuals, cystic fibrosis (CF, despite variable inheritance, mitochondrial mutations or low OMIM #2 19700 in Caucasians and TSD, Canavan disease incidence. Mental retardation genes were excluded. 489 OMIM #271900 and familial dysautonomia (OMIM recessive disease genes met these criteria (Table 11).

TABLE 11

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM if Name Symbol Type 300069 #300069 CARDIOMYOPATHY, DILATED, 3A: CMD3A TAZ cardiac 302060 #302060 BARTH SYNDROME: BTHS TAZ cardiac 220400 #2204OOJERVELLAND LANGE-NIELSENSYNDROME KCNQ1 cardiac US 2011/0098.193 A1 Apr. 28, 2011 22

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type 208OOO i2O8OOOARTERIAL CAL CIFICATION, GENERALIZED, ENPP1 cardiac OF INFANCY; GACI 611705 #611705 MYOPATHY, EARLY-ONSET, WITH FATAL TTN cardiac CARDIOMYOPATHY 241SSO #241SSO HYPOPLASTICLEFT HEARTSYNDROME GA1 cardiac 255.960 #255960 MYXOMA, INTRACARDIAC PRKAR1A cardiac 22S320 #225320 EHLERS-DANLOS SYNDROME, AUTOSOMAL COL1A2 Cl RECESSIVE, CARDIAC VALVULARFORM 27758O #277580 WAARDENBURG-SHAHSYNDROME EDN3 Cl 8(OS 27758O #277580 WAARDENBURG-SHAHSYNDROME EDNRB Cl 8(OS 27758O #277580 WAARDENBURG-SHAHSYNDROME SOX10 Cl 8(OS 6OOSO1 #6OOSO1 ABCD SYNDROME EDNRB Cl 8(OS 263700 #263700 PORPHYRIA, CONGENITAL UROS Cl 8(OS ERYTHROPOIETIC 2788OO #2788OODE SANCTIS-CACCHIONE SYNDROME ERCC6 Cl 8(OS 2788OO #2788OODE SANCTIS-CACCHIONE SYNDROME XPA Cl 8(OS 109400 BASAL CELL NEVUSSYNDROME; BCNS PTCH1 Cl 8(OS 3OS100 #305100 ECTODERMAL DYSPLASIA, HYPOHIDROTIC, EDA Cl 8(OS X-LINKED XHED 3098O1 MICROPHTHALMIASYNDROMIC 7: MCOPS7 HCCS Cl 8(OS 245660 #24566OLARYNGOONYCHOCUTANEOUS LAMA3 Cl 8(OS SYNDROME; LOCS 2286OO #2286OO FIBROMATOSIS JUVENILE HYALIN E ANTXR2 Cl 8(OS 2292OO #2292OO BRITTLE CORN EASYNDROME; BCS Cl 8(OS 2266OO #2266OO EP S BULLOSA DYSTROPHICA, COL7A1 Cl 8(OS AUTOSO SSIVE: RDE 2266SO #2266SO MAD E s o LY S S BULLOSA, JUNCTIONAL, COL17A1 Cl 8(OS NON-HE 2266SO #2266SO S BULLOSA, JUNCTIONAL, ITGB4 Cl 8(OS NON-HE 2266SO #2266SO S BULLOSA, JUNCTIONAL, LAMA3 Cl 8(OS NON-HE TY P E 2266SO #2266SO S BULLOSA, JUNCTIONAL, LAMB3 Cl 8(OS NON-HE 2266SO #2266SO S BULLOSA, JUNCTIONAL, LAMC2 Cl 8(OS NON-HE 2267OO #2267OO C D S BULLOSA, JUNCTIONAL, LAMA3 Cl 8(OS HERLITZT Y P E 2267OO #226,700 EPIDERMOLYS S BULLOSA, JUNCTIONAL, LAMB3 Cl 8(OS HERLITZT 2267OO #226,700 EPIDERMOLYS S BULLOSA, JUNCTIONAL, LAMC2 Cl 8(OS HERLITZT 242SOO #242500 ICHTHYOSIS CONGENITA, HARLEQUIN ABCA12 Cl 8(OS FETUSTYP 2787OO #2787OOXERODERMAP GMENTOSUM, XPA Cl 8(OS COMPLEMENTATION G ROUPA; XPA 27873O #278730 XERODERMAP GMENTOSUM, ERCC2 Cl 8(OS COMPLEMENTATION G ROUP D; XPD 2.7874O #27874OXERODERMAP GMENTOSUM, DDB2 Cl 8(OS COMPLEMENTATION G ROUPE 27876O #27876O XERODERMAP GMENTOSUM, ERCC4 Cl 8(OS COMPLEMENTATION G ROUPF, XPF 27878O #278780 XERODERMAP GMENTOSUM, ERCCS Cl 8(OS COMPLEMENTATION G ROUP G; XPG 219 100 #219100 CUTIS LAXA, A UTOSOMAL RECESSIVE, EFEMP2 Cl 8(OS TYPE I 219 100 #219100 CUTIS LAXA, A UTOSOMAL RECESSIVE, FBLNS Cl 8(OS TYPE I #601675 TRICHOTHIODYSTROPHY, ERCC2 Cl 8(OS PHOTOSENSITIVE: TTD C #601675 TRICHOTHIODYSTROPHY, ERCC3 Cl 8(OS PHOTOSENSITIVE: TTD C #601675 TRICHOTHIODYSTROPHY, Cl 8(OS PHOTOSENSITIVE: TTD C 2192OO #219200 CUTIS LAXA, A UTOSOMAL RECESSIVE, Cl 8(OS TYPE II 226,730 #226730 EPIDERMOLYS SBULLOSA JUNCTIONALIS ITGA6 Cl 8(OS WITHPYLORICATRESIA 226,730 #226730 EPIDERMOLYS SBULLOSA JUNCTIONALIS ITGB4 Cl 8(OS WITHPYLORICATRESIA 609638 #609638 EPIDERMOLYS S BULLOSA, LETHAL DSP Cl 8(OS ACANTHOLYTIC US 2011/0098.193 A1 Apr. 28, 2011 23

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol e

22S410 #225410 EHLERS-DANLOS SYNDROME, TYPE VII, ADAMTS2 Cl 8(OS AUTOSOMAL RECESSIVE 226670 #226670 EPIDERMOLYSIS BULLOSA SIMPLEX WITH PLEC1 Cl 8(OS MUSCULARDYSTROPHY 242300 #242300 ICHTHYOSIS, LAMELLAR, 1: LI1 TGM1 Cl 8(OS 275210 #275210 TIGHT SKIN CONTRACTURE SYNDROME, LMNA Cl 8(OS LETHAL 275210 #275210 TIGHT SKIN CONTRACTURE SYNDROME, ZMPSTE24 Cl 8(OS LETHAL 6O1706 #6O1706YEMENITE DEAF-BLIND SOX10 Cl 8(OS HYPOPIGMENTATIONSYNDROME #607626 ICHTHYOSIS, LEUKOCYTE VACUOLES, CLDN1 Cl 8(OS ALOPECIA, AND SCLEROSING CHOLANGITIS 6O7655 #6O76SS SKINFRAGILITYWOOLLY HAIR SYNDROME DSP Cl 8(OS 610651 #610651 XERODERMA PIGMENTOSUM, ERCC3 Cl 8(OS COMPLEMENTATION GROUP B; XPB 25798O #257980 ODONTOONYCHODERMAL DYSPLASIA: WNT1OA Cl 8(OS OODD 3.00537 HETEROTOPIA PERIVENTRICULAREHLERS-DANLOS FLNA Cl 8(OS VARIANT BASAL CELL CARCINOMA SUSCEPTIBILITY TO 1: PTCH1 Cl 8(OS BCC1 20808S #208085 ARTHROGRYPOSIS, RENAL DYSFUNCTION, VPS33B (WCOOle AND CHOLESTASIS 306955 #306955 HETEROTAXY, VISCERAL, 1, X-LINKED; ZIC3 (WCOOle HTX1 30O215 #300215 LISSENCEPHALY,X-LINKED, 2 LISX2 ARX (WCOOle 600118 #600118 WARBURG MICROSYNDROME; WARBM RAB3GAP1 (WCOOle 3.00209 #3.00209 SIMPSON-GOLABI-BEHMEL SYNDROME, OFD1 eWe O le TYPE 2 6O1378 #6O1378 CRISPONISYNDROME eWe le 3OO166 MICROPHTHALMIASYNDROMIC 2: MCOPS2 eWe le 222448 #222448 DONNAI-BARROW SYNDROME eWe le 6O7598 #6O7598 LETHAL CONGENITAL CONTRACTURE eWe le SYNDROME 2 608612 #608612 MANDIBULOACRALDYSPLASLAWITH TYPE ZMPSTE24 eWe O le 8. B LIPODYSTROPHY MADB 3095OO #309500 RENPENNING SYNDROME 1: RENS1 eWe le 211750 #211750 CSYNDROME 9 6 eWe le 60SO39 #6OSO39 C-LIKE SYNDROME eWe le 2438OO #243800 JOHANSON-BLIZZARD SYNDROME; JBS eWe le 2704OO #270400 SMITH-LEMLI-OPITZ SYNDROME: SLOS eWe le 31.1300 OTOPALATODIGITAL SYNDROME TYPE I: OPD1 eWe le 2141SO #214150 CEREBROOCULOFACIOSKELETAL EkN 6 eWe le SYNDROME 1: COFS1 311200 OROFACIODIGITAL SYNDROME I: OFD1 FD1 eWe le 61.1561 #611561 MECKEL SYNDROME, TYPE 5: MKS5 PGRIP1L, eWe le 219 OOO #219 OOO FRASERSYNDROME RAS1 eWe le 219 OOO #219 OOO FRASERSYNDROME REM2 eWe le 249000 #249000 MECKEL SYNDROME, TYPE 1: MKS1 MKS1 eWe le 2S3310 #25331 OLETHAL CONGENITAL CONTRACTURE eWe le SYNDROME 1; LCCS1 23668O #23668O HYDROLETHALUS SYNDROME 1 eWe le 200990 #200990 ACROCALLOSAL SYNDROME; ACLS eWe le 25732O #257320 LISSENCEPHALY 2: LIS2 eWe le 308300 INCONTINENTIAPIGMENTI; IP eWe le 30S600 FOCAL DERMAL HYPOPLASLA; FDH eWe le 30O815 CHROMOSOMEXq28 DUPLICATIONSYNDROME eWe le 3OO422 FG SYNDROME 4; FGS4 eWe le 3OO321 FG SYNDROME 2: FGS2 eWe le 3OO472 CORPUS CALLOSUM, AGENESIS OF WITH MENTAL eWe le RETARDATION, OCULAR COLOBOMA, 309000 #309000 LOWE OCULOCEREBRORENAL SYNDROME; (WCOOle 8. OCRL 31 O600 #310600 NORRIE DISEASE; ND NDP (WCOOle 311150 #311150 OPTICOACOUSTICNERVE ATROPHY WITH TIMM8A (WCOOle DEMENTIA 208150 #2O81SO FETALAKINESIA DEFORMATION RAPSN (WCOOle SEQUENCE: FADS CORNELIA DE LANGE SYNDROME 2: CDLS2 SMC1A (WCOOle #302950 CHONDRODYSPLASIA PUNCTATA 1, X ARSE (WCOOle LINKED RECESSIVE: CDPX1 US 2011/0098.193 A1 Apr. 28, 2011 24

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type

2151OO #2151OO RHIZOMELIC CHONDRODYSPLASIA PEX7 (WCOOle PUNCTATA, TYPE 1: RCDP1 222600 #2226OODLASTROPHICDYSPLASIA SLC26A2 eWe le 256.050 #256.050 ATELOSTEOGENESIS, TYPE II: AOII SLC26A2 eWe le 2683OO #268300 ROBERTS SYNDROME; RBS ESCO2 eWe le 273.395 #273395 TETRA-AMELIA, AUTOSOMAL RECESSIVE WNT3 eWe le 6O2398 #6O2398 DESMOSTEROLOSIS DHCR 24 eWe le 201OOO i2O10(OO CARPENTERSYNDROME RAB23 eWe le 3093SO MELNICK-NEEDLESSYNDROME: MNS FLNA eWe le 601451 #601451 NEVO SYNDROME PLOD1 eWe le 2S3290 #253290 MULTIPLEPTERYGIUM SYNDROME, CHRNA1 eWe le LETHALTYPE 2S3290 #253290 MULTIPLEPTERYGIUM SYNDROME, CHRND (WCOOle 8. LETHALTYPE 2S3290 #253290 MULTIPLEPTERYGIUM SYNDROME, CHRNG (WCOOle LETHALTYPE 26SOOO #265OOOMULTIPLEPTERYGIUM SYNDROME, (WCOOle ESCOBARVARIANT 6O1186 #6O1186 MICROPHTHALMLA, SYNDROMIC 9; MCOPS9 (WCOOle 253,250 #2S32SOMULIBREYNANISM (WCOOle 24O3OO #24O3OOAUTOIMMUNE POLYEN DOCRINE el ocrine SYNDROME, TYPE I: APS 264700 #264700 VITAMIN D-DEPENDENT RICKETS, TYPE I el ocrine 3O8370 #3O8370 INFERTILEMALE SYNDROME el ocrine 244460 #244460 KENNY-CAFFEY SYNDROME, TYPE 1: KCS el ocrine 2O3800 #203800 ALSTROMSYNDROME: ALMS el ocrine 201710 i2O1710 LIPOID CONGENITALA DRENAL el ocrine HYPERPLASIA 201710 i2O1710 LIPOID CONGENITALA DRENAL el ocrine HYPERPLASIA 2462OO #2462OO DONOHUE SYNDROME el ocrine 262600 #2626OO PITUITARY DWARFISM I el ocrine 262600 #2626OO PITUITARY DWARFISM I el ocrine 262600 #2626OO PITUITARY DWARFISM I el ocrine 262600 #2626OO PITUITARY DWARFISM I POU1F1 el ocrine 27O4SO #27O4SOINSULIN-LIKE GROWTH FACTORI, IGF1 el ocrine RESISTANCETO 275100 #275100 HYPOTHYROIDISM, CONGENITAL, TSHB el ocrine NONGOITROUS, 4; CHNG4 2O1910 +201910 ADRENAL HYPERPLASIA, CONGENITAL, CYP21A2 el ocrine DUETO 21-HYDROXYLASE DEF ICIENCY 300048 INTESTINAL PSEUDOOBSTRUCTION, NEURONAL, FLNA gastro CHRONIC IDIOPATHIC, X-LINKED enterologic 610370 #610370 DLARRHEA4, MALABSORPTIVE, NEUROG3 gastro CONGENITAL enterologic 301040 C-THALASSEMIAMENTAL RETARDATION ATRX hematologic SYNDROME, NONDELETION TYPE, X-LINKEDATRX 26O4(OO #260400 SHWACHMAN-DLAMOND SYNDROME: SDS SBDS (8. ogic 2O2400 BRINOGENEMIA, CONGENITAL FGA (8. ogic 2O2400 BRINOGENEMIA, CONGENITAL FGB (8. ogic 2O2400 i2O24(OOA BRINOGENEMIA, CONGENITAL FGG (8. ogic 2741SO #2741SO THROMBOTIC THROMB OCYTOPENIC ADAMTS13 (8. ogic PURPURA, CONGENITAL, TTP 612304 #612304 THROMBOPHILIA, HEREDITARY, DUE TO PROC hematologic PROTEIN C DEFICIENCY, AUTOSOMAL 2662OO #2662OOPYRUVATEKINASE DEFICIENCY OF RED hematologic CELLS 217090 #217090 PLASMINOGEN DEFICIENCY, TYPE I PLG hematologic 266130 #266.130 GLUTATHIONE SYNTHETASE DEFICIENCY GSS (8. O ogic 604498 #604498AMEGAKARYOCYTIC MPL (8. O ogic THROMBOCYTOPENIA, CONGENITAL; CAMT 1418OO +141800 HEMOGLOBIN--C. LOCUS 1; HBA1 HBA1 (8. ogic 141900 +141900 HEMOGLOBIN-BETA LOCUS; HBB HBB (8. ogic 6O3903 #603903 SICKLE CELLANEMIA HBB (8. ogic 6O2390 #602390 HEMOCHROMATOSIS, JUVENILE: JH HAMP (8. ogic 6O2390 #602390 HEMOCHROMATOSIS, JUVENILE: JH HFE2 (8. ogic 3OO448 C-THALASSEMIA MYELODYSPLASIA SYNDROME; ATRX (8. ogic ATMDS 21S600 #215600 CIRRHOSIS, FAMILLAL KRT18 hepatic 21S600 #215600 CIRRHOSIS, FAMILLAL KRT8 hepatic 107400 +107400 PROTEASE INHIBITOR 1: PI SERPINA1 hepatic US 2011/0098.193 A1 Apr. 28, 2011 25

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol ype

235550 #23SSSO HEPATIC VENOOCCLUSIVE DISEASE WITH SP110 immuno IMMUNODEFICIENCY VODI eficiency 30O240 #30O240 HOYERAAL-HREIDARSSON SYNDROME; DKC1 immuno HHS eficiency 2O8900 #208900 ATAXIA-TELANGIECTASIA: AT ATM immuno eficiency 301 OOO #301.000 WISKOTT-ALDRICH SYNDROME: WAS WAS immuno eficiency 3O4790 #304790 IMMUNODYSREGULATION, FOXP3 immuno POLYENDOCRINOPATHY, AND ENTEROPATHY, X eficiency LINKED; 308240 #308240 LYMPHOPROLIFERATIVE SYNDROME, X- SH2D1A immuno LINKED, 1; XLP1 eficiency 312O60 #31206O PROPERDIN DEFICIENCY,X-LINKED CFP immuno eficiency 300755 #300755 AGAMMAGLOBULINEMLA, X-LINKED XLA BTK immuno eficiency ANHIDROTICECTODERMALDYSPLASIAWITH IKBKG immuno IMMUNODEFICIENCY, OSTEOPETROSIS AND eficiency LYMPHEDEMAOLEDAID 30O291 #30O291 ECTODERMAL DYSPLASIA, HYPOHIDROTIC, IKBKG immuno WITH IMMUNE DEFICIENCY eficiency 312863 #312863 COMBINED IMMUNODEFICIENCY,X- IL2RG immuno LINKED, CIDX eficiency #300400 SEVERE COMBINED IMMUNODEFICIENCY, IL2RG immuno X-LINKED: SCIDX1 eficiency 3O8230 #308230 IMMUNODEFICIENCY WITH HYPER-IgM, CD4OLG immuno TYPE 1: HIGM1 eficiency 1027OO #102700 SEVERE COMBINED IMMUNODEFICIENCY, ADA immuno AUTOSOMAL RECESSIVE, T CELL-NEGATIVE, eficiency 210900 #210900 BLOOMSYNDROME; BLM BLM immuno eficiency 2491OO #249100 FAMILIAL MEDITERRANEAN FEVER: FMF MEFV immuno eficiency 2S1260 #2S1260 NIJMEGEN BREAKAGE SYNDROME NBN immuno eficiency 603554 #603554 OMENNSYNDROME DCLRE1C immuno eficiency 603554 #603554 OMENNSYNDROME RAG1 immuno eficiency 603554 #603554 OMENNSYNDROME RAG2 immuno eficiency 242860 #242860 IMMUNODEFICIENCYCENTROMERIC DNMT3B immuno INSTABILITY-FACIAL ANOMALIES SYNDROME eficiency #607624 GRISCELLISYNDROME, TYPE 2: GS2 RAB27A immuno eficiency 6O1457 #601457 SEVERE COMBINED IMMUNODEFICIENCY, RAG1 immuno AUTOSOMAL RECESSIVE, T CELL-NEGATIVE, eficiency 6O1457 #601457 SEVERE COMBINED IMMUNODEFICIENCY, RAG2 immuno AUTOSOMAL RECESSIVE, T CELL-NEGATIVE, eficiency 25O250 #250250 CARTILAGE-HAIR HYPOPLASIA: CHH RMRP Immuno eficiency 6O1705 #6O1705 T-CELL IMMUNODEFICIENCY, CONGENITAL FOXN1 Immuno ALOPECIA, AND NAIL DYSTROPHY eficiency 214500 CHEDIAK-HIGASHISYNDROME: CHS LYST Immuno eficiency SEVERE COMBINED IMMUNODEFICIENCY, AR, T AK3 Immuno CELL-NEGATIVE, B CELL-POSITIVE, NK CELL eficiency NEGATIVE 261740 #261740 GLYCOGEN STORAGE DISEASE OF HEART, PRKAG2 Metabolic LETHAL CONGENITAL 232400 #232400 GLYCOGEN STORAGE DISEASE III AGL Metabolic 2149SO #214.950 BILE ACID SYNTHESIS DEFECT, AMACR metabolic CONGENITAL, 4 609060 #609060 COMBINED OXIDATIVE PHOSPHORYLATION GFM1 metabolic DEFICIENCY 1: COXPD1 6.10498 #610498 COMBINED OXIDATIVE PHOSPHORYLATION MRPS16 metabolic DEFICIENCY 2: COXPD2 61.1719 #611719 COMBINED OXIDATIVE PHOSPHORYLATION MRPS22 metabolic DEFICIENCY 5; COXPD5 2322OO 2322OO GLYCOGEN STORAGE DISEASE I G6PC3 metabolic 232SOO #232SOO GLYCOGEN STORAGE DISEASE IV GBE1 metabolic US 2011/0098.193 A1 Apr. 28, 2011 26

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Typ e

215700 #215700 CITRULLINEMIA, CLASSIC ASS1 O ic 230900 #230900 GAUCHERDISEASE, TYPE II GBA O ic 245200 #2452OOKRABBE DISEASE GALC O ic 248SOO #248500 MANNOSIDOSIS, C. B., LYSOSOMAL MAN2B1 O ic 252500 #2S25OOMUCOLIPIDOSIS II of BETA GNPTAB O ic 2S2600 #2526OOMUCOLIPIDOSISIII Cu BETA GNPTAB O ic 252650 #2526SOMUCOLIPIDOSI SIV MCOLN1 O ic 257200 #257200 NIEMANN-PICK DISEASE, TYPE A SMPD1 O ic 257220 #257220 NIEMANN-PICK DISEASE, TYPE C1; NPC1 NPC O ic 26992O #26992OINFANTILESIALICAC D STORAGE SLC17AS O ic DISORDER 604369 #604369 SIALURIA, FINN SH TYPE O ic 6O7625 #607625 NIEMANN-PICK DISEASE, TYPE C2 O ic 608O13 #608013 GAUCHERDISEASE, P ERINATAL LETHAL O ic 2S3200 #2S32OOMUCOPOLYSACC ARIDOSIS TYPE VI O ic 2S3220 #2S322O MUCOPOLYSACC ARIDOSIS TYPE VII O ic 256550 #2S6SSONEURAMINIDA SE DEFICIENCY O ic 23OOOO #23OOOOFUCOSIDOSIS O ic 230600 #230600 GM1-GANGLIOSIDOSI S, TYPE II O ic 252930 #25293OMUCOPOLYSACCH ARIDOSIS TYPE IIIC O ic 611721 #611721 COMBINED SAPOS INDEFICIENCY O ic 23O800 #23O8OO GAUCHERDISEAS E, TYPE I O ic 607616 #607616 NIEMANN-PICKDI SEASE, TYPE B O ic 26S800 #2658OOPYCNODYSOSTOSIS O ic 231 OOO #231 OOO GAUCHERDISEAS E, TYPE III O ic 252900 #252900 MUCOPOLYSACCHARIDOSIS TYPE IIIA O ic 2084OO +2O8400 ASPART YLGLUCOSAMINURIA O ic 6O7014 #6O7014 HURLERSYNDROME O ic 608688 #608688 AICARTRANSFORMYLASEAIMP ATIC O ic CYCLOHYDROLASE, DEFICIENCY OF 604377 #604377 CARDIOENCEPHA LOMYOPATHY, FATAL SCO2 O ic INFANTILE, DUE TO CYTOCHROME c OXIDASE #6OO121 RHIZOMELIC CHONDRODYSPLASIA AGPS O ic PUNCTATA, TYPE 3: RCDP3 271900 #271900 CANAVANDISEASE ASPA O ic 30O816 COMBINED OXIDATIVE PHOSPHORYLATION AIFM1 O ic DEFICIENCY 6 3OO 100 #300100 ADRENOLEUKODYSTROPHY: ALD ABCD1 O ic 213700 #213700 CEREBROTENDINOUSXANTHOMATOSIS CYP27A1 O ic 2SO62O #2SO62O BETA-HYDROXYISOBUTYRYL COA HIBCE O ic DEACYLASE, DEFICIENCY OF 609241 #609241 SCHINDLER DISEASE, TYPE I NAGA O ic 608782 #608782 PYRUVATE DEH YDROGENASE PDP1 O ic PHOSPHATASE DEFICIE NCY #6OS4O7 SEGAWA SYND ROME, AUTOSOMAL TH O ic RECESSIVE #612736 GUANIDINOAC ETATE GAMT O ic METHYLTRANSFERASE DEFICIENCY 7-(a)BETA-HYDROXYST TEROID DEHYDROGENASE X O ic DEFICIENCY PYRUVATE DECARBOXYLASE DEFICIENCY O ic #301SOO FABRY DISEAS O ic #3112SOORNITHINE TRANSCARBAMYLASE O ic DEFICIENCY, HYPERAMMONEMIA DUE TO 2014SO #201450 ACYL-CoA DEHYDROGENASE, MEDIUM O ic CHAIN, DEFICIENCY OF 211600 #211600 CHOLESTASIS, PROGRESSIVE FAMILLAL ATP8B1 O ic INTRAHEPATIC 1: PFIC 21206S #21206S CONGENITAL DISORD ER OF PMM2 O ic GLYCOSYLATION, TYPE 219750 #219750 CYSTINOSIS, ADULT NONNEPHROPATHIC CTNS O ic 2198OO #219800 CYSTINOSIS, N EPHRO PATHIC; CTNS CTNS O ic 23O4OO #23O4(OO GALACTOSEMIA GALT O ic 231 680 #231 680 MULTIPLE ACYL-CoA DEHYDROGENASE ETFA O ic DEFICIENCY MADD 231 680 #231 680 MULTIPLE ACYL-CoA DEHYDROGENASE ETFB O ic DEFICIENCY MADD 231 680 #231 680 MULTIPLE ACYL-CoA DEHYDROGENASE ETFDH O ic DEFICIENCY MADD 232220 #232220 GLYCOGEN STORAGE DISEASE Ib SLC37A4 O ic 2323OO #2323OO GLYCOGEN STORAGE DISEASE II GAA 8. O ic 243 SOO #243500 ISOVALERIC ACIDEMLA; IVA IVD O ic US 2011/0098.193 A1 Apr. 28, 2011 27

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol

2486OO #248600 MAPLE SYRUP URINE DISEASE Type Ia BCKDHA O ic 251 OOO #2S1 OOOMETHYLMALONIC ACIDURIA DUETO MUT O ic METHYLMALONYL-CoA MUTASE DEFICIENCY 2S3260 #2S326O BIOTINIDASE DEFICIENCY BTD O ic 255110 #2SS110 CARNITINE PALMITOYLTRANSFERASE II CPT2 O ic DEFICIENCY, LATE-ONSET 255120 #25512O CARNITINE PALMITOYLTRANSFERASE I CPT1A O ic DEFICIENCY 2585O1 #258501 3-(a)METHYLGLUTACONIC ACIDURIA, TYPE OPA3 O ic III 259900 #25990O HYPEROXALURIA, PRIMARY. TYPE I AGXT O ic 26OOOO #260000 HYPEROXALURIA, PRIMARY. TYPE II GRHPR O ic 271980 #271980 SUCCINIC SEMLALDEHYDE ALDHSA1 O ic DEHYDROGENASE DEFICIENCY 277900 #277900 WILSON DISEASE O ic 6OO649 #6OO649 CARNITINE PALMITOYLTRANSFERASE II O ic DEFICIENCY, INFANTILE 6O2579 #6O2S79 CONGENITAL DISORDER OF MP O ic GLY COSYLATION, TYPE Ib; CDG1B 605899 #605899 GLYCINE ENCEPHALOPATHY: GCE AMT O ic 605899 #605899 GLYCINE ENCEPHALOPATHY: GCE GCSEH O ic 605899 #605899 GLYCINE ENCEPHALOPATHY: GCE GLDC O ic 606812 #606812 FUMARASE DEFICIENCY FH O ic 608836 #608836 CARNITINE PALMITOYLTRANSFERASE II CPT2 O ic DEFICIENCY, LETHAL NEONATAL 6101.98 #6101983-(a)METHYLGLUTACONIC ACIDURIA, TYPE V O ic 610377 #61.0377 MEVALONIC ACIDURIA O ic 250950 #250950 3-(a)METHYLGLUTACONIC ACIDURIA, TYPE I O ic 124OOO #124OOOMITOCHONDRIAL COMPLEXIII DEFICIENCY O ic 124OOO #124OOOMITOCHONDRIAL COMPLEXIII DEFICIENCY O ic 124OOO #124OOOMITOCHONDRIAL COMPLEXIII DEFICIENCY O ic 607091 #607091 CONGENITAL DISORDER OF O ic GLY COSYLATION, TYPE IId: CDG2D 608643 #608643 AROMATICL-AMINOACID DDC O ic DECARBOXYLASE DEFICIENCY 600721 #600721 D-2-(a)HYDROXYGLUTARIC ACIDURIA O ic 210210 #210210 3-(a)METHYLCROTONYL-CoA O ic CARBOXYLASE 2 DEFICIENCY 201475 #201475ACYL-CoA DEHYDROGENASE, VERY LONG ACADVL O ic CHAIN, DEFICIENCY OF 60901S #609015 TRIFUNCTIONAL PROTEINDEFICIENCY HADHA O ic 60901S #609015 TRIFUNCTIONAL PROTEINDEFICIENCY HADHB O ic #6100062-(a)METHYLBUTYRYL-CoA ACADSB O ic DEHYDROGENASE DEFICIENCY 610992 #610992 PHOSPHOSERINE AMINOTRANSFERASE PSAT1 O ic DEFICIENCY 2774OO #2774OOMETHYLMALONIC ACIDURIA AND MMACHC O ic HOMOCYSTINURIA, cblC TYPE 2O1460 #201460 ACYL-CoA DEHYDROGENASE, LONG-CHAIN, ACADL O ic DEFICIENCY OF 22O111 #220111 LEIGH SYNDROME, FRENCH-CANADIAN LRPPRC O ic TYPE; LSFC 261515 #261515 D-BIFUNCTIONAL PROTEINDEFICIENCY O ic 245349 #245349 PYRUVATE DEHYDROGENASE E3-BINDING O ic PROTEINDEFICIENCY 24S400 #245400 LACTIC ACIDOSIS, FATAL INFANTILE SUCLG1 O ic 231530 #231530 3-(a)HYDROXYACYL-CoA DEHYDROGENASE HADH O ic DEFICIENCY 23.7300 #2373OO CARBAMOYL PHOSPHATE SYNTHETASEI CPS1 O ic DEFICIENCY, HYPERAMMONEMIA DUE TO 264470 #26447O PEROXISOMALACYL-CoA OXIDASE ACOX1 O ic DEFICIENCY 26S 120 #265120 SURFACTANT METABOLISM DYSFUNCTION, SFTPB O ic PULMONARY, 1; SMDP1 272,300 #2723OO SULFOCYSTEINURIA SUOX O ic 6O2473 #602473 ENCEPHALOPATHY, ETHYLMALONIC ETHE1 O ic 61OO90 #61009.OPYRIDOXAMINE S-PRIME-PHOSPHATE PNPO O ic OXIDASE DEFICIENCY 6O1847 #6O1847 CHOLESTASIS, PROGRESSIVE FAMILLAL ABCB11 O ic INTRAHEPATIC 2; PFIC2 608799 #608799 CONGENITAL DISORDER OF DPM1 O ic GLY COSYLATION, TYPE Ie; CDG1E US 2011/0098.193 A1 Apr. 28, 2011 28

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol

61OSOS #61OSOS COMBINED OXIDATIVE PHOSPHORYLATION TSFM metabolic DEFICIENCY 3: COXPD3 610768 #610768 CONGENITAL DISORDER OF DOLK GLY COSYLATION, TYPEC Im; CDG1 M 611126 #611126ACYL-CoA DEHYDROGENASE FAMILY, ACAD9 MEMBER 9, DEFICIENCY OF 212O66 #212O66 CONGENITAL OF MGAT2 GLY COSYLATION, 26626S #266265 CONGENITAL O SLC35C1 GLYCOSYLATION, TYP 603147 #603147 CONGENITAL S O R D E R O ALG6 GLYCOSYLATION, TYP 603585 #603585 CONGENITAL O SLC3SA1 GLYCOSYLATION, TYP 6060S6 #6060S6 CONGENITAL O MOGS GLYCOSYLATION, TYP 607330 #60733OLATHOSTEROL o S SCSDL metabolic 60854O #60854O CONGENITAL D S R D E R O ALG1 metabolic GLYCOSYLATION, TYP E o 2362SO #23625OHOMOCYSTINURIA DUETO DEFICIENCY OF MTHFR N(5,10)-METHYLENETETRAHYDROFOLATE 2661SO #2661SOPYRUVATE CA R BOXYLASE DEFICIENCY PC metabolic 2O7900 i2O7900 ARGININOSUCCINIC ACIDURIA ASL metabolic 238,970 #23897O HYPERORNITHINEMIA-HYPERAMMONEMIA SLC25A15 metabolic HOMOCITRULLINURIA SYNDROME 253,270 #2S327O HOLOCARBOXYLASE SYNTHETASE HLCS DEFICIENCY 261600 #261600 PHENYLKETONURIA; PKU PAH metabolic 237310 #237310 N-ACETYLGLUTAMATE SYNTHASE NAGS metabolic DEFICIENCY 212140 #212140 CARNITINE DEFICIENCY, SYSTEMIC SLC22AS PRIMARY: CDSP 251 100 #251100 METHYLMALONIC ACIDURIA, cb|A TYPE MMAA metabolic 2O3750 #2O37SO C-METHYLACETOACETIC ACIDURIA ACAT1 metabolic 219900 #219900 CYSTINOSIS, LATE-ONSET JUVENILE OR CTNS metabolic ADOLESCENT NEPHROPATHICTYPE 23O2OO #23O2OO GALACTOKINASE DEFICI ENCY GALK1. 251110 #251110 METHYLMALONIC ACIDURIA, cb|B TYPE MMAB 608.093 #608.093 CONGENITAL DISORDER OF DPAGT1 l e 8. bolic GLY COSYLATION, TYPE I: CDG1J 232240 #232240 GLYCOGEN STORAGE DISEASE Ic SLC37A4 bolic 229600 +229600 FRUCTOSE INTOLERANCE, HEREDITARY ALDOB bolic 231670 #231670 GLUTARIC ACIDEMIA GCDH bolic 2362OO 2362OOHOMOCYSTINURIA CBS bolic 2486OO #248600 MAPLE SYRUP URINE DISEASE Type III DLD bolic 246450 +246450 3-(a)HYDROXY-3-METHYLGLUTARYL-CoA HMGCL bolic LYASE DEFICIENCY 2486OO 2486OO MAPL E SYRUP URINE DISEASE, CLASSIC, BCKDHB TYPE IB 274270 +274270 DIHYDROPYRIMIDINE DEHYDROGENASE; DPYD DPYD 276,700 +276700 TYROSINEMIA, TYPE I FAH metabolic HYDROXYACYL-CoADEHYDROG ENASEA3 HADHA metabolic KETOACYL-CoATHIOLASEENOYL-CoA HYDRATASE, 603358 #603358 GRACILE SYNDROME 21.2138 21.2138 CARNITINE-ACYLCARNIT TINE TRANSLOCASE DEFICIENCY 30O257 DANON DISEASE LAMP2 metabolic 3.09900 MUCOPOLYSACCHARIDOSIS TYP EII DS metabolic 606612 #606612 MUSCULAR DYSTROPHY, CONGENITAL, 1C: FKRP neurological MDC1C 609528 CEREBRAL DYSGENESIS, NEURO PATHY, SNAP29 neurological ICHTHYOSIS, AND PALMOPLANTAR KERATODERMA 231550 #2315SOACHALASIA-ADDISONIANISM-ALACRIMA AAAS neurological SYNDROME: AAA 25478O #25478O MYOCLONICEPILEPSYO FLAFORA EPM2A neurological 25478O #25478O MYOCLONICEPILEPSYO FLAFORA NHLRC1 neurological 2S4800 #254800MYOCLONICEPILEPSYO FUNVERRICHT CSTB neurological AND LUNDBORG #300067 LISSENCEPHALY,X-LINK ED, 1: LISX1 DCX neurological US 2011/0098.193 A1 Apr. 28, 2011 29

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type 30O220 #30O220 MENTAL RETARDATION, X-LINKED, ogica SYNDROMIC 10; #300322 LESCH-NYHANSYNDROME; LNS HPRT1 ogica #3.00352 CREATINE DEFICIENCY SYNDROME, X SLC6A8 ogica LINKED #301835 ARTS SYNDROME: ARTS PRPS1 ogica #3O33SO MASA SYNDROME L1 CAM ogica #304100 CORPUS CALLOSUM, PARTIAL AGENESIS L1 CAM ogica OF, X-LINKED 307OOO #307OOO HYDROCEPHALUS DUE TO CONGENITAL L1 CAM ogica STENOSIS OF AQUEDUCT OF SYLVIUS; HSAS H3O8350 EPILEPT CIC ENCEPHALOPATHY, EARLY ARX ogica INFANTILE, 1 309400 #3094OOMENKE S DISEASE ATP7A ogica 30952O #30952OLUAN FRYNSSYNDROME MED12 ogica 3.1208O #312O8OPELIZA EUS-MERZBACHERDISEASE: PMD PLP1 ogica 312920 #312920 SPASTIC PARAPLEGIA 2, X-LINKED: SPG2 PLP1 ogica 10583.0 #10583 OANGELMAN SYNDROMEAS MECP2 ogica 30O243 #30O243 MENTA L RETARDATION, X-LINKED, SLC9A6 ogica SYNDROMIC, C HRISTIANSON 3OOS23 #3 OOS23 ALLAN-HERNDON-DUDLEY SYNDROME SLC16A2 ogica AHDS i2O67OOANIRID A, CEREBELLARATAXIA, AND PAX6 ogica MENTAL DEFIC ENCY 216550 #216SSO COHEN SYNDROME: COH1 VPS13B ogica 225750 #225750 AICARDI-GOUTIERESSYNDROME 1: AGS1 TREX1 ogica 2521.50 DENUM COFACTORDEFICIENCY MOCS1 ogica 2521.50 DENUM COFACTORDEFICIENCY MOCS2 ogica 212720 #21272O MARTSOLFSYNDROME RAB3GAP2 ogica 241410 #24141 O HYPOPARATHY ROIDISM-RETARDATION TBCE ogica DYSMORPHISM SYNDROME; HRD 2S328O #253280 MUSCL E-EYE-BRAIN DISEASE: MEB FKRP ogica 2S328O #253280 MUSCL E-EYE-BRAIN DISEASE: MEB POMGNT1 ogica 271930 #271930 STRIATONIGRALD EGENERATION, NUP62 ogica INFANTILE: SNDI 312750 RETTSYNDROME; RTT MECP2 ogica NA X-linked mental retardation KIAA2O22 ogica NA X-linked mental retardation NXF5 ogica NA X-linked mental retardation RPL10 ogica NA X-linked mental retardation ZCCHC12 ogica NA X-linked mental retardation ZMYM3 ogica NA Autosomal mental retardation ST3GAL3 ogica NA Autosomal mental retardation ZC3H14 ogica NA Autosomal mental retardation SRDSA3 ogica NA Autosomal mental retardation NSUN2 ogica NA Autosomal mental retardation ZNF526 ogica NA Autosomal mental retardation BOD1 ogica 309548 MENTAL RETARDATION X-LINK E DASSOCIATED AFF2 ogica WITH FRAGILE SITE FRAXE 309530 MENTAL RDATION X-LINK D 1; MRX1 ogica 3O3600 COFFIN-L SYNDROME: CL ogica 30O803 MENTAL DATION X-LINK s ogica RELATED MENTAL TARDATION X-LIN DSYP-RELATED ogica MENTAL ARDATION X-LIN D SYNDROMIC ogica ZDHHC9 ATE MENTAL ET ARDATION AND MI ROCEPHALY CASK ogica WITHP O NT INE AND CEREBELLAR HYPOPLASIA 3OO716 MENTAL RET D 95; MRX95 MAGT1 ogica 300706 MENTAL RET D SYNDROMIC HUWE1 ogica TURNERTYPE 3OO639 MENTAL RETARDATION X-LINK E D WITH CUL4B ogica BRACHYDACTYLY AND MACROGLOSSIA 3OO607 HYPEREKPLEXLA AND EPILE ARHGEF9 ogica 300573 MENTAL R D 92; MRX92 ZNF674 ogica 30O271 MENTAL A. gN X. D 72: MRX72 RAB39B ogica 3OO189 MENTAL N D 90; MRX90 DLG3 ogica 3OOO88 EPILEPSY ED WITH MENTAL PCDH19 ogica RETARDAT ON: 300075 MENTAL s N K E D 19 INCLUDED; RPS6KA3 ogica MRX19 ED MENTAL R ET A RDA CION X-LINKEDN 88; MRX88 AGTR2 ogica US 2011/0098.193 A1 Apr. 28, 2011 30

TABLE 1. 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type 31218O MENTAL RETARDATION X-LINKED SYNDROMIC UBE2A ogica UBE2A-RELATED 314995 MENTAL RETARDATION X-LINKED 89; MRX89 ZNF41 ogica 6.13.192 MENTAL RETARDATION AUTOSOMAL RECESSIVE TRAPPC9 ogica 13; MRT13 61.1092 MENTAL RETARDATION AUTOSOMAL RECESSIVE 6: GRIK2 ogica 61.1093 MENTAL RETARDATION AUTOSOMAL RECESSIVE 7: TUSC3 ogica MRTT 2688OO #2688OO SANDHOF FDISEASE HEXB ogica 223.900 #223900 NEUROPATHY, HEREDITARY SENSORY AND IKBKAP ogica AUTONOMIC. TYP E III: HSAN3 133S4O #133540 COCKAYNE SYNDROME, TY B; CSB ERCC6 ogica 204200 #204200 CEROID L. POFUSCINOSIS, N RONAL, 3; CLN3 ogica CLN3 2O4SOO #204SOO CEROID L. POFUSCINOSIS, N E URONAL, 2: TPP1 ogica CLN2 216400 #216400 COCKAYNE SYNDROME, TY PEA; CSA ERCC8 ogica 2488OO #2488OO MARINESCO-SOGRENSYNDROME SIL1 ogica 256730 #256.730 CEROID L. POFUSCINOSIS, N EURONAL, 1: PPT1 ogica CLN1 256731 #2S6731 CEROID L. POFUSCINOSIS, N EURONAL, 5; CLNS ogica CLNS 600143 #600 143 CEROID L. POFUSCINOSIS, N E U RONAL, 8; CLN8 ogica CLN8 6O1780 #6O1780 CEROID L. POFUSCINOSIS, N E U RONAL, 6; CLN6 ogica CLN6 61OOO3 #61OOO3 CEROID POFUSCINOSIS, N E U RONAL, 8, CLN8 ogica NORTHERN EPIL PSYWARLANT 61O127 #61O127 CEROID POFUSCINOSIS, N EURONAL, 10; CTSD ogica CLN10 610951 #610951 CEROID L. POFUSCINOSIS, N EURONAL, 7; MFSD8 ogica CLN7 2O3700 ALPERS DIFFUSE DEGENERATION O F CEREBRAL POLG ogica GRAY MATTER WITH HEPATIC CIRRHOSIS 249900 #24990O METACHROMATICLEUKODYSTROPHYDUE PSAP ogica TO SAPOSIN B DE FICIENCY 271245 #271245 INFANTIL E-ONSET SPINOCE REBELLAR C10ORF2 ogica ATAXIA: IOSCA 608804 #608804 LEUKODYSTROPHY, HYPOMYELINATING, 2 GC2 ogica 61OS32 #610532 LEUKODYSTROPHY, HYPOMYELINATING, 5 FAM126A ogica 2342OO #2342OONEURODEGENERATION WIT HBRAINIRON PANK2 ogica ACCUMULATION 277460 #277460 VITAMINE, FAMILIALISOLATED TTPA ogica DEFICIENCY OF; VED 205100 #205100 AMYOTROPHIC LATERAL SCLEROSIS 2, ALS2 ogica UVENILE: ALS2 2.70550 #270550 SPASTICATAXIA, CHARLEVOIX-SAGUENAY ogica TYPE SACS 606353 #6063S3 PRIMARY LATERAL SCLEROSIS, JUVENILE: ogica PLSJ 61.1067 #611067 SPINALM USCULARATROPH Y, DISTAL, ogica AUTOSOMAL REC ESSIVE, 4; DSMA 270200 #2702OO SOGREN LARSSON SYNDROME: SLS ogica 3OO623 FRAGILE XTREMORATAXIASYNDROME: FXTAS ogica 609560 #609560 MITOCHONDRIAL DNA DEPLETION ogica SYNDROME,MYO PATHICFORM 301830 #301830 SPINALM USCULARATROPH Y,X-LINKED 2: ogica SMAX2 218OOO #218OOOAGENESIS OF THE CORPUS CALLOSUM SLC12A6 ogica WITH PERIPHERA L NEUROPATHY: ACCPN 2S3300 #2S3300 SPINALM USCULARATROPH Y, TYPE ISMA1 SMN ogica 256.030 #2S6030 NEMALIN E MYOPATHY 2; NEM2 NEB ogica 602771 #602771 RIGID SPINE MUSCULAR DYSTROPHY 1: SEPN1 ogica RSMD1 605355 #6053SS NEMALIN E MYOPATHY5; NEM5 TNNT1 ogica 6O4320 #60432O SPINALM USCULARATROPH Y, DISTAL, IGHMBP2 ogica AUTOSOMAL REC ESSIVE, 1; DSMA 2.53550 #2S35SO SPINALM USCULARATROP Y, TYPE II: SMN ogica SMA2 607855 #607855 MUSCULAR DYSTROPHY, CONGENITAL LAMA2 ogica MEROSIN-DEFICIENT, 1A: MDC1A US 2011/0098.193 A1 Apr. 28, 2011 31

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type

60884O #608840 MUSCULAR DYSTROPHY, CONGENITAL, LARGE le:O ogica TYPE 1D 2S3400 #2S3400 SPINAL MUSCULARATROPHY, TYPE III: SMN1 le:O ogica SMA3 236670 #236670 WALKER-WARBURG SYNDROME:WWS POMT1 le:O ogica 236670 #236670 WALKER-WARBURG SYNDROME:WWS POMT2 le:O ogica 3OO489 SPINAL MUSCULARATROPHY DISTALX-LINKED 3; ATP7A le:O ogica SMAX3 310200 #310200 MUSCULAR DYSTROPHY, DUCHENNETYPE; le:O ogica 253800 #25380OFUKUYAMA CONGENITAL MUSCULAR ogica DYSTROPHY, FCMD 3104.00 #310400 MYOTUBULAR MYOPATHY 1; MTM1 MTM1 ogica 145900 #14590O HYPERTROPHICNEUROPATHY OF EGR2 ogica DETERINE-SOTTAS. CMT3, CMT4F 145900 #14590O HYPERTROPHICNEUROPATHY OF ogica DETERINE-SOTTAS. CMT3, CMT4F 145900 #14590O HYPERTROPHICNEUROPATHY OF PMP22 ogica DETERINE-SOTTAS. CMT3, CMT4F 145900 #14590O HYPERTROPHICNEUROPATHY OF PRX ogica DETERINE-SOTTAS. CMT3, CMT4F #300004 CORPUS CALLOSUM, AGENESIS OF, WITH ARX ogica ABNORMAL GENITALIA #300673 ENCEPHALOPATHY, NEONATAL SEVERE, MECP2 ogica DUE TO MECP2 MUTATIONS 308930 #308930 LEIGH SYNDROME, X-LINKE PDHA1 ogica 2O892O EARLY-ONSET, WITH APTX ogica OCULOMOTOR APRAXIA AND HYPOALBUMINEMIA: #2SO1OO METACHROMATICLEUKODYSTROPHY ARSA ogica #256.600 NEUROAXONAL DYSTROPHY, INFANTILE: PLA2G6 ogica INAD1 272800 #272800 TAY-SACHS DISEASE: TSD HEXA ogica 604004 #604004 MEGALENCEPHALIC MILC1 ogica LEUKOENCEPHALOPATHY WITHSUBCORTICAL CYSTS; MLC 605253 NEUROPATHY, CONGENITAL HYPOMYELINATING EGR2 ogica CHARCOT-MARIE-TOOTH DISEASE, TYPE 4E 605253 NEUROPATHY, CONGENITAL HYPOMYELINATING MPZ ogica CHARCOT-MARIE-TOOTH DISEASE, TYPE 4E 6O7426 #607426 COENZYME Q10 DEFICIENCY APTX ogica 6O7426 #607426 COENZYME Q10 DEFICIENCY CABC1 ogica 6O7426 #607426 COENZYME Q10 DEFICIENCY COQ2 ogica 6O7426 #607426 COENZYME Q10 DEFICIENCY PDSS1 ogica 6O7426 #607426 COENZYME Q10 DEFICIENCY PDSS2 ogica 608629 #608629 JOUBERT SYNDROME 3: JBTS3 AHI1 ogica 609311 #609311 CHARCOT-MARIE-TOOTH DISEASE, TYPE 4H; FGD4 ogica 609.583 #609583 JOUBERT SYNDROME 4: JBTS4 NPHP1 ogica 6101.88 #610188 JOUBERT SYNDROME 5: JBTS5 CEP290 ogica 610688 #610688 JOUBERT SYNDROME 6; JBTS6 TMEM67 ogica 611722 #611722 KRABBE DISEASE, ATYPICAL DUE TO PSAP ogica SAPOSINADE CIENCY 251880 #251880 MITOC ONDRIAL DNA DEPLETION C10ORF2 ogica SYNDROME, H PATOCEREBRAL FORM 251880 #251880 MITOC ONDRIAL DNA DEPLETION DGUOK ogica SYNDROME, H PATOCEREBRAL FORM 251880 #251880 MITOC ONDRIAL DNA DEPLETION MPV17 ogica SYNDROME, H PATOCEREBRAL FORM 256810 #2S681ONAVAIO NEUROHEPATOPATHY: NN MPV17 ogica 214450 #2144SO GRISCE LLISYNDROME, TYPE 1: GS1 MYOSA ogica 256710 #256.71 OELEJAL DE DISEASE MYOSA ogica 23OSOO #2305.00 GM1-GANGLIOSIDOSIS, TYPE I GLB1 ogica 2S6800 #2S6800 INSENS TIVITY TO PAIN, CONGENITAL, NTRK1 ogica WITHANHIDRO SIS; CIPA 6090S6 #6090S6 AMISH NFANTILEEPILEPSYSYNDROME ST3GALS ogica 609304 #609304 EPILEPTIC ENCEPHALOPATHY, EARLY SLC25A22 ogica INFANTILE, 3 224OSO CEREBELLAR YPOPLASIA AND MENTAL VLDLR ogica RETARDATION WITHOR WITHOUT QUADRUPEDAL 225753 #225753 PONTOCEREBELLAR HYPOPLASIA TYPE 4; TSEN54 ogica PCH4 US 2011/0098.193 A1 Apr. 28, 2011 32

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol Type 277470 #277470 PONTOCEREBELLAR HYPOPLASIA TYPE 2A; TSEN54 neurological 606369 #606369 EPILEPTIC ENCEPHALOPATHY, LENNOX MAPK10 neurological GASTAUT TYPE 61.1726 #611726 EPILEPSY, PROGRESSIVE MYOCLONIC 3; KCTD7 neurological 612164 #612164 EPILEPTIC ENCEPHALOPATHY, EARLY STXBP1 neurological INFANTIL E, 4 30O804 JOUBERT SYNDROME 10; JBTS10 OFD1 neurological 300049 HETEROTOPIA PERIVENTRICULARX-LINKED FLNA neurological DOMINANT 610828 HOLOPROSENCEPHALY 7: HPE7 PTCH1 neurological 217400 #217400 CORNEAL DYSTROPHY AND PERCEPTIVE SLC4A11 ocular DEAFNES 276900 #276900 U SHERSYNDROMETYPEI MYO7A ocular 276901 #276901 U SHERSYNDROME, TYPE IIA, USH2A USH2A ocular 276904 #276904 U SHERSYNDROME, TYPE IC; USH1C USH1C ocular 6O1067 #6O1067 U SHERSYNDROME, TYPE ID; USH1D CDH23 ocular 60S472 #605472U SHERSYNDROME, TYPE IIC; USH2C GPR98 ocular 606943 #606943 U SHERSYNDROME, TYPE IG; USH1G USH1G ocular 30O216 COATS DISEASE NDP ocular 2O378O #203780 ALPORTSYNDROME, AUTOSOMAL COL4A3 Cl3 RECESSIVE #203780 ALPORTSYNDROME, AUTOSOMAL COL4A4 Cl3 RECESSIVE 2632OO #263,200 POLYCYSTIC KIDNEY DISEASE, PKHD1 Cl3 AUTOSOMAL RECESSIVE: ARPKD 6064O7 #6064.07 HYPOTONIA-CYSTINURIA SYNDROME Cl3 6064O7 #6064.07 HYPOTONIA-CYSTINURIA SYNDROME Cl3 609049 #609049 PIERSON SYNDROME Cl3 2412OO #241200 BARTTERSYNDROME, ANTENATAL, TYPE 2 Cl3 256100 #2561OON PHRONOPHTHISIS 1; NPHP1 Cl3 256370 #25637ON PHROTIC SYNDROME, EARLY-ONSET, Cl3 WITHDIF USE MESANGIAL SCLEROSIS 267430 #26743OR ENALTUBULAR DYSGENESIS; RTD Cl3 267430 #26743OR ENALTUBULAR DYSGENESIS; RTD Cl3 267430 #26743OR ENALTUBULAR DYSGENESIS; RTD AGTR1 Cl3 267430 #26743OR ENALTUBULAR DYSGENESIS; RTD Cl3 6O2O88 #602088 N EPHRONOPHTHISIS 2: NPHP2 Cl3 208.540 #2O854OR ENAL-HEPATIC-PANCREATIC DYSPLASIA: NPHP3 Cl3 RHPD 248.190 #248.190 HYPOMAGNESEMIA, RENAL, WITH OCULAR CLDN19 Cl3 INVOLVEMENT 2S63OO EPHROSIS 1, CONGENITAL FINNISH TYPE; NPHS1 Cl3

266900 #266900 SEN OR-LOKENSYNDROME 1: SLSN1 NPHP1 Cl3 6092S4 #6092S4SEN OR-LOKENSYNDROME 5: SLSNS IQCB1 Cl3 610725 #610725N EP H ROTIC SYNDROME, TYPE 3: NPHS3 PLCE1 Cl3 606966 #606966 N EPH RONOPHTHISIS 4: NPHP4 NPHP4 Cl3 6O1678 #601678 BARTTERSYNDROME, ANTENATAL, TYPE 1 SLC12A1 Cl3 600995 #600995N EPHROTIC SYNDROME, STEROID NPHS2 Cl3 RESISTANT, AUTOSOMAL RECESSIVE: SRN1 2643SO #264350 PSEUDOHYPOALDOSTERONISM, TYPE I, SCNN1A Cl3 AUTOSOMAL RECESSIVE: PHA 2643SO #264350 PSEUDOHYPOALDOSTERONISM, TYPE I, Cl3 AUTOSOMAL RECESSIVE: PHA 2643SO #264350 PSEUDOHYPOALDOSTERONISM, TYPE I, AUTOSOMAL RECESSIVE: PHA 2197OO #219700 CYSTIC FIBROSIS; CF CFTR respiratory 6088OO #6088OOSUDDEN INFANT DEATH WITH DYSGENESIS TSPYL1 respiratory OF THE TESTES SYNDROME; SIDDT #265450 PULMONARY VENOOCCLUSIVE DISEASE; BMPR2 respiratory PVOD 2651OO #2651OOPULMONARYALVEOLAR MICROLITHIASIS SLC34A2 respiratory 26S380 #265380 PULMONARY HYPERTENSION, FAMILIAL CPS1 respiratory PERSISTENT OF THE NEWBORN 2674SO ESPIRATORY DISTRESS SYNDROME IN SFTPA1 respiratory PREMATURE INFANTS 2674SO ESPIRATORY DISTRESS SYNDROME IN SFTPB respiratory PREMATURE INFANTS 2674SO ESPIRATORY DISTRESS SYNDROME IN SFTPC respiratory PREMATURE INFANTS US 2011/0098.193 A1 Apr. 28, 2011 33

TABLE 1 1-continued

X-Linked Recessive and Autosomal Recessive Disease Genes

OMIM # Name Symbol 226980 #226980 EPIPHYSEAL DYSPLASIA, MULTIPLE, WITH EARLY ONSET DIABETES MELLITUS 2364.90 #236490 HYALINOSIS, INFANTILE SYSTEMIC ANTXR2 SKCIE8 24.1510 #24151O HYPOPHOSPHATASLA, CHILDHOOD ALPL SKCIE8 600972 #600972 ACHONDROGENESIS, TYPE IB; ACG1B SLC26A2 SKCIE8 610854 #610854 OSTEOGENESIS IMPERFECTA, TYPE IIB CRTAP SKCIE8 241S2O #24152O HYPOPHOSPHATEMIC RICKETS, DMP1 SKCIE8 AUTOSOMAL RECESSIVE 277440 #277440 VITAMIN D-DEPENDENT RICKETS, TYPE II WDR SKCIE8 6O1559 #6O1559 STUVE-WIEDEMANNSYNDROME LIFR SKCIE8 215045 #215045 CHONDRODYSPLASIA, BLOMSTRAND TYPE; PTH1R SKCIE8 BOLD 231 OSO #231 OSO GELEOPHYSIC DYSPLASIA ADAMTSL2 Skeleta 2O7410 #207410 ANTLEY-BIXLERSYNDROME: ABS FGFR2 SKCIE8 215140 HYDROPS-ECTOPIC CALCIFICATION-MOTH-EATEN LBR SKCIE8 SKELETAL DYSPLASIA 25972O OSTEOPETROSIS, AUTOSOMAL RECESSIVE 5: OPTB5 OSTM1 SKCIE8 25973O OSTEOPETROSIS, AUTOSOMAL RECESSIVE 3: OPTB3 CA2 SKCIE8 259770 OSTEOPOROSIS-PSEUDOGLIOMASYNDROME; OPPG LRP5 SKCIE8 277300 SPONDYLOCOSTAL DYSOSTOSIS, AUTOSOMAL DLL3 SKCIE8 RECESSIVE 1: SCDO1 607095 ANAUXETIC DYSPLASIA RMRP SKCIE8 210600 SECKEL, SYNDROME 1 ATR SKCIE8 224410 DYSSEGMENTAL DYSPLASLA, SILVERMAN HSPG2 SKCIE8 HANDMAKERTYPE; DDSH 228930 FIBULARAPLASIA OR HYPOPLASLA, FEMORAL WNTTA SKCIE8 BOWING AND POLY-, SYN-, AND 2597OO OSTEOPETROSIS, AUTOSOMAL RECESSIVE 1: OPTB1 TCIRG1 SKCIE8 259775 RAINE SYNDROME; RNS FAM20C SKCIE8 2692SO SCHNECKENEBECKENDYSPLASIA SLC3SD1 Skeletal 27.682O ULNA AND FIBULA, ABSENCE OF WITH SEVERE WNTTA Skeletal LIMB DEFICIENCY 610915 OSTEOGENESIS IMPERFECTA, TYPE VIII LEPRE1 Skeletal 239000 PAGET DISEASE, JUVENILE TNFRSF11B Skeletal 215150 OTOSPONDYLOMEGAEPIPHYSEAL DYSPLASLA; COL11A2 Skeletal OSMED 215150 OTOSPONDYLOMEGAEPIPHYSEAL DYSPLASLA; COL2A1 Skeletal OSMED

0209 ii. DNA Samples gous, heterozygous, compound heterozygous and hemizy 0210 Target enrichment was performed with 104 DNA gous DM states. Twenty six samples were well-characterized, samples obtained from the Coriell Institute (Camden, N.J.) (Table 13). Seventy six of these were carriers or affected by 37 from “normal individuals, and two had previously under severe, childhood recessive disorders. The latter samples con gone genome sequencing. In Table 13, the following apply: 1 tained 120 known DMs in 34 genes (63 substitutions, 20 refers to SureSelect, library 1, 2 refers to SureSelect, library indels, 13 gross deletions, 19 splicing, 2 regulatory and 3 design 2: 3 refers to Rainlance; 4 refers to Illumina GAIIx complex DMs). These samples also represented homozy SBS; 5 refers to: 53 SBL, and 6 refers to Illumina 6 2000.

Coriell Selection Sequencing DNA.ii Method Method Description OMIM if Gene

NAO282S 4 ADA DEFICIENCY 1027OO ADA NAO282S 4 ADA DEFICIENCY 1027OO ADA NAO2471 6 ADA DEFICIENCY 1027OO ADA NAO2471 6 ADA DEFICIENCY 1027OO ADA NAO2756 6 ADA DEFICIENCY 1027OO ADA NAO2756 6 ADA DEFICIENCY 1027OO ADA NAOS816 6 ADA DEFICIENCY 1027OO ADA NAOS816 6 ADA DEFICIENCY 1027OO ADA NAO2O57 4 ASPARTYLGLUCOSAMINURIA 2O84OO AGA NAO2O57 4 ASPARTYLGLUCOSAMINURIA 2O84OO AGA NA10641 6 SJOGREN-LARSSON 2702OO ALDH3A2 SYNDROME NAOOOS9 1 6 CANAVANDISEASE 271900 ASPA NAO4268 6 CANAVANDISEASE 271900 ASPA US 2011/0098.193 A1 Apr. 28, 2011 34

-continued

NA18929 2 6 CANAVANDISEASE 271900 ASPA NA13669 4, 5, 6 MENKES SYNDROME 3094OO ATP7A NA13672 1 & 2 4, 5, 6 MENKES SYNDROME 3094OO ATP7A NA13668 1 & 2 4, 5, 6 MENKES SYNDROME 3094OO ATP7A NA13674 4, 5, 6 MENKES SYNDROME 3094OO ATP7A NA13675 4, 5, 6 MENKES SYNDROME 3094OO ATP7A NAO 1982 2 6 MENKES SYNDROME 3094OO ATP7A NAO0649 1,3 4 MAPLE SYRUP URINE 248600 BCKDHA DISEASE Type Ia NAO0649 1,3 4 MAPLE SYRUP URINE 248600 BCKDHA DISEASE Type Ia NA18803 1,3 4 CYSTICFIBROSIS 219700 CFTR NA18803 1,3 4 CYSTICFIBROSIS 219700 CFTR NA18668 1,3 4 CYSTICFIBROSIS 219700 CFTR NA18668 1,3 4 CYSTICFIBROSIS 219700 CFTR NA11277 1,3 4 CYSTICFIBROSIS 219700 CFTR NA11496 6 CYSTICFIBROSIS 219700 CFTR NA11472 2 6 CYSTICFIBROSIS 219700 CFTR NA11472 2 6 CYSTICFIBROSIS 219700 CFTR NA2O836 2 6 CYSTICFIBROSIS 219700 CFTR NA13591 1,3 4 CYSTICFIBROSIS 219700 CFTR NA13591 1,3 4 CYSTICFIBROSIS 219700 CFTR NA2O381 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O381 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O382 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O382 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O383 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O383 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O384 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NA2O384 1 & 4, 6 NEURONAL CEROID 2042OO CLN3 2, 3 LIPOFUSCINOSIS - 3 NAO3193 2 6 DYSKERATOSIS 3OSOOO DKC1 CONGENITA, X-LINKED NAO4364 2 6 MUSCULAR DYSTROPHY, 310200 DMD DUCHENNETYPE NAOSO22 2 6 MUSCULAR DYSTROPHY, 310200 DMD DUCHENNETYPE NAO3S42 2 6 XERODERMA 27876O ERCC4 PIGMENTOSUM, COMP. GROUPF NAO3S42 2 6 XERODERMA 27876O ERCC4 PIGMENTOSUM, COMP. GROUPF NAO1712 2 6 COCKAYNE SYNDROME, 216400 ERCC6 TYPEB NAO1712 2 6 COCKAYNE SYNDROME, 216400 ERCC6 TYPEB NAO 1464. 1, 3 4 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAO 1464. 1, 3 4 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAO1935 1,3 4 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAO1935 1,3 4 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAOO244 2 6 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAOO244 2 6 GLYCOGEN STORAGE 2323OO GAA DISEASE II NA12932 2 6 GLYCOGEN STORAGE 2323OO GAA DISEASE II NA12932 2 6 GLYCOGEN STORAGE 2323OO GAA DISEASE II NAO1210 2 6 GALACTOSEMIA 23O4OO GALT NA1743S 2 6 GALACTOSEMIA 23O4OO GALT NA1743S 2 6 GALACTOSEMIA 23O4OO GALT NAOO852 2 6 GAUCHERDISEASE, TYPE I 231 OOO GBA NAOO852 2 6 GAUCHERDISEASE, TYPE I 231 OOO GBA NAO4394 2 6 GAUCHERDISEASE, TYPE I 23O800 GBA US 2011/0098.193 A1 Apr. 28, 2011 35

-continued NAO4394 2 6 GAUCHERDISEASE, TYPE I 23O800 GBA NAO1260 2 6 GAUCHERDISEASE, TYPE II 230900 GBA NAO1260 2 6 GAUCHERDISEASE, TYPE II 230900 GBA NAO 1031, 2 6 GAUCHERDISEASE, TYPE 231 OOO GBA III NAOSOO2 2 6 GLUTARIC ACIDEMIAI 231670 GCDH NAOSOO2 2 6 GLUTARIC ACIDEMIAI 231670 GCDH NA16392 2 6 GLUTARIC ACIDEMIAI 231670 GCDH NA02013 1,3 4 MUCOLIPIDOSIS II of B 252500 GNPTAB NA02013 1,3 4 MUCOLIPIDOSIS II of B 252500 GNPTAB NAO3066 2 6 MUCOLIPIDOSIS II of B 252500 GNPTAB NAO3066 2 6 MUCOLIPIDOSIS II of B 252500 GNPTAB NA10798 2 6 HEMOGLOBIN--OLOCUS 1 1418OO B 1 NAO7406 1,3 4 B-PLUS-THALASSEMIA 141900 NAO7406 1,3 4 B-PLUS-THALASSEMIA 141900 NAO7426 1,3 4 B-ZERO-THALASSEMLA 141900 NAO7426 1,3 4 B-ZERO-THALASSEMLA 141900 NAO74O7 6 HEMOGLOBIN-B LOCUS 141900 NAO74O7 6 HEMOGLOBIN-B LOCUS 141900 NA16643 2 6 HEMOGLOBIN-B LOCUS 141900 NAO3575 1,3 4 TAY-SACHS DISEASE 2728OO NAO3575 1,3 4 TAY-SACHS DISEASE 2728OO NAO9787 2 6 TAY-SACHS DISEASE 2728OO NAO 6804 6 LESCEH-NYHANSYNDROME 3OO322 NAO7092 6 LESCEH-NYHANSYNDROME 3OO322 NAO1899 2 6 LESCEH-NYHANSYNDROME 3OO322 NAO929S 2 6 HEREDITARY SENSORY & 223900 AUTONOMICNEUROPATHY 3 NAO929S 2 6 GAUCHERDISEASE, TYPE I 223900 GBA NA02075 1,3 4 CHEDIAK-HIGASHI 214500 LYST SYNDROME NAO336S 6 CHEDIAK-HIGASHI 214500 LYST SYNDROME NA02533 1,3 4 MUCOLIPIDOSIS IV 252650 MCOLN1 NA02533 1,3 4 MUCOLIPIDOSIS IV 252650 MCOLN1 NA16382 1,3 4 RETTSYNDROME 312750 MECP2 NA1754.O 2 6 RETTSYNDROME 312750 MECP2 NA11110 1,3 4 PHENYLKETONURIA 2616OO PAH NA11110 1,3 4 PHENYLKETONURIA 2616OO PA NAOOOO6 2 6 PHENYLKETONURIA 2616OO PA NAOOOO6 2 6 PHENYLKETONURIA 2616OO PA NAO156S 2 6 PHENYLKETONURIA 2616OO PA NAO156S 2 6 PHENYLKETONURIA 2616OO PA NA1343S 4, 5, 6 PELIZAEUS-MERZBACHER 312O8O PL DISEASE NA13434 6 PELIZAEUS-MERZBACHER 312O8O PL C DISEASE NA16081 1,3 4 NEURONAL CEROID 256730 PPT LIPOFUSCINOSIS - 1 NA16081 1,3 4 NEURONAL CEROID 256730 PPT LIPOFUSCINOSIS - 1 NA20379 1,3 4 NEURONAL CEROID 256730 PPT LIPOFUSCINOSIS - 1 NA20379 1,3 4 NEURONAL CEROID 256730 PPT LIPOFUSCINOSIS - 1 NAO358O 2 6 PROTEASE INHIBITOR1 107400 SERPINA1 NAO0879 2 6 MUCOPOLYSACCHARIDOSIS 252900 SGSH TYPE IIIA NAO0879 2 6 MUCOPOLYSACCHARIDOSIS 252900 SGSH TYPE IIIA NAO1881 2 6 MUCOPOLYSACCHARIDOSIS 252900 SGSH TYPE IIIA NAO1881 2 6 MUCOPOLYSACCHARIDOSIS 252900 SGSH TYPE IIIA NAO3813 1 6 SPINAL MUSCULAR 25.3300 SMN1 ATROPHY, TYPE I NA16193 1,3 4 NIEMANN-PICK DISEASE, 6O7616 SMPD1 TYPEB NA16193 1,3 4 NIEMANN-PICK DISEASE, 6O7616 SMPD1 TYPEB NA16193 1,3 4 GAUCHERDISEASE, TYPE I 223900 GBA NAO 1960 6 FAMILIALISOLATED 277460 TTPA DEFICIENCY OF VITAMINE NAO9069 2 6 USHERSYNDROME, TYPE 276904 USH1C IC

US 2011/0098.193 A1 Apr. 28, 2011 40

-continued

NA1SO36 NA1S215 NA15223 NA15224 NA1S236 NA15245 NA15510 twin 0001 twin0101 NA19193 NA19130 NA1912O NA19171 NA18912 NA18517

Discovered Coriell differing HGMD DNA.ii mutation accession # Notes

NAO282S NAO282S NAO2471 NAO2471 NAO2756 NAO2756 NAOS816 phenotypically normal NAOS816 phenotypically normal NAO2O57 misannotated: homozygous non disease causing polymorphism linked with C163 mutation in 98% of cases NAO2O57 misannotated: homozygous NA10641 Detected in 1 read NAOOOS9 clinically affected; second mutation not annotated NAO4268 NA18929 NA13669 NA13672 NA13668 NA13674 NA13675 NAO1982 NAOO649 NAOO649 NA18803 NA18803 NA18668 NA18668 NA11277 NA11496 uniparental disomy NA11472 NA11472 NA2O836 NA13591 NA13591 NA2O381 NA2O381 NA2O382 NA2O382 NA2O383 NA20383 exon 11, CMOO3663 misannotated: correct c.1020G > T, location, different E295X, SNS chr16: 284O1322 G > T NA2O384 US 2011/0098.193 A1 Apr. 28, 2011 41

-continued

NAO 31.93 NAO4364 NAO 5022 No mutation likely de novo; absent in sample (mother of proband) NAO annotated mutation absent (0/130 reads) NAO annotated mutation absent (O/166 reads) NAO 712 exon 17, CD982624 missanotated; actual c.3536delA, mutation 1 bp over Y1179fs, chr10:5034847 6.delA NAO 712 exon 8, unlisted cDNA analysis c.1990C > T, annotated only

NAO 464 NAO 464 exon 17, unlisted clinically affected c.2544delC, p.K849fs, chr17: 7S706649eC

NAO 210 NA1 7435 NA1 7435 Duarte variant (clinically normal) NAO O852 listed Gaucher type III; mutation is type I NAO NAO4394 exon 8, misannotated c.1171G > C, p.V391 L, chr1: 1534.72713G > C NAO4394 NAO 1260 NAO 1260 NAO 1031 NAO SOO2 NAO SOO2 NA1 6392 NAO 2013 NAO 2013 NAO 3066 NAO 3066 NA1 O798 NAO 74O6 NAO 74O6 NAO 7426 NAO 7426 NAO 74O7 NAO 74O7 NA1 6643 exon 2, unlisted misan noted E102D, chr11: S2O4392G > C NAO 3575 NAO 3575 NAO 9787 second mutation not reported NAO 6804 NAO 7092 intron 8, IVS8 + cDNA annotated 1 4delGTAA, only; actual mutation chrX: 133460381. 13346.0384delGTAA is 4bp del NAO 1899 intron 8, IVS8 - misannotated; actual 2A >T, mutation is splice chrX: 133461724A >T site Substitution, transcription restarts at cryptic splice site NAO 92.95 NAO 92.95 US 2011/0098.193 A1 Apr. 28, 2011 42

-continued

Homozygous (20/22 reads) NAO2S33 NA16382 X Dominant NA17540 X Dominant NA11110 NA11110 NAOOOO6 NAOOOO6 NAO1565 NAO1565 NA1343S disease-causing mutation not annotated NA13434 NA16081 NA16081 NA2O379 NA2O379 NAO358O NAOO879 NAO0879 exon 8, CD972442 misannotated: c.1079delC, annotated mutation p.V361fs, absent chr17: 7S799276deC NAO1881 NAO1881 NAO3813 NA16193

synonymous; creates a novel splice site

0211 iii Target Enrichment and Sequencing by Synthesis protocols. Following adapter ligation, Illumina libraries were (SBS) prepared with AMPure bead—(Beckman Coulter, Danvers, 0212 For Illumina GAIIx SBS (San Diego, Calif.), 3 ug Mass.) rather than gel-purification. Library quality was DNA was sonicated by Covaris S2 (Woburn, Mass.) to ~250 assessed by optical density and electrophoresis (Agilent nt using 20% duty cycle, 5 intensity and 200 cycles/burst for 2100, Santa Clara, Calif.). 180 sec. For Illumina HiSeq SEBS, shearing to ~150nt was by 0213 SureSelect enrichment of 6, 8 or 12-plex pooled 10% duty cycle, 5 intensity and 200 cycles/burst for 660 sec. libraries was per Agilent protocols' with 100 ng of custom Barcoded sequencing libraries were made per manufacturer bait library, blocking oligos specific for paired-end sequenc US 2011/0098.193 A1 Apr. 28, 2011 ing libraries and 60 hr. hybridization. Biotinylated RNA value information for SNP calling. Reads with mapping qual library hybrids were recovered with streptavidin beads. ity <8 were discarded by diBayes. A position must have at Enrichment was assessed by quantitative PCR (Life Tech least 2x or 3X coverage to call a homozygous or heterozygous nologies, Foster City, Calif.; CLN3, exon 15, Hs00041388 SNP respectively. The Bioscope small indel pipeline was cn; HPRT1, exon 9, Hs02699975 cm; LYST, exon 5, used with default settings and calls insertions of size s3 bp Hs0292.9596 cn; PLP1, exon 4: Hs01638246 cn) and a non and deletions of sizes 11 bp. In comparisons with SBS, SNP targeted locus (chrX: 77082157, Hs05637993 cm) pre- and and indel calls were further restricted to positions where at post-enrichment. least 4 or 10 reads called a variant. 0214 Rainlance RDT1000 (Lexington, Mass.) target enrichment was as described and used a custom primer 2. Results library: Genomic DNA samples were fragmented by nebuli 0220 i. Disease Inclusion zation to 2-4 kb and 1 lug mixed with all PCR reagents but 0221) The carrier test reported herein considered several primers. Microdroplets containing three primer pairs were factors. Firstly, cost effectiveness was assumed to be critical fused with PCR reagent droplets and amplified. Following for test adoption. The incremental cost associated with emulsion breaking and purification by MinElute column increasing the degree of multiplexing was assumed to (Qiagen, Valencia, Calif.), amplicons were concatenated decrease toward an asymptote. Thus, very broad coverage of overnight at 16° C. and sequencing libraries were prepared. diseases was assumed to offer optimal cost-benefit. Secondly, Sequencing was performed on Illumina GAIIX and comprehensive mutation sets, allele frequencies in popula HiSeq2000 instruments per manufacturer protocols. tions and individual mutation genotype-phenotype relation 0215 iv. Hybrid Capture and Sequencing by Ligation ships have been defined in very few recessive diseases. In (SBL) addition, some studies of CF carrier screening for a few 0216) For SOLiD3 SBL, 3 ug DNA was sheared by Cova common alleles have shown decreased prevalence of tested ris to ~150 nt using 10% duty cycle, 5 intensity and 100 alleles with time, rather than reduced disease incidence. cycles/bursts for 60 sec. Barcoded fragment sequencing These two different lines of evidence indicated that very libraries were made using Life Tehnologies (Carlsbad, Calif.) broad coverage of mutations offered the greatest likelihood of protocols and reagents. Taqman quantitative PCR was used to Substantial reductions in disease incidences with time. assess each library, and an equimolar 6-plex pool was pro Thirdly, physician and patient adoption of screening was duced for enrichment using Agilent SureSelect and a modi assumed to be optimal for the most severe childhood diseases. fied protocol. Prior to enrichment, the 6-plex pool was single Therefore, diseases were chosen can almost certainly change stranded. Furthermore, 1.2 ug pooled DNA with 5 uL (100 family planning by prospective parents or impact ante-, peri ng) custom baits was used for enrichment, with blocking or neo-natal care of high risk pregnancies. Milder recessive oligos specific for SOLiD sequencing libraries and 24 hr. disorders, such as deafness, and adult-onset diseases, such as hybridization. Sequencing was performed on a SOLiD 3 inherited cancer syndromes, were omitted. instrument using one quadrant on a single sequencing slide, 0222 Database and literature searches and expert reviews generating singleton 50 mer reads. were performed on 1,123 diseases with recessive inheritance 0217 v. Sequence Analysis of known molecular basis. Several subordinate requirements 0218. The bioinformatic decision tree for detecting and were gathered: In view of pleiotropy and variable severity, genotyping DMs was predicated on experience with detec tion and genotyping of variants in next generation genome disease genes were included if mutations caused severe ill and chromosome sequences (FIG. 19). Briefly, SBS ness in a proportion of affected children. All but six diseases sequences were aligned to the NCBI reference human that featured genocopies (including variable inheritance and genome sequence (Version 36.3) with GSNAP and scored by mitochondrial mutations) were included. Diseases were not rewarding identities (+1) and penalizing mismatches (-1)and excluded on the basis of low incidence. Diseases for which indels (-1-log(indel-length)). Alignments were retained if large population carrier screens exist were included. Such as covering 295% of the read and scoring 278% of maximum. TSD, hemoglobinopathies and CF. Mental retardation genes Variants were detected with Alpheus using stringent filters were not included in this iteration. 489 X-linked recessive (2.14% and 210 reads calling variants and average quality (XLR) and autosomal recessive (AR) disease genes met these score 220). Allele frequencies of 14-86% were designated criteria (Table 11). heterozygous, and >86% homozygous. Reference genotypes 0223 ii. Technology Selection of SNPs and CNVs mapping within targets were obtained 0224 Array hybridization with allele-specific primer with Illumina Omnil-Quad arrays and GenomeStudio 2010. 1. indel genotypes were confirmed by genomic PCR of <600 extension can be favored for expanded carrier detection due to bp flanking variants and Sanger sequencing. test simplicity, cost, Scalability and accuracy. The majority of 0219 SBL sequence data analysis was performed using carriers can be accounted for by a few mutations, and most Bioscope v1.2. 50 by reads were aligned to NCBI genome DMs must be nucleotide substitutions. Of 215 AR disorders build 36.3 using a seed and extend approach (-mapping). examined, only 87 were assessed to meet these criteria. Most A 25bp seed with up to 2 mismatches is first aligned to the recessive disorders for which a large proportion of burden reference. Extension can proceed in both directions, depend was attributable to a few DMs were limited to specific ethnic ing on the footprint of the seed within the read. During exten groups. Indeed, 286 severe childhood AR diseases encom Sion, each base match receives a score of +1, while mis passed 19,640 known DMs Given that the Human Gene matches get a default score of -2. The alignment with the Mutation Database (HGMD) lists 102,433 disease mutations highest mapping quality value is chosen as the primary align (DMs), a number which is steadily increasing, a fixed-content ment. If 2 or more alignments have the same score then one of method appeared impractical. Other concerns with array them is randomly chosen as the primary alignment. SNPs based screening for recessive disorders were Type 1 errors in were called using the Bioscope diBayes algorithm at medium the absence of confirmatory testing and Type 2 errors for DMs stringency setting. DiBayes is a Bayesian algorithm which other than Substitutions (complex rearrangements, indels or incorporates position and probe errors as well as color quality gross deletions with uncertain boundaries). US 2011/0098.193 A1 Apr. 28, 2011 44

0225. The effectiveness and remarkable decline in cost of exome capture and next generation sequencing for variant TABLE 15-continued detection in genomes and exomes Suggested an alternative Repeat content of 55 exons failing RNA bait design due to potential paradigm for comprehensive carrier testing. Four repetitive sequences. target enrichment and three next generation sequencing methods were preliminarily evaluated for multiplexed carrier Length % of testing. Preliminary experiments indicated that existing pro Type Element Number1 (total nt) Sequence tocols for Agilent SureSelect hybrid capture and Rainlance LTR ERVL 2 276 2.2 micro-droplet PCR but not Febit HybSelect microarray ERVL-MaLR 2 115 O.9 based biochip capture or Olink padlock probe ligation and ERV-ClassI 3 427 3.4 ERV-ClassII O O O PCR yielded consistent target enrichment (data not shown). DNA hAT-Charlie O O O Therefore, detailed workflows were developed for compre TcMar-Tigger 1 78 O.6 hensive carrier testing by hybrid capture or micro-droplet Small RNA O O O PCR, followed by next generation sequencing (FIG. 16). Satellite O O O Baits or primers were designed to capture or amplify 1,978, Simple Repeats 8 479 3.8 041 nucleotides (nt), corresponding to 7,717 segments of 489 Low Complexity 10 494 4 recessive disease genes by hybrid capture and micro-droplet 1: repeats fragmented by insertions or deletions were counted as 1 element PCR, respectively. Targeted were all coding exons and splice 1: repeats fragmented by insertions or deletions were counted site junctions, and intronic, regulatory and untranslated as 1 element regions known to contain DMs. In general, baits for hybrid 0226 iii. Analytic Metrics capture or PCR primers were designed to encompass or flank 0227. An target enrichment protocol can inexpensively DMs, respectively. Primers were also designed to avoid result in at least 30% of nucleotides being on target, which known polymorphisms and minimize non-target nucleotides. corresponded to approximately 500-fold enrichment with ~2 Custom baits or primers were also designed for 11 gross million nt target size. This was achieved with hybrid capture following one round of bait redesign for under-represented deletion DMs for which boundaries had been defined, in order exons and decreased bait representation in over-represented to capture or amplify both the normal and DM alleles (Table exons (Table 12). An ideal target enrichment protocol can also 14). 29,891 120 mer RNA baits were designed to capture of give a narrow distribution of target coverage and without tails 98.7% of targets. 55% of 101 exons that failed bait design or skewness (indicative of minimal enrichment-associated contained repeat sequences (Table 15). 10,280 primer pairs bias). Following hybrid capture, the sequencing library size were designed to amplify 99% of targets. Twenty exons failed distribution was narrow (FIG. 17A). In FIG. 17A, the top primer design by falling outside the amplicon size range of panel shows target enrichment by hybrid capture, and the 200-600 nt. bottom panel shows target enrichment by microdroplet PCR. Size markers are shown at 40 and 8000 nt. FU: fluorescent units. The aligned sequence coverage distribution was uni TABLE 1.5 modal but flat (platykurtic) and right-skewed (FIG. 17B). Repeat content of 55 exons failing RNA bait design due to This implied that hybrid capture can require over-sequencing repetitive sequences. of the majority of targets to recruit a minority of poorly selected targets to adequate coverage. In FIG. 17B, aligned Length % of sequences had quality score 225. As expected, median cov Type Element Number1 (total nt) Sequence erage increased linearly with sequence depth. The proportion SINE Alu 16 2175 17.4 of bases with greater than Zero and 220x coverage increased MIR 8 950 7.6 toward asymptotes at ~99% and ~96%, respectively (Table LINE LINE1 5 779 6.2 12, FIG. 17C). Interestingly, targets with low (s3x) coverage LINE2 O O O were highly reproducible and had high GC content. Table 16. L3;CR1 O O O This indicated that targets failing hybrid capture could be predicted and rescued by individual PCR reactions.

TABLE 12 Sequencing, alignment and coverage statistics for target enrichment and sequencing platforms. Median% Read Median Median Uniquely Sample Enrichment Sequencing Multi- Length Quality Total reads: Aligning Median Total Set Method Method plexing (nt) Score 9, CV Reads nucleotides

1, n = 12 SureSelect GAIIx 12 50 30 9,952,972.5 + 21 94 497,648,625 2, n = 12 SureSelect GAIIx 12 50 30 10,127,721 + 16 95 506,386,025 1 + 2, RainDance GAIIx 12 50 36 9,412,698 + 30 97 470,634,900 n = 24 1 + 2, RainDance GAIIx 12 50 31 12,807,392 + 17 96 640,369,600 n = 12 3, n = 6 SureSelect GAIIx 6 50 30 19,711,735 + 34 95 985,586,750 3, n = 6 SureSelect SOLD 3 6 50 24 16,506,076 + 5 82 825,303,800 4, n = 72 SureSelect 2 HiSeq 8 1493 423 9,273,596 + 24 98 1,390.464,487 5, n = 8 SureSelect HiSeq 8 149 413 9,861,765 + 35 97 1493,946,141 US 2011/0098.193 A1 Apr. 28, 2011 45

TABLE 12-continued Sequencing, alignment and coverage statistics for target enrichment and sequencing platforms. Median Pearson's Median % inton Median Median % Median % Median Median Sample Aligning targeti Fold OX e2OX Coverage it Skewness Set depth 96 CV Enrichment Coverage Coverage 96 CV Coefficient? 1, n = 12 225 13.73 214 4.83 61 2721 O.28 2, n = 12 234 23.0 - 2 358 3.66 8O SO 16 O.19 1 + 2, 196 29.65 462 5.46 86 52.533 O.23 n = 24 1 + 2, 277 2227 346 4.62 88 56 12 0.27 n = 12 3, n = 6 463 17.43 273 18O 86 76 30 3, in 310 1957 3O4 6.08 79 S87 4. l : 2 495 317 4 494 2.33 92 152- 26 5 l 517 28.44 442 2.25 93 139 40 Coefficient of variation (%). 'Pearson's median skewness coefficient 3(mean - median), standard deviation. Following assembly of forward and reverse 130 bp paired reads.

Table 16. Coordinates, genes and GC content of 40 exons with recurrent coverage <3x.

TABLE 16 Coordinates, genes and GC content of 40 exons with recurrent coverage <3X. SureSelect Bate Design Design 2: Design 1: Samples with Samples with <3X Coverage <3X Coverage Gene C hir Start Stop (n = 80) (n = 8) GAA 75689949 75690019 85 97.2% OO.0% PDSS1 270266OO 27026775 8O 97.2% 87.5% HGSNAT 43114748 43114914 83 97.2% OO.0% TTPA 641.60930 64161166 76 97.2% OO.0% AAAS 519875.06 51987764 57 97.2% MTM1 149487704 1494.87770 81 97.2% OO.0% IDUA 97.0784 971.030 78 97.2% 87.5% EFEMP2 65396.729 6539691.6 82 97.2% OO.0% ENPP1 132170848 132171108 79 97.2% OO.0% G6PD 1534281.97 153428427 78 97.2% OO.0% MYOSA SO608268 SO608539 82 97.2% 87.5% CPT1A 68365818 683.65975 79 97.2% OO.0% ST3GALS 8596.9457 859696.68 8O 97.2% OO.0% LIFR 38S92192 38592.505 78 97.2% OO.0% IDUA 986S 19 986732 77 94.4% 87.5% INSR 72448O2 7245O11 8O 94.4% OO.0% D2HGDH 242322702 24.2322783 79 93.1% OO.0% OCRL 128SO1932 128502136 75 87.5% 87.5% ITGB4 712291 10 71229287 78 77.8% OO.0% SLC25A15 40261596 40261799 8O 77.8% 87.5% MMAB 1084836O7 108483705 61 68.1% 87.5% LHX3 1382.34612 138.23482S 77 66.7% 75.0% DLL3 44685294 44685537 79 66.7% PLEC1 145088,547 14508868O 75 65.3% 12.5% 4658SOO4 46585081 72 62.5% 87.5% 1323.09914 1323102O3 79 61.1% 75.0% 433.58794 433S8874 63 SS.6% SO.0% 73243OO6 73243111 58 52.8% 87.5% 2611792 2612271 70 52.8% 75.0% 42713629 42713790 75 52.8% 25.0% D NM T3 B 3O813851 3O814166 79 48.6% 25.0% S974890 59751.18 74 48.6% 25.0% 40010011 40010232 75 40.3% SO.0% 1s 48723O88 48723236 74 38.9% 19 7493S11 7493667 75 36.1% 87.5% 9 133384596 1333,84694 65 34.7% 87.5% US 2011/0098.193 A1 Apr. 28, 2011 46

TABLE 16-continued Coordinates, genes and GC content of 40 exons with recurrent coverage <3X. SureSelect Bate Design Design 2: Design 1: Samples with Samples with <3X Coverage <3X Coverage Gene Chr Start Stop GC 9/o (n = 80) (n = 8) SLC37A4 11 1184O6768 1184O68OO 67 33.3% 87.5% GCSEH 16 79687236 79687481 79 33.3% 100.0% IDUA 4 987132 987258 8O 30.6% 75.0% COL17A1 10 105806722 105806920 68 29.2% 37.5%

0228. Given the need for highly accurate carrier detection, least 50 GB per run and read lengths of at least 50 nt. There 210 uniquely aligned reads of quality score 220 and 2.14% fore, the quality and quantity of sequences from multiplexed, of reads were required to call a variant. The requirement for target-enriched libraries were compared using SBS (GAIIx 210 reads was highly effective for nucleotides with moderate singleton 50 mers) and SBL (SOLiD3 singleton 50 mers: coverage. For heterozygote detection, for example, this was Table 12). SBS- and SBL-derived 50mer sequences (and equivalent to ~20x coverage, which was achieved in ~96% of alignment algorithms) gave similar alignment metrics (Table exons with -2.6 GB of sequence (FIG. 17C). In FIG. 17C, 12). When compared with Infinium array results, specificity target coverage was a function of depth of sequencing across of SNP genotypes by SBS and SBL were very similar (SBS 104 samples and six experiments. The proportion of targets 99.69%, SBL 99.66%, following target enrichment and mul with at least 20x coverage appeared to be useful for quality tiplexed sequencing; FIG. 18). In FIG. 18, target nucleotides assessment. The requirement for 2.14% of reads to call a were enriched by hybrid selection and sequenced by Illumina variant was highly effective for nucleotides with very high GAIIx SBS and SOLiD3 SBL at 6-fold multiplexing. The coverage and was derived from the genotype data discussed samples were also genotyped with Infinium OminOuad1 SNP below. A quality score requirement was important when next arrays. In FIG. 18, the following apply: (A) Comparison of generation sequencing started, but is now largely redundant. SNP calls and genotypes obtained by SBS, SBL and arrays at 0229 Micro-droplet PCR can result in all cognate ampli nucleotides surveyed by all three methods. SNPs were called cons being on target and can induce minimal bias. In practice, if present in 210 uniquely aligning SBS reads, 2.14% of the coverage distribution was narrower than hybrid capture reads and with average quality Score 220. Heterozygotes but with similar right-skewing (FIG. 17D). In FIG. 17D, the were identified if present in 14%-86% of reads. Numbers frequency distribution of target coverage following micro refer to SNP calls. Numbers in brackets refer to SNP geno droplet PCR and 1.49 GB of singleton 50 mer SBS of sample types. (B) Comparison of SNP calls and genotypes obtained NA20379. Aligned sequences had quality score 225. These by SBS, SBL and arrays. SNPs were called if present in 24 results were complicated by ~11% recurrent primer synthesis uniquely aligning SBS reads, 21.4% of reads and with aver failures. This resulted in linear amplification of a subset of age quality Score 220. Heterozygotes were identified if targets, ~5% of target nucleotides with Zero coverage and present in 14%-86% of reads. similar proportion of nucleotides on target to that obtained in 0232 Given approximate parity of throughput and accu the best hybrid capture experiments (-30%; Table 12). racy, consideration was given to optimal read length. Unam Hybrid capture was employed for subsequent studies for rea biguous alignment of short read sequences is typically con Sons of cost. founded by repetitive sequences, which can be irrelevant for 0230 Multiplexing of samples during hybrid selection carrier testing since targets overwhelmingly contained and next generation sequencing had not previously been unique sequences. The number of mismatches tolerated for reported. Six- and twelve-fold multiplexing was achieved by unique alignment of short read sequences is highly con adding molecular bar-codes to adapter sequences. Interfer strained but increases with read length. The majority of dis ence of bar-code nucleotides with hybrid selection did not ease mutations are single nucleotide Substitutions or Small occur appreciably: The stoichiometry of multiplexed pools indels. Comprehensive carrier testing also requires detection was essentially unchanged before and after hybrid selection. of polynucleotide indels, gross insertions, gross deletions and Multiplexed hybrid selection was found to be approximately complex rearrangements. A combination of bioinformatic 10% less effective than singleton selection, as assessed by approaches were used to overcome short read alignment median fold-enrichment. Less than 1% of sequences were shortcomings (FIG. 19). Firstly, with the Illumina HiSeq SEBS discarded at alignment because of bar-code sequence ambi platform, the novel approach of read pair assembly before guity. Therefore, up to 12-fold multiplexing at hybrid selec alignment (99% efficiency) was employed, in order to gener tion and per sequencing lane (equivalent to 96-plex per ate longer reads with high quality scores (148.6+3.8 nt com sequencing flow cell) were used in Subsequent studies to bined read length and increase in nucleotides with quality achieve the targeted cost of 30 from 75% to 83%). This was combined with gen 0231. Several next generation sequencing technologies eration of 150 nt sequencing libraries without gel purification are currently available. Of these, the Illumina sequencing-by by optimization of DNA shearing procedures and use of silica synthesis (SBS) and SOLiD sequencing-by-ligation (SBL) membrane columns. Omission of gel purification was critical platforms are widely disseminated, have throughput of at for scalability of library generation. Secondly, the penalty on US 2011/0098.193 A1 Apr. 28, 2011 47 polynucleotide variants was reduced, rewarding identities exons3 4. c.del349 495del, chrX: 102.9282O7 (+1) and penalizing mismatches (-1) and indels (-1-log(in 102929424del in one (NA13434, red diamond) of eight del-length)). Thirdly, gross deletions were detected either by samples; and (G) absence of gross deletion CG984340 perfect alignment to mutant reference sequences or by local (ERCC6 exon 9, c.1993 2169del, 665 723del, exon 9 del, decreases in normalized coverage (FIG. 20). Seeking perfect chr10:50360915 50360739del) in 72 DNA samples. The alignment to mutant reference sequences obviates low align sample in red (NA01712) was incorrectly annotated to be a ment scores when short reads containing polynucleotide vari compound heterozygote with CG984340 based on cDNA ants are mapped to a normal reference. This was illustrated by sequencing.

TABLE 1.4 Custom Agilent SureSelect RNA baits for hybrid capture of 11 gross deletion DMs with defined boundaries.

Bait D Chr Start Stop Length Disease OMIM # Gene A. 11 4.033883 4034083 200 Immunodeficiency & 60S921 STIM1 autoimmunity B 11 S204606 5204726 120 B thalassemia 141900 HBB C 12 1017582O7 101758.306 99 PKU 2616OO PAH D 16 14318O 143380 200 C. thalassemia 14231 O HBZ E 16 170677 170877 200 C. thalassemia 142240 HBQ1 16 284O4587 284O4987 400 Batten disease 2042OO CLN3 G 16 284.05652 284OS852 200 Batten disease 2042OO CLN3 H 17 75692836 75692.947 111 GSD2 2323OO GAA 19 7492S22 7492722 2OO ML4 2526SO MCOLN1 19 74989S4 7499042 88 ML4 2526SO MCOLN1 K X 133428209 133428418 209 Lesch-Nyhan syn. 3O8000 HPRT1 5 702834O7 70283522 115 SMA1 2S3300 SMN1 M 7 11692SSO3 11692S703 200 CF 219700 CFTR N 7 11694,6582 116946,782 200 CF 219700 CFTR O 7 11703874S 117038869 124 CF 219700 CFTR C 7 117073059 1170732.59 200 CF 219700 CFTR identification of 11 gross deletion DMs for which boundaries 0233 iv. Clinical Metrics had been defined (Table 14). This approach is anticipated to 0234 Based on these strategies of genotyping variants be extensible to gross insertions and complex rearrange identified in next generation genome and chromosome ments. In FIG. 20, the following apply: (A) deletion of CLN3 sequences bioinformatic decision tree for genotyping DMs introns 6-8,966bpdel, exons7-8delandfs, chr16:28.405752 was developed (FIG. 19). Clinical utility of target enrichment, 284.04787del in four known compound heterozygotes SBS sequencing and this decision tree for genotyping DMs (NA20381, NA20382, NA20383 and NA20384, red dia were assessed. SNPs in 26 samples were genotyped both by monds) and one undescribed carrier (NA00006, green dia high density arrays and sequencing. The distribution of read mond) among 72 samples sequenced; (B) heterozygous dele count-based allele frequencies of 92,106 SNP calls was tri tion in HBA1 (chr16:141620 172294del, 30,676 bp deletion modal, with peaks corresponding to homozygous reference from 5' of 2 to 3' of 01 in ALU regions) in one known alleles, heterozygotes and homozygous variant alleles, as (NA10798, red diamond) and one undescribed carriers ascertained by array hybridization (FIG.21B). Optimal geno (NA19193, green diamond) among 72 samples; (C) known typing cut-offs were 14% and 86% (FIG. 21B). With these homozygous deletion of exons 7 and 8 of SMN1 in one of cutoffs and a requirement for 20x coverage and 10 reads of eight samples (NAO3813, red diamond); and (D) detection of quality 220 to call a variant, the accuracy of sequence-based a gross deletion that is a cause of Duchenne muscular dystro SNP genotyping was 98.8%, sensitivity was 94.9% and speci ficity was 99.99%. The positive predictive value (PPV) of phy (OMIMH310200, DMD exon 51-55 del, chrx: sequence-based SNP genotypes was 99.96% and negative 31702000 31555711 del) by reduction in normalized predictive value was 98.5%, as ascertained by array hybrid aligned reads at chrX:31586112. FIGS. 20E-G show 72 ization. As sequence depth increased from 0.7 to 2.7GB, samples, of which one (NAO4364, red diamond) was from an sensitivity increased from 93.9% to 95.6%, while PPV affected male, and another (NA18540, a female JPT/HAN remained ~100% (FIG.21A). Areas under the curve (AUC) of HapMap sample) was determined to carry a deletion that the receiver operating characteristic (ROC) for SNP calls by extends to at least chrX:31860199 (see FIG. 20E). In FIGS. hybrid capture and SBS were calculated. When genotypes in 20E-G, the following apply: (E) An undescribed heterozy 26 samples were compared with genome-wide SNP array gous deletion of DMD 3' exon 44-3' exon 50 (chrX: hybridization, the AUC was 0.97 when either the number or% 32144956-31702228.del) in NA1854.0 (green diamond), a reads calling a SNP was varied (FIG. 21C-D). When the JPT/HAN HapMap sample. This deletion extends from at parameters were combined, the AUC was 0.99. For known least chrX:315861 12 to chrX:31860199 (see FIG. 20D). Substitution, indel, splicing, gross deletion and regulatory Sample NA (red diamond) is the uncharacterized mother of alleles in 76 samples, sensitivity was 100% (113 of 113 an affected son with 3' exon 44-3' exon 50 del, chrX: known alleles; Table 13). The higher sensitivity for detection 32144956-31702228.del; (F) hemizygous deletion in PLP1 of known mutations reflected manual curation. Of note, Sub US 2011/0098.193 A1 Apr. 28, 2011 48 stitutions, indels, splicing mutations and gross deletions 20G). 64 of 138 NAO1712 reads contained a nucleotide sub account for the vast majority (96%) of annotated mutations stitution that created a premature stop codon (Q664X, chr10: 0235. In FIG. 21, the following apply: (A) comparison of 50360741C>T). The other allele of NA01712 had been char 92,128 SNP genotypes by array hybridization with those acterized as a deletion within a homopolymeric repeat (exon obtained by target enrichment, SBS and a bioinformatic deci 17, c.3533delT, Y1179fs, chr10:50348479delT), but instead sion tree in 26 samples. SNPs were called if present in 210 occurred three bases upstream (exon 17, c.3536delA. uniquely aligning reads, 21.4% of reads and average quality Y1179fs, chr10:50348476delA; FIG. 27). NA01464, a CHT score 220. Heterozygotes were identified if present in 14%- for glycogen storage disease, type II (OMIM #232300), which 86% of reads. TP-SNP called and genotyped correctly. had an undefined second mutation, contained a frame-shift TN=Reference genotype called correctly. FN-SNP genotype ing deletion of GAA (exon 17, c.2544delC, p.K849fs, chr17: undercall. FP=SNP genotype overcall. Accuracy=(TP+TN)/ 75706649delC) in 44 of 117 reads. One allele of NA20383, a (TP+FN+TN+FP). Sensitivity=TP/(TP+FN). CHT for neuronal ceroid lipofuscinosis, type 3, had been Specificity=TN/(TN+FP). PPV=TP/(TP+FP). NPV=TN/ characterized as exon 11, c.1020G>A, E295K, chr16: (TN+FN); (B) distribution of allele frequencies of SNP calls 284.01322Gd-A. Instead, however, 193 of 400 reads called a by hybrid capture and SBS in 26 samples. Light blue: het different, more deleterious mutation at that nucleotide erozygotes by array hybridization; (C) receiver operating (c.1020G>T, E295X, chr16:28401322G>T; FIG. 28). One characteristic (ROC) curve of sensitivity and specificity of allele of NA04394, a CHT, was annotated as GBA exon 8, SNP genotypes by hybrid capture and SBS in 26 samples c. 1208Gd-C, S403T, chr1:153472676G>C, but was exon 8, (when compared with array-based genotypes). Genomic c.1171G>C, p.V391L, chr1:1534.72713G>C. NA16643 was regions with less than 20x coverage were excluded. Upon annotated as an HBB exon 2, c.306G>T, E102D, chr11: varying the number of reads calling the SNP, the area under 5204392G>T heterozygote, but 23 of 49 reads called the curve (AUC) was 0.97; and (D) ROC curve of SNP geno c.306G>C, E 102D, chr11:5204392G>C (FIG. 29). Both types by hybrid capture and SBS in 26 samples. Genomic ERCC4 mutations described in CHT NAO3542 were absent regions with less than 20x coverage were excluded. Upon in at least 130 aligning reads. However, the current study used DNA from EBV-transformed cell lines, in which somatic varying the percent reads calling the SNP AUC was 0.97. hypermutation has been noted. In particular ERCC4, a DNA 0236 14 of 113 literature-annotated DMs were either repair gene, is a likely candidate for Somatic mutation. Includ incorrector incomplete (Table 13): Sample NA07092, from a ing these results, the specificity of sequence-based genotyp male with XLR Lesch-Nyhan syndrome (LN, ing of Substitution, indel, gross deletion and splicing DMs OMIMH300322), was characterized as a deletion of HPRT1 exon 8 by cDNA sequencing, but had an explanatory splicing was 100% (97/97). mutation (intron 8, IVS8+1 4delGTAA, chrx: 133460381 0237 Also, FIG. 27 shows one end of five reads from 13346.0384delGTAA; FIG.22A). NA01899, also from a male NAO1712 showing ERCC6 exon 17, c.3536delA, Y1179fs, with LN, was characterized as an exon 9 deletion (c.610 chr10:50348476delA. 94 of 249 reads contained this deletion 626del, H204fs, chrx: 133461726 133461742del) by cINA DM (CD982624). The top lines of doublets are Illumina sequencing but none of 22 reads detected this variant HiSeq assembled reads (following assembly of overlapping whereas 26 of 27 reads detected a splicing mutation of intron paired forward and reverse 130nt reads). The bottom lines are 8 (intron 8, IVS8-2ADT, chrX: 133461724ADT). NAO9545, NCBI reference genome, build 36.3. Colors represent quality from a male with XLR Pelizaeus-Merzbacher disease (PMD, (Q) scores of each nucleotide: Red >30, Orange 20-29; Green OMIMH312080), characterized as a substitution DM (PLP1 10-19; and Blue <10. Reads aligned uniquely to these coor exon 5, c.767C>T, P215S), was found to also feature PLP1 dinates. The top read was of length 237 nt and matched the gene duplication (which is reported in 62% of sporadic PMD minus reference strand at 235 of 237 positions. The second FIG. 22B). One allele of NAO0879, from an affected com read matched the minus strand at 220 of 221 nt. The third read pound heterozygote (CHT) for AR Sanfilippo syndrome A matched the minus strand at 222 of 223 nt. The fourth read (mucopolysaccharidosis IIIA, OMIMH252900) had been matched the plus strand at 212 of 213 nt. The fifth read reported as a conservative substitution DM (exon 6, matched the minus strand at 238 of 239 nt. c.734G>A, R245H, chr17:75,802.210G>A), but was a 0238. In FIG. 28, 193 of 400 reads contained this substi frame-shifting, nucleotide deletion (exon 8, c.1079delC, tution DM (CM003663). The top lines of doublets are Illu p.V361fs, chr17:75799276delC in 72 of 164 reads). mina HiSeq assembled reads (following assembly of overlap NA02057, from a female with aspartylglucosaminuria ping paired forward and reverse 130 nt reads). The bottom (OMIM #208400), characterized as a CHT, was homozygous lines are NCBI reference genome, build 36.3. Colors repre for two adjacent substitutions (AGA exon 4, c.482G-A, sent quality (Q) scores of each nucleotide: Redd-30; Orange R161Q, chr4:178596918G-A and exon 4, c.488G>C, 20-29; Green 10-19; and Blue-10. Reads aligned uniquely to C163S, chr4:178596912GdC in 38 of 39 reads: FIG. 23), of these coordinates. The top read was of length 214 nt and which C163S had been shown to be the DM. In FIG. 24, the matched the minus reference strand at 213 of 214 positions. top lines of doublets are Illumina GAIIx 50 nt reads and the The second read matched the plus strand at 187 of 189 nt. The bottom lines are NCBI reference genome, build 36.3. Colors third read matched the plus strand at 182 of 183 nt. The fourth represent quality (Q) scores of each nucleotide: Red >30; read matched the minus strand at 180 of 181 nt. The fifth read Orange 20-29; and Green 10-19. Reads aligned uniquely to matched the minus strand at 188 of 189 nt. these coordinates. While one allele of NA01712, a CHT with 0239. In FIG. 29, one end of five reads from NA16643 Cockayne syndrome, type B (OMIMH 133540), had been showing HBB exon 2, c.306G>C, E 102D, chr11: characterized by cINA analysis as a deletion of ERCC6 exon 5204392G>C (Black arrow) is shown. 29 of 43 reads con 9 (c. 1993 2169del, p.665 723de1, exon 9 del, chr10: tained this substitution DM. The top lines of doublets are 50360915 50360739del, no decrease in normalized exon 9 Illumina HiSeq assembled reads (following assembly of read number was observed despite over 300x coverage (FIG. overlapping paired forward and reverse 130 nt reads). The US 2011/0098.193 A1 Apr. 28, 2011 49 bottom lines are NCBI reference genome, build 36.3. Colors was attested to by samples from identical twins being nearest represent quality (Q) scores of each nucleotide: Red-30; neighbors, as were two DMS in linkage disequilibrium. Orange 20-29; Green 10-19; and Blues-10. Reads aligned uniquely to these coordinates. 3. Discussion 0240 FIG. 30 shows the strategy for detection of a large 0244. These results indicate that comprehensive popula deletion mutation in a human genomic DNA sample. In (A), tion screening is a technically feasible and cost-effective the region of human chromosome 16 that contains the Ceroid approach to reduce the incidence of severe childhood reces Lipofuscinosis type 3 (CLN3) gene is shown. In the upper sive diseases and ameliorate resultant Suffering. Comprehen panel, a 154 nucleotide sequence from an individual who is a sive carrier screening by target enrichment, next generation heterozygote carrier of a 966 nucleotide mutation in CLN3 is sequencing and bioinformatic analyses was remarkably spe shown. The sequence is a normal sequence and aligns per cific (99.96%). When sequence depth of 2.5 GB per sample fectly to the reference human genome sequence. In the lower was employed, ~95% sensitivity was attained with hybrid panel, numbers refer to nucleotide positions on human chro capture. Since enrichment failures with hybrid capture were mosome 16. The CLN3 gene is shown in green, with exons reproducible, many may be amenable to rescue by individual illustrated by Vertical green bars and introns by grey arrows PCR or probe redesign. Alternatively, micro-droplet PCR illustrating the direction of transcription. In FIG. 30B, the should theoretically achieve sensitivity of ~99%, albeit at region of human chromosome 16 that contains the Ceroid higher cost. The cost of consumables was $218 for the hybrid Lipofuscinosis type 3 (CLN3) gene is shown. A 966 bp region enrichment-based test and S322 for the micro-droplet PCR of the chromosome is indicated by a grey box in the upper test. This excluded capital equipment, manpower, sales, mar panel. The middle panel shows the genomic region following keting and regulatory costs. It also did not account for coun deletion of the 966 bp region which includes introns 6.7 and Seling and other health care provider costs. These aspects— 8 and exons 7 and 8 of CLN3. The lower panel shows perfect facile interpretation of results, physician and public alignment of a 50 nucleotide sequence from an individual education, and training of genetic counselors—are antici who is a heterozygote carrier of a 966 nucleotide mutation in pated to be the most significant hurdles in implementation of CLN3. The sequence is a mutantsequence and aligns per comprehensive carrier screening. Nevertheless, the overall fectly to a synthetic mutant reference sequence. In FIG. 30C, cost of

standards and guidelines for population based cystic fibro R, Baxter S M, Day S. W. Retzel E. F. Management of sis carrier screening. Genetics in Medicine. 3:149-54 high-throughput DNA sequencing projects: Alpheus. J. (2001). Comput. Sci. Syst. Biol. 1, 132-148 (2008). 0279 23. Hale, J. E., Parad, R. B., Comeau, A. M. New 0290 34. Mimault, C., et al. Proteolipoprotein gene analy born screening showing decreasing incidence of cystic sis in 82 patients with sporadic Pelizaeus-Merzbacher Dis fibrosis. N Engl J Med. 358:973-974 (2008). ease: duplications, the major cause of the disease, originate 0280 24. Hedges, D. J., et al. Exome sequencing of a more frequently in male germ cells, but point mutations do multigenerational human pedigree. PLoS One. 4:e8232 not. The Clinical European Network on Brain Dysmyeli (2009). PubMed PMID: 20011588. nating Disease. Am J Hum Genet. 65:360-9 (1999). 0281. 25. Hu H. Wrogemann K. Kalscheuer V. Tzschach 0291 35. Mitchell, J.J., Capua, A., Clow, C., Scriver, C. R. A, Richard H. Haas SA, Menzel C, Bienek M. Froyen G, Twenty-year outcome analysis of genetic screening pro Raynaud M. Von Bokhoven H. Chelly J. Ropers H, Chen grams for Tay-Sachs and beta-thalassemia disease carriers W. Mutation screening in 86 known X-linked mental retar in high schools. Am J Hum Genet. 59:793-8 (1996). dation genes by droplet-based multiplex PCR and massive 0292 36. Myrianthopoulos NC, Aronson SM. Population parallel sequencing. HUGOJ (2010). dynamics of Tay-Sachs disease. I. Reproductive fitness and 0282. 26. Kaback MM.Hexosaminidase A Deficiency. In: selection. Am J Hum Genet. 18:313-27 (1966). PMID: Pagon R A, Bird TC, Dolan C R. Stephens K, editors. 5945951. GeneReviews Internet. Seattle (Wash.): University of 0293 37. Ng, S.B., et al. Exome sequencing identifies the Washington, Seattle: 1993- 1999 Mar. 11 updated 2006 cause of a mendelian disorder. Nat Genet. 42:30-5 (2010). May 19. PMID: 2030.1397 PubMed PMID: 1991.5526. 0283 27. Kaback M M. Population-based genetic screen 0294 38. Ng, S. B., et al. Targeted capture and massively ing for reproductive counseling: the Tay-Sachs disease parallel sequencing of 12 human exomes. Nature 461: model. Eurj Pediatr. 159 Suppl 3:S192-5 (2000). PMID: 272-6 (2009). PubMed PMID: 19684571. 11216898 0295 39. Online Mendelian Inheritance in Man, OMIM 0284. 28. Kim J I, Ju Y S. Park H. Kim S. Lee S. Yi J H, (TM). McKusick-Nathans Institute of Genetic Medicine, Mudge J. Miller NA, Hong D, Bell C J, Kim HS, Chung I Johns Hopkins University (Baltimore, Md.). Available at S. Lee WC, Lee JS, Seo SH, Yun JY, Woo HN, Lee H, Suh http://www.ncbi.nlm.nih.gov/omim/ Accessed Aug. 11, D. Lee S. Kim H.J. Yavartanoo M. Kwak M. Zheng Y. Lee 2010. MK, Park H, Kim JY, Gokcumen O, Mills RE, Zaranek A 0296 40. Population-based carrier screening for single W. Thakuria J, Wu X, Kim RW. Huntley JJ, Luo S, Schroth gene disorders: lessons learned and new opportunities. GP Wu TD, Kim H. Yang KS, Park WY, Kim H. Church Feb. 6-7, 2008. Available at http://www.genome.gov/ GM, Lee C, Kingsmore SF, Seo J. S. A highly-annotated, 27O26048 whole-genome sequence of a Korean Individual. Nature 0297 41. Roach, J. C., et al. Analysis of genetic inherit 460:1011-5 (2009). ance in a family quartet by whole-genome sequencing. 0285 29. Kronn, D., Jansen, V., Ostrer, H. Carrier screen Science 328:636-9 (2010). PubMed PMID: 20220176. ing for cystic fibrosis, Gaucher disease, and Tay-Sachs 0298 42. Srinivasan BS, Evans E A, Flannick J. Patterson disease in the Ashkenazi Jewish population: the first 1000 A S. Chang C C, Pham T. Young S. Kaushal A. Lee J, cases at New York University Medical Center, New York, Jacobson JL, Patrizio P. A universal carrier test for the long N.Y. Arch Intern Med. 158:777-81 (1998). tail of Mendelian disease. Reprod Biomed Online. Aug. 21, 0286 30. Kumar, P. Radhakrishnan, J., Chowdhary, M. 2010. PMID: 2O729.146 A., Giampietro, P. F. Prevalence and patterns of presenta 0299 43. Stenson, P. D., et al. The Human Gene Mutation tion of genetic disorders in a pediatric emergency depart Database: 2008 update. Genome Med 1:13 (2009). ment. Mayo Clin Proc. 76:777-83 (2001). (0300 44. Sugarbaker DJ, Richards W G, Gordon G. J. (0287 31. McConkey, E. Human Genetics: The Molecular Dong L, De Rienzo A, Maulik G, Glickman J. N. Chirieac Revolution. Sudbury, MA: Jones & Bartlett, 1' Edition L R, Hartman M. L., Taillon B E, Du L, Bouffard P. (1993). Kingsmore S F, Miller N A, Farmer A D, Jensen RV. 0288 32. McKernan K. J. Peckham H. E. Costa G. L. Gullans S R, Bueno R. Transcriptome sequencing of McLaughlin SF, Fu Y. Tsung E. F. Clouser C R. Duncan C, malignant pleural mesothelioma tumors. Proc. Natl Acad. Ichikawa J. K. Lee C C, Zhang Z. Ranade SS, Dimalanta E Sci. USA 105, 3521-3526 (2008). T. Hyland FC, Sokolsky TD, Zhang L. Sheridan A, Fu H, (0301 45. Summerer D, Hevroni D, Jain A, Oldenburger O, Hendrickson C L, Li B, Kotler L. Stuart J R, Malek JA, Parker J, Caruso A, Staler C F. Stäler P F, Beier M. A Manning J. M. Antipova A A. Perez DS, Moore M P. flexible and fully integrated system for amplification, Hayashibara KC, Lyons MR, Beaudoin RE, Coleman B detection and genotyping of genomic DNA targets based E. Laptewicz MW, Sannicandro A E, Rhodes MD, Got on microfluidic oligonucleotide arrays. N. Biotechnol. 27. timukkala R K, Yang S. Bafna V. Bashir A. MacBride A, 149-155 (2010). PMID: 20359559 Alkan C, Kidd J. M. Eichler E. E. Reese MG, De LaVega F (0302) 46. Tewhey R. Warner J B, Nakano M, Libby B, M. Blanchard A. P. Sequence and structural variation in a Medkova M. David PH, Kotsopoulos S K, Samuels ML, human genome uncovered by short-read, massively paral Hutchison J. B. Larson J. W. Topol E. J. Weiner M. P. Haris lel ligation sequencing using two-base encoding. Genome mendy O, Olson J. Link D. R. Frazer K A. Microdroplet Res. 19:1527-41 (2009). PMID: 19546169 based PCR enrichment for large-scale targeted sequencing. 0289) 33. Miller NA, Kingsmore S F, Farmer A, Langley Nat Biotechnol. 27:1025-31 (2009). RJ, Mudge J. Crow JA, Gonzalez A.J. Schilkey F D. Kim (0303 47. Wang J, Wang W. Li R, Li Y. Tian G, Goodman RJ, van Velkinburgh J. May G. D. Black C F. Myers MK, L. Fan W. Zhang J, Li J, Zhang J, Guo Y. Feng B, Li H, Lu Utsey J. P. Frost NS, Sugarbaker DJ, Bueno R, Gullans S Y. Fang X, Liang H. Du Z, Li D, Zhao Y. Hu Y. Yang Z. US 2011/0098.193 A1 Apr. 28, 2011

Zheng H. Hellmann I, Inouye M, Pool J. Yi X. Zhao J, Duan (0306 50. Wigderson M, Firon N, Horowitz Z, Wilder S, J, Zhou Y. Qin J. Ma L., Li G, Yang Z, Zhang G, Yang B, Yu Frishberg Y. Reiner O, Horowitz M. C, Liang F, Li W. Li S. LiD, Ni P. Ruan J, Li Q, Zhu H, Liu 0307 Characterization of mutations in Gaucher patients D. Lu Z, Li N. Guo G, Zhang J. Ye J. Fang L, Hao Q, Chen by cDNA cloning. Am J Hum Genet. 44:365-77 (1989). Q, LiangY. SuY. San A, Ping C, Yang S., Chen F. Li Li Zhou PubMed ID: 2464926 K. Zheng H. Ren Y. Yang L, Gao Y. Yang G, Li Z. Feng X, (0308) 51. Zhong N, Martiniuk F, Tzall S. Hirschhorn R, Kristiansen K. Wong G. K. Nielsen R. Durbin R, Bolund L. Identification of a missense mutation in one allele of a Zhang X, Li S. Yang H. Wang J. The diploid genome patient with Pompe disease, and use of endonuclease sequence of an Asian individual. Nature. 456:60-5 (2008). digestion of PCR-amplified RNA to demonstrate lack of PubMed PMID: 1898.7735. mRNA expression from the second allele. Am J Hum 0304 48. Watson, M. S. Lloyd-Puryear, M.A., Mann, M. Genet. 49:635-45 (1991). PubMed ID: 1652892 Y., Rinaldo, P. Howell, R. R. Newborn screening main (0309 52. Zhong N, Wisniewski KE, Kaczmarski A L, Ju report. Genetics in Medicine. 8:12S-252S (2006). W, Xu WM, Xu W. W. Mclendon L. Liu B, Kaczmarski W. 0305. 49. Wheeler DA, Srinivasan M, Egholm M. Shen Y. Sklower Brooks SS, Brown WT. Molecular screening of Chen L. McGuire A, He W. Chen YJ, Makhijani V. Roth G Batten disease: identification of a missense mutation T. Gomes X, Tartaro K, Niazi F, Turcotte CL. Irzyk G. P. (E295K) in the CLN3 gene. Hum Genet. 102:57-62 (1998). Lupski JR, Chinault C. Song X Z. Liu Y. Yuan Y. Nazareth 0310. 53. Emery A E H (ed) Duchenne muscular dystro L. Qin X, Muzny D M, Margulies M. Weinstock G. M. phy. No 15 in: Motulsky A G, Harper PS, Bobrow M, Gibbs R A Rothberg J. M. Nature. 452:872-6 (2008). Scriver C(eds) Oxford monographs on medical genetics. PubMed PMID: 18421352 Oxford University Press, Oxford. 1988.

SEQUENCE LISTING

<16 Os NUMBER OF SEO ID NOS : 174

<21 Os SEQ ID NO 1 &211s LENGTH: 50 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs SEQUENCE: 1 cc.gcaggctic aatgaatcga atgaatgtga acttctt cat citctgaaaaa SO

<21 Os SEQ ID NO 2 &211s LENGTH: 50 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs SEQUENCE: 2 cagcaggctic aatgaatcga atgaatgtga acttctt cat citctgaaaaa SO

<21 Os SEQ ID NO 3 &211s LENGTH: 60 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs SEQUENCE: 3 cc.gc.ccc.cta gtctic ccacc ct tcc ct coc cqtagtgacc aatticcitatic tott coct ct 60

<21 Os SEQ ID NO 4 &211s LENGTH: 60 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note =

US 2011/0098.193 A1 Apr. 28, 2011 55

- Continued titcc tat citc titc cct ct co goacgcticaa tdaatcgaat gaatgtgaac ttctt.ca 1077

<210s, SEQ ID NO 8 &211s LENGTH: 108 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 8 ccacca accq gatgacaaga aaaaaacaga gqcttittgct ttcttgaata tdaagat cac 6 O caaacagotg ccc.cgtaaaa gtgctgtttg tacgcaacct togccaata 108

<210s, SEQ ID NO 9 &211s LENGTH: 115 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 9 ccacca accq gatgacaaga aaaaaaacag gagttittgct ttcttgaata tdaagat cac 6 O aaaa.ca.gctg ccc.ccc ctaa gogtaaaagtg ctgtttgtac gcaac cittgc caata 115

<210s, SEQ ID NO 10 &211s LENGTH: 76 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 10 gttctic cc at ggaatgctitt ccctggcaag gtttgttggct coaac Cttct gtc. catctgc 6 O aaaa.ca.gctg agggga 76

<210s, SEQ ID NO 11 &211s LENGTH: 76 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 11 gttctic cc at ggaatgctitt ccctggcaag gtttgttggct coaac Cttct gtc. catctgc 6 O aaaa.ca.gctg agggga 76

<210s, SEQ ID NO 12 &211s LENGTH: 68 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 12 cc catggaat gctitt.ccctg gcaaggtttgttggct coaac cittctgtc.ca totgcaaaac 6 O US 2011/0098.193 A1 Apr. 28, 2011 56

- Continued agctgagg 68

<210s, SEQ ID NO 13 &211s LENGTH: 76 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

<4 OOs, SEQUENCE: 13 gttctic cc at ggaatgctitt ccctggcaag gtttgttggct coaac Cttct gtc. catctgc 6 O aaaa.ca.gctg agggga 76

<210s, SEQ ID NO 14 &211s LENGTH: 46 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

<4 OOs, SEQUENCE: 14 gttctic cc at ggaatgctitt citctggcaag gtttgttggct coaacc 46

<210s, SEQ ID NO 15 &211s LENGTH: 76 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

<4 OOs, SEQUENCE: 15 gttctic cc at ggaatgctitt ccctggcaag gtttgttggct coaac Cttct gtc. catctgc 6 O aaaa.ca.gctg agggga 76

<210s, SEQ ID NO 16 &211s LENGTH: 76 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

<4 OOs, SEQUENCE: 16 gttctic cc at ggaatgctitt citctggcaag gtttgttggct coaac Cttct gtc. catctgc 6 O aaaa.ca.gctg agggga 76

<210s, SEQ ID NO 17 &211s LENGTH: 69 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

<4 OOs, SEQUENCE: 17 gtgaatttct ggattitttitt ttatagoatgtttgttgtcat tagtgaaact ggaaaagcaa. 6 O aatacaaag 69 US 2011/0098.193 A1 Apr. 28, 2011 57

- Continued <210s, SEQ ID NO 18 &211s LENGTH: 52 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 18 gtgaatttct ggattitttitt ttatagtgaa actggaaaag caaaatacaa ag 52

<210s, SEQ ID NO 19 &211s LENGTH: 50 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 19 cittitttitt at tttitt.ccata taagatggag acggit cittitt cittgtgagtic SO

<210s, SEQ ID NO 2 O &211s LENGTH: 55 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = Synthetic construct

<4 OOs, SEQUENCE: 2O cittitttitt at tttitt.ccata taagataaga tiggagacggit cittitt cittgt gag to 55

<210s, SEQ ID NO 21 &211s LENGTH: 61 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 21 ccagacaagt ttgttgtagg atatgcc.ctt gactataatgaat actt cag to attittaat 6 O

9 61

<210s, SEQ ID NO 22 &211s LENGTH: 88 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 22 actt caggga tittggatgta attgcttctt tttct cactic atttittcaaa acacgcataa 6 O aaatttagga aagaga attg ttittct co 88

<210s, SEQ ID NO 23 &211s LENGTH: 90 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct US 2011/0098.193 A1 Apr. 28, 2011 58

- Continued

<4 OOs, SEQUENCE: 23 at actt catg gatttgaatg taattgctitc tttitt ct cac to atttitt.ca aaacacgcat 6 O aaaaatttag gaaagaga at tdttittct co 9 O

<210s, SEQ ID NO 24 &211s LENGTH: 90 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 24 ggatttgaat gtaattgctt citttittctica ct catttittcaaaac acgca taaaaattta 6 O ggaaagagaa ttgttittctic ct tccago ac 9 O

<210s, SEQ ID NO 25 &211s LENGTH: 90 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 25 tgtaggatat gcc cittgact ataatgaata citt cagggat ttgaatgitaa ttgct actitt 6 O ttct cact catttittcaaaa cacgcataaa 9 O

<210s, SEQ ID NO 26 &211s LENGTH: 69 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 26 ccagacaagt ttgttgtagg atatgcc.ctt gactataatgaat actt cag ggatttgaat 6 O gtaattgct 69

<210s, SEQ ID NO 27 &211s LENGTH: 69 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct

<4 OOs, SEQUENCE: 27 ccagacaagt ttgttgtagg atatgcc.ctt gactataatgaat actt cag ggatttgaat 6 O gtaattgct 69

<210s, SEQ ID NO 28 &211s LENGTH: 69 &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Description of Artificial Sequence; note = synthetic construct US 2011/0098.193 A1 Apr. 28, 2011 59

- Continued

SEQUENCE: 28 ccagacaagt ttgttgtagg atatgcc.ctt gactataatgaat actt cag ggatttgaat 6 O gtaattgct 69

SEQ ID NO 29 LENGTH: 90 TYPE: DNA ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

SEQUENCE: 29 tgtaggatat gcc cittgact ataatgaata citt cagggat ttgaatgitaa ttgcttctitt 6 O ttct cact catttittcaaaa cacgcataaa 9 O

SEQ ID NO 3 O LENGTH: 90 TYPE: DNA ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

SEQUENCE: 3 O tgtaggatat gcc cttgact ataatgaata citt cagggat ttgaatgtaa ttgcttctitt 6 O ttct cact catttittcaaaa cacgcataaa 9 O

SEQ ID NO 31 LENGTH: 90 TYPE: DNA ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

SEQUENCE: 31 tgtaggatat gcc cittgact ataatgaata citt cagggat ttgaatgitaa ttgcttctitt 6 O ttct cact catttittcaaaa cacgcataaa 9 O

SEQ ID NO 32 LENGTH: 90 TYPE: DNA ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

SEQUENCE: 32 tgtaggatat gcc cittgact ataatgaata citt cagggat ttgaatgitaa ttgcttctitt 6 O ttct cact catttittcaaaa cacgcataaa 9 O

SEQ ID NO 33 LENGTH: 90 TYPE: DNA ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; note synthetic construct

SEQUENCE: 33