Supporting Information

Ariffin et al. 10.1073/pnas.1417322111 SI Materials and Methods predicted to be deleterious and to preclude p53 function TP53 Database and Anticipation Data. The IARC TP53 database by disrupting the oligomerization domain and preventing the compiles standardized information on all subjects with germ-line formation of high-affinity DNA binding protein complexes. TP53 mutation described in the public literature. This information DNA from five members of a second family (MM family) was includes TP53 mutation identification, tumor topography and analyzed by whole-exome sequencing (WES). This pedigree is morphology, sex, age at diagnosis, number of generations analyzed shown in Fig. 2. The proband carried a TP53 mutation at codon in the pedigree, and position in the pedigree of each subject with 245 (p.G245S) (mutation details) and developed rhabdomyosar- documented cancer status. We used the R16 version of the da- coma at age 2 y. A sister (unconfirmed mutation carrier) died of tabase, updated in November 2012 (p53.iarc.fr/DownloadDataset. medulloblastoma at age 2.5 y. Two other siblings who are carriers aspx), which contains data on 2,450 subjects with the germ-line of the p.G245S mutation (age 8 and 12 y) are yet to manifest TP53 mutation. Data were filtered to select individuals who (i)are a malignancy. The father (mutation carrier) has not developed any part of families matching LFS or LFL criteria or have a defined cancer by the age of 40. The mother is an ascertained noncarrier. familial history (FH) of cancer; (ii) belong to pedigrees with one to four documented generations (families with five generations Whole-Genome and -Exome Sequencing. From the onset of the study were excluded because of either low average age of individuals in (identification of the proband), the family was offered genetic the fifth generation or incomplete ascertainment of tumor di- counseling, and parents were asked to provide signed informed agnosis in the first generation); (iii) have their generational consent for themselves and for their children. For WGS of 13 position in the pedigree clearly identified; and (iv) do not carry subjects from this LFS kindred, a specific consent was provided by the “Brazilian founder” TP53 R337H mutation, a mutation of the father and the mother of the proband (who are acting as partial penetrance, which has been associated with extremely var- caretakers for the children of their deceased sister). For partic- iable familial and nonfamilial tumor patterns. This search identi- ipants who are currently alive, DNA was extracted from pe- fied 1,771 individual records from 294 pedigrees. Families were ripheral white blood cells obtained by venipuncture using the subgrouped in four classes according to the number of generations Qiagen DNA Extraction kit according to the manufacturer’s documented (one to four generations). The mean age at first instructions (QIAamp DNA blood Maxi kit; Qiagen). For de- cancer onset (±SD) and proportion of diagnoses corresponding to ceased patients, archival DNA (preserved at −80 °C) was used. childhood or adult forms of common LFS/LFL cancers in each WGS and WES were performed by the Beijing Genome Institute generation were calculated separately for families with one, two, (BGI) (www.genomics.cn/en/index). WGS was performed using three, or four generations. To detect a possible shift in the severity ABI SOLiD 4.0 with paired-end 50- reads. Primary data of childhood cancer phenotype, the age-dependent accrual of analysis including image analysis, base calling, alignment, and cancer up to age 20 y was plotted for each generation according to variant calling including CNV were performed by BGI using Bio- pedigree structure. Data were analyzed using the standard t test (for scope software. Exome sequencing of the MM family was per- age at first cancer onset) or χ2 test (for distribution of tumor types) formed by BGI using DNA fragments of 150–200bpinlengthand using tools available at vassarstats.net/, without further adjustment enriched for exome with the Agilent SureSelect in Solution. Cap- for age, sex, or region of origin of selected patients and families. tured library was then sequenced on the HisEq. 2000 platform with sequences generated as 90-bp paired-end reads. Basic bioinformatics Patients and Family. Whole-genome sequencing was performed on analysis was performed by BGI. Briefly, adapter and low-quality 13 members of a family of Malay origin matching strict LFS reads were filtered to generate clean data. Burrows–Wheeler criteria (KA family). Briefly, the proband, an 8-y-old girl, first Aligner (BWA) was used to align reads to the reference sequence. presented at the age of 8 mo with embryonal rhabdomyosarcoma SOAPsnp was used to detect SNPs, and SAMtools and the Ge- of the trunk, which was treated by surgery and chemotherapy. nome Analysis Toolkit (GATK) were used to detect SNVs and Seven years later, she developed an intracranial lesion (left tem- insertions/deletions (indels). poroparietal mass), which was histologically proven to be a re- currence of rhabdomyosarcoma, and died at age 9. Among her 5 CNV Calling of WGS Data. CNVs were inferred using depth-of-read siblings, a younger sister developed adrenal cortical carcinoma at coverage, and CNValidator (code.google.com/p/cnvalidator) was 6 mo of age, which was successfully treated. Their mother was used to identify high-confidence CNVs using SNP information. diagnosed with carcinoma of the right breast at age 26, and of left CNVs were initially called using the Find Human CNV tool in breast at age 28, for which she is currently in remission after surgery Bioscope. Briefly, the algorithm divides the arms and chemotherapy. Her sister (aunt of the proband) was diagnosed into windows of size 5 kb. The read-depth coverage was computed with an osteosarcoma of the jaw at age 26 and is deceased. The and corrected for GC content bias. Global normalization was maternal grandmother of the proband was deceased from breast used to compute log ratios using the median of average means of cancer at age 38. all chromosome arms as the expected value. A hidden Markov TP53 mutation was detected using primers and protocols model was used to call the copy-number state of the windows, and recommended by the International Agency for Research on contiguous windows of the same state were merged into CNV Cancer (p53.iarc.fr/Download/TP53_DirectSequencing_IARC.pdf). segments with P value significance. Only CNVs exceeding 10 kb A 6-bp insertion mutation at the second base of codon 334 (nu- in size were called because they must span at least two windows cleotide 17579) in exon 10, causing in-frame insertion of residues to be detected. A smaller P value indicates a greater probability repeating residues 334 (Gly) and 335 (Arg) was identified in the of a predicted CNV to be true. Only CNVs with P value less than following family members: the proband, her affected sister, two 0.05 were retained for further consideration. X younger siblings who have yet to manifest a malignancy, the af- and Y were excluded from the analysis. fected aunt and her two unaffected children, and an unaffected In addition, the CNValidator software (code.google.com/p/ uncle (brother of the mother). Two siblings of the proband and cnvalidator) was used to filter out low-quality CNVs using SNP the father were confirmed noncarriers (Fig. 1). The mutation is information, applying P < 0.01 as threshold. Briefly, rigorous

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 1of6 statistical methods were used to assign P values to CNVs with Male or Female Reference DNA according to sex. Initial data null model consisting of regions of the genome after removing all analysis was done with Agilent Cytogenomics 2.7.70 software us- purported CNVs as well as repeat regions including telomeric and ing Aberration Detection Algorithms (ADM-2 and Fuzzy Zero) pericentric regions. Deletions were filtered using homozygosity with the minimum number of probes that should be present in an and, because the null model has very uniform distribution of het- aberrant region taken to be 3, threshold score of 6.0 (default), and erozygous SNPs averaged over segments greater than 10 kb, the minimum average log ratio to be 0.25. Chromosomes X and Y Poisson model was used to determine P value. Duplications were were excluded from the analysis. filtered by median of ratio of heterozygous SNP reads, and boot- strapping of the null model was used to determine P value. CNV Accrual in WT and Trp53 Knockout Mice. Trp53 WT mice that were 3 mo of age were mated to either trp53 WT mice (at 3 mo, Detection of de Novo CNV from WGS. For the children KA:III:1 to 9 mo, and 12 mo of age), or trp53 heterozygous mice (at 3 mo, III:6, CNVs could be identified as inherited or de novo through 9 mo, and 12 mo of age), or trp53 null mice (at 3 mo of age) pairwise comparison with the CNV results for their parents, II:1 (Dataset S3). Whole embryo of offspring from the crosses were and II:2 (Table S2). Any CNV from a child that overlaps with a taken at 17.5 d for analysis. DNA was extracted from both parents CNV from either parent was considered inherited. Almost all P < (liver) and all offspring and sent for genotyping by The Jackson CNVs that passed CNValidator ( 0.01) were inherited. At Laboratory using the Mouse Diversity Genotyping Array to most one de novo CNV was found in any child. This number is identify any potential de novo CNV in the offsprings. The fathers consistent with a low false-positive rate for CNV detection be- and mothers were either 129SvSL or C57BL/6 but always different cause de novo CNVs are rare per birth and inherited CNVs are SL strains so offsprings were mixed of 129Sv and C57BL/6 strains. detected independently in at least two members of the family. To We identified the set of SNPs that were homozygous and different further assess the significance of candidate de novo CNV de- in the parents so the offspring should be heterozygous. From this tected, a lower P value was applied in filtering with CNValidator. No de novo CNVs passed CNValidator with P < 0.0001. set of SNPs, we computed the B allele frequency (BAF) and the log R ratio (LRR) using PennCNV (www.openbioinformatics.org/ Identification of SNV Segregating in LFS Family. Single nucleotide penncnv/). For the offspring, the BAF should be 0.5 and the LRR variants (SNVs) were called using diBayes, the SNP calling tools should be 0. Any significant deviation from these values may in- of the Bioscope software. SNV calls were confirmed using the dicate a CNV. A de novo germ-line deletion would have BAF close Genome Analysis Toolkit (GATK) (www.broadinstitute.org/gatk/). to 0 or 1 and an LRR that is significantly negative. A de novo germ- Briefly, duplicates were marked with Picard (picard.sourceforge. line duplication would have BAF close to 0.33 or 0.67 and an LRR net/) and then analyzed with GATK to perform local realign- that is significantly positive. We took a sliding window of 31 SNPs ment around indels and base quality score recalibration. SNVs and used the median value to reduce the noise for the BAF and were detected using the GATK Unified Genotyper. These LRR. To calculate the threshold for significance in BAF and LRR, variants were annotated and filtered using ANNOVAR (www. we used the bootstrap approach. We randomized the location of openbioinformatics.org/annovar/). The filters used include the SNPs across the genome and used the sliding windows to get the following: annotate and identify exonic/splicing variants, remove maximum and minimum value of BAF and LRR. We repeated this synonymous variants, use conservation from 46-species alignment, process 100 times and used the distribution of extremal BAF and remove variants in segmental duplication regions, and remove LRR values to estimate threshold values with P value of 0.05. The variants in 1000 Genome Projects and dbSNP. Nine missense BAF and LRR are shown for all mice with the threshold values as mutations and one nonsense mutation were identified and con- horizon lines (Dataset S3). Almost all of the CNVs detected in the firmed by Sanger sequencing. offsprings were inherited. The evidence for inheritance was found in the parent contributing the CNV by looking at the LRR value at Detection of de Novo SNV from WES. MuTect (www.broadinstitute. the CNV location. The second evidence was that multiple pups in org/cancer/cga/mutect) with default parameters was used to de- the same litter or different litters would have the same CNV be- tect point mutations in offsprings not found in parents from cause the parents in these studies are often related by few gen- ’ + + whole-exome sequencing data. The parents binary alignment/ erations. One embryo, batch 1/sample 16 from C57Bl6 trp53 / map (BAM) alignment files had duplicate reads removed, +/+ female 9 mo and 129SvSL trp53 male 3 mo, seemed to have a de merged, and used as control. Because MuTect was originally de- novo duplication in chromosome 16 (Dataset S3). Two siblings veloped for the identification of somatic point mutations in + + (batch 1/samples 3 and 4 from C57Bl6 trp53 / female 3 mo and cancer genomes, we performed additional filtering of calls. The trp53+/+ potential de novo mutations were ranked by the t_lod_fstar value, 129SvSL male 3 mo) seemed to have an identical de novo deletion on chromosome 11 not found in any other pups. Similarly, which represents the log of likelihood of a mutation being real to trp53−/− the likelihood of an event being a sequencing error. We inspected two siblings (batch 3/samples 10 and 11 from 129SvSL trp53+/+ the top candidates for each offspring manually by looking at the female 3 mo and C57Bl6 male 3 mo) had an identical de alignment of reads at those positions in the genome across all novo duplication on chromosome 17 not found in any other pups. These CNVs might be a result of clonal mosaicism of gametes from family members with Integrative Genomics Viewer (IGV). + + the parents. One pup (batch1/sample13 from C57Bl6 trp53 / fe- +/+ CNV Detection from aCGH. DNA from members of the KA family male 9 mo and 129 Sv SL trp53 male 3 mo) had a germ-line (II:1, II:2, III:1, III:4, and III:6) and the MM family (II:6, II:7, trisomy of chromosome 18 (Dataset S3). Two adults, batch − − III:1, III:3, and III:4) were sent to Genotypic Technology for 2/sample 50 (C57Bl6 trp53 / male 3 mo) and batch 3/sample 34 − − aCGH using the Agilent SurePrint G3 Human CGH Microarry (129SvSL trp53 / male 3 mo), seemed to have somatic mosaicism Kit, 1 × 1 M. Each sample was hybridized along with Coriell with a large number of aneuploidy in the liver (Dataset S3).

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 2of6 100 1 2 3 4 80 60 40 20 00 1.1 2.1 2.2 3.1 3.2 3.3 4.1 4.2 4.3 4.4 1 cancer 2 cancers 3 cancers or more

Fig. S1. Families are grouped in categories according to the number of documented generations (one, two, three, or four generations, gray bars at the top of the graph). In each group, generations are numbered, 1 being the oldest documented generation. Histograms show the proportion of patients with one, two, and three or more cancers diagnosed at any age. Only families with a single generation affected show a significantly higher number of subjects with morethan one diagnosis (distribution in families with one generation versus the sum of most recent generations of families with two, three, or four generations(P < 0.001; χ2 analysis). The proportion of patients with more than one diagnosis is comparable in all families with two or more documented generations, with no tendency toward increased multiple diagnoses with successive generations (comparison between generation 1 and generation 2 in families with two gen- erations, P = 0.6004, Fisher’s exact test; comparison between generation 1 and generation 3 in families with three generations, P = 0.0581, Fisher’s exact test).

Fig. S2. Southern blot assessment of telomere length in members of the KA family using TeloTAGGG kit (Roche). CNC, carrier with no cancer; CC, carrier with cancer.

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 3of6 Table S1. Haplotypes of TP53 in family KA Carriers Carriers without with cancer cancer Noncarriers Haplotype Mother Father Position Ref. II:2 II:1 III:4 III:1 III:5 III:6 III:2 III:3 M1 M2 P1 P2

7503347 G GG GA GG GA GG GA GA GA G G G A 7511548 G GG GC GG GC GG GC GC GC G G G C 7511654 C TT CT TT CT TT CT CT CT T T T C 7511655 A GG AG GG AG GG AG AG AG G G G A 7511681 G GG GA GG GA GG GA GA GA G G G A 7511703 T TT TC TT TC TT TC TC TC T T T C 7512177 G GG GA GG GA GG GA GA GA G G G A 7518132 A AC AC AA AC AA AC CC CC A C A C 7518152 G GA GA GG GA GG GA AA AA G A G A 7521849 A AG AG AA AG AA AG GG GG A G A G 7523808 G GA GA GG GA GG GA AA AA G A G A 7525125 T TC TC TT TC TT TC CC CC T C T C 7528624 T TC TC TT TC TT TC CC CC T C T C 7529195 A AT AT AA AT AA AT TT TT A T A T 7529285 A AG AG AA AG AA AG GG GG A G A G 7529913 T TC TT TT TT TT TT TC TC T C T T 7533098 C CG CC CC CC CC CC CG CG C G C G 7533285 C CT CT CC CT CC CT TT TT C T C T 7533505 G GA GA GG GA GG GA AA AA G A G A 7539985 C CG CG CC CG CC CG GG GG C G C G

Twenty SNPs of high confidence and informative for phasing of TP53 region are shown with the genotypes of members in the family and their haplotypes. The TP53 mutation is found on haplotype M1.

Table S2. Number of de novo and inherited deletions and duplications using different P value thresholds from CNValidator for CNV showing no significant difference between siblings in the KA family who are carriers with early cancer (III:1 and III:4), carriers without early cancer (III:5 and III:6), and noncarriers (III:2 and III:3) De novo De novo Inherited Inherited Total Child deletion duplication deletion duplication CNV

P < 0.01 III:1 1 0 67 55 123 III:2 0 1 58 59 118 III:3 0 0 58 56 114 III:4 0 0 71 66 137 III:5 0 0 64 61 125 III:6 1 0 68 53 122 P < 0.001 III:1 1 0 60 17 78 III:2 0 0 56 23 79 III:3 0 0 56 18 74 III:4 0 0 66 24 90 III:5 0 0 62 22 84 III:6 0 0 62 19 81 P < 0.0001 III:1 0 0 56 8 64 III:2 0 0 54 7 61 III:3 0 0 50 4 54 III:4 0 0 62 3 65 III:5 0 0 58 5 63 III:6 0 0 59 1 60

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 4of6 Table S3. Candidate genetic modifiers for anticipation in family KA Nucleotide Amino acid Accession Encoded p53 Variant change change no. protein Function interaction

RPS6KA1 c.G1661A p.R554H NM_002953 Ribosomal Control of cell No known interaction protein S6 kinase growth and alpha-1 differentiation SCMH1 c.A45C p.K15N NM_001172219 Human sex DNA binding and No direct interaction, comb on mid-leg sequence-specific but forms part of homolog 1 DNA binding the PRC1 complex transcription factor regulated by p53 activity CACNA1E c.G6347A p.R2116H NM_001205293 Calcium channel, Calcium-dependent No direct voltage- processes including protein–protein dependent, expression, interaction but R type, alpha cell division and CACNA1E contains 1E subunit cell death p53 binding site LOC729059 c.G189T p.L63F NM_001242521 Novel Novel Novel uncharacterized uncharacterized uncharacterized protein protein protein TTC12 c.G585C p.E195D NM_017868 Tetratricopeptide Protein binding No known interaction repeat domain 12 ACSS3 c.T482C p.V161A NM_024560 Acyl-CoA Activation of No known interaction synthetase acetate for lipid short-chain synthesis or energy family member 3 generation PAPD4 c.G1178T p.G393V NM_001114394 PAP associated Cytoplasmic poly(A) No direct domain RNA polymerase and interaction but containing 4 metal ion binding regulates miR-122 stability to control p53 protein levels EPHA5 c.C712T p.R238X NM_004439 Ephrin type-A Receptor tyrosine Direct protein–protein receptor 5 kinase implicated interaction in regulating development ZFYVE16 c.C565G p.Q189E NM_014733 Zinc finger, Regulate endosomal Direct protein–protein FYVE domain membrane trafficking interaction containing 16 and forms part of various signaling pathways LRIG3 c.A364G p.N122D NM_153377 Leucine-rich DNA ligase integral Correlated repeats and to DNA replication expression in Ig-like domains 3 and repair cancers but no known interaction CEP350* c.T1404G p.I468M NM_014810 Centrosomal Required for Direct protein–protein (de novo protein 350 kDa anchoring of interaction variant found microtubules to in III:4) centrosomes

Rare paternally inherited variants found in the two carrier children with cancer (III:1 and III:4) but not in the carrier children without cancer (III:5 and III:6). One de novo mutation found in III:4. All variants were validated by Sanger sequencing. All these variants have not been reported as polymorphisms in the dbSNP and 1000 Genomes Project databases.

Dataset S1. CNVs of members of the KA family from aCGH

Dataset S1

Dataset S2. CNVs of members of the MM family from aCGH

Dataset S2

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 5of6 Dataset S3. Genotypes and CNVs of mice ordered by their batch and sample number. Parents followed by their offsprings are grouped together, and their genotypes and age at mating (for parents) are shown. DNA of parents were taken from liver whereas DNA of offsprings were extracted from whole embryos at 17.5 d old. CNVs of mice are ordered by batch and sample number. Upper shows the B allele frequency (BAF) whereas Lower shows the log R ratio (LRR). Chromosomes are ordered in ascending numerical order and shown in different colors

Dataset S3

Ariffin et al. www.pnas.org/cgi/content/short/1417322111 6of6