Supplementary Materials: Non-Invasive Analysis of Intestinal Development in Preterm And

Supplementary Materials: Non-invasive analysis of intestinal development in preterm and term infants using RNA-Sequencing

Jason M. Knight1,3, Laurie A. Davidson2,3, Damir Herman6**, Camilia R. Martin7, Jennifer S. Goldsby2,3, Ivan V. Ivanov5, Sharon M. Donovan8 and Robert S. Chapkin2,3,4*

1Department of Electrical Engineering, Texas A&M University, College Station, TX, 2Depart- ment of Nutrition & Food Science, Texas A&M University, College Station, TX, 3Center for Translational Environmental Health Research, Texas A&M University, College Station, TX, 4Department of Veterinary Integrated Biosciences, Texas A&M University, College Station, TX, 5Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Sta- tion, TX, 6Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sci- ences, Little Rock, AR, 7Department of Neonatology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 8Department of Food Science & Human Nutrition, Uni- versity of Illinois, Urbana, IL.

*Address correspondence to Dr. Robert S. Chapkin, Center for Translational & Environmental Health Research, MS 2253, Texas A&M University, College Station, TX 77843-2253, USA; Tel: +1-979-845-0419; Fax: +1-979-458-3704; E-mail: [email protected]

**Current address: Ayasdi, 4400 Bohannon Drive, Suite #200, Menlo Park, CA 94025 Supplementary Figure 1: Integrative Genomics Viewer (IGV) was used to visualize the mapped reads on the APOA4 gene for preterm infant sample 3 and term infant sample 3 at the top and the bottom of the figure, respectively. The 3' bias is visible in the vast majority of reads belonging to or mapping near the 3' UTR on the left side of the annotated region of the reference hg19 genome. Supplementary Figure 2: Correlation scatter plots for RNA-Seq and qPCR for 11 differentially expressed genes across the six individual samples. Any gene with zero mapped reads is not displayed. The average slope is 0.869, the average Spearman correlation coefficient is 0.59 and the average Pearson correlation coefficient is 0.57. Overall, these correlations are lower than the 0.7 Spearman and 0.8 Pearson correlation coefficients seen in the MAQC dataset1. However, this is not surprising given the more diverse and challenging nature of fecal samples. For another comparison, these correlations are similar to those typically observed between RNA-Seq and microarray (0.62 – 0.75 Pearson) data on the same MAQC dataset and higher than RNA-Seq – protein correlations (0.24 – 0.36)2.

1 Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 323. 2 Fu, Xing, et al. "Estimating accuracy of RNA-Seq and microarrays with proteomics." BMC genomics 10.1 (2009): 161. Supplementary Figure 3: Correlation scatterplots among all six individual samples and their smoothed FPKM distributions. In addition, violin plots of FPKM show reasonable uniformity among overall normalized expression intensities. Supplementary Figure 4: Experimental design documenting sample isolation, sequencing and mapping. Supplementary Figure 5: Volcano plot showing 188 differentially expressed genes at p values < 0.05. Given the noisy nature of the data and limited number of samples, q-values (p-values adjusted for multiple testing) were not used in the DE selection criteria. Supplementary Figure 6: A Spearman correlation heatmap, comparing individual samples with the pooled term sample. Supplementary Table 1: Read Statistics.

Sample Total-reads Hu- ERCC-reads Genes 1 or Genes Mito Ribosomal Microbial Viral Fungi Protozoa man-reads more reads RPKM >1 Preterm 1 43983566 503552 41132697 4021 4525 125690 104 39862644 117847 91035 81383 Preterm 2 48841110 409748 44862225 2768 3432 112459 176 43331472 105658 80809 74945 Preterm 3 40462297 15368537 27886375 13379 8596 670003 2410 26175696 193080 144557 184535 Term 1 54179788 7303948 42719602 11527 7266 1084048 2074 41780304 176463 145378 190937 Term 2 63701346 710728 58601378 5036 5187 257341 163 56749538 134459 101926 95394 Term 3 51699388 384088 48161868 3274 4095 124769 82 46670080 117468 66361 58089 Term 32005113 1615385 0 5182 4049 20542641 165409 184987 Pooled Supplementary Table 2: qPCR Ct values used for validation of differentially expressed genes.

ABCC5 APOA4 CASP1 DYNLL1 NFKBIA PLIN2 PPAP2A RPS16 SCNN1A SLC2A1 TMSB4X Preterm 1 40.00 34.29 40.00 40.00 37.66 40.00 40.00 40.00 40.00 40.00 37.52 Preterm 2 37.39 31.69 34.90 35.24 31.11 35.21 36.30 40.00 36.25 37.21 28.78 qPCR CT values Preterm 3 33.56 23.57 27.64 27.05 25.95 25.71 29.46 34.32 31.95 29.30 21.92 Term 1 37.38 33.37 38.17 33.16 33.74 35.88 35.10 38.74 34.62 36.04 28.86 Term 2 30.75 34.07 29.72 27.13 28.95 29.26 28.50 34.47 28.44 31.02 22.56 Term 3 40.00 40.00 40.00 36.26 36.79 40.00 40.00 40.00 36.80 38.77 31.70 log2(fold change) qPCR 2.16 -3.21 -2.30 1.63 -1.53 -1.56 2.38 -2.25 3.61 -1.58 1.24 RNA-Seq 1.20 -5.71 -1.53 2.17 -1.32 -1.15 0.98 0.63 3.04 0.49 1.96 Supplementary Table 3: Preterm metadata.

Sample Name Gestational Age (weeks) Date of birth Preterm 1 32.6 4/6/2010 Preterm 2 30.2 4/15/2010 Preterm 3 27.5 5/3/2010

Supplementary Table 4: Term metadata. Pooled samples were aggregated with individual samples to obtain the pooled sample.

Name Gestational Age (weeks) Date of birth Diet Ethnicity Gender Term 1 39.714 7/14/2006 Breast Caucasian Male Term 2 40 4/25/2006 Breast Caucasian Female Term 3 40 7/7/2008 Formula Caucasian Male Pooled 41 5/30/2006 Breast Caucasian Male Pooled 39.857 6/4/2006 Breast Caucasian Male Pooled 41.429 6/30/2006 Breast Caucasian Male Pooled 39 5/18/2006 Breast Asian/Caucasian Female Pooled 39.571 2/11/2007 Breast Caucasian Male Pooled 40 4/6/2007 Breast Caucasian Female Pooled 39.286 5/16/2007 Breast Caucasian Female Pooled 38.714 10/18/2007 Breast Caucasian Male Pooled 39.714 12/27/2006 Formula Caucasian Male Pooled 40.714 8/30/2007 Formula Caucasian Male Pooled 39.857 10/9/2007 Formula Caucasian Female Pooled 39.857 10/18/2007 Formula Caucasian Female Pooled 39.714 7/23/2008 Formula Caucasian Male Pooled 40 9/26/2008 Formula African-American Male Supplemental Methods Appendix – Reference Genomes

RefSeq

Refseq version 59 [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/] was used to obtain all the full (not raw WGS shotgun repositories) DNA sequences for each organism group below. Statistics from each of these data repositories can be seen at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release- catalog/archive/RefSeq-release57.catalog.gz.

The following datasets were acquired:

 Mitochondria

 Fungi

 Viral

 Protozoa

Reference sequences were generated with:

SNAP: snap index ../mitochondrion.1.1.genomic.fna . -s 17 -t5 -O200 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles ../mitochondrion.1.1.genomic.fna --runThreadN 16 Bowtie2: bowtie2-build ../mitochondrion.1.1.genomic.fna genome

Microbial:

PatricBRC and RefSeq were used to assemble the microbial genomes. PatricBRC can be obtained as:wget -r -c -A "*PATRIC.ffn" ftp://ftp.patricbrc.org/patric2/genomes/ and RefSeq can be obtained withwget -r -c -A "microbial.*genomic.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/. This is roughly 16000 taxids and 30Gb of genomic nucleotides. The size of the microbial dataset necessitated use of BWA to build the reference and align against it, so:bwa index -a bwtsw ../micro-meta.fasta.

Fungi:

RefSeq was used to acquire the fungi database aswget -r -A 'fungi.*.genomic.fna.gz' ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/ which has approximately 2.3Gb of data. Reference indices were generated using: SNAP: snap index ../meta-microbe-5-25.fasta . -t15 -O100 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles../meta-microbe- 5-25.fasta --runThreadN 16 --genomeChrBinNbits 10 Bowtie2: bowtie2-build ../meta-microbe-5-25.fasta genome

Ribosomal:

From the Silva database, release 111, short subunit and long subunit fasta files were obtained that were pre-truncated with NR (no redundancy). We chose to use the reference, rather than the complete to keep the reference genome size low enough to use with typical aligners.

The GreenGenes database was not utilized due to a lack of metadata/information on its website. Therefore, Silva was utilized instead.

Reference Indexes:

SNAP: snap index ../silva-111.fasta . -s 17 -t5 -O200 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd)--genomeFastaFiles ../silva-111.fasta --runThreadN 16 --genomeChrBinNbits 10 Bowtie2: bowtie2-build ../SSURef_111_NR_tax_silva_trunc.fasta,../LSURef_111_tax_silva_trunc.fasta genome

ERCC Using the available sequence information available from Ambion at http://tools.invitrogen.com/downloads/ERCC92.fa, we generated a STAR reference and aligned it without spliced mapping. iGenomes

Human references, sequence, annotations, and bowtie2 indices are available from the Illumina iGenomes project. These are linked from the tophat website. The hg19 iGenome has been placed in the /data/mnt/igenomes folder on sequencer.tamu.edu and was used for the analysis.

In addition, upon publication our automated analysis pipeline code will be made available at http://github.com/chapkinlab.