Phylogenetic Microarrays

Phylogenetic Microarrays Oleg Paliy, Vijay Shankar and Marketa Sagova-Mareckova 9

Abstract microbial members. Many of these communities Environmental microbial communities are known play pivotal roles in ecosystem processes such as to be highly diverse, often comprising hundreds energy flow, elemental cycling, and biomass pro- and thousands of different species. Such great duction. Energy and nutrients in these systems complexity of these populations, as well as the are processed by intricate networks of metabolic fastidious nature of many of the microorganisms, pathways through multiple community members makes culture-based techniques both inefficient (Duncan et al., 2004; Belenguer et al., 2006; Flint and challenging to study these communities. The et al., 2008; De Vuyst and Leroy, 2011). The sheer analyses of such communities are best accom- complexity of such networks and the difficulty plished by the use of high-throughput molecular involved in culturing the individual members of methods such as phylogenetic microarrays and these communities have challenged researchers next generation sequencing. Phylogenetic micro- who have tried to gain a clearer understanding of arrays have recently become a popular tool for these interactions. Recent advances in molecular the compositional analysis of complex microbial technologies have significantly simplified the communities, owing to their ability to provide analysis of these communities because they simultaneous quantitative measurements of many remove the need to culture and grow community community members. This chapter describes the members individually. Some of the currently currently available phylogenetic microarrays used available molecular techniques include high- in the interrogation of complex microbial commu- throughput sequencing (discussed in chapter 8 of nities, the technology used to construct the arrays, this book), terminal restriction fragment length as well as several key features that distinguish them polymorphism (discussed in Chapter 6), cheq- from other approaches. We also discuss optimiza- uerboard DNA–DNA hybridization, quantitative tion strategies for the development and usage of real-time PCR, fluorescence in situ hybridization, phylogenetic microarrays as well as data analysis and phylogenetic microarrays. Phylogenetic inter- techniques and available options. rogation of small subunit ribosomal RNA (SSU rRNA) molecules using these techniques has led to considerable progress in our understanding of Introduction community structure and dynamics of various Microbes inhabit diverse environments. Some microbial ecosystems (Suau, 2003; Sekirov et al., of these environments include the human intes- 2010). Phylogenetic microarrays, one of the more tinal tract and skin, soil, roots, leaf and bark popular choices among these techniques, have surfaces of plants, ocean waters, deep see vents, been successfully used to quantitatively profile a and air. The ecosystems of such environments variety of microbial communities, including the are populated by communities of microorgan- gastrointestinal tract, sewage sludge, soil, and isms, rather than by individual species, and often air (Brodie et al., 2007; Nemir et al., 2010; Val- contain hundreds and even thousands of different Moraes et al., 2011; Rigsbee et al., 2012).

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 208 | Paliy et al.

Although gene expression analysis was the developments in the technology, optimization of original motivation behind the development of usage, applications, and potential future trends in microarrays, their versatility has allowed research- the use of phylogenetic microarrays. ers to adapt this technology for other uses, including phylogenetic analysis. Several types of microarrays have been developed to characterize Current phylogenetic the composition and function of microbial com- microarrays munities, including community genome arrays, The high-throughput and quantitative nature functional gene arrays, and phylogenetic microar- of phylogenetic microarrays makes them an rays. Community genome arrays are constructed excellent solution for researchers who seek to using whole-genomic DNA isolated from species determine the composition of their microbial in pure culture. They allow detection of individual community of interest. Some key features that species and strains in simple and complex com- distinguish different phylogenetic microarrays are munities. Functional gene arrays include probes the choices of phylogenetic markers utilized for to genes encoding important enzymes involved probe design and the experimental platform used in various metabolic processes and are useful for to host these probes (Paliy and Agans, 2012). A monitoring physiological changes in microbial gene or a group of genes that are ubiquitously pre- communities (Waldron et al., 2009; Xie et al., sent among all or at least the majority of species 2010). A good example of a functional gene array of interest often make the best target for phylo- is the GeoChip, which contains tens of thousands genetic analysis. A few already utilized examples of oligonucleotide probes for genes involved in that fit the above criteria include the SSU rRNA biogeochemical cycling of carbon, nitrogen, phos- gene (16S in and 18S in eukaryotes), phorus, and sulfur, for genes involved in metal and the large ribosomal subunit RNA gene (23S and antibiotic resistance, and for genes coding proteins 28S, respectively), genes coding for the heat shock involved in bioremediation of organic compounds proteins GroEL and GroES and for ribosomal (Zhou et al., 2011). Phylogenetic oligonucleotide proteins such as protein S1 (Martens et al., 2007), microarrays (phyloarrays) contain probes com- and in the case of methanogens, the mcrA gene plementary to well conserved and ubiquitous which encodes for methyl coenzyme-M reductase gene sequences (usually the SSU rRNA gene) and (Luton et al., 2002). The SSU rRNA gene is cur- are primarily used for the analysis of microbial rently the most popular choice in part because it community composition and variability (Paliy can be fully and selectively amplified from total and Agans, 2012). Among different array types, genomic DNA with a set of primers complemen- phyloarrays are currently the most popular owing tary to the conserved regions at the beginning and to the availability of a large set of near-full length the end of the gene. Note, however, that the 16S SSU rRNA sequences deposited in NCBI, EMBL, rRNA gene has substantial limitations as a taxo- RDP, and Greengenes databases (see also Chapter nomic marker when attempting to discriminate 7, ‘Repositories of 16S rRNA gene sequences and between closely related taxa, i.e. below the taxonomies’). level. This is due to a high level of conservation of The first recognized phylogenetic microarray, this gene sequence across bacterial taxa (Naum et developed by Guschin et al. (1997), was capable al., 2008). As an alternative to rRNA gene, apart of detecting select genera of nitrifying . from using the genes mentioned above, one can Since then, significant advances have been made also utilize more specific metabolic genes for a with phylogenetic microarrays to improve the particular community of interest. For example, to breadth of detection (total number of different study methanotrophs, methane monooxygenase groups detected), thereby increasing their ver- (pmoA) gene can be used (Bodrossy et al., 2003; satility. Progress has also been made to increase Stralis-Pavese et al., 2011), while nifH gene coding the sensitivity and specificity of phylogenetic for a component of nitrogenase protein complex microarrays (Hazen et al., 2010; Paliy and Agans, can be utilized to profile nitrogen-fixing diazo- 2012). In this chapter, we will discuss the current trophic populations (Zhang et al., 2007).

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 209

A typical design process for a microarray construction approach allows for a high level of specific to a particular ecosystem or community customization and adaptation. Because no metal usually involves the acquisition of 16S rRNA masks are required, the array design can be updated genes from members of that community (through frequently, and only a limited number of the clone library sequencing, for example) and sub- arrays can be created at any given time. One of the sequent selection of regions within the genes for commercial microarray manufacturers, Agilent, probe design. Region selection can either be done Inc. (USA), uses the process of ink-jet printing manually, based on the availability of unique frag- to print as many as 185,000 features onto a 1 × ments in the hypervariable regions of 16S rRNA 3 inch slide. Recently, microelectrodes have also sequence, or by using mathematical algorithms. been used to construct high-density arrays, where Several software solutions such as ARB, GoArray probes are aligned and concentrated on the array and PhylArray exist to facilitate this process and surface using electrical charges applied to specific provide an optimized automated design of micro- sections of the array. This technique reduces the array probes (Ludwig et al., 2004; Rimour et al., amount of time and labour required for the con- 2005; Militon et al., 2007). Several technologies struction of microarrays. More importantly, due are available for the construction of phylogenetic to the fact that oligonucleotide molecules can microarrays. A currently popular choice, devel- be concentrated on a small region of the array, oped by Affymetrix, Inc. (USA), is to build arrays this technique permits the use of lower amounts by probe chemical synthesis through photoli- of probe target fragments that are added to the thography. In this technique, oligonucleotide microarray during hybridization (Heller et al., probes are directly synthesized on the array glass 2000). With the exception of photolithographic surface, one nucleotide at a time, using light acti- synthesis, where each oligonucleotide probe is vation and masking plates. In each round, a light anchored onto the surface prior to synthesis, the mask is applied to the surface of the array which other microarray construction techniques require allows only specific growing oligo sequences to binding of probes to the array surface. Often, the incorporate a particular new nucleotide. After oligonucleotide probes are chemically bonded many rounds of masking and nucleotide addition onto the microarray surface covered with a coat of through light activation (typical oligonucleotide silane containing an active functional group (Chiu length is 20–25 bp), the desired probes are con- et al., 2003). structed to generate a high-density microarray In recent years, significant improvements (Pease et al., 1994). Although expensive to pro- have been achieved in the design of phylogenetic duce compared to other available techniques, the microarrays, including improvements in the Affymetrix arrays are consistent between batches, breadth of detection, sensitivity, and specificity. have high probe density on the array surface, and Table 9.1 lists some of the currently available phy- display low technical variability (Zakharkin et al., logenetic microarrays together with their design 2005). parameters and targeted communities. The origi- In contrast, some laboratories prefer to create nal phylogenetic microarray designed by Guschin ‘in-house’ glass slide microarrays, where fully et al. was capable of detecting a few genera of nitri- constructed oligonucleotide or DNA probes are fying bacteria (Guschin et al., 1997). The breadth deposited onto array surfaces using fine-point of detection was expanded on the microarray needles and robotics. The oligo or DNA probes developed by Wang and colleagues to include 40 are made and stored in solution, and each indi- predominant members of human gut microbiota vidual probe is deposited onto a specific glass (Wang et al., 2004). The current leader in the surface location (spot) as a small drop. The drops total number of potentially detectable groups, the are dried, the probes are subsequently attached to third generation (G3) PhyloChip array, has been the glass surface, and the microarray is ready for designed to detect as many bacterial phylotypes use (Goldmann and Gonzalez, 2000). In addition as possible (Brodie et al., 2006; Hazen et al., to the usual glass slide surface, membrane surfaces 2010). This microarray is based on the Affymetrix are sometimes used instead. This microarray GeneChip technology and contains 1.1 million

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 210 | Paliy et al.

Table 9.1 A selection of current phylogenetic microarrays Target Array name community Resolution Technology Detectable groups Reference PhyloChip All prokaryotes Varied Photolithography 9000 phylotypes (G2) Brodie et al. (2006), Species 50,000 phylotypes (G3) Hazen et al. (2010) Microbiota Human Species Photolithography 775 phylotypes Paliy et al. (2009) Array intestinal biota HOMIM Human oral Species Aldehyde slide 272 phylotypes Preza et al. (2009) biota V-Chip Human vaginal Varied Activated polymer 350 groups Dols et al. (2011) biota slide TCE Chip Soil biota Varied Aldehyde slide 742 groups Nemir et al. (2010) EcoChip Sewage sludge Species Amine slide 1,560 phylotypes Val-Moraes et al. biota (2011) RHC- Activated Varied Aldehyde slide 79 groups Hesselsoe et al. (2009) PhyloChip sludge biota Genome Marine biota Species Poly-L-lysine slide 14 phylotypes Rich et al. (2008) Proxy

25-mer probes arranged in a grid of 1,008 rows design to contain both perfect match probes (pro- by columns, with an approximate probe density vide target quantification) as well as mismatch of 10,000 molecules per µm2. The array is capable probes (estimate cross-hybridization amount of detecting approximately 50,000 phylotypes removed during normalization of probe signals) (the previous version of the array, G2, contained for each interrogated phylotype. This phyloarray 500,000 probes and was able to detect approxi- can detect phylotypes that are present at an overall mately 9000 phylotypes) (Brodie et al., 2006). community abundance of less than 0.001% (Paliy This increase in the breadth of detection allows for et al., 2009). To date the Microbiota Array has wide range applications, evidenced by the recent been used successfully to accurately profile the use of PhyloChip in profiling coastal salt marsh, microbial communities of the distal gut in healthy coral, and several human-associated microbial adults, adolescents, and adolescents with irritable communities (Cox et al., 2010; Lemon et al., bowel syndrome (Agans et al., 2011; Rigsbee et 2010; Wu et al., 2010b; Deangelis et al., 2011; al., 2012). Mendes et al., 2011). The HOMIM (Human Oral Microbial The growing interest in the human-associated Identification Microarray), an aldehyde-coated microbiota has led to the development of sev- glass-slide microarray, was designed to detect eral microarrays designed to detect and profile 272 microbial phylotypes from human oral cavity specific human microbial communities. The through the interrogation of the 16S rRNA gene. Microbiota Array, also based on the Affymetrix The reverse capture probes in this array consist photolithography technology, was designed to of 18–20 nucleotides complementary to the profile microbiota of the human gastrointestinal target sequence with a spacer sequence of eight tract. The array contains 16,223 probes, with thymidines and a 5′-(C6)-amine-modified base multiple probe sets allowing detection and for attachment to the slide. The oligonucleo- quantification of 775 different human intestinal tide probes are printed onto a 25 mm × 76 mm microbial phylotypes. Each probe set detects a aldehyde slide. Each array is separated into five single phylotype (also called operational taxo- sections to facilitate the parallel processing of five nomic unit or phylogenetic species) and contains samples, making the overall process more cost between 5 and 11 different probes to that phylo- effective (Preza et al., 2009). This array has been type’s 16S rRNA sequences. The Microbiota Array an effective tool in detecting and profiling the oral also takes advantage of the Affymetrix microarray microbiota in multiple studies, spanning several

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 211 disease states as well as examining oral microbiota enriched with Gammaproteobacteria and Pseu- in healthy hosts (Preza et al., 2009; Docktor et al., domonas probes has recently been used to assess 2012; Luo et al., 2012). microbial community structure perturbation The V-Chip, also called the vaginal microbiota- as a result of exposure to 1 ppm of trichloroeth- representing microarray, is another spotted ylene. Microbial groups specifically sensitive to microarray that utilizes polymer-coated slides to the trichloroethylene addition were determined house oligonucleotide probes. The array is con- (Nemir et al., 2010). structed by employing a high precision robotic The EcoChip, an alternative soil microbiota dispenser with fine-point quill pins to deliver phyloarray, was developed based on the 16S oligonucleotide probes onto a slide surface. The rRNA clone libraries obtained from different soil probes contain a 5′-NH2-C6 terminal region that types. The clones were chosen from a bank of is used in the probe attachment. The array surface metagenomic DNA from soil microorganisms. is coated with a proprietary activated polymer The PCR amplicons (300 to 1000 bp long) were that is responsible for the binding of the probes to used in replicates for the microarray construction. the array. The V-Chip array contains a total of 459 PCR products were printed on glass slides treated probes allowing for the detection of 350 vaginal with aminosilane. In total, the EcoChip contains microbial groups that are spread across multiple 1,560 distinct partial 16S rRNA gene fragments taxonomic levels (from species to order level) from soil microorganisms; 43 partial sequences of (Dols et al., 2011). This phylogenetic microarray 18S rRNA genes from fungi were printed to serve was designed to profile human vaginal micro- as a negative control. This microarray was able biota, and has demonstrated its effectiveness as a to distinguish bacterial communities between diagnostic tool for profiling changes in microbial various soil sites and could determine the effect communities in diseased states such as bacterial of sewage sludge addition on the respective soil vaginosis (Dols et al., 2011). bacterial community (Val-Moraes et al., 2011). Several microarrays targeting different soil Uncultivated microbial phylotypes and their microbial communities have also been recently close relatives from marine environments can developed. A prototype microarray composed also be studied with phylogenetic microarrays. of 122 oligonucleotide probes 20 to 25 nt in To construct a prototype Genome Proxy micro- length was designed to target known microbes array, probe sets to 14 of the sequenced genome from plant rhizospheres, which mostly included fragments and to genomic regions of the culti- representative taxa of Alphaproteobacteria at vated cyanobacterium Prochlorococcus MED4 various taxonomic levels from phyla to species. were designed. Genome fragments consisted This microarray was utilized to compare maize of sequenced clones from large-insert genomic rhizospheres and bulk soil samples (Sanguin libraries from microbial communities in Monterey et al., 2006). This array was further expanded Bay, the Hawaii Ocean Time station ALOHA, and to include 1033 probes targeting specific Antarctic coastal waters. Each probe set contained rhizosphere bacteria known for plant growth multiple 70-mers, each targeting an individual promoting or disease suppressive characteris- open reading frame, and distributed along 40–160 tics. It was capable of discriminating between kbp contiguous genomic region. This prototype disease suppressive and disease conducive soils array correctly identified the presence or absence for tobacco black root rot (Kyselkova et al., of the target organisms and their relatives in labo- 2009) and wheat take-all disease (Schreiner et ratory mixes, with negligible cross-hybridization al., 2010). A subset of probes from this micro- to organisms with ≤ 75% genomic identity (Rich array (113 oligonucleotide probes targeting et al., 2008). Furthermore, this microarray can , particularly genera known for be used for tracking microbial community and production of secondary metabolites) was population changes in marine environments over employed in a spatial–temporal study of Actino- time to provide a higher-resolution understanding bacteria in a waterlogged forest (Kopecky et al., of the dynamics of marine microbial communities 2011). Finally, the same microarray additionally (Rich et al., 2008).

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 212 | Paliy et al.

An ‘isotope’ microarray approach has been understand species-level interactions such as developed to allow the measurement of incor- metabolic interdependencies and co-patho- poration of labelled substrate into the rRNAs of genicity. In many cases the ability of microarrays community members. For this purpose, a 16S to measure phylotype abundance is dependent rRNA-targeting microarray, RHC-PhyloChip, on the complexity of the target community, and consisting of 79 nested oligonucleotide probes several of the currently available microarrays to most cultured and uncultured Rhodocyclales, are capable of profiling microbial communities was used. The diversity and ecophysiology of at the phylotype level (Table 9.1). Breadth of Rhodocyclales in activated sludge from a full- detection is yet another variable that differenti- scale wastewater treatment plant were analysed. ates phylogenetic microarrays (Paliy and Agans, RHC-PhyloChip analysis was performed with 2012). The PhyloChip is an excellent example of a fluorescently labelled and fragmented RNA from phyloarray specifically designed to detect as many each activated sludge subsample that was incu- microbial phylotypes as possible across the bacte- 14 bated with CO2 and allylthiourea under different rial and archaeal domains. Its detection breadth conditions. An activity and substrate-utilization makes this phyloarray very versatile, enabling its profile of the different Rhodocyclales groups in usage in many environmental and clinical studies. the activated sludge was created to distinguish The downside to this type of design strategy is a between the active and dormant communities potential for the high number of false positives (Hesselsoe et al., 2009). due to off-target hybridizations induced by the There are several features to take into account high number of probes (Midgley et al., 2012). The when comparing different phylogenetic microar- issue of false positives and cross-hybridization can rays. As seen in Table 9.1, microarrays differ in be ameliorated by optimizing the probe selection the technology used. The Microbiota Array and process and by assigning strict criteria for signal the PhyloChip were developed using photolitho- presence, though a complete resolution of the graphic synthesis, which has several advantages problem is very difficult. Contrary to such design, including the high degree of efficiency, uniformity, phylogenetic microarrays designed for specific and probe density (Brodie et al., 2006; Paliy et al., communities, such as the Microbiota Array and 2009). The Affymetrix platform takes advantage EcoChip, benefit from the reduced cross-hybrid- of the high probe density to allow these arrays to ization potential to provide robust estimates of contain multiple probes per target (phylotype) as community structure, while maintaining the well as to enable allocation of mismatch probes ability to discriminate different communities with that provide means to adjust for target cross- similar efficiency (Kyselkova et al., 2009). The hybridization (Rigsbee et al., 2011). On the other most powerful microarrays might be those that hand, ink-jet and fine-point needle printing allow target a very particular microbial community or for cost-effective production and modification of microbial taxonomic group (Genome Proxy array microarrays since expensive tools such as pho- or RHC-PhyloChip) and thus can be employed to tolithographic masks are not required. Printing directly test a specific hypothesis. on glass slides is still considered the most cost- Phylogenetic microarrays based on non- efficient method currently available. However, traditional techniques have also been described the drawback of this type of array manufacturing in several reports. For example, a fragment liga- is the loss of uniformity; therefore, these arrays tion reaction based DNA microarray has been require more extensive validation tests before they developed by Candela et al. (Candela et al., 2010). are ready for application. The microarray design involves the use of pairs of Phylogenetic microarrays are also distin- oligonucleotides complementary to the adjacent guished based on their resolution. In order to regions of each target sequence. One of the oli- achieve the degree of resolution seen with Sanger gonucleotides contains a 5′-fluorescent label and sequencing, a species- or OTU- (operational taxo- the other has a unique ‘zip-code’ sequence. The nomic unit) level specificity is required. Profiling oligonucleotide pair is ligated together only in the communities at this depth allows researchers to presence of the complementary target sequence

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 213 binding to both oligos. Since the ligation is carried et al., 2004). Quantification was based on fluo- out by highly selective ligase enzyme, a high level rescence scanning of the hybridized probe–target of probe specificity can be achieved with the use pairs. This array was successfully used to identify of this approach. The quantification of the fluores- bacterial communities in cervical swab samples at cently labelled ligated products is accomplished by a high resolution (Mitterer et al., 2004). the use of specially designed ‘universal’ detection array that houses probes complementary to the tag (‘zip-code’) sequences present within the ligated Optimization of phylogenetic products. These universal arrays allow for uniform microarrays hybridization conditions and for the use of differ- Phylogenetic microarrays provide several advan- ent ligation probe sets unique to each interrogated tages over some of the other currently available community, which enables flexible experimental techniques used for profiling microbial commu- design. A prototype ligation array developed by nities. These include a highly quantitative nature Candela and co-workers is capable of quantifying of the acquired data, an ability to analyse one 30 groups of human intestinal microbiota, and sample at a time, a short processing time, and an the array was used to profile the faecal microbiota opportunity for multi-probe interrogation of each of several young adults (Candela et al., 2010). community member (Paliy and Agans, 2012). Another non-traditional microarray, referred to as Phylogenetic microarrays can be used to identify the restriction site tagged microarray, was devel- taxa that vary in abundance by over five orders of oped by Zabarovsky et al. (2003). The array design magnitude (Roh et al., 2010). Above that, due to was accomplished by developing tag sequences a frequent hierarchical organization of microarray that are complementary to the regions flanking probes, the precision of identification is relatively the recognition site of a rare-cutting restriction high, and different taxonomic levels of probe enzyme. A set of these tags represents a ‘passport’ targets enable a more comprehensive view of the for a particular phylotype. In the experimental community structure. Although these attractive protocol, genomic DNA is first digested by the features make phylogenetic microarrays a viable restriction enzyme and is allowed to hybridize to option for phylogenetic analysis, there are also tag sequences on the array. Quantification of the some limitations to the technology that must hybridization is accomplished through detection be addressed. Firstly, phylogenetic microarrays of the labelled products. Phylotype differentiation typically do not allow for the detection of novel is achieved by constructing a custom microarray phylotypes. They are only capable of detecting containing ‘passport’ sequences complementary and quantifying phylotypes to which they contain to the enzyme site flanking regions from each phy- probes. Secondly, microarrays are technically lotype genome. This type of array design allows demanding to design, use, and analyse, and thus for the differentiation of even closely related require rigorous testing, validation, and optimiza- phylotypes. Finally, large subunit ribosomal tion (Hashsham et al., 2004). To help with the RNA gene based phylogenetic microarrays have second limitation, a number of approaches that also been developed successfully (Mitterer et al., improve the robustness of microarray data have 2004; Yoo et al., 2009). For example, Mitterer et al. been developed and are discussed below. (2004) developed a custom glass-slide array that contained genus- and species-specific solid phase Optimization of probe design and primers targeting a single variable region of the hybridization 23S rRNA gene (Mitterer et al., 2004). Using uni- The design of phylogenetic microarrays requires versal primers, genomic DNA from environmental extensive knowledge and experience in probe samples was subjected to PCR amplification on selection. A lack of a rigorous probe selection the glass-slide. The generated PCR products were process can lead to issues such as high level of allowed to bind to the group-specific primers for fragment cross-hybridization, which can result in subsequent elongation accompanied by the incor- inaccurate or biased community profiles. There poration of biotin labelled nucleotides (Mitterer are several variables that control the probe–target

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 214 | Paliy et al.

hybridization process and the subsequent estima- Phylogenetic studies tend to exploit the vari- tion of signal. One such variable, the size of the ability within these regions for the detection and probe oligonucleotide or DNA fragment, has a identification of microbial members within the large influence on the hybridization behaviour. analysed community. Many hypervariable regions In general, the length of the probe is positively are flanked by conserved sequences, allowing the correlated with hybridization chance (sensitivity) use of ‘universal’ primers for the amplification of and is negatively correlated with hybridization these regions from most microbial species. The specificity (Suzuki et al., 2007). Selecting probes degree of sequence variability varies among differ- that are small can lead to high specificity but at ent V regions as shown in Fig. 9.1. As a result, the the cost of low hybridization sensitivity. On the regions differ in their ability to distinguish among other hand, picking long probes can increase the microbial phylotypes and some regions (V3, V6) sensitivity of detection, but risks hybridization are slightly better suited to resolve closely related of smaller unrelated fragments to each probe. An microbial species (Chakravorty et al., 2007). This ideal probe length provides a balance between a characteristic emphasizes the need for careful high sensitivity and high specificity. Oligonucleo- consideration of probe target selection within the tides of lengths between 20 and 30 nucleotides are 16S rRNA gene. For example, community analy- generally selected in many phylogenetic microar- sis using a microarray with probes to only a single ray designs (Brodie et al., 2006; Paliy et al., 2009). hypervariable region has a potential to introduce The melting temperature of each probe-target a bias in the microbial community profile. It is

duplex (Tm) is another important variable that generally considered a good practice to design should be taken into consideration when design- probes to multiple hypervariable regions since ing probes. Since the hybridization efficiency at such design strategy can adjust for region specific any given temperature depends on the sequence level of variability and any potential hybridization

Tm, it is important to constrict the melting tem- biases. peratures of all of the probes to a relatively narrow General strategies for optimizing the design range (He et al., 2005). The resulting consistency of probes have been previously considered by

will reduce probe hybridization bias due to Tm Letowski and colleagues (Letowski et al., 2004). variability, thereby increasing the validity of the In that study, the authors explored the effects of acquired signals. While designing probes for sequence mismatch on the destabilization of the phylogenetic microarrays, it is also important to probe–target hybridization at different fragment consider the optimal choice of probe targets. Most GC% and at different temperatures. One of the phylogenetic microarrays use the SSU rRNA objectives of the study was to determine an gene for identification and taxonomic analysis optimal method for designing probes to closely of community members. While much of the 16S related target sequences. To obtain quantitative rRNA gene sequence is highly conserved, the results, the authors designed probes that differed gene contains nine sections commonly referred in the number and distribution of mismatches. to as the ‘hypervariable’ (V) regions that display The probe specificities were determined and considerable sequence variability among differ- compared at various hybridization temperatures. ent microbes (Chakravorty et al., 2007) (Fig. 9.1; The main conclusion of the study was that the see also Chapter 7, ‘Marker gene experiments’). greatest destabilization effect was achieved when

180 V2 220 592 V2 650 997 V6 1043 1234 V8 1294 Sequence entropy

→ 62 V1 101 405 V3 495 825 V5 860 1117 V7 1156 1412 V9 1488 Conserved Variable

Figure 9.1 Sequence conservation and variability of 16S ribosomal RNA gene in prokaryotes. Sequence entropy is displayed using a gradient scale as shown in the legend. Positions of the variable regions (V1–V9, nucleotide positions are displayed for Escherichia coli 16S rRNA sequence) and sequence entropy values are based on the information from Ashelford et al. (2005).

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 215 mismatches were distributed across the entire site on the target molecule. By binding to the sequence of the probe. From that observation the target molecule, the helper oligonucleotides pre- authors inferred that in order to achieve optimal vent the target molecule from binding to itself, specificity when designing probes to closely thereby increasing the efficiency of probe–target related sequences, it is important to choose probes hybridization. Other optimization strategies such such that the variability is spread along the probe as selective calibration for particular probes to length(s) (Letowski et al., 2004). Conversely, vari- recover false-negatives and improving specificity ability concentrated towards the terminal regions through signal-limiting parameters can also be of the probes showed greatly reduced specificity applied (Peplies et al., 2003). and therefore should be avoided. This study also confirmed previous reports of the dependence Optimization of sample preparation of the hybridization temperature on the GC% Methods to improve the experimental procedures of the probes. In general, optimal specificity was for the use of phylogenetic microarrays have been achieved when the hybridization temperature cor- described. A study by Salonen et al. illustrated and related positively with the probe GC% (Letowski compared several methods for the extraction of et al., 2004). genomic DNA from faecal samples (Salonen et Hybridization specificity is also dependent al., 2010). Interestingly, the study found that the on other parameters such as orientation of the method used for the extraction of the genomic immobilized probe, steric hindrance against bind- DNA from environmental samples had an effect ing, and secondary structure formation in target on the compositional analysis of the community, molecules. The influence of these parameters on and thus it is important to choose an extraction the hybridization specificity as well as methods method that accurately reflects the actual com- to curtail their negative impacts have been intro- munity composition as well as provide efficient duced and discussed by Peplies et al. (2003). PCR amplification. This study proposed to use Probe orientation was tested using variants of DNA quality, amount extracted, and community select probes immobilized by either their 5′ or 3′ composition analysis as criteria for selecting and ends. The hybridization of these probes to their statistically authenticating an optimal method of target revealed a higher annealing efficiency for genomic DNA extraction. The main conclusion the 3′ immobilized probes. The reduction in the from the comparison of methods was that the hybridization efficiency of the 5′ immobilized repeated bead beating approach to cell break- probes was likely due to the occurrence of steric down performed significantly better than the hindrance as the target has to bind the probe other methods, likely because it is generally more with its 3′ end facing the array surface. Note that universal than alternative enzyme and chemical- a potential presence of secondary and tertiary based techniques (Salonen et al., 2010). The bead structures in the target molecules can complicate beating method was capable of uncovering certain the interpretation of these results. The effects of groups of microbes such as the methanogenic such steric hindrance can be mitigated by the use Archaea and some Gram-positive bacteria that of spacer sequence in probes positioned between remained undetected when other commonly the array surface and the target-specific sequence utilized extraction protocols were employed. As of the probe. Indeed, Peplies and co-workers an alternative to bead beating protocol, a recently determined that there was a linear positive rela- developed pressure cycling technology can be tionship between hybridization signal intensity utilized. In this approach, microbial or tissue and the length of the spacer sequence, indicating samples are sealed in high-density tubes and are that larger space sequences significantly reduce subjected to repeated rounds of high-low pressure steric hindrance (Peplies et al., 2003). Lastly, the fluctuations (Tao et al., 2006). This process not use of helper oligonucleotides can resolve second- only leads to the breakdown of cells, but can also ary and tertiary structures of the target molecules. separate proteins, lipids, and DNA based on their Helper oligonucleotides are unlabelled sequences hydrophobicity and ionic properties. Pressure designed to bind adjacent to the probe’s binding cycling technology was shown to also reduce the

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 216 | Paliy et al.

effect of PCR inhibitors (see below), presumably inhibits most enzymatic reactions (Rock et al., because of the separation of the inhibitors and 2010). The problems with PCR inhibitors often nucleic acids into different phases (Tao et al., necessitate the use of lower amounts of the start- 2006). ing DNA material in the amplification reactions in A study of microbial community composition order to dilute the inhibitor concentration below typically involves subjection of DNA collected critical level. from the community to rounds of target gene Possible approaches to mitigate such PCR bias (such as 16S rRNA) specific PCR amplification. have been recently considered by Paliy and Foy The goal of this approach is to selectively enrich (2011). In this study, mathematical modelling of the DNA pool with the fragments of interest, since the multi-template PCR amplification of 16S ribo- 16S rRNA genes, for example, constitute less than somal RNA genes as well as detection of the PCR 0.5% of total genomic DNA in most microorgan- products by phylogenetic microarray was used isms. In the case of the 16S rRNA gene, primers in conjunction with experimentally determined that bind to universally conserved regions at the parameters to define optimal amplification condi- start and at the end of the gene or flanking one tions that lead to accurate estimations of phylotype or several variable regions are used. Methods levels. One of the most important conclusions such as the phylogenetic microarrays and next- from that study was that both the detection and generation sequencing are then employed to the accuracy of species abundance estimations determine the composition of the amplified library. depended heavily on the number of PCR ampli- It is important to keep in mind that environmental fication cycles used. The model predicted that communities are composed of a large number of the improvements in the detection and accuracy individual phylotypes with sequence differences reached optima between 15 and 20 cycles of PCR in the interrogated target gene. Thus, any PCR amplification. Because of the unequal amplifica- amplification of such mixture of sequences is tion rate for different templates in the mixture, the multi-template, and it has potential to introduce accuracy of community composition estimates a skew in the composition of the amplified PCR was negatively affected when DNA was subjected library compared to that of the original DNA mix- to more than 20 cycles of amplification – at that ture (Polz and Cavanaugh, 1998). Several causes point gradually increasing PCR bias outpaced have been proposed to explain this often observed any further improvements in phylotype detection deviation, which include the difference in the (Paliy and Foy, 2011). Modelling the presence of template GC% leading to unequal denaturation PCR inhibitors in the samples showed that the use of template–product pairs during the melting step of more than 50 ng of starting DNA was detrimen- of the PCR reaction, the higher binding efficiency tal to the overall reaction yield and to the accuracy of the GC-rich variants of the degenerate primer of phylotype detection and abundance estimates. mixtures used to amplify fragments, and the re- With higher starting amounts, the higher levels annealing of high abundance templates during the of inhibitors caused a significant reduction in the annealing step that results in the selection against amplification efficiency, and thus more amplifica- major templates (Polz and Cavanaugh, 1998). In tion cycles were needed to reach an appropriate addition, carrying out successful PCR reaction is reaction yield, which in turn led to a higher PCR always difficult for the genomic DNA obtained bias. Furthermore, the detection and accuracy of from environmental samples due to the presence phylotype abundance estimates correlated posi- of PCR inhibitors extracted during DNA isolation tively with sample-wide PCR amplification rate process. Faecal material, for example, contains but related negatively to the sample template-to- bile salts and complex polysaccharides that are template PCR bias and community complexity known to inhibit DNA polymerase activity (Lantz (Paliy and Foy, 2011). Although this model was et al., 1997; Monteiro et al., 1997). Isolation of developed based on the simulated interrogation high quality DNA from soil presents even greater of human intestinal microbiota community and challenges: not only an efficient lysis of microbial subsequent detection by the Microbiota Array, cells is challenging, but the presence of humic acid it can be easily modified to simulate the analysis

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 217 of other communities, other available or novel of each normalization step. The authors have microarray designs as well as other PCR amplifica- used an online-implemented version of this pack- tion protocols. age accessible through the CARMAweb service (Rainer et al., 2006) to successfully normalize Optimization of data normalization Affymetrix and glass slide microarrays. In order to draw accurate conclusions regarding One type of error that is often present in the microbial profiles, raw signal values measured phylogenetic microarray data is the occurrence of by each microarray have to be normalized and signal due to off-target fragment hybridization, i.e. adjusted, so that a valid comparison of signals cross-hybridization. This issue is especially prob- among multiple samples and arrays can be per- lematic for 16S rRNA gene based phylogenetic formed (Fujita et al., 2006). One goal of such analysis because most probes on such microarrays signal normalization is to account for technical interrogate a single highly conserved molecule, variability during sample preparation and micro- and thus many fragments in the mixture are likely array hybridization that can lead to systemic to possess significant sequence similarity, which variations in measured signals. The objective of leads to increased off-target hybridization and normalization is therefore to reduce the technical cross-hybridization signal. Without an appropriate systemic variability among arrays so that it is easier method to adjust for cross-hybridization, acquir- to discern patterns or changes in microbial profiles ing accurate estimates of community members’ across arrays. Many different methods of microar- abundances becomes challenging. Microarrays ray data normalization have been developed over based on Affymetrix design (Microbiota Array, the years, and these approaches are generally PhyloChip) include a mismatch probe for each applicable to the analysis of phylogenetic microar- interrogating probe. These mismatch probes pro- ray data. The best choice of method often depends vide an estimate of potential cross-hybridization on the microarray technology used, the type of that can be removed from the probe set signal esti- study, and the error or systemic variation present mate during data processing. The situation is more in the raw data. An interested reader is encouraged difficult for the designs where such mismatch to refer to the study by Choe and colleagues who probes are not incorporated. Several methods compared the efficiency of different methods of have been explored recently to correct for such microarray data normalization (Choe et al., 2005). fragment cross-hybridization. One such approach, In general, data normalization procedure described by Rigsbee et al. (2011), involved the encompasses background correction (subtraction use of an algorithm for the correction of cross- of background noise and non-specific general hybridization of 16S rRNA gene targets among probe binding), subtraction of mismatch and different phylotypes. In this method, the model control probe signals where applicable (for was first built to estimate the measured total signal example, mismatch probes are used in Affymetrix for each probeset as a combination of true signal microarray designs), adjustment of signal distri- from target–probe hybridization and false signal bution within each array to match those of other from cross-hybridizing fragments (Rigsbee et al., arrays in the set (across-array normalization), and 2011). To provide model parameters, the levels summation or averaging of signals from multiple of cross-hybridization for different phylotypes probes targeting the same sequence in order to were acquired from validation experiments for obtain a single estimate of sequence abundance. the Microbiota Array. These cross-hybridization Examples of software that run these normaliza- estimates were subsequently incorporated into tions semi-automatically include Dchip (Corradi an adjustment algorithm to calculate true signal et al., 2008), Affymetrix-developed Expression from total signal. The resulting true signal was Console (part of Affymetrix analysis suite), and then used instead of the total signal for phylotype commercially available GeneSpring software suite abundance calculations. This algorithm was suc- (Agilent, Inc.). For users who desire control of cessfully applied to phylogenetic data acquired each step of the process, freely available R-based with Microbiota Array, and the adjusted values Bioconductor package allows separate definitions were shown to be more consistent with other

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 218 | Paliy et al.

estimates of microbial community compositions databases such as rrnDB and NCBI. Adjusting the acquired with alternative molecular techniques phylotype signal value by the estimated number (Rigsbee et al., 2011). of 16S rRNA gene copies allowed for a more Rigsbee and co-authors also introduced a accurate inference of each phylotype abundance second algorithm to adjust the normalized signal (Rigsbee et al., 2011). values for the estimated number of 16S rRNA gene copies per phylotype genome (Rigsbee et al., Improvements in data analysis 2011). Since different bacterial species are known Similar to data normalization approaches, stand- to contain a broad range of ribosomal RNA- ard microarray data analysis tools can be utilized encoding gene copies per genome (between 1 successfully to analyse phylogenetic microarray and 15), the measured true signal of a phylotype data. The approaches include various ways to represents both its abundance as well as the total visualize data with heat maps (see Fig. 9.2), box number of 16S rRNA gene copies it contains (for plots, and scatter plots, as well as clustering of most species, 16S rRNA genes within the same different taxonomical groups based on their organism have nucleotide sequence identity of abundance among samples (Rajilic-Stojanovic et ≥ 98% and thus would be expected to bind to the al., 2009; Agans et al., 2011; Rigsbee et al., 2011). same probeset on the microarray) (Rigsbee et al., Because in many cases abundances of individual 2011; Kembel et al., 2012). The known numbers taxons are defined relative to the overall com- of 16S rRNA gene copies for the various microbial munity population, such relative abundance data species can be acquired from publicly accessible are often presented in stacked columns, stacked kIBS01 kIBS02 kIBS03 kIBS04 kIBS05 kIBS06 kIBS07 kIBS08 kIBS09 kIBS10 kIBS11 kIBS12 kIBS13 kIBS14 kIBS15 kIBS16 kIBS17 kIBS18 kIBS19 kIBS20 kIBS21 kIBS22 kHLT01 kHLT02 kHLT03 kHLT04 kHLT05 kHLT06 kHLT07 kHLT08 kHLT09 kHLT10 kHLT11 kHLT12 kHLT13 kHLT14 kHLT15 kHLT16 kHLT17 kHLT18 kHLT19 kHLT20 kHLT21 kHLT22 kIBS01 kIBS02 kIBS03 kIBS04 kIBS05 kIBS06 kIBS07 kIBS08 kIBS09 kIBS10 kIBS11 kIBS12 kIBS13 kIBS14 kIBS15 kIBS16 kIBS17 kIBS18 kIBS19 kIBS20 kIBS21 kIBS22 kHLT01 kHLT02 kHLT03 kHLT04 kHLT05 kHLT06 kHLT07 kHLT08 kHLT09 kHLT10 kHLT11 kHLT12 kHLT13 kHLT14 kHLT15 kHLT16 kHLT17 kHLT18 kHLT19 kHLT20 kHLT21 kHLT22 Relative abundance abundance lowlow ------highhigh

0% 0% - -0.1% 0.1% - - 1% 1% -- 5% - - 25% 25% ProteobacteriaProteobacteria

Clostridium 2.6%2.6% // 2.9%2.9% Anaerotruncus 2.6%2.6% / /3.4% 3.4% Faecalibacterium 9.5% / /9.1% 9.1% Subdoligranulum 2.7%2.7% / /2.8% 2.8% Lachnospira 3.0% / 3.2% Lachnospira 3.0% / 3.2%

Roseburia 6.0% // 5.4% 5.4% Ruminococcus 23.0% 23.0% / 21.2%/ 21.2% Eubacterium 4.4% / /4.1% 4.1% FirmicutesFirmicutes

Papillibacter 5.9%5.9% / 5.7%

Streptococcus 3.1% 3.1% / 2.8%/ 2.8%

Actinobacteria Bifidobacterium 6.8% / /8.6% 8.6% Bacteroides 5.7% / /6.1% 6.1% BacteroidetesBacteroidetes

Figure 9.2 Distribution of bacterial relative abundances among samples obtained from healthy children (kHLT) and children diagnosed with IBS (kIBS). Different samples are plotted as columns; microbial genera are plotted as rows. Relative abundances of each genus are displayed using a gradient scale as shown in the legend. Sample designation is shown at the top of each column. Vertical line separates kIBS and kHLT samples. The 12 most abundant genera are displayed on the right side; genus assignments to the four most abundant phyla are shown on the left side of the image. Numbers represent relative average abundance of each genus in kIBS and kHLT samples, respectively. The figure was first published in American Journal of Gastroenterology, issue 107, 2012 (Rigsbee et al., 2012), produced by Nature Publishing Group, a division of Macmillan Publishers Limited.

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 219 bars, or pie charts (Wu et al., 2010a; Rigsbee et etc., within the package. This feature provides al., 2012). To assess if different types of samples researchers with an efficient way to go from raw can be separated based on their community com- microarray data to comprehensive compositional position, data dimensionality reduction methods analysis in a single step. Furthermore, PhyloTrac such as principal components analysis can be used offers a user-friendly interface for the display (Nemir et al., 2010; Agans et al., 2011; Kopecky et of the community composition and , al., 2011). which allows for synchronized selection of OTUs Recently, several studies have explored meth- across multiple modes of data visualization as well ods to improve analysis procedures associated as for filtering of OTUs using any of the standard with phylogenetic microarrays. A unique feature distance metrics. of the phylogenetic microarray data is the abil- ity to link the presence and abundance of each sequence to the placement of the corresponding Phylogenetic microarray species on the phylogenetic tree. This informa- applications tion allows researchers to estimate community Phylogenetic microarrays have been utilized to ecological parameters such as diversity, richness, successfully carry out many different studies that and evenness, and to assess the sample separation interrogated a diverse set of microbial environ- that takes into account phylogenetic identity of ments. These included human associated niches community members (Hazen et al., 2010). For such as the gastrointestinal, oral, and vaginal example, Hamady et al. described improvements tracts, as well as communities from ocean waters, in ecological beta diversity analysis of microarray soil, and sewage. Examples of such high-through- data using phylogenetic information (Hamady put analyses using phylogenetic microarrays are et al., 2010). The approach incorporated evolu- discussed in this section. tionary relationships between taxa to calculate phylogenetic beta diversity, a metric that is used The Microbiota Array to compare diversity among communities. This The faecal microbiome of healthy adolescents and type of analysis can uncover underlying patterns adolescents with diarrhoea-predominant irritable of change in diversity that only become evident bowel syndrome (IBS) was profiled recently in when phylogenetic relationships are taken into a study by Rigsbee et al. (2012). The objective account. The authors developed an online tool, of the study was to assess the differences in the Fast UniFrac, which uses phylogenetic informa- faecal microbiota profile between the two groups tion in conjunction with multivariate statistics to and to potentially identify putative associations assess if the examined communities are signifi- among different microbial members. This study cantly different and to characterize phylogenies took advantage of the quantitative nature of the of the taxa that are responsible for the differences Microbiota Array to compare relative abundances among communities (Hamady et al., 2010). among the interrogated samples at several taxo- Another study, by Schatz and colleagues, nomic levels. Microarray data was confirmed with introduced a stand-alone software package for high-throughput 454-based pyrosequencing and the analysis of signal values from the PhyloChip fluorescence in situ hybridization (FISH). The microarray (Schatz et al., 2010). This software, study showed that the overall structure of the called PhyloTrac, is capable of identifying and faecal microbiomes was generally similar between quantifying microbial community members healthy and IBS adolescents. In both groups, from the environmental samples that were inter- the phylum was the most abundant, rogated using the PhyloChip microarray. One of followed by Actinobacteria and , the several advantages of this software is the all- with members of these three phyla cumulatively inclusive nature of the application. It contains all constituting 91% of the overall community com- the necessary dependencies, such as phylogenetic position on average (Fig. 9.2). At the genus level, information for assignment of taxonomy, normali- the relative fractions of the abundant genera in the zation procedures, microarray design information, microbial communities were also similar between

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 220 | Paliy et al.

the two groups; the polysaccharide-degrading was also observed in dental plaque (Chalmers et members of the genus Ruminococcus were the al., 2008). most abundant (Rigsbee et al., 2012). Some distinct differences in the microbial PhyloChip profiles were observed at lower taxonomic levels The G2 version of the PhyloChip was utilized to (genus and species). More specifically, the array analyse watershed microbial communities in an detected lower levels of the genus Bifidobacterium attempt to characterize the sensitivity of these but higher levels of genera Lactobacillus, Veil- communities to perturbations in the environment lonella, and Prevotella in adolescents with IBS, (Wu et al., 2010a). Three different watershed com- which is an observation that is consistent with munities (creek, lagoon, and ocean) were sampled several other reports (Rigsbee et al., 2012). The from a coastal area that was known to be prone to array also allowed for the characterization of a set faecal contamination. Aside from these environ- of phylotypes that was present in all or most sam- mental samples, faecal samples were also profiled ples. Such set of phylotypes can be referred to as in this study to obtain a direct comparison of com- the core microbiome of that niche, which is often munity membership. Multi-response permutation thought to play important roles in the community procedure using Bray–Curtis diversity distances functional capacity including inter-species and among the communities revealed significant host–microbial interactions. In the combined set differences among the four communities. Further- of adolescent faecal samples, the array identified more, non-parametric multidimensional scaling a core microbiome of 55 phylotypes. This core ordination was successful in separating samples microbiome was dominated by genus Ruminococ- based on their collection site for the majority of cus; members of genera Bacteroides, Clostridium, the analysed samples (Wu et al., 2010a). Envi- Faecalibacterium, Roseburia, and Streptococcus ronmental factors were also measured at the were also present (Rigsbee et al., 2012). sampled sites in order to correlate them with the In order to identify putative associations among microbial profiles. Interestingly, among all the microbial members, a non-parametric correlation measured environmental variables, salinity had matrix was constructed using the abundance the greatest effect on the community composi- levels of the various genera across all samples. tion, evidenced by the fact that in non-parametric Such relationships can represent potential meta- multidimensional scaling ordination, lagoon sam- bolic interdependencies, where the end-products ples that clustered with creek group had salinity of metabolism of some community members levels that resembled those of the creek samples. become energy and carbon sources for other Specific effects of the environmental factors on members. The study identified a large number the microbial communities were observed at the of statistically significant relationships among class level among the four habitats. Of the classes the genera, which is consistent with our current that showed the greatest variability among habi- understanding of the intricate nature of metabolic tats, Bacilli, Bacteroidetes, and Clostridia were networks among the community members in the found to have higher relative abundances in faecal intestinal ecosystem. As an example, abundance of samples compared to the creek, lagoon, or marine members of genus Veillonella correlated with the samples. Conversely, Alphaproteobacteria were largest number of other genera, probably because found at a lower relative abundance in faecal sam- the members of this genus participate in the ples than in the environmental samples. A set of metabolic cross-feeding pathways (Chalmers et 503 phylotypes, found to be ubiquitous in faecal al., 2008). Specifically,V. parvula cannot degrade samples but not in the environmental samples, complex or even simple sugars available in the was used as means to determine which collection colon and rely on the use of intermediary end- sites were prone to heavy faecal contamination products of carbohydrate fermentation (such as (Wu et al., 2010a). lactate, pyruvate, and fumarate) released by other The G3 version of the PhyloChip was used to gut microbes (Gronow et al., 2009). A physical profile marine microbial communities affected association between Veillonella and Streptococcus by oil plumes released during the Deep Horizon

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 221 oil spill (Hazen et al., 2010). The objective of the from the caries-active group, compared to the study was to characterize the unique features of the caries-free healthy group that contained on aver- communities sampled from deep-sea oil plumes. age 59 species. This suggested a shift in microbial The 16S rRNA microarray analysis showed that community structure in response to the change the communities underwent compositional and from a healthy to a diseased oral environment. structural changes upon contact with the oil. Examining the relative abundances at the genus Multidimensional scaling ordination using Bray– level revealed that genus Streptococcus was the Curtis beta diversity distance metric was able to most abundant, followed by Prevotella and Seleno- differentiate bacterial and archaeal communi- monas (Luo et al., 2012). ties from plume and non-plume samples. Since Surprisingly, at the phylotype level and in all other factors were not significantly different contrast with several previous reports, cariogenic between the sampled communities, this suggested species such as Streptococcus mutans and members that changes in microbial community profiles of the cariogenic genus Lactobacillus were not were due to the direct response of the microbes highly prevalent in the caries-active group (Luo to the existence of oil in the environment. The et al., 2012). Interestingly, these cariogenic groups PhyloChip uncovered a total of 951 individual were substituted by the high prevalence of other bacterial taxa spread across 62 phyla from the streptococci. Examples of phylotypes that were analysed oil-plume samples. When compared to differentially abundant between the two groups the non-plume samples, 16 bacterial taxa were included species of Leptotrichia, which were found found to be significantly enriched in the oil plume only in caries-active patients, and Granulicatella samples. All 16 of these taxa belonged to Gam- sp. and Rothia dentocariosa, which were found maproteobacteria and most had representative at much higher abundance in healthy children. members capable of degrading various hydro- There was a much greater number of phylotypes carbons. The bacterial taxa enriched through the unique to the caries-active group compared to presence of oil included a significant number of those unique to the healthy group, likely due psychrophilic and psychro-tolerant phylotypes to the higher community diversity seen in the similar to those that have been identified in cold caries-active group. A member of the genus Fuso- deep-sea ecosystems (Hazen et al., 2010). bacterium, Fusobacterium nucleatum, was found to be prevalent in all oral samples, which the authors HOMIM attributed to the key role this species plays in the Oral microbiota-specific HOMIM array was establishment of microbial communities in natu- employed to assess the microbiota profile in rally forming dental plaques (Luo et al., 2012). the saliva of healthy children and children with dental caries (Luo et al., 2012). The objective of V-Chip this research project was to determine microbial The vaginal microbiota of African women with or biomarkers for the onset of dental caries in mixed without (BV) was examined by dentition and to characterize the community Dols et al. (2011) through the use of the vaginal profile of the microbial disease. In total, the study microbiota-representing microarray (V-Chip). identified 86 phylotypes as well as eight clusters The goal of the study was to first test the ability of closely related phylotypes. In agreement with of the microarray to successfully detect microbes several sequencing studies, the microbial com- found at high prevalence in BV, and to characterize munity of the saliva was found to be dominated the profiles of the vaginal microbial communities by the phyla Firmicutes and . The in women in the study group. The microarray overall relative contribution of different phyla to results showed that women who were negative the total microbial abundance was similar in both for BV had a high prevalence of various species sample groups with the exception of the TM7 of Lactobacillus, a genus that includes many mem- phylum, which was only detected in the caries- bers considered beneficial to human health. The active group. A higher microbial diversity, with 89 number of detected microbial groups was signifi- detected species, was observed in communities cantly higher in the BV women than in those with

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 222 | Paliy et al.

normal vaginal microbiota. BV-positive women of sludge. The levels of Epsilonproteobacteria cor- harboured a much larger set of known microbial related well with the levels of sulfate present in the pathogens as well as more complex microbiota analysed soil – an observation that is consistent than women from BV negative or intermediate with previous reports that claim the presence of groups. The microarray data also indicated that Epsilonproteobacteria in sulfate-rich environ- high prevalence of HIV in many cases correlated ments such as deep-sea vents (Val-Moraes et al., with high prevalence of BV. At a species level, 2011). the study revealed that and Atopobium vaginae co-occurred in nearly 70% RHC-PhyloChip of the women, suggesting potential microbial A composite microarray-based fingerprint of the interaction(s) between these species towards Rhodocyclales community present in activated pathogenesis. The presence of Gardnerella was sludge was created with the help of the RHC- also associated with the presence of Leptotrichia PhyloChip (Hesselsoe et al., 2009). Separate and Prevotella species. Noteworthy, while previous microarray hybridization patterns obtained reports found Gardnerella vaginalis to be generally with the fragments after either Rhodocyclales associated with BV diagnosis, this species was also selective or general 16S rRNA gene based PCR present in 24% of BV-negative women profiled amplifications were merged to provide an over- in this study. Thus, the microarray data did not all community view. This merged microarray support the previous use of the presence of this hybridization results indicated the presence of organism as a diagnostic tool for BV. Instead, the bacteria belonging or related to the Sterolibacte- authors proposed to employ the co-occurrence of rium lineage, the ‘Candidatus Accumulibacter’ Gardnerella vaginalis and other pathogens such as cluster, and the genera Quadricoccus, Thauera and Atopobium vaginae as a criterion for the diagnosis Zoogloea. A parallel cloning-sequencing approach of BV (Dols et al., 2011). provided a validation of the microarray capability to detect uncultured members of Rhodocyclales. EcoChip A separate RHC-PhyloChip was hybridized with The EcoChip was used to determine an impact fluorescently labelled and fragmented RNA from of sewage sludge on soil bacterial communities each activated sludge subsample. Radioactive (Val-Moraes et al., 2011). In general, a relatively signals on the microarray indicated that bacteria high variation in community structure was represented by several cloned sequences were observed from the beginning to the end of the active under all conditions tested, while other experiment that likely reflected seasonal changes. Rhodocyclales groups, for which specific probes Consistent with previous reports, microarray data were present on the RHC-PhyloChip, displayed revealed that soil communities were dominated more specialized substrate incorporation behav- by members from the phylum , iours. For example, the genus Zoogloea was followed by those of Firmicutes, Proteobacteria, detectable after oxic incubation with butyrate and and Actinobacteria. Significant alterations in propionate, but not with toluene (Hesselsoe et al., were observed when bacterial 2009). communities were compared before and after sludge application. Sludge amendment containing ActinoChip 25 kg N/ha favoured an increase in the number of Actinobacterial community of a waterlogged members of Acidobacteria, Alphaproteobacteria, forest soil was analysed by an Actinobacteria- Bacteroidetes, Deltaproteobacteria, Firmicutes, specific microarray (Kopecky et al., 2011). The , and , while goal of the study was to follow bacterial commu- Actinobacteria, , and some Pro- nities at a previously studied site with respect to teobacteria were the most diminished in sludge differences between soil horizons and seasons. amendments of 200 kg N/ha. Members of the The PCA analysis of the microarray data was able Epsilonproteobacteria and were to distinguish between communities of the lower found only in the samples treated with high doses and upper horizons along the first ordination axis

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 223

(PC1, 49% of dataset variance explained), and the Future trends and outlook summer and winter communities (especially for High-throughput techniques such as phyloge- the upper horizon) along the second ordination netic microarrays and next-generation sequencing axis (PC2, 10% variance explained), indicating provide us extensive knowledge regarding the a higher effect of the horizon than season on composition of complex microbial communities. actinobacterial community composition. The This knowledge enables us to understand which differences between horizons were mostly caused members are present in the community as well by much higher signals from the Mycobacterium as to predict their potential role. Examples of the probes in the upper horizon, while the differences phyloarray applications that have been described between the seasons were due to the signals of in the previous section of this chapter highlight probes targeting the genera Asanoa and Brevibac- a multitude of questions that can be answered terium (higher in winter), and Mobiluncus and through the use of phylogenetic microarrays. A Saccharomonospora (higher in summer). The diverse set of microbial communities that include upper horizon soil appeared to be mostly influ- those found in human-associated niches such as enced by organic matter content in winter and gut, airways, and vaginal canal, as well as envi- soil moisture in summer, based on the PCA-IV ronmental ecosystems such as marine, soil, and (instrumental variables) analysis (Kopecky et al., sewage sludge, have been analysed qualitatively 2011). and quantitatively by phylogenetic microarrays. The intricate nature of the microarray design TCE Chip process and the extensive validation procedures Soils contaminated with trichloroethylene (TCE) have been limiting factors towards the wider use were examined in response to different doses of of phylogenetic microarrays. Nonetheless, there fresh TCE amendments at four concentrations (1 already exists an assortment of phylogenetic micro- ppb, 100 ppb, 1 ppm and 25 ppm) after exposure arrays capable of analysing a variety of microbial of 2 h, 2 days, 14 days, 35 days, and 151 days in a ecosystems (see Table 9.1). The improvements in study by Nemir and others (Nemir et al., 2010). cost efficiency and the highly quantitative nature Changes in bacterial communities were deter- of phyloarrays make them an excellent choice mined with the TCE Chip. TCE presence in the for high-throughput compositional analysis of microcosms for only 2 h was sufficient to elicit microbial communities. A particularly attractive changes in microbial composition. It was possible application is the use of both phylogenetic micro- to discriminate between bacterial communities arrays and next-generation sequencing for the containing either 1 ppm or 10 ppm TCE from analysis of the same microbial community (Ahn samples treated with lower TCE concentration. et al., 2011; Crielaard et al., 2011; van den Bogert This trend continued over time, with visible et al., 2011; Rigsbee et al., 2012). The phyloarrays separation between contaminated and control provide quantitative data for the comparison of samples. After 151 days, however, the community abundances across groups of samples, while the structure regained homogeneity across concentra- 16S rRNA amplicon sequencing allows for the tions. There was no significant difference between identification of novel members of the commu- wet and dry negative controls tested at 2 h and nity. 151 days time points, showing that the effect The future trends in the use of phylogenetic of adding water to the samples was negligible microarrays are likely to be defined by a shift when compared to the effect of adding TCE. An towards integrative approaches to community apparent threshold at which the microbial com- analysis. Current studies have helped us under- munity structure was significantly affected was stand the composition of microbial communities. determined to be at TCE concentration of about 1 Using this information in combination with new ppm. Bacterial taxa associated with TCE contami- molecular tools, future studies will likely focus on nation included, among others, Planctomycetes, the interactions among members of the microbial Acidobacteria, and various groups of Proteobac- communities as well as between microbiota and teria (Nemir et al., 2010). the environment. There is also a growing interest

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 224 | Paliy et al.

towards understanding the link(s) between References the function and the activity of microbiota in Agans, R., Rigsbee, L., Kenche, H., Michail, S., Khamis, H.J., and Paliy, O. (2011). Distal gut microbiota of various environmental niches or disease states. adolescent children is different from that of adults. In integrative approaches, the use of phyloge- FEMS Microbiol. Ecol. 77, 404–412. netic microarrays can be augmented with other Ahn, J., Yang, L., Paster, B.J., Ganly, I., Morris, L., Pei, Z., high-throughput methods such as metabolomics, and Hayes, R.B. (2011). Oral microbiome profiles: 16S rRNA pyrosequencing and microarray assay com- meta-genomics, meta-transcriptomics, and meta- parison. PLoS One 6, e22788. proteomics to construct a more comprehensive Ashelford, K.E., Chuzhanova, N.A., Fry, J.C., Jones, A.J., model of the analysed community ((Klaassens et and Weightman, A.J. (2005). At least 1 in 20 16S rRNA al., 2007; Booijink et al., 2010; Martin et al., 2010; sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. see also Chapter 7). A combination of these tech- Environ. Microbiol. 71, 7724–7736. niques would allow us to determine the profile of Belenguer, A., Duncan, S.H., Calder, A.G., Holtrop, G., the community composition, total gene content, Louis, P., Lobley, G.E., and Flint, H.J. (2006). Two and expression levels of the genes and proteins, routes of metabolic cross-feeding between Bifidobac- terium adolescentis and butyrate-producing anaerobes and we would be able to relate this data to the from the human gut. Appl. Environ. Microbiol. 72, metabolite profiles of the environment and com- 3593–3599. munity members. Such an approach will enable us Bodrossy, L., Stralis-Pavese, N., Murrell, J.C., Radajewski, to understand the intricate relationships and the S., Weilharter, A., and Sessitsch, A. (2003). Devel- opment and validation of a diagnostic microbial roles the members of the microbiota play within microarray for methanotrophs. Environ. Microbiol. 5, different microbial ecosystems. 566–582. Thanks to the advancements in technology and van den Bogert, B., de Vos, W.M., Zoetendal, E.G., and our knowledge of microbial communities, several Kleerebezem, M. (2011). Microarray analysis and bar- coded pyrosequencing provide consistent microbial enhancements to the design and use of phyloge- profiles depending on the source of human intestinal netic microarrays can also be conceived. Programs samples. Appl. Environ. Microbiol. 77, 2071–2080. such as the Human Microbiome Project (Peterson Booijink, C.C., Boekhorst, J., Zoetendal, E.G., Smidt, H., et al., 2009) and the MetaHIT initiative (Qin et al., Kleerebezem, M., and de Vos, W.M. (2010). Metatran- scriptome analysis of the human fecal microbiota 2010) have made available a substantial number reveals subject-specific expression profiles, with genes of genome sequences of human-associated micro- encoding proteins involved in carbohydrate metabo- biota members. The availability of such resources lism being dominantly expressed. Appl. Environ. has given rise to the possibility of designing phy- Microbiol. 76, 5533–5540. Brodie, E.L., Desantis, T.Z., Joyner, D.C., Baek, S.M., logenetic detection arrays based on functionally Larsen, J.T., Andersen, G.L., Hazen, T.C., Richardson, conserved genes such as groEL, rpoB, gyrA and P.M., Herman, D.J., Tokunaga, T.K., et al. (2006). tufA (Loy and Bodrossy, 2006). Specific pathogen Application of a high-density oligonucleotide microar- detection arrays have a potential to play a vital ray approach to study bacterial population dynamics during uranium reduction and reoxidation. Appl. role in the field of microbial forensics for the rapid Environ. Microbiol. 72, 6288–6298. detection and identification of pathogens in the Brodie, E.L., DeSantis, T.Z., Parker, J.P., Zubietta, I.X., environment. Furthermore, phylogenetic micro- Piceno, Y.M., and Andersen, G.L. (2007). Urban aero- arrays can also be designed to contain probes to sols harbor diverse and dynamic bacterial populations. Proc. Natl. Acad. Sci. U.S.A. 104, 299–304. functional genes to enable simultaneous analysis Candela, M., Consolandi, C., Severgnini, M., Biagi, E., of community structure and function (Louis and Castiglioni, B., Vitali, B., De Bellis, G., and Brigidi, P. Flint, 2007). In a clinical setting, phylogenetic (2010). High taxonomic level fingerprint of the human microarrays can be used as diagnostic tools, where intestinal microbiota by ligase detection reaction- -universal array approach. BMC Microbiol. 10, 116. their ability to detect human-associated micro- Chakravorty, S., Helb, D., Burday, M., Connell, N., and biota members at a species level in a relatively Alland, D. (2007). A detailed analysis of 16S ribosomal short period of time can help in the diagnosis of RNA gene segments for the diagnosis of pathogenic various pathological states and rapid selection of bacteria. J. Microbiol. Methods 69, 330–339. Chalmers, N.I., Palmer, R.J., Jr., Cisar, J.O., and treatment procedures that are most likely to suc- Kolenbrander, P.E. (2008). Characterization of ceed (Loy and Bodrossy, 2006). a Streptococcus sp.-Veillonella sp. community

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 225

micromanipulated from dental plaque. J. Bacteriol. of nucleic acids to solid supports. J. Biochem. Biophys. 190, 8145–8154. Methods 42, 105–110. Chiu, S.K., Hsu, M., Ku, W.C., Tu, C.Y., Tseng, Y.T., Lau, Gronow, S., Welnitz, S., Lapidus, A., Nolan, M., Ivanova, W.K., Yan, R.Y., Ma, J.T., and Tzeng, C.M. (2003). N., Glavina Del Rio, T., Copeland, A., Chen, F., Synergistic effects of epoxy- and amine-silanes on Tice, H., Pitluck, S., et al. (2009). Complete genome microarray DNA immobilization and hybridization. sequence of Veillonella parvula type strain (Te3). Biochem. J. 374, 625–632. Stand. Genomic Sci. 2, 57–65. Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., Guschin, D.Y., Mobarry, B.K., Proudnikov, D., Stahl, and Halfon, M.S. (2005). Preferred analysis meth- D.A., Rittmann, B.E., and Mirzabekov, A.D. (1997). ods for Affymetrix GeneChips revealed by a wholly Oligonucleotide microchips as genosensors for deter- defined control dataset. Genome Biol. 6, R16. minative and environmental studies in microbiology. Corradi, L., Fato, M., Porro, I., Scaglione, S., and Torterolo, Appl. Environ. Microbiol. 63, 2397–2402. L. (2008). A Web-based and Grid-enabled dChip ver- Hamady, M., Lozupone, C., and Knight, R. (2010). Fast sion for the analysis of large sets of gene expression UniFrac: facilitating high-throughput phylogenetic data. BMC Bioinformat. 9, 480. analyses of microbial communities including analysis Cox, M.J., Allgaier, M., Taylor, B., Baek, M.S., Huang, Y.J., of pyrosequencing and PhyloChip data. ISME J. 4, Daly, R.A., Karaoz, U., Andersen, G.L., Brown, R., 17–27. Fujimura, K.E., et al. (2010). Airway microbiota and Hashsham, S.A., Wick, L.M., Rouillard, J.M., Gulari, E., pathogen abundance in age-stratified cystic fibrosis and Tiedje, J.M. (2004). Potential of DNA microar- patients. PLoS One 5, e11044. rays for developing parallel detection tools (PDTs) Crielaard, W., Zaura, E., Schuller, A.A., Huse, S.M., Mon- for microorganisms relevant to biodefense and related tijn, R.C., and Keijser, B.J. (2011). Exploring the oral research needs. Biosens. Bioelectron. 20, 668–683. microbiota of children at various developmental stages Hazen, T.C., Dubinsky, E.A., DeSantis, T.Z., Andersen, of their dentition in the relation to their oral health. G.L., Piceno, Y.M., Singh, N., Jansson, J.K., Probst, A., BMC Med. Genomics 4, 22. Borglin, S.E., Fortney, J.L., et al. (2010). Deep-sea oil Deangelis, K.M., Allgaier, M., Chavarria, Y., Fortney, J.L., plume enriches indigenous oil-degrading bacteria. Sci- Hugenholtz, P., Simmons, B., Sublette, K., Silver, W.L., ence 330, 204–208. and Hazen, T.C. (2011). Characterization of trapped He, Z., Wu, L., Fields, M.W., and Zhou, J. (2005). Use of lignin-degrading microbes in tropical forest soil. PLoS microarrays with different probe sizes for monitor- One 6, e19306. ing gene expression. Appl. Environ. Microbiol. 71, De Vuyst, L., and Leroy, F. (2011). Cross-feeding between 5154–5162. bifidobacteria and butyrate-producing colon bacteria Heller, M.J., Forster, A.H., and Tu, E. (2000). Active explains bifdobacterial competitiveness, butyrate pro- microeletronic chip devices which utilize controlled duction, and gas production. Int. J. Food Microbiol. electrophoretic fields for multiplex DNA hybridiza- 149, 73–80. tion and other genomic applications. Electrophoresis Docktor, M.J., Paster, B.J., Abramowicz, S., Ingram, 21, 157–164. J., Wang, Y.E., Correll, M., Jiang, H., Cotton, S.L., Hesselsoe, M., Fureder, S., Schloter, M., Bodrossy, L., Kokaras, A.S., and Bousvaros, A. (2012). Alterations in Iversen, N., Roslev, P., Nielsen, P.H., Wagner, M., and diversity of the oral microbiome in pediatric inflamma- Loy, A. (2009). Isotope array analysis of Rhodocycla- tory bowel disease. Inflamm. Bowel Dis.18 , 935–942. les uncovers functional redundancy and versatility in Dols, J.A., Smit, P.W., Kort, R., Reid, G., Schuren, F.H., an activated sludge. ISME J. 3, 1349–1364. Tempelman, H., Bontekoe, T.R., Korporaal, H., and Kembel, S.W., Wu, M., Eisen, J.A., and Green, J.L. (2012). Boon, M.E. (2011). Microarray-based identification Incorporating 16S gene copy number information of clinically relevant vaginal bacteria in relation to improves estimates of microbial diversity and abun- bacterial vaginosis. Am. J. Obstet. Gynecol. 204, 305. dance. PLoS Comp. Biol. 8, e1002743. e301–307. Klaassens, E.S., de Vos, W.M., and Vaughan, E.E. (2007). Duncan, S.H., Louis, P., and Flint, H.J. (2004). Lactate- Metaproteomics approach to study the functionality utilizing bacteria, isolated from human feces, that of the microbiota in the human infant gastrointestinal produce butyrate as a major fermentation product. tract. Appl. Environ. Microbiol. 73, 1388–1392. Appl. Environ. Microbiol. 70, 5810–5817. Kopecky, J., Kyselkova, M., Omelka, M., Cermak, L., Flint, H.J., Bayer, E.A., Rincon, M.T., Lamed, R., and Novotna, J., Grundmann, G.L., Moenne-Loccoz, Y., White, B.A. (2008). Polysaccharide utilization by and Sagova-Mareckova, M. (2011). Actinobacterial gut bacteria: potential for new insights from genomic community dominated by a distinct clade in acidic soil analysis. Nat. Rev. 6, 121–131. of a waterlogged deciduous forest. FEMS Microbiol. Fujita, A., Sato, J.R., Rodrigues Lde, O., Ferreira, C.E., and Ecol. 78, 386–394. Sogayar, M.C. (2006). Evaluating different methods Kyselkova, M., Kopecky, J., Frapolli, M., Defago, G., of microarray data normalization. BMC Bioinformat. Sagova-Mareckova, M., Grundmann, G.L., and 7, 469. Moenne-Loccoz, Y. (2009). Comparison of rhizobac- Goldmann, T., and Gonzalez, J.S. (2000). DNA-printing: terial community composition in soil suppressive or utilization of a standard inkjet printer for the transfer conducive to tobacco black root rot disease. ISME J. 3, 1127–1138.

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 226 | Paliy et al.

Lantz, P., Matsson, M., Wadstrom, T., and Radstrom, P. Microarray-based identification of bacteria in clini- (1997). Removal of PCR inhibitors from human faecal cal samples by solid-phase PCR amplification of 23S samples through the use of an aqueous two-phase ribosomal DNA sequences. J. Clin. Microbiol. 42, system for sample preparation prior to PCR. J. Micro- 1048–1057. biol. Methods 28, 159–167. Monteiro, L., Bonnemaison, D., Vekris, A., Petry, K.G., Lemon, K.P., Klepac-Ceraj, V., Schiffer, H.K., Brodie, E.L., Bonnet, J., Vidal, R., Cabrita, J., and Megraud, F. Lynch, S.V., and Kolter, R. (2010). Comparative analy- (1997). Complex polysaccharides as PCR inhibitors ses of the bacterial microbiota of the human nostril and in feces: Helicobacter pylori model. J. Clin. Microbiol. oropharynx. mBio 1, e00129–00110. 35, 995–998. Letowski, J., Brousseau, R., and Masson, L. (2004). Naum, M., Brown, E.W., and Mason-Gamer, R.J. (2008). Designing better probes: effect of probe size, mismatch Is 16S rDNA a reliable phylogenetic marker to char- position and number on hybridization in DNA oli- acterize relationships below the family level in the gonucleotide microarrays. J. Microbiol. Methods 57, enterobacteriaceae? J. Mol. Evol. 66, 630–642. 269–278. Nemir, A., David, M.M., Perrussel, R., Sapkota, A., Louis, P., and Flint, H.J. (2007). Development of a semi- Simonet, P., Monier, J.M., and Vogel, T.M. (2010). quantitative degenerate real-time pcr-based assay for Comparative phylogenetic microarray analysis of estimation of numbers of butyryl-coenzyme A (CoA) microbial communities in TCE-contaminated soils. CoA transferase genes in complex bacterial samples. Chemosphere 80, 600–607. Appl. Environ. Microbiol. 73, 2009–2012. Paliy, O., and Agans, R. (2012). Application of phy- Loy, A., and Bodrossy, L. (2006). Highly parallel microbial logenetic microarrays to interrogation of human diagnostics using oligonucleotide microarrays. Clin. microbiota. FEMS Microbiol. Ecol. 79, 2–11. Chim. Acta 363, 106–119. Paliy, O., and Foy, B. (2011). Mathematical modeling of Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, 16S ribosomal DNA amplification reveals optimal H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, conditions for the interrogation of complex microbial G., et al. (2004). ARB: a software environment for communities with phylogenetic microarrays. Bioinfor- sequence data. Nucleic Acids Res. 32, 1363–1371. matics 27, 2134–2140. Luo, A.H., Yang, D.Q., Xin, B.C., Paster, B.J., and Qin, J. Paliy, O., Kenche, H., Abernathy, F., and Michail, S. (2009). (2012). Microbial profiles in saliva from children with High-throughput quantitative analysis of the human and without caries in mixed dentition. Oral Dis. 18, intestinal microbiota with a phylogenetic microarray. 595–601. Appl. Environ. Microbiol. 75, 3572–3579. Luton, P.E., Wayne, J.M., Sharp, R.J., and Riley, P.W. Pease, A.C., Solas, D., Sullivan, E.J., Cronin, M.T., Holmes, (2002). The mcrA gene as an alternative to 16S rRNA C.P., and Fodor, S.P. (1994). Light-generated oligonu- in the phylogenetic analysis of methanogen popula- cleotide arrays for rapid DNA sequence analysis. Proc. tions in landfill. Microbiology 148, 3521–3530. Natl. Acad. Sci. U.S.A. 91, 5022–5026. Martens, M., Weidner, S., Linke, B., de Vos, P., Gillis, Peplies, J., Glockner, F.O., and Amann, R. (2003). Optimi- M., and Willems, A. (2007). A prototype taxonomic zation strategies for DNA microarray-based detection microarray targeting the rpsA housekeeping gene per- of bacteria with 16S rRNA-targeting oligonucleotide mits species identification within the rhizobial genus probes. Appl. Environ. Microbiol. 69, 1397–1407. Ensifer. Syst. Appl. Microbiol. 30, 390–400. Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, Martin, F.P., Sprenger, N., Montoliu, I., Rezzi, S., Kochhar, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetter- S., and Nicholson, J.K. (2010). Dietary modulation of strand, K.A., Deal, C., et al. (2009). The NIH Human gut functional ecology studied by fecal metabonomics. Microbiome Project. Genome Res. 19, 2317–2323. Journal of proteome research 9, 5284–5295. Polz, M.F., and Cavanaugh, C.M. (1998). Bias in tem- Mendes, R., Kruijt, M., de Bruijn, I., Dekkers, E., van der plate-to-product ratios in multitemplate PCR. Appl. Voort, M., Schneider, J.H., Piceno, Y.M., DeSantis, Environ. Microbiol. 64, 3724–3730. T.Z., Andersen, G.L., Bakker, P.A., et al. (2011). Preza, D., Olsen, I., Willumsen, T., Boches, S.K., Cotton, Deciphering the rhizosphere microbiome for disease- S.L., Grinde, B., and Paster, B.J. (2009). Microarray suppressive bacteria. Science 332, 1097–1100. analysis of the microflora of root caries in elderly. Eur. Midgley, D.J., Greenfield, P., Shaw, J.M., Oytam, Y., Li, J. Clin. Microbiol. Infect. Dis. 28, 509–517. D., Kerr, C.A., and Hendry, P. (2012). Reanalysis and Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., simulation suggest a phylogenetic microarray does not Manichanh, C., Nielsen, T., Pons, N., Levenez, F., accurately profile microbial communities. PLoS One Yamada, T., et al. (2010). A human gut microbial gene 7, e33875. catalogue established by metagenomic sequencing. Militon, C., Rimour, S., Missaoui, M., Biderre, C., Barra, Nature 464, 59–65. V., Hill, D., Mone, A., Gagne, G., Meier, H., Peyretail- Rainer, J., Sanchez-Cabo, F., Stocker, G., Sturn, A., and lade, E., et al. (2007). PhylArray: phylogenetic probe Trajanoski, Z. (2006). CARMAweb: comprehensive design algorithm for microarray. Bioinformatics 23, R- and bioconductor-based web service for microarray 2550–2557. data analysis. Nucleic Acids Res. 34, W498–503. Mitterer, G., Huber, M., Leidinger, E., Kirisits, C., Lubitz, Rajilic-Stojanovic, M., Heilig, H.G., Molenaar, D., W., Mueller, M.W., and Schmidt, W.M. (2004). Kajander, K., Surakka, A., Smidt, H., and de Vos,

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Phylogenetic Microarrays | 227

W.M. (2009). Development and application of the composition using a pmoA-based microbial diagnostic human intestinal tract chip, a phylogenetic microar- microarray. Nat. Protoc. 6, 609–624. ray: analysis of universally conserved phylotypes in Suau, A. (2003). Molecular tools to investigate intestinal the abundant microbiota of young and elderly adults. bacterial communities. J. Pediatr. Gastroenterol. Nutr. Environ. Microbiol. 11, 1736–1751. 37, 222–224. Rich, V.I., Konstantinidis, K., and DeLong, E.F. (2008). Suzuki, S., Ono, N., Furusawa, C., Kashiwagi, A., and Design and testing of ‘genome-proxy’ microarrays Yomo, T. (2007). Experimental optimization of probe to profile marine microbial communities. Environ. length to increase the sequence specificity of high- Microbiol. 10, 506–521. density oligonucleotide microarrays. BMC Genomics Rigsbee, L., Agans, R., Foy, B.D., and Paliy, O. (2011). 8, 373. Optimizing the analysis of human intestinal micro- Tao, F., Li, C., Smejkal, G., Lazarev, A., Lawerence, N., and biota with phylogenetic microarray. FEMS Microbiol. Schumacher, R. (2006). Pressure Cycling Technology Ecol. 75, 332–342. (PCT) Applications in Extraction of Biomolecules Rigsbee, L., Agans, R., Shankar, V., Kenche, H., Khamis, from Challenging Biological Samples. High Pressure H.J., Michail, S., and Paliy, O. (2012). Quantitative Biosci. Biotechnol. 1, 166–173. profiling of gut microbiota of children with diarrhea- Val-Moraes, S., Marcondes, J., Carareto Alves, L., and predominant Irritable Bowel Syndrome. Am. J. Lemos, E. (2011). Impact of sewage sludge on the soil Gastroenterol. 107, 1740–1751. bacterial communities by DNA microarray analysis. Rimour, S., Hill, D., Militon, C., and Peyret, P. (2005). World J. Microbiol. Biotechnol. 27, 1997–2003. GoArrays: highly dynamic and efficient microarray Waldron, P.J., Wu, L., Van Nostrand, J.D., Schadt, C.W., probe design. Bioinformatics 21, 1094–1103. He, Z., Watson, D.B., Jardine, P.M., Palumbo, A.V., Rock, C., Alum, A., and Abbaszadegan, M. (2010). PCR Hazen, T.C., and Zhou, J. (2009). Functional gene inhibitor levels in concentrates of biosolid samples pre- array-based analysis of microbial community structure dicted by a new method based on excitation-emission in groundwaters with a gradient of contaminant levels. matrix spectroscopy. Appl. Environ. Microbiol. 76, Environ. Sci. Technol. 43, 3529–3534. 8102–8109. Wang, R.F., Beggs, M.L., Erickson, B.D., and Cerniglia, Roh, S.W., Abell, G.C., Kim, K.H., Nam, Y.D., and Bae, C.E. (2004). DNA microarray analysis of predominant J.W. (2010). Comparing microarrays and next-gener- human intestinal bacteria in fecal samples. Mol. Cell ation sequencing technologies for microbial ecology Probes 18, 223–234. research. Trends Biotechnol. 28, 291–299. Wu, C.H., Sercu, B., Van de Werfhorst, L.C., Wong, J., Salonen, A., Nikkila, J., Jalanka-Tuovinen, J., Immonen, O., DeSantis, T.Z., Brodie, E.L., Hazen, T.C., Holden, Rajilic-Stojanovic, M., Kekkonen, R.A., Palva, A., and P.A., and Andersen, G.L. (2010a). Characterization of de Vos, W.M. (2010). Comparative analysis of fecal coastal urban watershed bacterial communities leads DNA extraction methods with phylogenetic microar- to alternative community-based indicators. PLoS One ray: effective recovery of bacterial and archaeal DNA 5, e11285. using mechanical cell lysis. J. Microbiol. Methods 81, Wu, X., Ma, C., Han, L., Nawaz, M., Gao, F., Zhang, X., Yu, 127–134. P., Zhao, C., Li, L., Zhou, A., et al. (2010b). Molecular Sanguin, H., Remenant, B., Dechesne, A., Thioulouse, Characterisation of the Faecal Microbiota in Patients J., Vogel, T.M., Nesme, X., Moenne-Loccoz, Y., and with Type II Diabetes. Curr. Microbiol. 61, 69–78. Grundmann, G.L. (2006). Potential of a 16S rRNA- Xie, J., He, Z., Liu, X., Liu, X., Van Nostrand, J.D., Deng, Y., based taxonomic microarray for analyzing the Wu, L., Zhou, J., and Qiu, G. (2010). GeoChip-based rhizosphere effects of maize on Agrobacterium spp. analysis of the functional gene diversity and metabolic and bacterial communities. Appl. Environ. Microbiol. potential of microbial communities in acid mine drain- 72, 4302–4312. age. Appl. Environ. Microbiol. 77, 991–999. Schatz, M.C., Phillippy, A.M., Gajer, P., DeSantis, T.Z., Yoo, S.M., Lee, S.Y., Chang, K.H., Yoo, S.Y., Yoo, N.C., Andersen, G.L., and Ravel, J. (2010). Integrated micro- Keum, K.C., Yoo, W.M., Kim, J.M., and Choi, J.Y. bial survey analysis of prokaryotic communities for the (2009). High-throughput identification of clinically PhyloChip microarray. Appl. Environ. Microbiol. 76, important bacterial pathogens using DNA microarray. 5636–5638. Mol. Cell Probes 23, 171–177. Schreiner, K., Hagn, A., Kyselkova, M., Moenne-Loccoz, Zabarovsky, E.R., Petrenko, L., Protopopov, A., Voront- Y., Welzl, G., Munch, J.C., and Schloter, M. (2010). sova, O., Kutsenko, A.S., Zhao, Y., Kilosanidze, G., Comparison of barley succession and take-all disease Zabarovska, V., Rakhmanaliev, E., Pettersson, B., et al. as environmental factors shaping the rhizobacterial (2003). Restriction site tagged (RST) microarrays: a community during take-all decline. Appl. Environ. novel technique to study the species composition of Microbiol. 76, 4703–4712. complex microbial systems. Nucleic Acids Res. 31, Sekirov, I., Russell, S.L., Antunes, L.C., and Finlay, B.B. e95. (2010). Gut microbiota in health and disease. Physiol. Zakharkin, S.O., Kim, K., Mehta, T., Chen, L., Barnes, S., Rev. 90, 859–904. Scheirer, K.E., Parrish, R.S., Allison, D.B., and Page, Stralis-Pavese, N., Abell, G.C., Sessitsch, A., and Bodrossy, G.P. (2005). Sources of variation in Affymetrix micro- L. (2011). Analysis of methanotroph community array experiments. BMC Bioinformat. 6, 214.

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 228 | Paliy et al.

Zhang, L., Hurek, T., and Reinhold-Hurek, B. (2007). Zhou, J.Z., He, Z.L., Van Nostrand, J.D., and Deng, Y. A nifH-based oligonucleotide microarray for func- (2011). Development and applications of functional tional diagnostics of nitrogen-fixing microorganisms. gene microarrays in the analysis of the functional Microb. Ecol. 53, 456–470. diversity, composition, and structure of microbial communities. Front Environ. Sci. En. 5, 1–20.

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P Genetic Barcoding

Genetic Barcoding of Bacteria and its Microbiology and Biotechnology Applications 10 Oleg N. Reva, Wai Y. Chan, Oliver K.I. Bezuidt, Svitlana V. Lapa, Larisa A. Safronova, Lilija V. Avdeeva and Rainer Borriss

Abstract that he had discovered (van Zuylen, 1981). In fact A wide variety of genetic data about organisms of the creatures that had impressed Leeuwenhoek’s interest has become available with the advance- imagination were protists – single-celled organ- ment to next generation sequencing (NGS). For isms, not the usually uniform rod and coccal many potential new users, to process this huge bacterial cells, which do not look so impressive. amount of genetic data released by NGS and to After the discovery of the microscope and the utilize this information to resolve practical ques- introduction of techniques of bacterial cultivation tions remains a challenge. Genetic barcoding of on solid growth media by Robert Koch and Koch’s microorganisms is the first obvious area where assistant J. R. Petri (Weiss, 2005), the characteri- NGS has met the requirements of applied micro- zation of bacteria by the morphology of the colony biology. In general, barcoding in microbiology is became common practice. Very soon it became a comparative genome approach to differentiate obvious that morphology of bacterial cells and between species or strains that are hard to distin- colonies is not a robust taxonomic property and guish by traditional methods. In this chapter, we that the real dimension of bacterial versatility is in introduce the conceptual background of bacterial their biochemistry. One of the very first biochemi- barcoding and present several basic bioinformat- cal tests used in microbiology was Gram-staining. ics tools and approaches to provide solutions to This technique was developed by H.C. Gram in NGS data handling. While working with a puta- 1884 for the identification of pathogenic bacte- tive industrial strain or potentially hazardous ria, particularly for Typhus bacilli (Gram, 1884). pathogen, the following questions arise: (i) is Subsequently, many diagnostic tests have been this strain unique and if so, what makes it unique developed for the identification of various bacte- genetically or practically speaking?; (ii) how can rial species. A comprehensive regularly updated it be detected in the environment?; (iii) are there overview of all the identification procedures used any genetic markers for its extraordinary activity? in microbiology since 1923 has been published The possibility of barcoding of whole bacterial in Bergey’s Manual of Determinative Bacteriology communities is considered and both the benefits (visit the Springer Web-site www.springer.com for and limitations of the traditional 16S rRNA based the latest issues of the manual). A common belief barcoding and multi-locus sequence typing are of researchers was that a larger diagnostic test set discussed. would provide more reliable species identifica- tion. New approaches based on the comparison of multiple independent tests have been developed The history of barcoding of and termed as numeric taxonomy. Fuelled by the microorganisms advances in computer technologies in the early When Antonie van Leeuwenhoek looked through 1970s, the concept of numeric taxonomy had his miracle microscope for the first time, he was reflected a general conceptual shift in science. As amused by the multiformity of an unseen world descriptive and narrative diagnostic tests used

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P 230 | Reva et al.

in bacteriology gave way to digital rows of data 400–800 bp DNA fragments serving as unambigu- designed for arithmetic and computer based ous species identifiers; whereas DNA barcoding is processing, the concept of numeric taxonomy an approach for rapid species identification based was first developed, introduced and further on DNA sequences (Kress and Erickson, 2008). elaborated in works by Sokal and Sneath (1963) DNA barcoding in bacteriology was pioneered and Sneath and Sokal (1973). This concept was using 16S rRNA sequences as the taxonomic used for barcoding in microbiology where the markers (Weisburg et al., 1991), followed by the first barcode was based on sets of biochemical use of several other housekeeping protein coding tests. The early concept of barcoding was related gene sequences as potential barcodes (Case et al., to the phenetic approach of bacterial classification 1997). For eukaryotes, the internal transcribed proposed by Sokal and Sneath (1963). In contrast spacer (ITS) region of the nuclear ribosomal to the cladistic approach that uses sets of hierar- DNA was proposed as a genetic species marker chical diagnostic tests for a bipalmate branching for fungi (Nilsson et al., 2008); whereas the mito- of organisms for the different taxonomic levels, chondrial gene cytochrome c oxidase I (COI) the phenetics is used to search for similarities was established as a universal DNA barcode for between organisms by comparison between the animals (Hebert et al., 2003). In this chapter we patterns of multiple and independent variables, discuss the applications of molecular biology and i.e. barcodes. The advent of numeric taxonomy sequencing techniques for bacterial species iden- required developments of new multiplex facilities tification and barcoding of individual organisms. to generate massive datasets of biochemical traits and also new approaches to deal with these enor- mous arrays of data. Far before the introduction 16S ribosomal RNA sequence – of sequencing techniques, numeric taxonomy had a universal barcode of bacterial trailed bioinformatics as a new scientific disci- species pline. The introduction of the polymerase chain reac- The numeric taxonomy approach was chal- tion (PCR) amplification technique by Mullis lenged (i) by the problem with standardization (1993) has revolutionized molecular biology. of experimental conditions, which at later stage PCR provided scientists the ability to obtain mul- was resolved to some extent by the introduction tiple copies of precisely selected DNA fragments. of highly standardized commercial analytical Both the quantity and quality of PCR product profile index (API) systems; (ii) by the biochemi- can be further analysed by electrophoresis, direct cal versatility of bacterial species which hindered sequencing and other alternative methods. PCR the correct species identification even by com- amplification has become easy to standardize and mercial test systems (Inglis et al., 1998); (iii) to a also guaranteed reproducible results in different greater extent also by the extraordinary plasticity laboratories. A gene encoding the small 16S rRNA of microorganisms, with the rapid evolution of ribosomal subunit was found to be a universal bacteria under the pressure of changed environ- target for phylogenetic studies (Pace, 1997). The mental conditions, which may significantly differ application of this method aided in classifying from the parent organisms. Typical examples bacteria at levels from prokaryotic domains to are small colony variants of pathogenic bacteria, individual strains (Woese and Fox, 1977; Dalevi which rapidly evolve and often are associated with et al., 2007). Notably, this method provided chronic bacteraemia and a long-term persistence researchers with a universal genetic barcode for in host cells (Proctor et al., 2006). Advancements bacteria. The 16S rRNA gene is extremely con- in molecular biology and gene amplification have served in Archaea and eubacteria and allows the changed the paradigm of bacterial taxonomy construction of universal primers that enclose from operating with rows of experimental data several informative variable regions (Coenye and to comparative studies of molecular residues in Vandamme, 2003). 16S rRNA remains one of the biopolymers, i.e. DNA and protein molecules. most sequenced DNA fragments for species iden- To conclude, barcodes may be defined as tification (see Chapter 8). For its unprecedented

Date: 18:35 Friday 29 November 2013 UNCORRECTED PROOF File: Bioinformatics and Data Analysis 2P