Háskólinn á Akureyri Viðskipta- og raunvísindasvið - Líftækni

Námskeið LOK1126 og LOK1226

Heiti verkefnis Characterization of cathelicidin family members in Rock Ptarmigan (Lagopus muta)

Verktími Janúar – maí 2017 Nemandi Hallgrímur Steinsson Leiðbeinandi Kristinn Pétur Magnússon Upplag Rafrænt auk þriggja prentaðra eintaka Blaðsíðufjöldi 53 Fjöldi viðauka 1 Fylgigögn Engin Útgáfu- og notkunarréttur Opið verkefni Yfirlýsingar

„Ég lýsi því yfir að ég einn er höfundur þessa verkefnis og að það er afrakstur eigin rannsókna“

______Hallgrímur Steinsson, 210878-5649

„Það staðfestist að verkefni þetta fullnægir að mínum dómi kröfum til prófs í námskeiðunum LOK1126 og LOK1226“

______Kristinn P. Magnússon, leiðbeinandi

ii

Abstract Cathelicidins are a class of antimicrobial peptides expressed in vertebrate species which are part of the innate immune system. The aim of this thesis was to resolve genomic organization of the cathelicidin gene cluster in rock ptarmigan (Lagopus muta) and to predict the amino sequence of the mature peptides and analyze expression. To locate the cathelicidin the chicken (Gallus gallus) genome sequences were used to blast a novel draft genome of rock ptarmigan. The draft genome was subsequently used to design primers for PCR and sequencing, to enable obtaining the entire cathelicidin cluster. The characterization of the cathelicidin cluster in rock ptarmigan revealed all four cathelicidin genes orthologues found in chicken and turkey (Meleagris gallopavo), namely CATHL1, CATH2, CATH3, CATHB1, flanked by KLH18 and TBRG4, in the same order on 2. The genes map to a 15kb region, which is of similar size in chicken. The quality of the region is good except for two minor gaps of ~100bp. The sequence harbored the exons coding for the mature peptide for all four cathelicidin genes. Analysis by RT-qPCR revealed that all cathelicidins were expressed, using RNA isolated from eight tissues. Translation of the open reading frames of the cathelicidins revealed substitution in all four genes in rock ptarmigan. CATHL1, CATH2, CATH3 showing greatest similarity to chicken but CATHB1 to turkey.

Keywords: Antibacterial peptides, genomics, RNA, PCR, RT-qPCR.

iii

Þakkarorð Við leiðarlok er við hæfi að staldra við og þakka þeim sem réttu hjálparhönd, veittu hvatningu og innblástur til að leggja mikla vinnu á sig á fullorðinsaldri við að sækja þekkingu og víkka sjóndeildarhring sinn. Það er mikið lán fyrir mig að hafa alist upp í fjölskyldu þar sem í sífellu er verið að deila hvert með öðru áhugaverðum hlutum og sækja nýjan fróðleik um ólíka hluti. Mér er minnisstætt að um ári áður en ég sótti um í Háskólann á Akureyri benti Jón bróðir minn mér á grein um samlífisbakteríur okkar sem varð kveikjan að enn frekari lestri um líffræði, meðal annars The Red Queen eftir Matt Ridley. Þegar ég ákvað eftir langa bið að láta loks af náminu verða valdi ég líftækni fram yfir sjávarútvegsfræði sem hefði legið beint við mínum bakgrunni. Ekkert nám hefði betur passað við mitt áhugasvið og er ég þakklátur starfsfólki deildarinnar fyrir þeirra góða starf. Sérstaklega vil ég koma á framfæri þakklæti til Kristins Péturs Magnússonar prófessors í Auðlindadeild fyrir leiðsögn í verkefninu, stuðning og hvatningu og fyrir að kveikja áhuga minn á sameindaerfðafræði fyrir lífstíð. Mary Mavrikidi fær þakkir fyrir þátt hennar í rannsóknarvinnunni, fagmannleg vinnubrögð og skemmtilegur félagsskapur voru ómetanleg í rannsóknarvinnunni. Eiginkona mín Auðbjörg Halla Jóhannsdóttir fær þakklæti fyrir mikinn stuðning, þolinmæði og yfirlestur í gegnum þrjú ár af námi með fullri vinnu. Dætur mínar þær Unnur Birna, Hrafnhildur og Anna Steinunn fá hrós fyrir mikla þolinmæði og stuðning og hlakka ég til að geta eitt meiri tíma með þeim þegar ég útskrifast. Þakklæti fá foreldrar mínir fyrir að gefast aldrei upp á að hvetja mig til að fara í háskóla og læra meira. Að lokum vil ég þakka samnemendum fyrir samveruna. Það myndaðist frábær stemmning í hópnum og margar góðar minningar.

Vestmannaeyjum, 10. apríl 2017 Hallgrímur Steinsson

iv

Útdráttur Cathelicidin eru örverudrepandi peptíð í hryggdýrum sem eru hluti af meðfædda ónæmiskerfi hryggdýra. Markmið verkefnisins var að skilgreina cathelicidin genasvæðið í fjallrjúpu og að spá fyrir um aminosýruröð peptíðanna og tjáningu þeirra. Fjögur cathelicidin prótín eru tjáð í kjúkling (Gallus gallus) og voru þær raðir notaðar til að staðsetja genin í fjallrjúpu (Lagopus muta) með notkun nýlega samsetts erfðamengis. Erfðamengið var síðan notað til að hanna PCR primera til að geta raðgreint cathelicidin svæðið að fullu. Skilgreiningin á cathelicidin svæðinu leiddi í ljós að öll cathelicidin genin sem kjúklingar og kalkúnar (Meleagris gallopavo) hafa eru líka til staðar í fjallrjúpunni. Genin eru CATHL1, CATH2, CATH3 og CATHB1 en genin KLH18 og TBRG4 eru til hliðar við þau í sömu röð á litning 2. Genin raðast á 15 kb. svæði sem er af sömu stærð og í kjúkling. Gæði raðgreiningarinnar eru góð að frátöldum tveim ~100bp götum. Niðurstöðurnar innihéldu útraðirnar fyrir öll virku peptíðin á cathelicidin genunum. RT-qPCR leiddi í ljós að öll cathelicidin genin voru tjáð í átta vefjum sem voru rannsakaðir. Þýðing á lesrömmum genanna sýndi breytingar á öllum fjórum cathelicidin genunum miðað við kjúkling og kalkún. CATHL1, CATH2, CATH3 sýndu mesta samsvörun við kjúkling en CATHB1 við kalkún.

Lykilorð: Örverudrepandi peptíð, erfðamengjafræði, RNA, PCR, RT-qPCR.

v

Contents

1 Introduction ...... 1

2 Background ...... 3

2.1 Antimicrobial peptides ...... 3

2.2 Cathelicidins ...... 5

2.3 Expression and function of avian cathelicidins ...... 7

2.4 Research methods and data analysis in molecular genetics ...... 7

2.4.1 PCR ...... 8

2.4.2 Gel electrophoresis ...... 9

2.4.3 Sequencing methods ...... 10

2.4.4 Bioinformatics analysis of data ...... 12

3 Materials and methods ...... 14

4 Results ...... 20

4.1 PCR results ...... 20

4.2 Sequencing results ...... 22

4.3 Sequence analysis ...... 23

5 Discussion ...... 28

6 Conclusions ...... 31

7 References ...... 32

vi

Index of figures and tables

FIGURE 1 – A SCHEMATIC DIAGRAM OF AVIAN CATHELICIDIN GENES (CHENG ET AL., 2015). ARROWS INDICATE THE ORIENTATION OF THE GENES AND * POINT OUT PSEUDOGENES. 5 FIGURE 2 - A DNA ELECTROPHORESIS GEL WITH PCR RESULTS, LADDERS AND LOADED WITH SYBR SAFE...... 10 FIGURE 3 - GENOMIC REGION IN CHICKEN CONTAINING CATHELICIDIN GENES (YATES ET AL., 2015)...... 14 FIGURE 4 - A VISUALIZATION OF THE RESULTS FROM THE PRIMERDESIGN-M PRIMER SELECTION (HIV SEQUENCE DATABASE, 2017) ...... 18 FIGURE 5 - MIXED RESULTS FROM FIRST PCR RUN ...... 20 FIGURE 6 - BANDS COVERING CATH3-CATH2 REGION IN LAGOPUS MUTA IN TOP ROW LANES 17-19...... 21 FIGURE 7 - PRIMERDESIGN-M DESIGNED PRIMERS USED FOR PCR GETTING GOOD QUALITY PRODUCTS FOR MOST OF THE CATH REGION RANGING FROM KHL18 TO TBRG4...... 22 FIGURE 8 – VISUALIZATION FROM CODEONCODE ALIGNER OF ALIGNMENT USING THE SANGER SEQUENCES FROM THE STUDY AGAINST THE DRAFT GENOME OF LAGOPUS MUTA...... 23 FIGURE 9 - ALIGNMENT OF CHICKEN CATHELICIDINS AGAINST ROCK PTARMIGAN CATHELICIDINS ...... 24 FIGURE 10 - MAXIMUM LIKELIHOOD TREE CONSTRUCTED FROM AVIAN CATH2 SEQUENCES ALIGNED AGAINST THE CATH2 GENE FROM ROCK PTARMIGAN (TAMURA ET AL., 2007) .. 25 FIGURE 11 - VISUALIZATION OF THE CATHELICIDIN GENES FROM L.MUTA. RED – NEGATIVE CHARGE, BLUE - POSITIVE CHARGE, YELLOW - HYDROPHOBIC, GREEN - POLAR...... 25 FIGURE 12 - AMPLIFICATION PLOT FROM STEPONE SOFTWARE V. 2.1 FROM APPLIED BIOSYSTEMS ...... 27 FIGURE 13 – OPTIMAL PHYLOGENETIC TREE USING MAXIMUM LIKELIHOOD ANALYSIS OF 12S AND ND2 DNA SEQUENCE FOR 36 GALLIFORM TAXA (DIMCHEFF ET AL., 2001)...... ERROR! BOOKMARK NOT DEFINED. FIGURE 14 - AMINO ACID SEQUENCE OF THE FOUR CATHELICIDIN MATURE ANTIMICROBIAL PEPTIDES IN CHICKEN, TURKEY AND ROCK PTARMIGAN. AMINO ACIDS MARKED WITH YELLOW DEFER FROM CHICKEN PEPTIDE SEQUENCE...... 30

vii

Table 1 – PCR program for Q5 hot DNA polymerase 15 Table 2 – Cathelicidin genomic region in Lagopus muta, PCR amplification products 16 Table 3 – Program for reverse transcription in thermal cycler using Reverse Transcription Kit from Applied Biosystems 17 Table 4 – FGENESH gene prediction results on contig assembled in CodeOnCode Aligner 24 Table 5 – Results from pepcalc.com (2015) showing characteristics of ptarmigan cathelicidins 26 Table 6 – Expression levels given as cycle threshold (CT) in eight tissue samples from rock ptarmigan measured relatively against GADPH2, a glycolysis housekeeping gene. 26

Appendix Table 7 – Primers used in amplification and sequencing of PCR products from ptarmigan cathelicidin genomic region 1 Table 8 – List of samples sent for sequencing at Macrogen 2 Table 9 – RT-PCR primers 3 Table 10 – cDNA mastermix recipe 3 Table 11 – Primers selected with Primer Design-M for region containing Cathelicidin genes in L.muta. Tm is melting temperature and start and stop positions refer to position on 14393 bp region flanked by KLH18 and TBRG4. 4

viii

Abbreviations and concepts CATH - An abbreviation for cathelicidin genes usually preceding a number or a further label for the gene. cDNA - Complementary DNA, created by reverse transcribing RNA. In effect creating a library of transcripts from a tissue sample. KLH18 - Kelch like family member 18, a coding gene adjacent to region under investigation in the study. MEGA - Molecular evolutionary genetics analysis. Software for analyzing and aligning genetic and protein sequences RNA-seq - Next generation sequencing of cDNA transcriptomes. PCR - Polymerase chain reaction, a method used to amplify genetic sequences of interest for further analysis or sequencing. qPCR - Quantitative PCR used to measure amounts of nucleotide sequences in a heterogeneous sample. TBRG4 - Transforming growth factor beta regulator 4, a protein coding gene adjacent to genetic region under investigation. bp - base pairs (DNA) kb - kilo bases (DNA)

ix

1 Introduction

Life has evolved over billions of years and created millions of different species competing for resources, survival and procreation. Thinking in terms of higher animals, this results in competition with other members of same species for resources such as suitable habitats and mating partners but also a battle against other species. The red queen hypothesis is a theory within the field of evolutionary biology explaining evolution in terms of competitiveness and gets its name from a scene in Alice in Wonderland where the Red queen must constantly keep running to keep up with the ever- moving world. To put this in perspective to animals they must evolve at the same rate as their pathogens, parasites and predators or risk becoming extinct (Van Valen, 1974). In the context of birds, such as the ptarmigan species under investigation in this study it relates to survival in the context of falcons preying on rock ptarmigans, two species with closely related population cycles (Nielsen, 2011). It also means they must evolve defense mechanisms effective against microorganisms common to their diet and environment. Anti- microbial peptides are an excellent example of one such mechanism. The immune system of animals can broadly be classified into two categories, the adaptive immune system and the innate immune system. These two systems are complementary to each other. The innate immune system generally serves as a system to quickly respond to infections at a very early stage and the adaptive immune system can use specialized cells to custom design effector molecules and highly specialized T-cell receptors to deal with any threats the initial innate response is unable to contain. The mechanisms of the molecules of the innate immune system mediate their effect by exploiting attributes common to microorganisms but not plants and animals. From a pharmacological perspective molecules of the innate immune system also have advantages such as beneficial modulating effects on immune responses and a smaller chance of autoimmune effects (Finlay & Hancock, 2004; Zasloff, 2002). The increase in antibiotic resistant bacteria is happening at an alarming rate and the number of resistant gram negative strains, which contain lipopolysaccharides (LPS) in their cell wall, is growing faster than for gram positive strains (Kumarasamy et al., 2010). These developments make the discovery of therapeutic options from the anti-microbial peptides of the innate immune system an attractive choice for research.

1

The explosion of progress in sequencing technology has made the investigation of genetic factors relating to innate immunity possible and financially plausible. It also allows the characterization of their expression through a variety of molecular biology techniques such as qPCR and RNA-seq. The foundation of this study are two novel draft genomes for the closely related sister species, Rock ptarmigan (Lagopus muta) and the Willow ptarmigan (Lagopus lagopus), generated with next generation seqeuencing (NGS) technology (Kozma et al., 2016). I also had access to tissue samples from Rock Ptarmigan collected, at Mývatn, Iceland November 2016.

Aims of the study: - To resolve genomic organization of the cathelicidin gene cluster in rock ptarmigan. - To predict the amino acid sequence of the mature peptides of the cathelicidin genes. - To study the expression of the cathelicidin genes in selected tissues,

Objectives of the study: - To locate the cathelicidin gene cluster in a rock ptarmigan draft genome with the help of chicken sequences. - To obtain complete sequence of the entire cathelicidin gene cluster with PCR and sequencing. - To isolate RNA from selected tissues and perform qPCR expression analysis of the cathelicidin genes

This knowledge could provide interesting avenues for further research as each species´ antimicrobial peptides are uniquely tailored to the pathogens they are likely to encounter in their habitat providing a good chance of novel peptides being found (Zasloff, 2002). Another possible benefit from the research is an improved understanding of immune system function of ptarmigan species.

2

2 Background 2.1 Antimicrobial peptides Multi-cellular organisms utilize a variety of different strategies to protect them from invasion by microbes. Higher animals like humans have evolved leukocytes and special organs which confer them with an adaptive immunity but most animals are without an adaptive immune system and must rely on other strategies to survive. These strategies include creating hostile environments to microbes or barriers that the microbes are unable to penetrate. If microbes get past these first lines of defense they are attacked by anti-microbial peptides secreted by epithelial or circulating cells (Zasloff, 2002). The importance of these peptides has been demonstrated in knockout experiments in mice where a single knockout mutation in a gene for Cramp, a cathelicidin gene, results in high susceptibility and severe infections of Streptococcus bacteria compared to wild type mice (Zasloff, 2002). The importance of antibacterial peptides is further exemplified by Kostmann´s disease in humans. Kostmann´s is a congenital disease resulting in cathelicidin- LL-37 deficiency which was fatal until the advent of antibiotics and is still a severe disease characterized by recurrent infections. This highlights the importance of antibacterial peptides such as cathelicidins in mediating immunity to microorganisms (Pütsep et al., 2002). In vertebrate species antimicrobial peptides serve a function of quick response to invading pathogens but in many invertebrate species they play a major role in their immune system. A great variety of these genes have been found in plants and animals, to date the antibacterial peptide database contains 2786 antimicrobial peptides from six kingdoms with plants and animals contributing 2426 peptides (Wang et al., 2016). Antimicrobial peptides are generally short, 15-45 amino acids long, and have a positive net charge. They are also in most instances able to kill bacteria faster than the growth rate of bacteria. This effect is mediated by the nature of the peptide which, having a positive charge, is attracted to the negatively charged cell wall of bacteria where its amphipathic nature is believed to destabilize the membrane and cause lysis of the bacteria. The reason eukaryotic cells are not susceptible to this effect is that they contain cholesterol which protects the cells against the peptides. Additionally, the cell membrane of eukaryotic cells has no net charge on the outside of the membrane making weak hydrophobic interactions the only force attracting anti-microbial peptides (Boman, 2003; Zasloff, 2002).

3

It is worth noting that while bacteria utilize several mechanisms to escape antibiotics there are only a few bacterial species that have developed mechanisms that protect them from antibacterial peptides. These strategies are predictably the use of proteases to break down the peptides and modifying their cell wall to reduce its affinity to the peptides (Zasloff, 2002). Many antimicrobial peptides, such as cathelicidins and defensins, secreted by animal cells will kill both gram negative and positive bacteria at concentrations of 1-10 µM. In chicken species the cathelicidins secreted have a stronger bactericidal effect than the human cathelicidin LL-37. Other effects of LL37 have been researched such as: toll like receptor (TLR) activation and effects on leukocyte activation, migration and differentiation. There is limited knowledge of these effects with respect to avian cathelicidins but TLR activation and immune activation has been reported as well as their interesting role in modulating the lipopolysaccharide (LPS) effect on macrophages. (Cuperus et al., 2013; Yu et al., 2015).

4

2.2 Cathelicidins There are a few different types of anti-bacterial peptides, they have been classified by many different schemes: structure, internal bonds, origin and targets. Cathelicidins are short peptides and have a characteristic cathelin like domain from which their name derives. They are originally formed as prepropeptides with a signal sequence which is cleaved off inside the cell before the propeptide is secreted. After secretion, a serine protease cleaves off the pro- cathelin like domain resulting in the mature peptide. The mature peptides are diverse, α- helical, cationic peptides with an amphipathic structure (Cuperus et al., 2013).

Figure 1 – A schematic diagram of avian cathelicidin genes (Cheng et al., 2015). Arrows indicate the orientation of the genes and * point out pseudogenes.

The chicken cathelicidin genes are four and have highly conserved signal and cathelin domains. Two of these genes, CATH1 and CATH3, are highly similar suggesting a recent duplication event (Cuperus et al., 2013). Several studies have looked at the cathelicidin genes in different avian species including: quail, Japanese quail, emperor penguin, rock pigeon, peregrine falcon as well as turkey and chicken. CATH genes have been shown to be highly conserved among avian species but most birds have a different order of genes where

5 the region is ordered KLH18 – CATH2 – CATH3 – CATHB1 – TBRG4. The order of genes in Galliformes is different where the order is: KLH18 – CATH3 – CATH2 – CATHB1 – CATHL1 – TBRG4, figure 1 illustrates the varied arrangements of cathelicidins in avian species. The clustering of immune genes is characteristic of immune system genes, other examples being the major histocompatibility complex and immunoglobulin genes. It has been suggested that this is likely due to the importance of coordinated expression of related loci in the genome (Cheng et al., 2015). All four chicken cathelicidin genes have a four exon three intron structure typical for mammalian cathelicidin genes. The first three exons encoding the untranslated region, the signal domain and the cathelin domain while the last exon codes for the mature peptide. The mature peptide sequence and it´s similarity to other orthologues is of interest to this study (Zhang & Sunkara, 2014). An interesting quality of cathelicidin genes that has been observed is a negative correlation of the charge of the propeptide to the mature peptide. The same correlation has also been observed for mammalian defensins, antimicrobial peptides also present in birds. This can be explained by the need to prevent auto cytotoxicity where a positively charged mature peptide might attack the electronegative inside of the eukaryotic cytoplasm. By maintaining a neutral charge by balancing the anionic propiece against the cationic mature peptide this is prevented (Cheng et al., 2015; Michaelson et al., 1992). These interactions are typical for constraints on the evolution of the cathelicidin genes where changes in the sequence cannot cause a disruption to efficient translation, secretion or trafficking of the peptides without risking a loss of fitness in the individual (Zasloff, 2002). However, there is a strong propensity for diversity in cathelicidins as a small change in the sequence can dramatically alter its biological activity. It is likely that diversity in antimicrobial peptides results in immunity in some individuals to threats the wild-type peptides of the species were ineffective against. The selective pressures on avian cathelicidin genes have been investigated showing overall a high negative selection with only 4% of the sites positively selected for (Cheng et al., 2015; Zasloff, 2002).

6

2.3 Expression and function of avian cathelicidins Although the expression of cathelicidins in mammals has been well studied the expression in birds is less known. The expression is regulated by many stimuli among which infection is the predominant factor. Different methods have been used for profiling the expression of the genes and it has proven to be very different between tissues. The highest constitutive expression has been found in bone marrow for CATH1-2-3 but in Bursa for CATH-B1. CATH-B1 appears to have a role at mucosal interfaces in guarding against infection (Goitsuka et al., 2007). Quantitative PCR performed in chicken tissue indicate that the highest level of expression is detected in lymphoid organs as well as sexual organs while lower level is found in gastrointestinal organs (Yacoub et al., 2016). Tissue specific analysis from the Japanese quail (Coturnix japonica) has shown similar expression with little constitutive expression outside of bone marrow and bursa of fabricus, centers of immune system development (Ishige et al., 2017). A study done on pigeon cathelicidins has revealed that CATH2 has effects in modulating the LPS inflammatory response from infections by gram-negative bacteria. This is done by binding to LPS and blocking its binding site to toll like receptors (TLR) preventing a signal cascade resulting in the secretion of various pro-inflammatory cytokines of the immune system. Molecules with this kind of effect are of obvious clinical interest as sepsis is an example of a very serious condition believed to be caused by LPS activation of macrophages (Finlay & Hancock, 2004; Yu et al., 2015). Minimum inhibitory concentration (MIC) tests for the activity of the peptides done for CATH1-2-3 from chicken indicate that CATH1 and CATH3 have a lower MIC than CATH2 indicating a stronger bactericidal effect (Yacoub et al., 2016). The antibacterial effect of CATHB1 has been measured to be lower than for the other chicken cathelicidins (Lee et al., 2016).

2.4 Research methods and data analysis in molecular genetics A variety of methods are used in this study, from classic methods like PCR and gel electrophoresis to the more recent RNA-seq which was made possible by advances in sequencing technology in the last ten years. Analysis of the data was done using bioinformatics software and web resources like Ensembl and NCBI. These methods will be reviewed here briefly.

7

2.4.1 PCR The importance of polymerase chain reaction (PCR) as a method for studying genetics and in biotechnology cannot easily be quantified. Before Kary Mullis´ discovery of the Taq polymerase the amplification of genetic material was a time consuming and unreliable process. Klenow fragments of the E.coli DNA polymerase I were added before each cycle – an error prone process (Saiki et al., 1988). The PCR process relies on primers complementary to small ~20bp regions flanking the genomic sequence to be amplified and every cycle of the reaction starts by denaturing the DNA with an increase in temperature before lowering the temperature again to allow the primers to anneal to the separated strands. This is followed by an extension step where the heat resistant Taq polymerase fills in the region between the two primers. The Taq polymerase is isolated from a thermophilic bacterium Thermus aquaticus and is not destroyed by the high temperature of the denaturing step which is done at 95°C. This means that a mixture of genomic material, primers, polymerase, buffer solution with Mg2+ and equimolar mixture of nucleotides is all that is required for an amplification of the genomic region by a factor of millions depending on the number of cycles (Metzker & Caskey, 2009; Saiki et al., 1988). By selecting a pair of primers adjacent to a region of interest in a genome the region can be amplified allowing easy sequencing with the Sanger sequencing method. Other examples of uses for PCR are to: examine chromosome rearrangements, search for sequence variation, for cloning purposes, detection of pathogens and phylogenetic analysis (Metzker & Caskey, 2009; Saiki et al., 1988). An important reagent for the PCR is which polymerase is selected as they have very different fidelity. Two examples of DNA polymerases suitable for PCR and superior in fidelity to the Taq polymerase are Phusion and Q5 (New England Biolabs, US). These polymerases have >50 and >100 times better fidelity than Taq meaning fewer errors in the process of duplicating PCR critical in the preparation of DNA for sequencing. This higher fidelity is achieved through a structural arrangement that results in a slower incorporation of erroneous nucleotides which increases the likelihood of them dislodging and the correct one replacing it. Another important quality is an exo-nuclease proofreading mechanism that will remove mispairs and repair. In a study done with Sanger sequencing only two errors were detected for 440.000 bases sequenced by the Q5 DNA polymerase (Pezza et al., 2014). One method derived from PCR is real time qPCR, a quantitative method used for measuring transcription. It can be custom designed for many different purposes and give 8 results relatively as a difference in expression between samples or tissues or absolutely calculated from a standard curve. Real time qPCR uses the amplification of a small genetic sequence that is usually part of a cDNA sequence proportional to the expression of a gene of interest to quantify the initial amount of the cDNA sequence in the sample. This is done primarily using fluorescent dyes such as Sybr Green or the Taq-man system. Each system has its advantages as Sybr Green is cheaper but less specific than the Taq-man system as the Sybr Green dye binds to double stranded DNA which makes it important that the researcher checks for unintended amplification of other products by melting curve analysis or gel analysis (Applied Biosystems, 2010; Ponchel et al., 2003).

2.4.2 Gel electrophoresis Heterogeneous samples in biological research must be subjected to various research methods to visualize and quantify nanomolar concentrations of macromolecules such as and DNA. One method has established itself as one of the most important ones in molecular biology research is the separation of molecules by size in polysaccharide gels by use of an electric field (Applied Biosystems, n.d.; Thorne, 1966). This method relies on a matrix routinely made up of the polysaccharide agar mixed in water and heated to form a homogeneous gel which functions as a sieve with uniform pore size causing any molecules pulled through it to move at a speed inverse to their size. This results in molecules of uniform size to travel the same distance in an electric field created by electrodes connected to a power generator and separating them from slower moving bigger molecules and faster moving smaller molecules. By using a preconstructed mixture of DNA molecules with fixed concentrations of various sized DNA molecules, a DNA ladder as it is named by lab supply companies, a sample can be compared to the ladder to estimate size and amount in the sample being electrophoresed (Brody & Kern, 2004).

9

To conduct the electric current, the gel must be placed in a solution containing enough ions to carry the charge and keep the pH of the medium relatively constant during the process. For this purpose a TBE buffer consisting of tris boric acid and EDTA has gained the greatest popularity for electrophoresis of nucleic acids but other buffer mixtures can be used as well. For easy visualization of the bands formed in the electrophoresis separation a dye must be added. This is routinely done by staining the gel with ethidium bromide or adding a dye such as Sybr safe to the gel which is attracted to the DNA molecules surrounding them and visualizes without the need for staining (Brody & Kern, 2004). By use of this method a PCR reaction can be evaluated with regards to whether it is suitable for sequencing, one such gel is shown in figure 2. Figure 2 - A DNA electrophoresis gel with PCR results, ladders and loaded with Sybr safe.

2.4.3 Sequencing methods Many giant leaps were made during the 20. century in genetics but one of the most important was the ability to sequence DNA. DNA is a double strand of complementary bases of four different chemical monomers called nucleotides. Four bases are utilized in DNA, two being pyrimidines (C&T) and two purines (A&G), these bases form hydrogen bonds A to T and C to G. The bases are protected by a hydrophilic phosphate-deoxyribose backbone which has a negative charge. These chemical properties are utilized for the investigation of genetic material. The negative charge is used to separate polymers of different sizes being drawn through an electric field. The double strand is unzipped by controlling the surrounding temperature – important in PCR - and by incorporating nucleotides of different structure an elongation of DNA can be halted (Nelson & Cox, 2000). The Sanger sequencing method was the first method to reach widespread use and relies on the use of modified ribose sugars for the nucleotides. By removing a oxygen atom from the 3‘end of the phosphate-deoxyribose backbone a dideoxy ribose sugar is created which prevents further elongation leaving a product with one of these dideoxy bases at the 3´ end. By labeling these dideoxy molecules by different fluorescent dyes each polymer has a

10 specific length and a color to recognize the last nucleotide by. This method has been automated and perfected over the last 40 years and due to its high reliability and the convenience of using it along with PCR to produce a sequencing template it is still widely used (Sanger, Nicklen & Coulson, 1977). The sheer size of genomic material and the cost per sequenced has driven the further research of new technologies for sequencing and resulted in great improvements in cost, speed and reliability. It has also resulted in new research methods being born, one of which is RNA-seq – a revolutionary method in transcriptomics (van Dijk et al., 2014; Wang et al., 2009). Various platforms have competed in the new technologies, commonly referred to as next-generation sequencing, but the Illumina platform is of interest as it is cheapest per base pair sequenced and has the highest throughput per run and is currently the leading technology (van Dijk et al., 2014). Briefly the Illumina platform utilizes solid-phase amplification where adapters are ligated to the randomly fragmented genetic material to be sequenced prior to being amplified by bridge amplification creating millions of molecular clusters. These clusters are then sequenced by cyclic reversible termination which uses fluorescently modified nucleotides detected by two lasers. One nucleotide is added each time, unbound nucleotides washed away, the image is then analyzed to record which nucleotide base was incorporated before the fluorescent dye is cleaved off along with an inhibiting group that prevented the addition of more than one base (Metzker, 2010). RNA-seq is one of the methods made possible by next-generation sequencing technologies. It gives both quantitative and qualitative results about the transcriptome of the tissue being analyzed and uses next-generation sequencing of complementary DNA (cDNA) made by reverse transcription of RNA. As the RNA has not undergone any amplification prior to the sequencing levels of expression can be inferred from the data along with what isoforms of genes with multiple exons are present. Furthermore the method is free of many of the problems that plagued earlier methods for analyzing transcription such as prior knowledge of genome sequences, high background levels due to cross-hybridization and a limited range of detection due to background and saturation of signals. By contrast RNA-seq can detect up to 9000-fold difference in transcription, does not have to be mapped to an existing reference genome and results can be quantified absolutely allowing for comparison between studies and samples (Wang et al., 2009).

11

2.4.4 Bioinformatics analysis of data Because of the sheer volume of data collected in most genetics research the only plausible, or at least practical, strategy is to use powerful computers to compare sequences and align them. The resources available to researchers in genetics are powerful and free of charge, these include giant databases compiled with non-redundant sequences for structural RNA, protein coding transcripts and non-coding RNA (Pruitt et al., 2014). To be able to align search strings of nucleotides or proteins to these databases it is necessary to reduce the amount of computations performed by the computers. For this purpose there have been designed heuristic methods that are not guaranteed in providing the optimal alignment but provide a good alignment that often closely match alignments provided by exhaustive methods such as Smith-Waterman (Slater & Birney, 2005). One strategy that has been used a great deal is the breakdown of sequences into shorter sequences, called words, which are used to search against large databases. Matches are then used to extend the alignment and a scoring system is used to evaluate the goodness of fit for different matches, keeping track of significant matches. An important distinction is to be made in local alignments versus global alignments where local alignments search for the highest scoring alignments between two sequences which cannot be improved on by extending or shortening either sequence while global alignments are for measuring how well two whole sequences fit together. The Blast algorithm by scoring similarity is thus a valuable tool to search for orthologous sequences using one organism as a template for searching a related organisms unannotated genome (Altschul et al., 1990). The initial discovery of the molecular structure of DNA by Rosalind Franklin, James Watson & Francis Crick in 1953 resulted in great leaps in understanding of how genetic information is translated into polypeptides which perform metabolic tasks and structural purposes in living creatures. It also suggested to researchers how small changes in the genetic code could result in amino acid substitutions with implications to evolution on a molecular scale. Pauling & Zuckerkandl (1963) recognized this and put forward the molecular clock hypothesis which states that by analyzing the amino acid substitutions for the same gene in two different organisms a distance in time to a common ancestor may be inferred. This evolutionary basis is useful in the research of newly assembled genomes because using well researched and annotated genomes it is possible to locate genes of interest which often bear great similarity in sequence and arrangement on their chromosome. Since coding sequences are usually well preserved, as mutations tend to have a negative effect on the

12 evolutionary fitness of the gene, they are suitable for selecting sequences to blast against a newly assembled genome from a related species (Eyre-Walker & Keightley, 2007). This makes Ensembl an ideal resource for the study of vertebrate genomes as it is designed for easy access of annotated chordates and key model organisms (Yates et al., 2016). Chicken has been used to study genetics for a long time and some of the terms commonly used in genetics were indeed coined in research on chicken such as alleles, genetic linkage and epistasis (Burt, 2007). The chicken has in recent years been extensively studied and sequenced resulting in one of the best known avian genomes. The current assembly used on Ensembl was released December 2015 and consists of 34 , one linkage group and 15411 unplaced scaffolds. It has 18.346 coding genes, 6.492 non-coding genes and 38.118 gene transcripts (Aken et al., 2016). A blast search for genes known from chicken can be used to find sequences that are likely to be well preserved and these sequences then blasted locally against an unannotated assembly such as the newly assembled ptarmigan genomes. The results from these blast searches can be used to create primers for PCR and full sequencing of the genomic region of interest. For this purpose various web resources are available for increasing the likelihood of success, such as Primer3Plus, a web tool for selecting viable primers from query sequences. Primer3plus selects for a variety of criterion such as melting temperature and length and against the likelihood of hairpin formation or primer-dimers which can ruin any chances of success (Untergasser et al., 2007).

13

3 Materials and methods

Novel assembled draft genomes of Lagopus muta and Lagopus lagopus (Kozma et al 2016) were mounted into CLC genomic workbench (Qiagen, Aaruhus, Denmark). In order to search the draft genomes against genes from fully annotated genomes the Ensembl database was used to choose which genome had the best characterized set of cathelicidin genes (Yates et al., 2015). The chicken genome contains four cathelicidin genes named: CATH1, CATH2, CATH3 and CATHB1, these genes are placed in a short genomic span (~7,5 kb) on chromosome 2. The cDNA sequences for each gene were downloaded from the Ensembl database and used to blast locally in CLC workbench against the draft genome of the Rock Ptarmigan (Lagopus muta). This yielded hits for CATH3 and CATH2 but not for CATHL1 and CATHB1 (Yates et al., 2015). The same genes got hit when blasted against the Willow ptarmigan (Lagopus lagopus). As the evolutionary distance between these two species is very small, 6 million years as estimated by the Timetree database (Hedges et al., 2015) and the Rock Ptarmigan genome very incomplete in the genomic area of interest it was assumed that both species contained the same cathelicidin genes and primers designed to sequence the whole genomic region.

Figure 3 - Genomic region in chicken containing cathelicidin genes (Yates et al., 2015).

Figure 3 illustrates the genomic area to be sequenced and primers were selected from coding sequences inside genes in the area. Primers were selected for genes on either side of the cathelicidin genes to ensure that the full area containing the genes along with their regulatory elements would be contained. Primer3 was used to select primers from coding sequences and reverse complement used for sequences on the reverse strand (Untergasser et al., 2007). Of the four sequences, only CATH3 is on the forward strand in chicken but the others are on the reverse strand.

14

Primers were reordered using the genomic DNA mounted in CLC genomics workbench as a template and PCR done for the different regions under investigation until PCR products of enough quality and the correct size had been acquired for all regions. One such PCR program is listed in table 1 done with the Q5 hot high fidelity polymerase (New England Biolabs, US).

Table 1 – PCR program for Q5 hot DNA polymerase

Step Temperature Time length 1 Initial warm up 98°C 30 seconds 2 Denaturation 98°C 10 seconds 3 Annealing 68°C 20 seconds 4 Extension 72°C 2:30 Repeat 2-4 34 times 5 Final extension 72°C 10:00 6 Storage 4°C forever

Products yielded by the PCR reactions were separated by length with gel electrophoresis to visualize and check for length and purity for sequencing. For all the genomic DNA samples a 1% Invitrogen ultrapure agar gel was used and a 0,5X TBE buffer and the samples run for one hour at 110V, 1µL of Sybr safe was used for small gels and 2µL for larger gels. The cathelicidin region was divided into five separate regions for amplification by PCR and can be seen in table 2. DNA used for the PCR reactions was isolated from tissue samples with accession no. LM-12-040, the same male bird which was used for the whole genome sequencing (WGS), and no. LM-16-R01 from the Icelandic Institute of Natural History, the birds were caught in North-East Iceland.

15

Table 2 – Cathelicidin genomic region in Lagopus muta, PCR amplification products

Genes flanking region Location in L.muta genome Length of area KHL 18 – CATH3 Chromosome 2 (Chr. 2) 3939 base pairs (bp) 3974785 – 3978724 CATH3 – CATH2 Chr. 2 970 bp 3978655 – 3979625 CATH2 – CATHB1 Chr. 2 3016 bp 3979540 – 3982556 CATHB1 – CATHL1 Chr. 2 2730 bp 3981492 – 3984222 CATHL1 – TBRG4 Chr. 2 3468 bp 3983227 – 3987256

Three different primer orders were made to Macrogen for a total of 13 primer pairs as some of the initial primers yielded no products or in a PCR with a great amount of side products which would have been unsuitable for sequencing (see appendix I). Some of these extra primers were used however with the primer products sent for sequencing at Macrogen by a EZ-Bag kit with 48 samples (for list of PCR products and primers see appendix I). For the sequencing reaction, a mix of template and primer is mixed together taking care to adjust for size of the PCR product being used. A longer PCR product needs more volume relative to the primers. Each sample of the EZ-bag order contains 10µL (Macrogen, n.d.). Primers for RT-PCR of the cathelicidin genes and suitable housekeeping genes used to measure expression relatively were designed. The sequences were selected from exons of the CATH genes as well as the housekeeping genes HPRT (a gene for synthesizing an enzyme important for purine generation) and GADPH (part of glycolysis). The intended products of the primers were ranging from 80 – 110 bp in length (see appendix I). Two primer pairs were selected for GADPH to increase the chances of getting at least one good housekeeping gene for comparison in the RT-PCR reaction. Care was taken to use matching sequences from the genome mounted in CLC genomics workbench for the cathelicidin genes and primers selected with Primer3plus for suitable amplicon length (Untergasser et al., 2007). A PCR reaction was performed against genomic DNA to test the RT-PCR primers and the resulting products separated by size using gel electrophoresis in a 1% Invitrogen ultrapure agarose gel loaded with Sybr safe dye and containing 0,5X TBE. A 50 bp ladder from NEB was used for comparison to the samples and the samples run for 40 minutes in the gel. For extraction of RNA tissue samples from bird with accession no. LM-16-R01 of the Icelandic Institute of Natural History were used, the tissue was stored in RNA-later (Ambion, US) prior to extraction. Tissue from 8x organs were used: brain, kidney, heart, ovaries,

16 muscle, liver, left testicle and spleen. A Agencourt RNAdvance tissue kit for total RNA isolation from tissue was used to extract RNA. The tissue was homogenized mechanically and put in lysis buffer prior to digestion with proteinase K. The samples were then loaded with magnetic beads and subjected to wash steps prior to DNAse (RNAse free) (Thermo Fisher #EN0521) digestion of all genomic DNA. The samples were precipitated with a magnet during the wash steps so the RNA covered magnetic beads would not be removed during the process. Finally, the RNA was eluted with nuclease free water 50µL per sample. Following the RNA extraction 10µL was used for cDNA synthesis using a High- Capacity cDNA Reverse Transcription kit from Applied Biosystems. A cDNA master mix was made (see appendix I) and 10µL of sample mixed with 10µL of master mix. The samples were then loaded into a thermal cycler for reverse transcription prior to denaturation of the reverse transcriptase enzyme, see table 3.

Table 3 – Program for reverse transcription in thermal cycler using Reverse Transcription Kit from Applied Biosystems

Step Temperature Time 1 25°C 10 minutes 2 37°C 120 minutes 3 85°C 5 minutes 4 4°C Forever

To complete the sequence of the genome further PCR was needed along with purification of products using gel electrophoresis in order to reduce interference from PCR artifacts in the sequencing. Another strategy was also utilized to create primers for the whole region in question using PrimerDesign-M, a webtool specially designed for primer walking across genomes, creating 16 primer pairs at regular intervals for products ranging from 1400-1600 bp (Brodin et al., 2013; Yoon & Leitner, 2014). To accomplish this a region ranging from a KLH18 match on the Lagopus muta chromosome 2 to a TBRG4 match was exported in fasta format. The region contained 14393 bp of which 6517 bp were unidentified.

17

Figure 4 - A visualization of the results from the PrimerDesign-M primer selection (HIV sequence database, 2017)

This sequence was loaded into the PrimerDesign-M webtool on the HIV sequence database and primers selected for each region for a melting temperature close to 60°C and those primers with self-complementarity or capable of dimer formation were discarded. This was tested using the Multiple Primer Analyzer on the Thermo-Fisher website (n.d.). Figure 4 shows the primer pairs and visualizes the region and its unidentified regions where the complexity line (green) maxes out. 16x primer pairs were ordered from Macrogen for PCR and sequencing from L.muta samples (appendix 1). The primers from PrimerDesign-M were then used for PCR against the genomic DNA and the resulting products electrophoresed to test for purity by the same methods as before. For better sequences from the PCR products already in place some products were electrophoresed and purified from agarose gel and sent for sequencing along with the PCR products from the PrimerDesign-M webtool primers. The sequences were mounted in the program CodeOnCode Aligner (CodeOnCode Corp., Dedham, MA) and an exported sequence for the region ranging from KLH18 to TBRG4 containing the CATH genes in Lagopus muta. CodeOnCode Aligner was used to trim the ends of the sequences maximizing the region with an error rate <0.1 and throwing out all sequences shorter than 25bp and with fewer than 50 bases with a Phred score of less than 20 (99% accurate). The sequences were

18 aligned in CodeOnCode Aligner and the resulting sequence used for further bioinformatics analysis against sequences found online in the Ensembl and NCBI databases. Further analysis was conducted using FGENESH, a gene prediction algorithm as well as MEGA, a molecular evolutionary program useful for aligning sequences (Solovyev et al., 2006; Tamura et al., 2007). A preliminary epression analysis was performed on eight tissues from Rock Ptarmigan using custom made primer pairs for the four cathelicidin genes and a house keeping gene for Rock Ptarmigan. A qPCR run was performed on a Applied Biosystems StepOne Real-Time PCR system using GADPH as a housekeeping primer for cDNA using Luna Universal PCR mix (NEB, US) containing SYBR green dye. Results were collected and analyzed in StepOne software v. 2.1..

19

4 Results

In order to characterize the cathelicidin cluster in the Rock Ptarmigan genome I designed multiple primer pairs, tested polymerases with various conditions for optimization of PCR and sequencing. Because of the similarity of the genes, the risk is amplification of multiple regions resulting in multiple PCR products. The genomic characterization was performed in several PCR / sequencing attempts. New primer pairs were designed and subjected to PCR and sequencing based on results from the previous attempts.

4.1 Optimization of PCR In the first attempt, I designed primers based on conserved regions in the chicken cathelicidin genes which gave bad results, see figure 5, The PCR generated fuzzy bands and multiple bands giving a low chance of good sequencing results. Two of the primer pairs gave weak bands of the correct length and were used for sequencing.

Figure 5 - Mixed results from first PCR run

20

The second run of PCR was performed with primer pairs selected from the L.muta draft genome gave poor results, however one band covering the region between CATH3 and CATH2 gave sequencing results, see figure 6.

Figure 6 - Bands covering CATH3-CATH2 region in Lagopus muta in top row lanes 17-19.

Although clear bands appeared for another region they were of a shorter length than was intended and not suitable for sequencing. For the third sequencing run it was deemed that in good likelihood some regions were adequately covered for the initial sequencing run so only three new primer pairs were used. The PCR products resulted in multiple bands and as time would not allow further purification of bands of correct size or repetition of PCR steps these products were sequenced resulting in low quality sequences. Figure 2 shows the impure products of the third PCR run.

21

Figure 7 - PrimerDesign-M designed primers used for PCR getting good quality products for most of the CATH region ranging from KHL18 to TBRG4. The final attempt using PrimerDesign-M proved to be successful, resulting in PCR products giving strong bands of expected size(figure 7).

4.2 Sequencing results The sequencing results for the first batch obtained from Macrogen yielded 2472 new base pairs for L.muta were missing in the genome draft sequences. Less than half, or 19 of the 48 samples sent for sequencing yielded more than 100 bp with Quality value of >20, indicating that the obtained sequences were of high quality. Mounted into CodeOnCode Aligner (CodeOnCode corporation, Dedham, MA) these results yielded some new areas. The second and third sequencing plates that were sent yielded much more and resulted in a contig almost completely covered for the whole region containing the four cathelicidin genes. Two small gaps, 180bp long 2200bp into the region and 101bp long 8400bp into the region were still remaining for later completion. Figure 8 illustrates the final assembly by CodeOnCode Aligner of the genomic area.

22

Figure 8 – Visualization from CodeOnCode Aligner of alignment using the Sanger sequences from the study against the draft genome of Lagopus muta.

The assembled contig is 14995bp long, with a total of 8109 G or C bases (54%) and 5589 A and T bases (37%), 1297 bp of the region were not covered in Sanger sequencing (8,5%). 11947 bp (79,7%) in the sequence had a Q20 > 20 and 11231 a Q30>30 (74,9%). A Q20 score gives an accuracy of 99% of the correct base while Q30 gives 99,9%. The CodeOnCode aligner assigns quality scores of 15 to imported sequences but the coverage of the draft genome was 100-fold giving it a much higher quality than the Q15 CodeOnCode assigns it. The quality of the assembly is likely to be much higher than CodeOnCode assigns to them.

4.3 Sequence analysis Using the full assembled contig to BLAST against bird sequences on the NCBI database with a discontiguous blast gives matches with low expected values (10^-50 or less) and high identities for all four cathelicidin genes. Furthermore, strong matches came up for chicken, Japanese quail, turkey, pheasant and rock dove, all four cathelicidin genes were strongly represented in the BLAST results. The Ensembl database with its fully annotated chicken genome gave the same order of the genes as in chicken, turkey and Japanese quail: KLH18, CATH3, CATH2, CATHB1, CATHL1, TBRG4. Using results from FGENESH (see appendix I), a gene prediction algorithm, protein results were blasted against chicken sequences at NCBI getting protein matches for all four cathelicidin genes as well as two genes upstream of the region. The BLAST matches for the six proteins hypothesized by FGENESH are listed in table 4.

23

Table 4 – FGENESH gene prediction results on contig assembled in CodeOnCode Aligner

Length in Nr. of Starting End Top Prediction amino E value Idenity % exons position position BLAST hit acids CATH1 1 3 2815 3324 94 4*e-32 61% precursor CATHB1 2 5 4336 5563 261 3*e-100 67% precursor 3 2 5828 7047 68 CATH2 3*e-21 53% 4 4 8266 9416 130 CATH3 1*e-72 82% K/Na hyperpol. 5 4 10310 11004 154 activated 7*e-68 90% nuc. gated channel 4 6 1 13783 14031 82 KHL18 5*e-51 98%

Using cathelicidin mRNA sequences as a template to look for the cathelin domain of cathelicidin two such regions were found by aligning them in MEGA using Muscle to align. These were at positions corresponding to CATH2 and CATH3 blast results and were accordingly on different strands of the DNA. CLC genomic workbench, offers Tblastn (search translated nucleotide databases using a protein query). Blast queries with mature peptides from chicken (appendix I) as search strings gave good matches for all cathelicidins. The alignment is however highly conserved between chicken and rock ptarmigan which is visualized in figure 9 showing in green the sequences from the assembled contig against chicken sequences. Only one amino acid is substituted in CATHL1 and CATH2, two changes are in CATH3 while eight changes are made in CATHB1. CLC visualizes the quality of the match by how bright the green color of the hit is.

Figure 9 - Alignment of chicken cathelicidins against rock ptarmigan cathelicidins

24

To further test the validity of the genome sequences for Rock Ptarmigan, the contig sequence containing the CATH2 homolog was aligned in MEGA against five sequences for the gene selected from the NCBI database from avian species. A maximum likelihood tree was constructed showing the evolutionary history of the gene, see figure 10. The maximum likelihood tree places L.muta firmly in the middle of the tree.

Figure 10 - Maximum likelihood tree constructed from avian CATH2 sequences aligned against the CATH2 gene from rock ptarmigan (Tamura et al., 2007)

The amphipathic nature of cathelicidins which is essentially a mixture of hydrophobic (yellow amino acids) and positively charged amino acids (blue residues are charged and H, K and R are positive in charge while D and E are negative in charge) is easily noticeable looking at figure 11. Especially their immense positive charge while they also contain many highly hydrophobic amino acid residues such as tryptophan (W), tyrosine (Y) and phenylalanine (F).

Figure 11 - Visualization of the cathelicidin genes from L.muta. Red – negative charge, blue - positive charge, yellow - hydrophobic, green - polar.

Using an online peptide calculator, pepcalc.com, the charge and solubility of the peptides can be calculated. Table 5 gives the results for the rock ptarmigan cathelicidin mature antimicrobial peptides (pepcalc.com, 2015).

25

Table 5 – Results from pepcalc.com (2015) showing characteristics of ptarmigan cathelicidins

Peptide Net charge at pH 7 Estimated solubility Molecular weight CATHL1 +8 Good 3127 g/mol CATH2 +10 Good 3804 g/mol CATHB1 +9 Good 5092 g/mol CATH3 +6 Poor 3194 g/mol

As CATH3 is adjacent to an unsequenced area further results will have to clarify the nature of the peptide in rock ptarmigan.

4.4 Expression of CATH genes in Rock Ptarmigan The most extensive and reliable expression analysis is obtained by studying transcriptomes. Most of the RNA I isolated from the eight tissues available of the Rock Ptarmigan has been sent for RNA sequencing at Íslensk Erfðagreining ehf (Reykjavík, Iceland), awaiting to be processed. However a single run for eight tissue samples gave high levels of expression for all CATH genes in all the tested tissue samples. Further analysis is required to reliably quantify the expression. A preliminary epression analysis was performed on eight tissues from Rock Ptarmigan using custom made primer pairs for the four cathelicidin genes and a house keeping gene for Rock Ptarmigan. A qPCR run was performed on a Applied Biosystems StepOne Real-Time PCR system using GADPH as a housekeeping primer for cDNA using Luna Universal PCR mix (NEB, US) containing SYBR green dye. Results were collected and analyzed in StepOne software v. 2.1. Only raw results are shown (table 6) Lack of RNA, hampered proper design of the expression analysis, lacking standards and triplicates for each tissue sample.

Table 6 – Expression levels given as cycle threshold (Ct) in eight tissue samples from rock ptarmigan measured relatively against GADPH2, a glycolysis housekeeping gene.

GADPH2 CATH3 CATH2 CATHB1 CATHL1 Heart 17,16 19,19 18,23 18,53 19,36 Kidney 15,49 18,28 17,29 17,39 17,51 Brain 15,38 19,49 18,30 18,40 18,59 Ovaries 13,66 19,14 17,52 17,82 18,27 Muscle 29,09 23,41 22,45 22,83 23,98 Liver 17,85 20,10 18,59 18,73 19,59 Testis 28,08 29,30 19,72 20,82 22,34 Spleen 18,31 21,83 19,81 19,44 21,15

26

Table 6 illustrates these high levels where a low value for cycle threshold (Ct) implies a high initial amount of genetic material complementary to the qPCR primers. As GADPH2, the housekeeping gene, is constitutively expressed any value close to it would imply high expression of the gene. Looking at table 6 no value is higher than 30 while generally any score of <29 is considered a strong positive. Figure 12 shows the amplification plot where the real time amplification measured from fluorescence is visualized. Figure 12 - Amplification plot from StepOne software v. 2.1 from Applied Biosystems

27

5 Discussion

A genomic region spanning 15 kb was assembled from a draft genome and sequences obtained with Sanger sequencing for an almost total coverage of the area. Furthermore the order, location and arrangement of the genes were confirmed in Lagopus muta using the novel draft genome. RNA was extracted from eight tissues and used to create cDNA to perform qPCR analysis for expression of the four genes. The preliminary expression analysis shows expression of the four cathelicidins; CATHL1, CATH2, CATH3, CATHB1 in all tissue samples tested. Ongoing transcriptome analysis will provide further validation of expression. Four separate runs of PCR were needed to produce enough products for sequencing to be able to get this far with the region and problems with PCR for large regions caused problems in the study. In the end a change of tactics by setting up for smaller PCR products yielded most of the bases that were recovered in the Sanger sequencing process. The total number of sequencing samples sent to Macrogen were 157 and the problems with impure PCR products resulting in very poor sequences were to some level alleviated by the alignment software CodeOnCode Aligner. The analysis of the genomic region was complicated by the fact that the genes are very similar and this is a likely influence in the multiple PCR results that were confounding in length as many of the primers may have found complementary regions in the PCR process aside from their intended targets creating a complex mixture of PCR products with multiple bands ruining any realistic chance of getting high quality sequences from the experiment. It was also worrying that only two of the four genes in the sequencing contig assembled from the data contained a highly conserved cathelin region. Why the CATHL1 and CATHB1 gene sequences in the contig did not contain such a region is puzzling gave rise to doubts of the integrity of the assembly. However, the good matches found for all the mature antimicrobial peptide exons leaves little doubt to the arrangement of the genes. The pitfalls in primer design and difficulties in performing PCR on genomic material for products around 3 kb was an ambitious thesis project for an undergraduate. The results from this project will however prove valuable preparation for further analysis of the region which will hopefully close the gaps of nucleotides still contained and repair any errors still in the contig. To conclude, I have designed primer pairs for closing gaps and improving the quality of the cathelicidin cluster. In evolutionary context, it is interesting to align the amino acid sequences of the four mature antibacterial cathelicidin peptides of Rock Ptarmigan, turkey and chicken. The

28 phylogeny of the avian subfamily Tetraoninae (grouse and ptarmigan), a Holarctic group in the order Galliformes has been investigated using mitochondrial 12S and ND2 sequence data (Dimcheff et al 2001). Phylogeny shows that the ptarmigans are more related to turkey than chicken (fig. 13).

Figure 13 – Optimal phylogenetic tree using maximum likelihood analysis of 12S and ND2 DNA sequence for 36 galliform taxa (Dimcheff et al., 2001).

The comparison of peptide sequences to chicken gave an interesting view of the genes where there is very little variation in the peptide sequence for CATH 1,2 or 3 while much higher in CATHB1 (figure 14). From an evolutionary viewpoint this would indicate that the CATHB1 gene is under selection. The other cathelicidin may have evolved to a point where most

29 changes to the amino acid sequence will result in a loss of fitness to the individual and negative selection. The CATHB1 peptide is in Rock Ptarmigan is strikingly more similar to turkey than of chicken, indicating different selectional forces in evolution. It is intriguing to wonder if different life style or infectious burden could have cause the changes.

Figure 14 - Amino acid sequence of the four cathelicidin mature antimicrobial peptides in chicken, turkey and Rock Ptarmigan. Amino acids marked with yellow defer from chicken peptide sequence. The results from the qPCR experiment are interesting in that they show that the expression of antimicrobial peptides is ubiquitous in the L.muta tissue. However as the results were only preliminary little can be asserted conclusively on the subject. RNA-seq results will be forthcoming and should give a good idea about the relative expression of the genes in different tissue. Another interesting avenue for further research would be to test the antimicrobial peptides from L.muta for their ability to inhibit growth of microorganisms, an important clinical consideration in the age of antibiotic resistant bacterial strains.

30

6 Conclusions

In this project, I have managed to characterize the genomic organization of the cathelicidin cluster in rock ptarmigan, where all four cathelicidin genes orthologues found in chicken and turkey (Meleagris gallopavo) were revealed, namely CATHL1, CATH2, CATH3, CATHB1, flanked by KLH18 and TBRG4, in the same order on chromosome 2. The >14kb region is fully sequenced except for two gaps of ~100bp which can easily be covered with sequencing of two PCR products. Translation of the open reading frames of the cathelicidins revealed substitution in all four genes in rock ptarmigan. CATHL1, CATH2, CATH3 showing greatest similarity to chicken but CATHB1 to turkey. The region can easily be subjected to more thorough analysis of gene organization regarding exons, introns, promoters and non- translated regions when results from the transcriptomes will be available. Furthermore, preliminary results indicate that the sister species of Rock Ptarmigan, Willow Ptarmigan harbours essentially the same cathelicidin gene cluster organization. Further analysis of Willow Ptarmigan is a subject of another detailed study similar to this one. The RNA I extracted from the eight tissues will also provide important results from the RNA-seq, currently pending results, giving a full transcriptome for the rock ptarmigan for the first time. Results capable of providing research material for many further studies. I am confident that my thesis work together with transcriptome results will contribute to a scientific paper that should be accepted in an international peer reviewed scientific journal. Novel techniques in combating microbial pathogens are an important study subject and nature is an endless reservoir of opportunities for study and possible discovery of important compounds capable of improving our ability to fight infections. The study of genetics also gives perspective to evolutionary forces and further research into cathelicidins and other anti-microbial peptide genes might answer interesting questions about factors affecting the survival of rock ptarmigan.

31

7 References

Aken, B. L., Ayling, S., Barrell, D., Clarke, L., Curwen, V., Fairley, S., ... & Howe, K. (2016). The Ensembl gene annotation system. Database, 2016. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410. Applied Biosystems (2010). Applied Biosystems StepOne™ and StepOnePlus™ Real-Time PCR Systems Reagent Guide. Retrieved 25. march 2017 from http://www.mbl.edu/jbpc/files/2014/05/ABI_StepOnePlus_qPCR_ReagentGuide.pdf Applied Biosystems (n.d.). Quantitative analysis of the PCR reaction. Retrieved 21. march 2017 from http://tools.thermofisher.com/content/sfs/brochures/cms_042487.pdf Boman, H. G. (2003). Antibacterial peptides: basic facts and emerging concepts. Journal of internal medicine, 254(3), 197-215. Brodin, J., Krishnamoorthy, M., Athreya, G., Fischer, W., Hraber, P., Gleasner, C., ... & Leitner, T. (2013). A multiple-alignment based primer design algorithm for genetically highly variable DNA targets. BMC bioinformatics, 14(1), 255. Brody, J. R. & Kern, S. E. (2004). History and principles of conductive media for standard DNA electrophoresis. Analytical biochemistry, 333(1), 1-13. Burt, D. W. (2007). Emergence of the chicken as a model organism: implications for agriculture and biology. Poultry science, 86(7), 1460-1471. Cheng, Y., Prickett, M. D., Gutowska, W., Kuo, R., Belov, K., & Burt, D. W. (2015). Evolution of the avian β-defensin and cathelicidin genes. BMC evolutionary biology, 15(1), 188. Cuperus, T., Coorens, M., van Dijk, A., & Haagsman, H. P. (2013). Avian host defense peptides. Developmental & Comparative Immunology, 41(3), 352-369. Dimcheff, D.E., Drovetski, S.V., Mindell, D.P. (2002). Phylogeny of Tetraoninae and other galliform birds using mitochondrial 12S and ND2 genes. Mol Phylogenet Evol. 24(2):203-15 Eyre-Walker, A. & Keightley, P. D. (2007). The distribution of fitness effects of new mutations. Nature Reviews Genetics, 8(8), 610-618. Finlay, B. B. & Hancock, R. E. (2004). Can innate immunity be enhanced to treat microbial infections? Nature Reviews Microbiology, 2(6), 497-504.

32

Goitsuka, R., Chen-lo, H. C., Benyon, L., Asano, Y., Kitamura, D., & Cooper, M. D. (2007). Chicken cathelicidin-B1, an antimicrobial guardian at the mucosal M cell gateway. Proceedings of the National Academy of Sciences, 104(38), 15063-15068. Hedges, S.B., Marin, J., Suleski, M., Paymer, M. & Kumar, S. (2015). Tree of Life Reveals Clock-Like Speciation and Diversification. Mol Biol Evol. 32: 835-845. HIV sequence database (2017). Primer Design-M results. Retrieved 20. march 2017 from https://www.hiv.lanl.gov/tmp/PRIMER_DESIGN/bw5UHtY2W8/out.html Ishige, T., Hara, H., Hirano, T., Kono, T., & Hanzawa, K. (2017). Characterization of the cathelicidin cluster in the Japanese quail (Coturnix japonica). Animal Science Journal. Kozma, R., Melsted, P., Magnússon, K. P., & Höglund, J. (2016). Looking into the past–the reaction of three grouse species to climate change over the last million years using whole genome sequences. Molecular ecology, 25(2), 570-580. Kumarasamy, K. K., Toleman, M. A., Walsh, T. R., Bagaria, J., Butt, F., Balakrishnan, R., ... & Krishnan, P. (2010). Emergence of a new antibiotic resistance mechanism in India, Pakistan, and the UK: a molecular, biological, and epidemiological study. The Lancet infectious diseases, 10(9), 597-602. Lee, M. O., Jang, H. J., Rengaraj, D., Yang, S. Y., Han, J. Y., Lamont, S. J., & Womack, J. E. (2016). Tissue expression and antibacterial activity of host defense peptides in chicken. BMC veterinary research, 12(1), 231. Macrogen (n.d.). All about EZ-Bag. Retrieved 28. march 2017 from http://dna.macrogen.com/ezseq/ezbag.html Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature reviews genetics, 11(1), 31-46. Michaelson, D., Rayner, J., Couto, M., & Ganz, T. (1992). Cationic defensins arise from charge-neutralized propeptides: a mechanism for avoiding leukocyte autocytotoxicity? Journal of leukocyte biology, 51(6), 634-639. Nelson, D.L. & Cox, M.M. (2000). Lehninger Principles of Biochemistry. New York: Worth. Nielsen, O. K. (2011). Gyrfalcon population and reproduction in relation to Rock Ptarmigan numbers in Iceland. Gyrfalcons and Ptarmigans in a Changing World (The Peregrine Fund, Boise, ID). Pauling, L., & Zuckerkandl, E. (1963). Chemical paleogenetics. Acta chem. scand, 17, S9- S16. Pepcalc.com (2015). Pepcalc.com – Peptide property calculator. Retrieved 9. april 2017 from pepcalc.com. 33

Pezza, J. A., Kucera, R., & Sun, L. (2014). Polymerase fidelity: what is it, and what does it mean for your PCR. New England Biolabs. Ponchel, F., Toomes, C., Bransfield, K., Leong, F. T., Douglas, S. H., Field, S. L., ... & Robinson, P. A. (2003). Real-time PCR based on SYBR-Green I fluorescence: an alternative to the TaqMan assay for a relative quantification of gene rearrangements, gene amplifications and micro gene deletions. BMC biotechnology, 3(1), 18. Pruitt, K. D., Brown, G. R., Hiatt, S. M., Thibaud-Nissen, F., Astashyn, A., Ermolaeva, O., ... & Murphy, M. R. (2014). RefSeq: an update on mammalian reference sequences. Nucleic acids research, 42(D1), D756-D763.Pütsep, K., Carlsson, G., Boman, H. G., & Andersson, M. (2002). Deficiency of antibacterial peptides in patients with morbus Kostmann: an observation study. The Lancet, 360(9340), 1144-1149. Saiki, R. K., Gelfand, D. H., Stoffel, S., Scharf, S. J., & Higuchi, R. (1988). Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239(4839), 487. Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the national academy of sciences, 74(12), 5463-5467. Slater, G. S. C., & Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. BMC bioinformatics, 6(1), 31. Solovyev, V., Kosarev, P., Seledsov, I., & Vorobyev, D. (2006). Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome biology, 7(1) Tamura, K., Dudley, J., Nei, M., & Kumar, S. (2007). MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Molecular biology and evolution, 24(8), 1596-1599. Thermo-Fisher (n.d.). Multiple Primer Analyzer. Retrieved 20. march 2017 from https://www.thermofisher.com/is/en/home/brands/thermo-scientific/molecular- biology/molecular-biology-learning-center/molecular-biology-resource- library/thermo-scientific-web-tools/multiple-primer-analyzer.html Thorne, H. V. (1967). Electrophoretic characterization and fractionation of polyoma virus DNA. Journal of molecular biology, 24(2), 203-211. Untergasser, A., Nijveen, H., Rao, X., Bisseling, T., Geurts, R., & Leunissen, J. A. (2007). Primer3Plus, an enhanced web interface to Primer3. Nucleic acids research, 35(suppl 2), W71-W74. Yacoub, H. A., Elazzazy, A. M., Mahmoud, M. M., Baeshen, M. N., Al-Maghrabi, O. A., Alkarim, S., ... & Uversky, V. N. (2016). Chicken cathelicidins as potent intrinsically 34

disordered biocides with antimicrobial activity against infectious pathogens. Developmental & Comparative Immunology, 65, 8-24. Yates, A., Akanni, W., Amode, M. R., Barrell, D., Billis, K., Carvalho-Silva, D., ... & Girón, C. G. (2015). Ensembl 2016. Nucleic acids research, gkv1157. Yu, H., Lu, Y., Qiao, X., Wei, L., Fu, T., Cai, S., ... & Wang, Y. (2015). Novel cathelicidins from pigeon highlights evolutionary convergence in avain cathelicidins and functions in modulation of innate immunity. Scientific reports, 5, 11082. van Dijk, E. L., Auger, H., Jaszczyszyn, Y., & Thermes, C. (2014). Ten years of next- generation sequencing technology. Trends in genetics, 30(9), 418-426. Van Valen, L. (1974). Molecular evolution as predicted by natural selection. Journal of molecular evolution, 3(2), 89-101. Yoon, H., & Leitner, T. (2015). PrimerDesign-M: a multiple-alignment based multiple- primer design tool for walking across variable genomes. Bioinformatics, 31(9), 1472- 1474. Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1), 57-63. Wang, G., Li, X., & Wang, Z. (2016). APD3: the antimicrobial peptide database as a tool for research and education. Nucleic acids research, 44(D1), D1087-D1093. Zasloff, M. (2002). Antimicrobial peptides of multicellular organisms. Nature, 415(6870), 389-395. Zhang, G., & Sunkara, L. T. (2014). Avian antimicrobial host defense peptides: from biology to therapeutic applications. Pharmaceuticals, 7(3), 220-247.

35

Appendix I

Table 7 – Primers used in amplification and sequencing of PCR products from ptarmigan cathelicidin genomic region

Primer Forward primer sequence Reverse primer sequence Genome Region pair selected amplified from 1 TCTTTGCCTTCCTGCCTGAC GCTCTGGGCATGGCTCATTT Chicken KHL 18 – CATH3 2 GTCAAGCGCTTCTGGCCG CGTCGATCTGAGCACTCTGC Chicken CATH3 – CATH2 3 CCTGGATGGTGATGGTGACC GATGGATCCACACCAGCTGG Chicken CATH2 – CATHB1 4 GAGTGCTGGTGACGTTCAGA GGCTGTGGACTCCTACAACC Chicken CATHB1 – CATHL1 5 CGCCCGGTAGAGGTTGTATC GTTCACCCAGCTGGAAGAGT Chicken CATHL1 – TBRG4 6 AAGATGTTCCAGGGCTGTGC TTGTAGAGGTTGATGCCCGC L.muta 1900 bp upstream – CATH3 7 GTCAAGCGCTTCTGGCCG CAGCCCGTTCTCGTCCAG L.muta CATH2 – CATHB1 8 GGACACTTGGTGACAGCCA CTCCCAGGATGTCCTCTTGC L.muta CATHB1 – 1600 BP downstream 9 GGGAAGGCATGGGGTAGC CTCCCTGCACAACCTCAACT L.muta Upstream – CATHL1 10 GTCAAGCGCTTCTGGCCG CAGCCCGTTCTCGTCCAG L.muta CATHB1 – CATHL1 11 GGACACTTGGTGACAGCCA CTCCCTGCACAACCTCAACT L.muta CATHB1 – CATHL1 12 GTGATGCTGACCTTCGGGC CATGAGCAGGATGTGGCCC L.muta CATH2 – CATHB1 13 CGAGCATATCTGGCTATAAATGTGG TTGTAGAGGTTGATGCCCGC L.muta KHL 18 - CATH3

1

Table 8 – List of samples sent for sequencing at Macrogen

1 2 3 4 5 6 7 LM F 2 LM F 2 LL F 2 LM R 2 LM R 2 LL R 2 LM F 5 5/5 5/5 5/5 5/5 5/5 5/5 7/3

8 9 10 11 12 13 14 LM F 5 LL F 5 LM R 5 LM R 5 LL R 5 LM F 10 LM F 10 7/3 7/3 7/3 7/3 7/3 5/5 5/5 15 16 17 18 19 20 21 LL F 10 LM R 10 LM R 10 LL R 10 LM F 11 LL F 11 LM R 11 5/5 5/5 5/5 5/5 5/5 5/5 5/5 22 23 24 25 26 27 28 LL R 11 LM R 8 LL R 8 LM F 9 LL F 9 LM F 12 LM F 12 5/5 5/5 (11) 5/5 (11) 5/5 (11) 5/5 (11) 6/4 6/4 29 30 31 32 33 34 35 LL F 12 LM R 12 LM R 12 LL R 12 LM R 4 LM R 4 LL R 4 6/4 6/4 6/4 6/4 6/4 (12) 6/4 (12) 6/4 (12) 36 37 38 39 40 41 42 LM F 13 LM F 13 LL F 13 LM R 13 LM R 13 LL R 13 LM F 6 6/4 6/4 6/4 6/4 6/4 6/4 6/4 (13) 43 44 45 46 47 48 LM F 6 LL F 6 LM F 11 LM R 11 LM R 8 LM F 9 6/4 (13) 6/4 (13) 5,5/2 5,5/2 5,5/2 5,5/2

The sequencing sample list starts with number of sample in top line, LM is Lagopus muta and LL is Lagopus lagopus. F is forward primer, and R is reverse primer and the number is for which primer pair the primer comes from. The last line is to indicate µL of PCR product to µL of primer for a total of 10µL in each sample.

2

Table 9 – RT-PCR primers

Gene Forward primer Reverse primer Amplico n length GADP AAGGCTGTGGGGAAAGTCA GCAGGTCAGGTCAACAACA 99 bp H T G GADP CCACATGGCATCCAAGGAG GAACTGAGCGGTGGTGAAG 101 bp H T A HPRT ACGGGGAAGCAGAAGTACA ACCAGAGTTGAAGCCAGTG 110 bp A A CATH GTCAAGCGCTTCTGGCCG GCTCATTTCCTCCTGATGGC 89 bp 3 T CATH GTGATGCTGACCTTCGGGC CAGCCCGTTCTCGTCCAG 80 bp 2 CATH CGTGGAGTGCTGGTGATGT GAGTGGTGGGATGGCATCA 86 bp B1 T G CATH CTTCTTCTTGATGGCCCGGT ACAACGCTTCTCCCCACAG 103 bp L1

Table 10 – cDNA mastermix recipe

10x RT Buffer 18µL 25X DNTP mix 7,2µL 10X RT random primers 18µL Multiscribe reverse transcriptase 9µL Water 37,8µL Total volume 90µL

3

Table 11 – Primers selected with Primer Design-M for region containing Cathelicidin genes in L.muta. Tm is melting temperature and start and stop positions refer to position on 14393 bp region flanked by KLH18 and TBRG4.

Region Tm start stop 1F forward CGAGATGTACGACCCGGAGAC 63.87 178 198 1 GAGAAGTACGGGTTTGGGTGG 1R reverse 64.61 1791 1770 G 2F forward TTTTTGGGTTGTTTCAGCTTC 57.30 739 759 2 2R reverse GCCCCACATTTCTTCTGAG 58.00 2291 2273 3F forward ACTCCCAGCTCGGGATTTTAAC 62.14 1803 1824 3 3R reverse GCGTGCCCCAGTCAAACTTC 64.17 3354 3335 4F forward TTTCCTGTGCGACCTGCCC 64.36 2772 2790 4 4R reverse CTCCATCGCATTGCGGTCTGG 65.54 4347 4327 5F forward GAGTACAAGCCGTGTGGTTTGG 63.45 4133 4154 5 5R reverse CCAGAAGCGCTTGACGCGG 65.95 5771 5753 6F forward AAATGGCTTGTGGGAATGCG 61.13 4192 4211 6 6R reverse GTAGAGGTTGATGCCCGCAG 62.89 5822 5803 7F forward GCTGCGGGCATCAACCTCTAC 65.74 5802 5822 7 7R reverse TGGCGACCACTCCTCACAGGG 67.72 7295 7275 8F forward CCTAATGGCATCACTCCCC 59.25 6456 6474 8 8R reverse GATATCATGGGGACACCCTG 59.43 8004 7985 9F forward ATCGACGCCCTGTGAGGAG 63.30 7268 7286 9 9R reverse AAGGTTCCCTCGTCACCCC 62.97 8795 8777 10F forward CTCAGGCCGTTGGTTGTAG 60.22 8226 8244 10 10R reverse TCCTCGTGCACTGGAAATCC 61.46 9855 9836 11F forward TTTTGTGACCCTCGGGGTG 61.87 8764 8782 11 11R reverse GCCTCCATTATATCTCCGTGC 59.97 10290 10270 12F forward GCCTGCAACCTCCTCCTCTGC 67.22 9665 9685 12 12R reverse ATCATGGAGACGCGGTGCCAG 66.87 11298 11278 13F forward TCGTGCTTAGGAACACCATG 59.60 10681 10700 13 13R reverse GAGCTGAACAAACCCAGTG 58.46 12314 12296 14F forward CTGCAACTACAGTCCTGATGAC 60.58 10911 10932 14 14R reverse TATGAAACAGGAGACCGGG 57.71 12538 12520 15F forward TGAGGGCACTGGGTTTGTTC 62.43 12290 12309 15 15R reverse CCATTAACTCAGAGCTGCCAAG 61.24 13822 13801 16F forward GACACCCGCATACACTCCTG 62.52 12979 12998 16 16R reverse TTGGGGAGGCTGGAGGAGG 64.95 14035 14017

4

FGENESH results for cathelicidin contig

FGENESH 2.6 Prediction of potential genes in Aves genomic DNA Time : Sat Apr 8 09:18:23 2017 Seq name: Contig1 Length of sequence: 14062 Number of predicted genes 6: in +chain 3, in -chain 3. Number of predicted exons 19: in +chain 10, in -chain 9. Positions of predicted genes and exons: Variant 1 from 1, Score:150.592920

G Str Feature Start End Score ORF Len

1 + TSS 2696 -3.24 1 + 1 CDSf 2815 - 2928 -2.76 2815 - 2928 114 1 + 2 CDSi 3056 - 3139 6.19 3056 - 3139 84 1 + 3 CDSl 3238 - 3324 17.57 3238 - 3324 87 1 + PolA 3374 1.87

2 + TSS 4197 -7.94 2 + 1 CDSf 4336 - 4388 -2.62 4336 - 4386 51 2 + 2 CDSi 4558 - 4960 14.57 4559 - 4960 402 2 + 3 CDSi 5061 - 5184 1.34 5061 - 5183 123 2 + 4 CDSi 5274 - 5356 10.64 5276 - 5356 81 2 + 5 CDSl 5441 - 5563 9.29 5441 - 5563 123 2 + PolA 5597 1.87

3 + TSS 5785 -1.44 3 + 1 CDSf 5828 - 5906 2.90 5828 - 5905 78 3 + 2 CDSl 6920 - 7047 8.36 6922 - 7047 126 3 + PolA 7530 1.87

4 - PolA 8220 1.87 4 - 1 CDSl 8266 - 8346 -6.97 8266 - 8346 81 4 - 2 CDSi 8587 - 8670 13.02 8587 - 8670 84 5

4 - 3 CDSi 8753 - 8860 17.36 8753 - 8860 108 4 - 4 CDSf 9297 - 9416 26.76 9297 - 9416 120 4 - TSS 9427 -2.84

5 - PolA 10224 1.87 5 - 1 CDSl 10310 - 10399 -6.31 10310 - 10399 90 5 - 2 CDSi 10479 - 10565 12.12 10479 - 10565 87 5 - 3 CDSi 10633 - 10743 10.56 10633 - 10743 111 5 - 4 CDSf 10828 - 11004 32.96 10828 - 11004 177 5 - TSS 12034 -3.34

6 - PolA 12947 -4.73 6 - 1 CDSo 13783 - 14031 24.87 13783 - 14031 249

Predicted protein(s): >FGENESH: 1 3 exon (s) 2815 - 3324 94 aa, chain + MGNSKAVANGATLSVGGHYGRLSSLRKLKIILRGTRCQIIKDCTAPVVLQSGRAAFD VTCVDSMADAVRVKRYWPLVIRTVVAGYNLYRAIKKK >FGENESH: 2 5 exon (s) 4336 - 5563 261 aa, chain + MVEMGGFVDPLRGWAAPRAVTLGLDVSAAPGLDGSIPPGLDGSIPPGLDGSIPPGLD GSIPPGLDGSIPPGLDGSIPPGLGGSTPTGLDGSITPKLDGPITPKLDGSISPSWPWRWP TTYVDAILAAVRLLNQKISGPCTLRLRAAQPQPGWAGTLEASAGRVLPSRYARRPPG LPGGGCGVRTPTGEDPSHGGHGSAAWAPCPRNSNPTPEGRCRPLRVQPIRNWWTRIR EWWDGIRKRLRQRGPFYVRGRLNITSTPRP >FGENESH: 3 2 exon (s) 5828 - 7047 68 aa, chain + MASCWVLVLALLGGACAFPAPTELPSGVDLNTLRALNFTIMETECVPRAQTPIDDCD FKENGVRTGGL >FGENESH: 4 4 exon (s) 8266 - 9416 130 aa, chain - MASCWVLVLALLGGACALPAPLGYPQALAQAVDSYNQRPENVQLSSLHNLNFTIME TRCQARSGAQLDSCEFKEDGLVKDCAAPVVLQGGRAAFDNTCVDSMADRFWPLVP VAINTVAAGINLYKAIRRK >FGENESH: 5 4 exon (s) 10310 - 11004 154 aa, chain -

6

MAVLLLAVLLLAVLLPTAPSSTAPLPPTPRELARTVLEAHGRDAGSGLRLLKLQGVT RTKFDWGTHFTINLTAREISCPAGPTAPRGAACRARPGQQIQHCVAQISVFAFLPDVP LSLLECSRQTPSSSGQPRSRSRHSPAAPRAIGLREPAPS >FGENESH: 6 1 exon (s) 13783 - 14031 82 aa, chain - MAEVYSSVADQWYLIVPMNTRRSRVSLVANCGRLYAVGGYDGQSNLSSVEMYDPE TNRWTFMAPMVCPEGGVGVGCIPLLTI

Antimicrobial mature peptide of CATH genes in chicken used to align against contig >CATHL1_GALGA RVKRVWPLVIRTVIAGYNLYRAIKKK >CATHL2_GALGA LVQRGRFGRFLRKIRRFRPKVTITIQGSARFG >CATHL3_GALGA RVKRFWPLVPVAINTVAAGINLYKAIRRK >CATHB1_GALGA PIRNWWIRIWEWLNGIRKRLRQRSPFYVRGHLNVTSTPQP

7