AND    ! 

"#$ % & '()*  +%

+  ,-.,-/,0 +  121,..0-10- ! 3 4 33!!3 ,,,1/                                                                 

                                                                                                                                                                                                                                                                                              

                           !"#  $%   #  $# &'()$   $*+,'-./ $

  

         "Por la ciencia, como por el arte, se va al mismo sitio: a la verdad" Gregorio Marañón Madrid, 19-05-1887 - Madrid, 27-03-1960

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Martínez Barrio, Á., Soeria-Atmadja, D., Nister, A., Gustafsson, M.G., Hammerling, U., Bongcam-Rudloff, E. (2007) EVALLER: a web server for in silico assessment of potential allergenicity. Nucleic Acids Research, 35(Web Server issue):W694-700. II Martínez Barrio, Á.∗, Lagercrantz, E.∗, Sperber, G.O., Blomberg, J., Bongcam-Rudloff, E. (2009) Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX. BMC , 10(Suppl 6):S18. III Martínez Barrio, Á., Xu, F., Lagercrantz, E., Bongcam-Rudloff, E. (2009) GeneFinder: In silico positional cloning of trait . Manuscript. IV Martínez Barrio, Á., Ekerljung, M., Jern, P., Benachenhou, F., Sperber, G.O., Bongcam-Rudloff, E., Blomberg, J., Andersson, G. (2009) Data mining of the dog genome reveals novel Canine Endogenous Retro- viruses (CfERVs). Manuscript.

∗ Authors contributed equally.

I is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by- nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

II is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits un- restricted use, distribution, and reproduction in any medium, provided the original work is properly cited. List of Additional Papers

I Martínez Barrio, Á.∗, Eriksson, O.∗, Badhai, J., Fröjmark, A.-S., Bongcam-Rudloff, E., Dahl, N., Schuster, J. (2009) Resequencing and Analysis of the Diamond-Blackfan Anemia Disease Locus RPS19. PLoS ONE 4(7): e6172.

∗ Authors contributed equally. Contents

1 Introduction ...... 11 2 Background ...... 13 2.1 The immune system ...... 16 2.2 The central dogma of molecular biology ...... 20 2.3 Phenotypic traits and association studies ...... 27 2.4 Retroviruses and their endogenization ...... 31 2.5 Bioinformatics ...... 40 2.6 Statistics and Biology in a complex interplay scenario ...... 41 2.7 The -omics age...... 43 3 Aims of this thesis ...... 45 4 Methods ...... 47 4.1 Sequence alignments ...... 47 4.2 Classification, hypothesis testing and evaluation ...... 53 4.3 Phylogenetic analysis ...... 58 4.4 Databases ...... 60 4.5 Information sources, annotations and ontologies ...... 62 4.6 Web services in bioinformatics ...... 66 4.7 RetroTectorc ...... 69 5 Present investigations ...... 71 5.1 Paper I: EVALLER - a web server for in silico assessment of potential protein allergenicity...... 71 5.1.1 Future prospects ...... 73 5.2 Paper II: Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX ...... 73 5.2.1 Future prospects ...... 74 5.3 Paper III: GeneFinder - in silico positional cloning of trait genes 75 5.3.1 Future prospects ...... 76 5.4 Paper IV: Data mining of the dog genome reveals novel Canine Endogenous Retroviruses (CfERVs) ...... 77 5.4.1 Future prospects ...... 79 6 Sammanfattning på Svenska ...... 81 6.1 Backgrund ...... 81 6.2 Sammanfattning av studierna ...... 81 7 Acknowledgements ...... 83 References ...... 87

Abbreviations

API Application Programming Interface DAS Distributed Annotation System FLAP Filtered Length-adjusted Allergen Peptides DNA Deoxyribonucleic acid ERV Endogenous RetroVirus FN False Negative FP False Positive FPR False Positive Rate GO Ontology HGNC HUGO Gene Nomenclature Committee LINE Long INterspersed Element LTR Long Terminal Repeat mRNA messenger RNA OMIM Online Mendelian Inheritance in Man Pol Polymerase protein REST Representational State Transfer RNA Ribonucleic acid RT Reverse Transcriptase SINE Short INterspersed Element SOA Service Oriented Architectures SOAP Simple Object Access Protocol SQL Structured Query Language TFR Transposon Free Region TFBS Transcription Factor Binding Site WSDL Web Services Description Language XML Extensible Markup Language XSD XML Schema Definition

9

1. Introduction

The lapse of time between the almost manually assembled C. elegans [42] and the first human genome draft [118] became the provenance of what today is a common nexus among the Life Sciences fields, a newborn Science called Bioinformatics. Bioinformatics is a highly inter-disciplinary field which is evolving at a exponential pace due to the recent advent of a plethora of new technologies bringing an equally exponential quantity of data. Unfortunately, although the majority of this research data is publicly available, its analysis becomes un- feasible because it may be placed in legacy systems or it may belong to small groups without the skills to publicly offer it. At the same time, the sole anal- ysis of experimental data has turned out not to be significant anymore unless it is collated with the available knowledge sources interconnected across this abstraction we call Internet. These different data sources are enriching our analyses until a turning point where research can be based on expensive pub- lic data that it is generated at big biomedical institutions. However, integration of tangled knowledge sources is not an easy task and can be quite cumbersome due to the large number of resources to be dedicated for the efficient accom- plishment of the information technology (IT) management. Recently, several Networks of Excellence led by a combination of tech- nology and biology experts were launched to contribute and develop a pan- European distributed infrastructure for Bioinformatics. Their success has been measured by the establishment of many public and openly released projects that have matured in order to provide means of data integration, annotation, distribution as well as analysis by enacting remotely synchronized and high- performance services. By collaborating in one of these European consortiums called EMBRACE [4], I had the opportunity to participate in several research projects and learn some of the latest information technologies to help accelerate distributed Bioinformatics. The creation of new tools and data sources was driven by an expanding set of test problems and the goal was that they would be easily distributed and integrable as well as possess a biomedical significance. Some of them are explained in this thesis. The following chapters of this thesis introduce the reader to the common and varied Bioinformatics terminology by describing the background of the study as well as the methods utilized. In every paper, a different information

11 architecture is devised to achieve the presented results. Specifically, the first paper is a machine learning approach to allergenicity assessment implemented in a web server which interacts with a specialized trained vector classifier to recognize linear allergen descriptors in . The second paper creates a data resource for human endogenous retroviruses. Thereafter, these proviruses are visualized and correlated with previously reported human genome regions deemed to be transposons-free. The third paper presents GeneFinder, a dedi- cated pipeline application which queries a mixture of biological data providers and ultimately ranks the collated information to anlyze regions of association to genetic diseases. We made the application available through a web service and tested it with the ridgeless phenotype in dogs mapped by a genome-wide association study [232]. Finally, a fourth paper, presents a prototype pipeline used to characterize previously unknown groups of endogenous retroviruses in dogs and their integrational landscape.

12 2. Background

Life in all organisms on Earth is characterised by the capability of respond- ing to stimuli, autonomous reproduction, growth, the ability to show some form of development, and maintenance of homeostasis in order to regulate its internal environment while maintaining a stable, constant condition. The term organism describes any living entity composed by a single (uni- cellular) or multiple cells (multicellular) grouped into specialized tissues and organs. The development of sexual reproduction in unicellular organisms is thought to have rised the development of multicellular life during the late- Precambrian evolutionary explosion [249] with the earliest Metazoan from Ediacaran deposits in Australia [152]. All organisms are classified by the science of taxonomy by means of a taxonomical classification, that based on the Linnaean hierarchical system, seeks to describe the diversity of organisms by naming and grouping them on the basis of similarities. On the other side of the tree of life, viruses are not considered to be related to any living organism because they are acellular and do not possess their own metabolism. Therefore, viruses require a host’s cell to parasitically attach to, then penetrate into the cytoplasm, uncoat from their viral capsid, release their genetic information (DNA or RNA) and produce copies of it. After assem- bling new viral particles, viruses are released from the host cell by acquiring a modified lipid envelope from the host’s plasma membrane to propagate. Viruses infect all cellular life, from vertebrates [44], invertebrates [46], pro- tozoans [286], plants [109] or even bacteria [164] and other viruses [158, 256]. Satellite viruses can only replicate when a cell has been already infected by another virus [158] whereas other viruses need the structures created by a partner virus to propagate [300]. But in general, every cell type have their own specific viruses that often infect only a specific species [66]. The mechanisms viruses use to cause disease largely depends on the viral species. At the cellular level, cell lysis is the breaking, opening and subsequent death of the cell. In multicellular organisms, when a sufficient number of cells die, the whole organism will start to suffer the effects. Although viruses may disrupt healthy homeostasis, most viruses co-exist harmlessly in their host and cause no signs or symptoms of disease [66]. Viral transmission events can be vertical when parents infect their progeny or horizontal between members of a population or across species, this be- ing the most common mechanism of propagation. Horizontal transmission

13 can occur by sharing infected cells (e.g. blood transfusions), liquids (e.g. saliva, water), food, viruses in aerosols suspension or by insects (also known as vectors) carrying infectious stages in the host’s blood to new hosts after biting or sucking. An epidemia occurs when an outbreak of unusually high proportion of cases in a population, community or region. If outbreaks are spread worldwide they are called pandemics [237]. The 1918 Spanish flu pan- demic, was caused by an unusually severe and deadly influenza A (H1N1) virus. Conservative estimates say it killed 40-50 million people [215]. The pandemia lasted until June 1920. The source was possibly a newly emerged virus from a swine or avian host of a mutated H1N1 virus. Many people died within the first few days after infection, and others died of complications later. Nearly half of those who died were young, healthy adults. Influenza A (H1N1) viruses still circulate today after being introduced again into the human pop- ulation in the 1970s. As a rough comparison, one of the most prominent chal- lenges and scourges of our century, HIV, has infected 38.6 million people worldwide [237]. The Joint United Nations Programme on HIV/AIDS (UN- AIDS) and the World Health Organization (WHO) estimate that AIDS has killed more than 25 million people since it was first recognised on June 5, 1981 [186]. Epidemiological control measures based on knowledge of strain virus trans- mission and incubation period are used to break the chain of infection in pop- ulations during outbreaks of viral diseases [237]. It is of vital importance to identify the virus and to find the source, or sources, of the outbreak. Con- trol measures include not only vaccines or antiviral drugs but also sanita- tion, disinfection, quarantine isolation after exposure, and sometimes slaugh- tering. Thousands of cattle were slaughtered to control the outbreak of foot and mouth disease in 2001 in Great Britain, and in the summer of 1989 in the south-western region of Spain and Portugal, horse livestock suffered one of the most virulents outbreaks of African horse sickness. Incubation periods range from a few days to weeks while the infection causes no signs or symp- toms. In the end of this period, an infected individual is contagious and can infect other persons or animals, meaning the population size in a certain area is an important factor that contributes to the speed of the spreading. Scenarios with chronic infections where the individual conceals the characteristic symp- toms retain all the potential of evolutionary contagion. Viruses are known to be important pathogens for livestock [96] or companion animals such as cats, dogs, and horses, that if infected, are susceptible to propagate serious viral infections. A life-long infection remains latent into the carriers that will serve as a reservoir of infectious particles. The origin of viruses remains unclear. Specific molecular techniques have been created to investigate how they arose [51, 172]. However, viruses do not form fossils and the lack of ancient viral genetic material makes it difficult to study their evolutionary path. There are three explanatory hypotheses as to the origins of viruses [66, 237, 266]: the regressive/degeneracy hypothesis,

14 Figure 2.1: Transmission electron microscope image of a recreated 1918 influenza virus. Dr. Terrence Tumpey, member of the National Center for Infectious Diseases (NCID), recreated the 1918 influenza virus from the supernatant of a 1918-infected Madin-Darby Canine Kidney (MDCK) cell culture 18 hours after infection in order to identify the characteristics that made this organism such a deadly pathogen. Research efforts such as this, enables researchers to develop new vaccines and treatments for future pandemic influenza viruses. Courtesy of Dr. Terrence Tumpey/Cynthia Gold- smith from the Centers for Disease Control and Prevention. Public Health Image Li- brary (PHIL), with identification number #8243.

the cellular origin/vagrancy hypothesis, and the coevolution hypothesis. Com- putational analysis of viral and host sequences increases our understanding of the evolutionary relationships between different viruses and may help to iden- tify the ancestors of modern viruses. However, such analyses have not helped to ascertain which of these hypotheses are correct but it seems unlikely that all currently known viruses have evolved from a common ancestor. Viruses may have arisen in numerous times during different bursts of replication by either one or more mechanisms [66]. In 1962, André Lwoff, Robert Horne, and Paul Tournier devised a hierarchi- cal system [181] to group viruses based on phylum, class, order, family, genus, and species according to their shared properties (not those of their hosts) and the type of nucleic acid in their genomes [180]. Later the International Com- mittee on Taxonomy of Viruses (ICTV) was formed. This committee devel- oped the current classification system and foresee important viral properties to maintain family uniformity. The Nobel Prize-winning biologist David Bal- timore devised the Baltimore classification system [23]. Up to date, the ICTV

15 classification system is used in conjunction with the Baltimore classification system in modern virus classification [62, 187, 274].

2.1 The immune system The host and pathogen (e.g. viruses, bacteria and parasites) relationship is a type of molecular ’arms race’ leading to constant for adap- tation and counter adaptation. This phenomenon supports the "Red Queen" hypothesis [173, 275] where the probability of extinction within families of related organisms would be at random due to the counteraction of selective advantages among species sharing the same environmental system. Organisms have developed a repertoire of identification and killing mech- anisms against contingencies such as viruses, bacteria and other pathogens. The immune system has been divided into an innate immune system, and ac- quired or adaptive immune system characteristic in vertebrates, the latter of which is further divided into humoral and cellular components. The composi- tion of the immune system is actually cellular in nature but it is not associated with any specific organ because they are embedded or circulate in various tis- sues throughout the body. The humoral (antibody) response is produced by the interaction between antibodies and antigens. Antibodies (also known as immunoglobulins, abbreviated Ig) are gamma globulin proteins released from a certain class of immune cells (B lymphocytes). Antigens (Antibody Gener- ators) are defined as anything that elicits generation of antibodies. The unique part of the antigen recognized by an antibody is called an epitope. The di- versity of antibodies is generated by random combinations of a set of gene segments that encode different antigen binding sites (or paratopes) [171], and can recognize an equally wide diversity of antigens. V(D)J recombination, also known as somatic recombination, is the term for this mechanism of ran- dom selection and assembly [197]. Human antibody molecules (and B cell receptors) comprise heavy and light chains with both constant (C) and vari- able (V) regions that are encoded by three types of genes. 1. Heavy chain gene - located on chromosome 14 2. Light chain kappa (κ) gene - located on chromosome 2 3. Light chain lambda (λ) gene - located on chromosome 22 Multiple genes for the variable regions are encoded in the human genome that contain three distinct types of segments. For example, the immunoglob- ulin heavy chain region contains 65 Variable (V) genes plus 27 Diversity (D) genes and 6 Joining (J) genes [166]. The light chains also possess numerous V and J genes, but do not have D genes. By the mechanism of DNA rearrange- ment of these regional genes it is possible to generate an enormous antibody repertoire: 65 V genes x 27 D genes x 6 J genes = 10530 combinations, with additional variability through light chains. A typical human B cell will have

16 50,000 to 100,000 antibodies bound to its surface [296] but it will react specif- ically to one determined antigen. Susumu Tonegawa, a Japanese scientist, was awarded the Nobel Prize for Physiology or Medicine in 1987 for his discovery of the V(D)J recombination, the genetic principle for generation of antibody diversity. Antibodies appear in two different forms: soluble and membrane- bound [212]. Surface immunoglobulins (sIg) are attached to the membrane of the B cells by their transmembrane region, and form part of the B cell receptor (BCR), which triggers B cell activation. Soluble antibodies are the secreted form of Ig and lack the transmembrane region, enabling them to be released into the bloodstream and body cavities. Antibodies can come in different varieties known as isotypes or classes. These classes are defined by the type of heavy chain present. There are five types of mammalian Ig heavy chain denoted by the Greek letters: α, δ, ε, γ, and μ [197] and these define the five antibody isotypes known as IgA, IgD, IgE, IgG and IgM, respectively [227]. Antibody genes can re-organize in a process called class switching that changes the base of the heavy chain to another, creating a different isotype of the antibody that retains the antigen- specific variable region. Random in the antibody genes (i.e. somatic hypermutations) help to create further diversity [65, 185]. Allelic exclusion (see section 2.3) is another mechanism used by the B cell to ensure that each B cell only express antibodies with one antigen specificity. Antibodies biological properties, functional locations and ability to deal with different antigens, are depicted in the table 2.1. Pathogen detection is complicated as these can evolve rapidly and need to be distinguished from the organism’s own healthy cells and tissues. Tolerance (i.e. immunological non-reactivity) is developed to an antigen resulting from a previous exposure to the same antigen. While the most important form of tolerance is non-reactivity to self antigens, it is possible to induce tolerance to non-self antigens. When an antigen induces tolerance, it is termed tolerogen. In extreme cases, the immune system can be the cause of many immunolog- ical disorders under malfunctioning conditions. Diseases caused by immune disorders fall into two categories: immunodeficiency (parts of the system fail to respond. e.g. chronic granulomatous disease and AIDS) and autoimmu- nity (the immune system respond attacking its own host. e.g. Hashimoto’s thyroiditis, rheumatoid arthritis, diabetes mellitus type 1 and systemic lupus erythematosus). Other immune system disorders include various hypersensitivities, in which the system responds inappropriately to harmless compounds. Allergy is an unwanted immunological response to foreign substances which is commonly referred to a type I hypersensitivity reaction mediated by the immunoglobulin E (IgE) antibody [122]. IgE antibody reaction is produced after contact with foreign substances such as pollen, certain foods, house dust-mites and animal saliva. Allergen, herein and within a biomolecular allergologic context, refers

17 Table 2.1: Antibody isotypes of mammals (Woof and Burton, 2004) [298]

Name Types Description IgA 2 Found in mucosal areas, such as the gut, respiratory tract and urogenital tract, and prevents colonization by pathogens. Also found in saliva, tears, and breast milk. IgD 1 Functions mainly as an antigen receptor on B cells that have not been exposed to antigens. It has been shown to activate basophils and mast cells to produce antimicrobial factors. IgE 1 Binds to allergens and triggers histamine release from mast cells and basophils, and is involved in allergy. Also protects against parasitic worms. IgG 4 In its four forms, provides the majority of antibody-based immunity against invading pathogens. The only antibody capable of crossing the placenta to give passive immunity to fetus. IgM 1 Expressed on the surface of B cells and in a secreted form with very high avidity. Eliminates pathogens in the early stages of B cell mediated (humoral) immunity before there is sufficient IgG.

to the actual molecules (proteins) binding to the IgE antibody. Although rela- tively few proteins are allergens, most allergens are proteins.

Known allergens and IgE cross-reactivity Allergy is a major health problem in many affluent countries and seems to be steadily increasing during the recent decades [101, 114, 282]. Patients’ immune systems may respond inappropriately to apparently harmless com- pounds showing a variety of symptoms (asthma, allergic rhinitis, rhinocon- junctivitis, atopic eczema, angioedema, abdominal pain) [144] or respond with severe reactions (anaphylactic shock) leading to respiratory and circu- latory failures, and can pose life-threatening conditions [98]. The estimated prevalence of food allergy among populations within the EU ranges from 2.5% to 3.2% [236] and as much as 8% in the pediatric population [40]. On the other hand, pollen allergy is estimated to have raised by up to 40% [58]. Several hundreds of allergen proteins, described in literature, have been characterized and assembled in searchable on-line databases. Most of them belong to relatively few protein families. Among them, lipocalins [80, 279], serum albumins [97], proteases [264] and tropomyosins [15] are known

18 indoor allergenic sources (e.g. pets, house dust mites and cockroaches). Tropomyosins are also allergens of several shellfish and therefore, cross-reactivity between these protein sources may happen [223, 297]. The main pollen allergen families are expansins, calcium-binding proteins, profilins and Bet v 1 homologues [225]. Some of these families, which are common among plant food sources, are the main basis for pollen-fruit cross-reactivity. Additional plant food allergen families include the relatives to the prolamin superfamily (2S seed storage albumins, cereal inhibitors of trypsin and α-amylase and non-specific lipid transfer proteins) [126] and the cupin superfamily (7S and 11S seed storage globulins) [37]. Incomplete allergens have the ability to release inflammatory mediatiors through cross-reactive IgE binding [14, 277, 290]. The IgE cross-reactivity of a suspicious allergen protein can be assessed by a variety of immunochemical, biochemical and immunological methods [218]. Identification of specific IgE- antibodies against two separate allergen preparations in many individuals may indicate that they share similar proteins, and thereby may cross-react. In vitro tests are those based on activation of sensitized basophils to assess their abil- ity to release of histamine [270]. Conversely, assessing de novo sensitization is a much more difficult task and includes analysis of in vitro pepsin degrada- tion [21, 263] since several food allergens are resistant to digestive proteases. Animal models for allergy are also increasingly being reported [150].

GMOs and allergenicity risk assessment The advent of genetic engineering and its adoption by the agricultural indus- try has resulted in the development of genetically modified organisms (GMOs) with genetic changes introduced in their genome. Among them, food industry modified crops include such genetic features as bacterial resistance or better energy absorption. Soybean, corn, canola, and cotton seed oil were put into the market in the early 1990s. Animal products have also been engineered. Researchers have developed genetically modified (GM) pigs that are able to absorb plant phosphorus more efficiently [11] than ordinary un-modified pigs to produce omega-3 fatty acids through the expression of a roundworm gene [159]. Objections to these modified pigs have been generally controver- sial due to misleading objectives. It is unclear if their aim is to benefit the environment or simply to reduce costs in the production chain [159]. Like- wise, it is often unclear if these make beneficial contributions to human health or are solely used to obtain derived economical benefits to the owners of the patents [85, 136]. Development of GMOs are subjected to intellectual prop- erty rights which adds criticism to the perceived safety issues and ecological concerns. Currently, there are a small number of food species with a genetically modi- fied version, especially plants, cultivated in vast amounts of land. The accumu- lated surface area of land cultivated with GMOs for the period 1996 to 2008

19 exceeded 800 million hectares. It took 10 years to reach the first 400 million hectares but only three years to reach the second 400 million. In recent years, developing countries are rapidly adopting GMO cultivars. Of the 25 countries planting biotech crops in 2008, 15 were developing and 10 industrial. A total of 90% of the total of 13.3 million beneficiary biotech farmers in 2008, (up from 12 million in 2007), were small and resource-poor farmers from devel- oping countries; only one million were large farmers from industrial countries such as the USA and Canada and developing countries such as Argentina and Brazil [120]. In 1996, Pioneer Hi-Breed [12], the second largest U.S. producer of hy- brid seeds for agriculture after Monsanto, introduced the 2S albumin from the Brazil nut (methionine and rich) into their soybean to improve the nutritional quality of a variety of soybeans [Glycine max (L.) Merr.] which in a natural state are compromised as a foodstuff by a relative deficiency of me- thionine in the seeds. A study tested the allergenicity (by radioallergosorbent testing, immunoblotting, and skin-prick testing) of the transgenic soybean and showed how a pool of individuals allergic to Brazil nuts became also allergic to the new GM soybean and that the transgenic soybean increased their sen- sitization to Brazil nuts [203]. The protein was a major Brazil nut allergen, now known as Bere1, which resulted in the termination of the GM soybean development [192]. The potential to transfer proteins causative of type-I hy- persensitivity reactions into GM foods, among others, increased the legislative interest in protecting the food chain. Consequently, evaluation and labelling compliancy to specific regulations [60, 72, 79] is required to receive approval to market GM food in the European Union (EU). Several of the test assays mentioned above were compiled into multi-procedural assessment schemes. The 2001 FAO/WHO report on allergenicity assessment of foods derived from biotechnology [81] proposed a revised decision tree which recommended increase the rigor by in silico methodology called sequence homology in Figure 2.2. Some background to this procedure is given in section 5.1.

2.2 The central dogma of molecular biology After 10 years of experiments with inheritance in pea plants in the 1860s, Gregor Mendel, the considered "father of modern genetics", hypothesized a factor of inheritance and the rules that conveyed traits from parent to off- spring. Hugo de Vries in 1889, probably unaware of Mendel’s work, coined the term pangen for "the smallest particle representing one hereditary charac- teristic" [283]. Wilhelm Johannsen, a Danish botanist, plant physiologist and geneticist, abbreviated this term to gene ("gen" in Danish and German) two decades later [132].

20 Source of gene (allergic)

Sequence homology Sequence homology

Specific serum screen Targeted serum screen

Pepsin resistance Likely to be and animal models allergenic

+/++//+ +/-+/-- -/--/- - High Low Probability of allergenicity

Figure 2.2: Revised decision tree assessment for allergenicity of GM crops set forth by a FAO/WHO consultancy in 2001. Based on the tree depicted in Metcalfe [192]

The genetic information in most celled organisms is stored in DNA. The central dogma of molecular biology, as stated by Francis Crick in two dif- ferent publications [52, 53], is the directional flow of detailed, residue-by- residue, sequence information from one polymer molecule to another. The process would start by transcribing deoxyribonucleic acid (DNA) into ribonu- cleic acid (RNA) and thereafter translate that product into proteins. The DNA coding block is composed of a base, a sugar deoxiribose and three phosphate groups. The nucleotide bases, adenine (A), cytosine (C), guanine (G) and thymine (T), are attached to each sugar in the backbone and linked to a complementary nucleotide by hydrogen bonds shaping the antiparallel strands of the characteristic DNA double-helix (see Figure 2.3). In prokaryotes, the majority of genes are located to a single chromosome of circular DNA, while DNA helices in eukaryotes are densely packed into a DNA-protein complex called chromatin into several chromosomes. Many species carry more than one copy of their chromosomes within each of their somatic cells. For in- stance, humans with two copies are called diploid. Therefore, each somatic cell contains two copies of a gene, one inherited from each parent. Specific re- production cells, called gametes, contain only one copy after a process called

21 meiosis. Meiosis facilitates stable sexual reproduction as well as increasing genetic diversity by producing novel combinations.

Figure 2.3: The structure of B-DNA in space filling format (left) and with the DNA helix represented as tubes (right). Content released at the Open University under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Licence.

The total non-homologous set of chromosomes in an organism or cell is usually referred as its genome. The genome in eukaryotes is protected in- side the cell nucleus which in turn is separated by its nuclear membrane from the surrounding cytoplasm and other parts of the cell. Cells also present a membrane arranged in the form of a phospholipid bilayer which controls the transfer of material in and out the cell whilst sensing the surroundings. Inside cells’ nuclei, the "coding" portion of a gene’s DNA (exons) determine what the gene does, and the "non-coding" sequence determines when it is active (or expressed). Upon expression, the coding and non-coding sequences are copied by the RNA polymerase II (pol II) in a process called transcription, producing a primary RNA (pre-mRNA) copy of the gene’s information. In prokaryotes, transcription occurs in the cytoplasm; in eukaryotes, transcrip- tion necessarily occurs in the nucleus.This primary transcript must undergo post-transcriptional modifications before being exported to the cytoplasm for . In a modification unique to eukaryotes, exons are spliced out of introns by the spliceosome.Acap molecule and a poly-A tail are added to the 5’ and 3’ ends respectively to protect the messenger RNA (mRNA) from degradation while it is transported into the cell’s cytoplasm. The amount of mRNA synthesized by a gene (or a pool of them) is referred to as its expres-

22 sion level or transcriptome and it can be measured in laboratory experiments such as microarrays or transcriptome analysis. The region of the chromosome at which a particular gene is located is called its locus. Genes that appear together at one specific locus may appear on dif- ferent chromosomes in another species. But if the order among genes is con- served, these different chromosomal parts are said to be syntenic. This may or not be caused by genetic linkage. The concept of genetic linkage is introduced in section 2.3. An mRNA can direct the synthesis of proteins via its in a pro- cess called translation. The extant -and poorly understood- alternative splicing mechanisms result in different mRNAs being generated from the same gene having different RNA sequences and thus coding for different proteins. Trans- lation into proteins is carried out by ribosomes which are large complexes of RNA and protein responsible of adding new amino acids to a growing polypeptide chain joined by peptide bonds. The genetic code of the mRNA is read three at a time, in units called codons, via interactions with specialized RNA molecules called transfer RNA (tRNA). Each tRNA has a characteristic anticodon: three unpaired bases complementary to the codon it reads. There are 64 possible codons (four possible nucleotides at each of three positions, hence 43 possible codons) and only 20 standard amino acids; hence the code is redundant and multiple codons can specify the same . There is a correspondence between codons and amino acids which is nearly universal among all known living organisms. During the translation process, when the tRNA is base-paired to its complementary codon in the mRNA strand, the ribosome ligates the amino acid cargo that the tRNA cova- lently carries in its 3’ terminal site to the growing polypeptide chain, which is then synthesized from amino terminus (N- or NH- terminal) to carboxyl terminus (C- or CO- terminal). During and after polypetide synthesis, the new protein folds to its active three-dimensional structure before it can carry out its cellular function. In the transcription regulation orchestration, the major effectors are a group of proteins known as transcription factors (TFs). TFs regulate gene expression levels in the control of transcription of DNA into mRNA. TFs are proteins that bind to specific DNA sequences upstream of the genes. TFs can activate, re- press or enhance transcription by interacting with the transcriptional machin- ery and by altering the structure of the chromatin. The analysis of promoter regions upstream of genes is fundamental to understand their regulation. TFs can be classified depending on the binding areas they show affinity but also on the distance to the transcription start site (TSS) like basal or distal promoters. Since the expression and activity of the protein TFs is in turn regulated by other components of the cell, the whole system can be seen as a regulatory network, where genes (and their products) may regulate each other. Transcription in eukaryotes is performed by three kinds of RNA polymerases: pol I, pol II and pol III. Protein coding genes are transcribed

23 into mRNA by pol II, whereas pol I and pol III transcribe genes coding for ribosomal RNA (rRNA), transfer RNA (tRNA) and other small RNAs. The polI II pre-initation complex (PIC) is a holoenzyme assembled from more than 60 polypeptides, including 12 subunits of the enzyme pol II, a 20-subunit complex called the Mediator, and various general transcription factors (GTFs). Few components of the PIC recognize specific DNA sequences of the core promoters, and it consists of one or more sequence elements such as the TATA box, BRE (TFIIB recognition element), Initiator and DPE (Downstream promoter element), but none of them are essential as it will be explained in the next section. Transcription starts when TFIID recognizes a specific promoter site and possibly the active site for pol II (Initiator element, Inr). TFIID is a com- plex that consists of the TATA box binding protein (TBP) and several TBP- associated factors (TAFs). Thereafter, the rest of the PIC is assembled around TFIID. Once the complex is stable, a docking site is provided for TFIIF which binds together with the pol II enzyme. The largest subunit of pol II has a unique CTD (C-terminal domain) tail which is a target for phosphorylation. The complex is positioned so that the Inr site of pol II is close to the TSS, ap- proximately 30-90 bp downstream of the TATA box. When the last two GTFs, TFIIE and TFIIH, have been recruited, promoter melting opens up the DNA and the template strand can be accessed. TFIIE stimulates a protein kinase activity of TFIIH, which phosphorylates the CTD tail. The CTD tail phospho- rylation triggers transcription elongation by breaking the interaction of the pol II catalytic complex and releasing pol II with TFIIF. When the transcription is complete, the CTD tail is dephosphorylated and the cycle can start over again. In the first round of transcription, an important limiting factor of the reaction rate is the binding of TFIID to the TATA box. In subsequent rounds of transcription, when TFIID is already bound to the DNA, recruitment of components of the PIC may become the rate limiting factor. Basal pol II transcription occur normally at very low rates without the help of additional TFs. Single TF or complex of TFs can enhance transcription by several fold or conversely, inhibit transcription by acting as repressors. TFs are mostly highly cell- and time-specific and only regulate a given set of genes. TFs recognize specific DNA sequences, called motifs (see section 4.1). It has already been explained how the PIC recognizes the TATA box and the Inr ele- ment binding close to the TSS. In a similar way, gene-specific TFs recognize short DNA sequences, typically 5-15 bp, and often degenerated. This means that a TF does not recognize a unique sequence motif, but a set of similar DNA sequences. Such motifs are organized in enhancers or groups of motifs that together orchestrate transcription of distant genes and can be located far away of the TSS, reaching close to millions of bp in higher eukaryotes. One of the most striking cases to date is the enhancer 780 kb upstream of human DACH1 [202], implicated in developmental regulation.

24 Proteins are made up of functional or structural independent units called protein domains. One domain may appear in a variety of evolutionarily re- lated proteins. Domains present high variation in length. Because of their sta- bility, domains can be "swapped" by genetic engineering between one pro- tein and another to make chimeric proteins. The TFs functional domains are, among others, DNA-binding and activation/repression domains. In addition, many TFs also have structural dimerization domains which enables them to form complexes with other TFs. The most conserved domains are the DNA- binding domains which allow to classify TFs into different protein families. The most common (and one of the shortest) DNA-binding domain is the zinc finger domain with a combination of cysteine and histidine residues. They can be classified by the type and order of these zinc coordinating residues (e.g. Cys2His2, Cys4, and Cys6). This domain consist of approximately 30 amino acids stabilized by metal (zinc) ions and disulfide bridges. Zinc fingers typ- ically have several zinc finger domains in tandem, each of them interacting with three consecutive nucleotides. Therefore, if zinc fingers are dimerized, the length of the bound DNA motif is six, nine or twelve bp. In contrast, activation/repression domains show low sequence similarity and cannot be used to group TFs into protein families. The variety of acti- vation/repression domains in TFs could be implicated with the many differ- ent mechanisms by which regulation of transcription is affected. The activa- tion functions and related mechanisms played by the TFs can be divided into two main groups: enhancers and repressors. TFs have been shown to interact mainly with TFIID but also with other components of the PIC, commonly, through intermediates. These TF co-factors are indeed those which interact directly with the PIC. Other mechanism used by TFs to activate transcription is remove proteins that prevent transcription from the promoter regions. Some of these TFs can bind with high affinity and then remove inhibitory proteins through different mechanisms. Generally, other proteins compete with the pol II machinery for the binding regions in the promoters. The mechanisms of repression is to prevent the assembly of the PIC either by blocking access of TFIID to the DNA or some other component to the assembly of the PIC. Co- factors can also be recruited in this case to prevent the basal pol II factors to gain access to the DNA. Sometimes, this includes modifications to the struc- ture of the genomic chromatin by making the packing proteins, the histones, bind more tightly to DNA and thereby prevent pol II access to it.

The altered state of the central dogma In both Crick’s papers, a possible transition between RNA and DNA was rep- resented by a dashed arrow. The idea of reverse transcription was devised by one of its main proponents, Howard Temin, by which retroviruses would generate a double-stranded DNA copy of their single-stranded viral RNA genome [260]. Although it was very unpopular in the beginning as it con-

25 tradicted the central dogma of molecular biology, the independent discovery of the enzyme responsible for reverse transcription by Howard Temin and David Baltimore [261, 262], contributed to the idea of reverse transcription became generally accepted. The enzyme discovered, named reverse transcrip- tase (RT), is error prone when transcribing RNA into DNA since it has no proofreading ability. This high error rate allows mutations to accumulate at an accelerated rate relative to proofread forms of replication with DNA poly- merases. RT functionality is also important in eukaryotes retrotransposons and telomerases. Retroviruses are introduced in section 2.4. In addition to packaging the DNA into the nucleus, chromatin has an im- portant function in regulation of gene expression. Chromatin is a complex in which DNA is packaged together with special proteins called histones. There are four core histones, H2A, H2B, H3 and H4, all of them which possess a globular domain and a positively charged N-terminal tail. These four histones associate in nucleosomes which are the basic units for the geometry adopted by chromatin. Nucleosomes are comprised of 147 bp of DNA wrapped twice around an octamer of two molecules of each four core histones. Their N- terminal tails interact with the negatively charged DNA backbone strength- ening the interaction of the ensemble. Core nucleosomes are separated by a distance of 10-80 bp of linker DNA. Nucleosome positioning is cell type spe- cific [34, 224, 292] and the particular folding of this structure at a certain time alters nuclear DNA accessibility to TFs. For example, highly transcribed genes usually display a nucleosome-depleted region in their promoters com- pared to promoter regions of genes with lower expression [30, 162, 163]. Reg- ulation of transcription has been also shown to occur next to re-localization of genes within the cell nucleus upon certain conditions [16]. Therefore, chro- matin is not static but can be remodelled in several ways. One of the main control mechanisms is the covalent modification of N-terminal histone tails such as acetylations and methylations. A large number of modifications have been shown to affect nucleosome occupancy, chromatin structure and gene ex- pression [155]. The broad term epigenetics is used to refer to those changes in gene expression caused by mechanisms other than changes in the underlying DNA sequence, It should be noted that afore mentioned long-range cis-regulatory elements and their target genes may not always be far from each other in the cell, be- cause chromatin packaging can bring together elements that would be distant in a linear display of the genome sequence. Furthermore, systematic exami- nation of transcriptional regulation has yielded new understanding about tran- scription start sites [76]. Several genome-wide studies which aimed to char- acterize core promoters [45, 151] (reviewed by Sandelin et al. [234]) have revealed that the transcription model with a defined TATA box locus in mam- malian genes can also be found with the model of multiple promoters and TSSs offering alternative regions to initiate the transcription in most of the mammalian genes and thus generating a greater and more complex diversity

26 in the mammalian transcriptome and proteome. Many of these yet undiscov- ered alternative TSSs may exist closer to cis-regulatory elements located far upstream of currently known TSSs [183]. Genome transcription is not the only mechanism for passing on information within the cells. Intracellular signaling is also mediated through the levels of different metabolites, by post-translational modifications and by protein localization. Therefore, the information processing inside a cell can be viewed as a signaling network. Gene expression can thus be viewed as a function, where the combination of signals determines the transcription rate. The Mendelian concept of a gene has been eroded in recent years [216]. At least, the generally accepted definition of a gene that is "a section of DNA coding for a particular protein" is changing toward "a union of genomic se- quences encoding a coherent set of potentially overlapping functional prod- ucts" idea [95]. The new concept is now focused into functional products (ei- ther proteins or RNA) rather than a DNA locus. Recent observations of new fused proteins stemming from adjacent genes that should code for two sepa- rate protein products [214] or alternatively, proteins composed of exons from far away regions and even different chromosomes [138] have contributed to this new paradigm. In this new definition, the control regions of a gene do not necessarily have to be close to the coding sequence on the linear molecule or even on the same chromosome [247].

2.3 Phenotypic traits and association studies All organisms have many genes controlling several different biological traits, some of which are immediately observable, such as eye color or number of limbs, and some are not, such as blood type or increased risk for specific dis- eases. A phenotypic trait is the presence or absence of any observable charac- teristic or a quantifiable biochemical measurement in an organism. Following the Mendelian principle of independent assortment, each of the two copies from each parental gene(s) related to a certain trait will sort independently into gametes. Variants of a single gene are called alleles, and differences in alleles may give rise to differences in traits. A gene’s most common allele is called the wild-type allele, and rare alleles are called mutants. However, this does not imply that the wild-type allele is the ancestor from which the mutants are descended. Furthermore, which allele an organism inherits for one trait is unrelated to the allele inherited for another trait. But in fact, this is only true for genes that do not reside on the same chromosome, or are located very far from one another on the same chromosome. The closer two genes lie on the same chromosome, the more closely they will be associated in gametes and the more often they will appear together. In relation to the phenotype, the genotype of an organism is the inherited set of instructions. In zygotic organisms, embryos are usually the result of a fertil-

27 ization event between two haploid cells -an ovum from a female and a sperm cell from a male- which carry the maternal and the paternal allelic informa- tion. In diploid organisms (i.e. species with two homologous copies of each chromosome), haploid gametes are generated during meiosis. In one stage dur- ing meiosis, the two parental alleles will recombine their generic material in a crossing-over event between paternal and maternal chromatids. This event will shape the entire haplotype (from "haploid genotype") of the individual, a combination of alleles that are transmitted together on the same chromo- some. Allelic haplotypes are associated by genetic linkage. Alleles that don’t assort independently contain genetic information that is inherited jointly be- cause they may be physically linked to each other and tend to stay together. Genes that are very close to each other essentially would never separate be- cause it is extremely unlikely that a crossover point will occur between them. This is the basis of the concept known as genetic linkage. DNA replication is very accurate, the error rate per site is as low as 10−6 to 10−10 in eukaryotes [287]. But rare, spontaneous alterations in the base se- quence of a particular gene arise from various sources, such as errors in DNA replication and the aftermath of DNA damage. These errors are called muta- tions. Single nucleotides may be changed (substitution), removed (deletions) or added (insertion) in the DNA genome sequence. Single nucleotide poly- morphisms (SNPs) may fall within coding sequences of genes, non-coding regions of genes, or in the intergenic regions between genes. Due to the re- dundancy of the genetic code, some SNPs in protein-coding genes are called silent when they do not produce any change in the amino acid sequence of the protein. For example, the U ↔ C in the third nucleotide of the codon UCU has no effect on the final protein as the codon UCC code for serine as well. The other two types of point mutations, missense and nonsense, code respectively for a different amino acid and for a stop codon that can truncate the protein. Insertions and deletions (indels) of a number of nucleotides that is not evenly divisible by three may originate mutations that shift the reading frame (frameshift) or alter the splicing of the mRNA (splice site mutation), resulting in a completely different translation from the original. Mutation and SNPs are somehow interchangeably terms. But polymor- phism is used when the rarest allele occurs in a frequency > 1% in the population, and mutation is broader in the sense that gathers also other type of changes in the form of large structural rearrangements such as segmental variations (insertions, deletions and duplications), tranlocations, inversions. Segmental variation, in particular, is referred to a loss of heterozygosity (loss of one of the alleles) which may in turn cause loss-of-function by altering the dosage of the genes located within them. Mutations carrying phenotypic effects are often neutral or deleterious to the organism, but sometimes they may render themselves benefitial to an or- ganism’s fitness.

28 In population genetics, mutations propagated to the next generation can lead to variations within a species’ population. The existence of phenotype differ- ences in individuals with genotype variations is the basis for quantitative ge- netics. Analysis of Quantitative Trait Loci (QTL) deals with the methodology and study of quantitative traits in order to unravel the role and interactions that regions of the genome and phenotypic factors may play in the government of the trait(s) of interest. A Mendelian trait is that controlled by a single locus and shows a simple Mendelian inheritance pattern. Although it is rare for the variants in a single gene to have clearly distinguishable phenotypic effects, certain well-defined traits are in fact controlled by single genetic loci. A deleterious mutation in a single gene can cause a disease which will be inherited according to Mendel’s laws (segregation and independent assortment). Examples include sickle-cell anemia, Tay-Sachs disease and cystic fibrosis. A disease controlled by a single gene contrasts with a multi-factorial dis- ease, like arthritis, different cancers, autoimmune diseases, allergies and car- diovascular diseases, which are originated from the combination of deter- mined genetical risk factors within a catalytic surrounding environment which make an individual prone to manifest a disease. Those diseases are inherited in a non-Mendelian style. In Mendelian inheritance it is assumed that genes from maternal and pa- ternal chromosomes contribute equally to human development. One of the first exceptions to this law is X-chromosome inactivation in females, which silences gene expression from one of the two X chromosomes [92]. A fur- ther exception is genomic imprinting in which genes may show preferen- tial expression from one of the alleles [176, 294], either the allele inherited from the mother (e.g. H19 or CDKN1C) or the allele inherited from the fa- ther (e.g. IGF2). Variations in gene sequence and expression in combination with the surrounding environment underlie much of the species’ differences. Histone modifications have been proposed as one of the several mechanisms underlying such epigenetic effects that do not involve changes of the DNA sequence and are inherited over rounds of cell divisions or even across gener- ations [229]. If this transmission of the chromatin state is passed by histones themselves [107, 201] or other processes such as DNA methylation [124] re- mains unclear. Paramutation, a more restricted class of epigenetic variation, point to a new role of RNA as determinant and inducer of hereditary varia- tion. Strikingly, RNA-mediated non-Mendelian inheritance of an epigenetic change has been demonstrated in the mouse [226] and plants [57]. Moreover, the transcriptional regulation machinery is known to be frequently involved in both Mendelian [154] and complex dis- eases [209, 239, 291]. Often causative, the combination of genetics with environmental factors is truly what reflects into the phenotypic features of an organism during the evolution of the individual [272].

29 Several experiments can be done in the grounds of morphology, developmental, biochemical, behavioural or physiological properties in selected individuals. New hybrid populations subjected to controlled cross-mating can be obtained for special purposes like research. The first filial generation seeds/plants or animal offspring resulting from a cross mating of distinctly different parental types is designated as F1 hybrids. The second cross from offspring(s) resulting from a cross of the members of F1 generation is called F2, and so on [125]. Then, combinations of alleles or genetic markers may occur more or less frequently in the new hybrid populations than would be expected from a random formation of haplotypes based on the extant allele frequencies of the population. This situation is defined as linkage disequilibrium (LD). The level of linkage disequilibrium is influenced by a number of factors including genetic linkage, the rate of recombination, the rate of mutation, genetic drift, non-random mating, geographic isolation, selection, and population structure. Another methodology, genome-wide association studies (GWA studies, or GWAS), measures association of genetic markers across a given genome locus to a given disease state or an observable trait. GWAS normally require two groups of participants, cases and controls, to draw statistical tests and find significant disease associated areas of the genome. Thereby, certain genetically influenced traits can be studied through "artifi- cially" bred populations using different types of studies. And for the purpose of creating such population structures, domestic animals are a convenient an- imal model.

The dog’s role as a model organism Domestic dogs have a tremendous potential to unravel and increase our un- derstanding of the mechanisms behind genetic risk factors leading to complex diseases [19]. The common environment shared with our nearest household pets makes dogs an interesting model for translational medicine. The previ- ous reason complemented with dogs increased risk for genetic diseases and their characteristic genetic architecture [169] makes dogs an ideal compara- tive model mammal in which to study human disorders. Dogs have originated from the grey wolf (Canis lupus) and are considered the first animals that were domesticated [211]. This is thought to have occurred in China less than 16,300 years ago. Domestic dogs have since then been selectively bred for hunting, guarding and herding livestock, and as companion animals. The im- mense morphological phenotypic variation among modern dogs has accumu- lated as the result of the breeding process during the last 200 years [140, 289]. This evolutionarily short and extremely selective process gave origin to a sec- ond genetic bottleneck which followed the first species domestication causing a decrease in genetic variation within breeds and an unwanted increased fre- quency of disease mutations. However, this reduced genetic variation within

30 breeds is an advantage when studying the genetics underlying disease. Long haplotype blocks (500 kb to 1 Mb) and extensive linkage disequilibrium (LD) are shared within breeds [169]. Interbreed comparison takes advantage of the short shared haplotype blocks in LD to narrow down the region of association by using a denser set of markers. Often, areas in linkage or association studies cannot be reduced due to the size of the linkage block. In the case of associa- tion studies, random distribution of each parental haplotype is of importance in the interval of the study to allow fine scale mapping. The strategy for per- forming whole genome-wide association studies (GWAS) in dogs is outlined in [140] and [139].

2.4 Retroviruses and their endogenization Retroviruses have challenged and infected all orders of eukaryotic life dur- ing evolution. Some serious health problems have originated by retroviruses, such as in the HIV case, or linked to them. Unrelated serious diseases such as prostate cancer [269] and chronic fatigue syndrome (CFS) [178] have been ob- served and associated with retroviruses of type xenotropic murine leukemia- related virus (XMRV) [147]. Likewise, breast cancer in humans induced by mouse mammary tumor virus (MMTV), a murine-specific retrovirus, has been observed after cross-species transfer [190]. In all these diseases, the cause- consequence implication for the viral infection remains to be established. The retroviral genome is contained in a viral capsid which in turn is en- closed into a lipid bilayer. The retrovirus life cycle entails the transforma- tion of the positive sense single-stranded RNA (ssRNA) dimer to DNA. The transformed DNA integrates into the host genome of the infected cell leav- ing a provirus about 7-11 kb in size. Therefore, retroviruses are important insertional mutagens subjected also to recombinations, rearrangements and segmental duplications [146]. The reverse transcription step is performed by the reverse transcriptase contained in the Pol protein that is carried within the retrovirus. The viral proteins are called Group antigen (Gag), Protease (Pro), Polymerase (Pol), and Envelope (Env). The coding regions for these proteins are flanked on both 5’- and 3’-ends by long terminal repeats (LTRs) (Figure 2.4) which are identical at the time of integration. In humans, integra- tions of retroviruses (HERVs) are estimated to constitute about 7%-8% of the genome [118], with 98,000 elements and fragments [27]. The infectious and replicative cycle of a retrovirus (Figure 2.5) begins with the binding of surface protein (SU) to one of the cell receptors which forces transmembrane protein (TM) to come in a close contact with the cell mem- brane. Following the cell and virus fusion, the virion core is released into the cytoplasm where the genome of the retrovirus is reverse transcribed into a double-stranded DNA clone using the ssRNA like a template (Table 2.2). The preintegration complex (PIC) including the retroviral DNA and integrase

31 Short Short inverted Transcriptional start inverted repeat repeat U3 R U5 TATAA TG CA AATAAA Transcription Polyadenilation factor signal binding sites AAA...

Figure 2.4: Provirus structure. Based on the splice donor-acceptor sites, putative mRNA transcription for the retrovirus with simple replication strategies is shown. Pu- tative viral genes with open reading frames: Gag (MA, matrix; CA, capsid; NC, nucle- ocapsid); Pro (PR, protease); Pol (RT, reverse transcriptase; IN, integrase); Env (SU, surface protein; TM, transmembrane protein). Other parts contained: PBS, primer binding site; SD, splice donor; SA, splice acceptor; PPT, polypurine tract. Apart from these parts, the viral genome contains also the ψ element that is involved in the pack- ing of the viral genome (not shown). Based on a excerpt from Figure 1 in Jern and Coffin [129].

(IN) along with some cellular factors is formed, transported into the nucleus and subsequently integrated into the host’s genome by the viral encoded IN. Retroviral transcription starts at the 5’ U3-R junction and the 3’ polyadenyla- tion site if placed at the 3’ R-U5 junction (Figure 2.4). The major splice donor site downstream of the primer binding site (PBS) is used for generation of subgenomic mRNAs, including env. After translation, the polyproteins Gag and Gag-Pro-Pol localize to the cell membrane into which the Env protein attaches by its C-terminal transmembrane domain. Assemblage of a complex of unprocessed polyproteins, a dimer of the progeny and a Gag-coating cap- sid occurs close to the attached Env followed by budding of the virion from the cell surface. As the virion matures, the polyproteins are cleaved into func- tional subunits within the capsid and thus the infection continues spreading to neighboring cells. Surprisingly, not all retroviruses dispose to a life threatening condition. Retroviruses may have played significant evolutionary side effects. Many of the mammalian viruses may have acquired genes from their hosts during their evolution as the captured genes provide a selective advantage to the virus.

32 Table 2.2: Steps involved in the reverse transcription of retroviral RNA to DNA (Cann [43], Coffin [50], Menéndez-Arias and Berkhout [191], Nassal [198]) Step Stage 1 Located immediately downstream of the U5 region, a specific cellular tRNA complementary to the 18-nucleotide long PBS acts as a primer, hybridizes to the virus ssRNA genome and provides an hydroxyl group for initiation of reverse transcription. Reverse transcription starts in the 3’ → 5’ end direction. Complementary DNA (cDNA) to the U5 and R region of the viral RNA is syn- thesized by RT. The U5 is a non-coding unique sequence in the 5’ end. The R region is a direct repeat found at both ends of the RNA molecule 2 A domain on the RT enzyme called RNAse H degrades the 5’ end of the RNA which removes the U5 and R region 3 The primer then is translocated to the 3’ end of the viral genome and the newly synthesized DNA strand hybridizes to the comple- mentary R region on the RNA . 4 The first strand of complementary DNA (cDNA) is extended 5 The majority of viral RNA is degraded by RNAse H except the polypurine tract (PPT) site 6 Once the first complementary strand is completed, the second strand synthesis is initiated using the PPT as a primer. Synthesis of second strand of ssDNA is initiated from the 3’ end of the template PPT. tRNA is necessary to obtain the synthesis of the complementary PBS 7 tRNA is degraded 8 After another translocation, the PBS synthesized in the second strand hybridizes with the complementary PBS on the first strand 8 Synthesis of both strands is completed by the DNAP function of reverse transcriptase. Both double stranded DNA (dsDNA) ends have U3-R-U5 sequences, so called long terminal repat se- quences (3’LTR and 5’LTR, respectively). LTRs mediate integra- tion of the retroviral DNA into another region of the host genome by virtue of the enzyme integrase

33 Figure 2.5: Life-cycle and structure of the retrovirus model: HIV. This image was originally created by Daniel Beyer and distributed in Wikimedia Commons (http://commons.wikimedia.org/) under the Creative Commons Attribution Share- Alike 3.0 License.

However, syncitin is one of the few examples of co-adaptation between viruses and the host but may be the only extant case where viral genes have been found to serve an important function in the physiology of the mammalian host. Syn- cytin is believed to mediate placental cytotrophoblast fusion in vivo, and thus play an important role in human placental morphogenesis [193]. Another ex- ample of a likely positive selected retrovirus may be telomerase [56, 74]. In- dicators such as the RNA molecule carried by the telomere repair RT enzyme may suggest the retroviral origin of telomerases. Furthermore, retroviruses not always play a destructive mechanism in the genome organization. They may integrate and repair double-strand breaks, promote transduction of DNA from their 5’ or 3’ end to other locations, cre- ate chimeric retrogenes, integrate and transcribe within coding regions, affect transcription via antisense promoter or premature cleavage with its poly(A) signal [146]. Of course, there remains a potential to disrupt genes by muta- genic insertions and produce defective and/or truncated products [145]. One of few examples of chimerism with a reported phenotypic effect is the re-

34 cently acquired retrogene encoding fibroblast growth factor 4 (Fgf4), strongly associated with chondrodysplasia, a short-legged phenotype in at least 20 dog breeds [213]. Domestication has allowed the fixation of a 5 kb insert con- taining the retrogene in 19 of the 20 breeds tested. Gene duplication through retrotransposition must acquire a new promoter, likely providing with a differ- ent expression profile. To accomplish this, retrogenes often borrow contextual regulatory elements from neighboring genes or from other transposable ele- ments [84]. In the particular case of complete retroviruses, after integration into the genome they are bordered by short direct repeats of host DNA and LTR sequences of about 500-600 nucleotides. In these LTRs there are contained strong cell-specific transcriptional regulatory elements such as enhancers, promoters, hormone responsive elements, and polyadenylation signals (Figure 2.4) to promote their own transcription. But also alteration of neighboring genes could occur. LTRs are commonly found upstream of genes in antisense orientation or downstream in sense orientation [50]. One of the most well known cases of tissue-specific promoter is the expression of amylase in the human salivary glands by an integration of HERV-E in reverse orientation upstream of the gene [233]. Solitary LTRs, much more abundant than their cognate proviral ancestors, can retain promoter activity [64, 179] and sometimes lead to polyadenylation of spliced chromosomal transcripts [137]. Because the retroviral LTRs potential to influence transcriptional activity of the neighbouring genes and thus induce impaired functions of the host, LTRs are underrepresented within and in the vicinity of genes [189]. This seems to be result of a selective process such a distribution is not observed in recently integrated proviruses [41]. New proviruses can integrate in the vicinity of known proto-oncogenes and produce alteration of their expression which may be involved in the onco- genesis [121]. Despite of the presence of retroviral proteins in the diseased cells, no genetic proof of causality has been obtained [59]. The altered lev- els of expresion may be confusing its role as etiological agent. Proviruses may be transcriptionally activated by a change in the state of the host cell. For example, HIV seropositive patients experimented a reactivation of HERV- K(HML2) [91] and HERV-W Env expression was found to be upregulated in demyelinating brain tissues of MS patients [20]. Occasionally, endogenizations from ectopical retroviral infections can be passed on to germ line cells. This lead to transmission of the integrated provirus to the offspring in a Mendelian fashion. These viruses are known as endogenous retroviruses (ERVs) [253]. ERVs are present in all eukaryotic forms of life and in most species only as a record of an infectious agent. They are highly diverse and mutated although despite the millions of years since the integration occurred, some ERVs still exhibit open reading frames (ORFs) making possible protein translation. There are some concerns with ERVs having ORFs coding for proteins. Porcine endogenous retroviruses

35 (PERVs) have demonstrated infectious capacity in some cases for human, rat, and mouse cells after pig islet xenotransplantation [295], [235], [271]. Insertional mutagens with the Gag-Pro-Pol reading frames open are the master copies of the intracisternal A particles (IAPs), a family of autonomous betaretrovirus-like LTR-retrotransposons, in mice. The proteins provide with all the enzymatic and structural proteins neccessary for the IAPs to autonomously transpose in heterologous cells [63]. Likewise, in the case of murine ETn (Early Transposon) elements, known to be among the most active mobile sequences, "master" MusD LTR-retrotransposons are used to trans-mobilize. Similarly to IAPs, MusD "master" copies are able to autonomously trigger retrotransposition of marked ETn elements with presence of coding gag, pro and pol genes [228]. With time, ERVs may become fixed in a population because there is a se- lection process for those that are nearly neutral or beneficial. The presence of both complete replication competent infectious ERVs as well as a majority of defective ERVs with various degrees of mutation load is evidence that retro- viral transmission has been and continues to be an evolutionary process with potential functional implications for the host. The fact that most ERVs are incomplete with multiple inactivating mutations indicate that most of the inte- grations occurred early during eukaryotic evolution. It also implies that hosts that have co-evolved with ERVs over a long time and have developed protec- tion mechanisms such as cosuppresion [293], CpG methylation [36] and cyti- dine deamination [78]. From the information contained in the genomes today, it is undetermined when the first endogenization occurred. Estimations of the ERV age is based on the calculated LTR divergence if both LTRs have been identified. Assuming a neutral nucleotide substitution rate of 0.2% per million years (my) for retrotransposons [71, 168], a limit of 5% divergence would contain any integration that occurred around 12.5 million years ago (mya) whereas an integration with a 10% divergence would have occurred around 25 mya. Noticeably, with a 50% divergence limit for a reliable LTR nucleotide sequence recognition, ERVs older than 125 mya cannot be found in current genomes. Therefore, estimation of the time of integration for ERVs is lost in the late Mesozoic era. ERVs and transposable elements (TEs) that escaped genome silencing and became co-opted by their host genome [194, 281, 301] show frequently a particular promoter configuration. They usually share a pro- moter with a neighbouring gene in a head-to-head or bidirectional arrange- ment with CpG-associated regulatory regions [135]. Co-opted LTRs can affect splicing by inserting splice sites [137] or driving the expression of an aberrant transcript like an upstream promoter [156]. If an ERV is integrated into an intron, in the sense orientation, it may introduce new canonical splice sites leading to exonization [127]. Therefore, ERVs are selectively favoured in the antisense orientation and where integrated within introns they are less likely to have functional splice or poly(A) sites, and their access may be blocked by the synthesis of transcripts from the retrovirus LTR.

36 Beyond their influence over the transcriptional landscape of genomes, en- dogenous proviruses may play significant roles in the organization and geom- etry of the host genome structure. ERV-mediated recombination events have likely had a profound effect in genome re-shuffling. Different types of re- combinations can occur between identical sequences from the same or related elements. There are at least four types of recombinations commonly observed: 1. The most common is homologous recombination between two LTRs ex- cluding the cognate proviral genes and leaving a solitary LTR in the locus. Solitary LTRs are accounted up to 10 to 100 times more frequently than their retroviral counterparts [253]. Solitary LTR formation should occur soon after integration before mutations accumulate when still the LTR are highly identical [28]. 2. Homologous recombination between two proviruses in the same orienta- tion on the same chromosome result in the loss of viral and genetic se- quence between recombination sites. When the proviruses are in opposite orientation, the result is an inversion of the chromosomal region. 3. Recombination between proviral 3’ and 5’ LTRs on sister chromatids orig- inate tandem proviruses on one of the chromatids and a solitary LTR on the other. Tandem proviruses are flanked by LTRs while sharing one central LTR. 4. The only nonreciprocal exchange of sequences without proviral loss, re- sults in a functional conversion of the targeted proviral sequence within the same or related elements [168]. Such recombination events are not unique to ERVs. Repeat elements of the same size and distribution can be subjected to the same exact mechanisms of chromosomal rearrangement. A particular rearrangement may theoretically render together an protooncogene and a retrovirus which would place the for- mer under strong cellular transcriptional control signals resulting in aberrant expression and thus with the potential to induce various malignancies. Even if mutated, ERVs can contribute to replication of other proviruses or to related exogenous viruses. The dimer copackaged in every retroviral cap- sid makes possible complementation or recombination of the RNA genome copies to occur [25, 300]. Each of the proviruses copackaged should provide the functional proteins that are defective in the other for the complementation to effectively work. AKR, a high leukemic strain of mice, result as a recombi- nation of a murine leukemia virus (MLV) strain with at least two other endoge- nous MLVs [254]. Proviruses capable of complementing disrupted proviruses in trans can induce their trans-mobilization and thus, contribute to an atten- uated trans-infection by acting as midwife elements [130]. The small group of HERV-Fc elements [29], intermediate to the larger HERV-H and HERV-F groups, could manifest "midwife" properties as proviruses in the family con- serve, despite of their age implied by the LTR divergence, almost intact gag and pol, and intact pro and env in humans. In an example of heterogeneity be-

37 tween viruses, the hepatitis delta virus (HDV) of humans has a small circular RNA genome, similar to retroviruses, but a protein coat derived from hepatitis B virus because HDV is defective on it. HDV cannot produce its own capsid and replicate without the help of hepatitis B virus [237]. Therefore, regarding Fc elements, they are hypothesized to be a source for capsid particles [130]. In section 5.4, we present our results for an in silico detailed search for this kind of elements. For a more complete review background on endogenous retroviruses, see Jern and Coffin [129] and Blikstad et al. [32].

Endogenous retroviruses classification ERVs are part of the transposable elements category. This is a generic term referring to both DNA sequences and retroelements that can be excised and reinserted at another site. The term retroelements describes any sequence that can reverse transcribe itself or is dependent on reverse transcription, integrate and replicate. Retroelements include ERVs, retrotransposons (lacking an env gene), retroposons, and retrosequences [146, 219]. Among the most remark- able elements are LINEs and SINEs, non-LTR retrotransposons, for their high number presence in eukaryotic genomes. For example, in the human genome there are about 20,000-40,000 LINEs (roughly 21% of the genome), and about 1,500,000 SINEs (approximately 13% of the human genome) [99, 118]. These retrotransposon-derived repeats are also commonly referred to as long and short interspersed repeats respectively. LINEs code for the enzyme RT, and many LINEs also code for an endonuclease (e.g. RNase H) [206]. The 5’ UTR contains the promoter sequence, while the 3’ UTR contains the polyadenyla- tion signal (AATAAA) and the poly-A tail (Figure 2.4). SINEs do not en- code a functional RT protein and rely on LINE elements for retrotransposi- tion. The most common SINEs in primates are called Alu sequences. Trans- mobilization of SINEs (including Alu elements in primates) probably occurs by SINEs "piggy-backing" the RT encoded by LINEs. SINEs and LINEs of many species share homologous sequences at the 3’ end, upstream of the poly(A) site, which may interact upon mobilization. Trans-mobilization of eel SINEs by LINEs has been demonstrated in cultured human cells [134]. A meticulous classification effort was made by David Baltimore to devise his own classification system [261] for viruses. The Baltimore classification is based on the mechanism of viral mRNA production [23]. As already ex- plained, viruses generate mRNAs from their genomes to produce proteins and replicate themselves and different mechanisms are used to achieve this in each virus family. Viral genomes may be single-stranded (ss) or double-stranded (ds), RNA or DNA, and may or may not use reverse transcriptase (RT). Addi- tionally, ssRNA viruses may be either sense (+) or antisense (-). The Baltimore classification places viruses into seven groups. All members of Group VI use virally encoded RT to produce DNA from the initial virion RNA genome. This

38 DNA is often integrated into the host genome, as in the case of retroviruses (family Retroviridae) and pseudoviruses (family Pseudoviridae), where it is replicated and transcribed by the host. In the earlier attempts, morphological features observed in electron mi- croscopy were used to classify retroviruses. Exogenous retroviruses were clas- sified into four types [50]: • A-types Intracisternal A particles (IAPs). Non-enveloped, immature parti- cles which can only be detected and trans-mobilize inside cells. They are believed to originate from endogenous retrovirus-like elements. • B-types Mouse Mammary Tumor Virus (MMTV) is the class model. They are enveloped and form extracellular particles with a round, condensed, acentric core and prominent envelope spikes. • C-types Represented by Murine Leukemia Virus (MLV) for mammalian and Avian Leukosis Virus (ALV) for avian retroviruses. With similar morphology to B-types, they show sometimes an angular central core and barely visible spikes instead. • D-types Mason-Pfizer Monkey Virus (MPMV). Slightly larger, possess a bar shaped core and less protuberant envelope spikes. At that time, exceptions such as lentiviruses (e.g. HIV-1), with their distinct morphology of a cone shaped core and more prominent spikes, were difficult to classify in the four types above although they resembled the C-type in their way of budding [50]. With the advent of sequencing of retroviruses, the classification was up- dated to incorporate the seven genera used today: alpha, beta, gamma, delta, epsilon, lenti and spumaretrovirus. For each of them, there is a particular (or several) retrovirus model(s) for that genus (Table 2.3).

Table 2.3: RNA retrovirus genera classification.

Genus Type Model Virus Organization Major Human groups Alpha Avian type-C Avian Leukemia Virus (ALV) Simple Beta Mammalian type-B, D Mouse Mammary tumor Virus (MMTV) Simple/Complex HERV-K(HML) Gamma Mammalian type-C Mouse Leukemia Virus(MLV) Simple HERV-E, H(F), I, R(ERV3), S, T, W(ERV9) Delta Type-C like Bovine Leukemia Virus (BLV) Complex HTLV-1, 2 Epsilon Type-C Walleye Dermal Sarcoma Virus (WDSV) Complex Lenti Human Immunodeficiency Virus (HIV) Complex HIV-1, 2 Spuma Type-C like Human Spuma Retrovirus (HSR) Complex HSRV, HFV, HERV-L

Endogenous retroviruses are not formally included in this classification sys- tem, and are broadly classified into three classes, on the basis of relatedness to exogenous genera: • Class I are most similar to the gammaretroviruses • Class II are most similar to the betaretroviruses and alpharetroviruses • Class III are most similar to the spumaviruses However, ERVs have been found to date related to each of the seven exoge- nous retroviral genera in the majority of eukaryotic organisms.

39 Recent advances in phylogenetic classification of ERVs have suggested to use different retroviral traits to pinpoint ERV evolutionary differences, even to classify ERVs within a specific genus. A study [131] successfully applied these to human and chicken ERVs. Very recently, Blomberg and colleagues have proposed a novel classification system for ERVs [33] where they argue that classification should be based on a combination of similarity, structural features, (inferred) function, and previous nomenclature.

Retroviral applications Viruses have been of historical importance to the study of molecular and cel- lular biology. They provide simple systems that can be used to manipulate and investigate the functions of cells [266]. For example, viruses have en- lightened basic mechanisms of molecular genetics, such as DNA replication, transcription, RNA processing, translation, protein transport, and immunol- ogy. The study and use of viruses have provided valuable information about every aspect of cell biology [177] and even influenced new medical branches like oncology. The discovery of the reverse transcriptase enzyme implied one of the largest paradigm changes in Biology. It greatly improved knowledge in the area of Molecular Biology, as, along with other enzymes, allowed scientists to clone, sequence, and characterise DNA. Nowadays, RT is a standard enzyme in molecular biology applications and is commonly applied to studies of RNA in laboratory techniques such as cDNA synthesis and reverse transcription poly- merase chain reaction (RT-PCR). The classical PCR technique can only be applied to DNA strands, but, with the advent of RT, RNA can be transcribed into DNA, thus making PCR analysis of RNA molecules a reality. Transduction is a common tool used by molecular biologists to introduce a foreign gene into a host cell’s genome. Geneticists often use defective viruses as vectors to introduce genes into cells that they are studying. This is a use- ful technique for making the cell produce a foreign substance, or to study the effect of introducing a new gene into the genome. In similar fashion, vi- rotherapy, a promising field in the treatment of cancer and in gene therapy, use viruses as vectors to treat various diseases and deliver therapeutical drugs or chemicals, as they can target specific cells and DNA.

2.5 Bioinformatics Bioinformatics is the application of information technology to the field of molecular biology. It is the name given to mathematical and computing ap- proaches used to gain understanding of biological processes. The term was coined in 1979 by Paulien Hogeweg for the study of informatic processes in biotic systems [115] but Dr. (1925-1983) is credited

40 today as a founder of the field of Bioinformatics. In her PhD project she pio- neered in the use of computers in chemistry and biology. She was particularly interested in the possibility of deducing the evolutionary history of the biolog- ical kingdoms, phyla, and other taxa from protein sequence alignments. Bioin- formatics has broaden since her contributions to many new fields of expertise. Major research efforts nowadays include , gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies, and the modeling of evolution [111, 123, 208, 243, 250, 276, 280].

2.6 Statistics and Biology in a complex interplay scenario The scenario of mechanistic interplay that Biology tries to describe [100], where the relation of cause-effect is influenced by several factors, requires a modelling science like Statistics in order to explain, categorize, extract knowl- edge of hidden features or just offer a numerical description of the observa- tions. Sir Ronald A. Fisher, creator of the foundations for modern statistical science [103], studied the second generation cross (F2-cross) ratio in the re- sults of Mendel when still his experimental results were disputed. Fisher ad- verted them as extremely biased towards the 3 to 1 ratio. However, Mendel’s laws were never questioned as they have been already reproduced and proven as valid but Fisher’s vision of Biology, being one of the greatest geneticists and evolutionists, paved the way to the production of several of the most strik- ing statistical theorems (hypothesis testing, linear discriminant analysis, max- imun likelihood, permutation test, Fisher’s exact test just to quote some) that are still commonly used in Bioinformatics.

Machine Learning With the advent of biochemical and biomedical data analysis, several disci- plines were left out from the aforementioned definition of Bioinformatics. Recently, with the emergence of new technological advances in both scientific or industrial fields, the application of Statistics to the extraction of knowledge from huge quantities of data gave rise to a new discipline called Machine Learning. In a biological jargon, the goal of Machine Learning is to under- stand the relationship between the observations collected and the experimental results obtained. One of the biggest challenges for Machine Learning is to deal with the problem that Richard Bellman referred to as the curse of dimensionality, [26]. The curse of dimensionality originated by the increase of feature space (or mathematical dimensions) compared to the sparcity of observations. This is a

41 very recurrent problem in the biomedical field where the number of samples is low compared to the number of medical measures taken.

Supervised and Unsupervised learning Supervised learning is probably the most practised task where an expert tunes the different system parameters in the presence of an outcome variable that guides the whole learning process. The distinction in the output type has led to a naming convention for the prediction tasks: regression when quantitative and classification when qualitative outcomes are predicted. Some statistical methods are best suited to quantitative inputs and some most naturally for qualitative or even both. In the end, qualitative variables may be encoded nu- merically which poses that both quantitative and qualitative problems can be viewed as a function approximation task. Conversely, the unsupervised learning problem gets features or inputs clas- sified without supervision assuming there is a hidden theoretical relationship between them. Data output here is organized in clusters or hierarchical trees, but no numerical measurements are obtained. Among the varied existing models, linear models have been the mainstay of statistics for the past 30 years and they are primary analytic tools for under- standing the relationship between different inputs in data assuming linearity. They provide simple capabilities for prediction and classification and still lin- ear regression models are often used for biological data analyses. However, the traditional linear models may fail in real situations: real life effects are obviously not always linear. All the models have several complexity parameters that have to be deter- mined. In general, we would like to set the values of those so the error of our predictor model gets minimized. This represents often a multivariate prob- lem for which an optimal solution cannot be found computationally. When the search space is large enough to make unrealistic the evaluation of all the possible solutions, a local optimun (minima) can be found by a search of an initial solution followed by iterative searches for improvement of the candi- date solution. A further improvement is to restart this iterative search several times from different initial candidates at random. In order to use statistical learning methods, the observations (data samples) need to be described in a homogeneous way. In other words, common fea- tures should be consistent in the scale used for their measure. Biological ex- periments are not free from sources of errors either. In handling experimental bias, statistics can also help us to try to normalize the error bias in our data knowing their likely distribution (e.g. normalization on two channel array ex- periments of Cy5 and Cy3 values, detection of miscalling bases among true SNPs in a sequencer file, etc.). Most machine learning algorithms require in- puts in a certain scale of values and observations that contain the same number of features.

42 Model selection and validation When tuning different models, there are two goals we may have in mind. One is a model selection stage where performance of different models is estimated in order to choose the best one, another is a model assessment stage needed to determine the quality of the model in terms of significance and generalization capability. In other words, we need to assess the generalization performance of our model with the prediction error on new data to know how good it is constructed and have a measure of the quality. An estimate of the error can be used to infer comparisons between different models performance when these model belong to the same class. For this purpose, a common validation ap- proach in a data-rich situation is to divide the dataset randomly into three parts: training, validation and test sets. The training set is used to fit the mod- els; the validation set is used to measure the estimated prediction error (Err) accounted for model selection; finally, the test set is used to assess the gener- alization error (Errτ ) of the final chosen model. A typical split might be 50% of data for training, and each 25% of the rest for validation and testing. For situations where the data is insufficient to split in three sets, as in most of the biomedical observations, other kind of methods are used. These are methods to approximate the validation step either analytically (e.g. Akaike informa- tion criterion, AIC; bayesian information criterion, BIC; minimum descrip- tion length, MDL; structural risk minimization, SRM) or by efficient sample re-use (cross-validation or bootstrap). The process to choose the model parameters is guided by the minimization of the average test error (or expected prediction error, Err), drawn from the different test error measured (Errτ ). Most methods can obtain reasonable es- timates of the expected error at least but error estimation is not easy in cases of data paucity.

2.7 The -omics age The finishing of the Caenorhabditis elegans genome sequence in 1998 [42], supposed a definitive impulse for the Human Genome Project that had started in 1990 [288] and concluded successfully with a first draft of the first hu- man genome in 2001 [118] and a finished sequence in 2004 [119]. During the past decade the rapid overlapping developments in molecular research technologies paralleled with developments in information technologies which combined have produced a tremendous flood of information. This revolution wouldn’t have been possible without the advances listed in Table 2.4 among many others. In the -omics age of today’s molecular biology science (ge- nomics, transcriptomics, proteomics, interactomics, allergonomics, etc.), any study would not be feasible without the aid of Bioinformatics.

43 Table 2.4: Technological milestones that helped to the advancement of Life Sciences.

Year Discovere(s) Discovery(ies) Inventor(s) Invention(s) Developer(s) Technology(ies) 1965 Margaret Dayhoff Atlas of Protein Sequences 1966 National Library of Medicine development of Medline 1968 U.S. Advanced Research Projects development of Internet Agency 1972 R. Tomlinson development of email 1980 P.J. Stoehr and G.N. Cameron The EMBL Nucleotide Sequence Database, EMBL, Germany [252] 1982 W. Goad development of GenBank 1988 National Library of Medicine founding of the National Center for Biotechnology Information (NCBI) and the creation of public databases and systems for their use 1990 S.F. Altschul and D.J. Lipman development of the BLAST algorithm [17] 1991 T. Berners-Lee and R. Cailliau development of the World Wide Web (WWW), CERN 1997 National Center for Biotechnology In- development of PubMed formation 1998 L. Page, S. Brin development of Google

The usage of -omics techniques has grown exponentially in the past five years and shows no signs of slowdown. Result of this growth is that the num- ber of sources of products, services, and information has increased to the point that keeping track of (or locating) the numerous providers has become ex- tremely time consuming. The use of Bioinformatics today is entailed to the creation and advancement of biological databases, algorithms, computational and statistical techniques. To formally or practically solve these problems, Bioinformatics must rely in techniques learnt from fields like Mathematics, Computer Science or Statistics in a truly inter-disciplinary manner. However, with the advent of paradigm changes, often preceded by a change of technol- ogy, new niches for innovation are created so Bioinformatics paves its own way into the realm of biological knowledge.

44 3. Aims of this thesis

• Regarding the EMBRACE project objectives, the creation of newly dis- tributed tools that could aid in the advancement of biological research. • Bridging the EMBRACE project with relevant investigations, to support the solution of different test cases with a biological purpose. • Develop a web-based system to provide a state-of-the-art risk assessment procedure for the analysis of proteins with the potential to trigger allergy sensitization and cross-reactivity. • Develop a solution to catalogue and ease the analysis of ERV dataset(s) and allow the automatic retrieval of such integrations in the human genome. • Use GeneFinder, one of the tools developed for the EMBRACE project, to analyze results from genome-wide association studies, and to re-analyze one such study as a proof-of-principle. • To develop a detailed catalogue and display the canine endogenous retro- viruses (CfERV) integrational landscape. Investigate the possible influence proviruses may have on gene regulation and the role of relevant related viral families on the evolution of the species.

45

4. Methods

Some of methods used thorough the Papers presented accompanying this the- sis are described herein within brief self-explanatory sections to understand our reasons to use them.

4.1 Sequence alignments With the overwhelming high-availability of molecular sequences in the -omics age, sequence alignment has become one of the most powerful tools for the molecular biologist. The arrangement of sequences of DNA, RNA or protein in order to identify regions of similarity and conservation may provide insights into the functional, structural or evolutionary role of the sequences. The foun- dation for sequence alignment resides in the conservation principle. This is a basic tenant in Molecular Biology saying that being conserved through evo- lution may well be an indication of active functionality [148]. Therefore, if two sequences descended from a common ancestor, i.e. they are homologous, their alignments would reflect a reasonable degree of similarity interpreting that mismatches on them may occur as a result from evolutionary forces intro- ducing both point mutations and gaps (indels, insertional or deleterious mu- tations) in one or both lineages. However, convergent evolution is an evolu- tionary event that can randomly produce similarity between two proteins (or sequences) that evolutionarily are unrelated. Based on this similarity regions, the proteins may erroneously be considered as homologues just because they perform a similar function or fold into similar structures. Sequence alignment requires an algorithm to resolve the ordering of two (or several) sequences because due to their length and variability cannot be solely aligned under human supervision. Occasionally, a human supervised filtering step where heuristics are applied manually is needed after the final re- sults are obtained. A great variety of computational approaches to the problem have been researched. As a result of that studies, methods like dynamic pro- gramming, which perform an exhaustive search of solutions, heuristic algo- rithms, and probabilistic methods have been designed for large-scale database searches. We introduce in the following subsections some theory about the al- gorithmic approach. In the last step of a sequence alignment procedure is the scoring system which evaluates all possible alignments to decide an optimal

47 Figure 4.1: A sequence alignment, produced by ClustalW, of two human zinc finger proteins, identified on the left by GenBank accession number. Amino acids are colour- coded by their properties. Red: small, hydrophobic, aromatic, not Y. Blue: acidic. Magenta: basic. Green: hydroxyl, amine, amide, basic. Gray: others. Alignment keys are "*": identical residues, ":": conserved substitutions (same colour group), and ".": semi-conserved substitution (similar shapes)

one based on the highest score. The scored system is usually a scoring function usually in the form of substitution matrices between residues. For instance, there are two most used scoring matrices for protein sequence alignment: the Point Accepted Mutation (PAM) [61] and BLOcks of amino acid (BLOSUM) [110] encode evolutionary approximations of particular amino acid mutation rates or empirically derived substitution probabilities. The obtained score will reflect the biological or statistical observations about known sequences. In order to restrict the number of insertions and deletions, introduced by gaps in the alignment, because it would span the space of solu- tions endlessly, two additional parameters are used: the gap opening penalty and the gap extension penalty. Both penalties are negative values that will be added by the scoring procedure whenever new gaps are introduced. Two sequences producing an alignment of 20% or less identity such as some viruses in Paper IV are considered to drain into the Twilight Zone [69]. Twi- light Zone alignments may appear plausible to the human eye, but cannot be statistically supported as significant. Alignments over 25% identity at amino acid levels are said to reflect a certain protein relationship but in fact, below a 50% identity it becomes increasingly difficult to establish reliable relation- ships.

Pairwise alignments In section 2.1, we already described how the 2001 FAO/WHO decision tree in- cluded an in silico methodology based on sequence similarity to detect poten- tial de novo allergenic proteins based on already described reactive allergens in genetically modified food. Sequence alignments generally fall into two categories: global or local alignments. Pairwise alignments can find the best-matching between two se- quences at a time. Then, a global strategy tries to make an optimization of the alignment along the entire length of both the sequences. Local alignments, in contrast, try to identify regions of similarity within the sequences. However, a

48 local alignment of long sequences will be problematic to optimize due to the existence of multiple regions with high similarity. Dynamic programming algorithms for global and local alignment were de- vised by Saul Needleman and Christian Wunsch in 1970 [200], respective Temple Smith and Michael Waterman in 1981 [242] for the latter. The identi- fication of a good scoring function is the relevant step and it is often dependant on empirical knowledge. Given a scoring function, dynamic programming al- gorithms are then able to find the optimal alignment between two or more sequences although its computational cost discourages its use with large num- bers of or extremely long sequences.

a) b) c) f(i-1,j-1) f(i,j-1) 0 AGGTG T C AGGTG T C A 1 0 -1 -2 -3 -4 -5 000000 00 -d A 0 s(ri,rj) G 0 2 1 0 -1 -2 -3 10000 00 G -1 1 3 2 1 0 -1 G 0 021 0010 f(i-1,j) f(i,j) -d T -2 0 2 4 3 2 1 G 0 013 2 1 10 C -3 -1 1 3 4 3 2 T 0 002 4 3 2 1 C 0 0013 2 133

AGGTG T C AGGTG T C AGG T-- C AGG T

AGGTG T C AGG T-- C

Figure 4.2: a) Forward algorithm of the Needleman and Wunsch algorithm to re- cursively compute the entries of the alignment matrix. The grey box represents the additional parcel of the Smith and Waterman algorithm. b) Example of an alignment matrix and the back-tracing algorithm used to find the best-scoring global alignment. In the Needleman-Wunsch algorithm, the back-tracing reconstruction of the align- ment goes from the low right corner until the top left of the matrix. c) Example of an alignment matrix and the back-tracing algorithm used to find the best-scoring local alignment. In the Smith-Waterman algorithm, the back-tracing starts at the highest scoring value in the lowest row. The back-tracing reconstruction stops when a 0 value is found. Adapted from Balding et al. [22]

Sequence database alignments To facilitate searching against a large number of sequences to find local sim- ilarities for a specific sequence, word methods, also known as k-tuple meth- ods, were devised. The methods in this category are heuristic methods which implies that an optimal alignment is not guaranteed to be found but reason- able alignments will be done in computationally more efficient manner than with dynamic programming. After indexing a large number of sequences in an optimized dictionary database, these methods will match significant series of short, non-overlapping subsequences ("words" of k-letters) in the query sequence against these indexes of the target sequences, leaving out in this

49 step a large proportion of them. With this type of indexing, many low scor- ing matches are eliminated from being used as anchors for the alignment. The final step for closing the alignment, is an extension of these k-word matches sideways proceeding like in a local alignment until there is no more significant sequence to match. Word methods are very popular and used tools in database search. FASTA [170], the BLAST family [17] or BLAT [149] are the principal homology search tools for a bioinformatician. Lower values of k makes FASTA more sensitive but slower. BLAST is relatively faster as the algorithm starts with two diagonal seeds, or high-scoring segment pairs (HSPs), at a certain distance and extends them sideways using a modified Smith and Waterman algorithm. When the sequence database size increases, BLAST outperforms FASTA. In addition, BLAST can estimate the likelihood for a particular alignment to occur by chance given both the sequence composition and size of the database interrogated. The output parameters are very useful to ascertain if a score is just originated by chance or to compare the relatedness of various sequences. BLAT, used in Paper IV, is faster than BLAST because it keeps an index into the computer’s memory of all non-overlapping 11-mers except for those heavily involved in repeats. BLAT with DNA sequences of 40 bases or more quickly find alignments that have more than 95% identity. It may miss diver- gent or short sequences. On proteins, BLAT uses 4-mers rather than 11-mers, and finds good alignments for protein sequences longer than 20 amino acids of 80% and greater similarity. At a protein level, BLAT can give much better picture of gene families (paralogs). However, BLAST and PSI-BLAST [18], a BLAST-like tool to find distant relatives based on iterative protein profiles, can find much more remote matches.

Multiple sequence alignments Multiple sequence alignment is a natural extension to pairwise alignment to identify regions across a group of sequences. This strat- egy is normally used to ascertain the possible evolutionary relation between them. Sequence conservation is commonly used together with structural and functional information to discover e.g. the catalytic active sites of enzymes, nuclear localization signals in TFs or any other functional motifs in protein domains. In 1994 the most popular multiple alignment program was published, ClustalW [265], based on the progressive algorithm from Feng and Doolittle [83]. These class of algorithms add progressively less related sequences to an initial alignment composed of the most similar sequences with help of a guided tree as calculated from a matrix of pairwise alignments. Thus, alignment is sensitized to the first generation of pairwise matrix (p-distances). Most progressive algorithms rely on that finding the identities

50 by using the most similar sequences first, will minimize the alignment errors. New gaps are discouraged with penalties to reduce their presence as well as benefit established identities. Moreover, the initial choice of sequences for the alignment is based on weights assigned to the sequences from the branch length of the guiding tree. The guiding tree is built in ClustalW using the neighbor-joining method [231]. ClustalW has been the monopolizing alignment tool for many years due to the inexistence of other alternatives or lack of published work referencing any of its competitors. In the verge of the new millennium, new algorithms became available which outperform ClustalW in both time and accuracy. A summary of a benchmark published in protein alignments [205] is shown in Table 4.1. Iterative methods improve in the heavy dependence that the progressive algorithms show in the pairwise alignments calculated for the guided tree ma- trix. The strategy in iterative methods is therefore to optimize the initial global alignment by realigning randomly selected sequence subsets. The selection procedure of sequence subsets is reviewed in Hirosawa et al. [113]. After, the newly aligned subsets will be realigned back into the multiple sequence align- ment. This alignment, in turn, will be used in the next iteration. Muscle [73], a multiple alignment program with progressive and iterative strategy, was used due to its reasonable speed performance and its iterative search of a better lo- cal optima which lays a longer, stretched alignment for our groups of ERVs in Paper IV.

Table 4.1: Evaluation of different MSA methods. The running times are normalized to Mafft FFT-NS-2, the fastest of the methods. Adapted from Nuin et al. [205]. Program Year Accuracy Time (s) Simprot BAliBASE ClustalW [265] 1994 0.789 0.702 22 Dialign 2.2 [195] 1999 0.755 0.667 53 T-Coffee [204] 2000 0.836 0.735 1274 POA [161] 2002 0.752 0.608 9 Mafft FFT-NS-2 [142] 2002 0.839 0.701 1 Muscle [73] 2004 0.830 0.731 4 Mafft L-NS-i [143] 2005 0.865 0.758 16 ProbCons [68] 2005 0.867 0.762 354 Dialign-T [255] 2005 0.775 0.670 41 Kalign [160] 2005 0.803 0.708 3

51 Sequence motifs The majority of file formats for sequence alignment output are text-based. Many of them are specific for an alignment program. Aligned sequences are usually represented within an alignment matrix where the arranged rows rep- resent the sequences and symbols would appear in the columns if two residues are aligned. Inserted gaps between the residues are represented by ’–’ so that identical or similar characters are aligned in successive columns. And at the bottom of the alignment matrix, an extra row of symbols indicate the grade of similarity status between the two aligned sequence: "*", identical; ":", conserved substitutions (same colour group); ".": semi-conserved substitution (similar shapes) (see Figure 4.1). In the database searches file output, this bot- tom row is substituted by a middle row between the query sequence and the target where "|" is indication of similarity. For multiple sequences alignments, the last row may show in each column the consensus residue as determined per aligned column. There is also a large variety of programs used for visual- ization of alignments. The majority of them use colors to display information about the properties of the individual sequence elements e.g. nucleotide base or amino acid residue. The color of the latter often indicates also physico- chemical properties. The bioinformatics application used in Paper II, created by one of the members of our group, is commonly our choice of visualization tool. One of the most time-consuming tasks for a bioinformatician is the conver- sion of the different output formats between alignment programs. Nowadays, there are many bioinformatics programmatic interfaces (APIs) to handle this conversions as a well as web tools. Although it would be much simpler if the creators of the programs would pursue their integration and release convert- ers or web services to do that unpleasant task. We present several distributed interfaces to implement such converters in section 4.6. From a multiple sequence alignment, short conserved sequence motifs can be constructed. Highly conserved regions are isolated from the global mul- tiple sequence alignment and a profile matrix (i.e: position weight matrix, PWM) is constructed. For example, it is assumed that co-regulated genes usually contain the same DNA motifs (i.e: transcription factor binding sites, TFBS) in their promoters. DNA fragments bound by the same TFs are also expected to contain conserved motifs. The profile matrix of such an alignment is like the scoring matrices of the dynamic programming algorithm but the fre- quency counts are derived from the conserved region. The frequency counts can be normalized if desired. This newly formed matrices are used to look for new occurrences of sequences. There are several specialized databases that store PWMs of short aligned motifs (e.g. BLOCKS, Pfam, PRINTS, Prosite, etc.) or information of conserved protein 3D-modules (e.g. Class- Architecture-Topology-Homology (CATH) database, Structural Classification of Proteins (SCOP)).

52 To better visualize motifs we visually explored sequence logos in Paper IV from interesting candidate regions. The technique represents in a graphical logo the aligned sequences in which the degree of similarity is symbolized by the height of the nucleotide or amino acid letter [54].

Dot-matrix methods Dot plots are likely the oldest visual representation when comparing two se- quences [182]. The dot-matrix approaches to sequence alignment produce qualitative and simple results to analyze but time-consuming on a large scale. It is very easy from a dot-matrix plot to identify sequence features such as insertions, deletions, repeats, or inverted repeats. The dot-matrix plot, with a sequence in the top row and another in the left column, places a dot at position (i, j) if the character in column i in the first sequence is the same as character in row j in the second sequence. A dot-matrix plot between two sequences which are closely related will thus produce a single line in the diagonal of the matrix. There are some implementations that may vary the size or inten- sity of the dot to stress the degree of similarity between the two characters and more eleborated strategies using sliding windows and a threshold value to consider a match between two windows. In Paper IV we used a combination of these two in order to assess duplications in the ERV sequences analyzed. Retroviral sequences are plotted against themselves and related sequences to find regions that share significant similarities. A protein that is internally du- plicated by recombination will display as lines parallel to the main diagonal or perpendicular if an inversion occurred.

4.2 Classification, hypothesis testing and evaluation Phylogenetic trees, an unsupervised classification method based in a scoring matrix, was used in Paper IV. For a detailed description of their functionality and its application in this thesis, consult section 4.3. Moreover, unsupervised learning algorithms were used in Paper I for the preselection procedure of allergen templates used to match the amino acid query sequences. Although in Paper I we based our study in a support vector machine (SVM) classifier methodology to separate the decision boundary between the two classes allergen and non-allergen proteins or amino acids. SMVs are linear classifiers generalizing an optimal separating hyperplane. In classification problems, an optimal separating affine set (or hyperplane) can separate two classes and maximize this separation distance to the closest point from either the class (see V.Vapnik, pp. 33-41 in [90]). The novelty of SVMs is that it allows for a certain overlap between classes but minimizes the extent of this overlap which represents an improved discriminant model

53 for a realistic distribution of observations (see Figure 4.3). For a complete description of SVMs, see Hastie et al. [108].

xT β + β = 0 a) 0 b) T x β + β0 = 0

ε∗ 4 ε∗ 5 ∗ 1 ε ε M = 3 1 β M = ε∗ β 1 ε∗ 2 margin margin

1 M = β = 1 M β

Figure 4.3: Support vector machine classifiers. The decision boundary is an optimal separating hyperplane represented by a solid line, while broken lines bound the maxi- 2 mal margin of width 2M = β a) shows the separable case maximizing the marginal separation M. b) shows the non-separable case. The points indicated by an arrow are ε∗ = ε on their wrong side of their margin by an amount j M j; points on the correct side ε∗ = ∑ε ≤ ∑ε∗ have j 0. Here, the margin is maximized subjected to i constant. Hence, j is the total distance of points on the wrong side of their margin.

To understand what is the significance of a statistic test result, we will now introduce the hypothesis testing procedure. Statistical hypothesis tells us the significance of experimental results and provide us with valuable procedures to infer this knowledge from our observations. In the broad sense of the term, any test which would measure ability or potential could be subjected to in- clusion in this category. The procedure decides if a statistical hypothesis (e.g. the population mean μ = 10), known as the null hypothesis (denoted by H0 or NH), should be accepted or rejected in favor of another (e.g. μ > 10), the alternative hypothesis (H1 or AH). The decision is based on the value of a test statistic calculated from our observations. The possible values obtained for this statistic are divided into an acceptance region and a critical (or rejec- tion) region that contains the extreme values of the probabilistic distribution. To choose the formation of the critical region, it should be taken into account the nature of the alternative hypothesis H1. The critical region determines that if H0 is true, then there is a low probability α of the statistic falling into that extreme part of the distribution so the statistic is better explained if H1 is true. The critical region then consists of one set of extreme values of the test statis- tic (called a tail), and the corresponding test is a one-tail (or one-tailed). When the statistic falls into the critical region represented by the tail, then it is said to be significant at significance level α, and H0 is rejected in favor of H1.On the other hand, H1 may simply be ’μ = 10’ leaving an alternative two-tail test. The critical region then consist of two sets of tails on either side of the accep- tance region. If H0 is true in this case, the test statistic falls with probability

54 1 α 2 in each tail. Finally, the exact value of the test statistic having a value as extreme or more extreme than the calculated value x(ˆ P(Y|X ≥x))ˆ can be ob- tained. This is the p-value of the calculated statistic and corresponds to the exact level of significance at a particular level α. Statistical inference was used to test for the existence of dependence rela- tionship between the population of proviruses found and other input groups such as chromosomal length, number of genes (coding/non-coding) in Paper IV. For a collection of pairs of values (x,y) from two random (normal) vari- ables X and Y, the product moment correlation coefficient is defined as

Cov(X,Y) ρ =  (4.1) Var(X)Var(Y)

Covariance is, in this sense, a linear gauge of dependence. Covariance is sometimes treated like a measure of linear dependence between two random (normal) variables. When the covariance is normalized, one obtains a correla- tion matrix. If a perfect linear relationship between X and Y exists, then ρ = ±1. If the variables are totally independent then ρ = 0, but the converse is however not necessarily true. A confidence interval for ρ can be constructed using Fisher’s z-transformation. The Pearson product-moment correlation coefficient (PMCC, typically de- noted by r) is a test of goodness of fit for the best possible linear function describing the relation between two samples of two populations. If X and Y have a bivariate normal distribution, then the coefficient r of a random sample is an estimate of the population’s coefficient ρ. Based on a sample of paired data (Xi,Yi), the sample Pearson coefficient can be obtained as:

n ∑ = (X − X¯ )(Y −Y¯ ) r =  i 1 i  i (4.2) ∑n ( − ¯ )2 ∑n ( − ¯ )2 i=1 Xi X i=1 Yi Y

or in the form of products of the standard scores:    1 n X − X¯ Y −Y¯ r = ∑ i i (4.3) n i=1 sx sy

where X¯ is the sample mean and sx the sample standard deviation. The goodness of fit is used to describe the discrepancy between observed values and the values expected under the fitted model. There are several meth- ods to calculate goodness of fit. For example, in regression analysis, the co- efficient of determination (R2) and the lack-of-fit sum of squares can be used. Goodness of fit measures can be used in statistical hypothesis testing, e.g.to test for normality of residuals, to test whether two samples are drawn from identical distributions (Kolmogorov-Smirnov test), or whether outcome fre-

55 quencies follow a specified distribution (Pearson’s chi-square test). In Pa- per IV, we focused in testing for independence of two samples’ distribution (r = 0) and from the R2 value obtained, we calculate a confidence value (i.e: p-value). Note that acceptance of a null hypothesis does not necessarily mean that it is true, only that we have insufficient evidence to be confidence of rejecting it. Likewise, rejection of a null hypothesis only mean that there is strong evidence that it might be false. The structure of hypothesis testing is such that the null hypothesis H0 may be rejected when it is in fact true. This is known as a Type I error or an error of the first kind. The probability of a Type I error depends on the chosen significance level as it is equal to α. Furthermore, it is also possible for H0 to be accepted when it is in fact false. This is known as a Type II error or an error of the second kind. It depends on the alternative hypothesis H1 denoted by β for a particular H1.The probability of correctly rejecting H0 when it is false is 1 − β which is called the power of the test. For a two-class problem a dummy-variable approach can be taken and code our category space via binary Y values. Normal regression procedures can now be done within a numerical space. When linear regression is used, for instance, the model function don’t need to be positive so we can get strange artifacts that we might be suspicious about using as an estimate of the model generalization capabilities. In this situation, instead of the error measures of our classificator in a test set, there are alternative measures to assess the performance of binary classifiers. The observations that belong to the category we want to classify are defined as true positives (TP) and the remainder as true negatives (TN). As we stated before, a H0 may be rejected when in fact it is true and these cases are the false negatives (FN). If H0 is accepted when in fact it is false comprises the false positives (FP) rate. In biomedical classification jargon, the terms sensitivity and specificity are often used to characterize the power of our classifier.

• sensitivity (also known as recall) – the probability of predicting disease given true state is disease. TP Sensitivity = (4.4) TP+ FN

• specificity – the probability of predicting non-disease given true state is non-disease. TN Specificity = (4.5) FP+ TN

In biomedical classification problems, misclassification can delve into se- rious consequences. To account for this, usually, a matrix of equal losses is employed to introduce conservative decisions in the classification function. It

56 is probably worse to predict a FN (non-disease when it is true disease) than a FP (disease when it is true non-disease). Observation weighing tries to alter the prior probability on the classes. By varying the relative penalization of the class observed we increase the sensitivity and decrease the specificity of the rule, or vice-versa. A good model can detect many true features (it has high sensitivity or recall) and generates few false positives (it has high specificity or precision). The following equation for precision can be stated using the parameters above: TP Precision = (4.6) TP+ FP

Sensitivity and precision cannot be optimized at the same time. Our goal is to maximize TP and TN while reducing FP and FN but we risk overfitting our model so it retains poor generalization capabilities to detect new features that are not similar to the most common but still true positive. Therefore, a useful joint performance measure can be formed from this two in order to compare classifiers. The F-score: 2xPrecisionxSensitivity F-score = (4.7) Precision + Sensitivity

Noticeably, the p-value of a test at a certain critical value α is the exact significance of the False Positive Rate (FPR). Therefore, a p-value of 0.05 implies a 5% chance of falsely rejecting the H0. This means that if we perform a test n = 10,000 times, we expect to have 500 false positives (αxn)ata5% FPR. The fraction of false positives is defined as: FP FPR = (4.8) FP+ TN

While FPR determines the rate of classifying false positives in each test (i.e: error accumulated for every performed test when multiple testing), the False Discovery Rate (FDR) is the rate at which we include false positives among those we find to be significant predictions (TP). FP FDR = (4.9) TP+ FP

The concepts afore mentioned are summarized in form of a table (Table 4.2) to give a better graphical understanding of the relationships between them. As we have explained, it is rarely possible to find an all-time good classi- fier which is both sensitive and highly specific at the same time. Therefore, to plot visually the trade-off between these two values, a receiver operating characteristic curve (ROC) is used to assess the significance of the classifica- tion output (Paper I). This type of curve is a commonly used plot for assessing

57 Table 4.2: A confusion matrix. Additionally, the relationships among the terms that are discussed in this section are presented. - True Positive, - True Negative, - False Positive, - False Negative, - Negative Predictive Value, - False Positive Rate. For fur- ther explanations read the text of the section.

True Positive Negative FP Positive TP → Precision (Type I error, p-value) Predicted FN Negative TN (Type II error) ↓↓ Sensitivity (=Recall) Specificity ↓ FPR the model’s tradeoff between sensitivity (i.e: fraction of true positives or TPR) and specificity (or its inverse, FPR) rates as we vary the parameters of a classi- fication rule. The area under the ROC curve is sometimes called the c-statistic and it is used as a quantitative summary of the performance among different methods [257] (Figure 4.4). It can be shown that the area under the curve is equivalent to the Mann-Whitney U statistic (or Wilcoxon rank-sum test), for the median difference between the prediction scores in the two groups [106]. A proof that a ROC curve is independent of the class distribution or error cost was given in Provost et al. [222]. Metrics such as accuracy and precision will change if F-scores (4.7) are used as measures for a comparative assessment and the distribution is altered. Conversely, given the ROC is separated from the cost context [222], we can just decide to set a cutoff based on the accepted number of errors.

4.3 Phylogenetic analysis is the study of relatedness between entities such as species or genes. In genetics, phylogenetics tries to infer the evolutionary history based on the shared history available, i.e. the changes and similarities in the molec- ular data. Therefore, phylogenetics uses sequence alignments to measure evo- lutionary change by utilizing a score function. Phylogenetic trees are the rep- resentation of the particular inferred event history in the evolution of two or more sequences. Interpretation of phylogenetic trees are used frequently to as- certain evolutionary relationships between orthologous genes represented in the genomes of divergent species. The study of homology across species and paralogy within species are of utmost importance to draw conclusions about

58 Figure 4.4: A ROC curve graphical plot. Grey points represent classifiers: random, perfect, hypothetical classifier C and its inverse C’. The C’ classifier is a result of inverting decisions made by C. Two hypothetical ROC curves are shown. The ROC A curve represents better classificatory quality than the ROC B curve. The point (0,1) corresponds to a perfect classifier and the point (1,0) represents a classifier that it is always wrong.

genome evolution. For this purpose, it is assumed that evolutionary distance is reflected in the degree of divergence between each sequence. A high sequence identity percentage would suggest that a young recent common ancestor exists whilst a low identity would suggest the opposite instead, i.e. an older ances- try. This approximation reflects the "" hypothesis saying that the coalescence time since two genes diverged can be calculated and used to extrapolate evolutionary history assuming a constant evolutionary change in both mutation and selection rates across lineages (i.e. random drift). Complex phylogeny methods account also for the rates of DNA repair or the possible distinct functional conservation of specific genomic regions, and even allow to vary the evolutionary rate on each branch of the . These methods produce better estimates of coalescence times generally. Alignments are therefore used to draw evolutionary relationships by helping to construct phylogenetic trees [82]. Iterative and progressive multiple align- ment techniques are used in Paper IV to produce a phylogenetic tree based on the Neighbor-joining (NJ) method [231] as implemented in ClustalW [265]. The NJ method transform the sequences into pairwise distances to weigh the differences between the sequences and thus, weigh the lengths of the branches.

59 It follows the minimum evolution criterion in a star decomposition of the tree to find the best initial tree and therefore, it is convenient to generate the first tree in a heuristic search. Heuristics are commonly used in phylogenetic tree construction due to the optimal tree choice is a NP-hard [82] problem. Phylogenetic analysis is a classificatory problem in the field of unsuper- vised learning. The expert biologist just have to wait until the tree is con- structed to analyze the results. Although we admit that alignment adjustments and other choices of parameters make the process more interactive. Bootstrap- ping is an approach to assess quality of classification. How can be we com- pletely sure that our genes cluster (or not) together? In phylogenetic boot- strapping, all the sequences are drawn with replacement from the original training set and the phylogeny is reconstructed. A number of training pseudo- replicate trees are constructed with repetition (let’s say n = 1000 like in Paper IV). All the trees resulting from a bootstrap replicate are evaluated usually by a consensus tree. Consensus trees represent common features among two or more replicate trees. Consensus trees, in the mathematical sense, are not hypotheses of a phylogeny as are the initial (or replicate) trees, but rather a summary of such hypotheses. Therefore, consensus trees should not be inter- preted as primary hypotheses but just to assess classification. There are two most widespread consensus methods, the strict consensus and majority rule consensus. In the end of the replication bootstrap procedure, the final results estimations are not simple numbers but a number of trees evaluated usually by a majority-rule consensus tree. The majority rule consensus includes in the final bootstrapped result those clades that are present in more than 50% of the replicate trees. The edges of the final bootstrapped tree are usually labeled with the percentage of the replicate trees in which they are present. Interpretation of the bootstrap values is not comparable to statistical significance values used in statistical tests. All bootstrap proportions below 50% are nonsensical as they are leftovers from the other branches showing up in high proportion of replicates.

4.4 Databases Flat files is the most common storage facility in a computer. Constrained by size, files offer a sequential access to information. Simple writing-reading policies and non-hierarchical organization have been drawbacks that prevented the general use of files as resources for organizing information in a structured way. To cover the weaknesses of flat file storage, databases, emerged as inte- grated hierarchical collection of records. Databases classification could be is- sued based on the type of content they provide but other classifications meth- ods rather look at the database model. Between the existing models, the rela- tional model is still the most frequently used today.

60 Relational databases In 1970, Edgar Frank Codd, a British computer scientist, invented the rela- tional model for database management while working for IBM [49]. The rela- tional model is the theoretical basis for the relational databases of today. Codd proposed a set of thirteen rules (numbered zero to twelve and therefore, called Codd’s 12 rules) designed to define what is required from a database man- agement system in order for it to be considered relational [47, 48]. Although IBM failed to developed a commercial product first, many other companies, such as Oracle, converted the relational model foundation into the wealth of their businesses. Codd produced the 12 rules as part of a personal campaign to prevent his vision of the relational database being adulterated, as database vendors scrambled in the early 1980s to repackage existing products with a relational cover. Most of the bioinformatics annotation systems of today are relational (see section 4.5). We use a relational database manager system to store the results of GeneFinder (Paper III) and RetroTectorc (Paper II and IV).

Other database systems There are several different types of databases in the database jargon: analyt- ical, operational, distributed, federated, document-oriented, object-oriented, real-time, etc. But recently, the database system that has captured most of the attention when applied to Life Sciences data is the warehouse systems. The concept of data warehousing dates back to the late 1980s when IBM re- searchers Barry Devlin and Paul Murphy developed the "business data ware- house". The idea behind the data warehouse system is the old log systems for organization of warehouses where all data was stored client/subject-oriented, and kept stable as more data was added but data was never removed [117]. This enables management to gain a consistent picture of the business [116]. In a data warehouse, the data extracted from the various operational databases of an organization becomes the central source of data and it is screened, edited, standardized and integrated so that it can be used by managers and other end- user professionals throughout an organization. Data warehouses are charac- terized by being slow to insert into but fast to retrieve from because their management of keys and star-like geometry. BioMart [2, 102], the data ware- house that have been successfully applied in Life Sciences, provides all these features (see section 4.5) and facilitates the conversion from existing rela- tional databases to a BioMart-compliant architecture. Therefore, an explosion of "bio-data marts" have recently occurred due to their usefulness, friendliness at different levels of user expertise, seamless integration of data, as well as re- mote access to data through special web services protocols (see section 4.6).

61 Other systems Nowadays, with the appearance of high-performance systems improving both calculation and storage equipments, the requirements for speed have increased at the same pace as data volume. Old relational database systems perform their logical operations and reliable transactions not efficiently enough for the latest systems capabilities and requirements. At the same time, many of the newly developed database systems integrate images that are either stored in a large amount of space or rendered in real-time, needing for a high-performance computing facility and storage and retrieval systems that are able to cope with the user requirements. Storage of data in binary files could be a temporary solution which could give a decent retrieval performance but still the data is kept in large and encoded file volumes. New storage systems are increasingly under development (see the NoSQL movement [5]) due to the requirements of the big companies and the services offered by them. Cloud computing is a very recent concept to refer to the abstraction from the users over the underly- ing computer infrastructure connected like an Internet "cloud". It could be the panacea for the future of computer storage as long as the quality of data, secu- rity and privacy is guaranteed. Cloud computing enables users and developers to utilize services without knowledge of, expertise with, nor control over the technology infrastructure that supports them. It is, almost literally, operating the service in a cloud.

4.5 Information sources, annotations and ontologies Knowledge of biology has been subjected to storage since in 1965 Margaret Dayhoff published her Atlas of Protein Sequences. She collected all the known protein sequences and, as a service to the scientific community, made them available to others in 1965 in this small volume, which contained sequence information on 65 proteins. It served as a valuable reference work for scientists all over the world and laid the foundation for use of sequence information. After a certain molecular entity such a sequence, a protein structure or a cell is known, annotation is the process of marking features of biological im- portance on it. The first genome annotation software system was designed in 1995 by Dr. Owen White, for the analysis of the genome of the first organism to be sequenced, the bacterium Haemophilus influenzae. Public availability of databases and submission of published data is of ut- most importance for the research community in order to facilitate continuation studies and a necessary service. Several of the databases described in the next subsections are integrated or queried within the studies presented in the Papers of this thesis.

62 Sequence databases The main sequence databases are hosted simultaneously at three institutions around the world which synchronize on a daily basis. GenBank is hosted at the NCBI (USA), EMBL at the EBI (UK) and DDBJ at the CIB-DDBJ (Japan). GenBank resources can be seamlessly accessed using eutils [199]. Entrez Pro- gramming Utilities are tools that provide access to Entrez data outside of the regular web query interface at the NCBI web site. Eutils are used in Paper III.

Genome browers: Ensembl and UCSC EnsEMBL [87], one of the biggest successful projects from the European Bioinformatics Institute (EBI) at Hinxton (UK), produces, maintains and releases (in a database format) automatic annotation of selected eukaryotic genomes. The UCSC genome browser [157] is hosted at the University of California at Santa Cruz (UCSC), and is by far, the largest counterpart to EnsEBML. The novelty with both of the systems, is their centralization effort toward the storage of all kinds of Life Sciences data, which produces an enormous growth in their databases size and therefore it becomes not practical to reproduce them in local servers, and the highly-specific and automatized annotation pipelines created by their teams. They use known genes, mRNAs, EST, and protein data to improve genome annotation and obviously, annotate new genes in species being sequenced de novo. Noticeably, this annotation technique improves by far other renowned ab initio detection tools (e.g. Genescan [39]). Both of the browsers release their data publicly available and offer it in different downloadable formats as well as provide the user with new ways to control and manage the data. Available at EnsEMBL there are DAS data sources for every species annotated. EnsEMBL can act also as a DAS-enabled client [77] of external data sources, which we exploit to demonstrate our DAS- ERV server capabilities in Paper II. There are also a programmatic interface (API) and a public database server available. Conversely, UCSC offers lit- tle programmatic help to integrate their public data by external applications. Their efforts have taken the direction toward addressing their own systems of data retrieval [141] or collaborating with web systems like Galaxy [259]. Their DAS system implements an outdated protocol and offers old-versioned data. Therefore, for Paper IV we have to download their pre-calculated data using [141] or accessing their open MySQL server by issuing directly struc- tured query language (SQL) commands.

Uniprot As described in [268], UniProt maintains a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. Extending their resource, there are cross-references and querying interfaces

63 freely accessible to the scientific community. For example, we integrated their DAS server in Paper II. UniProt is produced by the UniProt Consortium which jointly merges efforts of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every three weeks and can be accessed for searches or downloaded at UniProt [8].

Biomart BioMart Central Portal [2, 102] offers a free data-warehouse solution to ac- cess a wide array of freely available biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE, which are integrated in a seamless and friendly data-warehouse way. Moreover, the web server fea- tures programmatic access through a Perl API as well as RESTful and SOAP oriented web services. We use these interfaces to query, up to date, Ensembl, Uniprot, and HGNC data in Paper III.

Gene ontology Ontologies are controlled vocabularies to represent and define concepts as well as their relationships of specific domains of knowledge. In Life Sciences, Open Biomedical Ontologies (OBO) [241], is the largest effort to create vari- ous controlled vocabularies to be shared across different biomedical domains, applications and projects. But the best renowned ontology in Life Sciences is Gene Ontology (GO) [94]. GO provides consistent descriptions of gene products in different databases by means of three different structured vocabularies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. GO is used in both Paper III and Paper IV to try to provide a consistent and explanatory classification of the genes under study.

HGNC The variety of databases and identifiers created to store data coming from as many biological sources, results in a plethora of names to call the same piece of sequence. In order to standardize these issues, a non-profit body, called the HUGO Gene Nomenclature Committee (HGNC), was created to regulate the

64 adoption of gene names and symbols (short-form abbreviation) in the human genome, and store them in a database [38].

PubMed PubMed is one of the most useful database services for the research commu- nity. Started in 1997 at the NCBI as a continuation to the Medline project (started in 1966) by the US National Library of Medicine. It includes nowa- days over 18 million citations from MEDLINE and other life science journals.

Disease databases Online Mendelian Inheritance in Man (OMIM) is a curated effort to create a database of human genes and genetic disorders developed by Victor McKusick and administrated by staff at Johns Hopkins University, Baltimore, MD [105]. Among various sources of information, it describes throughly genes whose involvement in several Mendelian traits causes disease. OMIM is integrated as a data source within GeneFinder in Paper III. Conversely, Online Mendelian Inheritance in Animals (OMIA) [165] com- plements OMIM in the sense that it is a database of genes, inherited disorders and traits but specialized in more than 135 animal species (other than human and mouse). OMIA is authored by Professor Frank Nicholas of the Univer- sity of Sydney, Australia, with help from many people over the years. The database contains textual information and references, as well as links to rele- vant PubMed and Gene records at the NCBI.

Allergen-specific databases Public allergen databases are available containing different numbers of molecules listed as allergens. The related information about every allergen varies from a simple identifier of the protein in a general protein database such as Uniprot to structural-domain information of the molecule or experimentally verified IgE epitopes. The allergen database created for Paper I contained 762 amino acid se- quences which were used to train, validate and test the SVM classifier. The sequences were retrieved from six publicly available sources: the list man- tained by the IUIS Allergen Nomenclature Sub-Committee [13], AllergenOn- line [112], The Allergen Database, The Allergen Sequence Database [93], The Protall Database and Allergome [184]. The records were curated and only deposited if they had related published reports. Sequences shorter than 100 amino acids were discarded to avoid incorporating truncated proteins devoid of sensitizing or cross-reactive parts. The total dataset and the different subsets of every source were made publicly available at [174].

65 The allergen database was updated after our Paper I study was published. The current allergen data set is compiled from the AllergenOnline repository on allergen proteins [10], based on the January 2007 release. A written permis- sion to use AllergenOnline for the purpose of updating the detection algorithm devised for the Paper I was kindly provided by Prof. Richard Goodman, Uni- versity of Nebraska, Food Allergy Research & Resource Program, Lincoln, NE, USA. An updated system for the version of the risk assessment system presented herein was revised and publicly released at [175].

Repeat oriented sources The RepeatMasker [240] track at the UCSC genome browser [141], which builds on Repbase [133], was downloaded for Paper IV in order to study in- terspersed elements (i.e: LINEs, SINEs) relation against ERVs in dog. Pre- Masked genomes are provided courtesy of the UCSC Genome Bioinformat- ics group. RepeatMasker, a tool for repeat annotation, makes use of Repbase which is a service of the Genetic Information Research Institute. Repbase is a comprehensive annotation catalogue of repetitive element consensus se- quences. Repbase drawback is that annotates repeated sequences by following a nonsystematic nomenclature and containing rather little retroviral informa- tion but our study was not affected. Retroviral sequences are often fragmented after a Repbase classification so it is hard to see the whole proviral structure although this seems to be caused by a RepetMasker artifact which fragment models into domain annotations and thus create a many-to-one relationship with Repbase full-length entries [88]. However, Repbase is a useful reposi- tory once these limitations are understood.

4.6 Web services in bioinformatics A broad array of technologies specialized for data distribution such as XML Remote Procedure Call (XML-RPC), Distributed Annotation System (DAS), Representational State Transfer (REST), Simple Object Access Protocol (SOAP) and Extensible Messaging and Presence Protocol (XMPP) have emerged recently because the decentralized nature of the sources of Biomedical data [251] and have developed biomedical services under the umbrella of several large projects like EMBRACE [4] or BioSapiens [1]. For Paper II, we used DAS [70] as a variant to REST because its simplicity, wide support [86, 221] and restrictive well-formedness in its XML responses to all kinds of life sciences data [220]. Both REST and DAS could be seen as a subset of XML-RPC which is in turn a subset of its successor SOAP. Protocols that are based on HTTP (e.g. REST, DAS, SOAP and XML-RPC) allow for easy communication through proxies and firewalls compared to pre- vious remote procedure protocols (e.g. CORBA and Java-RPC), which favors

66 integration of resources between organizations. DAS is a good option to re- place most of the data stored in flat files and databases because it is easy to implement and there are already many public clients, servers and libraries. Therefore DAS is well adapted for serving pre-calculated data, while SOAP is frequently used to request remote calculation of data and message exchange. The EMBRACE technical recommendation [75] adopted SOAP as model for a communication protocol and message exchange. The data types in SOAP are based on fixed contracts between servers, using Web Service Description Lan- guage (WSDL) and XML Schema Definition (XSD) (see next section). SOAP specifies a strict envelope/header/body message structure which is serialized by a client peer and deserialized at the server side. Because of its verbosity in the XML format, SOAP can result in a performance reduction when trans- mitting large messages. Typical use cases in bioinformatics are big genome sequences or analysis results. In general, the usage of XML syntax in mes- sages has both advantages and drawbacks. Whereas it facilitates error detec- tion and avoids interoperability problems such as byte-order (endianness), the processing time is increased in the serialize and deserialize steps and the logic to achieve the conversion can be cumbersome. Several options to overcome these problems are under discussion and several alternatives are emerging; at- tachments, MTOM [284] or external links are possible improvements for the performance when large data files are transmitted. On the other hand, the stateless nature of the HTTP protocol may provoke time out responses when requests take too long to process, i.e. a typical grid use case of very intensive and lasting calculations. Asynchrony is necessary in those situations but the inherent statelessness of every request requires to implement some kind of polling mechanism to ensure that the results are re- trieved back. An asynchronous response for DAS which also allows longer lasting computations and polling for results has been implemented within the Dasobert Java library [3]. A protocol based on HTTP cannot rely on it to manage state information, which in turn must be handled by the application layer, and until now DAS, REST or SOAP cannot be considered inherently asynchronous. A recent advance in the sense of guaranteeing asynchronous web services in the distribution of data, has been the development of an open source XMPP implementation by a group at Uppsala University toward its use with Chemoinformatics applications [285]. XMPP was used initially as an instant messaging protocol which usage has been broadened to the whole area of message oriented middleware. XMPP is an open standard promoted by several private companies. The standard describes a flexible and extensible message passing mechanism which can be secured and encrypted. Other technologies, such as VoIP and file transfer signaling, have been based on XMPP. Presence data is scarcely sent during the execution of distributed bioinformatics applications and instead users could be provided with real-time information about service availability. In addition, XMPP could

67 avoid wsdl negotiation due to the fact that end-point services provide their own XML-Schemas for sending/receiving information.

XSD and WSDL An XML Schema describes the structure of an XML document in a rich, strong data-typed and extensible manner. XML Schema is an XML-based al- ternative to DTD. An XML schema describes the structure of an XML docu- ment so we use it to support the description of our data types and documents in Paper III. The XML Schema language is also referred to as XML Schema Definition (XSD). WSDL (Web Services Description Language) is an XML-based language for describing Web services and how to access them. WSDL is a document written in XML. The document describes any Web service like the GeneFinder server in Paper III. by specifying its location and the operations (or methods) the service exposes. Programmers can use WSDLs as a type of contracts be- tween their programs and the services they are exploiting. This two techniques, recommended by EMBRACE WP3 [75], enable for automatic validation of syntactic correctness of the messages.

Web Services Repositories The EMBRACE consortium has proposed the approach of Service Oriented Architectures (SOA) and web-services as defined by the W3C [9] to increase the general interoperability of the resources. Service-oriented architectures define standard interfaces and protocols that allow developers to encapsu- late information tools as services. SOA is an attempt to develop yet another means for software module integration. Rather than defining an Application Programming Interface (API), SOA defines the interface in terms of protocols and functionality. An endpoint is the entry point for such an SOA implemen- tation. Web services can implement a service-oriented architecture by playing one of two roles: service provider or service consumer. Implementations of SOAs using web services standards can be achieved by using a plethora of web service standards as explained in this section before). One of the main drawbacks with SOAP, DAS and REST technologies is that users must be aware of the existence and location of the services before they are consumed. The EMBRACE contribution to improve web service discov- ery was to release publicly the EMBRACE registry [217] which is a collec- tion of pointers to (mainly European) bioinformatics web-services, all freely available. Other parallel contributions from various consortiums have been the BioMOBY central service registry (BioMOBY Central) for BioMOBY- like services. The main advantage in this repository is that web services are described based in service description ontologies. Finally, the DAS central registry [67, 221], which we exploit in Paper II, is a central repository for

68 DAS resource discovery with possibilities for automatization of such a task. It provides information on various DAS sources and their annotated data types, grouping of these sources into coordinate systems, as well as validation and testing of DAS sources. Currently it contains a listing of more than 450 DAS sources.

4.7 RetroTector c The program RetroTectorc [246] was used to screen the dog (canFam2) genome [169] in Paper IV and the human (hg18) genome in Paper II. RetroTectorc attempts to discover presumably old integrations by recognizing well described retroviral motifs attenuated under the evolutionary clock. Motifs found by RetroTectorc are, among others, LTRs, the primer binding site (PBS), motifs for three structural proteins that form the inner part of the virion: MA (matrix), CA (capsid), NC (nucleocapsid), the polyprotein cleaving protease (PR), reverse transcriptase (RT), integrase (IN), dUTPase (DU) recognizably related to cellular and herpesvirus dUTPases acquired by different classes of retroviruses, surface unit (SU), transmembrane protein (TM) and the poly purine tract (PPT) that escapes the RNase H degradation during reverse transcription. Following the motif discovery stage, RetroTectorc annotates, collects and stores different statistics about retroviral identified chains and putein sequences into a database (see Section 4.4).

69

5. Present investigations

This chapter summarizes the findings in the four papers next to some back- ground for the specific projects and some recent updates of the current status. Moreover, future prospects are described after the presentation of each paper.

5.1 Paper I: EVALLER - a web server for in silico assessment of potential protein allergenicity. Background The main application for computational risk assessment of allergenicity is to prevent unintentional introduction of allergen-encoding transgenes in ge- netically modified (GM) food crops. Computational screening for potential allergenicity/IgE-cross reactivity is both expedient and inexpensive, which may help to reduce laboratory costs. Moreover, many of the reported algo- rithms can accomplish good discriminations between known allergens and presumed non-allergens. The aim with these methods is to identify features inherent in the allergens amino acid sequences that are important for IgE-cross reactivity (IgE-epitopes) and possibly also de novo sensitization (T cell epi- topes, preferably specific for TH2). It is important to acknowledge that bioin- formatics and machine learning procedures can improve the performance of the existing current FAO/WHO criteria mentioned earlier (sequence identity > 35% over 80 amino acids (aa) or at least one identical motif with peptide length of six aa long) to match known allergens. The FAO/WHO criteria may generate many false positives, e.g. the six aa criterion, although many of the reported classifiers improving it may as well be overly optimistic in their pre- dictions. The majority of hitherto characterized IgE-epitopes are linear (continuous) peptides. The cross-reaction of sensitized individuals to another allergen is be- lieved to depend largely on the properties inherent to the protein that caused elicitation which are mostly discontinuous. Therefore, our approach consists in mimicking one or several structural parts of an IgE-epitope because struc- tural epitopes can also be piecewise linear (i.e: composed of several scattered short linear peptides) when linearly displayed, which then are brought together in the structural formation of the allergen protein.

71 Results and Discussion TM Paper I describes the essential features of the EVALLER web server, which is aimed at bioinformatics risk assessment of proteins in the context of aller- genicity (type I hypersensitivity). TM EVALLER core algorithm (DFLAP) is built on principles and directions described in Soeria-Atmadja et al. [245]. DFLAP is an SVM algorithm trained from two different datasets. The first dataset is a filtering procedure designed to select among a large range of overlapping allergen-derived peptide motifs, by comparison of segmented allergens from a large set of proteins with low probability of being allergenic. The second dataset utilizes selected feature motifs assembled from alignments of FLAPs (Filtered Length-adjusted Aller- gen Peptides) from a dedicated sets of allergen proteins. The decision system, built on top of this trained SVM which is implemented with an open source library, permits interrogation of query amino acid se- quences for allergenicity potential through assessment of their identity values matching a FLAP set, as assembled by the aforementioned filtering procedure. Validation test procedures have demonstrated high selectivity of the DFLAP system within protein families known to hold both allergens and non-allergens. Moreover, a low proportion of the SwissProt repository was assigned as allergens, indicating a low false discovery rate. Generally, it is difficult to rank allergen detection algorithms based on performance since they have been designed and evaluated their performance by means of different datasets. Estimation of the ratio of allergens in the entire SwissProt database have become the accepted procedure to compare the assessment of specificity between different methods due to the scarcity of reliable test sets with putative non-allergenic proteins. TM The simplest EVALLER output level involves a scoring output and a four- category classification ("Presumptive non-allergen", "Equivocal but tentative non-allergen", "Equivocal but tentative allergen" and "Presumptive allergen") to inform on the degree of assessment certainty. To further support risk assess- TM ment in allergology EVALLER also provides a more comprehensive synop- sis of its assessment on a query protein’s potential allergenicity. To the best TM of our knowledge, EVALLER is the only application with textual/graphical output including information on the best match(es) against allergen-related motifs, including their respective location in the query proteins. This feature should be most relevant for allergens, but is also allowed to operate on proteins falling in any of the two categories that signify low probability of allergenic- ity, as an extended support to the researcher in ambiguous cases. The latter feature involves colour-coded graphic representation of amino acid sequences matches to known allergens. The interface is dynamic and enables examina- tion of detailed motifs through zoomed views. TM Finally, we benchmarked EVALLER against two other servers (implementing the FAO/WHO criterion and a variant of SVMs combined

72 with physicochemical properties). For this test, we used several potential allergens from the polcalcin family and other homologues of the calmodulins and calmodulin-like protein families for the presumable non-allergenic TM category. EVALLER performed satisfactorily being both sensitive and specific respect to the current knowledge in this protein dataset. However, unequivocal conclusions on the likely non-allergenic proteins were difficult to ascertain.

5.1.1 Future prospects SVMs application for allergenic risk assessment classification seems to gain attention from the community [196]. However, we would like to make our system more modular to allow testing of new machine learning classification algorithms. The AllergoInformatics group in Uppsala is committed to continue updat- TM ing our risk assessment application with new developments of EVALLER . TM There is already an EVALLER 2.0 system [35, 175] based in synchronous web services, and a forthcoming version based on XMPP [285] with asyn- chronous web services which will allow for multiple-protein testing. We plan also to improve the core algorithm performance which is executing sequen- tially using a MATLAB engine at the moment. The amount of allergens with experimentally determined tertiary structure is much lower with respect to those with a known amino acid sequence. There- fore, new efforts toward experimentally determine allergen structures should be taken. But most of the computational risk assessment protocols have been based on similarity searches at the amino acid sequence level. We should work toward combining information from structures and linear amino acid se- quences [89] that can lead to the discovery of new epitopes, possibly leading to increased specificity when cross-reactive compounds are analyzed.

5.2 Paper II: Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX Background To date, efforts to create detailed ERV catalogues are almost non-existing, restricted to few species, and the quality of the data is certainly not very de- scriptive regarding the structure of the provirus and its different parts. Namely, some of the existing efforts reported are HERVd [210], RetroSearch [7], which compiles only annotated HERV ORFs in the human genome [278], as well as the RepeatMasker [240] track at the UCSC genome browser [157], which

73 builds on RepBase [133] annotations. RepBase is a repeat-based text database and its main drawback is that it annotates repeated sequences by following a non-systematic nomenclature and containing rather little retroviral informa- tion. With focus on this study, for example, RepBase annotated retroviral se- quences are often fragmented making it difficult to observe the whole proviral structure. The Distributed Annotation System (DAS) is a widely used lightweight net- work protocol for sharing biological information [6, 31, 128, 207]. It has been proposed by the BioSapiens Network of Excellence technology group [1] as their technological choice to distribute data. We thought that the combination of both the protocol and our quality ERV annotations in a service (DAS-ERV) would contribute to the distribution of endogenous retrovirus data belonging different species and also benefit differ- ent research groups that would be able to remotely consult it.

Results and Discussion We have extended a DAS server implementation called ProServer [86], with an special module to support human endogenous retroviral annotation stored in an underlying MySQL relational database. The visualization style was cat- egorized into types according to the different parts that a retrovirus may incor- porate. To test the well-formedness of the protocol messages and the integra- tion with a standard DAS client, we added our data source to the EnsEMBL genome browser and our application called eBioX. We next used this engi- neered system with a program to correlate the transposon-free regions (TFRs) in the human genome published by Simons et al. [238]. From the 856 regions that they estimated to be TFRs separated by their lengths in 10 kb and 5 kb windows, we observed 15 HERVs which overlapped over as many as 5 kb TFRs annotated regions distributed in 7 different chromosomes. All the 10 kb regions were found to be HERV-free. Simons et al. [238] based their study on RepeatMasker annotations extracted at the UCSC genome browser [157], ev- idencing the existing under-annotation of ERVs in such resources. Therefore, we demonstrated the power of our system and our analysis to distribute and integrate annotations as well as accelerate research. Apart from the study of TFRs presented here, the DAS-ERV service has proven to be useful as we have used it extensively in Paper IV to retrieve ERV data while comparing some of the proviruses found in dog with human data.

5.2.1 Future prospects We demonstrated the DAS-ERV service usability, performance and structured query interface on retrieval of pre-calculated data in a distributed way. Our plans are to release forthcoming species genome annotations with this system (Paper IV). This may be followed by the implementation of a sequence server

74 where retroviral sequences can be retrieved next to their annotations. This would help tremendously research and future phylogenetic studies by con- tributing with a distributed and curated source for quality assessed retroviral sequences. Experimentally, DAS was effectively used to collate ERV and TFR data. With a main genome browser implementing this technology, and the support of a large European consortium, we therefore look forward to the establish- ment of DAS as a mature technology in the bioinformatics world. Further- more, our bioinformatics rich-client application, eBioX, is now capable with this new DAS extension of accessing distributed data to perform data analyses locally.

5.3 Paper III: GeneFinder - in silico positional cloning of trait genes Background Positional cloning of trait genes is a very laborious task. Studies can narrow down to an area of association or linkage up to several hundreds of kb which contain not an obvious mutation or candidate gene for the trait controlling a phenotype. Random recombination of haplotypes across members of the pop- ulation is necessary for delineating better the area responsible for the trait but it may not be possible and or feasible. At the same time, the rapid growth in the amount of information related to gene function accumulated across dif- ferent organisms is increasing the difficulty in managing and retrieving that information from disperse resources. Not to mention the increase in complex- ity when the number of areas to investigate are multiple in cases of complex diseases. For that reason, Prof. Leif Andersson at Uppsala University pro- posed the creation of a special purpose tool that would aid in the compilation of data relevant to genetic studies and help to ascertain the possible causative regions as part of one of the initial test cases to the EMBRACE’s test case work package (WP4) board. WP4 addressed the problem by contributing with an user-friendly application called GeneFinder which we deem to be flexible to extend with new functionalities.

Results and Discussion GeneFinder is an application based on a client-server model. The server part is a pipeline which schedules and calls in a synchronized way to the different chosen resources holding information about e.g. gene function, disease condi- tions, gene expression, latest biomedical reports; by exploiting different web service technologies. A user can both enact the pipeline and retrieve the results from a web site implementing the client interface to GeneFinder. The server

75 interacts in turn with this web client through a web service implemented fol- lowing the recommendations of EMBRACE’s WP3 for web services remote procedure calls which is a basic polling mechanism. Once the pipeline ex- ecution is completed, and all the data from the different resources have been gathered, a ranking procedure of the genes present in the trait locus will be per- formed. This is a specially devised ranking algorithm which uses the fact that biological resources reference each other. The idea behind it is that a larger number of references to a gene confer it with a greater importance in the locus area. The textual information can also be mined accordingly to the tissue and keywords the user has previously established. This will improve certain results if the terms occurred at a higher frequency. The ranking algorithm is normal- ized in a logarithmic scale to a 0-10 range for the better understanding of the scored output. When the ranking task is finished, every result is deposited in the database the web client has access to, then the web client will display the results to the user. A web questionnaire system has also been developed to provide feedback to the user. GeneFinder was tested in the ridgeless phenotype locus reported by Salmon Hillbertz et al. [232]. With the resources that GeneFinder currently query, we were unable to find a structural mutation such as the causative of the ridgeless phenotype which is a 133 kb duplication that includes four complete genes and the 3’ end of another. However, GeneFinder was able to highly rank two of the three fibroblast growth factor genes (FGF3, FGF4 and FGF19) included in the area. Noticeably, the highest ranking gene, CCND1, is unimportant for the gene dosage alteration in the locus because it is truncated. Thus, the main information about the disease locus was efficiently retrieved by the application. Performance statistics are also kept in a special database. Averaged run- time for GeneFinder can be estimated to a minute per every 20 kb of sequence to be searched, if all available resources are selected in the query.

5.3.1 Future prospects GeneFinder server part is going to be re-implemented as a threaded applica- tion. This is a convenient technology change because it will make GeneFinder independent of any pipeline manager software or vendor. Furthermore, it will improve the portability of the application. The ranking system has been a good choice due to speed reasons and sim- plicity but we do not discard to use new machine learning algorithms to im- prove the way the data is combined and knowledge is extracted for narrowing down the causative mutation in the region. New ways of visualizing ranking can be implemented. The web client is user-friendly but text oriented. Star diagrams could be used to improve the understanding of the different contributions from the different resources to the final score.

76 Regarding visualization, new ways of visualizing the locus are going to be implemented as this is one of the frequently requested feature by users. Also, GeneFinder users have requested a number of new resources to be added if possible, namely, candidate SNPs, pathway information, protein-protein in- teraction and repeats found. These data sources will be extended following a strict order of prioritization.

5.4 Paper IV: Data mining of the dog genome reveals novel Canine Endogenous Retroviruses (CfERVs) Background In Paper IV we present a detailed catalogue of the Canis familiaris endoge- nous retroviruses (CfERVs) in the dog genome. This analysis is based in anno- tations of the only available high-quality dog genome sequence from a female boxer (canFam2, [169]) by an specialized application called RetroTectorc . The dog has emerged as a powerful comparative model for identification of phenotypic traits including genetic disease [139], and such a data source was missing, to the best of our knowledge, within the research community. We choose to annotate 200 kb surrounding the CfERVs found because LTRs have previously been observed functioning as promoters and enhancers up to dis- tances of 100 kb [24, 293].

Results and Discussion After a restrictive filtering of RetroTectorc annotations in order to reduce the false discovery rate, we found 407 proviruses with non-uniform distri- bution both along and across chromosomes. Some regions were essentially devoid of CfERVs whereas other regions had large numbers of integrations. As expected, the numbers show correlation with chromosomal length and also CfERVs were overrepresented in intergenic areas but to our surprise, telomeres were devoid of integrations which could be a sign that endogenous retroviruses could participate in a special mechanism of chromatin silencing other than those known for constitutive heterochromatin [104, 258]. There- after, we performed an analysis of the integrational landscape of the whole set of CfERVs to find the possible transcriptional aberrations in which CfERVs could be involved and to discover mechanisms of protection in the host. We also separated our CfERV dataset into three different groups depending on age since integration based on the calculated LTR divergence assuming genetic drift and identical flanking LTRs at the time of integration. Old CfERVs were preferentially located in antisense position of the surrounding genes when CfERVs were integrated closer than 70-80 kb up or downstream from them. Intermediately aged CfERVs exhibited a similar integration pattern except that

77 a large area expanding 30 kb downstream of the CfERVs had an increased number of protein-coding genes in sense direction. Young CfERVs presented a more interlaced pattern of integration regarding to genes in both the sense and antisense direction but with the promoter area of genes located up to 20 kb downstream from CfERVs in the same direction. This demonstrated the importance for the organism’s selective benefit to select against these integra- tions from the vicinity of genes due to the likely aberrant role they may have on transcription. The distribution of the CfERVs discovered between the already known genera was similar to other analyzed mammals (e.g. human, chimpanzee, macaque, mouse, opossum) but strikingly, the numbers were much lower (ap- proximately one fifth compared to the amount of human ERVs) suggesting a mechanism of protection from infections, or maybe purging from integrations, has been acting against ERVs during the evolution of canid species as it has already been observed before for the integrational neighborhood of CfERVs. After the initial clustering of Pol putative proteins [33, 188, 267, 273, 299] to observe the distribution of retroviral genera in dog, we brought our at- tention to the HERV-Fc-like group of proviruses that our analysis had re- vealed. HERV-Fc-like elements are interesting because their possible contri- bution with some of the proteins in trans- necessary to mobilization such as Gag [130]. This Fc-like group had been previously found as being primate- specific [29, 130], intermediate to ERV-F/H, in a low copy number distri- bution, and the protein translation frame maintained in numerous genes de- spite of age of integration. Next to phylogenetic reconstruction, we found four groups which we named CfERV-Fc1 to -Fc4, in reverse order to the estimated age of the youngest representative inside each cluster according to the LTR di- vergences. There were also two proviruses with no obvious common ancestral relation and a group of seven which we were unable to cluster with certainty. If the possibility of horizontal transfer between species is discarded, we could only predict that the ancestor of canids (and likely their superfamilies) suffered different periods of infection from external viral strains belonging to the Fc-like group followed by what we denominate "bursts of amplification" upon endogenization that may have led to several of the copies we found, for example, in the CfERV-Fc1 cluster. We could phylogenetically recognize six different "strains" among the HERV-Fc-like chains integrated in the dog genome, a prediction is also consistent with obtained LTR phylogenies based on nucleotide alignments which placed in time of integration of these CfERVs previously to any of the known integrations in the primate lineage. An hori- zontal transfer in the canid to primate direction could not formally be dis- carded but it had very low support based on local similarities to the HERV-Fc- like Pol sequences at hand.

78 5.4.1 Future prospects There are a number of ramifications this project could generate. Obviously, as we have shown the integration landscape of endogenous retroviruses (CfERVs) in the reference boxer genome with indications that the dog has been effective in purging ERVs from its genome, we would like to explore the mechanisms behind this finding. Furthermore, we deem to say that the dog genome is more prone to retrotransposition after an infection, at least for some CfERV groups as we have revealed for CfERV-Fc-like groups. With these results combined, we would like to investigate the mechanisms underlying the proviral purging, if it is linked to domestication, related to recombination, or population spread, by re-sequencing samples from a pool of breeds along with wild specimens for selected CfERVs. We expect to observe varied integrational polymorphisms as well as they may present haplotype variation, specifically, for the youngest elements. We also propose that the integrational landscape of ERVs in dog reflects that only proviruses with mostly neutral effects are kept. We demonstrated how, within evolution, integrations are displaced toward transcriptionally neutral regions of the genome. However, some selectively beneficial integrations may have been maintained. Analysis of endogenous retroviruses may be consid- ered as "viral archeology" because ERVs are the only traces of many ancient retrovirus strains, so we are expectantly looking to the new array of species being sequenced. We expect that our results will be a useful resource in future genetic disease- association studies. We have also developed the application pipeline to ana- lyze ERV data after the upstream annotation by RetroTectorc , facilitating its interpretation by researchers. We will continue annotating the Bos taurus genome. Work toward the distribution and release of this pipeline is an ongo- ing process. At the same time, we plan to release the dog data following the protocol presented in Paper II.

79

6. Sammanfattning på Svenska

6.1 Backgrund Den senaste tidens utveckling av ny storskalig sekveneringsteknologi har rev- olutionerat den biomedicinska forskningen och ökningen av ny biologisk data kommer att explodera det närmaste decenniet. Detta kommer bland annat att leda till nya genombrott inom forskningen kring våra folksjukdomar och även accelerera utvecklingen av bättre och mer sjukdomsresistenta husdjur och växter. Utvecklingen medför också nya problem; nämligen lagring, analys och in- tegrering av all biologisk data som produceras. Arbeten som presenteras i denna avhandling är en del i strävan att ta fram bättre, integrerade och användarvänliga bioinformatiska lösningar som gör det möjligt för forskare i livsvetenskaperna att förstå och analysera allt flöde av data som produceras.

6.2 Sammanfattning av studierna I denna avhandling redovisas utvecklingen och valideringen av flera bioinfor- matiska metoder och verktyg framtagna för att genomföra analys av biologisk data. I varje av de genomförda artiklarna användes olika protokoll och data- arkitekturer för att lösa specifika biologiska problem. I den första av artiklarna används en "Machine learning" metod för att identifiera peptider från föda som skulle kunna orsaka livsmedelsallergier hos människan. Systemet bygger på en klassificeringsmetod som känner igen peptider som nära besläktade med peptider redan beskrivna som allergener. Hela systemet har gjorts publikt tillgängligt via ett egenutvecklat webserver-system. I artikel nummer två beskrivs ett användarvänligt system som via ett grafiskt gränsnitt gör det möjligt att visualisera och rapportera endogena retrovirus i transposon-fria regioner i det humana genomet. Detta görs genom att hämta data från till exempel en "Genome browser" som Ensembl för att sedan visu- alisera resultaten i det eget utvecklat bioinformatiska programmet: eBioX. Den tredje artikeln presenterar framtagandet av en bioinformatisk lösning till ett vanligt biologiskt problem. Artikeln beskriver GeneFinder som är ett

81 frågesystem som tar fram data från flera olika källor, väger betydelsen av re- sultaten och presenterar en rapport som är lätt att förstå. Systemet baseras på en av de senaste teknologierna som rekommenderas av ledande bioinforma- tiska nätverk: web services. GeneFinder är speciellt framtaget för att samla all, från biologiska databaser, känd information från en väldefinierad kromosom- region från flera olika arter. För att validera användbarheten av GeneFinder testades systemet med data från en redan genomförd "genome wide associa- tion" studie på hundar som uppvisar "ridgeless" fenotypen [232]. Slutligen, den sista artikeln redogör för en ny och tidigare inte beskriven familj av hund endogena retrovirus. Arbetet presenterar också en detaljerad katalog av alla i studien upptäckta pro-virus och deras transkription med den enda publika hund genomet som bas.

82 7. Acknowledgements

This work was performed at the Linnaeus Centre for Bioinformatics, Uppsala University and the Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences. The research presented in this thesis was supported by grants from the European Commissions FP6 Programme (the EMBRACE project, LHSG-CT-2004-512092). Portions of this work were par- tially funded by the Swedish National Food Administration. I would like to express my sincere gratitude to the following people who have supported me academically during these years: Dr. Erik Bongcam-Rudloff, my main supervisor, my deepest kudos for choosing me as his first doktorand. He has been a ’magnifique’ supervisor. I couldn’t have make it with anyone else. He was a great leader for the project. I want to thank him and ask him to keep awake his vision, his ’spicy’ dis- cussions and the Bongcam-Rudloff group. I also want to thank for something that people don’t appreciate and I do a lot. Thanks for showing me academia from other perspectives. Thanks for unveiling to me the non-spoken language of how things work inside an institution called University. Prof. Jan Komorowski, my co-supervisor and director of the LCB, thanks for allowing to take part in this doktorand program. It was really pleasant trip to sail through the first stage of my academic formation. Thanks for memo- rable conferences in the snow with the rest of the LCB. Prof. Göran Andersson, my ’de facto’ third supervisor. Thanks so much. You have been a refreshing wave of Mediterranean sea in the last part of my PhD project. You easy going way of being. The feeling I have discussing with you as an equal peer and not a superior. I really appreciated that and specially, how much I have learnt from you. For listening carefully. Prof. Jonas Blomberg, the first guru of this acknowledgements, the retro- viral guru. I am extremely thankful for everything you have taught me about ERVs and how you push me hard in every meeting we have. It keeps me try- ing and wanting to learn more. Without RetroTector, a big part of my thesis wouldn’t have been possible. Göran O. Sperber, the RetroTector developer with Prof.Blomberg. Thanks so much for showing me your patience when I was still a Master Student and thanks a lot also for helping me with you incredible good corrections and for providing a great tool like RetroTector.

83 Prof. Ulf Hammerling, the second guru, thanks for your mastership of the allergenic world just at the same level than that of your English language. It has been of sum pleasure to discuss about allergy with you with a morning coffee at SLV. I hope you will continue with Daniel and Mats pioneering in the world of machine learning applied to allergy. Prof. Mats Gustafsson, the third and final guru, the guru of Statistics in the Uppsala region. Thanks for taking me as your student and all the assistance to Daniel during the final part of the EVALLER project. It is really energizing to listen your lectures. I have enjoyed the course you taught us from start to stop. I expect to keep learning. Prof. Leif Andersson, for the GeneFinder idea and having a great taste to find places where to organize conferences. I hope you can invite me to the next one even if I am finished my PhD time. Dr. Daniel Soeria-Atmadja, thanks for starting the EVALLERproject with Ulf. I expect we keep pushing it and my congrats for you new parental status and being the best toast master I have ever seen. Dr. Patric Jern, because the future of ERVs in Uppsala is with you and I hope to have the chance of learning a lot from you. Gunilla Nyberg Olsson. If I have someday a position. I want someone like you working with me. Can you clone yourself? Thanks for being so organized, helpful, funny, relaxed and bringing light to LCB. Also, I want to thank the new administrators at LCB-ICM, specially Sigrid Pettersson, she has been really helpful in the last few months. Prof. Anders Sjöberg. Thanks for having such a great impact for the future of LCB, trying to make a better place to work and personally helping me. Sorry for not listening carefully your wise advices. To the groups of people and tenures that went through LCB, specially, Ör- jan Calborg’s (sp. thanks for bringing Dom to Sweden), David’s Ardell. Wish your group things be well and research will be good. It would have been great to share more time in the same department. All the old postdocs & researchers and the old personal at the depart- ment; specially Volodya and Ole for football and billiard, I was sad when you left; the whole IT-department, you are great for helping so promptly always. Special thanks to Nils-Einar for EMBnet and taking care also of Erik, Emil for football and laughs, and Gustavo for sharing a passion with me in white shirt. I want to see the Spanish flag now! All the old Masters from LCB when I was a Master myself (I think I am still one). I really miss that times in A8 that will never come back. It was simply the best LCB times ever. Special thanks to Anders Nister for starting the EVALLER project with DASARP and EBioForms. For beers at nations and good laughs at the depart- ment and the gym. All the PhDs at LCB of yesterday, today and tomorrow. You are the simply great fellows and scientists. I don’t think nobody has realized about the mind

84 potential that has gone through our place. Well, may be someone. I just want to thank you for being great fellows and respecting me for being a strange person. I am just stupid to realize how much time I have lost to the end of the journey. I wish you the best in your future careers. Dr. Marcin Kierczak, LCB alumnus and my room ’camarada’ during most of this travel. Thanks for being the first Polish who tries seriously to learn Spanish. Seriously but joking... I think people is in research because they are curious to see. But some researchers dare to climb mountains and you are one of those. Keep going but realize when you have to come back. And keep aside Martyna. You are great together. Dr. Piff och Dr. Puff, you know who is who, don’t you? See you at my party. And I hope Bioclipse will be something great. You are driving it firmly. And you better keep being such great friends. Daniel Edsgärd, for working so hard in the EVALLER 2.0 project. To Marie Ekerljung (did I spell it right?) for starting the CfERV project and having a never-ending faith about retroviruses. Thanks to Ruoyo for the ancient money. It brought me luck. And finally, a big applause to three great researches (in an alphabetical or- der): José María Álvarez Castro, Álvaro Rada, and Mike C. Zody. Thanks for inspiration.

And now some personal acknowledgements: Erik Bongcam-Rudloff. Thanks for being my step father in Sweden for a while, for helping me to learn the language and showing me the idiosyncrasy of Sweden, for giving me this great living experience in summary. Erik is an almost-all-the-time gentleman and I am thankful we have travelled together as friends also. It has been such a great experience. But now we know at least that we cannot travel to the airport in the same car or even in the same train. Just human habits. But if you find him in the korridor he will tell you everything with his personal charm. Göran Andersson. Thanks for your incredible sense of humour. It is simply of the finest quality. I recommend it. I can see that 14-year old whiskey a bit nearer now. I have to add that I had my best conference dinner ever with these two gentleman and a third unknown one in St. Malo. It was such a great moment. If you seem to be in the same conference with them, please, choose their table. To the Bongcam-Rudloff group. In general. Now that it is growing so much I don’t want to miss anybody. It was a pity not to be more often in Ultuna, but you know, I am lazy. To lilla Erik for helping me so much and because he is a great guy to work with. I think we would make an awesome group together. And because he has found a great girlfriend.

85 To Feifeiiiiiii for being part of my projects, support and living with me for a month when she was homeless and broken-hearted. I think she has found a great boyfriend also. Keep going with you PhD! To everybody I have shared flat with. Thanks for standing me. To my Iê capoeira group (Raizes de África) and their founders. Thanks for support, specially after a hard day at work and re-energizing me, and i hope you realize also about the great the social work with kids and adults you are doing in Uppsala (kudos till Mia and Sereia!), Stockholm, Nörrtalje and Malmö. Thanks for passing me something I love to do. Axé! To my friends. I just love you. I don’t want to personalize in anybody because it is all of you. Thanks for being here and there. For being you in good and bad times and of course, for helping me when I most needed it. You are just great and I am crying. Seriously. I miss you and I will continue missing you. You are my second best treasure.

To my very first best treasure. A mi familia. Lo siento por estar tan lejos. Os he echado de menos todo este tiempo y no veo que vaya a ser mucho mejor en el futuro. Pero quiero que sepaís que aunque familia corta (y bajitos todos) sois la mejor familia del mundo. Gracias por darme la vida. No podría haber estado en ninguna otra. Estoy seguro que hoy todos mis abuelos son las personas más felices del mundo, allá donde estén.

I cannot believe this is finish now...

86 Bibliography

[1] The BioSapiens project homepage. URL http://www.biosapiens.info/.

[2] BioMart Central Portal. URL http://www.biomart.org/.

[3] Dasobert DAS client library. URL http://www.spice-3d.org/dasobert/.

[4] EMBRACE Network of Excellence. URL http://www.embracegrid.info/.

[5] NoSQL - Wikipedia, the free encyclopedia. URL http://en.wikipedia.org/wiki/ NoSQL.

[6] PeppeR – Graphical 3D-EM DAS Client. URL http://biocomp.cnb.uam.es/das/ PeppeR/.

[7] RetroSearch: HERV database (Annotated HERV ORFs in the human genome). URL http:// www.daimi.au.dk/~biopv/herv/.

[8] UniProt. URL http://www.uniprot.org.

[9] Web Services Architecture. URL http://www.w3.org/TR/ws-arch/.

[10] Allergen Online - University of Nebraska, Lincoln, NE, USA. URL http://www. allergenonline.org/.

[11] Guelph Transgenic Pig Research Program. URL http://www.uoguelph.ca/enviropig/.

[12] Pioneer Hi-Bred. URL http://www.pioneer.com/.

[13] Allergen nomenclature. IUIS/WHO Allergen Nomenclature Subcommittee. Bull World Health Or- gan, 72(5):797–806, 1994.

[14] Aalberse RC, Akkerdaas J, & van Ree R. Cross-reactivity of IgE antibodies to allergens. Allergy, 56(6):478–90, Jun 2001.

[15] Aki T, Kodama T, Fujikawa A, Miura K, et al. Immunochemical characterization of recombinant and native tropomyosins as a new allergen from the house dust mite, Dermatophagoides farinae. J Allergy Clin Immunol, 96(1):74–83, Jul 1995.

[16] Alfredsson-Timmins J, Kristell C, Henningson F, Lyckman S, & Bjerling P. Reorganization of chromatin is an early response to nitrogen starvation in Schizosaccharomyces pombe. Chromosoma, 118(1):99–112, Feb 2009. doi: 10.1007/s00412-008-0180-6.

[17] Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ. Basic local alignment search tool. J Mol Biol, 215(3):403–10, Oct 1990. doi: 10.1006/jmbi.1990.9999.

[18] Altschul SF, Madden TL, Schäffer AA, Zhang J, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–402, Sep 1997.

[19] Andersson L. Genome-wide association analysis in domestic animals: a powerful approach for genetic dissection of trait loci. Genetica, 136(2):341–9, Jun 2009. doi: 10.1007/s10709-008-9312-4.

87 [20] Antony JM, van Marle G, Opii W, Butterfield DA, et al. Human endogenous retrovirus glycoprotein- mediated induction of redox reactants causes oligodendrocyte death and demyelination. Nat Neu- rosci, 7(10):1088–95, Oct 2004. doi: 10.1038/nn1319.

[21] Astwood JD & Fuchs RL. Allergenicity of foods derived from transgenic plants. Monogr Allergy, 32:105–20, 1996.

[22] Balding DJ, Bishop MJ, & Cannings C. Handbook of statistical genetics. John Wiley & Sons, Chichester, England, 3rd ed edition, 2007. ISBN 9780470058305 (cloth : alk. paper). URL http: //www.loc.gov/catdir/toc/ecip0713/2007010263.html.

[23] Baltimore D. The strategy of RNA viruses. Harvey Lect, 70 Series:57–74, 1974.

[24] Bartholomew C & Ihle JN. Retroviral insertions 90 kilobases proximal to the Evi-1 myeloid trans- forming gene activate transcription from the normal promoter. Mol Cell Biol, 11(4):1820–8, Apr 1991.

[25] Beemon K, Duesberg P, & Vogt P. Evidence for crossing-over between avian tumor viruses based on analysis of viral RNAs. Proc Natl Acad SciUSA, 71(10):4254–8, Oct 1974.

[26] Bellman RE. Adaptive control processes: a guided tour. Princeton University Press, Princeton, N.J., 1961.

[27] Belshaw R, Pereira V, Katzourakis A, Talbot G, et al. Long-term reinfection of the human genome by endogenous retroviruses. Proc Natl Acad Sci U S A, 101(14):4894–9, Apr 2004. doi: 10.1073/ pnas.0307800101.

[28] Belshaw R, Watson J, Katzourakis A, Howe A, et al. Rate of recombinational deletion among human endogenous retroviruses. J Virol, 81(17):9437–42, Sep 2007. doi: 10.1128/JVI.02216-06.

[29] Bénit L, Calteau A, & Heidmann T. Characterization of the low-copy HERV-Fc family: evidence for recent integrations in primates of elements with coding envelope genes. Virology, 312(1):159–68, Jul 2003.

[30] Bernstein BE, Liu CL, Humphrey EL, Perlstein EO, & Schreiber SL. Global nucleosome occupancy in yeast. Genome Biol, 5(9):R62, 2004. doi: 10.1186/gb-2004-5-9-r62.

[31] Blankenburg H, Finn RD, Prlic´ A, Jenkinson AM, et al. DASMI: exchanging, annotating and assessing molecular interaction data. Bioinformatics, 25(10):1321–8, May 2009. doi: 10.1093/ bioinformatics/btp142.

[32] Blikstad V, Benachenhou F, Sperber GO, & Blomberg J. Evolution of human endogenous retroviral sequences: a conceptual account. Cell Mol Life Sci, 65(21):3348–65, Nov 2008. doi: 10.1007/ s00018-008-8495-2.

[33] Blomberg J, Benachenhou F, Blikstad V, Sperber G, & Mayer J. Classification and nomenclature of endogenous retroviral sequences (ERVs): problems and recommendations. Gene, 448(2):115–23, Dec 2009. doi: 10.1016/j.gene.2009.06.007.

[34] Blosser TR, Yang JG, Stone MD, Narlikar GJ, & Zhuang X. Dynamics of nucleosome remodelling by individual ACF complexes. Nature, 462(7276):1022–7, Dec 2009. doi: 10.1038/nature08627.

[35] Bongcam-Rudloff E, Edsgärd D, Martinez Barrio A, Soeria-Atmadja D, et al. A guide to Evaller (2.0) web server: A new tool for in silico testing of protein allergenicity. EMBnet.news, 13(4): 32–37, 2007.

[36] Bourc’his D & Bestor TH. Meiotic catastrophe and retrotransposon reactivation in male germ cells lacking Dnmt3L. Nature, 431(7004):96–9, Sep 2004. doi: 10.1038/nature02886.

88 [37] Breiteneder H & Clare Mills EN. Plant food allergens–structural and functional aspects of aller- genicity. Biotechnol Adv, 23(6):395–9, Sep 2005. doi: 10.1016/j.biotechadv.2005.05.004.

[38] Bruford EA, Lush MJ, Wright MW, Sneddon TP, et al. The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res, 36(Database issue):D445–8, Jan 2008. doi: 10.1093/nar/ gkm881.

[39] Burge C & Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol, 268(1):78–94, Apr 1997. doi: 10.1006/jmbi.1997.0951.

[40] Burks AW & Sampson H. Food allergies in children. Curr Probl Pediatr, 23(6):230–52, Jul 1993.

[41] Buzdin A, Ustyugova S, Khodosevich K, Mamedov I, et al. Human-specific subfamilies of HERV-K (HML-2) long terminal repeats: three master genes were active simultaneously during branching of hominoid lineages. Genomics, 81(2):149–56, Feb 2003.

[42] C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282(5396):2012–8, Dec 1998.

[43] Cann A. Principles of molecular virology. Elsevier Academic Press, Amsterdam, 4th ed edition, 2005. ISBN 0120887878 (standard ed.). URL http://www.loc.gov/catdir/ enhancements/fy0626/2005279896-d.html.

[44] Carmichael LE. An annotated historical account of canine parvovirus. J Vet Med B Infect Dis Vet Public Health, 52(7-8):303–11, 2005. doi: 10.1111/j.1439-0450.2005.00868.x.

[45] Carninci P, Sandelin A, Lenhard B, Katayama S, et al. Genome-wide analysis of mammalian pro- moter architecture and evolution. Nat Genet, 38(6):626–35, Jun 2006. doi: 10.1038/ng1789.

[46] Chen Y, Zhao Y, Hammond J, Hsu Ht, et al. Multiple virus infections in the honey bee and genome divergence of honey bee viruses. J Invertebr Pathol, 87(2-3):84–93, 2004. doi: 10.1016/j.jip.2004. 07.005.

[47] Codd EF. Is Your DBMS Really Relational? ComputerWorld, October 14 1985.

[48] Codd EF. Does Your DBMS Run By the Rules? ComputerWorld, October 21 1985.

[49] Codd EF. A relational model of data for large shared data banks. 1970. MD Comput, 15(3):162–6, 1998.

[50] Coffin JM. Retroviruses. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 2002. ISBN 0879696508.

[51] Creager ANH & Morgan GJ. After the double helix: Rosalind Franklin’s research on Tobacco mosaic virus. Isis, 99(2):239–72, Jun 2008.

[52] Crick F. Central dogma of molecular biology. Nature, 227(5258):561–3, Aug 1970.

[53] Crick FH. On protein synthesis. Symp Soc Exp Biol, 12:138–63, 1958.

[54] Crooks GE, Hon G, Chandonia JM, & Brenner SE. WebLogo: a sequence logo generator. Genome Res, 14(6):1188–90, Jun 2004. doi: 10.1101/gr.849004.

[55] Cui J, Han LY, Li H, Ung CY, et al. Computer prediction of allergen proteins from sequence- derived protein structural and physicochemical properties. Mol Immunol, 44(4):514–20, Jan 2007. doi: 10.1016/j.molimm.2006.02.010.

[56] Curcio MJ & Belfort M. The beginning of the end: links between ancient retroelements and modern telomerases. Proc Natl Acad Sci U S A, 104(22):9107–8, May 2007. doi: 10.1073/pnas.0703224104.

89 [57] Cuzin F, Grandjean V,& Rassoulzadegan M. Inherited variation at the epigenetic level: paramutation from the plant to the mouse. Curr Opin Genet Dev, 18(2):193–6, Apr 2008. doi: 10.1016/j.gde.2007. 12.004.

[58] D’Amato G, Cecchi L, Bonini S, Nunes C, et al. Allergenic pollen and pollen allergy in Europe. Allergy, 62(9):976–90, Sep 2007. doi: 10.1111/j.1398-9995.2007.01393.x.

[59] Datta SK, Owen FL, Womack JE, & Riblet RJ. Analysis of recombinant inbred lines derived from "autoimmune" (NZB) and "high leukemia" (C58) strains: independent multigenic systems control B cell hyperactivity, retrovirus expression, and autoimmunity. J Immunol, 129(4):1539–44, Oct 1982.

[60] Davies HV. GM organisms and the EU regulatory environment: allergenicity as a risk component. Proc Nutr Soc, 64(4):481–6, Nov 2005.

[61] Dayhoff MO. Protein segment dictionary 78: from the Atlas of protein sequence and structure, volume 5, and supplements 1, 2, and 3. National Biomedical Research Foundation, Silver Spring, Md., 1978. ISBN 0912466081.

[62] de Villiers EM, Fauquet C, Broker TR, Bernard HU, & zur Hausen H. Classification of papillo- maviruses. Virology, 324(1):17–27, Jun 2004. doi: 10.1016/j.virol.2004.03.033.

[63] Dewannieux M, Dupressoir A, Harper F, Pierron G, & Heidmann T. Identification of autonomous IAP LTR retrotransposons mobile in mammalian cells. Nat Genet, 36(5):534–9, May 2004. doi: 10.1038/ng1353.

[64] Di Cristofano A, Strazzullo M, Longo L, & La Mantia G. Characterization and genomic map- ping of the ZNF80 locus: expression of this zinc-finger gene is driven by a solitary LTR of ERV9 endogenous retroviral family. Nucleic Acids Res, 23(15):2823–30, Aug 1995.

[65] Diaz M & Casali P. Somatic immunoglobulin hypermutation. Curr Opin Immunol, 14(2):235–40, Apr 2002.

[66] Dimmock NJ, Easton AJ, & Leppard K. Introduction to modern virology. Blackwell Pub., Malden, MA, 6th ed edition, 2007. ISBN 1405136456 (pbk. : alk. paper). URL http://www.loc.gov/ catdir/toc/ecip0610/2006009426.html.

[67] Distributed Annotation System consortium. DAS registration server. URL http://www. dasregistry.org/.

[68] Do CB, Mahabhashyam MSP, Brudno M, & Batzoglou S. ProbCons: Probabilistic consistency- based multiple sequence alignment. Genome Res, 15(2):330–40, Feb 2005. doi: 10.1101/gr. 2821705.

[69] Doolittle RF. Of urfs and orfs: a primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley, CA, 1986. ISBN 0935702547 (pbk.).

[70] Dowell RD, Jokerst RM, Day A, Eddy SR, & Stein L. The distributed annotation system. BMC Bioinformatics, 2:7, 2001.

[71] Drake JW & Holland JJ. Mutation rates among RNA viruses. Proc Natl Acad Sci U S A, 96(24): 13910–3, Nov 1999.

[72] EC. Regulation (EC) No 1823/2003 of the European Parliament and of the Council of 22 September on genetically modified food and feed. Official J Eur Union, (L268):1–23, 2003.

[73] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32(5):1792–7, 2004. doi: 10.1093/nar/gkh340.

[74] Eickbush TH. Telomerase and retrotransposons: which came first? Science, 277(5328):911–2, Aug 1997.

90 [75] EMBRACE WP3. D3.1.1 Report on Technology Survey. URL http://www.embracegrid. info/files/pub/products/D3.1.1_v2.0.doc.

[76] ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799–816, Jun 2007. doi: 10. 1038/nature05874.

[77] EnsEMBL. How Ensembl uses DAS. URL http://www.ensembl.org/info/data/ ensembl_das.html.

[78] Esnault C, Heidmann O, Delebecque F, Dewannieux M, et al. APOBEC3G cytidine deaminase inhibits retrotransposition of endogenous retroviruses. Nature, 433(7024):430–3, Jan 2005. doi: 10.1038/nature03238.

[79] EU. Regulation (EC) No 1830/2003 of the European Parliament and of the Council of 22 September 2003 concerning the traceability and labelling of genetically modified organisms and amending Directive 2001/18/EC. Official J Eur Union, (L268):24–28, 2003.

[80] Fahlbusch B, Rudeschko O, Schlott B, Henzgen M, et al. Further characterization of IgE-binding antigens from guinea pig hair as new members of the lipocalin family. Allergy, 58(7):629–34, Jul 2003.

[81] FAO/WHO. Evaluation of allergenicity of genetically modified foods. Joint FAO/WHO Expert Consultation on Allergenicity of Foods Derived from Biotechnology. Technical report, FAO/WHO, Rome, Italy, 2001.

[82] Felsenstein J. Inferring phylogenies. Sinauer Associates, Sunderland, Mass., 2004. ISBN 0878931775 (pbk.). URL http://www.loc.gov/catdir/toc/ecip043/ 2003008942.html.

[83] Feng DF & Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol, 25(4):351–60, 1987.

[84] Feschotte C. Transposable elements and the evolution of regulatory networks. Nat Rev Genet, 9(5): 397–405, May 2008. doi: 10.1038/nrg2337.

[85] Fiester A. Why the omega-3 piggy should not go to market. Nat Biotechnol, 24(12):1472–3, Dec 2006. doi: 10.1038/nbt1206-1472.

[86] Finn RD, Stalker JW, Jackson DK, Kulesha E, et al. ProServer: a simple, extensible Perl DAS server. Bioinformatics, 23(12):1568–70, Jun 2007. doi: 10.1093/bioinformatics/btl650.

[87] Flicek P, Aken BL, Ballester B, Beal K, et al. Ensembl’s 10th year. Nucleic Acids Res, 38(Database issue):D557–62, Jan 2010. doi: 10.1093/nar/gkp972.

[88] for Systems Biology I. RepeatMasker Frequently Asked Questions. URL http://www. repeatmasker.org/faq.html.

[89] Furmonaviciene R, Sutton BJ, Glaser F, Laughton CA, et al. An attempt to define allergen-specific molecular surface features: a bioinformatic approach. Bioinformatics, 21(23):4201–4, Dec 2005. doi: 10.1093/bioinformatics/bti700.

[90] Gammerman A. Computational learning and probabilistic reasoning. Wiley, Chichester, 1996. ISBN 0471962791. URL http://www.loc.gov/catdir/description/wiley031/ 96177151.html.

[91] Garrison KE, Jones RB, Meiklejohn DA, Anwar N, et al. T cell responses to human endogenous retroviruses in HIV-1 infection. PLoS Pathog, 3(11):e165, Nov 2007. doi: 10.1371/journal.ppat. 0030165.

91 [92] Gartler SM & Goldman MA. Biology of the X chromosome. Curr Opin Pediatr, 13(4):340–5, Aug 2001.

[93] Gendel SM. Sequence databases for assessing the potential allergenicity of proteins used in trans- genic foods. Adv Food Nutr Res, 42:63–92, 1998.

[94] Gene Ontology Consortium. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res, 38(Database issue):D331–5, Jan 2010. doi: 10.1093/nar/gkp1018.

[95] Gerstein MB, Bruce C, Rozowsky JS, Zheng D, et al. What is a gene, post-ENCODE? History and updated definition. Genome Res, 17(6):669–81, Jun 2007. doi: 10.1101/gr.6339607.

[96] Goris N, Vandenbussche F, & De Clercq K. Potential of antiviral therapy and prophylaxis for controlling RNA viral infections of livestock. Antiviral Res, 78(1):170–8, Apr 2008. doi: 10.1016/ j.antiviral.2007.10.003.

[97] Goubran Botros H, Gregoire C, Rabillon J, David B, & Dandeu JP. Cross-antigenicity of horse serum albumin with dog and cat albumins: study of three short peptides with significant inhibitory activity towards specific human IgE and IgG antibodies. Immunology, 88(3):340–7, Jul 1996.

[98] Gould HJ, Sutton BJ, Beavil AJ, Beavil RL, et al. The biology of IGE and the basis of allergic disease. Annu Rev Immunol, 21:579–628, 2003. doi: 10.1146/annurev.immunol.21.120601.141103.

[99] Griffiths AJF. Introduction to genetic analysis. W.H. Freeman and Co., New York, 9th ed edi- tion, 2008. ISBN 9780716768876. URL http://www.loc.gov/catdir/toc/fy0713/ 2006936450.html.

[100] Grimaldi DA & Engel MS. Why Descriptive Science Still Matters. BioScience, 57(8):646–647, 2007. doi: 10.1641/B570802.

[101] Gupta R, Sheikh A, Strachan DP, & Anderson HR. Time trends in allergic disorders in the UK. Thorax, 62(1):91–6, Jan 2007. doi: 10.1136/thx.2004.038844.

[102] Haider S, Ballester B, Smedley D, Zhang J, et al. BioMart Central Portal–unified access to biological data. Nucleic Acids Res, 37(Web Server issue):W23–7, Jul 2009. doi: 10.1093/nar/gkp265.

[103] Hald A. A history of mathematical statistics from 1750 to 1930. Wiley, New York, 1998. ISBN 0471179124 (alk. paper). URL http://www.loc.gov/catdir/bios/wiley041/ 97019513.html.

[104] Hall IM, Shankaranarayana GD, Noma KI, Ayoub N, et al. Establishment and maintenance of a heterochromatin domain. Science, 297(5590):2232–7, Sep 2002. doi: 10.1126/science.1076466.

[105] Hamosh A, Scott AF, Amberger JS, Bocchini CA, & McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res,33 (Database issue):D514–7, Jan 2005. doi: 10.1093/nar/gki033.

[106] Hanley JA & McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36, Apr 1982.

[107] Hansen KH, Bracken AP, Pasini D, Dietrich N, et al. A model for transmission of the H3K27me3 epigenetic mark. Nat Cell Biol, 10(11):1291–300, Nov 2008. doi: 10.1038/ncb1787.

[108] Hastie T, Tibshirani R, & Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2nd edition, 2009.

[109] Heinlein M. The spread of tobacco mosaic virus infection: insights into the cellular mechanism of RNA transport. Cell Mol Life Sci, 59(1):58–82, Jan 2002.

92 [110] Henikoff S & Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad SciUSA, 89(22):10915–9, Nov 1992.

[111] Henikoff S, Greene EA, Pietrokovski S, Bork P, et al. Gene families: the taxonomy of protein paralogs and chimeras. Science, 278(5338):609–14, Oct 1997.

[112] Hileman RE, Silvanovich A, Goodman RE, Rice EA, et al. Bioinformatic methods for allergenicity assessment using a comprehensive allergen database. Int Arch Allergy Immunol, 128(4):280–91, Aug 2002.

[113] Hirosawa M, Totoki Y, Hoshida M, & Ishikawa M. Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci, 11(1):13–8, Feb 1995.

[114] Hjern A. Chapter 5.8: major public health problems - allergic disorders. Scand J Public Health Suppl, 67:125–31, Jun 2006. doi: 10.1080/14034950600677139.

[115] Hogeweg P & Hesper B. Interactive instruction on population interactions. Comput Biol Med, 8(4): 319–27, 1978.

[116] Inmon WH. Data warehousing for e-business. John Wiley & Sons, New York, 2001. ISBN 0471415790 (pbk. : alk. paper). URL http://www.loc.gov/catdir/bios/wiley043/ 2001046535.html.

[117] Inmon WH. Building the data warehouse. Wiley, Indianapolis, Ind., 4th ed edition, 2005. ISBN 9780764599446 (pbk.). URL http://www.loc.gov/catdir/enhancements/fy0645/ 2006273491-d.html.

[118] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, Feb 2001. doi: 10.1038/35057062.

[119] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–45, Oct 2004. doi: 10.1038/nature03001.

[120] ISAAA. ISAAA Brief 39-2008: Executive Summary. Brief Report 39, ISAAA, 2008.

[121] Izui S, Elder JH, McConahey PJ, & Dixon FJ. Identification of retroviral gp70 and anti-gp70 an- tibodies involved in circulating immune complexes in NZB X NZW mice. J Exp Med, 153(5): 1151–60, May 1981.

[122] Jackson M. Allergy: the making of a modern plague. Clin Exp Allergy, 31(11):1665–71, Nov 2001.

[123] Jacob F. Evolution and tinkering. Science, 196(4295):1161–6, Jun 1977.

[124] Jaenisch R & Bird A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet, 33 Suppl:245–54, Mar 2003. doi: 10.1038/ng1089.

[125] Jameson JL. Principles of molecular medicine. Humana Press, Totowa, N.J., 1998. ISBN 0896035298 (alk. paper).

[126] Jenkins JA, Griffiths-Jones S, Shewry PR, Breiteneder H, & Mills ENC. Structural relatedness of plant food allergens with specific reference to cross-reactive allergens: an in silico analysis. J Allergy Clin Immunol, 115(1):163–70, Jan 2005. doi: 10.1016/j.jaci.2004.10.026.

[127] Jenkins NA, Copeland NG, Taylor BA, & Lee BK. Dilute (d) coat colour mutation of DBA/2J mice is associated with the site of integration of an ecotropic MuLV genome. Nature, 293(5831):370–4, Oct 1981.

[128] Jenkinson AM, Albrecht M, Birney E, Blankenburg H, et al. Integrating biological data - the Distributed Annotation System. BMC Bioinformatics, 9 Suppl 8:S3, Jul 2008. doi: 10.1186/ 1471-2105-9-S8-S3.

93 [129] Jern P & Coffin J. Effects of retroviruses on host genome function. Annual review of genetics, 42: 709–732, 2008.

[130] Jern P, Sperber GO, & Blomberg J. Definition and variation of human endogenous retrovirus H. Virology, 327(1):93–110, Sep 2004. doi: 10.1016/j.virol.2004.06.023.

[131] Jern P, Sperber GO, & Blomberg J. Use of endogenous retroviral sequences (ERVs) and structural markers for retroviral phylogenetic inference and taxonomy. Retrovirology, 2:50, 2005. doi: 10. 1186/1742-4690-2-50.

[132] Johannsen W. Elemente der exakten erblichkeitslehre: mit grundzügen der biologischen variation- sstatistik. G. Fischer, Jena, 3. deutsche, neubearb. aufl edition, 1926.

[133] Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res, 110(1-4):462–7, 2005. doi: 10.1159/000084979.

[134] Kajikawa M & Okada N. LINEs mobilize SINEs in the eel through a shared 3’ sequence. Cell, 111 (3):433–44, Nov 2002.

[135] Kalitsis P & Saffery R. Inherent promoter bidirectionality facilitates maintenance of sequence in- tegrity and transcription of parasitic DNA in mammalian genomes. BMC Genomics, 10:498, 2009. doi: 10.1186/1471-2164-10-498.

[136] Kang JX & Leaf A. Why the omega-3 piggy should go to market. Nat Biotechnol, 25(5):505–6; author reply 506, May 2007. doi: 10.1038/nbt0507-505.

[137] Kapitonov VV & Jurka J. The long terminal repeat of an endogenous retrovirus induces alternative splicing and encodes an additional carboxy-terminal sequence in the human leptin receptor. J Mol Evol, 48(2):248–51, Feb 1999.

[138] Kapranov P, Drenkow J, Cheng J, Long J, et al. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res, 15(7):987–97, Jul 2005. doi: 10.1101/gr.3455305.

[139] Karlsson EK & Lindblad-Toh K. Leader of the pack: gene mapping in dogs and other model organ- isms. Nat Rev Genet, 9(9):713–25, Sep 2008. doi: 10.1038/nrg2382.

[140] Karlsson EK, Baranowska I, Wade CM, Salmon Hillbertz NHC, et al. Efficient mapping of mendelian traits in dogs through genome-wide association. Nat Genet, 39(11):1321–8, Nov 2007. doi: 10.1038/ng.2007.10.

[141] Karolchik D, Hinrichs AS, Furey TS, Roskin KM, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res, 32(Database issue):D493–6, Jan 2004. doi: 10.1093/nar/gkh103.

[142] Katoh K, Misawa K, Kuma Ki, & Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res, 30(14):3059–66, Jul 2002.

[143] Katoh K, Kuma Ki, Toh H, & Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33(2):511–8, 2005. doi: 10.1093/nar/gki198.

[144] Kay AB. Allergy and allergic diseases. Second of two parts. N Engl J Med, 344(2):109–13, Jan 2001.

[145] Kazazian HH, Jr. Mobile elements and disease. Curr Opin Genet Dev, 8(3):343–50, Jun 1998.

[146] Kazazian HH, Jr. Mobile elements: drivers of genome evolution. Science, 303(5664):1626–32, Mar 2004. doi: 10.1126/science.1089670.

[147] Kean S. Virology. Chronic fatigue and prostate cancer: a retroviral connection? Science, 326(5950): 215, Oct 2009. doi: 10.1126/science.326_215a.

94 [148] Keane J, Nicoll J, Kim S, Wu DM, et al. Conservation of structure and function between human and murine IL-16. J Immunol, 160(12):5945–54, Jun 1998.

[149] Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res, 12(4):656–64, Apr 2002. doi: 10.1101/gr.229202.ArticlepublishedonlinebeforeMarch2002.

[150] Kimber I, Dearman RJ, Penninks AH, Knippels LMJ, et al. Assessment of protein allergenicity on the basis of immune reactivity: animal models. Environ Health Perspect, 111(8):1125–30, Jun 2003.

[151] Kimura K, Wakamatsu A, Suzuki Y, Ota T, et al. Diversification of transcriptional modulation: large- scale identification and characterization of putative alternative promoters of human genes. Genome Res, 16(1):55–65, Jan 2006. doi: 10.1101/gr.4039406.

[152] Knoll AH. The early evolution of eukaryotes: a geological perspective. Science, 256(5057):622–7, May 1992.

[153] Kong W, Tan TS, Tham L, & Choo KW. Improved prediction of allergenicity by combination of multiple sequence motifs. In Silico Biol, 7(1):77–86, 2007.

[154] Konopka G, Bomar JM, Winden K, Coppola G, et al. Human-specific transcriptional regula- tion of CNS development genes by FOXP2. Nature, 462(7270):213–7, Nov 2009. doi: 10.1038/ nature08549.

[155] Kouzarides T. Chromatin modifications and their function. Cell, 128(4):693–705, Feb 2007. doi: 10.1016/j.cell.2007.02.005.

[156] Kowalski PE, Freeman JD, & Mager DL. Intergenic splicing between a HERV-H endogenous retro- virus and two adjacent human genes. Genomics, 57(3):371–9, May 1999. doi: 10.1006/geno.1999. 5787.

[157] Kuhn RM, Karolchik D, Zweig AS, Wang T, et al. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res, 37(Database issue):D755–61, Jan 2009. doi: 10.1093/nar/gkn875. URL http://genome.ucsc.edu/.

[158] La Scola B, Desnues C, Pagnier I, Robert C, et al. The virophage as a unique parasite of the giant mimivirus. Nature, 455(7209):100–4, Sep 2008. doi: 10.1038/nature07218.

[159] Lai L, Kang JX, Li R, Wang J, et al. Generation of cloned transgenic pigs rich in omega-3 fatty acids. Nat Biotechnol, 24(4):435–6, Apr 2006. doi: 10.1038/nbt1198.

[160] Lassmann T & Sonnhammer ELL. Kalign–an accurate and fast multiple sequence alignment algo- rithm. BMC Bioinformatics, 6:298, 2005. doi: 10.1186/1471-2105-6-298.

[161] Lee C, Grasso C, & Sharlow MF. Multiple sequence alignment using partial order graphs. Bioin- formatics, 18(3):452–64, Mar 2002.

[162] Lee CK, Shibata Y, Rao B, Strahl BD, & Lieb JD. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet, 36(8):900–5, Aug 2004. doi: 10.1038/ng1400.

[163] Lee W, Tillo D, Bray N, Morse RH, et al. A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet, 39(10):1235–44, Oct 2007. doi: 10.1038/ng2117.

[164] Leiman PG, Kanamaru S, Mesyanzhinov VV, Arisaka F, & Rossmann MG. Structure and mor- phogenesis of bacteriophage T4. Cell Mol Life Sci, 60(11):2356–70, Nov 2003. doi: 10.1007/ s00018-003-3072-1.

[165] Lenffer J, Nicholas FW, Castle K, Rao A, et al. OMIA (Online Mendelian Inheritance in Animals): an enhanced platform and integration into the Entrez search interface at NCBI. Nucleic Acids Res, 34(Database issue):D599–601, Jan 2006. doi: 10.1093/nar/gkj152.

95 [166] Li A, Rue M, Zhou J, Wang H, et al. Utilization of Ig heavy chain variable, diversity, and join- ing gene segments in children with B-lineage acute lymphoblastic leukemia: implications for the mechanisms of VDJ recombination and for pathogenesis. Blood, 103(12):4602–9, Jun 2004. doi: 10.1182/blood-2003-11-3857.

[167] Li KB, Issac P, & Krishnan A. Predicting allergenic proteins using wavelet transform. Bioinformat- ics, 20(16):2572–8, Nov 2004. doi: 10.1093/bioinformatics/bth286.

[168] Li WH. Molecular evolution. Sinauer Associates, Sunderland, Mass., 1997. ISBN 0878934634 (cloth).

[169] Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069):803–19, Dec 2005. doi: 10.1038/nature04338.

[170] Lipman DJ & Pearson WR. Rapid and sensitive protein similarity searches. Science, 227(4693): 1435–41, Mar 1985.

[171] Litman GW, Rast JP, Shamblott MJ, Haire RN, et al. Phylogenetic diversification of immunoglob- ulin genes and the antibody repertoire. Mol Biol Evol, 10(1):60–72, Jan 1993.

[172] Liu Y, Nickle DC, Shriner D, Jensen MA, et al. Molecular clock-like evolution of human immun- odeficiency virus type 1. Virology, 329(1):101–8, Nov 2004. doi: 10.1016/j.virol.2004.08.014.

[173] Lively CM & Dybdahl MF. Parasite adaptation to locally common host genotypes. Nature, 405 (6787):679–81, Jun 2000. doi: 10.1038/35015069.

[174] Livsmedelsverket. Computational Detection of Allergenic Proteins Reaches a New Level of Ac- curacy with In Silico Variable-length Peptide Extraction and Machine Learning. URL http: //www.slv.se/en-gb/Group1/Food-Safety/Allergens/.

[175] Livsmedelverket. e-Testing of protein allergenicity. URL http://www.slv.se/en-gb/ Group1/Food-Safety/e-Testing-of-protein-allergenicity/.

[176] Lo HS, Wang Z, Hu Y, Yang HH, Gere S, et al. Allelic variation in gene expression is common in the human genome. Genome Res, 13(8):1855–62, Aug 2003. doi: 10.1101/gr.1006603.

[177] Lodish HF. Molecular cell biology. W.H. Freeman, New York, 6th ed edition, 2008. ISBN 0716743663. URL http://www.loc.gov/catdir/toc/ecip0710/2007006188. html.

[178] Lombardi VC, Ruscetti FW, Das Gupta J, Pfost MA, et al. Detection of an infectious retrovirus, XMRV, in blood cells of patients with chronic fatigue syndrome. Science, 326(5952):585–9, Oct 2009. doi: 10.1126/science.1179052.

[179] Long Q, Bengra C, Li C, Kutlar F, & Tuan D. A long terminal repeat of the human endogenous retrovirus ERV-9 is located in the 5’ boundary area of the human beta-globin locus control region. Genomics, 54(3):542–55, Dec 1998. doi: 10.1006/geno.1998.5608.

[180] Lwoff A, Horne R, & Tournier P. A system of viruses. Cold Spring Harb Symp Quant Biol, 27: 51–5, 1962.

[181] Lwoff A, Horne RW, & Tournier P. [A virus system.]. C R Hebd Seances Acad Sci, 254:4225–7, Jun 1962.

[182] Maizel JV, Jr & Lenk RP. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc Natl Acad SciUSA, 78(12):7665–9, Dec 1981.

96 [183] Manak JR, Dike S, Sementchenko V, Kapranov P, et al. Biological function of unannotated tran- scription during the early development of Drosophila melanogaster. Nat Genet, 38(10):1151–8, Oct 2006. doi: 10.1038/ng1875.

[184] Mari A, Scala E, Palazzo P, Ridolfi S, et al. Bioinformatics applied to allergy: allergen databases, from collecting sequence information to data integration. The Allergome platform as a model. Cell Immunol, 244(2):97–100, Dec 2006. doi: 10.1016/j.cellimm.2007.02.012.

[185] Market E & Papavasiliou FN. V(D)J recombination and the evolution of the adaptive immune system. PLoS Biol, 1(1):E16, Oct 2003. doi: 10.1371/journal.pbio.0000016.

[186] Mawar N, Saha S, Pandit A, & Mahajan U. The third phase of HIV pandemic: social consequences of HIV/AIDS stigma & discrimination & future needs. Indian J Med Res, 122(6):471–84, Dec 2005.

[187] Mayo MA. Developments in plant virus taxonomy since the publication of the 6th ICTV Report. International Committee on Taxonomy of Viruses. Arch Virol, 144(8):1659–66, 1999.

[188] Medstrand P & Blomberg J. Characterization of novel reverse transcriptase encoding human en- dogenous retroviral sequences similar to type A and type B retroviruses: differential transcription in normal human tissues. J Virol, 67(11):6778–87, Nov 1993.

[189] Medstrand P, van de Lagemaat LN, & Mager DL. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res, 12(10):1483–95, Oct 2002. doi: 10.1101/gr.388902.

[190] Melana SM, Nepomnaschy I, Sakalian M, Abbott A, et al. Characterization of viral particles isolated from primary cultures of human breast cancer cells. Cancer Res, 67(18):8960–5, Sep 2007. doi: 10.1158/0008-5472.CAN-06-3892.

[191] Menéndez-Arias L & Berkhout B. Retroviral reverse transcription. Virus Res, 134(1-2):1–3, Jun 2008. doi: 10.1016/j.virusres.2008.01.009.

[192] Metcalfe DD. Genetically modified crops and allergenicity. Nat Immunol, 6(9):857–60, Sep 2005. doi: 10.1038/ni0905-857.

[193] Mi S, Lee X, Li X, Veldman GM, et al. Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature, 403(6771):785–9, Feb 2000. doi: 10.1038/35001608.

[194] Miller WJ, McDonald JF, Nouaud D, & Anxolabéhère D. Molecular domestication–more than a sporadic episode in evolution. Genetica, 107(1-3):197–207, 1999.

[195] Morgenstern B. DIALIGN 2: improvement of the segment-to-segment approach to multiple se- quence alignment. Bioinformatics, 15(3):211–8, Mar 1999.

[196] Muh HC, Tong JC, & Tammi MT. Allerhunter: a svm-pairwise system for assessment of allergenic- ity and allergic cross-reactivity in proteins. PLoS One, 4(6):e5861, 2009. doi: 10.1371/journal.pone. 0005861.

[197] Murphy KP, Travers P, Walport M, & Janeway C. Janeway’s immunobiology. Garland Science, New York, 7th ed. edition, 2008. ISBN 0815341237 (978-0-8153-4123-9). URL http://www. loc.gov/catdir/toc/ecip079/2007002499.html.

[198] Nassal M. Hepatitis B viruses: reverse transcription a different way. Virus Res, 134(1-2):235–49, Jun 2008. doi: 10.1016/j.virusres.2007.12.024.

[199] NCBI. NCBI Entrez Utilities. URL http://eutils.ncbi.nlm.nih.gov/.

[200] Needleman SB & Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48(3):443–53, Mar 1970.

97 [201] Ng RK & Gurdon JB. Epigenetic memory of an active gene state depends on histone H3.3 incor- poration into chromatin in the absence of transcription. Nat Cell Biol, 10(1):102–9, Jan 2008. doi: 10.1038/ncb1674.

[202] Nobrega MA, Ovcharenko I, Afzal V, & Rubin EM. Scanning human gene deserts for long-range enhancers. Science, 302(5644):413, Oct 2003. doi: 10.1126/science.1088328.

[203] Nordlee JA, Taylor SL, Townsend JA, Thomas LA, & Bush RK. Identification of a Brazil-nut allergen in transgenic soybeans. N Engl J Med, 334(11):688–92, Mar 1996.

[204] Notredame C, Higgins DG, & Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302(1):205–17, Sep 2000. doi: 10.1006/jmbi.2000.4042.

[205] Nuin PAS, Wang Z, & Tillier ERM. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics, 7:471, 2006. doi: 10.1186/1471-2105-7-471.

[206] Ohshima K & Okada N. SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet Genome Res, 110(1-4):475–90, 2005. doi: 10.1159/000084981.

[207] Olason PI. Integrating protein annotation resources through the Distributed Annotation System. Nucleic Acids Res, 33(Web Server issue):W468–70, Jul 2005. doi: 10.1093/nar/gki463.

[208] Ouzounis CA & Valencia A. Early bioinformatics: the birth of a discipline–a personal view. Bioin- formatics, 19(17):2176–90, Nov 2003.

[209] Owen KR & McCarthy MI. Genetics of type 2 diabetes. Curr Opin Genet Dev, 17(3):239–44, Jun 2007. doi: 10.1016/j.gde.2007.04.003.

[210] Paces J, Pavlícek A, Zika R, Kapitonov VV, et al. HERVd: the Human Endogenous RetroViruses Database: update. Nucleic Acids Res, 32(Database issue):D50, Jan 2004. doi: 10.1093/nar/gkh075.

[211] Pang JF, Kluetsch C, Zou XJ, Zhang Ab, et al. mtDNA data indicate a single origin for dogs south of Yangtze River, less than 16,300 years ago, from numerous wolves. Mol Biol Evol, 26(12):2849–64, Dec 2009. doi: 10.1093/molbev/msp195.

[212] Parham P & Janeway C. The immune system. Garland Science, London, 3rd ed edition, 2009. ISBN 9780815341468.

[213] Parker HG, VonHoldt BM, Quignon P, Margulies EH, et al. An expressed fgf4 retrogene is associ- ated with breed-defining chondrodysplasia in domestic dogs. Science, 325(5943):995–8, Aug 2009. doi: 10.1126/science.1173275.

[214] Parra G, Reymond A, Dabbouseh N, Dermitzakis ET, et al. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res, 16(1):37–44, Jan 2006. doi: 10.1101/gr.4145906.

[215] Patterson KD & Pyle GF. The geography and mortality of the 1918 influenza pandemic. Bull Hist Med, 65(1):4–21, 1991.

[216] Pennisi E. Genomics. DNA study forces rethink of what it means to be a gene. Science, 316(5831): 1556–7, Jun 2007. doi: 10.1126/science.316.5831.1556.

[217] Pettifer S, Thorne D, McDermott P, Attwood T, et al. An active registry for bioinformatics web services. Bioinformatics, 25(16):2090–1, Aug 2009. doi: 10.1093/bioinformatics/btp329.

[218] Poms RE, Klein CL, & Anklam E. Methods for allergen analysis in food: a review. Food Addit Contam, 21(1):1–31, Jan 2004. doi: 10.1080/02652030310001620423.

[219] Prak ET & Kazazian HH, Jr. Mobile elements and the human genome. Nat Rev Genet, 1(2):134–44, Nov 2000.

98 [220] Prlic´ A, Down TA, & Hubbard TJP. Adding some SPICE to DAS. Bioinformatics, 21 Suppl 2: ii40–1, Sep 2005. doi: 10.1093/bioinformatics/bti1106.

[221] Prlic´ A, Down TA, Kulesha E, Finn RD, et al. Integrating sequence and structural biology with DAS. BMC Bioinformatics, 8:333, 2007. doi: 10.1186/1471-2105-8-333.

[222] Provost F, Fawcett T, & Kohavi R. The case against accuracy estimation for comparing classifiers. In Kaufmann M, editor, Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, 1998.

[223] Purohit A, Shao J, Degreef JM, van Leeuwen A, et al. Role of tropomyosin as a cross-reacting allergen in sensitization to cockroach in patients from Martinique (French Caribbean island) with a respiratory allergy to mite and a food allergy to crab and shrimp. Eur Ann Allergy Clin Immunol,39 (3):85–8, Mar 2007.

[224] Racki LR, Yang JG, Naber N, Partensky PD, et al. The chromatin remodeller ACF acts as a dimeric motor to space nucleosomes. Nature, 462(7276):1016–21, Dec 2009. doi: 10.1038/nature08621.

[225] Radauer C & Breiteneder H. Pollen allergens are restricted to few protein families and show distinct patterns of species distribution. J Allergy Clin Immunol, 117(1):141–7, Jan 2006. doi: 10.1016/j. jaci.2005.09.010.

[226] Rassoulzadegan M, Grandjean V, Gounon P, Vincent S, et al. RNA-mediated non-mendelian in- heritance of an epigenetic change in the mouse. Nature, 441(7092):469–74, May 2006. doi: 10.1038/nature04674.

[227] Rhoades R. Human physiology. Thomson Learning, Philadelphia, PA, 4th ed edition, 2002. ISBN 0030321298.

[228] Ribet D, Dewannieux M, & Heidmann T. An active murine transposon family pair: retrotransposi- tion of "master" MusD copies and ETn trans-mobilization. Genome Res, 14(11):2261–7, Nov 2004. doi: 10.1101/gr.2924904.

[229] Russo VEA, Martienssen RA, & Riggs AD. Epigenetic mechanisms of gene regulation, volume 32. Cold Spring Harbor Laboratory Press, Plainview, N.Y., 1996. ISBN 0879694904.

[230] Saha S & Raghava GPS. AlgPred: prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Res, 34(Web Server issue):W202–9, Jul 2006. doi: 10.1093/nar/gkl343.

[231] Saitou N & Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):406–25, Jul 1987.

[232] Salmon Hillbertz NHC, Isaksson M, Karlsson EK, Hellmén E, et al. Duplication of FGF3, FGF4, FGF19 and ORAOV1 causes hair ridge and predisposition to dermoid sinus in Ridgeback dogs. Nat Genet, 39(11):1318–20, Nov 2007. doi: 10.1038/ng.2007.4.

[233] Samuelson LC, Wiebauer K, Snow CM, & Meisler MH. Retroviral and pseudogene insertion sites reveal the lineage of human salivary and pancreatic amylase genes from a single gene during primate evolution. Mol Cell Biol, 10(6):2513–20, Jun 1990.

[234] Sandelin A, Carninci P, Lenhard B, Ponjavic J, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet, 8(6):424–36, Jun 2007. doi: 10.1038/nrg2026.

[235] Schmidt P, Forsman A, Andersson G, Blomberg J, & Korsgren O. Pig islet xenotransplantation: activation of porcine endogenous retrovirus in the immediate post-transplantation period. Xeno- transplantation, 12(6):450–6, Nov 2005. doi: 10.1111/j.1399-3089.2005.00244.x.

[236] SCP. The ocurrence of severe food allergies in the EU. Report of experts participating in task 7.2. Technical report, Scientific Cooperation Programme in the EU, 1998.

99 [237] Shors T. Understanding viruses. Jones and Bartlett Publishers, Sudbury, Mass., 2009. ISBN 9780763729325 (alk. paper). URL http://www.loc.gov/catdir/toc/ecip0717/ 2007019380.html.

[238] Simons C, Makunin IV, Pheasant M, & Mattick JS. Maintenance of transposon-free regions through- out vertebrate evolution. BMC Genomics, 8:470, 2007. doi: 10.1186/1471-2164-8-470.

[239] Sladek R, Rocheleau G, Rung J, Dina C, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445(7130):881–5, Feb 2007. doi: 10.1038/nature05616.

[240] Smit A, Hubley R, & Green P. RepeatMasker Open-3.0, 1996-2004. URL http://www. repeatmasker.org/.

[241] Smith B, Ashburner M, Rosse C, Bard J, et al. The OBO Foundry: coordinated evolution of on- tologies to support biomedical data integration. Nat Biotechnol, 25(11):1251–5, Nov 2007. doi: 10.1038/nbt1346.

[242] Smith TF & Waterman MS. Identification of common molecular subsequences. J Mol Biol, 147(1): 195–7, Mar 1981.

[243] Snel B, Huynen MA, & Dutilh BE. Genome trees and the nature of genome evolution. Annu Rev Microbiol, 59:191–209, 2005. doi: 10.1146/annurev.micro.59.030804.121233.

[244] Soeria-Atmadja D. Novel Computational Analyses of Allergens for Improved Allergenicity Risk Assessment and Characterization of IgE Reactivity Relationships. PhD thesis, Uppsala University, Sweden, 2008.

[245] Soeria-Atmadja D, Lundell T, Gustafsson MG, & Hammerling U. Computational detection of aller- genic proteins attains a new level of accuracy with in silico variable-length peptide extraction and machine learning. Nucleic Acids Res, 34(13):3779–93, 2006.

[246] Sperber GO, Airola T, Jern P, & Blomberg J. Automated recognition of retroviral sequences in genomic data–RetroTector. Nucleic Acids Res, 35(15):4964–76, 2007. doi: 10.1093/nar/gkm515.

[247] Spilianakis CG, Lalioti MD, Town T, Lee GR, & Flavell RA. Interchromosomal associations be- tween alternatively expressed loci. Nature, 435(7042):637–45, Jun 2005. doi: 10.1038/nature03574.

[248] Stadler MB & Stadler BM. Allergenicity prediction by protein sequence. FASEB J, 17(9):1141–3, Jun 2003. doi: 10.1096/fj.02-1052fje.

[249] Stanley SM. An Ecological Theory for the Sudden Origin of Multicellular Life in the Late Precam- brian. Proc Natl Acad SciUSA, 70(5):1486–1489, May 1973.

[250] Stein L. Genome annotation: from sequence to biology. Nat Rev Genet, 2(7):493–503, Jul 2001. doi: 10.1038/35080529.

[251] Stein L. Creating a bioinformatics nation. Nature, 417(6885):119–20, May 2002. doi: 10.1038/ 417119a.

[252] Stoesser G, Sterk P, Tuli MA, Stoehr PJ, & Cameron GN. The EMBL Nucleotide Sequence Database. Nucleic Acids Res, 25(1):7–14, Jan 1997.

[253] Stoye JP. Endogenous retroviruses: still active after all these years? Curr Biol, 11(22):R914–6, Nov 2001.

[254] Stoye JP, Moroni C, & Coffin JM. Virological events leading to spontaneous AKR thymomas. J Virol, 65(3):1273–85, Mar 1991.

100 [255] Subramanian AR, Weyer-Menkhoff J, Kaufmann M, & Morgenstern B. DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics, 6:66, 2005. doi: 10.1186/1471-2105-6-66.

[256] Sun S, La Scola B, Bowman VD, Ryan CM, et al. Structural Studies of the Sputnik Virophage. J Virol, Nov 2009. doi: 10.1128/JVI.01957-09.

[257] Swets JA. Measuring the accuracy of diagnostic systems. Science, 240(4857):1285–93, Jun 1988.

[258] Talbert PB & Henikoff S. Spreading of silent chromatin: inaction at a distance. Nat Rev Genet,7 (10):793–803, Oct 2006. doi: 10.1038/nrg1920.

[259] Taylor J, Schenck I, Blankenberg D, & Nekrutenko A. Using galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinformatics, Chapter 10:Unit 10.5, Sep 2007. doi: 10.1002/0471250953.bi1005s19.

[260] Temin HM. HOMOLOGY BETWEEN RNA FROM ROUS SARCOMA VIROUS AND DNA FROM ROUS SARCOMA VIRUS-INFECTED CELLS. Proc Natl Acad Sci U S A, 52:323–9, Aug 1964.

[261] Temin HM & Baltimore D. RNA-directed DNA synthesis and RNA tumor viruses. Adv Virus Res, 17:129–86, 1972.

[262] Temin HM & Mizutani S. RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature, 226(5252):1211–3, Jun 1970.

[263] Thomas K, Aalbers M, Bannon GA, Bartels M, et al. A multi-laboratory evaluation of a common in vitro pepsin digestion assay protocol used in assessing the safety of novel proteins. Regul Toxicol Pharmacol, 39(2):87–98, Apr 2004. doi: 10.1016/j.yrtph.2003.11.003.

[264] Thomas WR, Smith WA, Hales BJ, Mills KL, & O’Brien RM. Characterization and immunobiology of house dust mite allergens. Int Arch Allergy Immunol, 129(1):1–18, Sep 2002.

[265] Thompson JD, Higgins DG, & Gibson TJ. CLUSTAL W: improving the sensitivity of progres- sive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673–80, Nov 1994.

[266] Topley WWC, Wilson GS, Collier LH, Balows A, et al. Topley & Wilson’s microbiology and microbial infections. Arnold, London, 9th ed. edition, 1998. ISBN 0340614706 (set). URL http://www.loc.gov/catdir/enhancements/fy0606/98137864-d.html.

[267] Tristem M. Identification and characterization of novel human endogenous retrovirus families by phylogenetic screening of the human genome mapping project database. J Virol, 74(8):3715–30, Apr 2000.

[268] UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res,38 (Database issue):D142–8, Jan 2010. doi: 10.1093/nar/gkp846.

[269] Urisman A, Molinaro RJ, Fischer N, Plummer SJ, et al. Identification of a novel Gammaretrovirus in prostate tumors of patients homozygous for R462Q RNASEL variant. PLoS Pathog, 2(3):e25, Mar 2006. doi: 10.1371/journal.ppat.0020025.

[270] Valent P, Hauswirth AW, Natter S, Sperr WR, et al. Assays for measuring in vitro basophil activation induced by recombinant allergens. Methods, 32(3):265–70, Mar 2004. doi: 10.1016/j.ymeth.2003. 08.006.

[271] van der Laan LJ, Lockey C, Griffeth BC, Frasier FS, Wilson CA, et al. Infection by porcine endoge- nous retrovirus after islet xenotransplantation in SCID mice. Nature, 407(6800):90–4, Sep 2000. doi: 10.1038/35024089.

101 [272] van Laere AS, Nguyen M, Braunschweig M, Nezer C, et al. A regulatory mutation in IGF2 causes a major QTL effect on muscle growth in the pig. Nature, 425(6960):832–6, Oct 2003. doi: 10.1038/ nature02064.

[273] van Regenmortel MHV. Virus taxonomy: classification and nomenclature of viruses : seventh report of the International Committee on Taxonomy of Viruses. Academic Press, San Diego, 2000. ISBN 0123702003 (alk. paper). URL http://www.loc.gov/catdir/toc/els033/ 99065267.html.

[274] van Regenmortel MHV & Mahy BWJ. Emerging issues in virus taxonomy. Emerg Infect Dis,10 (1):8–13, Jan 2004.

[275] van Valen L. A new evolutionary law. Evolutionary Theory, 1:1–30, 1973.

[276] Vastrik I, D’Eustachio P, Schmidt E, Joshi-Tope G, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol, 8(3):R39, 2007. doi: 10.1186/gb-2007-8-3-r39.

[277] Vieths S, Scheurer S, & Ballmer-Weber B. Current understanding of cross-reactivity of food aller- gens and pollen. Ann N Y Acad Sci, 964:47–68, May 2002.

[278] Villesen P, Aagaard L, Wiuf C, & Pedersen FS. Identification of endogenous retroviral reading frames in the human genome. Retrovirology, 1:32, 2004. doi: 10.1186/1742-4690-1-32.

[279] Virtanen T. Lipocalin allergens. Allergy, 56 Suppl 67:48–51, 2001.

[280] Visel A, Rubin EM, & Pennacchio LA. Genomic views of distant-acting enhancers. Nature, 461 (7261):199–205, Sep 2009. doi: 10.1038/nature08451.

[281] Volff JN. Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. Bioessays, 28(9):913–22, Sep 2006. doi: 10.1002/bies.20452.

[282] von Mutius E, Weiland SK, Fritzsch C, Duhme H, & Keil U. Increasing prevalence of hay fever and atopy among children in Leipzig, East Germany. Lancet, 351(9106):862–6, Mar 1998. doi: 10.1016/S0140-6736(97)10100-3.

[283] Vries Hd & Gager CS. Intracellular pangenesis, including a paper on fertilization and hybridiza- tion. The Open court publishing co., Chicago, 1910.

[284] W3C. SOAP Message Transmission Optimization Mechanism. URL http://www.w3.org/ TR/soap12-mtom/.

[285] Wagener J, Spjuth O, Willighagen EL, & Wikberg JES. XMPP for cloud computing in bioinformat- ics supporting discovery and invocation of asynchronous web services. BMC Bioinformatics, 10: 279, 2009. doi: 10.1186/1471-2105-10-279.

[286] Wang AL & Wang CC. Viruses of the protozoa. Annu Rev Microbiol, 45:251–63, 1991. doi: 10.1146/annurev.mi.45.100191.001343.

[287] Watson JD. Molecular biology of the gene. Pearson/Benjamin Cummings, San Francisco, 6th ed edition, 2008. ISBN 9780805395921 (cloth : alk. paper). URL http://www.loc.gov/ catdir/toc/ecip0727/2007038009.html.

[288] Watson JD & Cook-Deegan RM. Origins of the Human Genome Project. FASEB J, 5(1):8–11, Jan 1991.

[289] Wayne RK & Ostrander EA. Lessons learned from the dog genome. Trends Genet, 23(11):557–67, Nov 2007. doi: 10.1016/j.tig.2007.08.013.

[290] Weber RW. Cross-reactivity of plant and animal allergens. Clin Rev Allergy Immunol, 21(2-3): 153–202, Oct 2001. doi: 10.1385/CRIAI:21:2-3:153.

102 [291] Weedon MN. The importance of TCF7L2. Diabet Med, 24(10):1062–6, Oct 2007. doi: 10.1111/j. 1464-5491.2007.02258.x.

[292] Wegel E, Koumproglou R, Shaw P, & Osbourn A. Cell Type-Specific Chromatin Decondensation of a Metabolic Gene Cluster in Oats. Plant Cell, Dec 2009. doi: 10.1105/tpc.109.072124.

[293] Whitelaw E & Martin DI. Retrotransposons as epigenetic mediators of phenotypic variation in mammals. Nat Genet, 27(4):361–5, Apr 2001. doi: 10.1038/86850.

[294] Wilkinson LS, Davies W, & Isles AR. Genomic imprinting effects on brain development and func- tion. Nat Rev Neurosci, 8(11):832–43, Nov 2007. doi: 10.1038/nrn2235.

[295] Wilson CA. Porcine endogenous retroviruses and xenotransplantation. Cell Mol Life Sci, 65(21): 3399–412, Nov 2008. doi: 10.1007/s00018-008-8498-z.

[296] Wintrobe MM & Greer JP. Wintrobe’s clinical hematology. Wolters Kluwer Health/Lippincott Williams and Wilkins, Philadelphia, 12th ed. edition, 2009. ISBN 9780781765077 (case : alk. paper). URL http://www.loc.gov/catdir/toc/ecip0826/2008036157.html.

[297] Witteman AM, Akkerdaas JH, van Leeuwen J, van der Zee JS, & Aalberse RC. Identification of a cross-reactive allergen (presumably tropomyosin) in shrimp, mite and insects. Int Arch Allergy Immunol, 105(1):56–61, Sep 1994.

[298] Woof JM & Burton DR. Human antibody-Fc receptor interactions illuminated by crystal structures. Nat Rev Immunol, 4(2):89–99, Feb 2004.

[299] Xiong Y & Eickbush TH. Origin and evolution of retroelements based upon their reverse transcrip- tase sequences. EMBO J, 9(10):3353–62, Oct 1990.

[300] Yin PD & Hu WS. RNAs from genetically distinct retroviruses can copackage and exchange genetic information in vivo. J Virol, 71(8):6237–42, Aug 1997.

[301] Zdobnov EM, Campillos M, Harrington ED, Torrents D, & Bork P. Protein coding potential of retroviruses and other transposable elements in vertebrate genomes. Nucleic Acids Res, 33(3):946– 54, 2005. doi: 10.1093/nar/gki236.

103  5    5                 ) 3 '6 7  6 8!        '6  

      6 8!        '6   5  5    !!   !   !4  9  :    6       ;  < : 6 6 4  :6  6 !       4!      6!6 6   7   =6   !   5  7     6 8!        '6  9 >   ? ! /. 6   : !4 6 !  6    @=6   !   5  7     6 8!        '6  A9B

7  4!  3 !4   9!!9 ! 3 4 33!!3 ,,,1/