<<

INTERACTIVE PEDIGREE PLOTTER FOR GENETIC ANALYSIS

Sveinn Már Ásgeirsson

Master of Science

June 2019

School of Science and Engineering

Reykjavík University

M.Sc Thesis ii Interactive Pedigree Plotter for Genetic Analysis

by

Sveinn Már Ásgeirsson

Thesis of 30 ECTS credits submitted to the School of Science and Engineering at Reykjavík University in partial fulfillment of the requirements for the degree of Master of Science (M.Sc) in Biomedical Engineering

June 2019

Supervisor:

Bjarni V. Halldórsson, Supervisor Associate Professor, Reykjavík University, Iceland

Guðbjörn F. Jónsson, Supervisor Software Developer, deCODE , Iceland

Examiner:

Páll Melsted, Examiner Professor, University of Iceland, Iceland

i Copyright Sveinn Már Ásgeirsson June 2019

ii Interactive Pedigree Plotter for Genetic Analysis

Sveinn Már Ásgeirsson

June 2019

Abstract

Diseases and traits are often caused by mutations in the genome. A very useful way to try to determine how these mutations occur and are inherited in is by looking at a pedigree. A pedigree is a diagram that depicts the biological relationship between an in- dividual, its and other relatives. It is often used to look at the genetic transmis- sion of genetic disorders. The purpose of a pedigree is to have a visually easy-to-read chart that depicts a certain characteristic or disorder in a . It can be used for a physical characteristic like having a widow‘s peak or attached earlobes, or a genetic disorder like colorblindness or Huntington‘s disease.

With technological advances in genetics last two decades such as improved genotyping methods, e.g. by moving from microsatellites and other single marker genotyping meth- ods to chip genotyping and whole genome sequencing, researchers at deCODE now have immense data volume to work with. This gives them much better and more accurate un- derstanding of infrequent variations in the DNA that may be an attributing factor to rare diseases. One method to analyze these rare variants is to use pedigrees, but because of the low penetrance of these variants, the researcher may have to draw very large pedigrees, hundreds or even thousands of individuals, just to understand the of the variant.

This thesis reviews a pedigree plotter, Interactive Pedigree Plotter (IPP), designed by the author, which specializes in large pedigrees; both drawing them and working with them. The “interactive” refers to allowing the user to, for example, collapse/expand, move and delete certain parts of the descendant tree; a feature that becomes important with increas- ing pedigree size. The IPP also offers various attribute features such as arbitrary text at- tributes and multiple symbols, giving the user good tools to distinguish individuals from one another as well as being able to have multiple phenotypes, e.g. several cancer types.

To summarize, IPP is a pedigree plotter that is well equipped to handle large and complex pedigrees, enabling researchers to study rare variants.

This thesis reviews IPP and its features, and then compares it to several similar pedigree plotters to see whether existing pedigree plotters are sufficiently advanced for the genetic analyses being done at deCODE Genetics.

iii Gagnvirkur Fjölskylduteiknari fyrir Erfðarannsóknir

Sveinn Már Ásgeirsson

júní 2019

Útdráttur

Sjúkdómar og erfðaeinkenni koma oft til vegna stökkbreytinga í erfðamengi. Ein mjög gagnleg leið til þess að reyna átta sig á hvernig stökkbreytingar koma til og hvernig þær erfast í fjölskyldum er að horfa á ættartré (e. pedigree). Ættartré eru skýringarmyndir sem lýsa líffræðilegu sambandi milli einstaklinga, forfeðra og annarra skyldmenna. Ættartré eru oft notuð til þess að skoða arfgengi (e. genetic transmission) breytileika sem valda erfða- sjúkdómum. Tilgangur ættartrés er að hafa auðlesanlega skýringarmynd sem segir til um ákveðinn einkennandi eiginleika eða sjúkdóm í fjölskyldu. Hægt er að nota hana fyrir ým- is útlitseinkenni (t.d. Widow’s Peak eða Attached Earlobes), eða erfðasjúkdóma eins og litblindu eða Huntingtonssjúkdóm.

Það hafa orðið miklar tækniframfarir í erfðafræði seinutu tvo áratugi hvað varðar arfgerðar- greiningar, t.d. skipti frá örtunglum (e. microsatellites) og öðrum einmerkja arfgerðagrein- ingum, yfir í arfgerðagreiningu með flögum (e. chip genotyping) og raðgreiningu (e. whole genome sequencing), sem gefa vísindamönnum hjá Íslenskri Erfðagreiningu miklu meira magn af erfðagögnum til að vinna með. Þetta gefur miklu nákvæmari og betri skilning á sjaldgæfum breytileikum í erfðamenginu sem gætu hugsanlega stuðlað að sjaldgæfum sjúk- dómum. Ein leið til að greina þessa sjaldgæfu breytileika er að nota fjölskyldutré, en útaf lágri sýnd (e. penetrance) þeirra, þá gæti vísindamaðurinn þurft að teikna mjög stór fjöl- skyldutré, hundrað til þúsund einstaklinga, bara til þess að geta skilið arfleið breytileikans.

Þessi ritgerð skoðar fjölskylduteiknara, Gagnvirkur Fjölskylduteiknari (GFT), hannaður af höfundinum, hannaður með þá sérstöðu að takast á við stór fjölskyldutré; bæði teikna þau og vinna með þau. Með „gagnvirkni“ er átt við að gera notandanum kleift að til dæmis fella/útvíkka (e. collapse/expand), færa og/eða eyða ákveðnum greinum trésins, en þessi gagnvirkni verður æ meira mikilvæg með stækkandi ættartrjám. GFT býður líka upp á allskonar eiginleika eins og textaeiginleika sem notandinn velur textann að vild, ásamt tákn eiginleika (e. symbol attribute), en þetta gefur notandanum góð tól til þess að greina í sundur einstaklinga í tréinu auk þess að geta verið með margar svipgerðir í stöku tréi, eins og til dæmis nokkrar svipgerðir krabbameins.

Til að draga saman, þá er GFT fjölskylduteiknari sem er vel búinn til þess að ráða við stór og flókin fjölskyldutré, sem gerir vísindamönnum kleift að rannsaka sjaldgæfa breytileika.

Þessi ritgerð skoðar og lýsir GFT og eiginleikum hans, og ber hann síðan saman við svip- aða fjölskylduteiknara til þess að sjá hvort þeir fjölskylduteiknarar sem til eru, séu nægi- lega þróaðir til þess að vera nothæfir fyrir erfðarannsóknir í Íslenskri Erfðagreiningu.

iv Interactive Pedigree Plotter for Genetic Analysis

Sveinn Már Ásgeirsson

Thesis of 30 ECTS credits submitted to the School of Science and Engineering at Reykjavík University in partial fulfillment of the requirements for the degree of Master of Science (M.Sc) in Biomedical Engineering

June 2019

Student:

Sveinn Már Ásgeirsson

Supervisor:

Bjarni V. Halldórsson

Guðbjörn F. Jónsson

Examiner:

Páll Melsted

v I dedicate this thesis to my , Alba Rós Sveinsdóttir.

vi Acknowledgements

This work was funded by deCODE Genetics.

vii Preface

The programming of the sofware and this dissertation was done solely by the author, Sveinn Már Ásgeirsson. The design and the idea of the software was done in a collaboration with several people, including Gísli Másson, Guðbjörn F. Jónsson, Birgir Pálsson and Hreinn Ste- fánsson.

viii Contents

Acknowledgements vii

Preface viii

Contents ix

List of Figures xii

List of Tables xiv

1 Introduction 1

2 Background 3 2.1 Pedigree ...... 3 2.2 Genotype ...... 4 2.3 Phenotype ...... 5 2.4 Transmission genetics ...... 6 2.5 Penetrance ...... 6 2.6 Genetic marker (variation) ...... 6 2.6.1 SNPs ...... 6 2.6.2 Allele ...... 6 2.6.3 Common variants ...... 6 2.6.4 Rare variants ...... 6 2.7 Genotyping methods ...... 6 2.7.1 Microsatellite genotyping ...... 7 2.7.2 Chip genotyping ...... 7 2.7.3 Whole genome sequencing ...... 7 2.8 Linkage analysis ...... 7 2.9 GWAS...... 7 2.10 Haplotype ...... 7 2.10.1 Phasing ...... 8 2.10.2 Parental origin ...... 8 2.10.3 Long-range phasing ...... 9

3 Related Work 10 3.1 Online Pedigree Designers ...... 10 3.1.1 Medical Pedigree ...... 10 3.1.2 Progeny Pedigree Tool ...... 11 3.1.3 Genial Pedigree Draw ...... 13 3.2 Stand-alone Pedigree Plotters ...... 13

ix 3.2.1 HaploPainter ...... 13 3.2.2 CraneFoot ...... 14 3.2.3 Madeline ...... 16

4 Methods 18 4.1 Incentive ...... 18 4.2 Implementation ...... 19 4.3 Necessary requirements ...... 19 4.3.1 High drawing speed ...... 20 4.3.2 Handling complex family patterns ...... 20 4.3.3 Interaction ...... 20

5 Results 21 5.1 Pedigree Report file ...... 21 5.1.1 First line ...... 22 5.1.2 Pedigree report columns ...... 22 5.1.2.1 PN ...... 22 5.1.2.2 and ...... 23 5.1.2.3 Sex ...... 23 5.1.2.4 Yob and Yod ...... 23 5.1.2.5 Affstatus ...... 24 5.2 Attributes file ...... 24 5.2.1 Text attributes ...... 24 5.2.2 Symbols ...... 26 5.2.3 Haplotypes ...... 29 5.3 Layout algorithm ...... 29 5.4 Interactive features ...... 32 5.4.1 Deleting ...... 33 5.4.2 Collapsing and expanding ...... 33 5.4.3 Moving ...... 35 5.4.4 Shrinking and stretching ...... 36 5.5 Comparison ...... 37 5.5.1 Complex families ...... 37 5.5.1.1 Multiple ...... 38 5.5.1.2 Simple ...... 39 5.5.1.3 Complex consanguinity ...... 43 5.5.1.4 have spouses that also have in the pedigree 46 5.5.1.5 Partners belong to different generations ...... 48 5.5.1.6 Pedigree made up from many smaller families ...... 52 5.5.1.7 Single connection ...... 56 5.5.2 Drawing speed ...... 58 5.5.3 Interaction ...... 59

6 Future work 60 6.1 Haplotypes ...... 60 6.2 Delete ...... 61 6.3 Add ...... 62 6.4 Export ...... 62 6.5 Double consanguinity line ...... 62

x 6.6 Recalculation ...... 62

7 Summary and conclusion 63

8 Discussion 66

Bibliography 67

xi List of Figures

2.1 Standard set of pedigree symbols and an example of pedigree...... 4 2.2 Image showing Genotype ...... 5 2.3 Image showing Phenotype ...... 5 2.4 Two trio phasing examples for one marker...... 8

3.1 Medical Pedigree...... 11 3.2 The Progeny Pedigree Tool offers to start with a small family of maximum four generations, and the user can then add more nodes to the pedigree afterwards as well as adding text attributes and symbols...... 12 3.3 Genial Pedigree Draw ...... 13 3.4 Example of HaploPainter in action...... 14 3.5 Examples of how CraneFoot pedigrees can become hard to read...... 15 3.6 Examples of using CraneFoot with very big pedigrees...... 16 3.7 Example of the hybrid approach used to avoid line crossings...... 17 3.8 Example of Madeline in action. [22] ...... 17

5.1 Pedigree Example for PRE...... 23 5.2 Example of attribute dialog without and with pedigree report 5.2 added. . . . . 25 5.3 Pedigree Example for attributes ...... 25 5.4 Example of applying text attributes as tooltip...... 26 5.5 Symbols for cancer ...... 27 5.6 Symbols for autism ...... 28 5.7 having kids forms a circle ...... 30 5.8 Individual with two spouses ...... 30 5.9 Individual with two spouses ...... 30 5.10 Mates from different generations ...... 31 5.11 Mates all with parents ...... 31 5.12 Two connected families ...... 32 5.13 Simple pedigree where a branch of family members that are unaffected is deleted from the pedigree...... 33 5.14 Simple pedigree where a branch of family members is collapsed...... 33 5.15 A bigger pedigree of 65 individuals where all links are collapsed...... 34 5.16 A very cluttered pedigree fixed by moving nodes around...... 35 5.17 A very cluttered pedigree fixed by stretching it...... 36 5.18 Multiple spouses for IPP...... 38 5.19 Multiple spouses for HaploPainter...... 38 5.20 Multiple spouses for CraneFoot...... 39 5.21 Multiple spouses for Madeline...... 39 5.22 Simple consanguinity for IPP...... 39

xii 5.23 Simple consanguinity for HaploPainter...... 40 5.24 Simple consanguinity for CraneFoot...... 41 5.25 Simple consanguinity for Madeline...... 42 5.26 Complex consanguinity for IPP...... 43 5.27 Complex consanguinity for HaploPainter...... 44 5.28 Old Complex consanguinity for HaploPainter...... 44 5.29 Complex consanguinity for CraneFoot...... 45 5.30 Complex consanguinity for Madeline...... 45 5.31 Spouses have parents also for IPP...... 46 5.32 Spouses have parents also for HaploPainter...... 46 5.33 Spouses have parents also for CraneFoot...... 47 5.34 Spouses have parents also for Madeline...... 48 5.35 Mates from different generations problem for IPP...... 48 5.36 Mates from different generations problem for HaploPainter...... 49 5.37 Mates from different generations problem for CraneFoot...... 50 5.38 Mates from different generations problem for Madeline...... 51 5.39 Many families problem for IPP...... 52 5.40 Many families problem for HaploPainter...... 53 5.41 Many families problem for CraneFoot...... 54 5.42 Many families problem for Madeline...... 55 5.43 Single parent problem for IPP...... 56 5.44 Single parent problem for HaploPainter...... 57 5.45 Single parent problem for CraneFoot...... 57 5.46 Single parent problem for Madeline...... 57

6.1 Haplotypes with recombination ...... 61

xiii List of Tables

2.1 Example of haplotypes...... 8

5.1 Summary of Pedigree Report file structure components...... 22 5.2 Summary of all symbols that could be used in autism family pedigree 5.6. . . . 28 5.3 Comparison of the IPP to similar programs regarding complex family patterns. . 38 5.4 Comparison of the IPP and similar pedigree plotters on drawing speed...... 58 5.5 Comparison of the IPP to similar programs regarding interaction features. . . . 59

xiv Listings

5.1 Example of a Pedigree Report file...... 22 5.2 Example of an attribute file made for the Pedigree Report file in listing 5.1. 24 5.3 Attribute file for pedigree report 5.1 (figure 5.5)...... 27 5.4 Attribute file for figure 5.6...... 28 5.5 Attribute file where user wants no symbols - only symbol texts. In that case he just skips the “Symbol” column...... 29 5.6 Attribute file where user wants some symbols - but not all. In that case he would leave out the values (“empty string”) where he wants no symbol. . . 29 xvi Chapter 1

Introduction

Over the years, data collection has increased immensely due to technical advances in the field of genetics. One of the reasons is the constant development of genotyping meth- ods last two decades. At deCODE Genetics, these developments include going from mi- crosatellite genotyping to chip genotyping, as well as whole genome sequencing. Because of this, researchers have a much better and more accurate understanding of infrequent vari- ations in the DNA, thus, are able to study rarer variants than before. But with this data growth, it becomes more important to have advanced tools to analyse the data in a practi- cal way.

One tool often used for analysis is a pedigree diagram. A pedigree is a diagram that de- picts the biological relationship between an organism and its relatives. One purpose of a pedigree is to have an easy-to-read chart that depicts the transmission of a disease causing mutation in a family. It can be used for a phenotype like having a widow‘s peak or attached earlobes, or a genetic disorder like colorblindness or Huntington‘s disease. For example, a researcher could draw a pedigree of a family that is considered to have a higher genetic risk of being diagnosed with a certain rare disease. The researcher could then use the pedigree to look at whether and how certain mutations are being inherited from one family member to another. This could give the researcher information about, for example, if the mutation is likely to be inherited from a common .

This thesis reviews an all-in-one pedigree plotter, Interactive Pedigree Plotter (IPP), de- signed by the author, specially created to handle large sets of data, while still remaining user-friendly, well-functioning and robust. Like mentioned in the beginning of the chapter, more data volume has enabled researchers to better understand rare variants, and studying rare variants requires a lot of genealogy data to be able to get a bigger pic- ture of the genetic transmission and the low penetrance in a family. When studying these rare variants it becomes very helpful to be able to draw large pedigrees from the genealogy data, which could range from hundreds to thousands of individuals, helping the researcher to study the genetic transmission in a visual manner. Drawing large pedigrees can be diffi- cult, first and foremost because of the high number of nodes that need coordinates but also because of increasing likelihood of complex family patterns arising, such as interrelations and other patterns that endorse line crossings in the pedigree, making it very confusing.

This is where the IPP comes in. First of all, the layout algorithm used in IPP is able to cal- culate coordinates for large data sets with up to several thousand individuals, as well as 2 CHAPTER 1. INTRODUCTION optimizing the coordinates when complex family patterns arise; doing recalculations to ensure that the resulting pedigree is easy to read. One feature that becomes very important with increased pedigree size is being able to in- teract with the pedigree. The interactiveness of the pedigree allows the user to, for exam- ple, collapse/expand, move and delete certain parts of the descendant tree that is not of in- terest to the user. This becomes very useful because it can be daunting to have to navigate through several thousand individuals. These interaction features are mostly something that helps the researchers to better narrow down on the variation of interest or helps with things like reducing clutter or removing individuals/branches from the pedigree that are irrelevant, for example, a whole family branch where none of the individuals show the variation of interest. The IPP also offers various attribute features such as text attributes like name or year of birth/death, as well as arbitrary text attributes. Another attribute feature are symbols, giv- ing the user good tools to distinguish individuals from one another as well as being able to have multiple phenotypes, e.g. several cancer types. One attribute feature that is currently a work in progress is the possibility of drawing haplotypes. Haplotypes are set of markers on a single chromosome that tend to be inherited together (see chapter 2.10). The idea is that the researcher would be able to see haplotypes of genotyped individuals in the pedi- gree for a chosen chromosome location. This location might have several markers but one of them could, for example, be a marker that is a possible cause of a certain genetic dis- order. Then, looking at the pedigree, the researcher is able to see whether and how these haplotypes are inherited, that is, actual genetic transmission of these variants of interest, and thus follow the haplotypes to better study how the markers are inherited.

The IPP will be used at deCODE Genetics to handle large sets of genealogy data interac- tively. Headquartered in Reykjavik, Iceland, deCODE Genetics is a global leader in ana- lyzing and understanding the genome. Using their unique expertise and population resources, deCODE has discovered key genetic risk factors for dozens of common diseases ranging from cardiovascular disease to cancer. [1] The company is unique in that way that they have a large amount of genealogical information to work with in research.

The main features of the software are detailed in the first four sections of chapter 5. Then, the fifth section compares the IPP to similar pedigree plotters reviewed in chapter 3.2. There, the question will be answered whether existing pedigree plotters are sufficiently ad- vanced for the genetic analyses such as those being done at deCODE Genetics. Chapter 2

Background

This chapter will give a brief description of the several genetics terms used in the following chapters. Starting with a short general description of what a pedigree is, moving next over to more specific concepts used in genetics.

2.1 Pedigree

A pedigree is a diagram that depicts the biological relationship between an organism and its relatives, the organism being most commonly a human but pedigrees are also used for race , show dogs and more. It is often used to look at the genetic transmission of ge- netic disorders; following Mendelian inheritance principles, whether the mutation causing the genetic disorder is dominant or recessive. The purpose of a pedigree is to have a vi- sually easy-to-read chart that depicts a certain characteristic or disorder in a family. It can be used for a physical characteristic like having a widow‘s peak or attached earlobes, or a genetic disorder like colorblindness or Huntington‘s disease. First known uses of pedigrees date back to 15th century. The word pedigree comes from the middle-english word pedegrue, from Anglo-French pé de grue, literally meaning crane’s foot, referencing the lines in the pedigree resembling foot of a crane. [2] Pedigrees use a standard set of symbols to make them easier to understand:

• Males are represented by squares, while females are represented by circles. If the sex is unknown, diamonds are used.

• Parents are connected by horizontal lines, and vertical lines stemming from horizon- tal lines lead to the symbols for their offspring.

• The generations are often marked with roman numerics, with I being the first gen- eration, II being the children of the first generation, and III being the grandchildren of the first generation, etc. Sometimes, each individual is also numbered in Arabic numerals.

• If an individual in a pedigree has a phenotype in question, he is marked with a filled symbol.

• If an individual is deceased, it is shown with a line diagonally through the symbol. 4 CHAPTER 2. BACKGROUND

• Pedigrees are often constructed from one individual with the genetic disorder in question, and is this individual called a proband, often marked with an arrow.

• Consanguinity - when parents are blood related - is marked with a two-fold horizon- tal mating line.

• Twins are either marked with a triangle when identical, or triangle with a strike through when not identical.

(a) Standard set of symbols used in pedigrees. [3] (b) Pedigree example where some standard symbols are used. [4]

Figure 2.1: Standard set of pedigree symbols and an example of pedigree.

2.2 Genotype

Genotype is a part of the genetic makeup of an organism, which determines a characteristic (its phenotype). The Danish botanist, plant physiologist and geneticist W. Johannsen in- vented the term genotype when studying the inheritance of seed size in self-fertilized lines of beans which led him to realize the necessity of distinguishing between the appearance of an organism and its genetic constitution. [5] 2.3. PHENOTYPE 5

Figure 2.2: Here the relation between genotype and phenotype is illustrated, using a Pun- nett square, for the character of petal colour in a pea plant. The letters B and b represent alleles for colour and the pictures show the resultant flowers. [6]

2.3 Phenotype

Phenotypes are the observable properties of an organism, produced by the genotype in con- junction with the environment. In more restricted sense, a phenotype is used for the effect a gene produces, in comparison with its mutant alleles, on the morphology of the organism in which it resides. Some genes control the behavior of the organism, which in turn gener- ates an artefact outside the body. The Danish botanist, plant physiologist and geneticist W. Johannsen invented the term phenotype when studying the inheritance of seed size in self- fertilized lines of beans which led him to realize the necessity of distinguishing between the appearance of an organism and its genetic constitution. [5]

Figure 2.3: The shells of individuals within the bivalve mollusk species Donax variabilis show diverse coloration and patterning in their phenotypes. [7] 6 CHAPTER 2. BACKGROUND

2.4 Transmission genetics

Transmission genetics is the part of genetics concerning the mechanisms involved in the transfer of genes from parents to offspring. [5]

2.5 Penetrance

Penetrance is the proportion of individuals of a specified genotype that show the expected phenotype under a defined set of environmental conditions. For example, if all individu- als carrying a dominant mutant gene show the mutant phenotype, the gene is said to show complete penetrance. [5]

2.6 Genetic marker (variation)

Genetic markers are variations at some given position in the genome.

2.6.1 SNPs SNPs, or single nucleotide polymorphisms, are variations of a single nucleotide at a given position in the genomes of a population. Some human SNPs may be involved in genetic diseases, but most probably are not. [5]

2.6.2 Allele Allele is the different forms that a marker may take. FOr example, a SNP has two alleles, according to the basepair differences.

2.6.3 Common variants Common variants are variants (e.g. SNPs) that are common in a certain population (e.g. family), that is, a variant that could for example be associated with a certain disease is very commonly found when studying the disease.

2.6.4 Rare variants Rare variants are variants (e.g. SNPs) that are rarely found in a certain population (e.g. family), that is, a certain variant that could for example be associated with a certain disease is rarely found when studying the disease. These markers are more likely to be responsible for rare diseases.

2.7 Genotyping methods

Genotyping is the measurement of genetic variations. Most often measurement is done on blood, tissue or saliva samples. 2.8. LINKAGE ANALYSIS 7

2.7.1 Microsatellite genotyping Microsatellites are mono-, di-, tri-, tetra- or penta nucleotide tandem repeats (1-5 basepair long) that are interspersed throughout the genome. Of the mononucleotide repeats, runs of A and T are very common and together account for about 10 Mb, or 0.3% of the nuclear genome. Of dinucleotide repeats, arrays of CA repeats are most common, accounting for 0.5% of the genome. Trinucleotide and tetranucleotide tandem repeats are comparatively rare, but are often highly polymorphic. Microsatellite genotyping is the method of choice for many linkage analysis studies, especially genome scans designed to identify genes un- derlying common diseases. [8] [9] [10]

2.7.2 Chip genotyping Chip genotyping is the measurement of genetic variation in a population. The method uses chips that are configured to look for certain markers in the genome. In the field of genetic epidemiology, these markers are often SNPs, e.g. insertion or deletion; variations of inter- est that might be associated with a certain disease. These chips usually contain 1M mark- ers across the whole genome.

2.7.3 Whole genome sequencing Whole genome sequencing, or WGS, is where the complete DNA setup of individuals is analyzed in a single run. WGS opens up many possibilities in genetic research because of the extensive volume of data it produces, enabling researchers, with the help of computer power, to study rare variations that microsatellites or chip genotyping misses.

2.8 Linkage analysis

Linkage analysis is the analysis of the greater association in inheritance of two or more non-allelic genes than is to be expected from independent assortment. Variants are linked because they reside on the same chromosome. [5]

2.9 GWAS

Genome Wide Association Studies, or GWAS, is a study of common genetic variation across the entire human genome designed to identify genetic association with observable traits e.g. disease like psoriasis. It involves testing a high number of variations across the whole genome to identify variants that are associated with the trait.

2.10 Haplotype

A haplotype (also known as a signature, a DNA signature, or a genetic signature) is a set of markers on a single chromosome that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of SNPs. The term is contraction of haploid genotype. [5] [11] 8 CHAPTER 2. BACKGROUND

Table 2.1: Example of haplotypes. A/A A/T T/T C/C AC AC AC TC TC TC AC TG C/G AC AG or TC TG AG TC G/G AG AG AG TG TG TG

Table 2.1 is an example of haplotypes where first locus/marker has three possible geno- types: G/G G/C C/C, and second locus/marker also has three possible genotypes: A/A A/T T/T. For a given individual, there are then four possible haplotypes for these two markers: AC, AG, TC or TG (ten possible haplotype pairs: AC AC, AC TC, TC TC, AC AG, AC TG, AG TC, TC TG, AG AG, AG TG, TG TG).

2.10.1 Phasing Phasing, or haplotype estimation, is the method of estimating which alleles in a genotype are inherited together to form a haplotype. For the haplotype example above (table 2.1), phasing would be to estimate which haplotype pair in the middle “is the right one” (AC TC or AG TC).

2.10.2 Parental origin Parental origin, or parent of origin estimation, is used to determine whether alleles are on maternal or paternal chromosomes, on both or neither. In other words, determine which al- lele is inherited from the mother and which is inherited from the father. If parental origin is known, haplotypes can easily be read from that data. A simple case of parent of origin estimation is when you have genotypes for a and both parents, sometimes called trio phasing. In that case it is very easy to determine the origin, except when all are heterozy- gotes. [12] Let us look at an example of trio phasing:

(a) Parental origin known. (b) Parental origin unknown.

Figure 2.4: Two trio phasing examples for one marker.

In the trio phasing example above, figure 2.4a is an example where the father has the geno- type A/G, the mother has genotype G/G and child A/G. In that case, it is very easy to see 2.10. HAPLOTYPE 9 that the child A-allele comes from the father (paternal) and the G-allele comes from the mother (maternal), thus parent of origin can easily be determined. Then, 2.4b shows a case where the father, mother and child are all heterozygotes (with the genotype A/G). In that case, it is impossible to determine the parental origin from the genotype alone because the child A-allele and G-allele could both be maternal and paternal.

2.10.3 Long-range phasing Long-range phasing is an advanced phasing method developed at deCODE, which has en- abled researchers to phase bigger chromosome regions by making use of close- and distant relatives, as well as closely linked markers: “General phasing methods can only phase a small number of SNPs effectively and become unreliable when applied to SNPs spanning many linkage disequilibrium (LD) blocks. LRP, however, is able to phase more than 1,000 SNPs simultaneously. Moreover, haplotypes that are identical by descent (IBD) between close and distant relatives, for example, those sep- arated by ten meioses or more, can often be reliably detected. This method is particularly powerful in studies of the inheritance of recurrent mutations and fine-scale recombinations in large sample sets. A further extension of the method allows us to impute long haplo- types for individuals who are not genotyped.” [13] Chapter 3

Related Work

This chapter will review the several software programs that are able to draw pedigrees in one way or another. General concepts of the programs will be discussed here, but chapter 5 will introduce much more detailed analysis of the many features the programs have to offer, and then compare them to Interactive Pedigree Plotter.

3.1 Online Pedigree Designers

Most common types of pedigree plotters are the online pedigree designers, but pedigree designers are not similar to the Interactive Pedigree Plotter in the way that you cannot give the web-application a pre-existing file of individuals you want to draw, but rather user has to design and build up the pedigree diagram from scratch so there is no optimizing algo- rithm behind the application that decides what coordinates are best for the pedigree as a whole. These kind of designers also offer either none or very limited interaction. So, for obvious reasons, these types of pedigree applications are not really comparable to the Inter- active Pedigree Plotter.

3.1.1 Medical Pedigree Medical Pedigree is an online pedigree designer where the user manually writes the data one individual by one (see figure 3.1a). The text attributes are limited to Id, Name, Date of Birth, Date of Death, Age and Comments. No symbols are available. The resulting pedi- gree is just an image so there is no interaction with the diagram. As figure 3.1b shows, the resulting pedigree gets cluttered and because there is no possibility of interaction, the user cannot dissolve the clutter. [14] 3.1. ONLINE PEDIGREE DESIGNERS 11

(a) Data setup. User manually fills in the fields with some basic (b) Resulting pedigree from the data in (a). information like year of birth/death, gender, condition and more. Then, the user can choose to add that information as attributes to the pedigree.

Figure 3.1: Medical Pedigree.

3.1.2 Progeny Pedigree Tool Progeny Pedigree Tool is already far more advanced than the Medical Pedigree. It is also a designer so user starts by building up the pedigree one individual at a time from scratch, or by starting with a small family that is maximum four generations and building on from there (see figure 3.2). The user can then add some limited amount of attributes such as Name, Age and three custom attributes of user choosing. User can choose from 10 symbol attributes. The user can interact with the pedigree by moving, adding and deleting nodes. [15] 12 CHAPTER 3. RELATED WORK

(a) Choosing proband gender.

(b) Building small starting family.

(c) Resulting pedigree from (b) with few added text attributes and symbols.

Figure 3.2: The Progeny Pedigree Tool offers to start with a small family of maximum four generations, and the user can then add more nodes to the pedigree afterwards as well as adding text attributes and symbols. 3.2. STAND-ALONE PEDIGREE PLOTTERS 13

3.1.3 Genial Pedigree Draw Genial Pedigree Draw is the most advanced online pedigree designer of the ones the author researched. Like for all designers, the user has to build the pedigree from scratch, although the team behind the software seems to have an import tool in beta release. User can then add multiple symbols and other text attributes to the nodes and also interact with the pedi- gree, e.g. delete and add, but it seems like it is impossible to move the nodes. [16]

Figure 3.3: Genial Pedigree Draw example.

3.2 Stand-alone Pedigree Plotters

The stand-alone pedigree plotters are more like the Interactive Pedigree Plotter in the sense that they take in a pre-existing genealogy file (for example .PRE, .data and .txt) to draw from, instead of designing one individual at a time. In the following sections, the pedigree plotters that will then be used in the comparison in chapter 5.5, are reviewed.

3.2.1 HaploPainter HaploPainter is a good pedigree plotter aimed at bioinformaticians, medical researchers and genetic counselors, specifically designed to be a user-friendly drawing application with special features for easy visualization of complex haplotype information. [17] The HaploPainter main features are: [18]

• Import of pedigree data from common linkage format, HaploPainter format or MySQL, PostgreSQL or Oracle data tables. • Support Unicode encoded pedigree files. • Import of haplotype data from Merlin, Simwalk2, Genehunter and Allegro. • Symbol nomenclature based on the Pedigree Standardization Task Force. • Full drag and drop for symbols. • Manually and automatic selection of loop breaking. • Export high quality graphic to SVG, Postscript, PDF, PNG. 14 CHAPTER 3. RELATED WORK

• Several drawing style features.

• Command line mode for creating graphics in a shell environment.

• Modify your pedigree within the graphical user interface.

• Written platform independent in Perl.

(a) HaploPainter pedigree example. [19] (b) HaploPainter pedigree example. [17]

Figure 3.4: Example of HaploPainter in action.

3.2.2 CraneFoot CraneFoot is designed to be a lightweight pedigree plotter for geneticists, with focus on quality before quantity. CraneFoot uses a deterministic and efficient drawing algorithm, especially for pedigrees under 1000 nodes. CraneFoot is a command line software and as such not very well suited for users who are accustomed to graphical user interfaces, e.g. like the Interactive Pedigree Plotter. CraneFoot writes the pedigree information to a PostScript file which can then be opened by various image editing software programs, thus, resulting pedigree being an image which then offers no real time interaction like moving. [20] [21] The CraneFoot pedigrees are somewhat strangely drawn and can get very hard to read with increasing size. In figure 3.5, the lines between the individuals mean that they are the same person so if the pedigree has many families, e.g. in 3.5a where the children of 1.1m and 1.2f all have spouses that also have parents in the pedigree, it becomes very difficult to read. 3.2. STAND-ALONE PEDIGREE PLOTTERS 15

(a) Simple CraneFoot pedigree example. Not hard to (b) More complex pedigree. At this size it is much read. [3] harder to read.

Figure 3.5: Examples of how CraneFoot pedigrees can become hard to read.

Another big disadvantage with CraneFoot and bigger pedigrees is when it comes to show- ing the pedigree. Like mentioned above, the pedigree information is written to a PostScript file, which can be opened easily in web browsers such as Chrome, but Chrome won’t allow zooming far in (see figure 3.6a). One solution to solve this is to open the PostScript file in more advanced image applications like GIMP. In GIMP, user can zoom very far into the picture, but then the resolution has to be set very high in order to distinguish between text under nodes as well as the nodes themselves (see figure 3.6b). It should be mentioned that there are better imaging software applications out there that are able to handle drawing so many pixels per inch in a shorter amount of time, but GIMP is generally considered a good imaging software. 16 CHAPTER 3. RELATED WORK

(a) Maximum zoom in a web browser.

(b) Maximum zoom in GIMP. In order to get this resolution, the pixel per inch had to be set to 8000, which took 1 hour and 17 minutes to draw.

Figure 3.6: Examples of using CraneFoot with very big pedigrees.

3.2.3 Madeline Madeline was designed to draw large and complex pedigrees that are easy-to-read: “For complex pedigrees Madeline uses a hybrid algorithm in which consanguineous loops are drawn as cyclic graphs whenever possible, but the algorithm resorts to acyclic graphs when matings can no longer be connected without line crossings. A similar hybrid ap- proach is used to avoid line crossings for matings between far-flung descendants of dif- ferent founding groups.” It is important to avoid line crossings because of the poor readability it promotes. 3.2. STAND-ALONE PEDIGREE PLOTTERS 17

Figure 3.7: Example of the hybrid approach used to avoid line crossings.

Madeline reads input files specified on the command line and generates pedigree drawings without user interaction. It writes the pedigree information to a scalable vector graphics (SVG) format which can then be viewed in browsers with native SVG rendering support or vector graphics editors such as Inkscape. [22]

(a) Madeline pedigree example. (b) Madeline pedigree example.

Figure 3.8: Example of Madeline in action. [22] Chapter 4

Methods

This chapter will cover the design process behind the Interactive Pedigree Plotter (from now on: IPP), that is, what the incentive is behind creating IPP and also what the necessary requirements are in order for it to be usable for genetic analyses at deCODE Genetics.

4.1 Incentive

Drawing pedigrees has been an important analysing tool at deCODE since the company was founded in 1996, starting from drawing pedigrees by hand to using a pedigree soft- ware called Cyrillic. Around 2002, a software developer working for deCODE Genetics, Jósep Valur Guðlaugsson, reworked a previously made pedigree plotter from few years ear- lier, and called it the simple name Pedigree Plotter (from now on, PP). The PP was built inside a graphical software used for genetic analysis and more, called Disease Miner. The incentive behind the PP was to help with linkage analysis (see 2.8), allowing researchers to study inheritance of genes and genetic markers in a visual manner. At that time, deCODE was using microsatellite genotyping (see 2.7.1) and other single marker genotyping meth- ods to measure genetic variations in the DNA. Those genotyping methods provided limited data volume, but at the time, deCODE was mostly studying common variants (see 2.6.3), which requires not as much data volume as rare variants.

Fast forward to the present, deCODE has moved away from microsatellites and other sin- gle marker genotyping methods, over to chip genotyping (see 2.7.2) and whole genome sequencing (WGS, see 2.7.3). With these genotyping methods, researchers are now able to look at large number of single nucleotide variations in the genome (SNPs) instead of just few like before, making it easier to narrow down on mutations that for example are asso- ciated with certain diseases. With these new measurement techniques, genotyping is done with much higher throughput, resulting in a huge growth in data volume. This gives re- searchers enough data to start looking more into rare variants. That is where the IPP comes in. There is a need for a more advanced pedigree plotter that can respond to this increased data volume and handle larger pedigrees than before. With larger pedigrees, they become harder to read, so researchers need to be able to interact with the diagram so that they can collapse or delete branches that are not of interest, narrowing down on the variant of inter- est.

To summarize, the incentive was therefore to create a pedigree plotter that is better equipped to handle larger and more complex pedigrees, enabling research of rare variants. 4.2. IMPLEMENTATION 19

4.2 Implementation

The IPP is created by the author, Sveinn Már Ásgeirsson. At first, the plan was to rework PP and add more features on top of it, but after reviewing the code it was decided that it would take to long to get to know the code and architecture behind it, especially since Jósep is no longer working at deCODE. So, using some of the core functions and features from PP, it was decided that IPP would be built from scratch.

The IPP uses the same layout algorithm as PP does, designed by Guðbjörn F. Jónsson, since this layout algorithm is both efficient and can handle large pedigrees relatively well, and therefore sufficiently good for IPP. [23] Further details about the algorithm are in chap- ter 5.3.

Since PP was created, much has changed in computer programming that makes it more efficient, such as new and improved libraries used for programming and other core func- tions like lambda functions. The IPP was written in the programming language Java just like PP, but IPP uses JavaFX instead of JavaSwing for the graphical user interface (GUI) of the pedigree plotter. JavaFX has many advantages over JavaSwing, such as using D3D and ES2 shaders to draw which run directly on the graphics card thus handing the GPU much of the heavy lifting instead of solely the CPU, like in JavaSwing. Another advantage of these shaders JavaFX uses is that pedigrees are by much extent made out of shapes (rectan- gles for males, ovals for females and diamonds for unknown gender, as well as mating- and descendant lines showing familial relations), texts (e.g. text attributes) and fills (e.g. sym- bols and if individual is affected), and JavaFX uses the direct rendering shaders for simple objects like rectangles, ovals and simple lines and fills. Text is then done with a grayscale bitmap cache. Yet another advantage is that JavaFX uses scene graphs and therefore has a notion of what has been changed in the GUI and can render only a subset of the scene graph nodes, which proves very useful for larger pedigrees when user has to change just few nodes, e.g. applying attributes to or moving three individuals out of 3000, and there- fore JavaFX is able to update only those three nodes but not re-draw the whole content like is needed in JavaSwing. [24]

4.3 Necessary requirements

There are certain requirements that the IPP needs to fulfill in order for it to be adequate and useful for genetic analyses being done at deCODE Genetics.

Like mentioned above in chapter 4.1, around 2002, deCODE was using a pedigree plot- ter for linkage analysis, studying common variations. For those kind of genetic analyses, small to medium (20-400 individuals) sized pedigrees are sufficient. It is quite trivial to draw small pedigrees, both regarding time and complexity. Small pedigrees of course in- clude few nodes and objects over all, as well as smaller families tend to be less complex, although that is of course not always the case. Today, deCODE is not doing linkage anal- yses anymore and have transferred more into studying rare variants. In order to study rare variants, researchers need much more genealogy data to be able to grasp the rarity of the inheritance because the variation might not show up often in a family, thus, resulting in larger pedigrees.

Next sections will detail some of these necessary requirements. 20 CHAPTER 4. METHODS

4.3.1 High drawing speed Fast drawing speed requirement is the ability to draw big pedigrees in a fair amount of time, at least pedigrees ranging from 1-3000 individuals. Larger pedigrees most often take longer time than smaller sized, although size is not the only variable (see 4.3.2), basically because drawing big pedigrees creates more work for the algorithm, that is, more individu- als/nodes for the algorithm to give coordinates to. That being said, although the algorithm is the bottleneck in majority of cases, it of course also takes a little time to draw the con- tent.

4.3.2 Handling complex family patterns The requirement to handle complex family patterns is the software ability to first and fore- most understand complex family patterns such as complex interrelations and draw them in the right way, but also being able to draw family patterns like individual having many spouses, mates belonging to different generations or mates being far away from each other (line crossings become a problem there) in an ideal manner so that the pedigree does not become hard to read. This is something that the algorithm takes care of by doing appropri- ate recalculations each time a non-standard family pattern arises. Smaller pedigrees, say 300 individuals, should in most cases take few seconds, but if the same pedigree has a lot of complex family patterns, these few seconds can multiply fast because of all the recalcu- lations needed.

4.3.3 Interaction After the two requirements discussed above have been met and a, let’s say, 2500 individual pedigree is drawn, researchers need to be able to interact with the diagram because it can be daunting to have to navigate through 2500 individuals. Those interaction requirements are mostly something that helps the researchers to better narrow down on the variation of interest or helps with things like reducing clutter by moving and/or removing individual- s/branches from the pedigree that are irrelevant, for example, a whole family branch where none of the individuals show the variation of interest. Another feature to reduce clutter is the collapse/expand feature, where the user can collapse branches (see 5.4.2). The IPP would in majority of cases be used for descendant trees where all individuals have a com- mon ancestor, so this becomes very helpful. Chapter 5

Results

In this chapter the resulting software, Interactive Pedigree Plotter (from now on called IPP), will be reviewed and then compared to other software programs that were discussed in chapter 3.2. Key structural components of the two types of files the IPP accepts are detailed in next two sections:

1. First is the the Pedigree Report (.PRE), or the PRE file. The pedigree report is a required file that has all the genealogy information of individuals that is needed to determine the layout of the pedigree. The pedigree report also has some optional information like year of birth/death and affection status. 2. The second is the Attribute File (.TXT), but the attribtue file is an optional file that should have all information about the attributes the user wants in the pedigree, such as text attributes, symbols and more.

In the future, it is very possible that the number of file types that IPP accepts will increase, for example, when the haplotype attribute feature has been implemented. The third section covers the layout algorithm used in IPP. An efficient layout algorithm is extremely important for a good pedigree plotter in order for the pedigree to be easy to read as well as being able to draw thousands of individuals. The fourth section details the interactive features of IPP. Being able to interact with the pedigree is really important, especially with increasing pedigree size and complexity, but the reasons for that are discussed both in chapter 4.3.3 and later in this section. The fifth section is the official comparison between the IPP and pedigree plotters reviewed in chapter 3.2. The comparison will take into account requirements that are considered necessary for a good pedigree drawing software, as well as interaction features that help the researchers with genetic analyses. The IPP at this point is not at its final version, so there are some future improvements men- tioned in chapter 6.

5.1 Pedigree Report file

Like mentioned in chapter 1, the IPP will strictly be used as an analysing software for de- CODE Genetics as a way for studying common and rare variants in individuals. This means 22 CHAPTER 5. RESULTS

that the data structure of the file provided for parsing is specifically designed to fit the data format deCODE Genetics follows. The IPP accepts so called PRE files, or Pedigree Re- port files. There are several mandatory rules on how the PRE file has to be constructed, and few others that are nice to have to get the most out of the software. In the following sections, these data structure guidelines will be detailed.

Table 5.1: Summary of Pedigree Report file structure components. Mandatory Short description First line with pound sign X First line in file should always start with pound sign. PN, or personal/patient number of individual, is PN X the identifier of the individual. Father X Father of individual. Mother X Mother of individual. Sex X Gender of individual. Yob  Year of Birth. Yod  Year of Death. Affected (2), unaffected (1) or Affstatus  unknown affection status (0).

5.1.1 First line The first line in the Pedigree Report file has to begin with a pound sign (hashtag, #) so that the IPP knows that line is the header line. After the pound sign the column names are listed in a tab-delimited manner.

5.1.2 Pedigree report columns In order to construct the pedigree, some columns are mandatory while some are not (see ta- ble 5.1), and the program will envoke an error if mandatory columns are not present. Also, the naming convention of them is also strict.

1 #Family PN Father Mother Sex Yob Yod Affstatus Family1 child1 father mother 2 1990 0 1 3 Family1 child2 father mother 1 1993 0 2 Family1 father 0 0 1 1970 0 2 5 Family1 mother 0 0 2 1973 2016 1 Listing 5.1: Example of a Pedigree Report file.

5.1.2.1 PN The PN column simply represents some kind of an identifier of each individual in the pedi- gree, and can be a string of any sort. The reason for this naming convention is that in order to protect personal information of donors and obey ethic rules and regulations set by The Icelandic Data Protection Authority, deCODE encrypts all the personal identities into so called PNs, which means personal number or patient number - an individual iden- tifier, 7 characters long. PNs serve as identification for any person who is in the deCODE database. It identifies individuals that have donated samples or are otherwise participating 5.1. PEDIGREE REPORT FILE 23

Figure 5.1: Pedigree generated from listing 5.1. in projects. The values for this column need to be in perfect naming consistency through- out the file, e.g. if one name is used at one line in the file, the same name has to be used throughout the file when talking about the same individual.

5.1.2.2 Father and Mother The Father and Mother columns are mandatory because they are used by the layout algo- rithm to construct family relations. The values should be, as the column name implies, the individual father and mother, but it can be also be 0 if the parent is unknown or simply not included in the pedigree (See listing 5.1 and figure 5.1). Same goes for these columns as for PN, the naming consistency has to hold, but also, if an individual X has the father Y and mother Z, the father Y and mother Z have to exist as individuals in the file. Listing 5.1 shows it the right way but if e.g. line 4 was missing, the IPP would envoke an error.

5.1.2.3 Sex The Sex column is mandatory because, like mentioned in chapter 2.1, pedigrees use a stan- dard set of symbols to make them easier to understand. Three of those symbols are the symbols representing which sex the individual is - square for male, circle for female and diamonds for unknown sex. The values for this column are represented with numbers, where:

• 0: Unknown

• 1: Male

• 2: Female

5.1.2.4 Yob and Yod The Yob (Year of birth) and Yod (Year of death) columns are, as the name implies, birth date and death date of the individuals. These columns are not mandatory but if the user wants to include them, they must be included in the Pedigree Report file (See listing 5.1 and figure 5.1). 24 CHAPTER 5. RESULTS

5.1.2.5 Affstatus The Affstatus column is not mandatory but helpful when looking at the genetic transmis- sion. For that you need to know which individuals in the pedigree have the variant of inter- est, and which do not. Like mentioned in chapter 2.1, if an individual in a pedigree has a phenotype in question, he is marked with a filled symbol. (See listing 5.1 and figure 5.1). The values for this column are represented with numbers, where:

• 0: Unknown affection status

• 1: Unaffected

• 2: Affected

Another way to represent affection is with symbol attributes, see chapter 5.2.2 for more details on that.

5.2 Attributes file

One key feature of the IPP is the option of adding different types of attributes to nodes. Attributes can become extremely useful when user wants to attach some extra information to the nodes and/or distinguish one node from another. Attribute data is passed to the IPP via attribute file (see listings 5.1, 5.2, 5.3 and 5.4 for examples). The only thing in the data structure of the attribute file that is strict is the first line, but same goes for the attribute file as for the Pedigree Report file - the first line in the attribute file has to begin with a pound sign, and then list the column names in a tab-delimited manner. Next sections will cover the several types of attributes IPP offers.

5.2.1 Text attributes

1 #PN Sequenced CancerType Something child1 Yes None Yes 3 child2 Yes Colon Unknown father Yes Colon Unknown 5 mother No None Unknown Listing 5.2: Example of an attribute file made for the Pedigree Report file in listing 5.1.

Text attributes enable the user to attach extra information onto the nodes. The column names are completely arbitrary and will appear in the attribute dialog like they are named in the file, but as figure 5.3 shows, the attributes appear under the nodes in a “Column name: value” manner. 5.2. ATTRIBUTES FILE 25

(a) No attribute file added. IPP gets the (b) Attribute file 5.2 is added. Name and Year of Birth/Death information from the Pedigree Report (see listing 5.1).

Figure 5.2: Example of attribute dialog without and with pedigree report 5.2 added.

Figure 5.3: Example of pedigree generated from pedigree report 5.1 and attribute file 5.2 when attributes are turned on.

In the example above, the user could be studying the transmission of cancer in a family, and the attributes are specifying which type of cancer the individual has. Also, the user could be interested in whether the individual is sequenced or not, so he has a text attribute for that. Another option is to enable the attributes so that they appear as a tooltip on the nodes but not as text attributes under the nodes: 26 CHAPTER 5. RESULTS

(a) Attribute file 5.2 added and some text (b) Pedigree when configurations from 5.4a attributes set as tooltip (purple minus) are applied. and others as visible text attributes (green checkmark).

Figure 5.4: Example of applying text attributes as tooltip.

One important thing to mention is that IPP also offers to import another attribute file, and either merging it with the existing one or overwriting it.

5.2.2 Symbols Symbols in pedigrees are very useful for genetic analysis, e.g. when dealing with individ- uals that have multiple diagnoses, to express sub-phenotypes and/or mark severity scores. The IPP offers 11 types of symbols in 15 different colors, which, if calculations are correct, means 1365 symbols. With symbols you can add a special symbol text which may serve as a further detail on what the symbol means (see figure 5.5 and 5.6). The symbols and symbol texts are taken in via the attribute file similar to the text attributes, except the column names are not arbitrary. In order for the IPP to recognize the symbols, the colum names have to be strictly “Symbol” and “SymbolText”. Values for the sym- bol text are just arbitrary strings of user choosing, but for the symbol it is a little bit more complicated. The symbols follow a color map which is defined internally. It uses hexadecimal number system up to 15, where:

• 0: Transparent (no color) • 8: Gray

• 1: Black • 9: Dark blue • 2: Blue • A: Brown • 3: Red • B: Green • 4: Light green • C: Violet • 5: Yellow • 6: Magenta • D: Beige • 7: Light blue • E: Cyan 5.2. ATTRIBUTES FILE 27

These color values are then joined together in a 4 digit string, where:

• First digit: Top left quadrant • Third digit: Bottom left quadrant

• Second Digit: Top right quadrant • Fourth digit: Bottom right quadrant

So if the user wanted to make a symbol like , he would use the value 5678. Now let’s look at examples of how symbols could be used. One example would be if a re- searcher studying cancer, draws a pedigree of a family where there is an obvious genetic risk of being diagnosed with cancer. But it might be the case that not all individuals in the pedigree are diagnosed with the same type of cancer. In that case, the researcher could use symbols where each symbol would represent a type of cancer.

1 #PN Symbol SymbolText child1 1234 Breast 3 child2 5678 Colon father 5 mother 1234 Breast Listing 5.3: Attribute file for pedigree report 5.1 (figure 5.5).

Figure 5.5: Example of using symbols and symbol text to show cancer inheritance.

Figure 5.5 shows a small example of a family where the mother might have the 999del5 mutation in BRCA2 gene. The mutation is inherited by the children where the daughter gets breast cancer, but the same mutation also increases the likelihood of prostate cancer in males, so the mutation also affects the . These symbols and symbol texts are defined in attribute file 5.3. The symbols used are exaggerated color wise, but most usages would be more like in figure 5.6 With multiple diagnoses, the user would use symbols in a similar way, that is, either each symbol would represent multiple types or individuals would get multiple symbols where each symbol represents a cancer type. A more specific example of symbols usage is if researcher is e.g. studying autism, then there is often a scale where individuals with autism get a score on, and that score depends on how severe the autism is. In that case, researcher could draw a family with a high ge- netic risk of autism and then mark affected individuals with symbols, where symbols repre- sent a interval on the scale. 28 CHAPTER 5. RESULTS

1 #PN Symbol SymbolText 1−1m 3 1−2f 2−1m 5 2−2f 3−1m 3333 >100a 7 3−6f 3−2f 9 3−7m 4−5m 00B0 40−59c 11 4−7f BBBB >100c 4−1m 13 4−2f 000B 60−79c 2−3m 15 2−4f 3−3m 3300 80−99a 17 3−4f 3000 <20a 3−5m 19 4−3m B000 <20c 4−4f 0B00 20−39c Listing 5.4: Attribute file for figure 5.6.

Figure 5.6: Family with a genetic risk of autism.

In the example above, there are two severity scores, one for adults and one for children. So for that reason, red is being used for adult scoring and green for children scoring. Meaning of all symbols are in table 5.2.

Table 5.2: Summary of all symbols that could be used in autism family pedigree 5.6. Score Child Adult <20 20-39 40-59 60-79 80-99 >100 5.3. LAYOUT ALGORITHM 29

One thing to mention is that symbols and symbol texts are independent of each other, so it is possible to have only symbols with no symbol texts, and vice versa. Examples of how the attribute files could look like are listed below.

1 #PN Symbol SymbolText 1 #PN SymbolText child1 Breast 2 child1 Breast 3 child2 5678 Colon child2 Colon father 4 father 5 mother Breast mother Breast Listing 5.6: Attribute file where user Listing 5.5: Attribute file where user wants some symbols - but not all. In wants no symbols - only symbol texts. that case he would leave out the values In that case he just skips the “Symbol” (“empty string”) where he wants no column. symbol.

5.2.3 Haplotypes Being able to draw haplotypes of individuals is a work in progress. That feature is detailed in chapter 6.1.

5.3 Layout algorithm

When drawing families of considerate size and complexity, an efficient algorithm is cru- cial. Calculating coordinates for small and simple descendant pedigrees is quite trivial but for the genetic analysis being done at deCODE, e.g. studying rare variants, the pedigrees tend to get quite big, and with increasing pedigree size, interrelations and other non-trivial family relations have a higher chance of occurring. It is important that the algorithm is able to handle pedigrees of all sizes and complexity in a fair amount of time, resulting in an easy to read diagram for the researcher. The algorithm used for the IPP was originally designed because the available algorithms were only able to handle small to medium size families, and also because the available algorithms were not able to distribute coordinates in an efficient manner when the pedigrees became complex. Given a Pedigree Report to work with, the algorithm determines the most convenient coordinates for nodes in the pedi- gree. Here are few common family patterns that can increase the complexity of the pedigrees, and examples of how the algorithm solves them: 30 CHAPTER 5. RESULTS

• Cousins have children together, which forms a circle in the pedigree:

Figure 5.7: Cousins (red arrows) have kids, which forms a circle in the pedigree drawing.

• Individuals have children with more than one :

Figure 5.8: A man (red arrow) has children with two different women (blue arrows). Here it is important where the two women are drawn and how the two families branch down based on that.

Figure 5.9: A man (red arrow) has children with four different women (blue arrows). The algorithm assigns coordinates in a “star-manner”, where the women are drawn in a star around the man. This is a very good way to solve this problem when the pedigree is full of individuals with many spouses.. 5.3. LAYOUT ALGORITHM 31

• Mates belong to different generations:

Figure 5.10: Mates (red arrows) are from different generations. The IPP solves this by elongating the descendant line to 5.1m.

• Group of siblings have mates that also have parents in the pedigree:

Figure 5.11: Three siblings (marked 1, 2 and 3), all with spouses that have parents in the pedigree (color coded). There is no perfect way to solve this because the pedigree gets clut- tered very fast, but the algorithm tries to fit all the small families in a readable manner. 32 CHAPTER 5. RESULTS

• A family is put together from several smaller families with one or more connec- tions between them:

Figure 5.12: A woman (red arrow) has children with two different men (blue arrows), but those men belong to two different families.

All of these examples above can get significantly more complex. The pedigree layout algorithm used in the IPP was created by Guðbjörn Freyr Jónsson, software developer at deCODE Genetics. It is based on an algorithm for hierarchical or layered drawings of directed graphs, but with many improvements that take into account the special structure of the pedigrees. First, the nodes are assigned a rank, which corre- sponds to determining the generation to which each individual belongs. Secondly, the nodes in each layer (individuals in each generation) are ordered to minimize the number of crossings. Finally, coordinates are assigned to each node, starting from the bottom. The rank assignment is an optimization problem where the stretching of the legs is minimized. The ordering is a heuristic that iterates several times going up and down the graph, trying to untangle the branches. One major improvement was to collapse simple descendant or ancestor branches into single strands before doing the ordering. The coordinate assignment works from the bottom, placing the nodes with fixed spacing, but then shifting the nodes as needed to make space when placing the nodes at the next level above. [23]

5.4 Interactive features

One of the key features of the IPP is the ability to interact with the pedigree. Like men- tioned in chapter 5.3, the software is able to draw several thousand nodes in a fairly short amount of time, and it will often be used to do so. Drawing such a big family does not nec- essarily mean that all of them are of interest to the researcher, but rather, the researcher might want to draw the whole family and then narrow further down on the mutation of in- terest. With the high level interactiveness that IPP offers, it is fairly easy to manipulate the pedigree. This section will cover features like deleting, collapsing, expanding, moving and stretching. 5.4. INTERACTIVE FEATURES 33

5.4.1 Deleting

In some cases, whole family branches are not affected by the mutation of interest. In that case, the researcher might want to remove them completely from the pedigree. When the branch is deleted, it cannot be retrieved.

(a) Family with a unaffected branch. (b) Branch deleted.

Figure 5.13: Simple pedigree where a branch of family members that are unaffected is deleted from the pedigree.

5.4.2 Collapsing and expanding

Collapsing and expanding can be used e.g. when the user does not want to completely re- move the branch from the pedigree, but rather collapse it to perhaps simplify the pedigree and then expand it later. Unlike deleting, collapsed branches are retrievable by expanding them. Hovering over the collapsed node shows how many total individuals are collapsed and how many descendants there are.

(a) A simple pedigree. (b) A simple pedigree with a collapsed branch.

Figure 5.14: Simple pedigree where a branch of family members is collapsed.

The user can also collapse every single family link in the pedigree, and then traverse down the branches one by one, from top to bottom. This might be very useful when the pedigrees get bigger. 34 CHAPTER 5. RESULTS

(a) A pedigree with 65 individuals.

(b) Collapse all button. Expand all button (c) The pedigree with all links collapsed. above.

(d) Traversing down branch. (e) Traversing further down.

Figure 5.15: A bigger pedigree of 65 individuals where all links are collapsed. 5.4. INTERACTIVE FEATURES 35

5.4.3 Moving Moving nodes around is very useful, especially when you have text attributes, symbols and/or haplotypes on just several nodes. In that case, the pedigree tends to get cluttered and the user might want to drag only those nodes that have attributes around to clear the clutter.

(a) A very cluttered pedigree. (b) Same pedigree after nodes have been moved around to reduce the clutter.

Figure 5.16: A very cluttered pedigree fixed by moving nodes around. 36 CHAPTER 5. RESULTS

5.4.4 Shrinking and stretching Similar to moving nodes, simply stretching or shrinking the pedigree can be enough to sort out the clutter. This is especially useful if all nodes have attributes.

(a) A very cluttered pedigree.

(b) The shrink and stretch buttons. (c) Same pedigree after being stretched to reduce the clutter.

Figure 5.17: A very cluttered pedigree fixed by stretching it. 5.5. COMPARISON 37

5.5 Comparison

This section will cover the comparison between IPP and similar software programs. In chapter 3, all related software applications were discussed, but the comparison will only include those discussed in the stand-alone chapter 3.2. In order for a pedigree plotter to be sufficient for deCODE line of research, it needs to meet certain requirements. All of those requirements will be taken into account in the compari- son in next sections.

5.5.1 Complex families The first comparison will be about family pattern complexity. The parameters for that com- parison will be the ones discussed in chapter 5.3, plus some more. It is very important for the genetic analyses being done at deCODE that the pedigree plotter is able to handle non- standard family patterns and draw them in an efficient manner. All pedigree plotters were handed the same genealogy data to work with, but in some cases the format of it had to be changed in order for the software application to understand it, for example, sometimes gender is denoted with numbers (1 for male and 2 for female) but sometimes letters (m for male and f for female). Following complexity problems will be tested, respectively:

1. Individuals with multiple spouses.

2. Simple consanguinity.

3. Complex consanguinity.

4. Siblings have partners that also have parents in the pedigree.

5. Partners belong to different generations.

6. A pedigree is put together from several smaller families with one or more connec- tions between them all.

7. Individual only has one parent.

Scoring will be given such that correctly drawn and ideal gives 1 (X), correctly drawn but not ideal gives 0.5 (.) and failed gives 0 (). 38 CHAPTER 5. RESULTS

Table 5.3: Comparison of the IPP to similar programs regarding complex family patterns. IPP HaploPainter CraneFoot Madeline 1 X . . X 2 . X . X 3 . X . X 4 X X . . 5 X X . . 6 X X . . 7 X   

= 87% (6/7) 79% (5.5/7) 43% (3/7) 64% (4.5/7)

Next sections will go into further details of the results.

5.5.1.1 Multiple spouses Individuals having children with many different partners is something that the pedigree plotter needs to be able to handle, especially because with increasing pedigree size, the probability of individuals having more than one spouse gets higher. Having six spouses is not something that comes up often, but it is a good complexity problem to see how the algorithm fits all these spouses and in what manner.

Figure 5.18: IPP pedigree for the “multiple spouses” complexity problem.

Result: Correctly drawn and ideal. Comments: The IPP solves the multiple spouses problem by drawing the spouses in a “star” manner. This becomes very efficient and easy to read when the pedigrees get up to several thousand individuals big and many individuals have many spouses.

Figure 5.19: HaploPainter pedigree for the “multiple spouses” complexity problem.

Result: Correctly drawn but not ideal. Comments: HaploPainter draws it in a manner that looks like the women are having chil- dren together. The resulting pedigree drawn by HaploPainter is not sufficient for deCODE standards, nor pedigree drawing standards in general. 5.5. COMPARISON 39

Figure 5.20: CraneFoot pedigree for the “multiple spouses” complexity problem.

Result: Correctly drawn but not ideal. Comments: Although this layout would never work for bigger pedigrees, it is correctly drawn.

Figure 5.21: Madeline pedigree for the “multiple spouses” complexity problem.

Result: Correctly drawn but not ideal. Comments: Although this layout would never work for bigger pedigrees, it is correctly drawn.

5.5.1.2 Simple consanguinity This type of consanguinity is rather common, so it is important that the pedigree plotter can handle those cases. The complexity problem tested for was cousins having a child to- gether.

Figure 5.22: IPP pedigree for the “simple consanguinity” complexity problem.

Result: Correctly drawn but not ideal. Comments: Missing traditional double line to show consanguinity between 5.1m and 5.2f. 40 CHAPTER 5. RESULTS

Figure 5.23: HaploPainter pedigree for the “simple consanguinity” complexity problem.

Result: Correctly drawn and ideal. Comments: None. 5.5. COMPARISON 41

Figure 5.24: CraneFoot pedigree for the “simple consanguinity” complexity problem.

Result: Correctly drawn but not ideal. Comments: Very cheap way to avoid complications. 42 CHAPTER 5. RESULTS

Figure 5.25: Madeline pedigree for the “simple consanguinity” complexity problem.

Result: Correctly drawn and ideal. Comments: A little weird that 6.1m is drawn so far right instead of middle. 5.5. COMPARISON 43

5.5.1.3 Complex consanguinity The complexity problem tested for was siblings having a child together, but one of the sib- lings also had a child with another partner not related to the family.

Figure 5.26: IPP pedigree for the “complex consanguinity” complexity problem.

Result: Correctly drawn but not ideal. Comments: Missing traditional double line to show consanguinity between 6.1f and 6.2m 44 CHAPTER 5. RESULTS

Figure 5.27: HaploPainter pedigree for the “complex consanguinity” complexity problem.

Result: Correctly drawn and ideal. Comments: None, but it should be noted though that the complex consanguinity problem was changed to remove unnecessary generations (see figure 5.28), and before changes it failed because it showed consanguinity between wrong individuals, that is, between 6.2m and 6.3m, not 6.1f and 6.2m:

Figure 5.28: Old complex consanguinity problem where HaploPainter failed. 5.5. COMPARISON 45

Figure 5.29: CraneFoot pedigree for the “complex consanguinity” complexity problem.

Result: Correctly drawn but not ideal. Comments: Cheap way to avoid complications.

Figure 5.30: Madeline pedigree for the “complex consanguinity” complexity problem.

Result: Correctly drawn and ideal. Comments: None. 46 CHAPTER 5. RESULTS

5.5.1.4 Siblings have spouses that also have parents in the pedigree The complexity problem tested for was when siblings all have spouses that also have fami- lies in the pedigree. This is a hard problem for the algorithm to efficiently draw so it is easy to read.

Figure 5.31: IPP pedigree for the “spouses of siblings that also have parents” complexity problem.

Result: Correctly drawn and ideal. Comments: None.

Figure 5.32: HaploPainter pedigree for the “spouses of siblings that also have parents” complexity problem.

Result: Correctly drawn and ideal. Comments: None. 5.5. COMPARISON 47

Figure 5.33: CraneFoot pedigree for the “spouses of siblings that also have parents” complexity problem.

Result: Correctly drawn but not ideal. Comments: This hybrid approach CraneFoot uses would become somewhat confusing with bigger pedigrees. 48 CHAPTER 5. RESULTS

Figure 5.34: Madeline pedigree for the “spouses of siblings that also have parents” com- plexity problem.

Result: Correctly drawn but not ideal. Comments: These substitute nodes are not ideal. This would become somewhat confusing for bigger pedigrees.

5.5.1.5 Partners belong to different generations

Figure 5.35: IPP pedigree for the “mates from different generations” complexity prob- lem.

Result: Correctly drawn and ideal. Comments: Missing traditional double line to show consanguinity between 5.1m and 5.2f. 5.5. COMPARISON 49

Figure 5.36: HaploPainter pedigree for the “mates from different generations” complex- ity problem.

Result: Correctly drawn and ideal. Comments: None. 50 CHAPTER 5. RESULTS

Figure 5.37: CraneFoot pedigree for the “mates from different generations” complexity problem.

Result: Correctly drawn but not ideal. Comments: A bit cheap way to solve this simple problem. 5.5. COMPARISON 51

Figure 5.38: Madeline pedigree for the “mates from different generations” complexity problem.

Result: Correctly drawn but not ideal. Comments: This is exactly what is talked about in algorithm chapter 5.3 being one of the problems for pedigree drawing algorithm. 52 CHAPTER 5. RESULTS

5.5.1.6 Pedigree made up from many smaller families A big family, made up by three smaller families, connected together with one connection between them. Author made the connections between the smaller families so that they were not between multiple spouses, because that has already been addressed as a problem in the multiple spouse problem, but rather random connections between them.

Figure 5.39: IPP pedigree for the “many small families make up one big” complexity problem.

Result: Correctly drawn and ideal. Comments: None. 5.5. COMPARISON 53

Figure 5.40: HaploPainter pedigree for the “many small families make up one big” com- plexity problem.

Result: Correctly drawn and ideal. Comments: None. 54 CHAPTER 5. RESULTS

Figure 5.41: CraneFoot pedigree for the “many small families make up one big” com- plexity problem.

Result: Correctly drawn but not ideal. Comments: None. 5.5. COMPARISON 55

Figure 5.42: Madeline pedigree for the “many small families make up one big” complex- ity problem.

Result: Correctly drawn but not ideal. Comments: It uses the “A” as a substitute instead of fitting the smaller families in a ef- ficient way. For this problem it should not be needed. Also, the line marked with the red arrows is unnecessarily long. 56 CHAPTER 5. RESULTS

5.5.1.7 Single parent connection In many cases, the genealogy data does not have information about the partner. In other cases, the partner is not of any interest to the researcher. For those reasons, it is very im- portant that the pedigree plotter is able to handle those kinds cases.

Figure 5.43: IPP pedigree for the “single parent” complexity problem.

Result: Correctly drawn and ideal. Comments: None. 5.5. COMPARISON 57

Figure 5.44: HaploPainter pedigree for the “single parent” complexity problem.

Result: Failed. Comments: Does not support drawing single parents.

Figure 5.45: CraneFoot pedigree for the “single parent” complexity problem.

Result: Failed. Comments: Does not support drawing single parents.

Figure 5.46: Madeline pedigree for the “single parent” complexity problem.

Result: Failed. Comments: Draws a dummy spouse to be able to handle the problem, although this is not normally how Madeline adds a substitute node. Also, it does not invoke an error like CraneFoot and HaploPainter, but the ideal solution is to be able to draw single parents without a dummy spouse. 58 CHAPTER 5. RESULTS

5.5.2 Drawing speed The drawing speed is rather hard to compare between applications. In order to get the most accurate and relevant results, real genealogy data that the IPP needs to be able to handle was used. But, in chapter 5.5.1 few typical complexity problems were addressed that are trivial to the IPP, but other pedigree plotters were simply giving wrong or not ideal results. The problem there is with increasing size of genealogy data, strange family patterns often arise, so not all of the pedigree plotters being tested can handle those cases. The drawing speed will nevertheless be measured on genealogy data, parameters being number of indi- viduals in the pedigree. The minimum requirement so that the drawing speed comparison is valid, is that the result- ing pedigree is first and foremost drawn, but also that it is in a good resolution so that the user is able to read text attributes like names under nodes and distinguish between nodes. Whether it is exactly the right result will not be heeded here, but it should be noted that wrong results cannot be tolerated for genetic analysis at deCODE.

Table 5.4: Comparison of the IPP and similar pedigree plotters on drawing speed. Individuals IPP CraneFoot HaploPainter Madeline 2726 52sec 1hour 17min 56sec CRASH 6sec 1181 8sec 27min 33sec CRASH 5sec 695 3sec 6min 56sec 1h 15min 26sec 5sec 246 3sec 1min 35sec 5sec 5sec 5.5. COMPARISON 59

5.5.3 Interaction The interaction features tested were all features that are considered necessary for a very ef- ficient pedigree plotter, which IPP strives to be, that allows the researcher to easily navigate through the pedigree and interacting with it, making it easier for him to narrow down on the variation of interest.

Table 5.5: Comparison of the IPP to similar programs regarding interaction features. IPP CraneFoot HaploPainter Madeline Affected X X X X Deceased X X X X Moving/dragging X  X  Stretching/shrinking X    Deletion X  X  Adding   X  Collapse X   X Expand X    Arbitrary text attributes X X  X Symbol attributes X X X X Haplotype attributes   X 

= 82% (9/11) 36% (4/11) 64% (7/11) 45% (5/11)

CraneFoot and Madeline lack most of the interaction features because the pedigrees are images with no interaction ability whatsoever. The “Collapse” feature in Madeline is not the ideal implementation because the user has to define beforehand in the genealogy file which nodes should be collapsed together, so if the user wants to expand later, he has to refactor the data file and reload the whole pedigree. Chapter 6

Future work

Although the Interactive Pedigree Plotter has the majority of features completed, there are many small tasks and few bigger tasks left to be done. The most crucial features that have not been completed will be detailed in this chapter.

6.1 Haplotypes

The haplotype attribute feature gives the researcher advanced tools to study the genetic transmission of variants, very thoroughly. Haplotypes are a symbolic representation of markers on a single chromosome that are inherited together (see 2.10 for details on hap- lotypes). These markers are locations of variations in the DNA, such as SNPs (deletions, insertions, etc.) and/or microsatellites. The idea is that the researcher would be able to see haplotypes of genotyped individuals in the pedigree (see figure 6.1) for a chosen chromosome location. This location could, for example, include a marker that is a possible cause of a certain disease. Then, looking at the pedigree, the researcher is able to see whether and how these haplotypes are inherited, that is, actual genetic transmission of these variants of interest, and thus follow the haplotypes to better study how the markers are inherited. Like mentioned, haplotypes would of course only be available for genotyped data. The genotyped data then has to be phased in order to determine the haplotypes (see 2.10.1 for details on phasing). The reality is, though, that not all individuals are genotyped, and thus it is not possible to get their haplotypes. That is where so called long-range phasing comes in (see 2.10.3 for details on long-range phasing), but long-range phasing is an advanced phasing method developed at deCODE, which has enabled researchers to phase bigger chromosome regions by making use of close- and distant relatives, as well as closely linked markers. A rough example of how this could look like is shown in figure 6.1. It could be the case that Ind12 is not genotyped and therefore has an undetermined haplotype. That should result in undetermined haplotypes for individuals Ind13, Ind14 and therefore also Ind16, since haplotypes for Ind13 and Ind14 rely on both their parents being genotyped, but with the help of long-range phasing, the haplotypes for these individuals can be determined. Same thing goes for Ind8 and his children. 6.2. DELETE 61

Figure 6.1: Example of a pedigree with haplotype attributes beneath.

In the figure above, a pedigree is shown with haplotype attributes from each parent; father chromosomes to the left and mother chromosomes to the right. In reality, individuals do not always inherit full chromosomes from their parents, but sometimes a recombination of the maternal or paternal chromsome pair; as can be seen by how the colors mix.

It is important to note that in the former pedigree plotter, mentioned in chapter 4.1, re- searchers are able to draw haplotypes. The main difference is that PP can only draw hap- lotypes based on a small set of chosen markers, while IPP will be used for drawing large haplotypes for a given genomic region, where these haplotypes will be based on the long- range phasing data. Also, it is important that the resulting implementation enables a fairly easy way to choose the genomic region of interest by panning or zooming, making it easy for the user to change the region.

6.2 Delete

User should be able to dynamically delete pedigree nodes, e.g. delete spouses or children. This would require though, a re-calculation of coordinates for the whole pedigree, which might be an expensive operation for big pedigrees. It is already possible to delete a whole branch (see 5.4.1) but this has more to do with a single node. 62 CHAPTER 6. FUTURE WORK

6.3 Add

User should be able to dynamically add nodes to the pedigree, e.g. add spouse or add chil- dren. This would require though, a re-calculation of coordinates for the whole pedigree, which might be an expensive operation for big pedigrees.

6.4 Export

It is important to be able to export the pedigree as it is. User might have moved nodes, col- lapsed branches, deleted nodes, added attributes etc. and it is important to be able to export the pedigree. User should preferably have few choices in exporting, such as exporting as PDF, exporting as image (just a snapshot of the panel where the pedigree is drawn) and then exporting similar to “Save as...” feature most applications have. The save as export would somehow have to include all the coordinates for the nodes at the time the pedigree is saved as well as all information about attributes added and which attributes are turned on and such. This proves to be a complicated task to solve, so it was put on ice after other more important features.

6.5 Double consanguinity line

Like mentioned in chapter 2.1, one of the symbol in the standard set of pedigree symbols is the consanguinity line, that is, when mates are blood related, it is marked with a two-fold horizontal line. This is not hard to implement for small pedigrees, but because the Interactive Pedigree Plotter will be used a lot for big pedigrees, this has to be thought thoroughly. In theory, the program could check each time when drawing a marriage line, whether it should be twofold or not. That would take very little time for small pedigrees but would take a lot more time with increasing pedigree size. The idea that is probably the best solution is to draw everything without double marriage lines, and then iterate once through the whole pedigree and replace the marriage lines with double ones if needed.

6.6 Recalculation

A feature mentioned in chapter 5.4.2 is the “Collapse all” feature, where every single fam- ily link is collapsed. Now, for pedigrees that have more than 500 individuals, the idea was that the pedigree would begin in the collapse all state, that is, every single link would be collapsed. The motive for that was that a pedigree that has 500 individuals or more is too big to display all at once, so it would be better for the researcher to get the pedigree col- lapsed, and then traverse down the path he wants. This has been implemented. The prob- lem is that at the moment, the layout algorithm hands coordinates to all nodes like it is in full expansion, and then the IPP collapses all the links afterwards, resulting in the first gen- eration node(s) being drawn like there is a full pedigree below them. So, if the descendant tree stretches really far to the sides, from first generation to the last generation, it will be collapsed very unideally, with the first generation node(s) far to the right and not visible when the pedigree is displayed. Ideally, after each expand, the pedigree should recalculate itself like there were no nodes present but the ones that are actually visible, making the pedigree more narrow and better positioned. Chapter 7

Summary and conclusion

With technological advances in genetic techniques such as precision and the ability to har- vest large quantities of genealogical information in a much shorter amount of time than be- fore, e.g. by moving from microsatellite genotyping to chip genotyping and whole genome sequencing, researchers have been moving more from common variants towards studying rare variants in individuals. This is something that was not nearly as feasible 20 years ago. Studying rare variants requires a lot of familial data to be able to get a bigger picture of the genetic transmission and penetrance in a family. When studying these rare variants it be- comes very helpful to be able to visualize the transmission somehow, and that is where the IPP comes in. The IPP is an answer to the need of a good pedigree plotter that is able to draw diagrams of several thousand individuals in a short amount of time, but also being able to interact with the resulting pedigree in variety of ways - all while remaining robust and user friendly; an all in one pedigree plotter. Pedigree plotters are not a new technology so there are several applications that are able to draw pedigrees. These applications were reviewed in chapter 3 and then compared to the IPP in chapter 5.5.

First comparison (see 5.5.1) was to see whether these pedigree plotters were able to solve certain complexity problems regarding non-trivial familial relations such as complex con- sanguineous relations or individual having multiple spouses. Most of the time, the pedi- gree plotters were able to plot the pedigrees with correct results, but most often the result- ing pedigree format was not ideal, meaning, the way the diagrams were drawn would not have been ideal when the pedigrees get several thousand individuals big, like the IPP is in- tended for, but instead they would get cluttered very easily and hard to read. The IPP had the highest score of 87%, but of the plotter compared, HaploPainter had the highest score of 79%, Madeline coming in second with 64%.

Second comparison (see 5.5.2) was the time it took to draw a pedigree. The minimum re- quirement so that the drawing speed comparison was valid, was first and foremost that the pedigree plotter is able to draw the pedigree, second that the pedigree is in a good enough resolution so that the user is able to read text attributes such names under nodes and more. For this specific comparison, whether the family relations are drawn in an ideal way was not heeded, but in chapter 5.5.1 we saw that in many cases, such as mates belonging to dif- ferent generations and individual having multiple spouses, the algorithm did not produce the ideal solution; at least not sufficient to use for deCODE genetic analysis. It should be 64 CHAPTER 7. SUMMARY AND CONCLUSION noted that to get the most relevant results, real genealogy data was used but it had to be fixed for the comparison plotters so that no single parents were present in the data, just so that they understood the data and were able to draw. This is not the ideal solution, since pedigrees with both single- and two parents are common in the genealogy data, especially with increasing size, but had to be done just so that the pedigree plotters understand the data. So, regarding this comparison, it is by no means fair since the resulting pedigree for most of the compared plotters does not meet the standards needed. Nevertheless, overall, results were that Madeline is by far the fastest, IPP coming in second. HaploPainter did not handle medium to big pedigrees but was able to draw small pedigrees with around 250 individuals. The CraneFoot plotter was fast in assigning coordinates and building the PostScript file, but when GIMP was used to produce the pedigree of e.g. 2726 nodes, a resolution of 8000 pixels per inch had to be set (width: 66.080, height: 93.600); the minimum resolution to be able to read the text attributes under the nodes. Preferably, the resolution should have even been higher because the resolution was just barely high enough to distinguish letters.

Third comparison (see 5.5.3) was comparison of interaction features. Being able to draw big pedigrees in a short amount of time is not enough because user needs to be able to in- teract with the pedigree by moving nodes around, deleting them and adding attributes to them, to name a few. A big incentive of creating the IPP was the ability to collapse and ex- pand nodes and also the ability to add haplotype attributes to the nodes. The former is pos- sible but the latter is a work in progress. Only one other pedigree plotter had the collapse possibility and that was Madeline. The problem with the collapse feature in Madeline is that it is not exactly an interaction because you define which nodes should be collapsed in the beginning when creating the data. Like chapter 3.2.3 talks about, Madeline reads input files specified on the command line and generates pedigree drawings without user interac- tion. Madeline then writes the pedigree information to a scalable vector graphics (SVG) format. So, you cannot collapse and expand the nodes after you have drawn the diagram. The IPP had the highest score of 82%, but of the plotter compared, HaploPainter had the highest score of 64%, Madeline coming in second with 45%.

To summarize, there are few points that are worth mentioning:

1. Often the pedigree plotters used in the comparison are made to highly specialize in some requirements/features and therefore lack in other, such as Madeline being able to draw pedigrees incredibly fast and easy to read manner, but has no interaction fea- tures. Or the HaploPainter, having the highest score of the interaction comparison and able to draw haplotypes, but crashes/takes extremely long when handed bigger sets of data. So none of the applications seem to have it all, like the IPP is striving for. 2. All of the pedigree plotters used in the comparison except for the HaploPainter used non-standard pedigree methods to solve family patterns that are of more complexity, by either creating substitute nodes and marking them and the real node like Madeline does (e.g. figure 5.34), or connecting the substitute node and the real node together with special connection lines like CraneFoot does (e.g. figure 5.33); Madeline al- though doing a lot less of it. This of course makes the topology much easier for the algorithm but instead makes the pedigree very confusing and very hard to read with increasing pedigree size. 65

3. Madeline clearly uses a very good algorithm because it is able to assign coordinates very fast and the resulting pedigrees look very well. It would be interesting to know what it looks like and how it works.

4. Of the pedigree plotters used in the comparison, HaploPainter is probably the closest to what IPP is striving for. It draws the pedigrees in a standard way instead of using substitute nodes like Madeline and CraneFoot. When drawn, user is able to interact with the pedigree by moving, adding, deleting and more, as well as adding attributes like haplotypes and symbols. What is lacking is the algorithm, not being able to han- dle pedigrees larger than 250 individuals, as well as not solving some complex fam- ily pattern in an ideal manner (see figure 5.27 for example).

5. There are some features that lack in IPP, such as adding, deleting and haplotype at- tributes (see details in 6). They will be implemented the next few months. Chapter 8

Discussion

It is safe to say that Interactive Pedigree Plotter is on the right path to becoming a highly advanced pedigree plotter which will prove very useful in genetic analyses. After evaluat- ing other similar pedigree plotters, IPP seems to be in the front line when it comes to ideal solutions of big and complex pedigrees. It is possible though, that other companies similar to deCODE Genetics that do genetic analyses, have developed pedigree plotters and have not released them for others to use. Bibliography

[1] deCODE genetics | a global leader in human genetics. [Online]. Available: https: //www.decode.com/ (visited on 04/26/2019). [2] D. Merriam-Webster, Dictionary by Merriam-Webster: America’s most-trusted on- line dictionary, en, 1828. [Online]. Available: https://www.merriam-webster. com/ (visited on 04/26/2019). [3] Edward H. Trager, Ritu Khanna, and Adrian Marrs, Comparison of pedigree draw- ing programs, 2006. [Online]. Available: https://madeline.med.umich.edu/ madeline/comparisons/ (visited on 05/05/2019). [4] PDQ Cancer Genetics Editorial Board, “Cancer Genetics Risk Assessment and Counseling (PDQ®): Health Professional Version”, eng, in PDQ Cancer Informa- tion Summaries, Bethesda (MD): National Cancer Institute (US), 2002. [Online]. Available: http://www.ncbi.nlm.nih.gov/books/NBK65817/ (visited on 04/26/2019). [5] R. C. King and W. C. Stansfield, A Dictionary of Genetics (Sixth Edition). Oxford University Press, 2002, vol. 6. [6] M. P. Ball, Genotype picture, en, Page Version ID: 889379459, Mar. 2019. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Genotype& oldid=889379459 (visited on 04/26/2019). [7] Debivort, Phenotype picture, en, Page Version ID: 891143030, Apr. 2019. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Phenotype& oldid=891143030 (visited on 04/26/2019). [8] T. Strachan, Human molecular genetics. New York: Wiley, 1999, isbn: 1-85996- 202-5. [9] P. H. Byers, The American Journal of Human Genetics. San Francisco, California: The University of Chicago Press, 1999. [10] C. Kwok and K. Schmitt, “Microsatellite Genotyping”, en, in Molecular Genetic Epidemiology — A Laboratory Perspective, ser. Principles and Practice, I. N. M. Day, Ed., Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 55–85, isbn: 978-3-642-56207-5. doi: 10.1007/978-3-642-56207-5_3. [Online]. Available: https://doi.org/10.1007/978-3-642-56207-5_3 (visited on 05/18/2019). [11] International Society of Genetic Genealogy Wiki, Haplotype - ISOGG Wiki, Genetic Genealogy Wiki, May 2018. [Online]. Available: https://isogg.org/wiki/ Haplotype (visited on 05/22/2019). [12] ——, Phasing - ISOGG Wiki, Genetic Genealogy Wiki, Dec. 2018. [Online]. Avail- able: https://isogg.org/wiki/Phasing (visited on 05/22/2019). 68 BIBLIOGRAPHY

[13] A. Kong, G. Masson, M. L. Frigge, A. Gylfason, P. Zusmanovich, G. Thorleifsson, P. I. Olason, A. Ingason, S. Steinberg, T. Rafnar, P. Sulem, M. Mouy, F. Jonsson, U. Thorsteinsdottir, D. F. Gudbjartsson, H. Stefansson, and K. Stefansson, “Detection of sharing by descent, long-range phasing and haplotype imputation”, en, Nature Genetics, vol. 40, no. 9, pp. 1068–1075, Sep. 2008, issn: 1546-1718. doi: 10.1038/ ng.216. [Online]. Available: https://www.nature.com/articles/ng.216 (visited on 05/22/2019). [14] B. Veytsman and L. Akhmadeeva, Medical Pedigrees, May 2010. [Online]. Avail- able: http://pedigrees.varphi.com/cgi-bin/pedigree.cgi (visited on 04/30/2019). [15] Progeny Genetics LLC, Free , Pedigree Chart Online - Progeny, 1996. [Online]. Available: https://www.progenygenetics.com/online-pedigree/ (visited on 04/27/2019). [16] G. G. S. Ltd, Genial Pedigree Draw - Automated Pedigree Drawing Software, en, 2014. [Online]. Available: http : / / www . pedigreedraw . com/ (visited on 04/27/2019). [17] H. Thiele and P. Nürnberg, “HaploPainter: A tool for drawing pedigrees with com- plex haplotypes”, en, Bioinformatics, vol. 21, no. 8, pp. 1730–1732, Apr. 2005, issn: 1367-4803. doi: 10.1093/bioinformatics/bth488. [Online]. Available: https: //academic.oup.com/bioinformatics/article/21/8/1730/248867 (visited on 04/27/2019). [18] ——, HaploPainter (about), About HaploPainter, 2004. [Online]. Available: http: //haplopainter.sourceforge.net/about.html (visited on 06/02/2019). [19] ——, HaploPainter (screenshots), HaploPainter screenshots, 2004. [Online]. Avail- able: http://haplopainter.sourceforge.net/screenshots.html (visited on 06/02/2019). [20] V.-P. Mäkinen, M. Parkkonen, M. Wessman, P.-H. Groop, T. Kanninen, and K. Kaski, “High-throughput pedigree drawing”, En, European Journal of Human Genetics, vol. 13, no. 8, p. 987, Aug. 2005, issn: 1476-5438. doi: 10 . 1038 / sj . ejhg . 5201430. [Online]. Available: https://www.nature.com/articles/5201430 (visited on 04/30/2019). [21] FinnDiane Study Group and Folkhälsan Research Center, CraneFoot | FinnDiane, en-US, 2009. [Online]. Available: http://www.finndiane.fi/software/ cranefoot/ (visited on 04/30/2019). [22] Edward H. Trager, Ritu Khanna, and Adrian Marrs, Madeline 2.0 Drawing Engine, About Madeline 2.0 pedigree plotter, 2006. [Online]. Available: https://madelin e.med.umich.edu/madeline/index.php (visited on 05/05/2019). [23] Guðbjörn Freyr Jónsson, “Að teikna fjölskyldur”, IS, Tímarit um raunvísindi og stærðfræði, vol. 2, no. 2, p. 8, Jul. 2004. [Online]. Available: http : / / raust . is/2004/2/17/ (visited on 05/02/2019). [24] W. Jackson, Pro Java 9 games development : leveraging the JavaFX APIs. Berkeley, CA: Apress, 2017, isbn: 978-1-4842-0974-5.

School of Science and Engineering Reykjavík University Menntavegur 1 101 Reykjavík, Iceland Tel. +354 599 6200 Fax +354 599 6201 www.ru.is