Drawing DNA Sequence Networks

Julia Olivieri Motivation Datasets

Caryophyllaceae (16) Castilleja (19) Cactaceae (82)

Arenaria congesta Castilleja applegatei Mammillaria ??? Datasets Castilleja Cactaceae

J R Bennett and S Mathews 2006 S Fior et. al. 2006 R Nyffeler 2002 Method for Comparing Sequences

Alignment cost, aij σ = Substitution penalty

g = gap penalty The Problem

D: S P

aij = dD(i),D(j) Ideally Ideal isn’t always possible

1. AAAAA 2. ATATA 3. TTTTT 4. AAATT Quadratic Assignment Problem

QAP: n facilities

n locations to place those facilities

Flow between every pair of facilities

Our Problem: n sequences

n locations to place those sequences

Nucleotide sequence similarity between every pair of sequences

Ordering Cost

∈ a ≤ a s S ls = u1, …, un s,u_i s,u_(i+1)

L = v , …, v d ≤ d |S| = n s 1 n D(s),D(v_i) D(s),D(v_(i+1))

u i vj Point Placement Heuristic: Random Assignment

● Choose random drawings and save the drawing with the lowest cost ● r is the number of runs (r = 10,000) ● Assuming one optimal solution, the chance of finding it is:

● If |P| = 10 and r = 100,000, chance of finding optimum is 2.72% Heuristic: Greedy Assignment

Create a drawing by assigning sequences to points one at a time, each time to the lowest-cost point by the Euclidean cost

Ordering method Heuristic: Hill Climbing

2-swap:

1 1

4 3 One run: 2 2 Find the minimum-cost 2-swap 3 4 For n = 16: 120 2-swaps 1120 3-swaps Heuristic: Simulated Annealing

Temperature T > 1, ratio q ∈ (0,1), runs r If new < cur, D = D’

Else, choose b ∈ (0,1) 2-swap D D’ If b < e(cur – new)/T , D = D’ cur new If not, D does not change

T = qT

Continue until T < 1

Random: 1 run Greedy: ordered Hill climbing

Random: 10,000 runs Greedy: unordered Simulated annealing Results: Caryophyllaceae Random: 1 run Greedy: ordered Hill climbing

Castilleja

Random: 10,000 runs Greedy: unordered Simulated annealing Results: Random: 1 run Greedy: ordered Hill climbing

Random: 10,000 runs Greedy: unordered Simulated annealing Results: Cactaceae Results Table Time Crunch: 1000 seconds

175432 99412 Time Crunch: 60 seconds

120136 Further Directions

● Larger datasets ● Combining heuristics ● Figuring out the dimension of graphs ● Continuous technique ● Biological interpretations of images Acknowledgements

Math Department

Bob Bosch

Friends and family

Audience! Works Cited

[1] Rodney J. Dyer and John D. Nason. Population graphs: the graph theoretic shape of genetic structure. Molecular Ecology, 13:1713–1727, 2004.

[2] Ashesh Nandy, Marissa Harle, and Subhash C. Basak. Mathematical descriptors of DNA sequences: development and applications. Archive of Organic Chemistry, 9:211– 238, 2006.

[3] Jonathan R. Bennett and Sarah Mathews. Phylogeny of the parasitic family Orobanchaceae inferred from phytochrome A. American Journal of Botany, 93(7):1039–1051, 2006.

[4] Simone Fior, Per Ola Karis, Gabriele Casazza, Luigi Minuto, and Francesco Sala. Molecular phylogeny of the Caryophyllaceae () inferred from chloroplast matK and nuclear rDNA ITS sequences. American Journal of Botany, 93(3):399–411, 2006.

[5] Reto Nyffeler. Phylogenetic relationships in the cactus family (Cactaceae) based on evidence from trnK/matK and trnL-trnF sequences. American Journal of Botany, 89(2):312–326, 2002.

[6] Harold William Rickett. Wild Flowers of the United States, volume 1. Hinkhouse Inc, New York, New York, 1966.

Works Cited

[7] Samuel F. Brockington, Ya Yang, Fernando Gandia-Herrero, Sarah Covshoff, Ju- lian M. Hibberd, Rowan F. Sage, Gane K. S. Wong, Michael J. Moore, and Stephen A. Smith. Lineage-specific gene radiations underlie the evolution of novel betalain pig- mentation in Caryophyllales. New Phytologist, 207(4):1170–1180, 2015.

[8] National Library of Medicine. BLAST , 2016. http://blast.ncbi.nlm.nih.gov /Blast.cgi.

[9] István Miklós. Introduction to Algorithms in Bioinformatics. Budapest, Hungary, 2016. http://www.renyi.hu/ miklosi/AlgorithmsOfBioinformatics.pdf.

[10] Eranda Cela. The Quadratic Assignment Problem: Theory and Algorithms. Springer Science+Business Media, B.V., Dordrecht, Holland, 1998.

[11] Rainer E. Burkard. Handbook of Combinatorial Optimization. Springer Reference, Media, New York, 2013.

[12] Zbigniew Michalewicz and David B. Fogel. How to Solve it: Modern Heuristics. Springer, Berlin, Germany, 2002.