<<

AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

A Graph Theoretical Approach to DNA Fragment Assembly

Jonathan Kaptcianos‡ Saint Michael’s College Colchester, Vermont 05439 USA

Received: September 19, 2007 Accepted: April 1, 2008

ABSTRACT

Built on prior work on sequencing by hybridization, fragment assembly is a newly explored method of determining whether or not a reassembled strand of DNA matches the original strand. One particular way to analyze this method is by using concepts from graph . By constructing data models based on these ideas, it is possible to come to various conclusions about the original problem regarding reassembled strands of DNA. In this paper we will detail this approach to DNA fragment assembly and present some related graph theoretical proofs in the process, including the BEST theorem. Further, we will explore the Eulerian superpath problem and its role in aiding fragment assembly, in addition to other recent applications of in the of .

I. Introduction both of which are used in counting Eulerian circuits in a . In recent years, scientists and Further, this paper will briefly survey researchers have emphasized DNA the current state of DNA sequencing and sequencing and fragment assembly with the fragment assembly, such as the problems hopes of enhancing their abilities to that arise, and how they are addressed reconstruct full strands of DNA based on the using graph theory. We will see how graphs pieces of data they are able to record. are being constructed using DNA data, and Complications arise with fragment assembly how these graphs inform the problem. Our due to imperfect data sets. Strands are primary emphasis falls on the Eulerian often riddled with repeats and come in superpath problem of Pevzner, Tang and varying sizes. As a result, configuring the Waterman in [1-2], which is a more recent image of the original genome is not as easy attempt to simplify DNA graphs through a as fitting one puzzle piece into the next. series of transformations on the graph. We will discuss a recently Such transformations include different edge discovered approach to DNA fragment detachments for cases of single and multiple assembly that uses components from graph edges in a directed Eulerian graph. theory, including de Bruijn graphs and The concluding section discusses Eulerian circuits, to successfully the decomposition of DNA sequencing with compensate for some of these errors. nanopores, as detailed in Bokhari and Sauer Specifically, this approach, originating from [3], and the role that graph theory plays in work done with sequencing by hybridization, resolving problems that arise. This is a consists of constructing various directed more recent application of graph theory graphs based on provided DNA data, and being put to use in the field of bioinformatics. counting possible Eulerian circuits in these graphs. II. Graph Theory Definitions This paper will provide some preliminary background information from Graph theory is a field of graph theory. In addition to several which has many applications in definitions, we will prove two theorems, the the world of science. We assume the reader BEST theorem and the theorem, is familiar with the basic foundations of graph theory, such as that found in Tucker ‡ [email protected]

1 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

[4], and West [5]. The basic terminology here follows that of West [5]. 01 11 One can move through the elements 011 of a graph in several different ways. A walk is any journey through a graph which produces a list of vertices and edges such that an edge connects the vertices on either 101 side of it. A trail is a walk where no edges 010 110 are repeated. A closed trail is a circuit, and the list of vertices is expressed in cyclic order. Further, a is a trail in which no vertices are repeated such that the vertices 10 can be listed in consecutive order as one is Figure 1. A de Bruijn Graph for the adjacent to the next. In the case of a sequence “0110101” with fragments of directed graph, the edges of a trail or circuit length 4. must be consistently oriented.

A trail or circuit which traces every edge in a graph once and exactly once is the directed edge between two vertices considered Eulerian. Eulerian trails and represent the respective triple. circuits are allowed to pass through vertices in a graph more than once. The number of Eulerian circuits in a graph can be computed III. DNA Sequencing and Fragment with a formula generated from the BEST Assembly theorem, which will be proven later in this paper. For simplicity, consider DNA A final definition from the field of sequencing to be much like the process of graph theory which we use in this paper is constructing a children’s puzzle, as that of de Bruijn graphs. A de Bruijn graph expressed by Pevzner, Tang, and is a directed graph with vertices that Waterman [1]. The only difference lies in represents sequences of symbols from an the fact that scientists are dealing with alphabet, and edges that indicate where the hundreds of pieces of varying sizes, and sequence may overlap, as defined in de some pieces are exactly identical to one Bruijn [6]. another. The “puzzle pieces” are various The construction of this graph is strands of DNA reads and the completed dependent on the of -length pieces, or puzzle is what the entire original strand, or fragments, from the particular sequence at genome, looks like. The provided hand. Each is labeled by a fragment information on this aspect of Bioinformatics of length l – 1 and the directed edge existing comes from various sections of the text by between two vertices represents one of the Jones and Pevzner [7]. l-length fragments. Specifically, the first Specifically, DNA fragment symbol in the fragment comes from the assembly is a lab technique which vertex that sends the edge, and similarly, determines the configuration of an entire the last symbol comes from the receiving genome given a series of viewable reads vertex. Thus, the remaining symbols that from that particular genome. Since a full both vertices contain are found labeling the strand of DNA consists of millions of edges. nucleotides, there is no way to visibly For example, we construct a de capture the structure in its entirety. With Bruijn graph for the sequence “0110101” today’s technology, scientists and using fragments of length l = 3. The four researchers are able to look at fragments of triples that are present in the sequence are DNA anywhere from 500 to 1200 011, 110, 101 (which appears twice), and nucleotides long. These fragments all 010. Figure 1 shows the de Bruijn graph consist of the assortment of the four types of that coincides with this sequence. Notice bases which make up DNA: adenine (A), the vertices represent the first and last thymine (T), guanine (G), and cytosine (C). subsequence of length 2 in each triple, and There could be multiple ways to reconstruct

2 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

GT CC CG CCG

GTG TGT GCC CGC

ATG TGC GCA AT TG GC CA

Figure 2. A de Bruijn graph for the sequence ATGTGCCGCA.

the original strand out of the fragment pieces problem of repeats. Previous approaches but only one of them is correct. for doing so followed the “overlap-layout- There are many problems and consensus” . The first step complications that arise in fragment involves all possible reads and assembly which make it much more difficult finding any overlapping. This is done by than simply constructing a puzzle. Within looking at the beginning sequence of one the portrayal of the viewable reads there read and the ending of another. The layout exists a rate in which errors are produced by step is the construction of the strand, and it sequencing machines. Anywhere from 1% proves to be the most difficult. Keeping to 3% of DNA reads resulting from repeats in mind, an attempt is made to find sequencing may contain errors and do not the order of reads along a full strand of DNA appropriately represent a part of the full and put them together. The last strand. of this algorithm is deriving how the Further, another problem lies within sequence will appear based on the layout the fact that DNA is a double-stranded produced in the previous step. The structure. Based on the work of Watson and approach we will discuss in this paper Crick [8], we know that with each single abandons the traditional outline of this three strand of DNA there exists a complement, step process, and in doing so, becomes a as A matches with T, and C is matter of . complementary with G. Thus, upon looking at reads from a strand of DNA, we are IV. Resembling Strands of DNA with de unable to determine whether or not a Bruijn Graphs fragment is coming from the desired strand, or its complement. Sequencing by hybridization is a lab While the error rates and technique which similarly looks at fragments complement confusions do hinder some of DNA, known as DNA arrays, but is not to steps in DNA sequencing and fragment be confused with fragment assembly, as assembly, the major area of concern and they are different. Specifically, sequencing confusion comes from repeated sequences. by hybridization (SBH) relies mostly on the The human genome has a large number of binding of an unknown target fragment of sequences that repeat multiple times. DNA with an array of shorter synthetic Further, if a repeating sequence is larger fragments known as probes. These probes than the size of the viewable reads, it would are anywhere between 8 and 30 nucleotides make construction of the genome almost in length, and they bind to the target based impossible. For example, consider a on the Watson-Crick complements, as noted particular genome which has a repeating in Pevzner [9]. sequence spanning a stretch of 2000 However, it was a graph theory nucleotides. No sequencing machines approach to SBH by Idury and Waterman today can view a read of this length, so [10] that introduced similar approaches in there is no way to ever see the repeat in its the field of DNA fragment assembly. In their entirety. article, the authors defined the rules to Most attempts at correcting errors in construct a directed graph based on a fragment assembly aim to attack the sequence of DNA. More specifically, these

3 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1 rules were for constructing a de Bruijn graph based on DNA pieces of k-length. The graph had vertices labeled by fragment of GT CG length k-1, and edges labeled by fragments of length k, which connected two adjacent vertices in the sequence. Figure 2 shows a de Bruijn graph for the simple sequence ATGTGCCGCA, as it AT TG GC CA was used by Idury and Waterman [10]. Here, the desired fragments of DNA are of length 3. Thus, the vertices are labeled by GG fragments of length 2 and edges labeled by fragments of length 3, representing the particular triple from the sequence. Notice Figure 3. The Graph for the set of reads S. how there is only one possible Eulerian trail in the graph above, as this is just a simple case. Various work done by Pevzner in [9, With that, there is a theorem used to count 11] has developed these ideas further. Eulerian circuits in a directed graph, known Now, we will construct another de as the BEST theorem, which we will prove, Bruijn graph using the same rules as above. followed by another related theorem and Given the set of fragments S = {ATG, TGG, proof. TGC, GTG, GGC, GCA, GCG, CGT}, the graph would similarly possess vertices of length 2 and edges of length 3. However, V. The BEST Theorem the graph produced for this set of reads is more complex than the previous graph. There exists a formula for Specifically, there are two different closed computation of Eulerian circuits in a directed trails through the entire graph, producing the graph. Named after its inventors, de Bruijn, sequences ATGGCGTGCA and van Aardenne-Ehrenfest, Smith, and Tutte, ATGCGTGGCA. See Figure 3, and note [12-13], the BEST Theorem reads as follows: that the extra dotted edge is added to connect the first and last vertices AT and CA, Theorem 5.1. Given a connected directed creating an Eulerian digraph. graph G and set of vertices V(G) = {ν1,…, νn} As previously mentioned, this all of even , the number of Eulerian approach to DNA fragment assembly, circuits |s(G)| is expressed as the following, referred to by some as the approach, creates a question of probability. where |ti(G)| is the number of spanning With this example above, we see that this trees rooted towards any vertex νi in G: given set of fragments constructs two different sequences, based on there being n + )( = i ( ) vdGtGs j − )!1)(( . two differing Eulerian circuits. Therefore, ∏ j = 1 there is a ½ possibility that we will resemble the actual genome, depending on which The following proof for this theorem is Eulerian circuit we follow. Since we added indicated in Bollobás [14]. the extra edge making the graph closed, we are able to count Eulerian circuits as Proof: We are given a directed G opposed to trails. with a vertex set V(G) = {ν1,…, νn} and the More generally, when we are given outdegree and indegree of a vertex νi being complete and errorless data, we find that the + - equivalent, which is denoted d (νi) = d (νi). probability of reconstructing the original If these conditions are met, then we know strand of DNA given a set of fragments is: that G has at least one directed Euler circuit, 1 by a theorem found in Bollobás [14]. Further, let’s assign E as the set of total # of Eulerian circuits in the de Bruijn graph directed Euler circuits, and Ei as the set of

4 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

v v2 v1 v2 1

v v 4 v3 4 v3

Figure 4. By using the Eulerian ν1ν2ν4ν2ν3ν4ν1 in graph G, this produces the oriented toward ν1 indicated in bold in the graph on the right.

directed Euler trails starting and ending with Since here we are claiming that vertex νi. Since the number of times that T ∈T1, we must prove that a) T is indeed a each Eulerian circuit will pass through vertex tree, and that b) T is oriented towards the νi is equivalent to indegree (or outdegree, vertex ν1. equivalently) of νi, then the number of Euler To show (a) we first assume that T trails starting with νi can be expressed as contains a cycle C, which would not contain

ν1. Also, since C is a cycle, each vertex in = + (ν ) = − ()ν EdEdE i i i . C has an outdegree of 1. Assume that edge . eg is the last edge of Euler trail S on cycle C, The above equation states that the number which connects the vertices νg and νh. If of Eulerian trails starting and ending with νi this is the case, then S is arriving back to νh is equivalent to the either the indegree or despite having left it for the last time after outdegree of νi multiplied by the total traveling through edge eh previously. This number of Eulerian circuits in G. contradicts the corresponding map, which Now, we let Ti be the set of states that an edge ej follows the last visit to spanning trees oriented towards νi and use it vertex νj, and will not return to νj again. in the mapping function fi : Ei →Ti. This map Therefore, T is a tree. has an input of the set of Euler trails starting Now, we prove (b) by asking and ending with in and an output of the νi G whether or not T is oriented towards ν1. We set of spanning trees oriented towards νi in answer this by first assuming that T contains G. For simple notation in this proof, we’ll a path νkνk-1…ν1. The edge between use i = 1, so f1: E1→T1. vertices ν2 and ν1 is e2, as there is no e1. Given S as an Euler trail from the Further, the edge between vertices ν3 and set E1, this map constructs a set of edges ej ν2 could be either e2 or e3, but since e2 has in the following manner. The first edge in S already been assigned, it must be e3. As we exiting ν1 is not drawn, meaning that j = proceed in this same manner between all 2,…,n for each edge ej. Further, each edge vertices in the path, we this prove that the ej appears as S passes through a vertex νj path νkνk-1…ν1 is indeed oriented towards for the last time and doesn’t return to it again the vertex ν1. in the Eulerian trail. The result from this Thus, the digraph T is in the set of map is a directed tree T, with vertices ν1,…, spanning trees oriented towards vertex νi in νn and edges e2,…,en that is in the set of the original graph G. Therefore, we can spanning trees oriented towards vertex v1. conclude that T ∈T1 and continue with this Figure 3 shows an example of this map proof. based on a specified Eulerian cycle in G.

5 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

n To get the desired map f1: E1→T1, we must −1 ()= dTf + ()ν ! d + ()ν − !1 . 1 i ∏ j = 2 ()j set f 1(S)=T. Now, working backwards, the set -1(T) describes the all the Euler trails f 1 The factorial of the outdegree of vertex ν1 about vertex ν1 that give rise to the considers number of different ways the Euler spanning tree T. trail can begin (which explains why the size Thus, using the fact that f1(S) = T of E1 is greater than E), and the product and S∈E1, we can draw an Euler trail S in factorial considers that there is one less the following manner. We begin at an edge available outgoing edge at vertex νj as the which leaves vertex ν1. Once we return to Euler trail S passes through. ν1, we leave it through another unused Furthermore, upon focusing on the edge, which also must exit ν1; when there entire set of Eulerian circuits E and are no untouched edges exiting ν1, the trail disregarding a set vertex νi to start from, the ends. Now, upon arriving at νj (where j > 1), number of said circuits is as follows: leave this vertex by an edge that is not ej n and also not in the tree T; once these edges = dTE + (ν )− !1 . 1 ∏ j = 1 ( j ) have been exhausted, then leave νj through the edge ej. Here, the product factorial for outdegrees of Based on the fact that the indegree all vertices in G is multiplied by the total and outdegrees of any vertex νj in G are number of spanning trees T1 oriented equivalent, we have successfully found an towards the vertex ν1. Euler trail S which is in the set E1 such that With this series of equations, we f1(S) = T. Note that, for some spanning have thus proven and verified the accuracy trees T, there will be more than one Euler of the formula which is associated with the trail S that can be produced. BEST theorem. ■ Using the graph G as an example, The BEST theorem relies on the we show the process of using the spanning number of spanning trees which exist in a tree T (as shown darkened in Figure 3) to particular Eulerian digraph. We now need a construct the Euler trails S that gives rise theorem which enables us to determine the with it. Start at the edge which leaves ν1 number of spanning trees in a directed Eulerian graph. This theorem has been towards ν2. Once at ν2, we must use the referred to as the matrix tree theorem, and a edge towards ν4, as the one towards ν3 is the “e ” edge used in the tree T. Notice at full understanding of its proof requires some j further definitions. The following terms and ν4, there are two outgoing edges, since the definitions follow from texts by Fleischner one towards is also part of T, we must ν1 [15-16], Bollobas [14], and Pevzner [9]. use the edge going back to v . Having 2 Given a directed Eulerian graph G, it returned to ν2, the only remaining edge is is possible to construct various matrices the one from T, so we use that. The same which encode information about the follows for , as the only remaining edge, ν4 properties of the graph. One matrix, the also from T, completes the Eulerian trail S. , lists the number of edges The primary concept in this process between various vertices in the graph. It is is as follows: once we enter any vertex in an n × n matrix where n is the total number of the graph, we exit that vertex by a non-tree edge for as long as possible, until we are vertices in G. If aij is the number of directed only left with a tree edge. Continuing to do edges that leave a vertex νi and enter a so for each vertex produces an Eulerian trail vertex νj in G, then aij is the i,j-th entry of the in the graph. matrix. The i,i-th entries, which are the In general, we can express the diagonal of the matrix, will be zero unless G number of Euler trails S that can be has loops. produced with a given spanning tree T -1 Recall the graph G from Figure 3 oriented towards νi (denoted as | f 1 (T)|) as that we just used previous. Its adjacency follows: matrix is denoted here as A(G).

6 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

⎛ 0010 ⎞ ⎜ ⎟ This matrix is central to the matrix ⎜ 1100 ⎟ tree theorem. There are certain A G)( = . characteristics that all Laplacian matrices of ⎜ 1000 ⎟ ⎜ ⎟ Eulerian digraphs possess. First, the ⎜ ⎟ ⎝ 0011 ⎠ diagonal still represents the indegree of each vertex in G, having only removed any loops. In our example there are none, thus Notice the following: a) ν1 sends an edge the diagonal stays the same. Also, for each to ν2, hence a 1 in the 1,2-entry; b) ν2 sends edge that exists from vertex νi to νj, there is edges to ν3 and ν4, hence 1’s in the 2,3 and a -1 in the i,j-th entry of the matrix, where i 2,4 entries; c) ν3 sends an edge to ν4, hence and j are not equal. a 1 in the 3,4-entry; and d) sends edges ν4 Another characteristic of the to ν1 and ν2, hence 1’s in the 4,1 and 4,2 is that columns and rows entries. Also, since there are no multiple sum to zero. The reason for this is clear. edges of the same direction in G, there are Each column i accounts for two quantities. only 1’s in the entries. Further, the diagonal One of which, on the diagonal, is the contains all zeros because there are no indegree of vertex νi, excluding loops. The loops in G. second is the number of edges that are A second matrix that can be received by νi from all other vertices in G, constructed from a graph is the diagonal whose values here are negative. Thus, it is matrix. This matrix gives the indegrees of reasonable that by adding these together, all each vertex in the graph along its diagonal. edges that enter vertex νi are being All other entries must be zero. Using the accounted for, which produces zero for the same graph G, its D(G) is as balanced digraphs we consider here. shown: On the other hand, each row i accounts for two different quantities. The ⎛ 0001 ⎞ first is the number of edges that leave vertex ⎜ ⎟ νi, excluding loops, and the second is the ⎜ 0020 ⎟ number of edges that are received by all D G)( = ⎜ ⎟ . 0100 other vertices from νi. These values, ⎜ ⎟ however, are all negative. Since we know ⎜ 2000 ⎟ G ⎝ ⎠ is an Eulerian digraph, the number of edges entering and number of edges exiting any A third matrix combines both aspects of vertex are equivalent. Because of this, we these two matrices A(G) and D(G) that we know that by adding the indegree of vertex have described above. Some graph νi, which is positive, to the summation of the theorists will refer to this as the Kirchoff edges which leave it, which is negative, will matrix, as seen in Fleischner [16], while always result in zero for any L(G) where G is others will call it the combinatorial Laplacian an Eulerian digraph. matrix, as seen in Bollobás [14]. For the We now recall the cofactor of a sake of this paper, will we call it the matrix. This value is found by removing the Laplacian matrix, and denote it L(G). i-th row and i-th column of a given matrix, Specifically, this matrix is the result of and then taking the determinant of the subtracting A(G) from D(G). remaining matrix. We denote this as Once again, using our graph G from follows, where k is the i-th cofactor: Figure 3 as an example, the Laplacian

matrix L(G) is as shown below: k = det ,ii L()G .

We have included the L(G) matrix in the ⎛ − 0011 ⎞ ⎜ ⎟ formula above because, for the sake of this ⎜ −− 1120 ⎟ paper and ensuing theorem and proof, we L G)( = . will be considering the cofactors of the ⎜ −1100 ⎟ ⎜ ⎟ Laplacian matrices of various directed ⎜ ⎟ ⎝ −− 2011 ⎠ Eulerian graphs.

7 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

G G-v1v2 G/v1v2

v1 v2 v1 v2 v12

v3 v4 v3 v4 v4 v3

Figure 5. The graphs G, G-ν1ν2, and G/ν1ν2.

As an example, consider the Laplacian i (Gt ) = ,ii L G)(det . matrix for the graph G from Figure 3. The cofactor at the first row and first column is The following proof for the matrix the determinant of the matrix below: tree theorem was constructed with the aid of Bollobás [14], which proves a similar ⎛ −− 112 ⎞ theorem involving undirected and weighted ⎜ ⎟ graphs for electrical networks. ⎜ −110 ⎟ . ⎜ ⎟ ⎝ 201 ⎠ Proof: We will use induction on the number of edges m in a graph for this proof. Taking the determinant of this matrix Further, we will assume that the given graph produces the value of k = 2. Further, if we G is connected, as an unconnected graph found the other three cofactors from L(G), has zero spanning trees, and that m is we would find all of them to be 2; the reason greater than one, as a graph with one edge for this stems back to the fact that all rows only has one spanning tree. and columns sum to zero, as this will hold Now, given two vertices ν1 and ν2 true for all such matrices. Proof of this can that are adjacent to each other in G, we will be found in Fleischner [16]. It is important to produce two new graphs from G which have note that matrices whose rows and columns less than m edges. The first is created by do not add to zero may have different removing all the directed edges that leave cofactors depending on which rows and vertex ν1 and enter ν2 in G, whose set we columns are removed from the original express as . This produces the graph matrix. This information allows us to move ν1ν2 on to a theorem which is pivotal for full G-ν1ν2. To make the second graph we understanding of the former BEST theorem contract these edges ν1ν2, producing a and other topics involving Eulerian graphs. graph G/ν1ν2. In this case, the vertices ν1 As stated earlier, this theorem is referred to and ν2 are fused, creating a new vertex ν12. as the matrix tree theorem. Also, the edges previously coming into each former vertex from a particular vertex νi are Theorem 5.2. Given a directed graph G with joined as one edge, with a weight totaling the set of vertices V(G) = {ν1,…, νn} and a the summation of these individual edges. Further, all edges exiting each removed set of spanning trees ti(G) oriented towards vertex into vi are joined and weighted in the the vertex νi, then │ti(G)│ is equal to the same manner. Note that both graphs G-ν1ν2 cofactor of L(G) on the i-th row and i-th and / will have less than m edges. column. That is, G ν1ν2 Figure 5 depicts an example of producing these two new graphs from the

8 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

graph G. Notice how G-ν1ν2 just removes spanning trees oriented towards νi in the the single directed edge from ν1 to ν2, and graph where ν1ν2 has been removed. We G/ν1ν2 contracts the edge, leaving a multiple know this to be G/ν1ν2. edge weighted at 2 going from the vertex ν4 By considering the various spanning trees oriented towards any of the four into the new vertex ν12, which we express with two arrows on the one drawn edge. vertices in our example graph G, we find this This compensates for the fact that ν4 sent equality to be true. Taking the vertex ν4, for an edge to each of ν1 and ν2 in the original example, we see there are two spanning graph G. There are no trees oriented towards it. The first contains the directed edges ν1ν2, ν2ν3, and ν3ν4, and leaving ν12, as there were no vertices in G the second contains the edges ν1ν2, ν2ν4, that received an edge from both ν1 and ν2. As the new graphs are produced, and ν3ν4. With the right side of our formula the following relationship on the number of and the graph G -ν1ν2, we notice there are no spanning trees oriented towards ν4. spanning trees is established, where a12 is Therefore, both trees must be produced the total number of edges ν1ν2 that existed from the second term of our formula above. in G: Using both of the multiple edges at ν4ν2, we are able to construct two sets of trees which i () i ()GtGt νν 21 +−= (Gta / νν 21112 ) . touch all three vertices and are oriented towards . Multiplying this by the number ν4 Specifically, the number of spanning trees in of edges ν1ν2 we removed, 1, we produce G oriented towards vertex νi is equal to such the final value 2. This is indeed equivalent to the number of spanning trees spanning trees in G-ν1ν2 added to the in our original graph G oriented towards the product of such spanning trees in G/ν1ν2 vertex ν4. and the number of directed edges ν1ν2. The Now that we have established a first term on the right side takes into account relationship for the spanning trees in G, all the spanning trees oriented toward νi -ν1ν2, and /ν1ν2, we turn to the second where the edges ν1ν2 have been removed. G G Therefore the second term must consider all part of this proof. We do so by constructing the spanning trees that do have the edges a similar relationship with the cofactors of each graph’s Laplacian matrices, which is as ν1ν2 in them. Because of this, we can represent those edges by themselves with follows: a12. Removing this amount leaves all the

det ,ii L()G = det , ii L(G −ν ν ) + ,1221 ii L(Ga /det ν ν 21 ).

In order to complete this proof, we must same manner, by looking at each individual prove that the formula above will always be component, starting first with the Laplacian true. By doing this we can establish equality matrices of each graph. between the elements on the right side of As we stated prior to this proof, a this equation and those from the right side of Laplacian matrix for a directed graph the earlier equation based on |ti(G)|. enumerates the directed edges from a Keeping in mind that we are proving this vertex νi to a vertex νj, which is represented theorem via induction on the number of as a negative number in the i,j entry of the edges m, notice that these inequalities are matrix. We will denote this as –ai,j, where i ≠ based on graphs of less than m edges, j. On the diagonal in L(G) are the indegrees Like the equation in the first part of of each respective vertex in G, where di is at this proof, the among the cofactors the i,i entry and is the indegree of vertex . uses the same summation with the graphs νi Thus L(G) for a general directed graph G G-ν1ν2 and G/ν1ν2, and the number of edges with n vertices can be expressed as follows: ν1ν2 removed from the original graph G. Because of this, we can approach it in the

9 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

L()G = Now, upon removing the all the directed edges ν1ν2 in our general graph, the ⎛ −−− aaad − a ⎞ ⎜ 1 12 13 14 L 1n ⎟ respective Laplacian Matrix would be as shown below, keeping in mind that a12 is the ⎜− 21 2 23 −− aada 24 − a 2n ⎟ L number of directed edges that existed from ⎜ −− − adaa − a ⎟ ⎜ 31 32 3 34 L 3n ⎟ ν1 to ν2: ⎜ 41 42 −−− 43 daaa 4 M ⎟ ⎜ ⎟ ⎜ MMM MO ⎟ ⎜ ⎟ ⎝ n1 n2 −−− aaa n3 LL d n ⎠

⎛ −−− aaad − a ⎞ ⎜ 1 12 13 14 L 1n ⎟ ⎜ 12221 23 −−−− aaada 24 L − a2n ⎟ ⎜ −− − adaa − a ⎟ ⎜ 31 32 3 34 L 3n ⎟ . L()G νν 21 =− ⎜ 41 42 −−− 43 daaa 4 M ⎟ ⎜ ⎟ ⎜ MMM MO ⎟ ⎜ ⎟ ⎝ n1 n2 −−− aaa n3 LL dn ⎠

Notice how only two components of this the same premise. matrix are different from the original matrix The third Laplacian matrix, for the L(G), at the 1,2 and 2,2 entries, which graph G/ν1ν2, will be slightly different. Since represent the directed edges from vertices the number of vertices is being reduced from ν1 to ν2, and the indegree of ν2, the original graph G, it follows that L(G/ν1ν2) respectively. The indegree of ν2 will be will be of a smaller . Specifically, decreased as the edges coming from ν1, it will be of dimension (n-1)×(n-1), as two expressed as a12, have been removed. vertices in G are being fused, producing just Further, the value in the 1,2 entry will clearly one vertex. become zero under

⎛ ( ) ( +−+−−+− aaaaadad ) ( +− aa )⎞ ⎜ 122211 2313 2414 L 21 nn ⎟ ⎜ ()+− aa 3231 d3 − a34 L − a3n ⎟ L G νν )/( = ⎜ ()+− aa − a d − a ⎟ 21 ⎜ 4241 43 4 4n ⎟ ⎜ M M MO ⎟ ⎜ ⎟ ⎝ ()+− aa nn 21 − an3 − an4 L dn ⎠

Comparing this matrix to the previous matrices. Once again, this previous one, we see that the 1,1 entry is coincides with the actual graph. The edges the summation of the entries from the 1,1, that a vertex νi sent or received from both ν1 1,2, 2,1, and 2,2 entries in L(G-ν1ν2). This and ν2 are joined to create a multiple edge is reasonable, as these vertices have been adjacent to the new matrix ν12. Thus, these fused the directed edges from ν1 to ν2 entries in the first row and column for the where removed. Further, for the rest of the matrix L(G/ν1ν2) are indeed a summation of first row and first column, we see each entry such edges. The rest of the matrix remains is the summation of the respective entries the same as in the previous two matrices. from the first two rows and columns frm the

10 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

Now that we have established the found upon removing the first row and first form of these matrices for each specific column of each respective Laplacian matrix, graph and how they are related, we now which will be referred to as k1. Specifically, must focus the cofactors of each. For each k1 can be found with the following: notational sake, we will consider the cofactor

⎛ −− aad − a ⎞ ⎜ 2 23 24 L 2n ⎟ ⎜ − 32 3 − ada 34 − a3n ⎟ L (1) ()Gk =det ⎜ −− daa − a ⎟ 1 ⎜ 42 43 4 4n ⎟ ⎜ MM MO ⎟ ⎜ ⎟ ⎝ n 2 n3 −−− n 4 L daaa n ⎠

⎛ −−− aaad − a ⎞ ⎜ 122 23 24 L 2n ⎟ ⎜ − 32 3 − ada 34 L − a3n ⎟ (2) ()Gk νν =− det ⎜ −− daa − a ⎟ 1 21 ⎜ 42 43 4 4n ⎟ ⎜ MM MO ⎟ ⎜ ⎟ ⎝ n2 n3 −−− n4 L daaa n ⎠ and

⎛ − ad − a ⎞ ⎜ 3 34 L 3n ⎟ (3). ⎜ − 43 da 4 − a4n ⎟ Gk νν =det/ 1 ()21 ⎜ ⎟ M MO ⎜ ⎟ ⎝ n3 −− n 4 L daa n ⎠

Upon computing these values, we from the last term needed in finding the immediately notice a relationship that can be determinant. Using the rules for finding used to simplify the notation. If we call the determinants from linear verifies entire right side of (3) D0, it can then this. Notice that the terms in brackets below become one of a series of such are exactly identical in the first two determinants in computing the first k1 value. equations. Following D0 are D1,…,Dn where n is taken

1 ()= dGk ⋅D02 −[(− a ⋅D123 )+(− a ⋅D224 )−K− (− a2 ⋅Dnn )]

1 ()()Gk ν ν 21 −=− ad ⋅D0122 −[(− a ⋅D123 )+(− a ⋅D224 )(−K a2 ⋅−− Dnn )]

(Gk /ν ν 211 ) = D0

If we use these values in their respective we find both sides to be equal. Thus, the spots from the initial equation and simplify, following equation will always be true:

11 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

. A similar argument shows that, for any i, 1 ()1 (ν ν 21 )+−= 112 (GkaGkGk /ν ν 21 )

. ,ii L G = det)(det ,ii L( −ν ν 21 )+ ,12 ii L(GaG /det ν ν 21 )

In establishing this equality, we have graph G, we were able to find the following completed our proof of the matrix tree equality knowing that the number of theorem. Since this proof was based on spanning trees and cofactors for two graphs induction on the number of edges m in the with less than m edges were equivalent.

. i ()ν ν 21 +− 12 i ()GtaGt ν ν 21 = det/ , ii L( −ν ν 21 )+ ,12 ii L GaG ν ν 21 )/(det

And hence, a directed graph representing fragments from the strand of DNA. An Eulerian circuit in this graph completes a full sequence that i ()Gt = det , ii L ()G can be drawn from this data. Finding the Therefore, the number of spanning trees for total number of Eulerian circuits in such a a directed Eulerian graph G is indeed graph is known as the Eulerian path equivalent to the cofactor of its Laplacian problem. matrix. ■ However, in cases of erroneous Having completed these two proofs data with repeats and fragments of varying and derived formulas from each, we can lengths, the typical Eulerian path problem now state the following: Given a de Bruijn becomes much more difficult. Fragments graph G generated from a set of DNA range from anywhere from 100 to 300 nucleotides in length, and attempting to fragments, the probability of correctly compose a de Bruijn graph with 20-tuples assembling the original strand is produces a graph with millions of vertices 1 and edges. Clearly a graph of this nature is . not easy to work with. n + i ()∏ j = 1 ()dGt ()ν j − !1 In two articles by Pevzner, Tang, and Waterman [1-2], the authors detailed a I. The Eulerian Superpath Problem computer package which handles graphs of this kind. This program, known as EULER, The Eulerian superpath problem takes erroneous and incomplete data, and (referred to here as the ESP) aims to do the simplifies it based on a series of cuts and following: Given an Eulerian graph G with a detachments on the graph. The random set of paths P in the graph, find an Eulerian DNA fragments of various lengths are represented as individual paths found circuit in G that contains all components of P throughout the graph. These as subpaths. transformations on graphs in EULER are the We noted earlier in this paper that components of what the authors define as the technique of using graph theory for DNA the Eulerian superpath problem. Note that fragment assembly is referred to be some the original Eulerian path problem is a case authors as the Eulerian path approach. of the ESP, in which each path is However, in the terminology we have represented by one edge. adopted, a path does not repeat edges or The approach to solving this vertices, so it cannot always complete an problem is based on maintaining Eulerian journey through a graph. Thus, for equivalence on the graph and system of the rest of this paper, this approach really G refers to Eulerian trails and circuits, as paths P from one transformation to the next. opposed to paths. More specifically, the resulting graph and Given error free data and perfect paths following the first transformation, pieces of DNA data, we are able to compose represented as (G1, P1), corresponds

12 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

a a b b c c

AT ATG TG TGT GT AT ATGT GT TG CTG TGC

CT GC CTG TGC GC CT

Figure 6. An x,y-detachment connecting the edges ATG and TGT.

equally with the original set (G, P). TG would safely be pulled out and the edges Therefore, once the final k-th transformation ATG and TGT would become a new, longer edge, representing the piece ATGT. Notice is completed, the resulting set (Gk, Pk) is a how the resulting graph has fewer edges, much neater representation of ( , ). It is G P and the degree of vertex TG has similarly with this final set that we have transformed been reduced. Further, now all paths which the ESP to just an Eulerian path problem, contained both edges ATG and TGT just and the Eulerian circuits that we find with contain edge ATGT, the paths which ended (Gk, Pk) match what would be found in (G, at ATG now do so at ATGT, and the paths P). The first transformation detailed by which began with TGT now do so at ATGT. Pevzner, Tang, and Waterman [1-2] is the The second detachment described x,y-detachment, and it can be performed in concerns a graph G which contains multiple the event that there are no multiple edges in edges, which is referred to as a x,y1- the given graph G. Please note that our detachment. Here, we are given a multiple notation differs from that of the respective edge x1x2, and the edges x2x3 and x2x4 references, as it maintains the notation we which are separately adjacent to and have been using throughout this paper. If following it. We know that there will be an we have two adjacent edges x1x2 and x2x3 in Eulerian circuit which visits the edge x1x2 on G, and three complete sets of paths from P, exactly two occasions, once moving to x2x3, one which contains both edges and once to x2x4. consecutively, one which ends with x1x2, and Much like the previous detachment, one which begins with x2x3, then this can the x,y1-detachment similarly depends on be sufficiently executed. Specifically, the complete sets of paths. The first set is all vertex connecting the two edges, x2, is the paths which contain one copy of the pulled away, bringing together a new edge, multiple edge x1x2 followed consecutively by x1x3. This transformation removes an edge x2x3. The second and third sets are those from the graph, and makes the vertex ν2 of which end at the multiple edge x1x2, and lesser degree. The new graph is still those which begin with the edge x2x3. equivalent to the original, but now Now this case becomes slightly computations are more simplified as a result more complicated, as it is dependent on of these transformations. whether or not there are any paths that Figure 6 details this x,y-detachment satisfy the second set. Specifically, if there using a portion of a simple de Bruijn graph are no paths which end on x1x2, then we can constructed with DNA data. In the event safely remove one copy of x1x2, and create a that we had three complete sets of various new edge x1x3, which would further reduce fragments, one set which contained the the degree of the vertex x2. However, if edges ATG and TGT consecutively (a), one there is a set of paths which end on x1x2, set which ended at ATG (b), and one set then this detachment cannot safely be which began with TGT (c), then the vertex made,

13 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

a GT c GT a c ATGT TGT AT ATG TG AT TG ATG TGC TGC GC GC

Figure 7. A successful x,y1-detachment at the multiple edge ATG.

because we cannot assume that the desired with DNA data. Recall the graph in Figure 8, Eulerian circuit will move to x2 x 3 or x 2 x 4 next. which represents a DNA de Bruijn graph that Once again, we will look at this has two possible Eulerian circuits (given the detachment using a portion of a simplified additional edge which connects the first and de Bruijn graph with DNA data in Figure 7. last vertices). If we are given this graph, Since there is a multiple edge ATG, we and a series of DNA fragments of varying know this segment must repeat at some size, we can use the different fragments to point in the entire sequence. Here, we have determine which Eulerian circuit represents a complete set of DNA fragments which the actual sequence. contain one copy of ATG followed Suppose we have the following consecutively by TGT (a), and such a set fragments: (1) ATGG, (2) TGGCG, (3) which begins at the TGT (c). For the sake of GGCGTGG, and (4) CGTGCA. With these this example we will say there are no such fragments, there are two critical single- fragments which stop at ATG, hence why edged detachments that we can make, and there is no depiction of the paths (b) in our still maintain equality with the original graph. figure. Now, since these conditions are met, This transformation on the graph is shown in we can create a new edge ATGT connecting Figure 8. the vertices AT and GT, leaving one copy of The first detachment takes place at ATG. Like the first detachment, this lessens the vertex TG. Fragment (1) travels from AT the number of edges in our graph and to GG through vertex TG, whereas reduces the degree of the vertex TG. fragments (3) and (4) travel through TG from Notice, once again, how all the paths in the GT and then move on to GC. Because of set which contained ATG and TGT now this, we can safely create a new edge contain ATGT, and such paths from the set between the vertices AT and GG, and not which began with TGT now begin with disrupt any of the paths in doing so. ATGT. The second critical detachment The discussion of the ESP takes place at the vertex GC. Here, both continues in the articles by Pevzner, Tang, fragment (2) and (3) pass through GC from and Waterman [1-2] with further cases of GG and then to CG. Further, fragment (4) the problem, including an analysis on passes through GC after from TG, and consistency between sets of paths. concludes at CA. Because of this, we are Further, the authors show another able to make another detachment which transformation on a graph, known as the x- does not affect any of the paths by creating cut, which allows us to completely remove a new edge connecting vertices TG and CA. an edge from certain sets of paths in our Recall that the edges from the graph. original graph represent fragments of length In the interest of understanding how 3. But with these two detachments, the new the ESP works with a , we edges formed, ATGG and TGCA, are of will demonstrate an example of detachments length 4. With these two transformations on on a complete de Bruijn graph constructed our graph, it is now clear that the fragments

14 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

GT CG

AT TG GC CA TGCA

ATGG GG

Figure 8. The graph which is produced after two detachments on the graph in Figure 3.

belong to the sequence ATGGCGTGCA, as graph has 126 repeats, of which further that is the only Eulerian circuit that remains. improvements can be implemented as It is also important to recognize that shown by Pevzner and Tang [17]. Here the the ESP has been found successful in authors introduce the concepts of dealing realistic contexts. Specifically, Pevzner, with “double-barreled data” and a new Tang, and Waterman [1-2] apply the algorithm, EULER-DB, which reduces this components from ESP and their EULER number of repeats significantly, also based program and put it towards a project on similar ideas. involving the sequencing of the Neisseria In more recent years, other authors meningitidis [NM] genome, referred to as the have continued to gain a better NM project. Throughout the course of these understanding of the ESP and how it can two articles and further in Pevzner and Tang further be used. Zheng, Shi, and Zhou [18] [17], the authors detail the impact that these used the concepts of the problem which concepts have when dealing with a bacterial enabled them to create an algorithm which genome that is rather difficult to reconstruct. would store all data found in a particular The NM genome, successfully DNA de Bruijn graph. This proves to be reconstructed without errors, is represented crucial because these graphs are very large, in a de Bruijn graph that has 4,039,248 and as a result are both time and memory vertices, where each vertex is of length 20. consuming. However, construction of a graph based on Additionally, as of last year, Zheng, the real provided DNA fragments originally Leong, and Tang [19] addressed the contains 9,474,411 vertices. Following the “spectral alignment problem” which was error-correction procedure on repetitive and similarly developed by Pevzner, Tang, and faulty data detailed by Pevzner, Tang, and Waterman [1-2] along with EULER and the Waterman [1-2], this number is reduced ESP. Here, the authors propose an significantly to 4,081,857 vertices. Thus, algorithm regarding coverage at the position there are still more than forty thousand on various fragments of DNA, which they vertices that separate the graph with refer to as the “coverage sensitive experimental data from the graph alignment” algorithm. representing the actual genome. The Eulerian superpath approach Using the concepts behind the ESP, on DNA fragment assembly doesn’t the experimental graph of the NM genome necessarily eliminate all the discrepancies is able to be simplified. Specifically, 18,026 about the original construction of the of the 18,962 edges of length 21 in this genome, but just makes it easier to work graph are considered resolvable based on with. With this approach and the EULER the detachments and decompositions as package, scientists and researchers are shown in the paper. The resulting de Bruijn able to consider large groups of edges,

15 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

vertices, and paths as a significantly smaller successfully with long and erroneous number of elements, which enables easier strands of DNA. They addressed the computation of the Eulerian circuits in the problem of complements and orientation by graphs. considering all four variations of a substring as four different paths, all of which being contained in the graph. II. DNA Sequencing with Nanopores Specifically, their de Bruijn DNA graph G must contain the paths P1,…,P4 In more recent years, concepts in such that the union of all four paths is graph theory have been used to aid other equivalent to G, denoted P1UP2UP3UP4 = G. problems which occur in the field of Further, no two paths have any equivalence bioinformatics. One of which can be found in their intersections, as ∩ = Ø for all i, j, in Bokhari and Sauer [3]. Here, the authors Pi Pj where i ≠ j. This property was referred to by use de Bruijn graphs representing DNA data Bokhari and Sauer [3] as E4, and the and the Eulerian path approach upon 4 determining an algorithm which aids the algorithm contains a function E (G) that is process of DNA sequencing with nanopores. agreed upon if the DNA graph G maintains Nanopore devices are used in DNA these identities. Testing these paths with a sequencing with the hopes of enabling large permutation further determines whether or and complete strands of DNA to be read not that are, in fact, complements and completely. These nanopores are a few reversals of each other. If this holds true, in diameter, and single stranded DNA then the strand being examined comes from passes through them by optical or electrical an authentic sequence of DNA data. means. Through this passage, each In the cases of repetitive data, it individual base of the strand is identified, must first be determined whether or not the producing a seemingly endless chain of A, longest repeat in a set of DNA data is longer T, G, and C. than the available substrings. If this is the Bokhari and Sauer [3] identify and case, then the algorithm is unable to address the problems that occur in this form produce a graph matching the E4 identity. of sequencing, which are much similarly to Also, in the case of the Eulerian path those that we saw previous. The first issue approach, repeats are defined to be lies on the fact that an entire single strand sequences that match each other exactly. rarely makes it through the nanopore in one Thus, two long substrings which are 99% piece. Rather, substrings of lengths around identical are not considered to be repeats 100,000 bases in length are produced. upon constructing the de Bruijn graph, and Further, as discussed earlier, the problem of the Eulerian path approach still functions the coinciding Watson-Crick complement properly with the provided data. causes confusion regarding whether or not This graph decomposition algorithm the original strand or its coinciding was tested by Bokhari and Sauer [3] on the complement is being passed through the Mycoplasma pneumoniae genome, which is nanopore. Similarly, there another issue 816,394 bases in length. Upon sequencing involving the specified orientation of single with a nanopore device, this algorithm was strand as it passes through. Upon passage able to produce the full sequence of this and decomposition into substrings, the bacterium in less than an hour. direction (3’—5’ or 5’—3’) of the DNA is lost. The final problem is that of erroneous data. III. Conclusions Not only may the DNA sequence itself be riddled with insertions, deletions, mutations, Various concepts from graph theory and repeats, but the nanopores themselves are used for applications in many different may also produce errors amidst the passing fields of study. In this paper, we of a strand. demonstrated how de Bruijn graphs and In formulating an algorithm to Eulerian circuits are used to resolve some eliminate the discrepancies that coincide problems in DNA sequencing and fragment with these issues, the authors used the assembly. Further, it was the formula Eulerian path approach and provided new derived from both the BEST and matrix tree variations which allowed it to still be used theorems that allowed us to arrive at a

16 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

constant formula which can be used to count was also made possible through the the number of Eulerian circuits in a de Bruijn National Security Agency. graph which has been constructed from DNA data. Such circuits dictate the REFERENCES probability of constructing the desired original strand of DNA based on a collection 1. Pevzner, P.A.; Tang, H.; Waterman, of fragments from that strand. M.S. “An Eulerian Path Approach to The progress being made with the DNA Fragment Assembly,” Eulerian superpath problem and EULER Proceedings of the National Academy of package shows that scientists and Sciences. 98 (2001), no. 17, 9748- researchers are continuing to find ways to 9753. resolve problems that arise in fragment 2. Pevzner, P.A.; Tang, H.; Waterman, assembly, especially those dealing with M.S. “A New Approach to Fragment erroneous data sets. Hopefully, these Assembly in DNA Sequencing,” transformations on directed Eulerian graphs RECOMB 01. (2001), 256-267. represent the first of many steps which can 3. Bokhari, S.H.; Sauer, J.R. “A Parallel be taken to simplify the extraordinarily Graph Decomposition Algorithm for complex world of DNA sequencing, as more DNA Sequencing with Nanopores,” of the like will come with time. Further, it is Bioinformatics. 21 (2005), no. 7, 889- imperative to recognize that the field of 896. bioinformatics is changing on a daily basis, 4. Tucker, A. Applied . 5th as seen in the previous section. Much of Edition. (John Wiley & Sons, Inc., this development is triggered with new Danvers, Massachusetts, 2007). advancements in technology and 5. West, D. Introduction to Graph Theory. intelligence. In addition to the sources 2nd Edition. (Prentice Hall, Upper discussed in this paper, the reader is Saddle River, New Jersey, 2001). encouraged to look at other aspects of 6. de Bruijn, N.G. “A Combinatorial bioinformatics in which this approach has Problem,” Koninklijke Nederlandse been involved. This includes, but is certainly Akademie V. Wetenschappen. 49 not limited to, the following articles: (1946), 758-764. Alekseyev and Pevzner [20], Myers [21], 7. Jones, N.C.; Pevzner, P.A. An Intro- and Raphael, Zhi, Tang, and Pevzner [22]. duction to Bioinformatics . Progress that has been achieved in (The MIT Press, Cambridge, this field through the use of graph theoretical Massachusetts, 2004). concepts simply opens the door for more 8. Watson, J.D.; Crick, F.H.C. “Molecular opportunities. Ultimately, problems with Structure of Nucleic Acids,” American seemingly never-ending lines of partially Journal of Psychiatry. 160 (2003), corrupted DNA data can be simplified into 623-624 various cases of a graph and paths 9. Pevzner, P.A. Computational Molecular contained within that graph. It is with these : An Algorithmic Approach. (MIT tools that the more complicated cases that Press, Cambridge, Massachusetts, may arrive can be approached with the 2000). hopes of more clarity and success. 10. Idury, R.M.; Waterman, M.S. “A New Algorithm for DNA Sequence ACKNOWLEDGEMENTS Assembly,” Journal of Computation Biology. 2 (1995), no. 2, 291-306. Support made possible by the Vermont 11. Pevzner, P.A. “DNA Physical Mapping Genetics Network through Grant Number and Alternating Eulerian Cycles in P20 RR16462 from the INBRE Program of Colored Graphs,” Algorithmica. 13 the National Center for Research Resources (1995), no. 1-2, 77-105. (NCRR), a component of the National 12. van Aardenne Ehrenfest, T.; de Bruijn, Institutes of Health (NIH) is acknowledged. N.G. "Circuits and Trees in Oriented The contents of this research report are Linear Graphs,” Simon Stevin. 28 solely the responsibility of the author and do (1951). 203-217. not necessarily represent the official views of NCRR or NIH. Support for this research

17 AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 7, NO. 1

13. Tutte, W.T.; Smith, C.A.B. “On Unicursal Journal of Bioinformatics. 5 (2004), 91- Paths in a Network of Degree 4,” Amer. 101. Math. Monthly. 48 (1941). 233-237. 19. Zheng, J.; Leong, H.W.; Tang, H. “An 14. Bollobás, B. Graduate Texts in Improved Algorithm for Error Correction Mathematics: Modern Graph Theory. of Reads in DNA Fragment Assembly,” (Springer-Verlag, New York, NY, 1998). RECOMB 07. 2007. 15. Fleischner, H. “Eulerian Graphs and 20. Alekseyev, M.A; Pevzner, P.A. “Colored Related Topics,” Part 1, Volume 1. de Bruijn graphs and the Genome Annals of Discrete Mathematcs. 45 Halving Problem,” IEEE/ACM (1990). Transactions on 16. Fleischner, H. “Eulerian Graphs and and Bioinformatics. 4 (2007), no. 1, 98- Related Topics” Part 1, Volume 2. 107. Annals of Discrete Mathematcs. 50 21. Myers, E.W. “The Fragment Assembly (1990). ,” Bioinformatics 2005. 21 17. Pevzner, P.A.; Tang, H. "Fragment (2005), Supplement 2, ii79-ii85. Assembly with Double-barreled Data,” 22. Raphael, B.; Zhi, D.; Tang, H.; Pevzner, Bioinformatics. 17 (2001), S225-S233. P.A. “A Novel Method for Multiple 18. Zheng, W.; Shi W.; Zhou W. “A Parallel Alignment of Sequences with Repeated DNA Fragment Assembly Algorithm and Shuffled Elements,” Genome Based on Eulerian Superpath,” Online Research. 14 (2004), 2336-2346.

St. Michael’s College Colchester, Vermont USA www.smcvt.edu/

It is the mission of Saint Michael’s College to contribute through higher education to the enhancement of the human person and to the advancement of human culture in the light of the Catholic faith.

18