Efficient Approximation of DNA Hybridisation Using Deep Learning

Efficient Approximation of DNA Hybridisation Using Deep Learning

Efficient approximation of DNA hybridisation using deep learning David Buterez? Department of Computer Science and Technology University of Cambridge, Cambridge, UK Abstract. Deoxyribonucleic acid (DNA) has shown great promise in enabling computational applications, most notably in the fields of DNA data storage and DNA computing. The former exploits the natural prop- erties of DNA, such as high storage density and longevity, for the archival of digital information, while the latter aims to use the interactivity of DNA to encode computations. Recently, the two paradigms were jointly used to formulate the near-data processing concept for DNA databases, where the computations are performed directly on the stored data. The fundamental, low-level operation that DNA naturally pos- sesses is that of hybridisation, also called annealing, of complementary sequences. Information is encoded as DNA strands, which will naturally bind in solution, thus enabling search and pattern-matching capa- bilities. Being able to control and predict the process of hybridisation is crucial for the ambitious future of the so-called Hybrid Molecular-Electronic Computing. Current tools are, however, limited in terms of throughput and applicability to large-scale problems. In this work, we present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation. For this purpose, we introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms, including the latest in deep learning. Depending on the hardware, the proposed models provide a reduction in inference time ranging from one to over two orders of magnitude compared to the state-of-the-art, while retaining high fidelity. We then discuss the integration of our methods in modern, scalable workflows. The implementation is available at: https://github.com/davidbuterez/dna-hyb-deep-learning Keywords: DNA hybridisation · Annealing · DNA storage · DNA computing · Deep learning Introduction and motivation In this introductory section we place our work in context, by discussing the innovations and chal- lenges brought by digital data processing in DNA. We emphasise the importance of DNA hybridisa- tion, the mechanism enabling these technologies, and how this study delivers scalable, future-ready workflows. DNA storage The use of DNA to facilitate computation is an active area of research, dating back to 1994 when arXiv:2102.10131v1 [cs.LG] 19 Feb 2021 Leonard Adleman solved a seven-node instance of the Hamiltonian path problem, using the tool- box of (DNA) molecular biology [1]. The algorithm uses DNA sequences to represent nodes and edges in the graph and relies on hybridisation to detect paths which visit all nodes once. The resulting DNA sequences (algorithm output) can be read using polymerase chain reaction (PCR). This pioneering work opened the gate to many interesting questions: can molecular machines be used to solve intractable problems? Is DNA suitable for long-term storage of digital information? More recently, are such methods scalable in the era of Big Data? The focus has gradually changed from solving difficult computational problems to exploiting other ? Work conceptualised and partially executed while at Imperial College London. 2 D. Buterez desirable properties of DNA, leading to the development of DNA digital data storage. The field has seen rapid advances in recent years [22], [7], [29], [39], [10], [27], [5], [11], [30] and industry-ready solutions are currently in development. It is now generally accepted that the amount of digital data is doubling at least every two years. Predictions from Seagate [35] estimate that this quan- tity, called the Global Datasphere, will grow from 33 ZB (Zettabytes) in 2018 to 175 ZB by 2025. Furthermore, the dominant storage mediums are “traditional”, with 59% of the storage capacity expected to come from hard disk drives and 26% from flash technologies. Synthetic DNA has been argued to be an attractive storage medium, for at least three prominent reasons [9]: (a) Density and lifetime comparison with (b) Reading and writing DNA against mainstream storage media [9] evolution of transistors [8] Fig. 1: DNA storage compared to traditional systems. 3 1. Density – The theoretical maximum information density of DNA is 1018 B/mm . Compara- tively, this is a 7 orders of magnitude increase over tape storage (Figure 1a). 2. Durability – Depending on storage conditions, DNA can be preserved for at least a few hundred years. Fossil studies reveal that DNA has a half-life of 521 years [3]. In appropriate conditions, researchers estimate that digital information can be recovered from DNA stored at the Global Seed Vault (at -18°C) after over 1 million years [22]. 3. Future-proofing – Next-generation sequencing and technologies such as Oxford Nanopore have made reading and writing DNA more accessible than ever. There has been exponential progress in manipulating DNA, surpassing even Moore’s Law (Figure 1b). Furthermore, since DNA is the fundamental building block of life, it will be relevant for as long as humans exist. Near-data processing Clearly, DNA storage and computing have the potential to bring breakthroughs in the way we process and store data. The objective of this work is to develop methods that make near-data processing applications viable at a large scale. The near-data processing philosophy is to bring the computing substrate closer to the storage substrate, concretely by encoding meaningful interactions using DNA hybridisation. Database query operations on DNA In the OligoArchive project [4], the authors present a processing technique capable of performing SQL operations such as selection, projection and join directly on the DNA molecules. More specif- ically, arbitrary tables can be encoded in DNA by designing a strand that represents the attribute value prepended by hashed values of the table, attribute name and primary key, supplemented by error correcting codes. The selection database operation can then be implemented by PCR, as it is sensitive enough to allow single sequences of DNA to be selected in a background of billions of Efficient approximation of DNA hybridisation using deep learning 3 irrelevant molecules. This enables searching by attribute value. More interestingly, the join operator makes direct use of hybridisation. Intuitively, if the attributes of two records match, then a duplex product will form. This implements the equi-join operation. The encoding of the attributes should be orthogonal for different values, to avoid unintended operations. However, the case of an imperfect match is still interesting for fuzzy matching and similarity search. The scale of the experiments in [4] is small: a PostgreSQL database of 12KB was successfully encoded and read back. Future projects will clearly benefit from a larger, high-quality sequence design tool. DNA encodings for similarity search The objective of similarity search is to retrieve documents from a database that are similar in terms of content to a search query. The state-of-the-art solution is to encode the items into feature vectors that can be compared with metrics like Euclidean distance (similar documents are close in the feature space). With rich data (e.g. images) the feature vectors tend to have hundreds of dimensions, hindering even the fastest algorithms, in the so-called “curse of dimensionality”. The near-data processing paradigm can be applied to the similarity search problem: the database stores information in DNA molecules and the similarity search is performed by means of DNA hybridisation. The work of Stewart et al. [38] proposes a novel encoding for information. Each database element is associated with a unique ID and a feature vector. Fig. 2: Strand designs in Stewart et al. [38] The database itself does not store the data, but the ID can be used to retrieve the information from another location. The ID and feature sequences are placed on the same DNA strand. In Figure2, FP, IP and RP are sequencing primers (forward, internal and reverse, respectively) and coloured in blue. d(T ) denotes the ID sequence, and f(T ) is the feature sequence, in orange colour. Similarly, the features of the query strand are encoded in the f(Q) region. This is designed to match (im- perfectly) with f(T ), so its reverse complement, denoted by f(Q)∗, is used. To help the (unstable) hybridisation, a reverse complement of the first six bases of the reverse primer RP is added to the query strand: RP [: 6]∗. With this design, similar targets and queries will bind together. They can then be amplified by PCR (using FP and RP) and sequenced. The other novel contribution of [38] is training a neural network that learns to associate a query with a target if and only if the query and the target represent similar images. Note that the network is trained to find appropriate encodings for pairs of images based on their semantic features, i.e. it does not predict hybridisation probabilities. Rather, it uses an approximation for thermodynamic analysis that is differentiable (necessary for backpropagation) in the form of a modified sigmoid function. Unfortunately, the estimation is far from perfect. The authors remark that an important feature direction is a more accurate approximation for thermodynamic yield. To summarise, we identify two applications for such thermodynamic prediction techniques: • Orthogonal datasets – Traditionally, orthogonal DNA sequences have been used to build DNA barcodes, unique sequences that are used for identification. More recently, orthogonal 4 D. Buterez sequences are required to enable random-access in DNA databases through PCR. More specif- ically, each piece of information stored in DNA molecules requires a short, unique identifica- tion sequence. Uniqueness is crucial: one of the goals is to avoid cross-talk. Naturally, larger databases require a higher number of orthogonal sequences. • Similar datasets – The concept of near-data processing introduced the need for similar sets of sequences. The idea is to store closely-related entities (by some distance metric) in similar DNA molecules, which are part of the database.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    37 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us