Hitseq 2017 Proceedings
Total Page:16
File Type:pdf, Size:1020Kb
HiTSeq 2017 Proceedings Prague, Czech Republic July 24-25, 2017 http://www.hitseq.org Organizers: Can Alkan Bilkent University, Bilkent, Ankara, Turkey E-mail: [email protected] Valentina Boeva Institut Cochin/INSERM/CNRS, Paris, France E-mail: [email protected] Ana Conesa University of Florida, Gainesville, Florida, USA E-mail: [email protected] Francisco M. De La Vega, D.Sc. Stanford University, and TOMA Biosciences, USA. E-mail: [email protected] Dirk Evers Molecular Health GmbH, Heidelberg, Germany E-mail: [email protected] Kjong Lehmann Memorial Sloan-Kettering Cancer Center. New York, NY, USA E-mail: [email protected] Gunnar R¨atsch Memorial Sloan-Kettering Cancer Center. New York, NY, USA E-mail: [email protected] Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-mer Embedding Keywords: chromatin accessibility, k-mer embedding, long short-term memory, deep learning Abstract: Motivation: Experimental techniques for measuring chromatin accessibility are ex- pensive and time consuming, appealing for the development of computational methods to precisely predict open chromatin regions from DNA sequences. Along this direction, existing computational methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in spe- cific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning. Method and results: We fill this gap by addressing the problem of chromatin accessibility pre- diction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classifica- tion. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can ef- fectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. Availability and implementation: The source code can be downloaded from https://github.com/ minxueric/ismb2017 lstm. Authors: first name last name email country organization corresponding? Xu Min [email protected] China Tsinghua University Wanwen Zeng [email protected] China Tsinghua University Ning Chen [email protected] China Tsinghua University Ting Chen [email protected] China Tsinghua University X Rui Jiang [email protected] China Tsinghua University X Discovery and genotyping of novel sequence insertions in many sequenced indi- viduals Keywords: high throughput sequencing, novel sequence insertions, structural variation Abstract: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computa- tional difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Addition- ally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algo- rithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sampledata to "collectively" obtain very high coverage dataset to accurately find insertions common in a given population. Here we present Pamir, a new algorithm to efficiently and accurately discover and geno- type novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous vs. homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. Availability. Pamir is available at https://github.com/vpc-ccg/pamir *Contact. [email protected], [email protected] Authors: first name last name email country organization corresponding? Pınar Kavak [email protected] Turkey Department of Computer Engineering, Bo˘gazi¸ciUni- versity, Istanbul,_ Turkey Yen-Yi Lin [email protected] Canada School of Computing Science, Simon Fraser Univer- sity, Burnaby, BC Ibrahim Numanagi´c [email protected] Canada School of Computing Science, Simon Fraser Univer- sity, Burnaby, BC Hossein Asghari [email protected] Canada School of Computing Science, Simon Fraser Univer- sity, Burnaby, BC Tunga G¨ung¨or [email protected] Turkey Department of Computer Engineering, Bo˘gazi¸ciUni- versity, Istanbul,_ Turkey Can Alkan [email protected] Turkey Department of Computer Engineering, Bilkent Uni- X versity, Ankara, Turkey Faraz Hach [email protected] Canada School of Computing Science, Simon Fraser Univer- X sity, Burnaby, BC HopLand: Single-cell pseudotime recovery using continuous Hopfield network based modeling of Waddington's epigenetic landscape Keywords: single-cell, RNA-seq, Waddington's epigenetic landscape, pseudotime, Hopfield net- work, embryonic development Abstract: Motivation: The interpretation of transcriptome dynamics in single-cell data, espe- cially pseudotime estimation, could help understand the transition of gene expression profiles. The recovery of pseudotime increases the temporal resolution of single-cell transcriptional data, but is challenging due to the high variability in gene expression between individual cells. Here, we intro- duce HopLand, a pseudotime recovery method using continuous Hopfield network to map cells in a Waddington's epigenetic landscape. It reveals from the single-cell data the combinatorial regulatory interactions of genes that control the dynamic progression through successive cellular states. Results: We applied HopLand to different types of single-cell transcriptome data. It achieved high accuracies of pseudotime prediction compared to existing methods. Moreover, a kinetic model can be extracted from each dataset. Through the analysis of such a model, we identified key genes and regulatory interactions driving the transition of cell states. Therefore, our method has the potential to generate fundamental insights into cell fate regulation. Availability and implementation: The Matlab implementation of HopLand is available at https://github.com/NetLand-NTU/HopLand. Availability: http://www.ntu.edu.sg/home/zhengjie/ Authors: first name last name email country organization corresponding? Jing Guo [email protected] Singapore Nanyang Technological University Jie Zheng [email protected] Singapore Nanyang Technological University X Improving the performance of minimizers and winnowing schemes Keywords: minimizers, winnowing, k-mer Abstract: Motivation: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a se- quence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g., too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. Results: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence. Availability: