Project Notes: Project Title: Speeding up Genome Sequence Alignment Using Volunteer Computing Name: Jay Nagpaul

Project Notes: Project Title: Speeding up genome sequence alignment using volunteer computing Name: Jay Nagpaul Note Well: There are NO SHORT-cuts to reading journal articles and taking notes from them. Comprehension is paramount. You will most likely need to read it several times so set aside enough time in your schedule. Contents: Important Notes: 2 Knowledge Gaps: 3 Literature Search Parameters: 4 Article #0 Notes: Title 5 Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language 6 Article #2 Notes: Efficient genomic read alignment in an in-memory database 7 Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP 11 Article #4 Notes: The Cost of Sequencing a Human Genome 13 Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures 15 Article #6 Notes: System and method for grid and cloud computing 18 Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity 23 Article #8 Notes: bíogo: a simple high-performance bioinformatics toolkit for the Go language 25 Article #9 Notes: Best Practices for Scientific Computing 27 Article #10 Notes: Biopython: freely available Python tools for computational molecular biology and bioinformatics 29 Article #11 Notes: BioJava: an open-source framework for bioinformatics 31 Article #12 Notes: On the performance and design of BioSequences compared to the Seq language 33 Article #13 Notes: Title 35 Nagpaul 1 Nagpaul 2 Important Notes: Notes Article Date resolved ● Rust would be a great candidate 7 10/8/2020 for the project ○ Memory safe ○ Fast ○ Concurrent Nagpaul 3 Knowledge Gaps: This list provides a brief overview of the major knowledge gaps for this project, how they were resolved and where to find the information. Knowledge Gap Resolved By Information is Date resolved located Nagpaul 4 Literature Search Parameters: These searches were performed between (Start Date of reading) and XX/XX/2019. List of keywords and databases used during this project. Database/search engine Keywords Summary of search Google Patents bowtie2 Arxiv Faster and More Accurate Sequence Alignment with SNAP Google scholar Genome sequencing Google scholar bowtie Google scholar biogo Nagpaul 5 Article #0 Notes: Title Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title Source citation (APA Format) Original URL Source type Keywords Summary of key points (include methodology) Research Question/Problem/ Need Important Figures Notes Cited references to follow up on Follow up Questions Nagpaul 6 Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language Article notes should be on separate sheets Source Title A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language Source citation (APA Zargarbashi, S. S. H., & Babaali, B. (2019). A Multi-Modal Format) Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language. https://arxiv.org/abs/1910.00330v1 Original URL https://arxiv.org/abs/1910.00330v1 Source type ArXiv paper Keywords Alzheimer’s, Machine Learning, Computer Science, Language Summary of key - Alzheimer’s early diagnosis crucial points (include - Current tests are costly, slow, and time-consuming methodology) - Enter: machine learning - Algorithm which analyzes spoken language of patient - Looking for semantic errors, irregular tone/acoustics, and other irregularities - 3 models were tested: - I-vector & x-vector - Operated on the sound itself - No semantic understanding - N-gram - Operates on word semantics and ordering - Combination of 3 models diagnosed with 83.6% accuracy Research How can we diagnose Alzheimer’s earlier for more people? Question/Problem/ Need Important Figures N/A Nagpaul 7 Notes - Specific details on the algo were hard to grasp. <- possible knowledge gap? - Article results were quite light. <- Inconclusive research? - Takes up low resources - Able to be implemented in a variety of languages - This is especially is promising for my initial idea - This model isnt specific to vocal analysis - The actual details of the model can be incorporated in other data Cited references to P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. follow up on Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992. <- Possible explanation for the algo details Follow up Questions What are the specific details of the algorithm? How effective is this today? What is the false positives percentage? Answered in results: 83.6% success rate How general purpose are the vector models Are these vocal samples public? Why were random noise samples added to the data? Is the model overfitted? 83.6% seems like a unrealistic level of accuracy. Nagpaul 8 Article #2 Notes: Efficient genomic read alignment in an in-memory database Source Title Efficient genomic read alignment in an in-memory database Source citation (APA Plattner, H., Schapranow, M., & Ziegler, E. (2014). European. Patent Format) No. EP2759952A1. Munich, Germany. European Patent Office. Original URL https://patents.google.com/patent/EP2759952A1/en Source type Patent Keywords Search term: Bowtie2, bioinformatics, possible project Summary of key ● Software for genomic alignment points (include ○ Faster than existing solutions methodology) ● Sequencing becoming more ubiquitous in modern research ● Patent sequences faster and cheaper ● Processes in parallel! ○ Validation for my idea ● Goes onto describe details about the design itself ○ Not necessary for my project specifically ● Chunks data ○ Chunks are processed concurrently ● Uses a query system ○ I.e Rusts RLS new development strategy ○ Based on in-memory database infrastructure (IMDB) ● Distributes using multiple computer cores ○ And possibly multiple computers Research Genome alignment is an expensive and slow process. Can we Question/Problem/ develop a better alternative? Need Nagpaul 9 Important Figures Notes ● Super pleased about this patent ○ Verifies my assumption that parallelization would lead to beneficial improvements in genome alignments ● Paper also mentioned Bowtie2 as a possible tool which could work with their software ● Truly no unique ideas anymore ○ Will have to consult with Sontheimer for other ideas ○ Although this one isn’t open source Cited references to LANGMEAD, B.; SALZBERG, S.L.: "Fast gapped-read alignment with Bowtie follow up on 2", NATURE METHODS, vol. 9, no. 4, April 2012 (2012-04-01), pages 357 - 9, XP002715401, DOI: doi:10.1038/nmeth.1923 STEVE HOFFMANN ET AL: "Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures", PLOS COMPUTATIONAL BIOLOGY, vol. 5, no. 9, 11 September 2009 (2009-09-11), pages e1000502, XP055070487, DOI: 10.1371/journal.pcbi.1000502 * ZAHARIA, M.; BOLOSKY, W.J.; CURTIS, K.; FOX, A.; PATTERSON, D.; SHENKER, S.; STOICA, I.; KARP, R.M.; SITTLER, T., FASTER AND MORE Nagpaul 10 ACCURATE SEQUENCE ALIGNMENT WITH SNAP, November 2011 (2011-11-01) Follow up Questions ● Parallelization is a technique I’m familiar with when applied to projects such as these ○ This is the first time I’m hearing about IMDB ○ Is it a common use case? ■ (look into badger- GoLang) ■ ANSWERED https://docs.oracle.com/en/database/oracle/ora cle-database/19/vldbg/inmemory-parallel-exec. html ● Yes, it’s an underlying model behind many rdms databases today ● ● How does this system differ from Bowties? ● The patent repeatedly alludes to a new algorithm: their secret sauce ○ Claims it takes advantage of shortcuts, but is simple ○ Would a project I develop be focused more on new algorithmic design rather than utilizing modern techniques. ■ Similar but not quite the same ● i.e rewriting algorithms vs implementing PVC network Nagpaul 11 Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP Source Title Faster and More Accurate Sequence Alignment with SNAP Source citation (APA Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson, D., Format) Shenker, S., Stoica, I., Karp, R. M., & Sittler, T. (2011). Faster and more accurate sequence alignment with snap. ArXiv:1111.5572 [Cs, q-Bio]. http://arxiv.org/abs/1111.5572 Original URL https://arxiv.org/abs/1111.5572 Source type ArXiv Paper Keywords Bowtie2, Sequence alignment, optimization Summary of key ● New sequence aligner (alternative To Bowtie2) points (include ● Uses new algorithm methodology) ○ Not BWA (B2’s alg) ● 10-100x speed up ○ Cheaper to run ■ $2 AWS unit ● Tested on "m2.4xlarge" Amazon EC2 machine with 68 GB RAM. ● Optimized for more error prone data ○ A notable area which is weak in existing tools Research How can we speed up nucleotide alignment while preserving a high Question/Problem/ accuracy percentage? Need Important Figures Can see the sheer magnitude of improvement Notes ● Github: https://github.com/amplab/snap ○ Hm, seems less popular than bowtie2 ■ Odd considering it’s enormous speed improvements ● Optimized for both short and long read alignments Nagpaul 12 ● Results necessarily show an improvement in accuracy ○ BWA: 93.0% reads 0.05% error 662 reads/s ○ SNAP: 94.1% 0.05% error 34100 reads/s. ○ Will this hold in all cases? ● 40 GitHub issues ○ Large amount of duplicate bug reports ○ Paper is light on flaws of SNAP? ● Last paragraph agrees my suspicions of how to approach project ○ Reconsidering the algorithms at the core of such projects can lead to interesting speed ups and benefits Cited references to R. Li, C. Yu, Y. Li, T.-W. W. Lam, S.-M. M. Yiu, K. Kristiansen, and J. follow up on Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, Aug. 2009. Z. Ning, A. J. Cox, and J. C. Mullikin. SSAHA: A fast search method for large DNA databases. Genome Research, 11(10):1725–1729, 2001. NHGRI data on DNA Sequencing Costs. http: //www.genome.gov/sequencingcosts/. Follow up Questions ● Why isn’t SNAP more popular if it’s such a considerable upgrade over bowtie2 ○ ANSWERED ■ Github reveals a shocking number of bugs ■ 40 open issues vs Bowtie2’s 7.

Load more

Project Notes: Project Title: Speeding up Genome Sequence Alignment Using Volunteer Computing ​ Name: Jay Nagpaul ​

Project Notes: Project Title: Speeding up Genome Sequence Alignment Using Volunteer Computing Name: Jay Nagpaul