Project Notes: Project Title: Speeding up genome using volunteer computing ​ Name: Jay Nagpaul ​

Note Well: There are NO SHORT-cuts to reading journal articles and taking notes from them. Comprehension is paramount. You ​ will most likely need to read it several times so set aside enough time in your schedule.

Contents:

Important Notes: 2

Knowledge Gaps: 3

Literature Search Parameters: 4

Article #0 Notes: Title 5

Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language 6

Article #2 Notes: Efficient genomic read alignment in an in-memory database 7

Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP 11

Article #4 Notes: The Cost of Sequencing a Human Genome 13

Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures 15

Article #6 Notes: System and method for grid and cloud computing 18

Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity 23

Article #8 Notes: bíogo: a simple high-performance toolkit for the Go language 25

Article #9 Notes: Best Practices for Scientific Computing 27

Article #10 Notes: : freely available Python tools for computational molecular and bioinformatics 29

Article #11 Notes: BioJava: an open-source framework for bioinformatics 31

Article #12 Notes: On the performance and design of BioSequences compared to the Seq language 33

Article #13 Notes: Title 35

Nagpaul 1

Nagpaul 2

Important Notes:

Notes Article Date resolved

● Rust would be a great candidate 7 10/8/2020 for the project ○ Memory safe ○ Fast ○ Concurrent

Nagpaul 3

Knowledge Gaps:

This list provides a brief overview of the major knowledge gaps for this project, how they were resolved and where to find the information.

Knowledge Gap Resolved By Information is Date resolved located

Nagpaul 4

Literature Search Parameters:

These searches were performed between (Start Date of reading) and XX/XX/2019. List of keywords and databases used during this project.

Database/search engine Keywords Summary of search

Google Patents bowtie2

Arxiv Faster and More Accurate Sequence Alignment with SNAP

Google scholar Genome sequencing

Google scholar bowtie

Google scholar biogo

Nagpaul 5

Article #0 Notes: Title

Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title

Source citation (APA Format)

Original URL

Source type

Keywords

Summary of key points (include methodology)

Research Question/Problem/ Need

Important Figures

Notes

Cited references to follow up on

Follow up Questions

Nagpaul 6

Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language

Article notes should be on separate sheets

Source Title A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language

Source citation (APA Zargarbashi, S. S. H., & Babaali, B. (2019). A Multi-Modal ​ Format) Feature Embedding Approach to Diagnose Alzheimer

Disease from Spoken Language. ​ https://arxiv.org/abs/1910.00330v1

Original URL https://arxiv.org/abs/1910.00330v1

Source type ArXiv paper

Keywords Alzheimer’s, Machine Learning, Computer Science, Language

Summary of key - Alzheimer’s early diagnosis crucial points (include - Current tests are costly, slow, and time-consuming methodology) - Enter: machine learning - Algorithm which analyzes spoken language of patient - Looking for semantic errors, irregular tone/acoustics, and other irregularities - 3 models were tested: - I-vector & x-vector - Operated on the sound itself - No semantic understanding - N-gram - Operates on word semantics and ordering - Combination of 3 models diagnosed with 83.6% accuracy

Research How can we diagnose Alzheimer’s earlier for more people? Question/Problem/ Need

Important Figures N/A

Nagpaul 7

Notes - Specific details on the algo were hard to grasp. <- possible knowledge gap? - Article results were quite light. <- Inconclusive research? - Takes up low resources - Able to be implemented in a variety of languages - This is especially is promising for my initial idea - This model isnt specific to vocal analysis - The actual details of the model can be incorporated in other data

Cited references to P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. follow up on Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992. <- ​ Possible explanation for the algo details

Follow up Questions What are the specific details of the algorithm? How effective is this today? What is the false positives percentage? Answered in results: 83.6% ​ success rate How general purpose are the vector models Are these vocal samples public? Why were random noise samples added to the data? Is the model overfitted? 83.6% seems like a unrealistic level of accuracy.

Nagpaul 8 Article #2 Notes: Efficient genomic read alignment in an in-memory database

Source Title Efficient genomic read alignment in an in-memory database

Source citation (APA Plattner, H., Schapranow, M., & Ziegler, E. (2014). European. Patent ​ Format) No. EP2759952A1. Munich, Germany. European Patent Office. ​ Original URL https://patents.google.com/patent/EP2759952A1/en

Source type Patent

Keywords Search term: Bowtie2, bioinformatics, possible project

Summary of key ● Software for genomic alignment points (include ○ Faster than existing solutions methodology) ● Sequencing becoming more ubiquitous in modern research ● Patent sequences faster and cheaper ● Processes in parallel! ○ Validation for my idea ● Goes onto describe details about the design itself ○ Not necessary for my project specifically ● Chunks data ○ Chunks are processed concurrently ● Uses a query system ○ I.e Rusts RLS new development strategy ○ Based on in-memory database infrastructure (IMDB) ● Distributes using multiple computer cores ○ And possibly multiple computers

Research Genome alignment is an expensive and slow process. Can we Question/Problem/ develop a better alternative? Need

Nagpaul 9

Important Figures

Notes ● Super pleased about this patent ○ Verifies my assumption that parallelization would lead to beneficial improvements in genome alignments ● Paper also mentioned Bowtie2 as a possible tool which could work with their software ● Truly no unique ideas anymore ○ Will have to consult with Sontheimer for other ideas ○ Although this one isn’t open source

Cited references to LANGMEAD, B.; SALZBERG, S.L.: "Fast gapped-read alignment with Bowtie follow up on 2", NATURE METHODS, vol. 9, no. 4, April 2012 (2012-04-01), pages 357 - 9, XP002715401, DOI: doi:10.1038/nmeth.1923

STEVE HOFFMANN ET AL: "Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures", PLOS COMPUTATIONAL BIOLOGY, vol. 5, no. 9, 11 September 2009 (2009-09-11), pages e1000502, XP055070487, DOI: 10.1371/journal.pcbi.1000502 *

ZAHARIA, M.; BOLOSKY, W.J.; CURTIS, K.; FOX, A.; PATTERSON, D.; SHENKER, S.; STOICA, I.; KARP, R.M.; SITTLER, T., FASTER AND MORE

Nagpaul 10

ACCURATE SEQUENCE ALIGNMENT WITH SNAP, November 2011 (2011-11-01)

Follow up Questions ● Parallelization is a technique I’m familiar with when applied to projects such as these ○ This is the first time I’m hearing about IMDB ○ Is it a common use case? ■ (look into badger- GoLang) ■ ANSWERED https://docs.oracle.com/en/database/oracle/ora cle-database/19/vldbg/inmemory-parallel-exec. html ● Yes, it’s an underlying model behind many rdms databases today ● ● How does this system differ from Bowties? ● The patent repeatedly alludes to a new algorithm: their secret sauce ○ Claims it takes advantage of shortcuts, but is simple ○ Would a project I develop be focused more on new algorithmic design rather than utilizing modern techniques. ■ Similar but not quite the same ● i.e rewriting algorithms vs implementing PVC network

Nagpaul 11

Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP

Source Title Faster and More Accurate Sequence Alignment with SNAP

Source citation (APA Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson, D., Format) Shenker, S., Stoica, I., Karp, R. M., & Sittler, T. (2011). Faster and more accurate sequence alignment with snap. ArXiv:1111.5572 [Cs, q-Bio]. http://arxiv.org/abs/1111.5572

Original URL https://arxiv.org/abs/1111.5572

Source type ArXiv Paper

Keywords Bowtie2, Sequence alignment, optimization

Summary of key ● New sequence aligner (alternative To Bowtie2) points (include ● Uses new algorithm methodology) ○ Not BWA (B2’s alg) ● 10-100x speed up ○ Cheaper to run ■ $2 AWS unit ● Tested on "m2.4xlarge" Amazon EC2 machine with 68 GB RAM. ● Optimized for more error prone data ○ A notable area which is weak in existing tools

Research How can we speed up nucleotide alignment while preserving a high Question/Problem/ accuracy percentage? Need

Important Figures

Can see the sheer magnitude of improvement

Notes ● Github: https://github.com/amplab/snap ​ ○ Hm, seems less popular than bowtie2 ■ Odd considering it’s enormous speed improvements ● Optimized for both short and long read alignments

Nagpaul 12

● Results necessarily show an improvement in accuracy ○ BWA: 93.0% reads 0.05% error 662 reads/s ○ SNAP: 94.1% 0.05% error 34100 reads/s. ○ Will this hold in all cases? ● 40 GitHub issues ○ Large amount of duplicate bug reports ○ Paper is light on flaws of SNAP? ● Last paragraph agrees my suspicions of how to approach project ○ Reconsidering the algorithms at the core of such projects can lead to interesting speed ups and benefits

Cited references to R. Li, C. Yu, Y. Li, T.-W. W. Lam, S.-M. M. Yiu, K. Kristiansen, and J. follow up on Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, Aug. 2009.

Z. Ning, A. J. Cox, and J. C. Mullikin. SSAHA: A fast search method for large DNA databases. Genome Research, 11(10):1725–1729, 2001.

NHGRI data on DNA Sequencing Costs. http: //www.genome.gov/sequencingcosts/.

Follow up Questions ● Why isn’t SNAP more popular if it’s such a considerable upgrade over bowtie2 ○ ANSWERED ■ Github reveals a shocking number of bugs ■ 40 open issues vs Bowtie2’s 7. ● How important is the accuracy benefit of SNAP? ○ I remember discussing this during my time in lab. ■ * Possible S.L talking point.

Nagpaul 13

Article #4 Notes: The Cost of Sequencing a Human Genome

Source Title The Cost of Sequencing a Human Genome

Source citation (APA The cost of sequencing a human genome. (n.d.). Genome.Gov. Format) Retrieved September 27, 2020, from https://www.genome.gov/about-genomics/fact-sheets/Sequencing-H uman-Genome-cost

Original URL https://www.genome.gov/about-genomics/fact-sheets/Sequencing-H uman-Genome-cost

Source type Scientific Article

Keywords Genome, sequencing, bowtie2

Summary of key ● First genome cost $300 million to sequence points (include ○ Finished in 2003 methodology) ○ Total experimentation cost ~$3 billion to get to this point ● 2006 ○ Could now generate for $25 million ○ Vast amount of tools built ■ Still in use today ! ● Today ○ Process ■ Generate draft sequence ■ Compare to 6 million base pair ○ 2015 - $4000 -> $1000

Research How much does it cost to sequence a human genome, and how has Question/Problem/ this cost evolved throughout the years? Need

Nagpaul 14

Important Figures

*Note the scale on the left

Notes ● This article was fairly sparing on details ○ Not too much technical information ○ Did have one extremely useful bit of information ○ System of items involved in the genetic sequencing process ■ Reagents ■ Consumables ■ DNA-sequencing instruments ■ certain computer equipment ■ other equipment ■ laboratory pipeline development ■ laboratory information management systems ■ initial data processing ■ submission of data to public databases ■ project management ■ Utilities ■ other indirect costs ■ Labor ■ Administration ○ All of these are possible optimization points in my own project

Cited references to N/A follow up on

Follow up Questions What has been the leading factor in the speed ups throughout the years? ● ANSWERED https://www.broadinstitute.org/blog/opinionome-can-dna-sequ encing-get-any-faster-and-cheaper ○ Primarily due to speedups in the hardware portion

Nagpaul 15

○ But authorities agree there is plenty of room to grow.

Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Source Title Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Source citation (APA Hoffmann, S., Otto, C., Kurtz, S., Sharma, C. M., Khaitovich, P., Format) Vogel, J., Stadler, P. F., & Hackermüller, J. (2009). Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Computational Biology, 5(9). https://doi.org/10.1371/journal.pcbi.1000502

Original URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2730575/

Source type Journal article (NCBI)

Keywords Algorithms, genome, sequencing

Summary of key ● Optimizes short read matching points (include ● Accounts for indels while seed searching methodology) ○ inserts / deletes terminology ● Uses suffix trees ! ○ for each suffix ■ read longest prefix match, ● If long enough to be unique (ish) ○ On/off target style ● check all these positions to verify alignment. ■ increase sensitivity for no-match cases ○ Example ○ ACTGACTG. ○ If 2nd suffix match of length 4, i.e. CTGA, ■ third suffix has a longest prefix match > 3 ○ TGA ■ determine the longest prefix match of next suffix ● without rematching ● Benchmark

Nagpaul 16

○ 500 000 simulated reads ■ length 35 bp ■ sampled 50 MB from human genome ○ Results in important figures

Research How can we speed up the matching of short genetic sequences Question/Problem/ using a new algorithm? Need

Important Figures

Visual diagram of algorithm

Benchmark results

Notes ● Another example of an algorithm yielding impressive speedups ○ Huge speedups for short reads ■ (95% vs 71%)

Nagpaul 17

● Algorithm was actually quite easy to understand ○ Visual very useful ○ Important if I ever need to convey my own algorithms ■ Animated visuals? ● No standard benchmark ○ All benchmark against standard Bowtie2 ■ None against snap etc… ● Just as with SNAP ○ Bowtie2 much more popular ● One takeaway from this article: data correctness is the ​ most important output that should be tuned in my own project.

Cited references to Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M, et al. follow up on PatMaN: rapid alignment of short sequences to large databases. Bioinformatics. 2008;24:1530–1531

^PatMan! <- Used this before at S.L

Follow up Questions Is Bowtie2 so slow because it is the most correct out of the existing tools? ● ANSWERED http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml ○ Hmm, shows a option to choose between speed/correctness ■ Are these studies taking advantage of the cli flags to benefit their own tool? Optimization methods that don’t sacrifice correctness?

Nagpaul 18

Article #6 Notes: System and method for grid and cloud computing

Source Title System and method for grid and cloud computing

Source citation (APA Kortschak, R. D., & Adelson, D. L. (2014). bíogo: A simple Format) high-performance bioinformatics toolkit for the Go language.

BioRxiv, 005033. https://doi.org/10.1101/005033 ​

Original URL https://patents.google.com/patent/US8230070B2/en?q=Folding@Ho me&oq=Folding@Home

Source type Patent

Keywords Volunteer computing, algorithms

Summary of key ● Grid computing is a valid method to speeding up applications points (include ● Goes into existing tools methodology) ○ They all work using a hive architecture ○ All workers communicate back to the central server (queen) for information ● Their solution allows for inter-worker communication ● System overview: ○ Node ■ Computer ● Process ● Process etc.. ● Each node is independent ● Can communicate with each other without a dispatching server ● Support for a wide variety of programming techniques and languages ● Implements a MapReduce model ​ ​ ● Uses another query based system ● Nodes are split into pods ○ So still needs a central server to connect ○ Just not to actually handle data transfer ● Implemented on .NET using Mono ● Tasks are functionally pure ○ Deterministic Input -> Output ● Node processes are split into individual workers processing jobs independently and concurrently

Nagpaul 19

○ These workers also communicate between

Research How can we improve existing grid computing solutions? Question/Problem/ Need

Important Figures

Nagpaul 20

Notes ● Benefits exponentially decay as workers are added ● Terminology a little dated (master-> slave) (hive) ​ ​ ● Language of the text was odd? ○ Kept restating the same aspects over and over ○ Figures were useful however ■ Shows how different types of tasks vary in the benefits from parallelization ■ I believe sequence alignment will be safe however, as new improvements are devoted to parallelizing, thus showing it’s benefits ● Programming Language not too important ○ Weren’t using too many domain-specific features ● this is the second time a query architecture has come up ○ Definitely something to look into when designing my own system ● Increased latency in the message processing kills any benefits from parallelization ●

Cited references to C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, follow up on C. Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems, Proceedings of the 13th Intl. Sym posium on High-Performance Computer Architecture (HPCA), Phoenix, Ariz., February 2007.

Follow up Questions ● Definitely look more into MapReduce? ● Next article should be something on the query architecture

Nagpaul 21 itself? ○ Look into RLS system?

Nagpaul 22

Nagpaul 23

Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity

Source Title Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity

Source citation (APA Misale, C. (2014). Accelerating Bowtie2 with a lock-less concurrency Format) approach and memory affinity. 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 578–585. https://doi.org/10.1109/PDP.2014.50

Original URL https://ieeexplore.ieee.org/document/6787333

Source type Journal Paper

Keywords Bowtie2, sequence alignment, optimization

Summary of key ● Alignment is often memory-bottlenecked points (include ● Modified Bowtie2 methodology) ○ Concurrency switched from thread pool ○ To “master-worker” task delegation ● BW2 uses Burrows Wheeler Transform (BWT ● Already concurrent ● BWT -> their own algorithm FastFlow ○ Hive Pattern ○ Implemented in C++ ○ 4 Core Concepts ■ layered design for supporting local ■ Well-implemented algos (speed); ■ Parallelism ■ skeleton/pattern programming model ● Significant speed up ○ B/c before: ■ Parallelized threads often had idle time ○ After: ■ Using the hive pattern ■ Each thread is constantly in use and has jobs to complete ○ Speedups largely due to the generic FastFlow ○ Claim: It’s programming model is superior to existing model in BW2 ○

Research How can we improve Bowtie2’s speed by improving its concurrency Question/Problem/ model?

Nagpaul 24

Need

Important Figures

Notes ● Bowtie2 is a lot of code ○ Working on it alone would be a large projet ● Benefits of volunteer computing would largely be constrained to lower resource usage ○ Not necessarily a bad thing. ○ Still has applications ● Looking at the “bio” package in the Crates.io infrastructure ○ RUST seems like a good choice for the project ■ Memory safe ■ Concurrent ■ Fast

Cited references to n/a follow up on Next paper: https://www.biorxiv.org/content/10.1101/005033v1

Follow up Questions Thread pinning(?)

Nagpaul 25

Article #8 Notes: bíogo: a simple high-performance bioinformatics toolkit for the Go language

Source Title bíogo: a simple high-performance bioinformatics toolkit for the Go language

Source citation (APA Kortschak, R. D., & Adelson, D. L. (2014). bíogo: A simple Format) high-performance bioinformatics toolkit for the Go language.

BioRxiv, 005033. https://doi.org/10.1101/005033 ​ ​ ​

Original URL https://www.biorxiv.org/content/10.1101/005033v1

Source type Journal Paper

Keywords Bioinformatics, tools, Go

Summary of key ● Bioinformatic toolkits are becoming more ubiquitous points (include ○ Users have to choose between methodology) ■ Speed ■ Ease of use ● There are fast toolkits, but the languages they are built on are often low level, cumbersome for scientists to write in ○ C, , Fortran, etc.. ○ “Written for machines not humans” ● Toolkits in Python, , Ruby are easy to use ○ But sloww… ● Fell out of favor in modern times as requirements for processing increase ○ Higher volume of data ● Go is a good fit to be a goldilocks language ○ Simple ○ Fast ■ Fast compile times ○ Compiled ○ Concurrent ■ Important in bioinformatics ○ Good language tools ■ Documentation ■ Testing ■ IDE environment ● Easy interaction with C code ○ Brings a larger library ecosystem

Nagpaul 26

● Bíogo ○ Can read write sequences ○ Support for FAST FASTQ BAM BED GTF ● Remote BLAST entries ● Pure Go implementation of genome-scale pairwise local sequence alignment tool, PALS ○ Comparable speed to C implementation ○ Allows parallelism for further speed improvements

Research How can a bioinformatic toolkit with both speed and development Question/Problem/ ease be created? Need

Important Figures N/A

Notes ● I agree with a lot of the findings of the paper ● But.. ○ I believe it was a paper more on justifying the use of golang rather than reasons to pick golang ○ If it were giving reasons/benefits I believe a comparison to other languages would be applicable ● The reasons provided go along with my programming language evaluation ○ Although the article does fail to mention the community’s niche for the language ○ Which is more important because of the library ecosystem developed around that niche. ●

Cited references to Cock, P. J. A. et al. 2009. Biopython: Freely available Python tools follow up on for computational molecular biology and bioinformatics. Bioinformatics, 25: 1422–3.

Wilson, G. et al. 2014. Best practices for scientific computing. PLoS Biol., 12: e1001745.

Follow up Questions How widely used is Go for scientific computing?

Nagpaul 27

Article #9 Notes: Best Practices for Scientific Computing

Source Title Best Practices for Scientific Computing

Source citation (APA Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, Format) M., Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I.

M., Plumbley, M. D., Waugh, B., White, E. P., & Wilson, P.

(2014). Best Practices for Scientific Computing. PLOS ​ Biology, 12(1), e1001745. ​ ​ ​ https://doi.org/10.1371/journal.pbio.1001745

Original URL https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1 001745

Source type Journal paper

Keywords scientific computing, system design

Summary of key ● 1. Optimize for people not computers. points (include ○ Low cognitive overhead methodology) ○ Consistent, unique, and meaningful var names + formatting ● 2. Automate tasks ○ Use a build tool to automate workflows ● 3. Make many small changes ○ Work in small steps with frequent feedback and course correction ○ Use a version control system ● 4. Abstract and modularize code ○ Re-use code instead of rewriting it ● 5. Plan for mistakes. ○ Assertions ○ Unit testing library. ○ Turn bugs into test cases. ○ Use symbolic debuggers

Nagpaul 28

● 6. Optimize software only after it works correctly. ○ Use a profiler to identify bottlenecks ○ Write code in the highest-level language possible. ● 7. Document design and purpose, not mechanics ○ Document interfaces and reasons, not implementations ● 8. Collaborate. ○ Review before repo merge ○ Pair programming ○ Track issues

Research What are the best practices for scientific computing to ensure clean Question/Problem/ maintainable code? Need

Important Figures N/A

Notes ● A lot of the “rules” provided aren’t really specific to scientific computing ○ But CS in general ○ Still serves as a useful checklist to hold my own programs to.

Cited references to Haddock S, Dunn C (2010) Practical computing for biologists. follow up on Sunderland (Massachusetts): Sinauer Associates

Follow up Questions Any differences on writing code targeting biologists versus other computer scientists?

Nagpaul 29

Article #10 Notes: Biopython: freely available Python tools for computational molecular biology and bioinformatics

Source Title Biopython: freely available Python tools for computational molecular biology and bioinformatics

Source citation (APA Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Format) Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163

Original URL https://pubmed.ncbi.nlm.nih.gov/19304878/

Source type Journal paper

Keywords Bioinformatics, tools, Python

Summary of key ● Why Python? points (include ○ High level language methodology) ○ Interop with C, C++, Fortran ○ High quality numerical library ● Features ○ Sequence representation ■ Support for ● Fasta ● Genbank ● embl ● swiss ● clustal ● phylip ● stockholm ● Nexus ○ Call NCBI Blast ○

Research What is Biopython? Question/Problem/

Nagpaul 30

Need

Important Figures N/A

Notes ● Extremely light on details, short paper ● Second tool found with interop with C, C++, Fortran mentioned as a key benefit ○ Julia has a good interop story ○ Simpler than the one in GoLang as well ● Appears to be based on BioPerl.

Cited references to Holland RCG, et al. BioJava: an open-source framework for follow up on bioinformatics. Bioinformatics. 2008;24:2096–2097

Follow up Questions Why is BioPython the defacto library?

Nagpaul 31

Article #11 Notes: BioJava: an open-source framework for bioinformatics

Source Title BioJava: an open-source framework for bioinformatics

Source citation (APA Holland, R. C. G., Down, T. A., Pocock, M., Prlić, A., Huen, D., Format) James, K., Foisy, S., Dräger, A., Yates, A., Heuer, M., & Schreiber, M. J. (2008). BioJava: An open-source framework for bioinformatics. Bioinformatics, 24(18), 2096–2097. https://doi.org/10.1093/bioinformatics/btn397

Original URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2530884/

Source type Journal paper

Keywords Bioinformatics, tools, Java

Summary of key ● Goal is to abstract and reuse common biological tasks points (include ● Open source methodology) ● Estimated 47 person-years of effort ● Main features ○ (1) nucleotide and alphabets ○ (2) BLAST parser ○ (3) sequence I/O ○ (4) dynamic programming ○ (5) structure I/O and manipulation ○ (6) sequence manipulation ○ (7) genetic algorithms ○ (8) statistical distributions ○ (9) graphical user interfaces ○ (10) serialization to databases ● Sequence (symbolic alphabet) API is the core of BioJava ○ Can load sequence from multiple formats ○ Serves as the base of the other 10 main features ○ Sequence annotation ● Future goals: ○ Become an all-encompassing tool for bioinformatic Java work

Research How can we write code which can be reused and save time for Java Question/Problem/ bioinformatics? Need

Nagpaul 32

Important Figures

Notes ● Seems like the most mature of the libraries researched ● Support for a large amount of functions ● The Bio* toolkits have a large amount of interoperability with existing tools ○ Needed as bioinformatics is quite slow on the CS side of things ○ Maintenance is difficult ○ No mention of the specific libraries however ○ Specifically numerical libraries which Java is not well known for

Cited references to N/A follow up on

Follow up Questions What is the numerical ecosystem like in Java?

Nagpaul 33

Article #12 Notes: On the performance and design of BioSequences compared to the Seq language

Source Title On the performance and design of BioSequences compared to the Seq language

Source citation (APA Ward, B. J. (2020, January 23). On the performance and design of Format) BioSequences compared to the Seq language. BioJulia. https://BioJulia.github.io/post/seq-lang/

Original URL https://biojulia.net/post/seq-lang/

Source type Journal Paper Review

Keywords Julia, bioinformatics, tools

Summary of key ● Different authors published a paper regarding details on Seq points (include ● A Domain specific language (DSL) for bioinformatics methodology) ● Benchmarks between languages are tricky ● Need to ensure idiomacy between the two implementations ● Seq benchmarks actually well done ● Few changes, to make more fair between Seq vs. BioJulia (BJ) ● Seq allowed several errors to pass unnoticed ● However Seq was much faster ● Turns out that the difference is from the underlying library BioSequences.jl ○ BJ utilizes biosequences in a manner that optimizes for ram usage rather than time allotted for the task ○ Also verifies the sequences and ensures memory safety ● Seq does neither ● Biosequences was replaced with a “seq-style” model ○ Performance was then equal to / greater than Seq ● Bioinformatics does not need a DSL ○ The existing ecosystem is just as important if not more important than the bioinformatics work itself

Research How was the programming language Seq fairing so well in it’s Question/Problem/ benchmarks? Need

Nagpaul 34

Important Figures

Notes ● Can always learn from competing tools ● Benchmarks are trickier than first thought ○ More time may needed to be allotted in gantt chart ○ Verification is a more valued characteristic than raw speed ○ This falls in line with the guidelines for scientific computing ● I’m inclined to agree with the BioJulia developers ○ A DSL is outdated and removes access to an entire ecosystem of existing well-written tools ● Seq definitely not an option for my work ● Julia sounds like a good fit based on preliminary research.

Cited references to N/A follow up on

Follow up Questions Are there any specific tasks where raw speed is valued more?

Nagpaul 35

Article #13 Notes: Title

Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title

Source citation (APA Format)

Original URL

Source type

Keywords

Summary of key points (include methodology)

Research Question/Problem/ Need

Important Figures

Notes

Cited references to follow up on

Follow up Questions