Project Notes: Project Title: Speeding up genome sequence alignment using volunteer computing Name: Jay Nagpaul
Note Well: There are NO SHORT-cuts to reading journal articles and taking notes from them. Comprehension is paramount. You will most likely need to read it several times so set aside enough time in your schedule.
Contents:
Important Notes: 2
Knowledge Gaps: 3
Literature Search Parameters: 4
Article #0 Notes: Title 5
Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language 6
Article #2 Notes: Efficient genomic read alignment in an in-memory database 7
Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP 11
Article #4 Notes: The Cost of Sequencing a Human Genome 13
Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures 15
Article #6 Notes: System and method for grid and cloud computing 18
Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity 23
Article #8 Notes: bíogo: a simple high-performance bioinformatics toolkit for the Go language 25
Article #9 Notes: Best Practices for Scientific Computing 27
Article #10 Notes: Biopython: freely available Python tools for computational molecular biology and bioinformatics 29
Article #11 Notes: BioJava: an open-source framework for bioinformatics 31
Article #12 Notes: On the performance and design of BioSequences compared to the Seq language 33
Article #13 Notes: Title 35
Nagpaul 1
Nagpaul 2
Important Notes:
Notes Article Date resolved
● Rust would be a great candidate 7 10/8/2020 for the project ○ Memory safe ○ Fast ○ Concurrent
Nagpaul 3
Knowledge Gaps:
This list provides a brief overview of the major knowledge gaps for this project, how they were resolved and where to find the information.
Knowledge Gap Resolved By Information is Date resolved located
Nagpaul 4
Literature Search Parameters:
These searches were performed between (Start Date of reading) and XX/XX/2019. List of keywords and databases used during this project.
Database/search engine Keywords Summary of search
Google Patents bowtie2
Arxiv Faster and More Accurate Sequence Alignment with SNAP
Google scholar Genome sequencing
Google scholar bowtie
Google scholar biogo
Nagpaul 5
Article #0 Notes: Title
Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title
Source citation (APA Format)
Original URL
Source type
Keywords
Summary of key points (include methodology)
Research Question/Problem/ Need
Important Figures
Notes
Cited references to follow up on
Follow up Questions
Nagpaul 6
Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language
Article notes should be on separate sheets
Source Title A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language
Source citation (APA Zargarbashi, S. S. H., & Babaali, B. (2019). A Multi-Modal Format) Feature Embedding Approach to Diagnose Alzheimer
Disease from Spoken Language. https://arxiv.org/abs/1910.00330v1
Original URL https://arxiv.org/abs/1910.00330v1
Source type ArXiv paper
Keywords Alzheimer’s, Machine Learning, Computer Science, Language
Summary of key - Alzheimer’s early diagnosis crucial points (include - Current tests are costly, slow, and time-consuming methodology) - Enter: machine learning - Algorithm which analyzes spoken language of patient - Looking for semantic errors, irregular tone/acoustics, and other irregularities - 3 models were tested: - I-vector & x-vector - Operated on the sound itself - No semantic understanding - N-gram - Operates on word semantics and ordering - Combination of 3 models diagnosed with 83.6% accuracy
Research How can we diagnose Alzheimer’s earlier for more people? Question/Problem/ Need
Important Figures N/A
Nagpaul 7
Notes - Specific details on the algo were hard to grasp. <- possible knowledge gap? - Article results were quite light. <- Inconclusive research? - Takes up low resources - Able to be implemented in a variety of languages - This is especially is promising for my initial idea - This model isnt specific to vocal analysis - The actual details of the model can be incorporated in other data
Cited references to P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. follow up on Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992. <- Possible explanation for the algo details
Follow up Questions What are the specific details of the algorithm? How effective is this today? What is the false positives percentage? Answered in results: 83.6% success rate How general purpose are the vector models Are these vocal samples public? Why were random noise samples added to the data? Is the model overfitted? 83.6% seems like a unrealistic level of accuracy.
Nagpaul 8 Article #2 Notes: Efficient genomic read alignment in an in-memory database
Source Title Efficient genomic read alignment in an in-memory database
Source citation (APA Plattner, H., Schapranow, M., & Ziegler, E. (2014). European. Patent Format) No. EP2759952A1. Munich, Germany. European Patent Office. Original URL https://patents.google.com/patent/EP2759952A1/en
Source type Patent
Keywords Search term: Bowtie2, bioinformatics, possible project
Summary of key ● Software for genomic alignment points (include ○ Faster than existing solutions methodology) ● Sequencing becoming more ubiquitous in modern research ● Patent sequences faster and cheaper ● Processes in parallel! ○ Validation for my idea ● Goes onto describe details about the design itself ○ Not necessary for my project specifically ● Chunks data ○ Chunks are processed concurrently ● Uses a query system ○ I.e Rusts RLS new development strategy ○ Based on in-memory database infrastructure (IMDB) ● Distributes using multiple computer cores ○ And possibly multiple computers
Research Genome alignment is an expensive and slow process. Can we Question/Problem/ develop a better alternative? Need
Nagpaul 9
Important Figures
Notes ● Super pleased about this patent ○ Verifies my assumption that parallelization would lead to beneficial improvements in genome alignments ● Paper also mentioned Bowtie2 as a possible tool which could work with their software ● Truly no unique ideas anymore ○ Will have to consult with Sontheimer for other ideas ○ Although this one isn’t open source
Cited references to LANGMEAD, B.; SALZBERG, S.L.: "Fast gapped-read alignment with Bowtie follow up on 2", NATURE METHODS, vol. 9, no. 4, April 2012 (2012-04-01), pages 357 - 9, XP002715401, DOI: doi:10.1038/nmeth.1923
STEVE HOFFMANN ET AL: "Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures", PLOS COMPUTATIONAL BIOLOGY, vol. 5, no. 9, 11 September 2009 (2009-09-11), pages e1000502, XP055070487, DOI: 10.1371/journal.pcbi.1000502 *
ZAHARIA, M.; BOLOSKY, W.J.; CURTIS, K.; FOX, A.; PATTERSON, D.; SHENKER, S.; STOICA, I.; KARP, R.M.; SITTLER, T., FASTER AND MORE
Nagpaul 10
ACCURATE SEQUENCE ALIGNMENT WITH SNAP, November 2011 (2011-11-01)
Follow up Questions ● Parallelization is a technique I’m familiar with when applied to projects such as these ○ This is the first time I’m hearing about IMDB ○ Is it a common use case? ■ (look into badger- GoLang) ■ ANSWERED https://docs.oracle.com/en/database/oracle/ora cle-database/19/vldbg/inmemory-parallel-exec. html ● Yes, it’s an underlying model behind many rdms databases today ● ● How does this system differ from Bowties? ● The patent repeatedly alludes to a new algorithm: their secret sauce ○ Claims it takes advantage of shortcuts, but is simple ○ Would a project I develop be focused more on new algorithmic design rather than utilizing modern techniques. ■ Similar but not quite the same ● i.e rewriting algorithms vs implementing PVC network
Nagpaul 11
Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP
Source Title Faster and More Accurate Sequence Alignment with SNAP
Source citation (APA Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson, D., Format) Shenker, S., Stoica, I., Karp, R. M., & Sittler, T. (2011). Faster and more accurate sequence alignment with snap. ArXiv:1111.5572 [Cs, q-Bio]. http://arxiv.org/abs/1111.5572
Original URL https://arxiv.org/abs/1111.5572
Source type ArXiv Paper
Keywords Bowtie2, Sequence alignment, optimization
Summary of key ● New sequence aligner (alternative To Bowtie2) points (include ● Uses new algorithm methodology) ○ Not BWA (B2’s alg) ● 10-100x speed up ○ Cheaper to run ■ $2 AWS unit ● Tested on "m2.4xlarge" Amazon EC2 machine with 68 GB RAM. ● Optimized for more error prone data ○ A notable area which is weak in existing tools
Research How can we speed up nucleotide alignment while preserving a high Question/Problem/ accuracy percentage? Need
Important Figures
Can see the sheer magnitude of improvement
Notes ● Github: https://github.com/amplab/snap ○ Hm, seems less popular than bowtie2 ■ Odd considering it’s enormous speed improvements ● Optimized for both short and long read alignments
Nagpaul 12
● Results necessarily show an improvement in accuracy ○ BWA: 93.0% reads 0.05% error 662 reads/s ○ SNAP: 94.1% 0.05% error 34100 reads/s. ○ Will this hold in all cases? ● 40 GitHub issues ○ Large amount of duplicate bug reports ○ Paper is light on flaws of SNAP? ● Last paragraph agrees my suspicions of how to approach project ○ Reconsidering the algorithms at the core of such projects can lead to interesting speed ups and benefits
Cited references to R. Li, C. Yu, Y. Li, T.-W. W. Lam, S.-M. M. Yiu, K. Kristiansen, and J. follow up on Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, Aug. 2009.
Z. Ning, A. J. Cox, and J. C. Mullikin. SSAHA: A fast search method for large DNA databases. Genome Research, 11(10):1725–1729, 2001.
NHGRI data on DNA Sequencing Costs. http: //www.genome.gov/sequencingcosts/.
Follow up Questions ● Why isn’t SNAP more popular if it’s such a considerable upgrade over bowtie2 ○ ANSWERED ■ Github reveals a shocking number of bugs ■ 40 open issues vs Bowtie2’s 7. ● How important is the accuracy benefit of SNAP? ○ I remember discussing this during my time in lab. ■ * Possible S.L talking point.
Nagpaul 13
Article #4 Notes: The Cost of Sequencing a Human Genome
Source Title The Cost of Sequencing a Human Genome
Source citation (APA The cost of sequencing a human genome. (n.d.). Genome.Gov. Format) Retrieved September 27, 2020, from https://www.genome.gov/about-genomics/fact-sheets/Sequencing-H uman-Genome-cost
Original URL https://www.genome.gov/about-genomics/fact-sheets/Sequencing-H uman-Genome-cost
Source type Scientific Article
Keywords Genome, sequencing, bowtie2
Summary of key ● First genome cost $300 million to sequence points (include ○ Finished in 2003 methodology) ○ Total experimentation cost ~$3 billion to get to this point ● 2006 ○ Could now generate for $25 million ○ Vast amount of tools built ■ Still in use today ! ● Today ○ Process ■ Generate draft sequence ■ Compare to 6 million base pair ○ 2015 - $4000 -> $1000
Research How much does it cost to sequence a human genome, and how has Question/Problem/ this cost evolved throughout the years? Need
Nagpaul 14
Important Figures
*Note the scale on the left
Notes ● This article was fairly sparing on details ○ Not too much technical information ○ Did have one extremely useful bit of information ○ System of items involved in the genetic sequencing process ■ Reagents ■ Consumables ■ DNA-sequencing instruments ■ certain computer equipment ■ other equipment ■ laboratory pipeline development ■ laboratory information management systems ■ initial data processing ■ submission of data to public databases ■ project management ■ Utilities ■ other indirect costs ■ Labor ■ Administration ○ All of these are possible optimization points in my own project
Cited references to N/A follow up on
Follow up Questions What has been the leading factor in the speed ups throughout the years? ● ANSWERED https://www.broadinstitute.org/blog/opinionome-can-dna-sequ encing-get-any-faster-and-cheaper ○ Primarily due to speedups in the hardware portion
Nagpaul 15
○ But authorities agree there is plenty of room to grow.
Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures
Source Title Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures
Source citation (APA Hoffmann, S., Otto, C., Kurtz, S., Sharma, C. M., Khaitovich, P., Format) Vogel, J., Stadler, P. F., & Hackermüller, J. (2009). Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Computational Biology, 5(9). https://doi.org/10.1371/journal.pcbi.1000502
Original URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2730575/
Source type Journal article (NCBI)
Keywords Algorithms, genome, sequencing
Summary of key ● Optimizes short read matching points (include ● Accounts for indels while seed searching methodology) ○ inserts / deletes terminology ● Uses suffix trees ! ○ for each suffix ■ read longest prefix match, ● If long enough to be unique (ish) ○ On/off target style ● check all these positions to verify alignment. ■ increase sensitivity for no-match cases ○ Example ○ ACTGACTG. ○ If 2nd suffix match of length 4, i.e. CTGA, ■ third suffix has a longest prefix match > 3 ○ TGA ■ determine the longest prefix match of next suffix ● without rematching ● Benchmark
Nagpaul 16
○ 500 000 simulated reads ■ length 35 bp ■ sampled 50 MB from human genome ○ Results in important figures
Research How can we speed up the matching of short genetic sequences Question/Problem/ using a new algorithm? Need
Important Figures
Visual diagram of algorithm
Benchmark results
Notes ● Another example of an algorithm yielding impressive speedups ○ Huge speedups for short reads ■ (95% vs 71%)
Nagpaul 17
● Algorithm was actually quite easy to understand ○ Visual very useful ○ Important if I ever need to convey my own algorithms ■ Animated visuals? ● No standard benchmark ○ All benchmark against standard Bowtie2 ■ None against snap etc… ● Just as with SNAP ○ Bowtie2 much more popular ● One takeaway from this article: data correctness is the most important output that should be tuned in my own project.
Cited references to Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M, et al. follow up on PatMaN: rapid alignment of short sequences to large databases. Bioinformatics. 2008;24:1530–1531
^PatMan! <- Used this before at S.L
Follow up Questions Is Bowtie2 so slow because it is the most correct out of the existing tools? ● ANSWERED http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml ○ Hmm, shows a option to choose between speed/correctness ■ Are these studies taking advantage of the cli flags to benefit their own tool? Optimization methods that don’t sacrifice correctness?
Nagpaul 18
Article #6 Notes: System and method for grid and cloud computing
Source Title System and method for grid and cloud computing
Source citation (APA Kortschak, R. D., & Adelson, D. L. (2014). bíogo: A simple Format) high-performance bioinformatics toolkit for the Go language.
BioRxiv, 005033. https://doi.org/10.1101/005033
Original URL https://patents.google.com/patent/US8230070B2/en?q=Folding@Ho me&oq=Folding@Home
Source type Patent
Keywords Volunteer computing, algorithms
Summary of key ● Grid computing is a valid method to speeding up applications points (include ● Goes into existing tools methodology) ○ They all work using a hive architecture ○ All workers communicate back to the central server (queen) for information ● Their solution allows for inter-worker communication ● System overview: ○ Node ■ Computer ● Process ● Process etc.. ● Each node is independent ● Can communicate with each other without a dispatching server ● Support for a wide variety of programming techniques and languages ● Implements a MapReduce model ● Uses another query based system ● Nodes are split into pods ○ So still needs a central server to connect ○ Just not to actually handle data transfer ● Implemented on .NET using Mono ● Tasks are functionally pure ○ Deterministic Input -> Output ● Node processes are split into individual workers processing jobs independently and concurrently
Nagpaul 19
○ These workers also communicate between
Research How can we improve existing grid computing solutions? Question/Problem/ Need
Important Figures
Nagpaul 20
Notes ● Benefits exponentially decay as workers are added ● Terminology a little dated (master-> slave) (hive) ● Language of the text was odd? ○ Kept restating the same aspects over and over ○ Figures were useful however ■ Shows how different types of tasks vary in the benefits from parallelization ■ I believe sequence alignment will be safe however, as new improvements are devoted to parallelizing, thus showing it’s benefits ● Programming Language not too important ○ Weren’t using too many domain-specific features ● this is the second time a query architecture has come up ○ Definitely something to look into when designing my own system ● Increased latency in the message processing kills any benefits from parallelization ●
Cited references to C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, follow up on C. Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems, Proceedings of the 13th Intl. Sym posium on High-Performance Computer Architecture (HPCA), Phoenix, Ariz., February 2007.
Follow up Questions ● Definitely look more into MapReduce? ● Next article should be something on the query architecture
Nagpaul 21 itself? ○ Look into RLS system?
Nagpaul 22
Nagpaul 23
Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity
Source Title Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity
Source citation (APA Misale, C. (2014). Accelerating Bowtie2 with a lock-less concurrency Format) approach and memory affinity. 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 578–585. https://doi.org/10.1109/PDP.2014.50
Original URL https://ieeexplore.ieee.org/document/6787333
Source type Journal Paper
Keywords Bowtie2, sequence alignment, optimization
Summary of key ● Alignment is often memory-bottlenecked points (include ● Modified Bowtie2 methodology) ○ Concurrency switched from thread pool ○ To “master-worker” task delegation ● BW2 uses Burrows Wheeler Transform (BWT ● Already concurrent ● BWT -> their own algorithm FastFlow ○ Hive Pattern ○ Implemented in C++ ○ 4 Core Concepts ■ layered design for supporting local ■ Well-implemented algos (speed); ■ Parallelism ■ skeleton/pattern programming model ● Significant speed up ○ B/c before: ■ Parallelized threads often had idle time ○ After: ■ Using the hive pattern ■ Each thread is constantly in use and has jobs to complete ○ Speedups largely due to the generic FastFlow library ○ Claim: It’s programming model is superior to existing model in BW2 ○
Research How can we improve Bowtie2’s speed by improving its concurrency Question/Problem/ model?
Nagpaul 24
Need
Important Figures
Notes ● Bowtie2 is a lot of code ○ Working on it alone would be a large projet ● Benefits of volunteer computing would largely be constrained to lower resource usage ○ Not necessarily a bad thing. ○ Still has applications ● Looking at the “bio” package in the Crates.io infrastructure ○ RUST seems like a good choice for the project ■ Memory safe ■ Concurrent ■ Fast
Cited references to n/a follow up on Next paper: https://www.biorxiv.org/content/10.1101/005033v1
Follow up Questions Thread pinning(?)
Nagpaul 25
Article #8 Notes: bíogo: a simple high-performance bioinformatics toolkit for the Go language
Source Title bíogo: a simple high-performance bioinformatics toolkit for the Go language
Source citation (APA Kortschak, R. D., & Adelson, D. L. (2014). bíogo: A simple Format) high-performance bioinformatics toolkit for the Go language.
BioRxiv, 005033. https://doi.org/10.1101/005033
Original URL https://www.biorxiv.org/content/10.1101/005033v1
Source type Journal Paper
Keywords Bioinformatics, tools, Go
Summary of key ● Bioinformatic toolkits are becoming more ubiquitous points (include ○ Users have to choose between methodology) ■ Speed ■ Ease of use ● There are fast toolkits, but the languages they are built on are often low level, cumbersome for scientists to write in ○ C, Java, Fortran, etc.. ○ “Written for machines not humans” ● Toolkits in Python, Perl, Ruby are easy to use ○ But sloww… ● Fell out of favor in modern times as requirements for processing increase ○ Higher volume of data ● Go is a good fit to be a goldilocks language ○ Simple ○ Fast ■ Fast compile times ○ Compiled ○ Concurrent ■ Important in bioinformatics ○ Good language tools ■ Documentation ■ Testing ■ IDE environment ● Easy interaction with C code ○ Brings a larger library ecosystem
Nagpaul 26
● Bíogo ○ Can read write sequences ○ Support for FAST FASTQ BAM BED GTF ● Remote BLAST entries ● Pure Go implementation of genome-scale pairwise local sequence alignment tool, PALS ○ Comparable speed to C implementation ○ Allows parallelism for further speed improvements
Research How can a bioinformatic toolkit with both speed and development Question/Problem/ ease be created? Need
Important Figures N/A
Notes ● I agree with a lot of the findings of the paper ● But.. ○ I believe it was a paper more on justifying the use of golang rather than reasons to pick golang ○ If it were giving reasons/benefits I believe a comparison to other languages would be applicable ● The reasons provided go along with my programming language evaluation ○ Although the article does fail to mention the community’s niche for the language ○ Which is more important because of the library ecosystem developed around that niche. ●
Cited references to Cock, P. J. A. et al. 2009. Biopython: Freely available Python tools follow up on for computational molecular biology and bioinformatics. Bioinformatics, 25: 1422–3.
Wilson, G. et al. 2014. Best practices for scientific computing. PLoS Biol., 12: e1001745.
Follow up Questions How widely used is Go for scientific computing?
Nagpaul 27
Article #9 Notes: Best Practices for Scientific Computing
Source Title Best Practices for Scientific Computing
Source citation (APA Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, Format) M., Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I.
M., Plumbley, M. D., Waugh, B., White, E. P., & Wilson, P.
(2014). Best Practices for Scientific Computing. PLOS Biology, 12(1), e1001745. https://doi.org/10.1371/journal.pbio.1001745
Original URL https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1 001745
Source type Journal paper
Keywords scientific computing, system design
Summary of key ● 1. Optimize for people not computers. points (include ○ Low cognitive overhead methodology) ○ Consistent, unique, and meaningful var names + formatting ● 2. Automate tasks ○ Use a build tool to automate workflows ● 3. Make many small changes ○ Work in small steps with frequent feedback and course correction ○ Use a version control system ● 4. Abstract and modularize code ○ Re-use code instead of rewriting it ● 5. Plan for mistakes. ○ Assertions ○ Unit testing library. ○ Turn bugs into test cases. ○ Use symbolic debuggers
Nagpaul 28
● 6. Optimize software only after it works correctly. ○ Use a profiler to identify bottlenecks ○ Write code in the highest-level language possible. ● 7. Document design and purpose, not mechanics ○ Document interfaces and reasons, not implementations ● 8. Collaborate. ○ Review before repo merge ○ Pair programming ○ Track issues
Research What are the best practices for scientific computing to ensure clean Question/Problem/ maintainable code? Need
Important Figures N/A
Notes ● A lot of the “rules” provided aren’t really specific to scientific computing ○ But CS in general ○ Still serves as a useful checklist to hold my own programs to.
Cited references to Haddock S, Dunn C (2010) Practical computing for biologists. follow up on Sunderland (Massachusetts): Sinauer Associates
Follow up Questions Any differences on writing code targeting biologists versus other computer scientists?
Nagpaul 29
Article #10 Notes: Biopython: freely available Python tools for computational molecular biology and bioinformatics
Source Title Biopython: freely available Python tools for computational molecular biology and bioinformatics
Source citation (APA Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Format) Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163
Original URL https://pubmed.ncbi.nlm.nih.gov/19304878/
Source type Journal paper
Keywords Bioinformatics, tools, Python
Summary of key ● Why Python? points (include ○ High level language methodology) ○ Interop with C, C++, Fortran ○ High quality numerical library ● Features ○ Sequence representation ■ Support for ● Fasta ● Genbank ● embl ● swiss ● clustal ● phylip ● stockholm ● Nexus ○ Call NCBI Blast ○
Research What is Biopython? Question/Problem/
Nagpaul 30
Need
Important Figures N/A
Notes ● Extremely light on details, short paper ● Second tool found with interop with C, C++, Fortran mentioned as a key benefit ○ Julia has a good interop story ○ Simpler than the one in GoLang as well ● Appears to be based on BioPerl.
Cited references to Holland RCG, et al. BioJava: an open-source framework for follow up on bioinformatics. Bioinformatics. 2008;24:2096–2097
Follow up Questions Why is BioPython the defacto library?
Nagpaul 31
Article #11 Notes: BioJava: an open-source framework for bioinformatics
Source Title BioJava: an open-source framework for bioinformatics
Source citation (APA Holland, R. C. G., Down, T. A., Pocock, M., Prlić, A., Huen, D., Format) James, K., Foisy, S., Dräger, A., Yates, A., Heuer, M., & Schreiber, M. J. (2008). BioJava: An open-source framework for bioinformatics. Bioinformatics, 24(18), 2096–2097. https://doi.org/10.1093/bioinformatics/btn397
Original URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2530884/
Source type Journal paper
Keywords Bioinformatics, tools, Java
Summary of key ● Goal is to abstract and reuse common biological tasks points (include ● Open source methodology) ● Estimated 47 person-years of effort ● Main features ○ (1) nucleotide and amino acid alphabets ○ (2) BLAST parser ○ (3) sequence I/O ○ (4) dynamic programming ○ (5) structure I/O and manipulation ○ (6) sequence manipulation ○ (7) genetic algorithms ○ (8) statistical distributions ○ (9) graphical user interfaces ○ (10) serialization to databases ● Sequence (symbolic alphabet) API is the core of BioJava ○ Can load sequence from multiple formats ○ Serves as the base of the other 10 main features ○ Sequence annotation ● Future goals: ○ Become an all-encompassing tool for bioinformatic Java work
Research How can we write code which can be reused and save time for Java Question/Problem/ bioinformatics? Need
Nagpaul 32
Important Figures
Notes ● Seems like the most mature of the libraries researched ● Support for a large amount of functions ● The Bio* toolkits have a large amount of interoperability with existing tools ○ Needed as bioinformatics is quite slow on the CS side of things ○ Maintenance is difficult ○ No mention of the specific libraries however ○ Specifically numerical libraries which Java is not well known for
Cited references to N/A follow up on
Follow up Questions What is the numerical ecosystem like in Java?
Nagpaul 33
Article #12 Notes: On the performance and design of BioSequences compared to the Seq language
Source Title On the performance and design of BioSequences compared to the Seq language
Source citation (APA Ward, B. J. (2020, January 23). On the performance and design of Format) BioSequences compared to the Seq language. BioJulia. https://BioJulia.github.io/post/seq-lang/
Original URL https://biojulia.net/post/seq-lang/
Source type Journal Paper Review
Keywords Julia, bioinformatics, tools
Summary of key ● Different authors published a paper regarding details on Seq points (include ● A Domain specific language (DSL) for bioinformatics methodology) ● Benchmarks between languages are tricky ● Need to ensure idiomacy between the two implementations ● Seq benchmarks actually well done ● Few changes, to make more fair between Seq vs. BioJulia (BJ) ● Seq allowed several errors to pass unnoticed ● However Seq was much faster ● Turns out that the difference is from the underlying library BioSequences.jl ○ BJ utilizes biosequences in a manner that optimizes for ram usage rather than time allotted for the task ○ Also verifies the sequences and ensures memory safety ● Seq does neither ● Biosequences was replaced with a “seq-style” model ○ Performance was then equal to / greater than Seq ● Bioinformatics does not need a DSL ○ The existing ecosystem is just as important if not more important than the bioinformatics work itself
Research How was the programming language Seq fairing so well in it’s Question/Problem/ benchmarks? Need
Nagpaul 34
Important Figures
Notes ● Can always learn from competing tools ● Benchmarks are trickier than first thought ○ More time may needed to be allotted in gantt chart ○ Verification is a more valued characteristic than raw speed ○ This falls in line with the guidelines for scientific computing ● I’m inclined to agree with the BioJulia developers ○ A DSL is outdated and removes access to an entire ecosystem of existing well-written tools ● Seq definitely not an option for my work ● Julia sounds like a good fit based on preliminary research.
Cited references to N/A follow up on
Follow up Questions Are there any specific tasks where raw speed is valued more?
Nagpaul 35
Article #13 Notes: Title
Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title
Source citation (APA Format)
Original URL
Source type
Keywords
Summary of key points (include methodology)
Research Question/Problem/ Need
Important Figures
Notes
Cited references to follow up on
Follow up Questions