Below Is a List and Short Summary of Some of the Projects That Were Proposed Last Year

Project Ideas

Below is a list and short summary of some of the projects that were proposed last year. The number in the brackets following the title indicates the number of people that were in each group. Some of these proposals were very ambitious and many groups succeeded in achieving their personal as well as project goals. The objective of the project is to give you the opportunity to delve into an area of genomics or computational biology that excites you. This year we are encouraging the formation of medium sized groups (3-6). In this way we hope you will be able to exploit your mutual interests and complementary skill sets. Please use the Discussion Forum on the Bphys101 website to help yourself find classmates with similar project interests.

Protein Interactions (4) We are interested with the end of the information transfer process in a cell; namely, the 3-D structures of proteins and their interactions in the cell. The goal of the project is to be able to analyze a given fold with unknown binding propensities and predict its interactions with other proteins or ligands based on structural and mutational analysis.

Exploring Nucleic Acid Sequence Space via Computation (1) Explore the distribution of nucleic acid sequences that can bind ATP and their abundance in sequence space. Design a set of computer programs to calculate the probability that a randomly chosen sequence could upon mutation converge on the previously discovered RNA ATP aptamer motif. By looking at this probability at different mutagenesis levels, an overall convergence potential for that sequence can be determined.

Mitochondrial Genome Phylogeny (3) In this proposal we aim to use two novel algorithms to study the evolutionary relationships among fully sequenced mitochondrial genomes and to compare these methods to current models of mitochondrial phylogeny.

Malaria Vaccine Development --- From Genome Sequence to DNA Vaccine (1) 400 million people suffer from malaria every year worldwide. This disease kills a child every 12 seconds. No vaccine is currently available. An increase in drug resistance of the parasites has worsened the global malaria situation. Two of the 14 chromosomes of Plasmodium falciparum (7% of the whole genome), the most malignant malaria-causing pathogen, have been sequenced. However, the conventional vaccine development strategy utilizes the sequence data very insufficiently. This proposal aims to develop effective malaria DNA vaccines by integrating bioinformatic tools into the process of vaccine construction.

Functional analysis of bromodomain motifs (1) My project involves the Bromodomain, a protein sequence motif of approximately 100 amino acids. The bromodomain is present in a number of proteins involved in nucleosome remodeling and histone acetylation and deacetylation. It also appears in a number of proteins whose functions are poorly understood. Recently, the structure of the p300/CBP-associated factor (P/CAF) bromodomain protein was solved by NMR. The structure is a left-handed 4-helix bundle with a pocket which appears to specifically bind acetylated lysine. The specificity of the pocket and the large number of bromodomain motifs associated with histone- and chromatin-modifying proteins suggests that bromodomains might target proteins to DNA stretches by recognizing chemical features such as acetylation. I would like to see if this model can be tested computationally.

Extracting Relevant Genetic Regulatory Networks from RNA Expression Data (3) In recent years, the development of methods for acquiring large-scale gene expression data has enable the investigation of regulatory circuitries of genetic systems. During recent years, some theoretical models have focused on using Boolean network models from the expression data. Due to the lack of sufficient data, these network extraction methods have largely been applied to model networks as a proof of concept. The work that has focused on experimental data has only gone as far as to use a correlative measure (such as mutual information) to determine which genes might have shared wiring, but was not used to extract actual wiring diagrams. We plan to review the state of the art in genetic network extraction and apply a test case to newly available clinically relevant RNA expression data.

Neural Network Development (3) The predictive power of neural networks to model behavior at the level of proteins is increasing as more structural data is becoming available. The ability for neural networks to predict secondary structure (primarily alpha helix, beta sheet, coiled coils, as well as other minor motifs) of proteins was first presented by Martin Karplus in 1989. Since that time, some research has focused on optimizing this model by considering correlations between patterns of amino acid substitution and local protein structure, introducing thermodynamic considerations, and evolving statistical methods to more accurately predict behavior. The network architecture of most models designed to predict protein secondary structure tends to be simple: a three layer, feed-forward network, trained by backpropogation. Review of models developed in the last decade to predict secondary structure : We intend to analyze the development of network models used to predict secondary structure and focus on the following: a. the difficulty in predicting beta-sheet patterns; b. comparison to spectroscopic methods, such as vibrational or electronic circular dichroism; c. the basis for an increase in predicative power of network models (from 60% in 1989 to 75% currently). Development of network models : We intend to develop a model that will address some of the issues presented in the first goal. Our general plan is to use an existing public domain network which we can modify to investigate new approaches to improving network performance. We will then define training and testing sets from online databases of protein sequences whose structures have been solved.

Evolutionary Relationship Between Genes (1) I am interested in learning how rates of evolution may vary among groups of genes. A phylogeny for 17 microbial species can be inferred from neutral DNA sequence, such as silent codon positions within homologous genes. However, it would be interesting to see if coregulated groups show similar rates of evolution. In other words, are co-regulated genes constrained by evolution in a similar manner ? Perhaps some functional groups are more constrained than others.

DNA Computing (1) For my project, I would like to do an in-depth look at the prospects of DNA computing. One goal I would like to pursue is to try to come up with methods to solve other NP-complete problems using DNA. If we can find efficient method for solving some NP-complete problem, then this will allow us to solve all other NP-complete problems. This is because given another NP-complete problem, we can transfer it to the type of problem which we can solve efficiently using DNA computing in polynomial time.

Determination of Transposon Insertion (1) Given a complete bacterial genome sequence, determine a set of restriction enzyme digestions that will allow the approximate identification of the location of a transposon insertion after probing a Southern blot of genomic DNA digested with those enzymes with that transposon.

Search for Processivity Domain in T5 polymerase (1) As my research project for the class, I would like to employ the SAM software suite (Sequence Alignment and Modeling) to construct my own Hidden Markov Model of processivity domains and DNA binding regions. I will then test both the N and C terminals of my sequence to determine if a statistically significant relationship exists. The results of these experiments would then be compared against two other methods for determining distant homologies: PHI-Blast and the UCLA-DOE fold recognition program. PHI-Blast accepts a seed motif and then searches for other sequences which possess a similar motif: I will perform a PROSITE motif search on my N and C terminals and then use the results as starting points for the PHI- Blast search. The fold recognition server takes a sequence as input and returns statistically related protein folds; often, protein regions that share similar folds have homologous function.

AlignAce on Human Sequences (6) Many genes in the human genome do not have published upstream promoter sequences. First, we will use the transcription factor database TRANSFAC. One section of this database contains lists of genes for which TF binding sites have been identified, implying that their promoters have been sequenced. A Microsoft Access database and/or script will be written to make a list of such human genes and automatically locate the Genbank entry of their promoter sequences. Upstream sequence will also be obtained by predicting the location of genes in the 500 Mbp of published genome using sequence analysis. A client script will be written to make ORF searches on the GRAIL web site. It will have to break the contigs up in a sensible way to deal with GRAIL’s size limitations (which I am not clear about as of yet). This will build a table of ORF initiation sites and the putative promoter regions will be extracted from the original contig files. The ORFs will be sent to a BLAST query to obtain Accession #s. The Accession #s, the ORF nucleotide sequences, and the promoter regions will be assembled into a database for the rest of the project. (Ron) Another script will be written to use known mRNA sequences to find the genomic location of the gene in the 500 Mbp of published genome. It will take an accession number as input, send it through GenBank to retrieve the FASTA-encoded sequence, send the sequence through BLAST (using human genome as the database) to determine its location in the genome, and use this location to extract its upstream region.

Genet Regulatory Networks (1) I was hoping to explore the ways that one could use genomics tools like microarrays, to study genetic regulatory networks. one approach might be to use the data from arrays to explore how to alter gene expression without actually “knowing” exactly what is going on in every part of the network. That is, given expression state A, can we introduce some combination of genes to get to state B.

Other Possible Project Ideas

1) Generate a novel visual or computational representation of nucleic acid or protein sequences and show how this aids in its analysis.

2) Based on sequence data and the abundant biological information that is available to you, construct a minimal genome for life and explain how you arrived at this answer. Is this smaller than that of known organisms?

3) Given all the proteins that have been shown to interact and bind to each other, what the is largest complex that can be construct and what possible function could this have?

4) Many technological advances come by making things faster, smaller (higher density) and better. Formulate a design based on published technologies that can improve processes such as sequencing, theoretically, by two orders of magnitude.

5) Is there anything in your own line of research that you can benefit from a computational approach?