Title: "Implementation of a PRRSV Strain Database" – NPB #04-118

Title: "Implementation of a PRRSV Strain Database" – NPB #04-118 Investigator: Kay S. Faaberg Institution: University of Minnesota Co-Investigators: James E. Collins, Ernest F. Retzel Date Received: August 1, 2006 Abstract: We describe the establishment of a PRRSV ORF5 database (http://prrsv.ahc.umn.edu/) in order to fulfill one objective of the PRRS Initiative of the National Pork Board. PRRSV ORF5 nucleotide sequence data, which is one key viral protein against which neutralizing antibodies are generated, was assembled from our diagnostic laboratory’s private collection of field samples. Searchable webpages were developed, such that any user can go from isolate sequence to generating alignments and phylogenetic dendograms that show nucleotide or amino acid relatedness to that input sequence with minimal steps. The database can be utilized for understanding PRRSV variation, epidemiological studies, and selection of vaccine candidates. We believe that this database will eventually be the only website in the world to obtain recent, as well as archival, and complete information that is critical to PRRS elimination. Introduction: This project was proposed to fulfill the stated directive of the PRRS Initiative to implement a National PRRSV Sequence Database. The organization that oversees the development, care and maintenance of the database is the Center for Computational Genomics and Bioinformatics at the University of Minnesota. The Center is not affiliated with any diagnostic laboratory or department, but is a fee-for-service facility that possesses high-throughput computational resources, and has developed databases for a number of nationwide initiatives. The database is now freely available to all PRRS researchers, veterinarians and producers for web-based queries concerning relationships to other sequences, RFLP analysis, year and state of isolation, and other related research-based endeavors. In addition, database can be used to align sequences and obtain a phylogenetic dendogram (tree) of the alignment. The database is available at http://prrsv.ahc.umn.edu. Stated Objectives from original proposal: Objective 1. Develop an advanced relational database to store all PRRSV sequence data Objective 2. Develop an interactive, web-based, query tool to provide information on isolate such as year and state of isolation, the sequence and related sequences, derive the RLFP pattern for ORF5, determine predicted glycosylation, and perform other advanced queries Materials and Methods: The CCGB is the computational genomics facility for the University of Minnesota and numerous other institutions. The facility consists of 8,000 sq. ft. of space, including office, training and controlled-environment machine room space. The database servers are professionally managed by Center staff and total more than 95 CPUs of integrated, high-performance computational resources. These clustered resources include 2 Sun Fire V880, 7 Sun E450, E420R or E3000 Enterprise servers, 2 Sun E250 Enterprise Servers, 5 Dell Intel/Linux Compute Servers, 9 Intel/Linux Compute Servers (Beowulf), 1 Compaq ES40 Alpha Compute Server, 1 Sun Blade Server and 1 TimeLogic Genomics Supercomputer. Approximately 24 TB of active mass storage is available, as well as 16 TB of robotic tape storage management. Center researchers are also involved in the 2 development and implementation of high-throughput computational tools. Available software suites include a wide variety of commercial and public domain tools for molecular biology, genomics, microarray analysis, statistics, visualization, data mining and relational database management systems. The facility was thus well suited to house and maintain the PRRSV Sequence Database (http://www.ccgb.umn.edu/). The UMN private database consisted of 1979 individual PRRSV sequences, predominantly open reading frame (ORF) 5. This project was to update the database to conform to a newer database format, MySQL, which enabled relational database management and complex queries. A database expert, programmer Trevor Wennblom of the CCGB, has been instrumental in database creation, web development and curation. Many software tools to enable his work were utilized, as mentioned below. Results: Objective 1: Develop an advanced relational database to store all PRRSV sequence data. The project PI and Trevor Wennblom worked to import the existing 1979 sequences into MySQL. This was a crucial step, assuring database integrity, and involved many months of searching through old records, cross-checking date and place of field sample origin, and assuring only one copy of the sequence was accounted for. The sequences were given a unique identifier, e.g. PRRSV00001, such that the immediate lay recognition of field isolate sequence is masked from the database user in order to protect producer confidentiality. Approximately 1000 new sequences were imported from the Minnesota Veterinary Diagnostic Laboratory. We requested sequences from other institutions, and although other veterinary diagnostic laboratories agreed, no sequences were deposited. We also developed a cross-reference file, known only to the database manager and the PI, that links the unique identifier to the diagnostic case number so as not to lose potentially important clinical information. Basic information on the field sample shared includes the year of isolation, the U.S. state or foreign country where the isolate was obtained, and the nucleotide sequence. We were going to make the database a private domain, accessible only to other diagnostic laboratories and related PRRS producers, veterinarians and investigators only after contributing the field isolate sequences in their separate databases. However, in discussions with James Collins, co- investigator, the privacy of the PRRSV database was not needed at that time. The complete information can only be retrieved by personnel knowledgeable in MySQL query language, and restricted to authorized users. We did not have the time in the funding cycle to implement the final aim of this objective, namely producing a direct link to GenBank (so that newly published PRRSV sequences are available in the CCGB Domain). This task was much harder than originally 3 anticipated. In the short term, we downloaded the published ORF5 sequences from GenBank. We anticipate being completed with the linking process in the next few months. Objective 2. Develop an interactive, web-based, query tool to provide information on an isolate such as year and state of isolation, the sequence and related sequences, derive the RLFP pattern for ORF5, determine predicted glycosylation, and perform other advanced queries. Using much of the hardware, software and personnel available at CCGB, we developed a webpage for universal access to describe the multistate project and partnerships that contribute to the PRRSV Database, including National Pork Board and the USDA PRRS Initiative (http://prrsv.ahc.umn.edu). The home page lists several tabbed sections: “Home”, “Sequence Database”, “BLAST and Phylogenetic Analysis”, “Community”, “Feedback” and “About”. Explanatory text for each section is given, including direct links to the NPB and the USDA. Figure 1 The database (under Sequence Database tab) consists as of June 2005, of 3930 PRRSV isolate nucleotide sequences from 1989-2004, most generated from the Minnesota Veterinary Diagnostic 4 Laboratory (MVDL) submission requests. Columns (left to right) representing the unique PRRSV ID Number, the date and U.S. State/Country, year of isolation, the starting 15 nucleotide sequence (which expands to include the entire sequence when highlighted), percent identity to Ingelvac and Ingelvac ATP, RFLP cut site information, and number of the GP5 amino acid (5) where N- glycosylation is predicted, names of early isolates, other notes, and GenBank ID are shown. The RFLP cut site information as well as the N-glycosylation site prediction is predicted automatically from the input sequence by computer programs Trevor Wennblom has written. Figure 1 shows the web interface to the MySQL database. The database is searchable so that information can be sorted for the question an investigator/veterinarian might pose. If one clicks directly on the 15 nucleotide sequence, a new page will be generated that lists all pertinent data that is documented in the database, but now can be selected and copied to the users desktop if desired. The “Blast” tab allows you to compare a nucleotide sequence from the database, an uploaded sequence file, PRRSV ID or sequence name to the sequences in the PRRSV database. Once the BLAST output is seen, one can select from 2 to 25 sequences for further analysis. Alternatively, one can download the selected sequences in “fasta” format, recognizable in most alignment programs. We utilized the open-source software of ClustalW (1), Phylip DNAML (2), and Jalview (3)/PFAAT(4) for alignment, intense phylogeny study and viewing, respectively. ClustalW analysis compares and reorganizes the list of nucleotide sequences according to standard alignment parameters (Gap Opening:10.00, Gap Extention:0.20). The ClustalW output is a multialignment file can be in Phylip format, which can be copied and pasted into other computer software if desired. Phylip, one of the best methods of determining phylogeny, uses a Maximum Likelihood algorithm to correctly assign nucleotide sequences to a phylogeny dendogram. The output at present is a simple text file and does not have the flexibility and options of many commercial software packages, nor can you obtain

Title: "Implementation of a PRRSV Strain Database" – NPB #04-118

Visualization Assisted Hardware Configuration for Bioinformatics Algorithms Priyesh Dixit Advisors: Drs. K. R. Subramanian, A. Mukherjee, A

Biological Sequence Analysis

An FPGA-Based Web Server for High Performance Biological Sequence Alignment

Eruca Sativa Mill.) Tomislav Cernava1†, Armin Erlacher1,3†, Jung Soh2, Christoph W

Exploiting Coarse-Grained Parallelism to Accelerate Protein Motif Finding with a Network Processor

Genome Sequence of the Lignocellulose-Bioconverting and Xylose-Fermenting Yeast Pichia Stipitis

Bioinformatics Companies

Timelogic® Is a Brand of Active Motif®, Inc. • • 877.222.9543

Accelerating Exhaustive Pairwise Metagenomic Comparisons

Specialized Hidden Markov Model Databases for Microbial Genomics

Structural and Functional Diversity of the Microbial Kinome

On the Evolution of the Sghc1q Gene Family, with Bioinformatic and Transcriptional Case Studies in Zebrafish