Title: "Implementation of a PRRSV Strain Database" – NPB #04-118

Investigator: Kay S. Faaberg

Institution: University of Minnesota

Co-Investigators: James E. Collins, Ernest F. Retzel

Date Received: August 1, 2006

Abstract:

We describe the establishment of a PRRSV ORF5 database (http://prrsv.ahc.umn.edu/) in order to fulfill one objective of the PRRS Initiative of the National Pork Board. PRRSV ORF5 nucleotide sequence data, which is one key viral protein against which neutralizing antibodies are generated, was assembled from our diagnostic laboratory’s private collection of field samples. Searchable webpages were developed, such that any user can go from isolate sequence to generating alignments and phylogenetic dendograms that show nucleotide or amino acid relatedness to that input sequence with minimal steps. The database can be utilized for understanding PRRSV variation, epidemiological studies, and selection of vaccine candidates. We believe that this database will eventually be the only website in the world to obtain recent, as well as archival, and complete information that is critical to PRRS elimination.

Introduction: This project was proposed to fulfill the stated directive of the PRRS Initiative to implement a National PRRSV Sequence Database. The organization that oversees the development, care and maintenance of the database is the Center for Computational and at the University of Minnesota. The Center is not affiliated with any diagnostic laboratory or department, but is a fee-for-service facility that possesses high-throughput computational resources, and has developed databases for a number of nationwide initiatives. The database is now freely available to all PRRS researchers, veterinarians and producers for web-based queries concerning relationships to other sequences, RFLP analysis, year and state of isolation, and other related research-based endeavors. In addition, database can be used to align sequences and obtain a phylogenetic dendogram (tree) of the alignment. The database is available at http://prrsv.ahc.umn.edu.

Stated Objectives from original proposal:

Objective 1. Develop an advanced relational database to store all PRRSV sequence data Objective 2. Develop an interactive, web-based, query tool to provide information on isolate such as year and state of isolation, the sequence and related sequences, derive the RLFP pattern for ORF5, determine predicted glycosylation, and perform other advanced queries

Materials and Methods: The CCGB is the computational genomics facility for the University of Minnesota and numerous other institutions. The facility consists of 8,000 sq. ft. of space, including office, training and controlled-environment machine room space. The database servers are professionally managed by Center staff and total more than 95 CPUs of integrated, high-performance computational resources. These clustered resources include 2 Sun Fire V880, 7 Sun E450, E420R or E3000 Enterprise servers, 2 Sun E250 Enterprise Servers, 5 Dell Intel/Linux Compute Servers, 9 Intel/Linux Compute Servers (Beowulf), 1 Compaq ES40 Alpha Compute Server, 1 Sun Blade Server and 1 TimeLogic Genomics Supercomputer. Approximately 24 TB of active mass storage is available, as well as 16 TB of robotic tape storage management. Center researchers are also involved in the 2

development and implementation of high-throughput computational tools. Available software suites include a wide variety of commercial and public domain tools for molecular biology, genomics, microarray analysis, statistics, visualization, data mining and relational database management systems. The facility was thus well suited to house and maintain the PRRSV Sequence Database (http://www.ccgb.umn.edu/). The UMN private database consisted of 1979 individual PRRSV sequences, predominantly open reading frame (ORF) 5. This project was to update the database to conform to a newer database format, MySQL, which enabled relational database management and complex queries. A database expert, programmer Trevor Wennblom of the CCGB, has been instrumental in database creation, web development and curation. Many software tools to enable his work were utilized, as mentioned below. Results: Objective 1: Develop an advanced relational database to store all PRRSV sequence data. The project PI and Trevor Wennblom worked to import the existing 1979 sequences into MySQL. This was a crucial step, assuring database integrity, and involved many months of searching through old records, cross-checking date and place of field sample origin, and assuring only one copy of the sequence was accounted for. The sequences were given a unique identifier, e.g. PRRSV00001, such that the immediate lay recognition of field isolate sequence is masked from the database user in order to protect producer confidentiality. Approximately 1000 new sequences were imported from the Minnesota Veterinary Diagnostic Laboratory. We requested sequences from other institutions, and although other veterinary diagnostic laboratories agreed, no sequences were deposited. We also developed a cross-reference file, known only to the database manager and the PI, that links the unique identifier to the diagnostic case number so as not to lose potentially important clinical information. Basic information on the field sample shared includes the year of isolation, the U.S. state or foreign country where the isolate was obtained, and the nucleotide sequence. We were going to make the database a private domain, accessible only to other diagnostic laboratories and related PRRS producers, veterinarians and investigators only after contributing the field isolate sequences in their separate databases. However, in discussions with James Collins, co- investigator, the privacy of the PRRSV database was not needed at that time. The complete information can only be retrieved by personnel knowledgeable in MySQL query language, and restricted to authorized users. We did not have the time in the funding cycle to implement the final aim of this objective, namely producing a direct link to GenBank (so that newly published PRRSV sequences are available in the CCGB Domain). This task was much harder than originally 3

anticipated. In the short term, we downloaded the published ORF5 sequences from GenBank. We anticipate being completed with the linking process in the next few months. Objective 2. Develop an interactive, web-based, query tool to provide information on an isolate such as year and state of isolation, the sequence and related sequences, derive the RLFP pattern for ORF5, determine predicted glycosylation, and perform other advanced queries. Using much of the hardware, software and personnel available at CCGB, we developed a webpage for universal access to describe the multistate project and partnerships that contribute to the PRRSV Database, including National Pork Board and the USDA PRRS Initiative (http://prrsv.ahc.umn.edu). The home page lists several tabbed sections: “Home”, “Sequence Database”, “BLAST and Phylogenetic Analysis”, “Community”, “Feedback” and “About”. Explanatory text for each section is given, including direct links to the NPB and the USDA.

Figure 1

The database (under Sequence Database tab) consists as of June 2005, of 3930 PRRSV isolate nucleotide sequences from 1989-2004, most generated from the Minnesota Veterinary Diagnostic

4

Laboratory (MVDL) submission requests. Columns (left to right) representing the unique PRRSV ID Number, the date and U.S. State/Country, year of isolation, the starting 15 nucleotide sequence (which expands to include the entire sequence when highlighted), percent identity to Ingelvac and Ingelvac ATP, RFLP cut site information, and number of the GP5 amino acid (5) where N- glycosylation is predicted, names of early isolates, other notes, and GenBank ID are shown. The RFLP cut site information as well as the N-glycosylation site prediction is predicted automatically from the input sequence by computer programs Trevor Wennblom has written. Figure 1 shows the web interface to the MySQL database. The database is searchable so that information can be sorted for the question an investigator/veterinarian might pose. If one clicks directly on the 15 nucleotide sequence, a new page will be generated that lists all pertinent data that is documented in the database, but now can be selected and copied to the users desktop if desired. The “Blast” tab allows you to compare a nucleotide sequence from the database, an uploaded sequence file, PRRSV ID or sequence name to the sequences in the PRRSV database. Once the BLAST output is seen, one can select from 2 to 25 sequences for further analysis. Alternatively, one can download the selected sequences in “fasta” format, recognizable in most alignment programs. We utilized the open-source software of ClustalW (1), Phylip DNAML (2), and Jalview (3)/PFAAT(4) for alignment, intense phylogeny study and viewing, respectively. ClustalW analysis compares and reorganizes the list of nucleotide sequences according to standard alignment parameters (Gap Opening:10.00, Gap Extention:0.20). The ClustalW output is a multialignment file can be in Phylip format, which can be copied and pasted into other computer software if desired. Phylip, one of the best methods of determining phylogeny, uses a Maximum Likelihood algorithm to correctly assign nucleotide sequences to a phylogeny dendogram. The output at present is a simple text file and does not have the flexibility and options of many commercial software packages, nor can you obtain a sequence distance matrix. If Phylip option is deselected, a ClustalW alignment file is given and the user can adjust the file parameters to obtain alternate alignments if desired, and can color code the sequences to mine for additional information. Lastly, the chosen nucleotide sequences or predicted amino acid sequences are placed into alignment editors, Jalview or PFAAT ( Alignment Annotation Tool), to allow users to view the alignment in an optimal and user-defined way, and can produce standard dendogram trees directly from the alignment. Jalview and PFAAT also enable the user to export alignments in various formats for downstream use. Next, there is a “Community” tab which holds information about PRRSV and can also be used to communicate with the website. This section is built upon server software (Instiki, a Wiki clone) that

5

allows users to freely create and edit Web page content (but not the PRRSV relational database) using any Web browser (5, 6). Next, there is a “Feedback” tab. Although no one has utilized this form of communication with the developers of the PRRSV database, one can easily pose questions of concern on this page. Finally, the “About” tab has a link (Site Technology) for an explanation about the specific software programs utilized in construction of the web site. In brief, Ruby (7) along with Bioruby (8) was used to develop the underlying programming language, Rails (9) (utilizes the Ruby language) enabled web site creation, and Rmagick (10) allowed bioinformatics graphics to be added.

Discussion:

We have successfully generated a vibrant, searchable, user-friendly PRRSV ORF5 database, accessible worldwide. Veterinarians, swine producers and researchers alike have immediate access to PRRSV ORF5 sequence data that is not available in this form elsewhere in the world. I have received many favorable comments on the website, and several people from Asia, Canada and the USA have requested additional data mining. The user can go from input isolate sequence to generating alignments and dendograms with minimal steps. One next key step is to broaden the database to include limited field surveillance data, as proposed in the NPB #05-162 funded grant. The other step is to acquire nucleotide sequences from other diagnostic laboratories or private databases, including those in the US and Canada, but as far reaching as Asia and Europe.

The use of the PRRSV database is directed towards the understanding of how variable the ORF5 sequence is, not only between European-like isolates and North American-like isolates, but also within the two types. We also believe that having this information is beneficial to novel vaccine targeting, drug-design selection, and understanding of where in GP5 variation is tolerated and where it is not. In addition, swine producers and veterinarians as well as researchers can access data concerning when an isolate sequence first appeared, where/when it was last identified, and how unique the sequence is likely to be. The capacity of the database to hold field isolate surveillance data as well as combing with other known databases for additional use is anticipated.

Lay Interpretation:

A PRRSV ORF5 database, accessible worldwide, was created for swine producers, practitioners and researchers (http://prrsv.ahc.umn.edu/). This database was established to fulfill one objective of the PRRS Initiative of the National Pork Board. PRRSV ORF5 nucleotide sequence data generated by diagnostic laboratories on field samples were assembled and placed into a searchable webpage,

6

such that any user can go from input isolate sequence to generating alignments and phylogenetic dendograms that show nucleotide or amino acid relatedness with minimal steps. It is the database of choice for selection of vaccine candidates, understanding PRRSV variation, epidemiological queries and other requests. In the future, as the database is expanded to include information from other laboratories, we envision this database as the only website in the world to obtain timely information that is critical to PRRS elimination.

One abstract describing the PRRSVdb has been published: Faaberg, K. S., and Wennblom, T. 2005. The PRRSV nucleotide sequence database. 2005 International PRRS Symposium, No. 5, Saint Louis, MO, 2005.

References: 1. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-80. 2. Felsenstein, et al. (http://evolution.genetics.washington.edu/phylip.html) 3. Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), "The Jalview Java Alignment Editor", Bioinformatics, 12, 426-427. (http://www.jalview.org/) 4. Johnson, J. M., Mason, K., Moallemi, C., Xi, H., Somaroo, S. and Huang. E. S. 2003. Protein Family Alignment Annotation Tool. Bioinformatics19, 544-545. (http://pfaat.sourceforge.net/) 5. Leuf, B. and Cunningham, W. (http://wiki.org/). 6. Griesser, A., et al. (http://instiki.org/). 7. Maeda, S., et al. (http://www.ruby-lang.org/) 8. http://bioruby.org/ 9. Heinemeier Hansson, D., et al., (http://www.rubyonrails.com/) 10. Hunter, T. P., (http://rmagick.rubyforge.org/)

7