Designing Custom Gene Expression Arrays Using Earray
Total Page:16
File Type:pdf, Size:1020Kb
eArray - custom microarray design web tool Gene Expression design tutorial Page 1 March 2007 Probe Design for Designing Custom Custom GX Arrays GX Arrays Using eArray 5.0 Print what you want when you want it Group/Presentation Title Agilent Confidential Page 2 Month ##, 200X eArray Overview Array Creation Process eArray provides an easy way of managing the array creation process. We are here! Search Agilent Probes Create Create Probe Submit to Microarray Order Arrays Groups Manufacturing Designs Upload Probes Download Design files Page 3 March 2007 Outline • Access your targeted organism’s transcriptome sequence from NCBI or other. •Format the data as needed into a standard FASTA format. •In eArray, select a design methodology and parameters • Uploading the transcriptome file • Download the results from eArray • Review the probe statistic results for possible culling •Create a probe group. •Create the microarray design. Page 4 March 2007 Access the Transcriptome of your Target Organism ex: Anopheles gambiae (African malaria mosquito) UniGene sequences downloaded from NCBI UniGene Build #35: Page 5 March 2007 Retrieve the file Gs.seq.uniq.gz from the NCBI unigene ftp repository (where Gs = Genus species) Page 6 March 2007 There are multiple links towards UniGene FTP site from the NCBI server The most direct way is to paste the link ftp://ftp.ncbi.nih.gov/repository/UniGene/ in your web browser. Alternatively, on the left lower corner of NCBI homepage, http://www.ncbi.nlm.nih.gov/, you will find a link “FTP sites”. Click it, you will be directed to a page with all FTP sites. The link, “UniGene”, leads you to the collection of all available UniGene sets. Page 7 March 2007 If you want to more information about the your research subject genome, you can start from the page http://www.ncbi.nlm.ni h.gov/Genomes/. It contains comprehensive information. Group/Presentation Title Agilent Confidential Page 8 Month ##, 200X A set of unigene sequence of a specific species can be downloaded from NCBI FTP website. For a particular species, there are several sequenced database available, such as dbSNP, Gene and UniGene. Double-click on the UniGene link, it leads to all available genome sequences Page 9 March 2007 In case of Xenopus Laevis, double-click the link, X1.seq.uniq.gz, to start download Page 10 March 2007 The folder for the species has various files including this type of summary: UniGene Build #67 Zea mays Sequences Included in UniGene Known genes are from GenBank 28 Dec 2007 ESTs are from dbEST through 28 Dec 2007 8807 mRNAs 0 Models 8827 HTC 185206 EST, 3'reads 212182 EST, 5'reads 520642 EST, 935664 total sequences in clusters Final Number of Clusters (sets) 57668 sets total 6183 sets contain at least one mRNA 8118 sets contain at least one HTC sequence 56734 sets contain at least one EST 5293 sets contain both mRNAs and ESTs Page 11 March 2007 Convert the unigene file to a simple FASTA format • The .uniq file has hidden code that will result in an upload error to eArray. • To strip the file to a simple FASTA format, use the freeware CygWin. • It runs in a simulated Linux-like environment on Windows. • Download from www.cygwin.com. • In the case of any installation problem, check your firewall setting. CygWin Page 12 March 2007 • The compressed file can be extracted using WinZip or other . • The extracted file has a file extension .seq.uniq. • It is a .txt file that can be opened with Notepad. • Although it looks like in FASTA format, it has embedded unix code. Page 13 March 2007 Snapshots of Cygwin You need to make sure the file Xl.seq.uniq is saved in the current directory when you open up Cygwin You can use command cd x:<return> to change you current directory into x drive or x directory Then use command ls<return> to list all files under this directory. After found the file Xl.seq.uniq., type the command— unix2dos Xl.seq.uniq <return> When the conversion is done, you will see Xl.seq.uniq:done Page 14 March 2007 The sequence is now in FASTA format and ready for upload Page 15 March 2007 FASTA file limitations for Gene Expression Designing • The target FASTA files should be <300 MB. • Any target transcripts that exceed 120K bp will be discarded. • The transcriptome (similarity) FASTA files should be < 1.4 GB. • Any sequences exceeding 1M bp in the genome file will be discarded. Page 16 March 2007 Now we’re ready to Login to eArray https://earray.chem.agilent.com/earray/login.do Page 17 March 2007 eArray Home Page Home page: Page 18 March 2007 From the home page wizard, choose: Create a microarray from Target Transcripts Page 19 March 2007 Design Probes for Gene Expression Tools >> GE Probe Design Page 20 March 2007 Select Design Methodology Base Composition probe design method is recommended for eukaryotes and Tm matching for prokaryotes. Suited to design probes suited to Agilent protocols for most eukaryotes (standard). Suited to design probes for prokaryotes (based on a target melting temperature. Previously completed jobs are shown here. Page 21 March 2007 Upload Target File To ensure that your design has the most recent annotation, designate the UniGene download from NCBI or using your self- developed UniGene set by choosing: ”Use Target File as Transcritome”. See publication 5989-0750EN: EST Assembly for the Creation of Oligonucleotide Probe Targets - Click next and be patient - Page 22 March 2007 Parameters for Probe Selection Anopheles gambiae • In this design we designate three 60-mer oligos designed per target sequence. 2 • A 3’ bias is used since we’ll use oligo dT priming for labelling . • Probes should be in the sense orientation (to hybridise to an antisense labelled cRNA target) • and I want the best possible probes in this case, rather than having them evenly distributed. Page 23 March 2007 Upload Target File My target file contains UniGene sequences in FASTA format that were retrieved from NCBI. The quality of the probes generated by eArray is dependent on the quality of the input sequences. If there are redundancies or chopped up sequence due to bad base calls, then the output will have cross hybing probes or probes design constrained by short sequence fragmments.See publication 5989-0750EN: EST Assembly for the Creation of Oligonucleotide Probe Targets - Click next and be patient - Page 24 March 2007 Submission Complete Probe design may take a few hours to complete. Page 25 March 2007 Submission Complete Page 26 March 2007 Retrieve Results from the same place that you submitted the design job Tools >> GE Probe Design - Click to open - Page 27 March 2007 Review Probe Design Summary This is the Summary Page 28 March 2007 Create a Probe Group in eArray Note: This step is preliminary to the later task of filtering your probe group to eliminate probes predicted to have poor performance characteristics. Page 29 March 2007 Create a Probe Group in eArray Page 30 March 2007 Download Probe Design Results Page 31 March 2007 No Really! Hold ctrl until the download window pops up and you’ve designated where to download to. Page 32 March 2007 Extract and Save the Results Page 33 March 2007 Review the Results in Excel The file MOST_tdt contains the probe sequences and information about the performance (computationally predicted). For instance, one wants probes that are within a certain Tm range, have similar GC content, do not cross-hyb to other probes, to other targets, or form secondary structures that prevent them from hybridising to their target sequences. Page 34 March 2007 Probe Statistics Probe statistics are the calculated results for the following parameters for selected probes, seen when you click Show Statistics and Sequence either on the probes Search Results page or in the listing of probes in the Probe Group report. Statistic Description Example G% Percent of 'G' characters for the probe sequence string Agilent expression probe, A_12_P119943, is 60 bases long and contains 11 G's; %G = 18.33% C% Percent of 'C' characters for the probe sequence string Same probe as above: 16 C's; %C = 26.67% T% Percent of 'T' characters for the probe sequence string Same probe as above: 20 T's; %T = 33.33% A% Percent of 'A' characters for the probe sequence string Same probe as above: 13 A's; %A = 21.67% GC% Percent of 'G' and 'C' characters for the probe sequence Same probe as above: 11 G's, 16 C's: %GC = 45% string PolyX The longest homeomeric run (longest run of one base in ATTAGTTTATG has a PolyX of 3 because it contains a run of three T's. Same the probe sequence string) represented as the number of probe as above: contains a run of 5 T's; therefore, the PolyX = 5. characters in that substring. FivePrimeAs A string of poly A sequences on the 5' end. Same probe as above: FivePrimeAs = 0 Tm For self-complementary oligonucleotide duplexes, the Same probe as above: Tm = 264.89 temperature at which half of the strands are in the double helical state and half are in the random-coil state. BC_Score Base Composition Score: A numeric value that defines the Same probe as above: BC_Score = BC_1 "quality" of the probe based upon its base composition and Because A% <= 60; G% <= 35; C% <= 30; T% <= 60; GC% => 35 and <=45; distribution. BC 1 is the best; BC 4 is the worst. A% - C% => -15 and <= 30; T% - G% => -15 and <= 30; PolyX <= 6; FivePrimeAs <=3. For BC 2, A% <= 70; G% <=40; C% <= 35; T% <= 70; GC% => 35 and <= 45; A% - C% => -15 and <= 30; T% =G% => -15 and <=30; PolyX <= 6; FivePrimeAs <=3.