Soybean IVT Array Annotation
Sequences used for BLAST came from the Affymetrix Soybean target sequences. Sequence information can be obtained directly from Affymetrix. The Affymetrix Soybean target sequence was based on the NCBI Unigene Build 13 (November, 2003). Probe design was based on the NCBI Unigene Build as well as the Affymetrix in-house clustering algorithm. Affymetrix in-house clustering probes are designated with the prefix "GmaAffx".
BLASTX analysis was carried out using soybean target sequences searched against all Arabidopsis proteins (TAIR ATH1_pep_cm_20040228). In our BLAST analysis, we filtered and removed any results with e-value greater than e-02. We selected the top Arabidopsis hit from each BLAST result (sometimes one Soybean sequence can hit many different Arabidopsis sequence) when identifying the corresponding Arabidopsis sequence. The e-value for that hit is displayed in the annotation file. Therefore, for each Soybean probe set, there is an associated Arabidopsis annotation (if available) and the degree of homology between the Soybean and Arabidopsis sequence based on the e-value. In cases where no Arabidopsis hit was identified (~9000 Soybean probe sets did not have homology to any Arabidopsis proteins), we BLASTED the Soybean sequence against Rice Proteins (Build #2 from TIGR) and the NCBI non-redundant protein database. We annotated Soybean probe sets and did not annotate any features from H. glycines or P. sojae that are in the GeneChip.
Recently, we annotated the soybean GeneChip to the draft soybean genome sequences (Phytozome.net).
ANNOTATION UPDATE:
Sept. 25, 2009 - We mapped individual probes to soybean predicted gene models (generated by the Department of Energy (DOE) Joint Genome Institute, Glyma version 1.01, released April 7, 2009) using BLASTN (≥ 23/25 nucleotide identity) to associate soybean array probe sets with soybean gene models. Probe sets that contain at least 9 out of 11 probes mapping to the same genomic locus are represented in the files below. Probe sets that did not meet these criteria (i.e. 23/25 nucleotide identity, ≥ 9/11 probes per probe set) were not included in the file below. We split the file into two files based on the confidence of prediction of soybean gene models (ftp://ftp.jgi-psf.org/pub/JGI_data/Glycine_max/Glyma1/annotation/highConfidence/Glyma1_highConfidence.transcriptList). Click the files below to download the association of Soybean array probe sets and Soybean gene models.
Feb. 1, 2009 - We updated the annotation of the soybean array information based on information from TAIR 7.0, TIGR, and Peking Transcription Factor databases as of October 2007. The updated information is available from the following link.
Click here to download the Soybean GeneChip annotation file (Updated Oct. 2007)
Click here to download a summary of the Soybean array annotation (Updated Oct. 2007)
Distribution of All Probe Sets on the Soybean Array (2007)
Soybean Whole Transcript Genome Array
Motivation: The first generation Affymetrix Soybean Genome GeneChip was designed by the Soybean Consortium using publicly available soybean full-length cDNAs and ESTs. The array consists of 37,000 probe sets interrogating ~ 25,000 distinct genes/ transcripts. With the release of the whole genome sequence of soybean (available at Phytozome.net), we decided to create a new Soybean Genome GeneChip that would interrogate all the genes in the genome.
Design: For this new array, the design is different from the original Soybean GeneChip. In the original array, probes were selected to correspond to the 600 bases at the 3' end of the transcript or cDNA. However, in this new array, probes were selected to span every exons of the predicted gene models/transcripts, if possible, covering the entire length of the gene/transcript. This approach allows for the interrogation of the full transcript (from 5' to 3') and can help determine exon usage in different splice variants that may be differentially used in specific tissues or compartments. Probes were selected to interrogate one transcript only, although some probes might interrogate multiple transcripts (if no unique probes can be obtained for that exon region).
Note: This array was designed for studying both Soybean and Medicago (i.e. a Legume array). There are sequences on the array corresponding to Medicago cDNAs. However, our main focus will be on the Soybean sequences on the array.
Sequence Data: All sequence data used to design probes on the array were obtained from the Department of Energy - Joint Genome Institute (DOE-JGI) web site (phytozome: http://phytozome.net). Probes were designed from the first draft assembly of the soybean genome (version 1.0). Probe selection algorithm were written by Christopher Davies and Brant Wong (Affymetrix, Inc.).
The array was designed with collaboration from our lab (Goldberg Lab) and Affymetrix with advice and suggestions from other members of the soybean community including Randy Shoemaker.
Library Files
Below are the library files for this array.
Soybean SENSE Library File - Coming Soon!
This library file is the original library file intended for use with the array. All users should download this library file.
Soybean ANTISENSE Library File [Click Here to Download]
This library file was generated for a small set of custom-designed antisense WT arrays and should be used for those arrays only.
Hybrididization Program
For array wash, stain, and scan, use the fluidics protocol EuKGE-WS2v5_450 for wash and stain procedures as described in the GeneChip Expression Analysis Technical Manual (Section 2: Eukaryotic Sample and Array PRocessing).
Arabidopsis ATH1 Array Annotation
The Arabidopsis ATH1 array was annotated in 2003 using all the publicly available resources at the time. In order to keep up with the increasing amount of information generated within the past four years since the annotation of the ATH1 array, we decided to re-annotate the ATH1 array in parallel with the soybean genome array.
The strategy for the re-annotation of the ATH1 array is as follows:
1. We updated the descriptions for each probe set on the array using TAIR Affy array descriptions (affy_ATH1_array_elements-2007-5-2.txt). The description file was downloaded from the TAIR web site: ftp://ftp.arabidopsis.org/home/tair/Microarrays. Descriptions were based on the latest release of the Arabidopsis genome TAIR 7 (released 04-11-07).
Note from TAIR: The mapping to the TAIR7 Transcripts was performed using the BLASTN program with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy chips, the required match length to achieve this e-value is 23 or more identical nucleotides. To assign a probe set to a given locus, at least 9 of the probes included in the probe set were required to match a transcript at that locus. Otherwise, the probe set was not assigned a locus and was given the description "no match".
2. In addition to updating the descriptions for each probe set, we also updated gene ontology (GO) information provided by Affymetrix.
3. We gathered information about putative transcription factors from many publicly available TF database for Arabidopsis including:
AGRIS - Arabidopsis Gene Regulatory Information Server (http://arabidopsis.med.ohio-state.edu/)
DATF - Database of Arabidopsis Transcription Factors (http://datf.cbi.pku.edu.cn/)
RARTF - Riken Arabidopsis Transcription Factor Database (http://rarge.gsc.riken.jp/rartf/)
ArabTFDB - Arabidopsis Transcription Factor Database (http://arabtfdb.bio.uni-potsdam.de/v1.1/)
Transcription factors and transcription factor families were associated with each probe set on the array. Information obtained from points 1-3 were compiled together into an annotation file containing the 2003 ATH1 annotations. Transcription factors were automatically updated based on the information obtained from the databases in point 3.
4. We focused on probe sets that were previously assigned into the "unclassified" category. The rationale is that many of the sequences in the "unclassified" category might have update information that can be used to re-assign into a different category. Sequences previously assigned categories of "protein synthesis" or "metabolism" most likely will not change. Therefore, we first focused on re-assigning the 11,145 probe sets classified as "unclassified" in 2003.
5. After the "unclassified" category was re-examined, we decided to re-examine the entire 22,746 probe sets on the array for consistent assignment of functional categories. We sorted all the probe sets by their description and made sure that probe sets with similar descriptions are assigned the same functional category.
6. We further examined the "unclassified" category that is divided into three groups as follows:
- Unclassified - hypothetical proteins with no cDNA support
- Unclassified - hypothetical proteins with cDNA support
- Unclassified - proteins with unknown function
We obtained several files from TAIR that will distinguish the different sequences within the unclassified category. We downloaded several files from the TAIR site including:
- TAIR7_protein_coding_no_transcript_support_09_30_07
- TAIR7_protein_coding_with_transcript_support_09_30_07
- TAIR7_unknown_proteins_no_transcript_support_09_30_07
- TAIR7_proteins_of_undefined_function_03_07
- TAIR7_unknown_proteins_03_07
- TAIR7_locus_type
These files were compiled into one main table listing all the transcripts detected and/or predicted in the Arabidopsis genome. This list helps distinguish if a sequence has cDNA support, represents a pseudogene/transposon, or is unknown. These files help re-assign the probe sets into appropriate unclassified categories.
Download
The updated information is available from the following link.

