Using 454 Sequencing to Survey The Transcriptome of Soybean Seeds Containing Globular-Stage Embryos

Background

The Affymetrix soybean genome array is being used to study the activity of genes in different compartments of the soybean seed at various stages of development (see the Browse link). The soybean array was designed using publicly available ESTs (Click here for more details about the soybean array). Most of the ESTs originate from reproductive and vegetative organs, but very few ESTs are from libraries constructed from soybean seeds throughout development. As such, genes active during many stages of seed development are most likely under-represented on the array. To uncover additional genes active during Soybean seed development, we carried out a pilot study using the high-throughput 454 sequencing (454 Life Sciences) to survey the transcriptome of a globular-stage soybean seed. We generated ~900,000 reads with the average length of 200 bases in one run representing a deep sampling of the globular-stage seed transcriptome.

Methods

Soybean plants were grown in the UCLA Plant Growth Center with a 16:8 light-dark cycle at 22°C. Total RNA isolated from soybean seeds containing globular-stage embryos was subjected to two rounds of poly(A) selection using a Dynabeads oligo(dT) system (Invitrogen). Complementary DNA (cDNA) was generated from 500 ng of poly(A)-selected mRNA using Superscript II (reverse transcriptase, Invitrogen) and SMART IV and CDSIII/3' PCR primers based on a SMART PCR cDNA Library Construction kit (Clontech). The first-strand cDNA was then amplified with 5' PCR and 3' PCR primers for 20 cycles. The cDNA sample was digested with a restriction enzyme Sfi I (New England Biolabs) and then size-fractionated over the CHROMA-SPIN column (Clontech). Fractions containing Sfi I-digested cDNA with size greater than 400 bp were pooled and precipitated. Five micrograms of the size-selected Sfi I-digested cDNA sample were sent to Roche 454 Life Sciences for sequencing.

Raw sequences were processed by trimming primer sequences using SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) (parameter -l 25). Processed sequences were filtered for rRNA sequences using a rRNA database and blastn (parameter -e 0.01).

Results

The globular-stage 454 ESTs sequences are currently being analyzed and compared against all the sequences represented on the soybean array. These data will determine whether the extent to which genes on the soybean genome array represent those active in globular-stage seeds. This dataset contains the largest collection of soybean seed ESTs, and was used, in part, by the JGI to help identify and annotate gene models in the soybean genome (http://www.phytozome.net/soybean).

The major conclusions to date are:

  • Ninety-nine percent of the 898,728 globular-stage seed ESTs map to the soybean genome, and ~94% of the ESTs map to 20,000 gene models.

  • Approximately 30% of each 450,000 EST sequencing run represents unique ESTs (i.e., not found in the other run), suggesting that a larger number of EST sequences are required to uncover all soybean globular-stage seed mRNAs (data not shown). That is, the globular-stage soybean seed mRNA population contains >20,000 diverse transcripts.

  • A comparison of the globular-stage seed mRNAs detected using EST sequencing and GeneChip technologies indicates that most diverse mRNAs detected by the GeneChip are also detected using 454 sequencing technology. A large set of transcripts (~10,000), however, is uncovered with EST sequencing that is not detected using the GeneChip. This indicates that the cDNA-based GeneChip underestimates the number of genes active within soybean seed RNA populations, and is primary reason for having a new soybean whole-genome GeneChip constructed.

  • Using both 454 sequencing and GeneChip technologies, we estimate that there are at least 25,000 diverse mRNAs present in a soybean globular-stage seed, and that the total number of genes required for the differentiation of all soybean seed compartments, regions, and tissues across development (i.e., genes required to "make a soybean seed") will be larger.

Raw unprocessed EST sequences generated from this study have been submitted to Genbank Short Read Archive (SRA) division (http://www.ncbi.nlm.nih.gov/Traces/sra/). The Genbank SRA division is a repository for next-generation sequencing data including 454, Solexa, and ABI SOLID. The 454 SFF file containing the raw sequence and sequence quality information can be access through the SRA web site under accession number SRA001022. Data file can also be access directly through FTP (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead)

We also submitted our processed EST sequences (see Methods above for details) to the Genbank dbEST division (http://www.ncbi.nlm.nih.gov/dbEST/index.html) under accession numbers FK265369 to FK668879 and GD660863 to GE139779. At the same time, we provide these 454 sequences for rapid data release to the soybean community. The 454 sequence files are very large and may take time to download. For Windows users, right click on the link and select "Save Linked File As". For Mac users, hold down CTRL while clicking on the link and select "Save linked File As".