Summary of All Project GenBank Submissions

Dataset Methylome mRNA Transcriptome smRNA Transcriptome Dataset GENECHIP
Total Reads ~ 12.0 ~ 4.4 ~ 2.7 No. Compartments 76
Total Bases ~ 1,195Gb ~ 250Gb ~ 136Gb No. GeneChip 166
No. Datasets 52 153 81 No. Datasets 76

Note: B - Billion; Gb - Gigabyte

All GenBank Submissions Categorized By Project

Please click on the GEO accession number below to download the data from GenBank.

BS DNA-SEQ (Methylome Profiling of Soybean Seed Development Using Next-Generation Sequencing)

Study Dataset No. Reads No. Bases GEO Accessions
Methylation Changes During Soybean Seed Development
Globular Stage Seeds 297M 27.9Gb GSM852274
Cotyledon Stage Seeds 276M 27.6Gb GSM1392187
Early Maturation Stage Seeds 268M 25.4Gb GSM852275
Mid-Maturation Stage Seeds (BR1) 284M 28.4Gb GSM2417602
Mid-Maturation Stage Seeds (BR2) 52M 5.2Gb GSM2417603
Late-Maturation Stage Seeds (BR1) 144M 14.4Gb GSM852278
Late-Maturation Stage Seeds (BR2) 194M 19.4Gb GSM1707337
Late-Maturation Stage Seeds (BR3) 233M 23.3Gb GSM1707338
Pre-dormancy Stage 1 Seeds 182M 18.2Gb GSM1707336
Pre-dormancy Stage 2 Seeds 190M 19.0Gb GSM1707335
Dry Seeds (BR1) 382M 38.2Gb GSM852279
Dry Seeds (BR2) 195M 19.5Gb GSM1707334
6DAI Cotyledon (BR1) 151M 12.9Gb GSM1008197
6DAI Cotyledon (BR2) 206M 20.6Gb GSM1707341
6DAI Cotyledon (BR3) 198M 19.8Gb GSM1707342
6DAI Seedling (BR1) 207M 17.9Gb GSM1008196
6DAI Seedling (BR2) 211M 21.1Gb GSM1707339
6DAI Seedling (BR3) 193M 19.3Gb GSM1707340
Methylation Changes in Soybean Early Maturation Seed Compartments Using LCM
Seed Coat Parenchyma 334M 33.4Gb GSM1388205
Seed Coat Palisade 286M 28.6Gb GSM1388204
Cotyledon Abaxial Parenchyma 196M 19.6Gb GSM929624
Cotyledon Adaxial Parenchyma 245M 24.5Gb GSM929625
Axis Parenchyma 195M 19.5Gb GSM1397517
Axis Plumule 205M 20.5Gb GSM1397518
Axis Root Tip 218M 21.8Gb GSM1397519
Axis Vascular 190M 19.0Gb GSM1397520
Seed Coat 213M 21.3Gb GSM2417600
Embryonic Cotyledon 206M 20.6Gb GSM2417601
Methylation Changes in Soybean Early Maturation Seed Parts
Embryonic Cotyledons 393M 39.3Gb GSM929633
Embryonic Axis 350M 35.0Gb GSM929634
Seed Coat (BR1) 294M 29.4Gb GSM2417691
Seed Coat (BR2) 263M 26.3Gb GSM2417692
Methylation Changes in Soybean Mid-Maturation Seed Parts
Embryonic Cotyledons 273M 27.3Gb GSM1008126
Embryonic Axis 457M 45.7Gb GSM1023628
Seed Coat 250M 25.0Gb GSM1008125
Methylation Changes in Soybean Cotyledons Seed Parts
Embryonic Cotyledons 276M 27.6Gb GSM1388176
Embryonic Axis 275M 27.5Gb GSM1388177
Seed Coat 270M 27.0Gb GSM1388175

Note: BR - Biological Replicate; LCM - Laser Microdissection

BS DNA-SEQ (Methylome Profiling of Arabidopsis Seed Development Using Next-Generation Sequencing)

Study Dataset No. Reads No. Bases GEO Accessions
Methylation Changes in Mature Green Stage Seed Parts (GSE57755) Embryo 123M 12.3Gb GSM1388112
Seed Coat 172M 17.2Gb GSM1388111
Methylation Changes in Arabidopsis Seed Development (GSE68132) Globular Stage Seeds 206M 20.6Gb GSM1664380
Linear Cotyledon Stage Seeds 191M 19.1Gb GSM1664381
Mature Green Stage Seeds 189M 18.9Gb GSM1664382
Post Mature Green Stage Seeds 227M 22.7Gb GSM1664383
Dry Seeds 215M 21.5Gb GSM1664384
Leaf 170M 17.0Gb GSM1664385
Methylation Profile of post mature green stage seed and dry seed from Arabidopsis ddcc mutant (GSE68131) ddcc mutant dry seeds 232M 23.2Gb GSM1664376
ddcc mutant post mature green seeds 191M 19.1Gb GSM2319711
ddcc mutant leaf 228M 22.8Gb GSM1664377
wild type dry seeds 220M 22.0Gb GSM1664378
wild type post mature green seeds 188M 18.8Gb GSM2319712
wild type leaf 234M 23.4Gb GSM1664379

RNA-SEQ (Transcriptome Profiling of Soybean and Arabidopsis Seed Development Using Next-Generation Sequencing)

Study Dataset No. Reads No. Bases GEO Accessions
Transcriptome Profiling of the Soybean Life Cycle
Globular Stage Seeds 89M 6.8Gb GSM721725
Heart Stage Seeds 40M 3.0Gb GSM721726
Cotyledon Stage Seeds 52M 4.0Gb GSM721727
Early Maturation Stage Seeds 123M 9.3Gb GSM721728
Dry Seeds 42M 3.2Gb GSM721729
Trifoliate leaves 45M 3.4Gb GSM721730
Roots 40M 3.0Gb GSM721731
Stems 19M 1.4Gb GSM721732
Floral Buds 59M 4.5Gb GSM721733
Whole seedlings six days after imbibition 33M 2.5Gb GSM721734
Transcriptome Profiling of Soybean Seed Compartments Using LCM
Globular Stage Embryo Proper 74M 5.6Gb GSM721717
Globular Stage Suspensor 68M 5.2Gb GSM721718
Early Maturation Seed Coat Parenchyma 73M 5.5Gb GSM721719
Transcriptome Profiling of Soybean Embryonic Cotyledon Before and After Germination
Mid-Maturation Cotyledon 49M 3.7Gb GSM721277
Late-Maturation Cotyledon 83M 6.3Gb GSM721278
Seedling Cotyledon 54M 4.1Gb GSM721280
Transcriptome Profiling of Soybean Early Maturation Seed Parts
Embryonic Cotyledons 24M 2.4Gb GSM1213856
Embryonic Axis 25M 2.5Gb GSM1213857
Seed Coat 21M 2.1Gb GSM1213855
Transcriptome Profiling of Soybean Seed Compartments at Early Maturation Stage Using LCM
Axis Epidermis (3 BRs) 88M 4.5Gb GSM1398252/GSM1398253/GSM1398254
Axis Stele (3 BRs) 82M 4.1Gb GSM1398262/GSM1398263/GSM1398264
Axis Vascular (3 BRs) 91M 4.5Gb GSM1123207/GSM1123208/GSM1398265
Axis Parenchyma (3 BRs) 102M 5.1Gb GSM1123204/GSM1123205/GSM1123206
Plumule (3 BRs) 75M 3.8Gb GSM1398255/GSM1398256/GSM1398257
Root Tip (3 BRs) 131M 6.6Gb GSM1123218/GSM1123219/GSM1398258
Shoot Meristem (3 BRs) 86M 4.2Gb GSM1398259/GSM1398260/GSM1398261
Cotyledon Abaxial Parenchyma (3 BRs) 121M 6.0Gb GSM1398266/GSM1398267/GSM1398268
Cotyledon Adaxial Parenchyma (3 BRs) 101M 5.1Gb GSM1398272/GSM1398273/GSM1398274
Cotyledon Abaxial Epidermis (3 BRs) 110M 5.5Gb GSM1123209/GSM1123210/GSM11232011
Cotyledon Adaxial Epidermis (3 BRs) 119M 5.9Gb GSM1398269/GSM1398270/GSM1398271
Cotyledon Vascular Bundle (3 BRs) 83M 4.2Gb GSM1123212/GSM1123213/GSM1123214
Endosperm (3 BRs) 96M 4.9Gb GSM1123214/GSM1398276/GSM1398277
Hilum (3 BRs) 118M 5.8Gb GSM1123215/GSM1123216/GSM1123217
Seed Coat Parenchyma (3 BRs) 70M 3.5Gb GSM1398279/GSM1398280/GSM1123225
Seed Coat Hourglass (3 BRs) 105M 5.2Gb GSM1123220/GSM1123221/GSM1123222
Seed Coat Palisade (3 BRs) 88M 4.4Gb GSM1123223/GSM1123224/GSM1398278
Transcriptome Profiling of Soybean Seed Compartments at Cotyledon Stage Using LCM
Embryo Proper (3 BRs) 74M 3.8Gb GSM1385450/GSM1385451/GSM1385452
Cotyledon (3 BRs) 98M 5.0Gb GSM1385456/GSM1385457/GSM1385458
Axis (3 BRs) 59M 3.0Gb GSM1385453/GSM1385454/GSM1385455
Endosperm (3 BRs) 64M 3.2Gb GSM1385459/GSM1385460/GSM1385461
Seed Coat Endothelium (3 BRs) 70M 3.5Gb GSM1385462/GSM1385463/GSM1385464
Seed Coat Inner Integument (3 BRs) 64M 3.2Gb GSM1385471/GSM1385472/GSM1385473
Seed Coat Outer Integument (3 BRs) 53M 2.7Gb GSM1385475/GSM1385476/GSM1385477
Seed Coat Hilum (3 BR) 65M 3.3Gb GSM1385468/GSM1385469/GSM1385460
Seed Coat Epidermis (3 BRs) 70M 3.5Gb GSM1385465/GSM1385466/GSM1385467
Suspensor (3 BRs) 58M 2.9Gb GSM1385477/GSM1385478/GSM1385479
Transcriptome Profiling of Soybean Seed Compartments at Heart Stage Using LCM
Embryo Proper (3 BRs) 56M 2.8Gb GSM1380799/GSM1380800/GSM1380801
Endosperm (3 BR) 53M 2.7Gb GSM1380802/GSM1380803/GSM1380804
Seed Coat Endothelium (3 BRs) 59M 3.0Gb GSM1380805/GSM1380806/GSM1380807
Seed Coat Epidermis (3 BRs) 73M 3.7Gb GSM1380808/GSM1380809/GSM1380810
Seed Coat Hilium (3 BRs) 63M 3.2Gb GSM1380811/GSM1380812/GSM1380813
Seed Coat Inner Integument (3 BRs) 54M 2.7Gb GSM1380814/GSM1380815/GSM1380816
Seed Coat Outer Integument (3 BRs) 53M 2.7Gb GSM1380817/GSM1380818/GSM1380819
Suspensor (3 BRs) 55M 2.8Gb GSM1380820/GSM1380821/GSM1380822
Transcriptome Profiling of Soybean Seed Compartments at Globular Stage Using LCM
Embryo Proper (3 BRs) 65M 3.3Gb GSM1380774/GSM1380775/GSM1380776
Endosperm (3 BR) 72M 3.7Gb GSM1380777/GSM1380778/GSM1380779
Seed Coat Endothelium (3 BRs) 63M 3.2Gb GSM1380780/GSM1380781/GSM1380782
Seed Coat Epidermis (3 BRs) 62M 3.1Gb GSM1380783/GSM1380784/GSM1380785
Seed Coat Hilium (3 BRs) 71M 3.6Gb GSM1380786/GSM1380787/GSM1380788
Seed Coat Inner Integument (4 BRs) 99M 5.0Gb GSM1380789/GSM1380790/GSM1380791/GSM1380792
Seed Coat Outer Integument (3 BRs) 66M 3.3Gb GSM1380793/GSM1380794/GSM1380795
Suspensor (3 BRs) 60M 3.0Gb GSM1380796/GSM1380797/GSM1380798
Transcriptome Profiling of Post Mature Green Seeds of Arabidopsis ddcc mutant and wild-type
ddcc mutant post mature green seeds (BR1) 11M 511.9Mb GSM2024779
ddcc mutant post mature green seeds (BR2) 7M 333Mb GSM2024780
wild type post mature green seeds (BR1) 11M 568Mb GSM2024781
wild type post mature green seeds (BR2) 11M 499Mb GSM2024782

Note: LCM - Laser Microdissection;

SMALL RNA-SEQ (Small RNA Profiling During Soybean Seed Development)

Study Dataset No. Reads No. Bases GEO Accessions
Small RNA Profiling of Soybean Early Maturation Seed Parts
Embryonic Cotyledons 53M 2.7Gb GSM1213859
Embryonic Axis 53M 2.7Gb GSM1213860
Seed Coat 53M 2.7Gb GSM1213858
Small RNA Profiling of Soybean Seed Compartments at Early Maturation Stage Using LCM
Axis Epidermis (2 BRs) 35M 1.8Gb GSM1397256/GSM1397257/
Axis Stele (2 BRs) 43M 2.1Gb GSM1397266/GSM1397267
Axis Vascular (2 BRs) 72M 3.6Gb GSM1397268/GSM1397269
Axis Parenchyma (2 BRs) 36M 1.8Gb GSM1397260/GSM1397261
Plumule (2 BRs) 52M 2.6Gb GSM1397258/GSM1397259
Root Tip (2 BRs) 82M 4.1Gb GSM1397262/GSM1397263
Shoot Meristem (2 BRs) 73M 3.6Gb GSM1397264/GSM1397265
Cotyledon Abaxial Parenchyma (2 BRs) 54M 2.7Gb GSM1397272/GSM1397273
Cotyledon Adaxial Parenchyma (2 BRs) 53M 2.7Gb GSM1397276/GSM1397277
Cotyledon Abaxial Epidermis (2 BRs) 59M 2.9Gb GSM1397270/GSM1397271
Cotyledon Adaxial Epidermis (2 BRs) 59M 2.9Gb GSM1397274/GSM1397275
Cotyledon Vascular Bundle (2 BRs) 80M 4.0Gb GSM1397278/GSM1397279
Endosperm (2 BRs) 115M 5.8Gb GSM1397280/GSM1397281
Hilum (2 BRs) 89M 4.4Gb GSM1397284/GSM1397285
Seed Coat Parenchyma (2 BRs) 36M 1.8Gb GSM1397288/GSM1397289
Seed Coat Hourglass (2 BRs) 60M 3.0Gb GSM1397282/GSM1397283
Seed Coat Palisade (2 BRs) 58M 2.9Gb GSM1397286/GSM1397287
Small RNA Profiling of Soybean Seed Compartments at Cotyledon Stage Using LCM
Embryo Proper (2 BRs) 84M 4.2Gb GSM1396335/GSM1396336
Embryonic Axis (2 BRs) 27M 1.3Gb GSM1396337/GSM1396338
Embryonic Cotyledon (2 BRs) 40M 2.0Gb GSM1396339/GSM1396340
Endosperm (2 BRs) 107M 5.3Gb GSM1396341/GSM1396342
Inner Integument (2 BRs) 115M 5.7Gb GSM1396347/GSM1396348
Outer Integument (2 BRs) 58M 2.9Gb GSM1396349/GSM1396350
Hilum (2 BRs) 44M 2.2Gb GSM1396345/GSM1396346
Seed Coat Epidermis (2 BRs) 84M 4.1Gb GSM1396343/GSM1396344
Small RNA Profiling of Soybean Seed Compartments at Heart Stage Using LCM
Embryo Proper (2 BRs) 51M 2.5Gb GSM1396278/GSM1396279
Suspensor (2 BRs) 82M 4.1Gb GSM1396288/GSM1396289
Endosperm (2 BRs) 53M 2.6Gb GSM1396280/GSM1396281
Endothelium (2 BRs) 97M 4.8Gb GSM1396276/GSM1396277
Inner Integument (2 BRs) 62M 3.1Gb GSM1396284/GSM1396285
Outer Integument (2 BRs) 63M 3.1Gb GSM1396286/GSM1396287
Hilum (2 BRs) 103M 5.1Gb GSM1695750/GSM1695751
Seed Coat Epidermis (2 BRs) 69M 3.4Gb GSM1396282/GSM1396283
Small RNA Profiling of Soybean Seed Compartments at Globular Stage Using LCM
Embryo Proper (2 BRs) 60M 3.0Gb GSM1695746/GSM1695747
Endosperm (2 BRs) 57M 2.8Gb GSM1394967/GSM1394968
Inner Integument (2 BRs) 77M 3.8Gb GSM1394971/GSM1394972
Outer Integument (2 BRs) 58M 2.9Gb GSM1394973/GSM1394974
Hilum (2 BRs) 73M 3.6Gb GSM1394969/GSM1394970
Epidermis (2 BRs) 54M 2.7Gb GSM1695748/GSM1695749

Note: LCM - Laser Microdissection; BR - Biological Replicate

GENECHIP (Transcriptome Profiling of Seed Development Using GeneChip Arrays)

Study Seed Stage No. Compartments Studied No. GeneChip Experiments GEO Series Accessions
Transcriptome Profiling of Soybean Seed Compartments Using LCM
Globular 8 24 GSE6414
Heart 8 19 GSE7511
Cotyledon 8 16 GSE7881
Early Maturation 16 32 GSE8112
Transcriptome Profiling of Arabidopsis Seed Compartments Using LCM
Pre-globular 6 12 GSE12402
Globular 7 15 GSE11262
Heart 6 14 GSE15160
Linear Cotyledon 6 12 GSE12403
Bending Cotyledon 5 10 GSE20039
Mature Green 6 12 GSE15165

Note: LCM - Laser Microdissection;


This section provides software used or developed for the analysis of large datasets.


Click here to download the latest version (Chipenrich-1.42a.jar)

We modified the ChipEnrich software program (Orlando et al., 2009) to identify GO terms, metabolic pathways, transcription factor families, and DNA sequence motifs overrepresented in coexpressed gene sets and to discover potential transcriptional modules,. This Java program was developed originally to identify significantly enriched GO terms (2009 download) and transcription factor families from gene lists. Significance of enrichment is reported as p values calculated from the hypergeometric distribution (Gadbury et al., 2009) using the Apache Commons Math library ( The following functions were added to ChipEnrich:

Metabolic pathway enrichment analysis: Genes represented on the ATH1 GeneChip were annotated according to metabolic pathways described in the PATHWAYS database from AraCyc (; 2008 download). Enrichment was defined as the ratio of (i) the number of AGI locus identifiers in the query list annotated as belonging to a pathway to (ii) the number of AGI locus identifiers associated with the pathway in the GeneChip compared with the ratio of (iii) the total number of AGI locus identifiers present in the query list to (iv) the total number of AGI locus identifiers present on the GeneChip.

DNA motif enrichment analysis: Gene sets were analyzed to identify enriched DNA sequence motifs known to interact with TFs (Arabidopsis Gene Regulation Information Server,, August, 2009) that are located in the region 1 kb upstream of the gene's transcription start site (TAIR9, as described by others (O'Connor et al., 2005; Vandepoele et al., 2009). The background distribution was determined by identifying DNA motifs for all genes represented as singletons on the Arabidopsis ATH1 GeneChip. Statistical enrichment (p value < 0.001) was determined for each gene list using the hypergeometric distribution. Enriched DNA sequence motifs are also identified among genes overrepresented for a GO term within a gene list.

Putative Transcriptional Modules: To discover putative transcriptional modules, we associated significantly enriched DNA sequence motifs with transcription factors known or predicted to bind the motifs. We used known interactions between transcription factors and DNA motifs specified in AtcisDB (Davuluri et al., 2003) and defined by others in the literature and assumed that transcription factors of a particular family bind to the same DNA motif (Brady et al., 2007). Two variations of this approach were used. In the first approach, we associated DNA motifs significantly enriched within a coexpressed gene set with their cognate TFs that were included in the coexpressed gene set. In the second, we identified DNA motifs that were significantly enriched for genes corresponding to an overrepresented GO term and associated coexpressed TFs known or predicted to bind the enriched DNA motifs. Using the software package, ChipEnrich (see below), overrepresented GO terms, DNA motifs, and their associated TFs were compiled into two Cytoscape compatible files that were used as network and node attribute files, and the modules are visualized with Cytoscape.

Data Output: Outputs are summarized in a text file, significant.txt, in which the gene set name is in the first column, enriched GO terms, DNA motifs, or transcription factor families are listed in the second column, and p values indicating the significance of enrichment are given in the third column. Each new enriched category is set in a new row. If a DNA motif is significantly overrepresented within a gene list (p < 0.001), it is also determined if the motif is enriched among genes significantly overrepresented for a GO term (p < 0.001). In the significant.txt file, the overrepresented GO terms are listed in the first column, enriched DNA motifs are in the second column, and p values are in the third column. TFs in the gene list (first column) that are predicted or known to bind with enriched DNA motifs (second column) are also listed in the file. A separate node attribute file is also provided from ChipEnrich that describes whether a node (first column of significant.txt file) is a pattern, GO term, DNA motif, transcription factor family, or transcription factor.

The significant.txt file is designed to be used as the network file for the network graphing software, Cytoscape (version 2.6.3, The node.txt file is used as the attributes file (Cline et al., 2007). P values are also imported with the network file as edge attributes. For visualization purposes, a thicker line represents a lower p value, a dashed line represents a TF with a predicted binding interaction, and a solid red edge is an experimentally determined TF - DNA motif interaction.


  • Brady, S.M., Orlando, D.A., Lee, J.Y., Wang, J.Y., Koch, J., Dinneny, J.R., Mace, D., Ohler, U., and Benfey, P.N. (2007). A high-resolution root spatiotemporal map reveals dominant expression patterns. Science 318, 801-806.
  • Cline, M.S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., et al. (2007). Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2, 2366-2382.
  • Davuluri, R.V., Sun, H., Palaniswamy, S.K., Matthews, N., Molina, C., Kurtz, M., and Grotewold, E. (2003). AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinform 4, 25.
  • Gadbury, G.L., Garrett, K.A., and Allison, D.B. (2009). Challenges and approaches to statistical design and inference in high-dimensional investigations. Methods Mol Biol 553, 181-206.
  • O'Connor, T.R., Dyreson, C., and Wyrick, J.J. (2005). Athena: a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences. Bioinformatics 21, 4411-4413.
  • Orlando, D.A., Brady, S.M., Koch, J.D., Dinneny, J.R., and Benfey, P.N. (2009). Manipulating large-scale Arabidopsis microarray expression data: identifying dominant expression patterns and biological process enrichment. Methods Mol Biol 553, 57-77.
  • Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L., and Van de Peer, Y. (2009). Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiol 150, 535-546.

A Soybean Seed Transcription Factor RNAi Knock-Out Collection

To study the functions of transcription factor genes active during soybean seed development, we collaborated with Dr. David Somers (Monsanto) to generate a collection of soybean seed RNAi knock-out lines. We used the CaMV 35S gene promoter to generate RNAi lines for 63 transcription factor genes that are expressed in specific seed regions at the globular, heart, cotyledon, and early-maturation stages of development. RNAi transgenes were integrated into the soybean genome, and R0 lines containing a single RNAi transgene were isolated and grown to maturity in the greenhouse. The developing R1 and R2 seed populations were screened for developmental abnormalities associated with seed and vegetative development. A preliminary screen yielded three lines with significant phenotypes. The remainder appeared similar to wild type under our screening conditions. A list of each transcription factor gene knock-out line, their seed expression profile, and RNAi phenotype is presented in the table below.


This work had been published on Plant Physiology (click here to read the manuscript). If you have used our data, please cite the manuscript using the following reference:

John Danzer, Eric Mellott, Anhthu Q. Bui, Brandon H. Le, Patrick Martin, Meryl Hashimoto, Jeanett Perez-Lesher, Min Chen, Julie M. Pelletier, David A. Somers, Robert B. Goldberg and John J. Harada, Down-Regulating the Expression of 53 Soybean Transcription Factor Genes Uncovers a Role for SPEECHLESS in Initiating Stomatal Cell Lineages during Embryo Development, Plant Physiolgy 2015 Jul;168(3):1025-35. doi: 10.1104/pp.15.00432.

Phenotypes of Some RNAi Knock-Out Lines (click on image to enlarge)

A Complete Summary of RNAi Knock-Out Lines (click on image to enlarge)

Click here to see the abbreviation of stages and compartments.

To find the expression profile of target genes during seed development, first go to the "Browse Soybean mRNAs Profiling Database" page. Next, type the target gene name (e.g. Glyma04g41710), in the "Predicted Gene Model ID" window and click the "Submit Query" button to search the database. Lastly, in the search results page, click on the probe set corresponding to the target gene to view the expression profile.

Soybean IVT Array Annotation

Sequences used for BLAST came from the Affymetrix Soybean target sequences. Sequence information can be obtained directly from Affymetrix. The Affymetrix Soybean target sequence was based on the NCBI Unigene Build 13 (November, 2003). Probe design was based on the NCBI Unigene Build as well as the Affymetrix in-house clustering algorithm. Affymetrix in-house clustering probes are designated with the prefix "GmaAffx".

BLASTX analysis was carried out using soybean target sequences searched against all Arabidopsis proteins (TAIR ATH1_pep_cm_20040228). In our BLAST analysis, we filtered and removed any results with e-value greater than e-02. We selected the top Arabidopsis hit from each BLAST result (sometimes one Soybean sequence can hit many different Arabidopsis sequence) when identifying the corresponding Arabidopsis sequence. The e-value for that hit is displayed in the annotation file. Therefore, for each Soybean probe set, there is an associated Arabidopsis annotation (if available) and the degree of homology between the Soybean and Arabidopsis sequence based on the e-value. In cases where no Arabidopsis hit was identified (~9000 Soybean probe sets did not have homology to any Arabidopsis proteins), we BLASTED the Soybean sequence against Rice Proteins (Build #2 from TIGR) and the NCBI non-redundant protein database. We annotated Soybean probe sets and did not annotate any features from H. glycines or P. sojae that are in the GeneChip.

Recently, we annotated the soybean GeneChip to the draft soybean genome sequences (


Sept. 25, 2009 - We mapped individual probes to soybean predicted gene models (generated by the Department of Energy (DOE) Joint Genome Institute, Glyma version 1.01, released April 7, 2009) using BLASTN (≥ 23/25 nucleotide identity) to associate soybean array probe sets with soybean gene models. Probe sets that contain at least 9 out of 11 probes mapping to the same genomic locus are represented in the files below. Probe sets that did not meet these criteria (i.e. 23/25 nucleotide identity, ≥ 9/11 probes per probe set) were not included in the file below. We split the file into two files based on the confidence of prediction of soybean gene models ( Click the files below to download the association of Soybean array probe sets and Soybean gene models.

Feb. 1, 2009 - We updated the annotation of the soybean array information based on information from TAIR 7.0, TIGR, and Peking Transcription Factor databases as of October 2007. The updated information is available from the following link.

Distribution of All Probe Sets on the Soybean Array (2007)

Soybean Whole Transcript Genome Array


We created this Soybean Whole Transcript (WT) Array to interrogate all the genes in the genome. The first generation Affymetrix Soybean Genome array was designed by the Soybean Consortium using publicly available soybean full-length cDNAs and ESTs. The Soybean Genome array consists of 37,000 probe sets interrogating ~ 25,000 distinct genes/transcripts. The release of the whole genome sequence of soybean1 (available at allowed the creation of an array that can survey all the genes (both high and low confidence gene models) in the genome [Schmutz et al., Nature 463 pp. 178-83 (2010)].


The design of the Soybean WT array is different from the Soybean Genome array. For the Soybean Genome array, probes were selected to correspond to the 3’ end of the transcript or cDNA. However, for the Soybean WT array, probes were selected to span every exon of the predicted gene models/transcripts, if possible. This approach allows for the interrogation of the transcript (from 5’ to 3’) and can help determine exon usage in different splice variants that may be differentially expressed in specific tissues or compartments. For information regarding this array design, please check out other references from Affymetrix (

Note: This array was designed for studying both Soybean and Medicago (i.e. a Legume array). There are sequences on the array corresponding to Medicago cDNAs. However, our main focus will be on the Soybean sequences on the array.

Sequence Data:

All sequence data used to design probes on the array were obtained from the Department of Energy - Joint Genome Institute (DOE-JGI) web site (phytozome: Probes were designed from the first draft assembly of the soybean genome1 (version 1.0). The probe selection algorithm was developed by Christopher Davies and Brant Wong at Affymetrix.

Publication Acknowledgement:

The array was designed with collaboration from our lab (Goldberg Lab) and Affymetrix with advice and suggestions from other members of the soybean community, including Randy Shoemaker.

Please acknowledge the following people for the design of this array:

Goldberg Lab: Bob Goldberg, Brandon Le, Chen Cheng, Min Chen, and Anhthu Bui

Affymetrix: Gene Tanimoto, Christopher Davies, Stan Trask, Brant Wong, Eric Schell, Xue Mei Zhou, and Patricia Chan

Files for Download

[Probe Association File]

We've created a text file that correlates Affymetrix probe ID with associated probe sequence, gene and exon information, etc.

Probe Association File: [Click Here to Download]

[Soybean SENSE WT Array]

This array design is available to the general public and can be purchased through Affymetrix.

Library File: [Click Here to Download]

Labeling Protocol: Check the Affymetrix Website for labeling and hybridization kits [Go to Affymetrix Website]

[Soybean ANTISENSE WT Array]

This array was created for our lab and is a custom-designed antisense WT array. Please use the library file and protocols listed below for this array only.

Library File: [Click Here to Download]

Labeling Protocols:

  • Labeling Protocol One: Nugen Ovation Pico WTA System

    Click on the link to go to the product web site [Link]

  • Labeling Protocol Two: Ambion WT Expression Kit with Affymetrix Second Strand cDNA Synthesis

    [Click Here to Download Protocol]

This labeling protocol is presented as is and is not regularly supported by the Affymetrix Technical Support team. This method requires an Ambion WT Expression kit, Affymetrix Fragmentation and Terminal Labeling kit, and second strand cDNA synthesis reagents from vendors provided in the attached protocol. For this protocol, you will generate cRNA using the Ambion WT Expression kit (up to Day2 Workflow, Step2). After cRNA synthesis, you will use the Affymetrix protocol (starting on page 9) to make the second cycle cDNA and terminally-labeled targets.

[Hybrididization Program]

For array wash, stain, and scan, use the fluidics protocol EuKGE-WS2v5_450 for wash and stain procedures as described in the GeneChip Expression Analysis Technical Manual (Section 2: Eukaryotic Sample and Array Processing).

Arabidopsis ATH1 Array Annotation

The Arabidopsis ATH1 array was annotated in 2003 using all the publicly available resources at the time. In order to keep up with the increasing amount of information generated within the past four years since the annotation of the ATH1 array, we decided to re-annotate the ATH1 array in parallel with the soybean genome array.

The strategy for the re-annotation of the ATH1 array is as follows:

1. We updated the descriptions for each probe set on the array using TAIR Affy array descriptions (affy_ATH1_array_elements-2007-5-2.txt). The description file was downloaded from the TAIR web site: Descriptions were based on the latest release of the Arabidopsis genome TAIR 7 (released 04-11-07).

Note from TAIR: The mapping to the TAIR7 Transcripts was performed using the BLASTN program with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the Affy chips, the required match length to achieve this e-value is 23 or more identical nucleotides. To assign a probe set to a given locus, at least 9 of the probes included in the probe set were required to match a transcript at that locus. Otherwise, the probe set was not assigned a locus and was given the description "no match".

2. In addition to updating the descriptions for each probe set, we also updated gene ontology (GO) information provided by Affymetrix.

3. We gathered information about putative transcription factors from many publicly available TF database for Arabidopsis including:

Transcription factors and transcription factor families were associated with each probe set on the array. Information obtained from points 1-3 were compiled together into an annotation file containing the 2003 ATH1 annotations. Transcription factors were automatically updated based on the information obtained from the databases in point 3.

4. We focused on probe sets that were previously assigned into the "unclassified" category. The rationale is that many of the sequences in the "unclassified" category might have update information that can be used to re-assign into a different category. Sequences previously assigned categories of "protein synthesis" or "metabolism" most likely will not change. Therefore, we first focused on re-assigning the 11,145 probe sets classified as "unclassified" in 2003.

5. After the "unclassified" category was re-examined, we decided to re-examine the entire 22,746 probe sets on the array for consistent assignment of functional categories. We sorted all the probe sets by their description and made sure that probe sets with similar descriptions are assigned the same functional category.

6. We further examined the "unclassified" category that is divided into three groups as follows:

  • Unclassified - hypothetical proteins with no cDNA support
  • Unclassified - hypothetical proteins with cDNA support
  • Unclassified - proteins with unknown function

We obtained several files from TAIR that will distinguish the different sequences within the unclassified category. We downloaded several files from the TAIR site including:

  • TAIR7_protein_coding_no_transcript_support_09_30_07
  • TAIR7_protein_coding_with_transcript_support_09_30_07
  • TAIR7_unknown_proteins_no_transcript_support_09_30_07
  • TAIR7_proteins_of_undefined_function_03_07
  • TAIR7_unknown_proteins_03_07
  • TAIR7_locus_type

These files were compiled into one main table listing all the transcripts detected and/or predicted in the Arabidopsis genome. This list helps distinguish if a sequence has cDNA support, represents a pseudogene/transposon, or is unknown. These files help re-assign the probe sets into appropriate unclassified categories.


The updated information is available from the following link.

Distribution of All Probe Sets on the Arabidopsis ATH1 Array (2007)

Click the image to view larger image.

Presentations Relevant to This Project

Bob Goldberg

  • 2012
    • The Food Dialogues - A Public Conversation, Los Angeles, CA and New York, NY (2012)
    • International Congress of Plant Molecular Biology, Jeju, Korea (2012)
  • 2011
    • University of Arizona, Tucson, AZ (2011) [Download pdf]
    • Plant and Animal Genome Conference Affymetrix Workshop, San Diego, CA (2011)
  • 2010
    • 13TH Biennial Molecular & Cellular Biology of the Soybean Conference, Durham, NC (2010) [Download pdf]
    • Dow AgroSciences Distinguished Scientist Lecture Series, Indianapolis, IN (2010)
    • Plant and Animal Genome Conference, San Diego, CA (2010)
  • 2009
    • HHMI Professors Meeting, Washington, DC (2009)
    • UCLA Parents Weekend - Keynote Speaker, UCLA, Los Angeles, CA (2009)
    • UCLA Science Board Lecture, UCLA, Los Angeles, CA (2009)
  • 2008
    • Ohio University MCDB Retreat Research Lecture, Athens, OH (2008)
    • Genome Canada's 3rd International Conference, Vancouver, CANADA (2008)
    • International Conference on Legume Genetics and Genomics, Puerto Vallarta, Mexico (2008)
    • Faculty Science Research Colloquium Lecturer, UCLA, Los Angeles, CA (2008) [Download pdf]
    • Ueli Wobus at 65 Seed Biology Symposium, Gatersleben, GERMANY (2008) [Download pdf]
    • XX International Congress on Sexual Plant Reproduction, Brasilia, BRAZIL (2008) [Download pdf]
    • New Phytologist Symposium, Mount Hood, OR (2008) [Download pdf]
    • Peers Undergraduate Orientation Research Lecture, UCLA, Los Angeles, CA (2008) [Download pdf]
    • University of Saskatchewan Seed Development Symposium, Saskatoon, Saskatchewan, CANADA (2008)
    • Plant and Animal Genome Conference, San Diego, CA (2008)

John Harada

  • 2013
    • Biotechnology Center in Southern Taiwan, Tainan, TAIWAN (2013)
    • Agricultural Biotechnology Research Center, Academia Sinica, Taipei, TAIWAN (2013)
    • Brazilian Symposium on Plant Molecular Genetics, Bento Gonçalves, BRAZIL (2013)
    • Plant Cell and Molecular Biology Conference, Suzhou, CHINA (2013)
    • Plant Biology 2013, American Society of Plant Biologists, Providence, RI (2013)
  • 2012
    • American Society of Plant Biologists, Providence, RI - Minisymposium Speaker and MAC Symposium Co-Organizer (2013)
    • Donald Danforth Plant Science Center, St. Louis, MO (2012)
    • Syngenta Global Seed Care Institute Symposium, Basel, SWITZERLAND (2012)
    • American Society of Plant Biologists, Austin, TX - MAC Symposium Organizer (2012)
    • University of New Mexico, Albuquerque, NM (2012)
    • American Society of Plant Biologists Minority Affairs Committee Professional Development Workshop - Workshop Presenter, Albuquerque, NM (2012)
    • SACNAS Conference - Scientific Symposium Organizer, Seattle, WA (2012)
    • International Symposium on Biocatalysis and Biotechnology - Keynote Speaker, Sonoma, CA (2012)
    • Annual Biomedical Research Conference for Minority Students - Leader for Networking Session, San Jose, CA (2012)
    • National Chung-Hsing University, Taichung, TAIWAN (2012)
  • 2011
    • University of Alberta, Edmonton, AB, CANADA (2011)
    • American Society of Plant Biologists Minority Affairs Committee Professional Development Workshop - Workshop Presenter, Cal Poly Pomona, CA (2011)
    • 10th International Society for Seed Science Conference, Salvador, BRAZIL (2011)
    • University of Massachusetts, Boston, MA (2011)
    • American Society of Plant Biologists, Minneapolis, MI - MAC Symposium Organizer (2011)
    • Noble Foundation, Ardmore, OK (2011)
    • SACNAS Conference - Scientific Symposium Organizer, San Jose, CA (2011)
    • NAIST Symposium, Nara, JAPAN (2011)
    • National Chung-Hsing University, Taichung, TAIWAN (2011)
  • 2010
    • American Society of Plant Biologists Minority Affairs Committee Professional Development Workshop - Workshop Presenter, Bowie State University, MD (2010)
    • American Society of Plant Biologists Western Section Meeting, Washington State University, Pullman, WA (2010)
    • American Society of Plant Biologists, Honolulu, HI - MAC Symposium Organizer (2010)
    • NSF Plant Genome Awardees Meeting, Arlington, VA (2010)
    • SACNAS Conference, Anaheim, CA - Scientific Symposium Organizer
    • University of California, Los Angeles, CA (2009)
    • National Chung-Hsing University, Taichung, TAIWAN (2010)
  • 2009
    • Trilateral Conference, National Chung-Hsing University, TAIWAN (2009)
    • Academica Sinica, Taipei, TAIWAN (2009)
    • American Society of Plant Biologists, Honolulu, HI - MAC Symposium Organizer (2009)
    • SACNAS Conference, Dallas, TX - Scientific Symposium Organizer (2009)
    • International Congress of Plant and Molecular Biology, St. Louis, MO (2009)
    • XIII National Congress of Plant Molecular Biology and 6th Mexico-USA Symposium, Guanajuato, MEXICO (2009)
  • 2008
    • College of Biological Science, Peking University, Beijing, CHINA (2008)
    • Institute of Genetics and Development, Chinese Academy of Sciences, Beijing, CHINA (2008)
    • Institute of Botany, Chinese Academy of Sciences, Beijing, CHINA (2008)
    • Sonoma State University, Rohnert Park, CA (2008)
    • International Congress on Sexual Plant Reproduction, Brasilia, BRAZIL (2008)
    • BASF, Research Triangle park, NC (2008)
    • Monsanto, AgraCetus Campus, Middleton, WI (2008)
    • University of Missouri, Columbia, MI (2008)
    • Texas A & M University, College Station, TX (2008)
    • University of Arizona, Tucson, AZ (2008)
  • 2007
    • National Chung-Hsing University, Taichung City,TAIWAN (2007)
    • Academia Sinica, Taipei, TAIWAN (2007)
    • National University of Taiwan, Taipei, TAIWAN (2007)

Miscellaneous Videos

These movies are best viewed in Quicktime. Click here to download Quicktime. To download the video, Mac users: Press CTRL and click on link to download video; PC user: right-click on the mouse and select download.

  • Seed Development Movie (2008) Developed by Brandon Le and Bob Goldberg [Download video]

  • Laser-capture microdissection of Arabidopsis seed compartments [Download video]

  • Laser-capture microdissection of soybean seed compartments [Download video]