2025
SOYGEN3: Building capacity to increase soybean genetic gain for yield through combining genomics-assisted breeding with characterization of future environments (year 3 of 3)
Category:
Sustainable Production
Keywords:
(none assigned)
Lead Principal Investigator:
Aaron Lorenz, University of Minnesota
Co-Principal Investigators:
Asheesh Singh, Iowa State University
William Schapaugh, Kansas State University
Dechun Wang, Michigan State University
Carrie Miranda, North Dakota State University
Katy M Rainey, Purdue University
Leah McHale, The Ohio State University
Eliana Monteverde Dominguez, University of Illinois
Matthew Hudson, University of Illinois at Urbana-Champaign
Nicolas Frederico Martin, University of Illinois at Urbana-Champaign
Andrew Scaboo, University of Missouri
George Graef, University of Nebraska
David Hyten, University of Nebraska at Lincoln
Rex Nelson, USDA/ARS-Iowa State University
+12 More
Project Code:
59010
Contributing Organization (Checkoff):
Institution Funded:
Brief Project Summary:
SOYGEN3 aims to enhance soybean breeding by integrating genomics, phenomics, and environmental modeling. In its final year, the project focuses on genomic selection tools, predictive models for future environments, and structural variant analysis. Key achievements include genotyping 4,000+ breeding lines, launching multi-location yield trials, and identifying 470,000+ structural variants. Expected outcomes include improved genomic tools, better yield stability, and superior soybean germplasm. The project enhances breeding efficiency, ensuring continued genetic gain and economic benefits for U.S. soybean producers. Leveraging existing funding, SOYGEN3 advances public breeding programs with cutting-edge genomic selection and environmental characterization strategies.
Unique Keywords:
#advanced methods in plant breeding, #cultivar-by-environment interactions, #genomic prediction, #yield
Information And Results
Project Summary

SOYGEN3: Building Capacity to Increase Soybean Genetic Gain for Future Environments

Project Overview

SOYGEN3 is a three-year initiative aimed at enhancing soybean genetic gain by integrating genomics-assisted breeding with environmental characterization. This initiative, in its third and final year, is designed to address the challenges of genotype-by-environment interactions, improve yield stability, and develop predictive models for future environments. The project involves a collaboration of multiple universities and institutions across the North Central region.

Project Justification and Rationale
Soybean is a critical crop with high global demand driven by its use in food, feed, and renewable fuel production. Since the 1940s, scientific breeding efforts have significantly improved yield, expanded production regions, and developed varieties with defensive traits. However, genotype-by-environment interactions complicate breeding efforts, requiring broader field testing across diverse environmental conditions. The SOYGEN initiative seeks to address these challenges by leveraging genomics, phenomics, and environmental data to enhance predictive breeding methodologies.

Key Objectives
1. Enhancing Genomics-Assisted Breeding:
o Develop and implement genomic selection tools in public breeding programs.
o Utilize genome-wide markers for genotyping advanced breeding lines.
o Integrate low-pass sequencing technology to generate cost-effective genomic data.
o Establish user-friendly software applications for genomic selection.

2. Predicting Cultivar Performance in Future Environments:
o Conduct multi-environment trials with 1,200 diverse breeding lines.
o Characterize environmental conditions and model genotype-by-environment interactions.
o Utilize UAV imagery to assess canopy development and growth rates.
o Develop predictive models connecting genotype, phenotype, and environmental data.

3. Structural Variant Analysis for Genomic Prediction:
o Sequence 41 SoyNAM founder lines to identify structural variants.
o Evaluate their influence on seed yield, composition, and adaptability.
o Improve genomic prediction models by incorporating structural variant data.

Progress to Date
Significant advancements have been made over the first two years, including:
• Genotyping and Data Management:
o Over 4,000 breeding lines genotyped with genome-wide markers.
o Development of a public genomic selection application integrated with SoyBase.
• Yield Trials and Environmental Modeling:
o Multi-location yield trials initiated with 1,200 elite lines.
o Collection of environmental data to refine predictive models.
• Structural Variant Discovery:
o Identification of over 470,000 structural variants using advanced sequencing tools.
o Initiation of pangenome sequencing for key soybean lines.

Expected Outcomes and Deliverables
• Publicly available genomic selection tools for soybean breeding programs.
• New knowledge on genotype-by-environment interactions and improved predictive models.
• Identification of structural variants impacting yield and seed composition.
• Development of superior soybean germplasm adapted to future environmental conditions.

Economic Impact
This project supports U.S. soybean competitiveness by ensuring continued genetic gain, improving yield stability, and equipping future breeders with cutting-edge genomic tools. The integration of advanced genomic and environmental modeling approaches will enhance breeding efficiency and profitability for soybean producers.

Budget Considerations
The project leverages existing breeding infrastructure and funding sources from multiple institutions. The final year will focus on optimizing resource use to complete genomic analysis, conduct large-scale trials, and refine predictive models for future breeding applications.

Project Objectives

1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.
2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environment characterization.
3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Project Deliverables

The following will be delivered upon completion of this three-year project:
1. Publicly available resources and tools for soybean breeders to implement cost-effective genomic prediction in their programs.
2. Publicly available knowledge on genetic control of genotype-by-environment interaction in soybean, and improved models for prediction of breeding line performance in new environments. Knowledge will be made available through open-access publications, presentations at scientific meetings, and presentations to the seed industry.
3. Identification of important structural variants that control seed yield and composition, and publication of knowledge on any benefit into explicitly modeling structural variants for predicting breeding line performance.
4. Enhanced germplasm and superior varieties developed through adoptions of genomics-assisted breeding techniques better adapted to future environmental conditions.

Progress Of Work

Updated April 28, 2025:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

Objective 1 can be divided into five sub-objectives to report specific, significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests

As reported in the last progress report, 511 new breeding lines submitted to regional trials were grown. Tissue was collected and DNA was extracted from the tissue. We sent DNA of the 511 breeding lines to David Hyten at UNL for genotyping via skim sequencing. Previously we sent samples to a private service vendor. But because of dramatic prices increases, we decided to work with co-PI Hyten who could provide very similar data for 70% of the cost. This delayed data delivery, but we are confident the data will be in-hand by May. We now have genotyped over 4000 advanced breeding lines entered into the public regional trials, creating an impressive resource helping current and future soybean breeders and geneticists connect genotype to phenotype, and develop genomics-assisted breeding resources.

Data from the 2024 NUST trials was collected this past fall. We formatted the data and sent it to Rex Nelson at Soybase, where it will be uploaded soon.

A manuscript on this work has been submitted to the scientific journal Crop Science. It was recently accepted pending revision. We are currently editing the manuscript for final acceptance.

1) Wartha, C.A., B. Campbell, V. Ramasubramanian, L. Nice, ….19 authors….A.J. Lorenz*. 2025. Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science (Accepted pending revision).

Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely. Below are some highlights from individual program reports that are part of the SOYGEN initiative.

University of Nebraska
Aiming to generate new recombinant populations with high yield and resistance to biotic stress, the UNL soybean breeding program conducted a genomic selection analysis following the 2024 field trials, utilizing phenotypic datasets from multiple years and locations to train the prediction model. This dataset was formed by UNL lines that belonged to elite populations designed to carry resistance alleles to the Soybean Cyst Nematode (Heterodera glycines) for the rhg-1a//rhg-1a, Rhg-2//Rhg-2, and Rhg-4//Rhg-4 genes. These lines have been evaluated in field trials since 2022. In addition, the Northern Uniform Soybean Trials yield datasets from 2012, 2018, 2019, and 2020 were added to train the model. Lines from both datasets have been extensively tested in maturity 2 and maturity 3 locations in Nebraska and surrounding states. UNL lines were genotyped with micro-inversion probes (MIP), and the NUST lines were genotyped with 6K SNP chip. Genotypes were imputed and filtered accordingly. For the analyses, yield values were adjusted for the experimental design model and the best linear unbiased estimations (BLUEs) were used as input for the genomic selection analyses. This analyses accounted for the genotype-by-environment (GEI) interaction, considering that complex, non-linear interactions between lines and environments regularly occur in the soybean breeding context. The GS4PB R Shiny App (previously known as SOYGEN2 R Shiny App) and its codes were used to run the analyses.
Seven lines were selected by the breeder following the analyses. These lines were highly ranked based on their GEBV from the genomic prediction analyses. In addition, they contain at least one allele of interest for the three mentioned genes, and some of these lines are homozygous for two loci of significant interest (rhg-1a//rhg-1a, Rhg-4//Rhg-4).
These seven lines were selected as parents for eight new combinations in the UNL Winter Nursery Crossing Block project, conducted in Puerto Rico between January-April 2025. It is expected that F1 plants of these populations will be planted and genotyped in Nebraska in June 2025. As future directions, these seven new populations will be advanced, and their selected progenies will be tested in multi-environment field trials to identify superior lines for yield and resistance to Soybean Cyst Nematode. Additionally, using SOYGEN2 yield datasets in both regular and sparse genomic selection designs will enable the UNL Soybean Breeding Program to efficiently select superior lines.

University of Missouri
Andrew Scaboo’s lab is diving into the collected data from SOYGEN2 in the genomic selection experiment. This experiment tested genomic prediction versus phenotypic selection versus random selection at the University of Minnesota, North Dakota State University, University of Illinois, and University of Missouri. The selection treatments we applied in the original experiment were not as successful as we had hoped. Currently, we are analyzing the data to learn why the genomic selection treatment was not as successful as anticipated, and how we can better understand and utilize it in the future. Because this multi-institutional dataset is large and complex, we are first developing the analysis framework and treatments using the Missouri data only. Several initial analyses were described in the last progress report. During this last reporting period, we extensively evaluated the effect of genotype imputation. Figure 1 shows that methods of genotype imputation implemented in the “GS4PB” application improves prediction accuracy overall. This indicates that these methods will be powerful approaches towards improving the cost effectiveness of genomic prediction for driving genetic gain in yield.

Figure 1. (See attached document)

University of Minnesota
As described in the last report, the UMN soybean breeding program has refined its GS pipeline and tested it extensively on the UMN Preliminary Yield Trials (PYT) data. PYT 2023 progeny population lines were assayed using 1K low-density (LD) genotyping assay and parents of PYT23 lines from the crossing block were assayed using a low-pass sequencing platform to generate high density (HD) variant data. The 50K SoySNP Chip subset from the HD data set as the parental reference panel to impute 1K LD set to 50K HD set (~30K SNPs after QC). We used this imputed data to make predictions using genomic prediction models that include GxE interaction effects. In the summer of 2024 we planted a trial including lines selected using genomic prediction and phenotypic selection. The trial was successfully planted at six Minnesota locations. Every location yielded good data collected at harvest.
During the past reporting period, we analyzed the data to compare the phenotypic selection versus the genomic selections. For each treatment, we selected both high and low yielding lines. Selecting low yielding lines is important as it tests our ability to identify the poorly yielding lines. This ensures we are not devoting precious field phenotyping resources towards lines that are predicted to have low yield.

Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. We have submitted a peer-reviewed article to the journal Plant Genome.

1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (Submitted).

This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.

Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.

Three main tasks are associated with this objective.
1. Conduct a large, multi-institutional, multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that soybean researchers can use to identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively so that crop growth can be evaluated in relation to certain environmental stressors.

Progress on each task:

1. For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass resequencing technologies. Detailed environmental data for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If successful, we will leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in environments that a producer is most likely to encounter.
Last summer these breeding lines were successfully grown and phenotyped for yield at 21 locations. Data has been delivered to the University of Minnesota in a centralized database. Plans were laid for scanning all 12,400 samples with NIR to measure protein and oil. We worked with Perkin-Elmer to develop a protocol to standardize instruments across universities. Standardization samples were collected across universities to represent diversity in germplasm and growing conditions. From UMN, samples were sentto collaborators for scanning on their instruments to create the standardization file. We anticipate the samples to be scanned for protein and oil content after planting when time allows.

Another major activity was the design and packaging of the 2025 yield trials. All seeds have been delivered to 2025 field locations, and 90% of the packaging is complete. Fields have and designed and sent to cooperators. All is on schedule for a successful 2025 planting.

Genotype data from the Hyten lab was collected and returned on April 11. The genotype data includes 15 million molecular markers (SNPs) imputed from skim sequencing data, with a 50K subset available. The SNPs were mapped to the latest Williams82 genome version, Wm82a6v1. Analysis of these data will proceed over the coming months.

2. We completed the development of the Seasonal Characterization Engine (SCE). The SCE is a specialized tool integrated within the R environment, designed to streamline the analysis of trial data using the Agricultural Production Systems Simulator (APSIM) model. Users begin by uploading trial data specifying key parameters such as location, latitude, longitude, and maturity. Predefined input files, such as the soybean seed composition test, facilitate accurate data setup.

After data upload, users select a crop model template (e.g., soybean or maize) and specify maturity handling methods. Users then choose appropriate weather and soil databases, considering geographic coverage to avoid analysis errors.

The APSIM model generates environmental variables specific to different crop growing stages, providing a detailed characterization of conditions throughout the growing season. Upon initiating the analysis, the SCE provides real-time process updates via the R console.

Analytical results are available through multiple visualization tools:
1. Results Viewer: Offers box plot visualizations of selected variables, downloadable individually or collectively.
2. Heat Maps: Enables detailed inspection of seasonal and environmental covariates by growth stage, comparing specific parameters within consistent genetic maturity groups.
3. Trial Similarities: Presents correlation heat maps illustrating seasonal profile similarities among trial sites, complemented by dendrograms to clarify site relationships. Outputs are readily exportable for further analysis.
4. Thermal Time and Precipitation: Summarizes accumulated thermal time and precipitation over growing seasons, offering comparative analyses between and within sites across multiple years. These insights support understanding climate variability and trends.

3. During the 2024 season, Rainey Lab at Purdue University has focused on phenotyping soybean crops using high-resolution UAS imagery from 10 SOYGEN sites to extract canopy traits. A data management system has been set up at Purdue to help SOYGEN collaborators with secure storage of UAS image data, efficient data sharing, and access to processed plot-level outputs. To date, key canopy traits including spectral, textural, and structural features, have been extracted from 6,320 two-row plots covering about 1,200 genotypes across V3 to R5 growth stages. The processed plot-level data includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This growing dataset provides valuable resources for soybean breeders, supporting the development of advanced AI/ML algorithms and trait derivation.

For biomass prediction, we established ‘Calibration Plots’ — 8-row plots planted adjacent to the SOYGEN plots to enable nondestructive biomass sampling. A representative subset of 36 lines from the SOYGEN3 GEI panel, with three replications, was used in this experiment. Ground biomass sampling was performed on rows 2, 3, and 4, while rows 6 and 7 were used for UAS-based data collection and harvest yield measurements. UAS flights were conducted within one day before or after each ground sampling event, and from these flights, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index, and an array of structural and textural features.

In addition to the calibration dataset, we incorporated our previous ‘Public Biomass’ experiments (2021–2023), which utilized public soybean breeding germplasm from the north-central U.S. region.

Biomass prediction was approached using two modeling strategies:
1. A simplified model using two robust drone-derived predictors — canopy coverage and canopy height — both known to exhibit strong correlations with biomass.
2. A comprehensive model leveraging a broader suite of structural and textural features, capturing more detailed aspects of plant architecture and canopy structure.

Machine learning models developed with these approaches achieved high performance, with R² values reaching 0.89 and RMSE as low as 68.70, demonstrating strong predictive accuracy. Our current focus is on refining these models by minimizing time-dependent biases, enabling reliable biomass predictions on any given day regardless of growth stage or sampling date.

Structural features derived from 3D models, like canopy volume, and the top and side canopy geometry, are also good indicators for biomass. We anticipate that the marker data will facilitate the development of a biomass prediction model. We expanded the ground truth biomass dataset in 2024, and plan to do so again in 2025.
Preliminary analysis has been conducted on yield estimation using image-derived features. The results show promising potential for predicting yield, with an R² value around 0.45. To improve prediction accuracy, efforts will focus on expanding the dataset and incorporating additional data such as weather variables, soil characteristics, and genetic marker information. The integration of expanded data sources and advanced modeling techniques is anticipated to significantly strengthen the robustness and predictive power of the models.

Status of 2024 season UAS data collection and processing
See attached report.

Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Recent improvements in our SV calling pipeline have significantly enhanced the accuracy and resolution of SV detection. By integrating five distinct SV callers, we are now able to identify a greater number and diversity of structural variants with higher confidence. These improvements were implemented using the Williams 82 version 6 (a6) reference genome, a more complete and accurate assembly with no scaffolds compared to its predecessors. GWAS was conducted using both the SVs and SNPs identified from the a6 reference genome, yielding highly significant associations between structural variants and key agronomic traits. These traits include, but are not limited to, yield, stress resistance, and plant architecture. The use of a refined SV dataset has reduced background noise and increased the power to detect true associations. Gene Ontology (GO) analysis of significant GWAS markers located within gene models revealed that many of the associated SVs and SNPs influence genes known to control trait development and physiological processes. These functional annotations provide valuable biological insights into how structural variation contributes to phenotypic diversity in soybeans.

Comparative analyses demonstrate that the use of the Williams 82 a6 reference genome produces more meaningful GWAS results than the earlier version 4 (a4). GWAS performed with a4, which contains more unresolved scaffolds, tends to yield less consistent and potentially spurious associations. This underscores the importance of reference genome quality in structural variation-based GWAS.

Our findings highlight the critical role of high-quality reference genomes and comprehensive SV calling pipelines in conducting accurate and biologically relevant GWAS. The integration of multiple SV callers, coupled with the Williams 82 a6 reference, has led to improved detection of trait-associated structural variants. These results demonstrate that unresolved genomic regions can compromise GWAS outcomes, potentially leading to random or misleading associations.

View uploaded report PDF file

Updated March 31, 2026:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

Objective 1 can be broken down into five sub-objectives for which we can report specific and significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests
Progress: 267 new breeding lines entered into the Northern Uniform Regional trials were grown in the field and sampled for DNA. DNA extraction has been conducted and we are in the process of shipping samples for genotyping. These will contribute to the large genotype dataset we have amassed as part of this project.

Data from the 2024 NUST trials has been uploaded to Soybase.
A manuscript on this work we have pursued for the last several years has been published in Crop Science

Wartha, C. A., Campbell, B. W., Ramasubramanian, V., Nice, L., Brock, A., Cai, G., Eskandari, M. M., Graef, G., Hudson, M. E., Hyten, D., Mahan, A. L., Martin, N. F., McHale, L., Miranda, C., Dominguez, E. M., Nelson, R., Rainey, K., Rajcan, I., Scaboo, A., … Lorenz, A. J. (2025). Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science, 65(5), e70138.

Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely.
There is nothing beyond the activities reported in the last report to report for this reporting period.

Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. The journal article reporting and describing this tool has been accepted for publication.

1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (In press).
This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.

In addition to the tool description, we also reported on how using this tool resulted in better selections than phenotypic selection. When making phenotypic selections in 2024 and validating them in 2025, we found that several lines that would have been culled using phenotypic selection actually were some of the best performing lines in 2025. Genomic selection did not make this same mistake, correctly culling lines. This test helps validate the tool we developed and will be convincing for users to adopt it. We are in the process of designing additional features to add to the tool to help soybean breeders enable genomic selection and increase the effectiveness of their breeding programs.


Sub-obj 4: Adopting and advancing BreedBase for storage of information for soybean genomic prediction.
This database is currently working for this project, so there is nothing new to report here. We have been uploading all our genotypic data used for this project to this database.

Sub-obj 5: Connect target and training populations using imputation that leverages pedigree relationships and enhance this capacity by inclusion of this method in the software application.
As mentioned in the last report, this sub-objective has been completed. We have implemented these methods in our software application GS4PB.

Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.
Three main tasks are associated with this objective.
1. Conducting a large, multi-institutional multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that current and future soybean researchers can use identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively to that crop growth can be evaluated in relation to certain environmental stressors.

Progress on each task is reported in order below:
1. Conducting a large, multi-institutional multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that current and future soybean researchers can use identify methods to improve genomic prediction methodology for yield.
The major activity this past reporting period was growing another season of the GxE trials described in the last report. All seeds were packaged, shipped, and planted at each of the 21 locations. Data templates were distributed, and we are receiving data back from cooperators this fall. As far as we know, we only lost two of the 21 locations, and we anticipate good data from 19 locations, totalling nearly 40 environments of data for this project, creating a powerful dataset for us to explore GxE genomic prediction modeling.

2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
As described in the last report, this objective has been completed. We are working to use this tool for GxE modeling in the next version of SOYGEN.

3. Quantify biomass non-destructively to that crop growth can be evaluated in relation to certain environmental stressors.

During the 2024 and 2025 seasons, the Rainey Lab at Purdue University has focused on high-resolution UAS-based phenotyping of soybean plots across 10 SOYGEN sites to extract canopy-level traits. A centralized data management system was developed at Purdue to support SOYGEN collaborators with secure UAS data storage, efficient data sharing, and streamlined access to processed plot-level outputs.
Beginning in 2024, key canopy traits - including textural and structural features were extracted from 6,320 two-row plots, representing approximately 1,200 genotypes from V3 to R5 growth stages. Processing for the 2025 season is currently underway. The resulting dataset includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This expanding resource provides a valuable foundation for soybean breeders, supporting the development of advanced AI/ML algorithms and improved trait derivation.

Biomass Calibration Experiments:
To enable non-destructive biomass estimation, we established dedicated Calibration Plots adjacent to the SOYGEN trials:
Calibration 1 (2024–2025)
• Included 36 representative lines from the SOYGEN3 GEI panel, with three replications.
• Ground biomass measurements were collected from subsections of rows 2, 3, and 4.
• Rows 6 and 7 were used for UAS imaging and harvest yield.
• Stand counts of the sampled area were recorded (2025).
• UAS flights occurred within one day before or after each ground sampling event.
Calibration 2 (2025)
• Included 20 representative lines, also with three replications.
• Ground biomass sampling was conducted on the entirety of rows 1–5.
• Rows 6 and 7 were harvested for yield.
• Stand counts of the rows were recorded.
• UAS imaging covered both sampled rows and yield rows.
• Flights were conducted within one day before ground sampling.

The Calibration 2 design was particularly effective in reducing time-dependent biases, enabling reliable biomass predictions across growth stages and sampling dates.
For both experiments, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index (GLI), and a range of textural and structural features.

Biomass Prediction Modeling:

Two modeling strategies were implemented:
1. Simplified Model
Utilized three robust predictors- canopy coverage, canopy height, and GLI, known to correlate strongly with biomass.

2. Comprehensive Model
Incorporated an expanded set of structural and textural features to capture finer details of plant architecture.
Machine-learning models built using these approaches demonstrated strong predictive performance, achieving R² up to 0.89 with RMSE as low as 68.70.

Application to Collaborator Sites and Yield Estimation
The enhanced biomass prediction model has been applied to multiple collaborator sites, where predicted biomass showed strong correlations with harvested yield, reaching up to 0.82. We are also investigating yield estimation directly from predicted biomass.
In parallel, we are integrating genomic data to perform GWAS for identifying both stage-specific and season-wide significant SNPs, with additional downstream genomic analyses underway.

Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.
The Hudson laboratory is mostly concerned with the use of soybean genome data to improve how efficiently and effectively we can breed new soybean varieties. To check how much the quality of the soybean reference genome (the master copy of Williams 82 generated by the Department of Energy (DOE) Joint Genome Institute (JGI) affects our results, we ran a number of tests comparing two versions, a4 and a6. The older a4 version was the best available until recently, while a6 is the newest and most complete. Because our project has uncovered extensive structural variation in SoyNAM, we want to determine how much that variation affects important traits in the plants, and how accurately we can find the genes for the traits (heritability) for use in genomic selection and molecular breeding.

We compared how each version affected three things:
1. Phylogenetic trees (family trees for the SoyNAM population) that show how similar the lines in the population are at the DNA level. Do the structural variants and genome versions affect our analysis of how closely related the plants are?

2. Principal Components (PCs) and kinship matrices, which are also tools that describe how closely related different plants are. However, we use these tools directly to control our Genome-wide association studies (GWAS), and they affect our predictions of which genes are likely to control the important traits.

3. GWAS, which find links between genetic variation (which can be SNPs (small single-letter differences in DNA), and /or SVs (large rearrangements or insertions/deletions), and traits such as yield or disease resistance, but require accurate data on kinship between the individuals in the population.
When we built phylogenetic trees using each of the two genome versions, we used just the SNPs, just the SVs, and both together.
For a4, the trees made from SVs and from SNPs didn’t match very well, showing that gaps and errors in the older genome caused unstable results.
For a6, the trees based on SVs looked stable and consistent when compared with a4, while trees based on small changes (SNPs) still varied more. This means that SVs may be more likely to capture real biological relationships that are not easily thrown off by small assembly errors.
Next, we tested every possible combination of:
• The version of the genome (a4 or a6),
• The type of genetic differences (SNPs, SVs, or both), and
• The relationship data (PCs and kinship matrices).
We judged accuracy by checking how well each GWAS result matched previously confirmed genetic regions, QTLs (quantitative trait loci), which are known to affect traits in the same soybean population.
The best matches came from using the newer a6 genome for everything—its SNPs, SVs, and combined data—along with relationship data also based on a6. These tests found more of the known true QTLs, showing that the higher-quality a6 genome gives cleaner, more realistic results. Using older a4 data (the best available for all previous analyses of SoyNAM) introduced small mistakes because of its missing or mis-assembled parts.
By fine-tuning how we use the a6 genome and how we include SVs, we have improved the heritability (the portion of trait differences explained by the genetic variation we can measure) and sensitivity of our soybean studies for several traits. The heritability is directly related to our ability to breed for enhanced traits, as well as to build genomic selection methods.

Our second project looks at how unbalanced structural changes in DNA—such as extra or missing pieces—affect the overall size of the genome. This helps double-check our SV results and the overall pangenome.
To study this, we have built a new way to estimate genome size (how much total inherited DNA is in the nucleus of each soybean line) that isn’t easily thrown off by sequencing errors.
We use a k-mer–based genome size estimation (GSz) method. A k-mer is a short DNA sequence that helps estimate total DNA content by counting how often certain patterns appear. Using this method, we found much more variation in genome size among the SoyNAM parent plants than we expected.
Genome size is known to influence many basic cell traits across living things (Eukaryotes). This means the differences in GSz among SoyNAM parents may represent a new layer of population structure, genetic organization in the population that is very important for accurately predicting loci encoding traits, and genomic trait prediction. Early tests show that genome size differences seem to be linked with important plant traits like yield, seed protein and oil levels, and fiber content. We are cautious with these findings because they are new, and older estimation methods can be thrown off by sequencing errors or repeated DNA regions. To reduce these problems, we developed a new GSz method that is much less sensitive to sequencing mistakes and that properly accounts for repeats. First results in a model species look promising, and we plan to apply this improved method to the SoyNAM lines soon.
If this approach works, it will add a second and independent way to describe how populations differ genetically, helping us build better models of population structure and improving genetic studies that depend on those models.

Although these two projects started separately, they point to the same big idea: different large-scale DNA processes, shaped by different evolutionary forces, each describe part of how populations are structured. Ignoring any one of these signals could lead to wrong estimates of genetic relationships and errors in later analyses. Our work has already improved our estimation of relationships and thus the heritability estimates of important traits. Genome size differences also seem to be linked to seed oil and protein levels, meaning that variation in genome size might help predict seed quality in future breeding work.

Previously, as part of this project, we have obtained substantial in-kind support where DOE JGI are internally funding the sequencing of a soybean pangenome, of several hundred lines including all of the SoyNAM, from samples we are supplying. These lines will each have their own reference genome created which should be similar quality to the a6 version of Williams 82. Nothing like this pangenome has been created before, in industry or academia, and our results above from the a4 vs a6 comparison indicate that the pangenome will have a very large impact on our ability to improve soybean varieties, and the speed with which we can do it. This year we obtained the first dozen or so genome assemblies for this pangenome, and so far the quality of these assemblies looks to be similar to that for a6. We expect to complete at least three hundred genomes, at a cost of several million dollars of DOE internal funding, as part of this project.

Final Project Results

Updated May 5, 2026:
Final Report: SOYGEN3 (Year 3 of 3)
Building capacity to increase soybean genetic gain in future environments through genomics-assisted breeding and environmental characterization
Overview
Over the three-year duration of SOYGEN3, the project made substantial progress toward advancing genomics-assisted breeding in soybean, with a particular emphasis on increasing genetic gain under dynamic and unpredictable environmental conditions. By integrating genomic data, multi-environment phenotyping, and environmental characterization, the project delivered a suite of datasets, tools, and analytical frameworks that are now being actively used in public breeding programs. The final year was especially important, as it brought together earlier investments in data generation and tool development into validated systems that demonstrate clear value for breeding decisions. Collectively, these outcomes position public soybean breeding programs to more effectively select superior varieties adapted to both current and future production environments.

Below we have listed our progress according to each Key Performance Indicator

1. Approximately 425 – 450 new advanced breeding lines entered into the Northern Uniform Soybean Tests (NUST) genotyped each year using low-pass sequencing technology.

Outcome: Successfully completed. Over these past three years, we generated genome-wide marker data for every breeding lines entered into NUST for a total of 1035 lines genotyped during this project period. Data from these lines has been stored in Soybeanbase and has been used to build genome-wide prediction models.

2. The “SOYGEN Genomic Selection” application and/or Breedbase installation at SoyBase.org is fully operational and used by majority of participating soybean breeding programs for managing genomic selection programs.

Outcome: The database and tool development has been successful. It is estimated a portion of the project collaborators are using these tools on a regular basis, but not yet most. We are working with individual projects to increase utilization and hope to continue to make better progress on this. The “SOYGEN Genomic Selection” application was completed and renamed “GS4PB”. We published a peer-reviewed manuscript on this in The Plant Genome (https://doi.org/10.1002/tpg2.70150) last fall. It has already been downloaded over 1500 times and we have received inquiries from public breeders across the country on its use. We are continuing to find ways to improve this tool by adding additional features such as AI-based guides on how to use the tool. As outlined in the publication, we have shown that the genomic prediction workflow implemented in this tool performs at least as well (and often better than) phenotypic selection, opening up many avenues for breeders to use genomic prediction in their programs.



Figure 1. Screenshot of the GS4PB app page where users can load data using a graphical user interface. The steps implement in the app are available in a menu on the left of the screen.


Also in line with the KPI, Soybeanbase is fully functional. We have also continue to adopt and deposit data into Soybeanbase, a version of BreedBase that stores genome-wide marker data in a structured and organized way, tracking metadata and allowing easy inquiry of genome-wide marker data status. Data is imported and exported in standardized formats, easing the workflow of genomic prediction and allowing predictions to be made more quickly and effectively.




Figure 2. Screenshot of the Soybeanbase query page for easily accessing structured and standardized genome-wide marker data.

3. Low density genotyping-to-high density genotyping imputation leveraging pedigrees integrated into tools described in KPI #2.

Outcome: This has also been successfully completed. As described in the GS4PB manuscript (https://doi.org/10.1002/tpg2.70150), we implemented methods in the packaged AlphaPlantImpute that can either use population-based or pedigree-based approaches to marker imputation, effectively going from low-density markers to high-density markers. The challenge of using AlphaPlantImpute is that is only available in Python, thus making it less accessible to many users. Integrating this into GS4PB allows genomic prediction practitioners to seamlessly incorporate this powerful genotype imputation method into their streamlined data workflow.

4. By end of third year, majority of participating breeding programs are using genomic selection and developed GEI models routinely.

Outcome: The majority of SOYGEN cooperators are using genomic selection in their programs, and the resulting tools and data developed by this project have helped expedite this. For example, the University of Nebraska has been exploring a “sparse-testing approach” using genomic data in their selections, helping to save phenotyping resources.

5. GEI panels are selected, genotyped, and phenotyped at characterized environments, and full dataset is provided to SOYGEN team members. These data will subsequently be made publicly available at SoyBase.org.

Outcome: All GEI panels of lines (4 RM sets, ~300 lines per set) were successfully selected, genotyped, and phenotyped in two years as indicated by the KPI. In total, approximately 1200 unique breeding lines were genotype using “skim sequencing” from which we extracted 36,000 SNPs corresponding to the standard 50K SNP assay. The lines were phenotyped at 21 locations in 2024 and 19 locations in 2025 for a total of 40 unique environments. Soil samples were collected at each location, and weather data has been obtained from the national weather databases. All data has been curated and deposited in our internal database. We have applied a yield adjustment for maturity date, and spatial analysis of the data to estimated breeding lines effects per environment. Data reliability measures were excellent at 90% of the environments, and quality control indicators such as flower color and pubescence color indicate data was collected correctly. We are now beginning the third year of this study which, when complete, will create an unprecedented soybean breeding dataset that both public and private breeding programs can use to model genomic prediction for genotype-by-environment interactions.


6. Model developed that can predict line performance in new environments based on environmental data, and model performance exceeds baseline model in accuracy of prediction.

Outcome: This milestone is still in progress as we decided to add a third year of data collection to the project. The first thing we did to address this is implement the methods of de los Campos et al. (2020) as we described in the proposal. In this method, de Los Campos et al (2020) proposed a novel framework for predicting GxE interactions and assessing breeding line performance uncertainty and stability among diverse environmental conditions. We implemented these simulations using the NUST genotype and phenotype data described above. By simulating the possible environmental scenarios in over 100,000 environments for each location, modelling interactions between genotypes and environmental conditions, we predicted the performance of each line in each potential environment. The predictions were then validated using observed performance in the years 2020-24 (years left out of the training set). Unfortunately, we did not see an improvement over baseline models and we believe this is because we are not adequately modelling the environmental variables. Lovepreet Singh, a graduate student supported by SOYGEN, is spending the next couple of months in a research lab focusing on modelling genotype-by-environment interactions and we expect to apply what we learns from that experience to the SOYGEN dataset to improve predictive soybean breeding.

With respect to the SOYGEN GxE data we have collected as described above, we have begun to test and analyze genomic prediction models. Using a leave-one-line-out cross validation analysis, we determined that our ability to predict performance was quite good, with a range of 0.61 (RM Set 1) to 0.75 (RM Set 2). This indicates that our genotype and phenotype data is, as mentioned above, of good quality. However, we found that when we use data across a diverse range of environments to build a predictive model, or predictive ability dramatically falls off. This indicates that there is room to develop models capturing genotype-by-environment interactions to improve our ability to predict and thus more effectively use genomic prediction for soybean breeding.



7. Forty-one SoyNAM founders are sequenced at 30x depth and structural variants scored and projected to NAM progenies.

For the sequences of the 41 SoyNAM founders, we have obtained high quality DNA from dark-adapted greenhouse-grown plants, and sequenced the whole genomes with Illumina technology to very high depth and quality. and have now effectively utilized various structural variant (SV) caller tools in tandem to identify SVs within the soybean NAM parents' dataset. Specifically, Sentieon has revealed approximately 470,000 unfiltered SVs. Delly has identified about 35,000 unfiltered SVs, and CNVnator has detected approximately 4,000 unfiltered copy number variations (CNVs).

Meanwhile, the SOYGEN group used the core of our preliminary data and research plans to lead a proposal to DOE JGI to perform a high-quality, reference-grade pangenome, consisting of 400 reference-quality soybean genomes each of similar quality to the current Williams 82 reference genome.

8. Structural variants are associated with publicly available seed yield and composition data from multi-environment trials, and genomic prediction models using structural variants are tested against baseline models.

The Hudson laboratory has so far analyzed data from the Soybean NAM population to study the influence of structural variants on accounting for genetic variation for yield. To date, they have found that when the newest genome assembly of soybean was used, modelling structural variants does account for more genetic variation in yield than only single nucleotide polymorphism data alone. This means, potentially, that further investigation of the modelling of structure variation for prediction of yield could help improve predictive ability for breeding programs.


Figure 4. Genetic variation for yield accounted for by simple molecular markers (SNPs) versus structural variations (SV). The value represented by the teal bars indicates that adding information on structural variations improved ability to explain genetic variation for yield, and thus could help enhance predictive ability of genomic prediction.

To enable further discovery of structural variation, the Hudson lab is building a pangenome resource consisting of many hundreds of soybean genome assemblies. This resource will help enable this project determine the importance of structural variation on yield and genotype-by-environment interactions in soybean.

View uploaded report Word file

Laymen’s summary

Over the past several years, a large team of soybean breeders and researchers across the Midwest has been working together to improve how new soybean varieties are developed. The goal of this project is to help breeders make better decisions earlier in the process, so that higher-yielding and more stable soybean varieties reach farmers faster.

One of the biggest accomplishments of the project has been the development of a massive dataset that combines field performance with detailed DNA information. Researchers have now collected genetic data on more than 4,000 soybean breeding lines and paired that with decades of yield and agronomic data from regional trials. This gives breeders a much clearer picture of which genetic traits lead to strong performance in the field.

Using this information, the team has developed tools that allow breeders to predict how a soybean line will perform based on its DNA alone. Traditionally, breeders rely heavily on field trials to decide which lines to advance or discard. However, results from this project show that DNA-based predictions can perform just as well as traditional methods—and in some cases better. In particular, genomic selection has proven useful for identifying poor-performing lines early, which helps breeders avoid spending time and resources testing lines that are unlikely to succeed. At the same time, it helps prevent good lines from being accidentally discarded too soon.

To better understand how soybean varieties perform under different conditions, the project established large coordinated field trials across the Midwest. Around 1,200 breeding lines were grown at more than 20 locations per year, generating data from nearly 40 different environments across multiple growing seasons. These trials capture a wide range of weather and soil conditions, which is critical because soybean performance often depends heavily on the environment. By studying these differences, researchers are working to identify varieties that are either broadly stable or well suited to specific regions.

In addition to traditional field measurements, the project has incorporated new technologies such as drone-based imaging. These tools allow researchers to monitor crop growth throughout the season by measuring canopy coverage, plant height, and overall biomass. The results have been promising, showing strong relationships between these measurements and final yield. This approach has the potential to provide faster, cheaper, and more detailed information about crop performance during the growing season.

Another important part of the project focuses on better understanding environmental conditions and how they interact with genetics. Researchers have developed tools to track weather patterns, soil characteristics, and crop growth stages, allowing them to connect environmental conditions with yield outcomes. This information will help improve predictions of how new soybean varieties will perform in future growing seasons.

Finally, the project is exploring new types of genetic variation beyond standard DNA markers. These larger structural differences in the genome appear to play an important role in traits like yield and stress tolerance. Early results suggest that including this information could improve the accuracy of prediction models and provide new insights into how soybean traits are controlled.

Overall, this work is helping to modernize soybean breeding by combining field data, genetic information, and environmental analysis. While much of this work happens behind the scenes, the long-term benefit for farmers will be the development of soybean varieties that yield more consistently, perform better under stress, and are better adapted to local growing conditions.

Benefit To Soybean Farmers

Soybean breeding has a large impact on the efficiency and profitability of agriculture through the development of high yielding new varieties with critical defensive traits and enhanced seed composition. Ensuring that such programs (both private and public) are using state-of-the-art technologies to drive genetic gain in the face of changing environments and narrowing genetic diversity will contribute to continual development and release of ever better varieties. Additionally, these efforts help to educate future agricultural scientists and soybean breeders that are best prepared to enter the seed industry and develop impactful future products for farmers, keeping the North Central region competitive in soybean production.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.