2023
SOYGEN3: Building capacity to increase soybean genetic gain in future environments for seed yield and composition through combining genomics-assisted breeding with environmental characterization
Category:
Sustainable Production
Keywords:
GeneticsGenomics
Lead Principal Investigator:
Aaron Lorenz, University of Minnesota
Co-Principal Investigators:
Asheesh Singh, Iowa State University
William Schapaugh, Kansas State University
Dechun Wang, Michigan State University
Carrie Miranda, North Dakota State University
Katy M Rainey, Purdue University
Leah McHale, The Ohio State University
Brian Diers, University of Illinois at Urbana-Champaign
Matthew Hudson, University of Illinois at Urbana-Champaign
Nicolas Frederico Martin, University of Illinois at Urbana-Champaign
Andrew Scaboo, University of Missouri
Grover Shannon, University of Missouri
George Graef, University of Nebraska
David Hyten, University of Nebraska at Lincoln
Adam Davis, USDA/ARS-University of Illinois
+13 More
Project Code:
Contributing Organization (Checkoff):
Leveraged Funding (Non-Checkoff):
The proposed work would not be possible without other sources of funding supporting the extensive field trials. For example, the participants of the Uniform Tests grow the trials with funding from other sources. For the 2023 test, if two reps are planted at each location, almost 11,000 plots would be grown (this is an underestimate as three reps are grown at some locations). Assuming it costs $30 to prepare, plant, grow, and harvest a yield plot, this would be a contribution of $330,000. In total, on an annual basis project co-investigators receive more than $1M/yr in funding related to this project, most from QSSBs. In addition to this funding, breeders (Shannon, Diers, Graef, Martin Rainey, Lorenz, McHale, Scaboo, Singh, Wang, Miranda) also participate in United Soybean Board funded projects for breeding for improved seed quality and composition.
Show More
Institution Funded:
Brief Project Summary:
The overall goal of this project is to advance genomics-assisted breeding for the development of soybean varieties improved for both yield and composition. The team will accomplish this by developing better breeding methods and furthering routine implementation of genomic prediction in public soybean breeding programs. Project objectives include: continuing to develop and enhance genomics-assisted breeding resources and tools to facilitate public breeding programs; developing and testing methods for predicting cultivar performance in target environments through genomics-assisted breeding models, phenomics, and environment characterization; discovering and testing structural variants for improved genomic predictions for yield and seed composition.
Key Beneficiaries:
#breeders, #farmers, #geneticists
Unique Keywords:
#breeding & genetics, #environmental adaptation, #genetics, #genomic prediction, #germplasm, #soybean breeding, #trials, #yield
Information And Results
Project Summary

Demand for soybeans is extraordinarily high and is expected to remain high, being driven by demand for soybean oil as a renewable fuel feedstock, demand for protein, and production disruptions across the world. To meet the global demand for food and fuel without cultivating marginal and sensitive land, U.S. soybean farmers need to maximize yield per acre in the face of rapidly changing environments. A key driver of U.S. soybean yield since the 1940s has been the development of new varieties through plant breeding. After the implementation of scientific soybean breeding in the 1930s and 1940s in the U.S., soybean yields have dramatically increased, regions of production have expanded, varieties with defensive traits have been developed, and seed composition has been altered to meet various premium-based specialty markets. These foregoing facts show that soybean breeding is a powerful activity capable of transforming the agricultural landscape and making U.S. farms more competitive and profitable.

Despite these successes of soybean breeding, formidable challenges remain. One major challenge is the ubiquity of genotype-by-environment interactions. Genotype-by-environment interactions occur when varieties that do relatively well in some environments perform relatively poorly in other environments. This phenomenon slows the progress in developing broadly adapted varieties and necessitates more field testing across years and locations. It commonly occurs across years, which is particularly frustrating to the breeder. The timespan of a variety development program (7-10 years from cross to variety release) combined with genotype-by-environment interaction and climate change effectively makes it necessary for breeders to somehow breed for future environments, not necessarily the ones they are testing in now. On the other hand, genotype-by-environment interactions can be viewed as an opportunity to develop locally adapted varieties if it can be sufficiently exploited through well-defined target environments and their characterization for purposes of prediction.

Advances in DNA sequencing and the science of genomics has been revolutionizing crop breeding for more than a decade now, making it easier to identify genes underlying economically important traits, search for useful genetic diversity, and make faster and more effective selections through “genomic selection”. Genomic prediction and selection is a breeding method in which line selection and advancement decisions are made on the basis of genomic data only, allowing breeders to save time and resources. Numerous scientific articles have been published on the development and optimization of genomics-assisted breeding techniques. However, implementation in actual breeding programs still lags, especially in public-sector programs and small- to mid-size industry programs. Since the inception of the SOYGEN (Science Optimized Yield Gains across ENvironments) initiative funded by NCSRP, we have made a concerted effort to develop the resources and tools needed implement genomics-assisted breeding techniques. The SOYGEN network consists of all public soybean breeding programs located in the North Central region along with key collaborators in the areas of genomics, genotyping technology, and precision agriculture (Figure 1). We have compiled and curated existing variety performance data from our regional trial network and deposited them in a relational and searchable database from which data can be easily retrieved for analysis (https://soybase.org/ncsrp/queryportal/). We collectively genotyped nearly 3,300 advanced elite breeding lines entered in our regional trial network with genome-wide markers and developed low-cost low-density DNA marker technology necessary for conducting cost-effective genomic selection. To help use this genotypic data in making breeding decisions, we developed workflows and analysis tools (Figure 2). During the course of this initiative we have made over 10,000 genomic predictions, predicted cross value of over 1.2 million potential cross combinations, and dramatically increased our genotyping capacity. These advancements have been used to facilitate rapid-cycling genomic selection to increase genetic per year, select upon early-generation progenies at the plant row stage increase program efficiency, and identify parental combinations expected to create promising breeding populations in terms of average performance and variation.

Despite this progress, there is still work to do to continue to completely infuse genomics-assisted breeding into public soybean breeding programs. There are three new challenges we would like to tackle to advance genomics-assisted breeding in soybean: 1) Collect and model extremely dense “low pass sequencing” data, project sequence data onto breeding populations, and use it routinely in breeding programs; 2) Predict performance of varieties in future environments through modelling genotype-by-environment interaction effects and environmental parameters to improve varietal stability, increase efficiency, and more effectively develop varieties for future environments and local adaptation; 3) Use of structural variant data for enhancing genomic predictions and connecting yield stability to underlying genetics. We have deliberately chosen these objectives because they are not only major questions facing public programs but are also major questions facing large multi-national companies striving to leverage genomics to deliver new higher yielding products more rapidly and effectively to farmers. Such companies – large, mid-sized, and small – look to public programs to investigate such questions of general interest that sometimes involve high-risk, high-reward experimentation (see letters of support).

Accomplishing the foregoing objectives will advance soybean breeding methodology to help ensure continued genetic gain is made for yield, defensive traits, and seed composition well into future. Findings from our studies will be published in peer-reviewed open-access journals so that the knowledge we generate is available to everyone in the soybean seed industry. Findings will also be integrated into our current public programs to enhance their effectiveness and efficiency. Finally, keeping our public programs on the leading edge of breeding technology contributes to graduate and undergraduate education and thus produces future plant breeders, geneticists and other agricultural scientists well equipped to join the seed industry and create ever higher yielding soybean varieties.


Project Objectives

Goal: The overall goal of this project is to advance genomics-assisted breeding for the development of future superior soybean varieties improved for both yield and composition. We will accomplish this using a multipronged approach including developing better breeding methods and furthering routine implementation of genomic prediction in actual public soybean breeding programs.
Our overall goal can be broken down into three interrelated objectives:

1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environment characterization.

3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Project Deliverables

The following will be delivered upon completion of this three-year project:
1. Publicly available resources and tools for soybean breeders to implement cost-effective genomic prediction in their programs.

2. Publicly available knowledge on genetic control of genotype-by-environment interaction in soybean, and improved models for prediction of breeding line performance in new environments. Knowledge will be made available through open-access publications, presentations at scientific meetings, and presentations to the seed industry.

3. Identification of important structural variants that control seed yield and composition, and publication of knowledge on any benefit into explicitly modeling structural variants for predicting breeding line performance.

4. Enhanced germplasm and superior varieties developed through adoptions of genomics-assisted breeding techniques better adapted to future environmental conditions.

Progress Of Work

Updated April 26, 2023:
Goal: The overall goal of this project is to advance genomics-assisted breeding for the development of future superior soybean varieties improved for both yield and composition. We will accomplish this using a multipronged approach including developing better breeding methods and furthering routine implementation of genomic prediction in actual public soybean breeding programs.

Our overall goal can be broken down into three interrelated objectives:

1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

The SOYGEN3 project officially kicked off this past reporting period. We held several meetings to refine goals and share information on use of markers and genomic selection in our own variety development programs. There are four specific tasks we are undertaking to advance this goal:

a. Collection of low-pass resequencing data on all advanced breeding lines entering the Northern Uniform Soybean Tests (NUST) and SCN NUST. This past year, we collected tissue on 611 new advanced breeding lines and submitted them for low-pass sequencing. This returned around 3 million SNP markers for each breeding line. We received the data last fall and have been working on processing these data and importing them into our new database, Soybeanbase. Seeds have been received for the 2024 cohort of NUST and NUST SCN lines. We are packaging those for planting and summer genotyping.

b. We have continued to deposit data and work on the Soybeanbase database with the Breeding Insight Onramp team. https://soybeanbase.breedinginsight.net/. This database now holds all the 6K genotype data collected on the breeding lines entered into our regional trial networks. We are preparing a publication reporting the analyses of these genotype data. This manuscript should be submitted in the coming months, and will make the availability of these data widely known to the larger community, and thus will become an important resource for soybean breeders and geneticists. The UMN Breeding group has also deposited internal genotyping data into this database. It is our hope that once expertise within one group can be established, we can share that expertise the other breeding groups.

c. A workflow and software application has been developed to streamline the process of and analysis to enable genomics-assisted breeding. Once standardized data files are entered, a practitioner can walk themselves through the GS process using a graphical user interface. This should democratize the use of genomic prediction for soybean breeding, and allow breeders to do quick analyses to make selections on quick deadlines. See this link for a demo of the application, and the screenshot of the workflow below.

d. In the past, we conducted a wide-scale test to compare genomic selection to phenotypic selection. This experiment is being repeated in 2023 for another season of data collection. For the new SOYGEN3 project, each breeder will genotype ~1000 breeding lines in their program and attempt to use genomic selection that best fits their current program and needs. This will start in the summer of 2023.



2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environment characterization.

a. For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass re-sequencing technologies. Detailed environmental for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If we are successful, we leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in future environments that a producer is most likely to encounter.
This project is just getting off the ground. The first step is to identify the breeding lines making up the panel, and increase their seeds all common growing environments to remove any effects related to seed source. This past funding period, we identified 300 breeding lines per relative maturity (RM) grouping. There are four RM groupings, for a total of 1200 breeding lines. Breeders were given listings of lines, and are currently exchanging seeds in order to plant small seed increases that will produce at least three pounds of seed. These seeds will be used for performance trials in 2024. We have designated seven hubs for increasing seed. ND and MN will increase seeds of the 0.5 – 1.5 RM set, IA and NE will increase seeds of the 1.5 – 2.5 RM set, IN and NE will increase seeds of the 2.5 – 3.5 set, and KS and MO will increase seeds of the 3.5 – 4.5 set. Each set is assigned two seed increase locations to protect against natural disasters such as drought and hail. Plans for collecting tissue and phenotyping of these seed increase plots are currently being discussed.


3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

a. For this objective, we will apply long-read sequencing technologies to discover structural variants in soybean. We will perform this on the Soybean Nested Association Mapping population, which is a widely used population on which quality phenotype and molecular marker data is available. This will allow us to determine if knowledge of genomic structural variation has any value for predicting phenotype from genomic data.
During this reporting period, co-PI Hudson obtained seed of the original 41 SoyNAM founders (original sources used to make the crosses). These seeds were planted in the greenhouse and tissue was collected for DNA isolation. DNA isolation is in progress, and will be sent out for sequencing during the next reporting period.

View uploaded report Word file

Updated November 1, 2023:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.


These past six months we have continued to develop the Northern Uniform Soybean Tests (NUST) database by sampling 560 NUST lines in the field this past year. Samples were submitted to a genotype vendor for trait-targeted genotyping, and the DNA extraction process is currently ongoing for submission of high-quality DNA for high-density genotyping via re-sequencing. Past collected genotype data have been uploaded into our new community database made possible by the collaboration on our project (SOYGEN) and the Breeding Insight team (https://soybeanbase.breedinginsight.net/). Data on over 2540 genotypes assayed with the BARSoySNP6K have been uploaded to the database. We have spent the last 6 months filtering the Gencove resequencing results (~3 million SNPs) down to the standard community 50K SNP set to create a reasonably sized dataset for uploading to SoybeanBase. This is nearly complete and will be uploaded in the coming days.

We have also continued to compile and organize the NUST phenotypic data into a format useful for uploading to our SOYGEN database housing historical phenotypic data (https://soybase.org/ncsrp/queryportal/). Dat from 1993 to 2021 have been uploaded. These past 6 months we compiled the 2022 data into a form that can be easily uploaded and are working with Rex Nelson (SoyBase curator) to upload these data.

To maximize the usefulness of these date resources we are creating, streamlined workflows and applications are needed to enable genomics-assisted breeding in the context of actual cultivar development programs. We are making progress towards this through the development of a streamlined workflow implemented through an accompanying R Shiny Application. A more general description and screenshots of the workflow and application are available in the proposal. During the last reporting period we made progress towards this goal through the following activities:

i) Testing several methods developed both in-house and by other research groups that validate parentage of breeding lines and breeding populations. This is useful as a quality control step and helps facilitate imputation.

ii) As part of our goal to reduce genotyping costs, we’ve tested methods that can impute low density(LD) marker sets in progeny lines up to a high density(HD) panel by using pedigree information from parental or related lines that have been assayed using a HD panel. Such a highly effective imputation method with high accuracies has the potential to cut genotyping costs significantly when used for assaying thousands of lines every year. Based on the excellent speed and accuracy of imputation reported for AlphaPlantImpute (API) in the literature, we’ve tested both AlphaPlantImpute2 Pedigree based imputation and AlphaPlantImpute2 Population based method and compared with LDKNNI (Linkage Disequilibrium K Nearest Neighbors Imputation), a population based imputation method. If the parental information is provided in the input, the method uses genotypic information of HD parental lines to construct phased libraries and uses a HMM to impute genotypic scores of progeny lines. If the founder information is not provided, the method constructs phased libraries from the HD set and constructs a cluster of surrogate parents based on non-discrepancy in homozygous sites and uses this cluster to impute genotypic scores in progeny lines. As mentioned before, API methods showed better performance in our tests with simulated and NUST data sets. We hope to integrate this python based implementation in our R pipeline and test for stability of performance in the coming weeks.

iii) We also answered some questions on the optimal size of LD marker set size for accurate imputation and examined the impact of such imputation on cross validation prediction accuracies. Preliminary results from these simulation studies will be presented as a poster at the Annual Tri-Societies meeting (Oct. 28 - Nov. 2, 2023).

iv) Tested several GxE models for integration into the genomic prediction pipeline and will be integrated into the app in the coming months. The GxE models we intend to integrate also include modeling environmental data for including high dimensional environmental covariates in GxE models to boost the accuracy of prediction in novel environments. This has the potential to be extended for predicting performance in future environments.

v) We are also exploring the deployment of our software application on Cyverse Discovery and Virtual Interactive Computing Environment (DE-VICE) (https://cyverse.org/) for routine analysis and make it available for our collaborators. Currently, we can store large amounts of data in the data store on Cyverse and make it accessible to the public. The aim is to provide access to our large low-pass sequencing data we’ve collected on 1000’s of lines.

vi) We presented Posters and talks on the Genomic Selection pipeline at two conferences this year including PAG and Soy 2023 Molecular Biology and Genetics Conference. Several groups working on Genomic Selection were interested in deploying similar pipelines such as the one we’ve developed. So we are also preparing a manuscript for describing the pipeline and the software application and hopefully publish it in the coming months.

We’ve uploaded genotypic data for more than 12,100 lines using various genotyping protocols such as BARCSoySNP6K. Agriplex Soy 1K, 1K MIPs, SOYNAM6K Chip and Gencove low-pass sequencing assays with different number of SNPs. Collectively, the assayed information stored in the database comes to more than 100M genotypic data scores. Currently, our group is leading this effort to store and manage genotypic data on Soybeanbase and we’ve invited several SOYGEN collaborators to start using it. We held a meeting with several SOYGEN3 collaborators on June8, to demonstrate the use of Soybeanbase for data management and described the protocol to upload genotyping protocols and methods. We also did a demonstration of the trials data management to Jacqueline Campbell and Marta who are exploring the use of Soybeanbase for managing SUST trial data. More recently, we held a meeting with Steven Schnebly's & Avi Kaler from Mcclintockseeds, who intend to use the publicly available NUST data set for genomic prediction in their cultivar development.
Genotyping Protocol No.of.Markers No.of.Genotypes

BARCSoySNP6K 6000 2589
Agriplex Soy 1K SNP Panel 1242 2976
1K MIPS SNP Set 995 359
SoyNAM6K 4292 5188
Gencove Low Pass Seq (50K) 42500 1303
Gencove Low Pass Seq (50K) 4598 1303

Finally, another task of this grant was to enable public breeders in executing genomic prediction. We are building on two important assumptions here: 1) It is best to work through problems of executing genomic prediction together has part of this SOYGEN network where we can share experiences and expertise, and 2) It is just time to do it, whether the execution is ideal or not. Proceeding with execution will help us learn from our mistakes and create the workflows and expertise to eventually institutionalize these methods in our day-to-day programs. Across the SOYGEN network, over 7000 breeding lines have been genotyped. Individual breeding programs are at different stages of how they are using these data for prediction. Each program provided a report on how they are genotyping and using GS. We have established a monthly meeting to discuss how each program can move forward with GS. At this point, because of budget cuts to the SOYGEN program, we cannot not commit to maintaining this as the funding is not there. Nevertheless, we will still try to leverage this network so we can advance GS in our programs together.


Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.

For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass re-sequencing technologies. Detailed environmental for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If we are successful, we leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in future environments that a producer is most likely to encounter.
The last report focused on the selection of each MG set consisting of 300 breeding lines. During the last reporting period, the main goal was to plant, manage, and harvest small seed increases for each MG set at two locations. We planted seven seed increase hubs as follows: ND and MN for the 0.5 – 1.5 RM set; IA and NE for the 1.5 – 2.5 RM set; IN and NE for the 2.5 – 3.5 set, and KS and MO for the 3.5 – 4.5 set. All locations were successfully planted and established except for the MO location, which experienced extreme drought in the early season, preventing good stand establishment. Flower color notes, pubescence color notes, and maturity dates were taken at most locations if time allowed. Some locations, such as Kansas State, experienced severe drought, and maturity date was just correlated with water availability, so they ceased taking maturity dates. Harvest has been completed for most locations. We expect good seed returns for at least 250 lines at each location. Once harvest is completed, we will evaluate morphological data (flower, pubescence, hilum color) and maturity date to QC the seed increases and panels, and determine how seed will be organized and shipped to yield trial site in 2024.


Objective 3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

For the sequences of the 41 SoyNAM founders, we have obtained high quality DNA from dark-adapted greenhouse-grown plants, and sequenced the whole genomes with Illumina technology to very high depth and quality. Structural variant calling is currently ongoing using the computer facilities at Illinois and will take at least two months to complete. Meanwhile, the SOYGEN group used the core of our preliminary data and research plans to lead a proposal to DOE JGI to perform a high-quality, reference-grade pangenome, consisting of 400 reference-quality soybean genomes each of similar quality to the current Williams 82 reference genome. This proposal was approved and the award was announced recently. Going forward, this will greatly expand the power of SOYGEN to identify the most desirable haplotypes in different environments across the North Central region and thus accelerate soybean breeding and improvement.

Final Project Results

Benefit To Soybean Farmers

Soybean breeding has a large impact on the efficiency and profitability of agriculture through the development of high yielding new varieties with critical defensive traits and enhanced seed composition. Ensuring that such programs (both private and public) are using state-of-the-art technologies to drive genetic gain in the face of changing environments and narrowing genetic diversity will contribute to continual development and release of ever better varieties. Additionally, these efforts help to educate future agricultural scientists and soybean breeders that are best prepared to enter the seed industry and develop impactful future products for farmers, keeping the North Central region competitive in soybean production.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.