Increasing the rate of genetic gain for yield in soybean breeding programs
Sustainable Production
Lead Principal Investigator:
Leah McHale, The Ohio State University
Co-Principal Investigators:
William Beavis, Iowa State University
David Hyten, Iowa State University
Asheesh Singh, Iowa State University
William Schapaugh, Kansas State University
Dechun Wang, Michigan State University
Jianxin Ma, Purdue University
Katy M Rainey, Purdue University
Brian Diers, University of Illinois at Urbana-Champaign
Matthew Hudson, University of Illinois at Urbana-Champaign
Aaron Lorenz, University of Minnesota
Pengyin Chen, University of Missouri
Andrew Scaboo, University of Missouri
George Graef, University of Nebraska
Steven Clough, USDA/ARS-University of Illinois
+13 More
Project Code:
Contributing Organization (Checkoff):
Institution Funded:
Brief Project Summary:

There are several possible targets for improving soybean yield if the genetic gain equation is considered. These targets include increasing the selection intensity, increasing measurement accuracy, increasing genetic diversity and additive genetic variance, and decreasing the amount of time required for each breeding cycle. Through coordinated activities across 12 breeding programs in the North Central region, the project objectives address one or more of the target areas. The final objective is aimed at developing a metric to accurately assess realized genetic gain for yield on an annual basis.

Key Benefactors:
farmers, geneticists, breeders

Information And Results
Project Deliverables

Objective 1:
• Observed data, selection information, pedigree and plot layout (range-row information), and shipment of seed for planting for breeders at 11 locations.
• Overall data management plan, preliminary analytical pipeline implemented and coordinated by the Rainey lab.
• For each breeders’ lines, ranking for yield breeding value, maturity prediction and a metric of diversity. Lines selected using additional sources of information may provide higher rank-order correlation with the performance of preliminary yield trails.

Objective 2:
• A community resource for genomic prediction consisting of a set of soybean lines that can be used to establish genomic prediction to help expedite genetic gain for yield.
• Novel inexpensive and rapid genotyping methods that can be used for genomic prediction and selection.
• Genotype imputation methods and tools to increase marker density and allow connection of data sets collected using diverse genotyping platforms.

Objective 3:
• Databased information on success of parent combinations.
• Method development for selection of parents using genomic prediction models developed in Objective 2.
• High-quality, multi-environment yield and other agronomic performance data for 500 PIs in the USDA Soybean Germplasm Collection.
• Identify yield-marker genotype relationships based on association mapping results from the extensive, high-quality yield dataset.
• Develop predictive model(s) that allow selection of superior high-yield genotypes from the USDA germplasm collection.
• Incorporate high-throughput phenotype data, plant developmental data, and environment data in the models.
• Public use of data, documentation of results.
• Candidate yield-conferring haplotypes from exotic germplasm.
• High yielding lines derived from wild soybean and G. tomentella.
• High-quality high throughput SNV and structural variants matrix for WGS panel.
• A list of candidate genomic regions and/or haplotypes associated with yield-related traits.
• Validated molecular markers for selection of yield component traits.

Objective 4:
• Short videos describing history and future developments of genetic gain to non-experts.
• Establish objective criteria for evaluating methods that estimate realized genetic gains.
• Deposit public, commercial and simulation, data resources with agreed upon nomenclature and format rules in a single shared file server.
• Simulation software for generating yields of potential varieties in various stages of field trials.
• Establish the potential range of field and laboratory resources that may be needed to realize RGG of 50% greater than current RGG.

Final Project Results

Updated May 4, 2020:
OBJECTIVE 1: Increasing selection intensity and decreasing non-genetic sources of variability through improved progeny row testing

We conducted selections on progeny rows for several breeding programs using models that incorporate pedigree selection, spatial adjustment, and UAS canopy or other phenotypes. We processed UAS imagery for canopy cover and color for Schapaugh, Wang, Lorenz, and Rainey, and we attempted it for Scaboo. Our preliminary yield tests based on selections from 2018 have been harvested. This is the second preliminary yield test based on the progeny row selection models implemented in this study.

1.e. Deliverables.
• We have reported observed data, reported breeder selection information, and reported pedigree and plot layout (range-row information) for breeders at 10 locations.
• Implementation of selection models unique to each breeders’ needs has been achieved.

1.f. Key Performance Indicators or performance measures (year 3).
• Additional data were collected on progeny rows in 10 programs. All programs reporting have collected additional phenotypic data for selection on progeny rows for two years.
• Selections completed before harvest for programs electing onto to use yield data.
• Preliminary yield trials organized by each breeder to test selection accuracy for the 10 breeding programs is ongoing for the 2019 season but was achieved in 2018.
• A selection accuracy assessed from three years of data, manuscript and reports are in preparation.

OBJECTIVE 2: Increasing selection coefficient and decreasing length of breeding cycle through genomic selection

The Lorenz lab has continued to curate phenotypic and available metadata on Uniform Northern Regional trials ranging from 1993–2017. These data, and associated genotypic data on nearly 2000 lines were made available to SoyBase for public databasing/availability. Phenotypic data from 2018 has been compiled, QC’ed and added to the training set. Training models including the 2018 data and new genotype data have been created.

A manuscript on the publicly available data from the Uniform Northern Regional Trials will include preliminary analyses on population structure, genomic distribution of allelic variation within and between breeding programs/maturity groups, association analyses using historically collected phenotypes, and initial assessments of prediction accuracy using the URT data. A. Lorenz gave a presentation at the 2019 Soybean Breeder’s Workshop describing this resource.
Beyond the genomic selection work by leveraging the URT data, we have also made use of this data for performing an association analysis on IDC tolerance. We detected a strong association on chr 5. Results from this analysis were combined with other fine mapping work, and a manuscript is currently in review at Plant Genome.

DNA extracted from 1500 UMN breeding lines in prelim yield trials was sent to the Hyten Lab at University of Nebraska for genotyping using 1000 SNPs selected from URT genotype dataset described above. The Hyten lab did extensive testing of multiple genotyping methods, specifically MIP protocol parameters, to try and reduce the amount of time to perform the protocol. Currently the hybridization and extension steps are performed separately and can take over 24 hours. They also tested all these conditions with both CTAB extracted DNA and the nanoparticles DNA extraction method, as well as various oligo:DNA ratios. Overall the CTAB DNA outperformed the nanoparticle extracted DNA for all conditions tested.

Based on our current implementations, the cost of the MIPs reaction is down to $4.88. The CTAB DNA extraction estimate was increased to $1.43 per reaction due to the need to include the cost of using service equipment for the extraction. Due to the high cost of CTAB we will continue to increase testing of the nanoparticle DNA extraction to try and reach the goal of getting DNA extraction costs below $0.50.

We have ran the 1k probe set on two NAM populations to help access our accuracy of calling hets using the MIPs protocol. The data produced looks good and is currently being analyzed for SNP calling.

2.e. Deliverables.
• A community resource for genomic prediction consisting of a set of soybean lines that can be used to establish genomic prediction to help expedite genetic gain for yield has been expanded as described above.
• Initial dataset has been delivered to SoyBase for posting. A manuscript is in preparation describing this dataset. We are continuing to build on this database by collecting and genotyping the 2018 URT entries and 2019 URT entries.
• Novel inexpensive and rapid genotyping method developed that can be used for genomic prediction and selection. From the different conditions tested we have been able to reduce the time it takes to prepare MIPs reactions for sequencing by more than half, increasing throughput. We have also continued to demonstrate that the lower reaction volumes and enzyme concentrations produce good data helping to get the MIPs reaction to under $5.
• Towards the development of genotype imputation methods, we have Published cluster-flexible structural variant calling workflow using Cortex-var on Github, completed whole genomic SNP and short indel calls via Sentieon Haplotyper algorithm for 481 samples, completed whole genomic structural variant calling via Cortex-var for 481 samples, completed whole genomic structural variant calling via Sentieon DNA-scope algorithm.

2.f. Key Performance Indicators or performance measures (year 3).
• Demonstrated ability to leverage historical URT data for making genomic predictions in soybean.
— We have shown this using cross validation within the URT dataset. We are collecting genotype data on entries in prelim yield trials. Once this is collected, we will assess our ability to predict new breeding lines.
• GBS method developed that can genotype 200-1000 markers with less than 10% missing data and greater than 95% accuracy has been achieved and implemented on 770 soybean lines with the MIPs 1k probe set.

OBJECTIVE 3: Increasing additive genetic variance

We obtained multi-sensor data on 250 accessions in field tests at locations in NE, KS, IA, and MO representing MGs I, II, III, and IV. This served as a test run during 2018 to prepare for our planned image and sensor data collection from all North Central locations during the 2019 season. Based on our experience from the 2018 plots, we constructed a pheno cart specifically for the plot layout of the NCSRP experiments so our data collection will be efficient and our processing time reduced. We have run protein and oil NIRS on seed samples from all plots received from all cooperators for the 2018 tests. Seed for 2019 (year 2 of field testing) was organized, distributed, and planted by cooperators. Due to the unique weather conditions of 2019, of the 16 locations grown this year, four locations were unable to be planted or were considered un-harvestable due to weather damage or weed pressure. The confirmation population was selected based on genomic estimated breeding values derived from the sampled individuals (training population) from the USDA soybean germplasm collection. We performed genome-wide association analyses and subsequently identified QTL for yield, plant height, maturity, seed weight and lodging data using the set of 500 accessions (training population) for which we had previously collected data.

A few of our G. max x G. tomentella, and of our G. max x G. soja derived lines, continued to show good yield potential in 2018 field tests, with some material being the largest yielders across several testings. Complex and simple F2 or backcross lines derived from crosses with wild soybean (Glycine soja) or G. tomentella have yields comparable to checks over three years. It has been determined that some of the G. max x G. soja derived lines have 18-24% of the DNA of G. soja, but we have yet to be able to confidently determine if any G. tomentella nuclear DNA is present in the G. max x G. tomentalla derived lines. Genomic sequence analysis of G. tomentella derived lines has failed to yield evidence of G. tomentella introgression.

The Ma lab analyzed the haplotype of genomic regions surrounding previously identified maturity genes E1, E2, E3, E4, E9, and J, as well as genes associated with branching angle and canopy coverage in ~800 re-sequenced soybean germplasm accessions and are in the process of designing PCR-based molecular markers for precise evaluation of the 240 PIs included in this project. We examined genetic variation of the major maturity genes in the ~800 lines. The information has confirmed causal mutations and is useful for design of individual maturity gene-specific markers for more accurate marker-assisted selection. We have examined genetic variation at the gene locus (GmBa1) controlling branching angle and canopy coverage in ~ 800 lines with available re-sequencing data. Our data indicate that the causal mutation occurred in the promoter region. Construct overexpression GmBa1 has been made and validation of the gene’s function is in process.

We have made crosses between two high-yielding RIL lines derived from G. soja with two elite lines towards dissection of the yield QTLs from G. soja that have been mapped on chromosomes 8 and 11. We are continuing functional validation of domestication-related traits including growth habit, leaf shape and size by transformation. We have investigated the distribution of a reciprocal chromosomal translocation in representative G. max and G. soja populations. Our data suggest that such a translocation played a major role in isolating of domesticated soybean from the wild relatives.

Through genomic analyses of convention and alternative germplasm pools at ancestral and elite stages of breeding development, we have identified and tracked across populations the genomic changes caused by selection for yield in alternative and conventional gene pool.

3.e. Deliverables.
• High-quality, multi-environment yield and other agronomic performance data for 500 PIs in the USDA Soybean Germplasm Collection. Some cooperators in this test have used some of the PIs in their current diversity program crosses.
• Identified yield-marker genotype relationships based on association mapping results from the extensive, high-quality yield dataset. We performed GWA analysis on the complete set of data for yield, plant height, maturity (days after planting), seed weight, and lodging. The seed composition dataset is being finalized and will be analyzed shortly. In general, the entire set of 500 accessions identified QTL better than any single sampling method. There may be a unique QTL for yield that was identified in the CLU sample that did not show up in the other samples or overall.
• Developed predictive model(s) that allow selection of superior high-yield genotypes from the USDA germplasm collection. Results show that the SSD sampling method more effectively reflects the total genetic variation in the Soybean Germplasm Collection than Random or Cluster sampling groups. Furthermore, the variance for most traits measured is nearly double in the SSD sample vs. the other sampling groups. For genomic prediction models, cross validation results showed that the SSD sampling group performed better than RAN and CLU for predicting yield, and at least as well as when the entire set of 500 accessions was used for prediction. These results support our hypothesis and verification that the SSD sampling method using genotype information is an efficient and effective way to sample the germplasm collection. This has important implications for use of the Soybean Germplasm Collection and other germplasm collections around the world.
• Incorporated high-throughput phenotype data, plant developmental data, and environment data in the models.
We collected multi-sensor phenotype information on the NCSRP germplasm sampling validation plots in NE, MN, KS, IA, MO, IN and IL at two stages of development during the 2019 season: (1) near V5 and (2) just prior to R5. Research results from our prior work and others shows that image and spectral data from these two growth stages show highest correlations with yield. We have some good video from these trips that can be used for presentation, education, and PR in the future for our SOYGEN objectives. After the 2019 data are received after harvest, we will begin work on compiling and analyzing the data from the validation study, including the high-throughput phenotype information.
• High yielding lines derived from crosses with G. tomentella have been identified, no introgressions from G. tomentella have been identified.
• Multiple lines derived from wild soybean (G. soja) F2 crosses with Williams 82 or BC1 crosses with Williams 82 yield 80-97% and 89-95% of the checks over three years, respectively.
• We have putatively identified causal mutations for maturity genes.
• We have putatively localized the causal mutation in the GmBa1 gene for branching angle to the promoter region.
• A set of molecular markers that can be used for traits selection and evaluation have been designed and are now under validation.
• A scientific manuscript that reports the contribution of wild type soybeans (G. soja) as a source of genomic diversity in soybean elite lines is in preparation.
• A scientific manuscript that reports the candidate regions under selection for yield in the alternative and conventional gene pools, and the potential application of this results for increasing the yield in the elite lines is in preparation.
• Workflow scripts for studying population genomics in soybeans using SNP array data have been developed. Using this resource, the researcher/breeder will be able to select samples in populations based on the results of genetic diversity, structure, genetic association, and selection analysis.
• We have identified the direction of selection for each candidate haplotype in elite populations. This result might contribute for increase the efficiency in the selection of parental combinations and decrease the length of breeding cycle.

3.f. Key Performance Indicators or performance measures (year 3).
• High quality yield and seed composition data on 500 PIs from the USDA Soybean Germplasm Collection from 14 environments, seven environments in each of two years.
• Preliminary model to predict yield and seed composition on PIs from the USDA Soybean Germplasm Collection. One or more potential yield-conferring haplotypes identified from exotic sources used to select parent lines for yield improvement. Models were developed using each sampling group alone as well as the complete dataset, and cross-validation was used to test effectiveness. Genotype information improved the models, but including GxE effects added little to the prediction. Model results and predictions were shared with all cooperators.
• Tentative identification of lines derived from wild soybean that can be used as parents in variety development programs. Yield testing show that this is possible (see above).
• Determination of introgression of G. tomentella DNA in lines derived from G. max x G. tomentella. All evidence so far, based on thorough sequence analyses, indicates that the G. max x G. tomentella 2n=40 derived line 12ST4-5 has no G. tomentella nuclear DNA intrograted in its genome. We still have more analyses to perform before being more conclusive, and these analyses should be completed before this grant expires.
• Molecular markers for precise tagging of five maturity genes and a newly identified branching angle/canopy coverage genes.
• A detailed list of 18 candidate haplotypes/genomic region underlying yield-related traits has been identified that are under selection in the elite soybean lines from conventional and alternative pools.
• A list of Alternative Elite Lines belonging to the MG III and MG IV has been identified containing the combination of most favorable haplotypes that could be used for increasing the performance in the elaboration of parental combinations in soybean breeding crosses

OBJECTIVE 4: Development of a metric to estimate genetic gains on an annual basis

Other than infrequent direct comparisons of historical varieties in common gardens, a.k.a. decade studies, there are no more than five published methods for evaluating genetic gains from annual field trials and all of these recognize the difficult problem of removing non-genetic (agro-environmental) sources of variability. Of these, only two have proposed using genotypes from adjacent years to adjust for annual non-genetic sources of variability. In addition to the EM algorithm that used check varieties to estimate non-genetic sources of variability in staged commercial field trials that we published two years ago, we developed a mixed model approach in which common entries among years are used to obtain shrunken (BLUP) values for non-genetic environment year combinations that will be used as non-genetic covariates in assessing genetic gains for yield using data from the uniform trials.

In order to estimate the realized genetic gain (RGG) on an annual basis, a simulation software based on the package AlphaSimR was implemented in the R environment. The simulation pipeline consists of a simulation of 14 years of multi-environment trials (MET) of a soybean breeding program. The parameters of the simulation, such as heritability, number of genotypes and locations, selection intensity, etc., will be based on real public breeding programs. These parameters were obtained in interviews with Brian Diers, Aaron Lorenz, George Graef, and Leah McHale. The main goal of the simulation is to have the true genetic values of genotypes across generations and years, and therefore to estimate the RGG. The following six models were also implemented in the R environment to estimate the RGG: (i) meta-analysis for yield correction due to a reference year; (ii) linear regression with unadjusted averages of genotypes; (iii) linear mixed model with simple effects; (iv) linear mixed model with a contrast of checks against regular genotypes; (v) three-way linear mixed model with subsequent simple linear regression of adjusted means of genotypes and years; and (vi) GBLUP model. We are working on evaluation criteria for evaluating the models.

4.e. Deliverables.
• One 10-minute video describing what genetic gain is has been developed and delivered to NCSRP.
• For the production of a second short video describing history and future developments of genetic gain to non-experts, interview questions for farmers in January/February 2020 were developed.
• Sources of bias in evaluating genetic and non-genetic sources of variability have been identified in both direct comparison and long term field trials. The latter are being addressed with the UT data from commercial and public plant breeders.
• Simulation of genetic gain using genomic and phenotypic selection in soybean breeding population structures were documented with R-markdown and are being evaluated before depositing in a public repository.
• Recently released simulation software (AlphaSimR) was identified and evaluated for generating yields of potential varieties in various stages of field trials.

4.f. Key Performance Indicators or performance measures (year 3).
• One 10-minute video describing what genetic gain is has been developed and delivered to NCSRP.
• Three manuscripts are in preparation to document the establishment of four objective criteria for evaluating methods that estimate realized genetic gains.
• Publicly available simulation software was identified that is sufficiently flexible for a skilled graduate student to generate yields of potential varieties from multiple (>1) families grown in multiple (>1) environments in at least one stage of field trials

This project has focused on increasing the rate of gain in soybean breeding programs through improving selection accuracy, increasing the strength of selection, decreasing the time it takes to go through a single cycle of selection, increasing the genetic diversity of breeding populations, and developing a metric to assess improvements. Here we highlight some key findings from each of these foci.

The progeny row stage of selection in breeding programs is often considered the most inefficient. Breeders have insufficient seed to have replicated yield plots, but have a large number of unique germplasm from which the best must be selected for ongoing field trials. Selections are sometimes based on inaccurate yields, carried out "by-eye", or other less than ideal methods. Here we investigated how adding additional characters such as reproductive time period (time to maturity - time to flowering) or canopy coverage via UAV, and/or applying spatial statistics and pedigree to models could improve the accuracy of selection. Three years of field data (representing two rounds of selection from progeny rows) was collected across nine North Central breeding programs, and the accuracy of selection was able to be improved. Final results and a manuscript describing these results are in preparation.

By using molecular markers distributed throughout the genome, breeders can evaluate a larger number of individuals and select more extreme individuals, thus increasing the strength of selection. In addition, by bypassing time-consuming and costly field evaluations (requiring multiple generations of inbreeding and increasing of seed), evaluation by genome-wide molecular markers (called genomic selection) can reduce the time per cycle of selection. In this project, we were able to database the public Uniform Trials and utilize these to create models for genomic selection. We were also able to develop an inexpensive and high-throughput genotyping platform (1000 molecular markers targeted to the diversity in public breeding programs available for under $5 or under $6.50 if DNA isolation costs are included). Both of these make genome selection available and feasible for public breeding programs.

Increasing genetic diversity of our breeding programs will increase our ability to genetically improve soybean; however, the increase in diversity needs to be carried out in an informed manner so that poor genetic backgrounds do not decrease overall yields. To do this, we implemented a number of methods. We characterized 500 accessions from the USDA Soybean Collection and evaluated how well genetic models developed from these accessions could predict the yield and agronomic characters of other accessions in the collection. Characterization included yield and agronomic traits evaluated over two years across the North Central region. This represents one of (if not the) largest and most in-depth characterization of the USDA Soybean Collection and has informed our knowledge of how to best sample germplasm collections, in general. We investigated the possibility of accessions of the wild soybean relatives Glycine soja and Glycine tomentella to donate alleles to soybean to improve yield and found that while Glycine sojae likely possessed valuable alleles to increase yield, there was no evidence that the more distant Glycine tomentella donated valuable yield alleles in our selected materials. Key genes for maturity, branching, and other domestication traits were characterized and markers designed for them. These markers will be useful to speed the introduction of exotic alleles into elite, adapted genetic backgrounds. Finally, we identified alleles potentially associated with yield from an alternative germplasm pool. These alleles could represent valuable alleles that are absent from our current elite cultivars or provide selection targets to make the introduction of exotic alleles into elite, adapted genetic backgrounds more efficient.

In an effort to be able to quantify changes and, hopefully, improvements in genetic gain, simulation software was evaluated for generating yields of varieties in various stages of field trials. In order to inform audiences on the concept of genetic gain, informational videos have been developed and released to NCSRP, as well as presented to broad audiences.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.