Development of integrative big-data driven technologies for soybean product development
Sustainable Production
Data analysisData ManagementEnabling technologyGeneticsGenomics
Parent Project:
This is the first year of this project.
Lead Principal Investigator:
Yong-qiang Charles, USDA-ARS
Co-Principal Investigators:
Project Code:
Contributing Organization (Checkoff):
Institution Funded:
Brief Project Summary:
In the past decade, my laboratory has devoted tremendous efforts to consolidate, organize, clean and analyze more than 40 terabytes (TB) data including 5,000 genomes, 3900 3920 transcriptome and millions of phenotypic data. We have been developing a suite of innovative data mining methodologies/strategies by integrating multiple disciplinary approaches to discover and validate several important trait genes and new genetic resources. In this proposed research, we will continue to develop new innovative data mining tools and generate new resources which will be transferred to soybean community while working in close collaboration with Soybase, KB Common/SoyKB and other public database teams.
Information And Results
Project Summary

Project Objectives

Project Deliverables

Progress Of Work

Final Project Results

Progress Summary: With the advance of high-throughput technologies and the increasing recognition of their significant impact on soybean research and product development, worldwide researchers have invested tremendously and generated a massive amount of genomic, genetic, and phenotypic data. This provides unprecedented power and opportunities to address many issues that the US soybean industry faces. Unfortunately, it becomes a major challenge for most soybean researchers to analyze and use the massive amount of data in their research because it requires a lot of tedious work, powerful computational infrastructure, a set of new technical skills and scientific research methodologies, and the development of new tools to organize, analyze, and mine the massive amount of data. Over the past decade, we have devoted significant effort to developing large-scale data analysis and data mining technologies and new uses of such large-scale dataset for soybean improvement. Applying the technologies and dataset, we have successfully discovered causative genes and alleles underlying the quantitative trait loci (QTLs) for traits important to US soybean farmers. The objectives of the project are to consolidate and analyze the massive amount of data generated by worldwide researchers into user-friendly data resources and integrate multidisciplinary data-driven approaches that we developed in the past decade to develop a robust big-data-driven technology platform for the US soybean community. The project progresses well and is ahead of all milestones (see Detailed Progress) in FY2023. We applied our big-data analysis pipeline to consolidate and analyze genome sequencing data of ~12,000 soybean accessions and ~8,000 transcriptome sequencing data available to the public. We generated several datasets for DNA variants and expression of all genes in different treatments. Having employed a set of new data-mining strategies, we identified a large set of QTLs, putative causative genes and alleles, functional markers, and inferred genetic and gene networks for soybean genetic improvement. We devoted significant efforts to deploying data sources and data technologies to the soybean community and providing technical assistance to use the data. We made eight oral and poster presentations at Cellular and Molecular Biology of the Soybean conference, Plant and Animal Genome Conference, and Soybean Breeder Workshop to promote the use of the data resource and technologies. We also provided technical assistance/consultation and the existing data-driven technologies to US companies such as Impossible Foods, universities, and research institutes for a variety of uses. We are finishing multiple publications to deploy our new data source and technologies to the soybean community. Our early publication of 1500 soybean genomes and dataset released in Soybase and Ag Data Commons has been accessed over 2,000 times. With the increasing use of large-scale data in soybean research, we expect that the use of the new data sources and data-driven approaches from the project will increase significantly with a great impact on US soybean agriculture and industry.

Benefit To Soybean Farmers

The project takes advantage of emerging and transforming big-data science technologies and integrates them with multi-disciplinary scientific approaches to develop, validate, and deploy a versatile big-data enabling technology platform. This platform aims to translate the huge amount of genetic, genomic, and phenotypic data generated by scientists worldwide—through at least one billion US dollars of accumulated investment in the past decades—into US soybean research and product development. Undoubtedly, the platform should provide cost-effective approaches and increase the efficiency of US soybean research and product development tremendously in both public and private sectors, eventually improving US farmers' investment returns by magnitudes. With an increasing realization of the significance of big-data-driven technologies, the amount of biological data continuously increases exponentially. Big-data-driven research approaches will be an important part of next-generation biological research. However, there is a huge shortage of data scientists who know how to turn the data into knowledge and develop new solutions to the problems that the US soybean industry faces. The project will train young soybean data scientists to lead our "next-generation" soybean research and product development. In addition, the project will facilitate the quick integration of big-data technologies into US soybean research and product development, which is critical for the US soybean industry to maintain its leading position in the world.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.