banner

Projects

Fate of carbon in soils: Using self-supervised language models on sequence data from subsurface microbial communities

PI: Woodward W. Fischer (Division of Geological and Planetary Science)
SASE: Pippa Richter, Scholar

Approximately one-third of the CO2 emitted from fossil fuel burning and land use change has been drawn out of the atmosphere and fixed into organic matter by the biosphere. As CO2 increases in the atmosphere and the climate warms, the biosphere is becoming more productive. This increase in primary productivity has outpaced that of respiration; the net process is locking CO2 away into organic matter—somewhere, somehow. Moreover this enhanced CO2 removal by the biosphere has been taking place for more than a hundred years. That carbon is not simply found in trees or algae, but it is present in large part as particles transported across the globe that are bleeding into sedimentary deposits (soils, river floodplains, deltas and coastal sediments) with significant potential for permanent storage over societal timescales. And it's the convolution of physical, chemical, and ecological processes in the biosphere with processes that produce, transport, and bury soil and sediment that is controlling the cadence of this sequestration. Put another way, when carbon is fixed by the biosphere and enters soil and sedimentary reservoirs, can we have quantitative knowledge of if and when we will see it in the atmosphere again?

Subsurface microbial communities are the key catalysts controlling carbon transformations, and the taxonomic and genomic composition of communities provides a valuable time-integrated measure of the environmental conditions present in soils and sediments. Recent breakthroughs make it possible to access the genomic content of soil and sediment microbes in a high throughput fashion that is both rapid and inexpensive. The challenge is that the volume of data we recover is massive and complex. However, a promising approach to learning about soil and sediment conditions is via the metabolic states of the microbes that live there; information about soil O2 content, for example, could be gleaned by the relative quantities of aerobic and anaerobic bacteria in a sample. Put simply, can we figure out from sequence data who is an aerobe and who isn’t?

Protein language models have highlighted a particularly promising approach for deducing physical and chemical conditions in the subsurface via microbial metabolic states captured within large genomic datasets that are routinely generated from global field sites. The Fischer team is collaborating with the Schmidt Academy on the development of software usable by geobiologists and environmental scientists for accurately determining key characteristics of microbial metabolism using protein language models. This approach takes significant advantage of the tens of millions of dollars spent training language models to represent proteins by the AI research teams.

image

Left: Low-dimensional embeddings of ~25,000 ferrodoxin protein sequences from a diverse suite of microbial taxa. The embeddings encompass a wide array of structural information, some of which is related to whether or not an organism is an aerobe or anaerobe. This metabolic information is critical to determininghe cadence of carbon cycling in the subsurface. Right: Receiver operating characteristic (ROC) curve indicates that the classifier built on language model embedding shows remarkable skill in predicting the metabolic states of microbes hidden from the model.