Projects

Sequencing Viral Datasets

Cloudbased platform for management and analysis of next generation sequencing viral mutant datasets

PI: Viviana Gradinaru (Division of Biology and Biological Engineering)
SASE: Umesh Padia, Scholar

Adeno-associated viruses (AAVs) are widely used gene delivery vectors due to their ability to transduce dividing and non-dividing cells, their long-term persistence, and low immunogenicity. However, natural AAV serotypes have a limited set of tissues and cell types. Directed evolution has been used to engineer recombinant AAVs to target specific cell types and tissues, leveraging next generation sequencing data. The deluge of data from these deep sequencing experiments has brought about data management and analysis challenges, for which there are no current commercially available solutions. Furthermore, classical approaches to analyzing data from directed evolution heavily involves manual inspection, and often overlooks patterns present in the larger datasets. To address these challenges, we developed robust cloud-based software that provides central management for next generation sequencing data, extracts variants, performs structural modeling, and can be extended to incorporate machine learning models to make predictions for variants with specific properties.

The software exposes a database of all the millions of variants that have been discovered over the course of many experiments in the lab for on-the-fly searching, filtering, and analysis. The software is composed of a set of interconnected discrete components: a modern web user interface implemented in JavaScript with React, a relational database, a distributed task queue, task workers, and a Django-based API. This architecture allows computationally intensive tasks such as alignments, structural modeling, and machine learning to scale from a single machine to hundreds of machines, with minimal configuration. The software is also built to be cloud-provider agnostic and can be run locally. The software automatically imports and manages sequencing data from several different commercial and in-house sequencing providers. When the data is imported, sequence quality metrics are automatically generated and presented to the user. Variants are extracted by performing pairwise alignments between the natural serotype and the sequencing reads. The variants are further encoded into embeddings, grouped into families, and are further analyzed using downstream tools. We use the Rosetta software libraries to perform comparative modeling simulations on selected variants.

We developed extension support for Pytorch-based machine learning models to generate novel variants with desirable properties as well as to select candidate variants for additional rounds of optimization and characterization. This software represents a general tool for simple, scalable, and centralized analyses of next generation sequencing data for protein engineering by directed evolution and could be generalized for all projects with large-scale deep sequencing datasets in the future.

Over the course of the past year, the lab has adopted the software their experiments regarding the directed evolution of proteins. We (https://www.asgct.org/) have presented the work at the American Society for Gene and Cell Therapy (ASGCT, https://www.asgct.org/) conference in May 2020.

Figure 1 - AAV Variant Count / Enrichment table from a directed evolution experiment

Figure 2 - Structural Modeling of an AAV variant that transduces the human central nervous system.

Chart
Figure 3 - Automatic sequence motif identification

Chart

Figure 4 - Automated AA covariation analysis between virus insert positions.

Chart

Figure 5 - VR Rendering of variant "families", i.e. variants in a sample that are biochemically related.