A comprehensive map of the SARS-CoV-2 genome

Scientists have determined the virus’ protein-coding gene set and analyzed new mutations’ likelihood of helping the virus adapt.


Last year, a team of MIT scientists was succeeded in sequencing the full genome of SARS-CoV-2. At that moment, most of the genome was known; still, the full complement of protein-coding genes was unresolved.

After performing a broad genomics study, scientists have determined the virus’ protein-coding gene set. They have successfully created the most accurate and complete gene annotation of the SARS-CoV-2 genome.

They confirmed several protein-coding genes and found that a few others suggested as genes do not code for any proteins.

This comparative genomics approach is powerful enough to help scientists understand the real functional protein-coding content of this enormously important genome.

Scientists studied almost 2,000 mutations that occur in different SARS-CoV-2 isolates. Doing this allowed them to rate how significant those mutations might be in changing the virus’s ability to sidestep the immune system.

Previously, scientists had developed computational techniques to perform such analysis.

The same techniques were also used to compare the human genome with genomes of other mammals. The methods are based on analyzing whether specific DNA or RNA bases are conserved between species and compare their evolution patterns over time.

Along with five well-established protein-coding genes, scientists confirmed six protein-coding genes in the SARS-CoV-2 genome. They also discovered that the region that encodes a gene called ORF3a also encodes an additional gene named ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame.

Five other regions had been proposed as possible genes do not encode functional proteins. According to scientists, there could be more conserved protein-coding genes yet to be discovered.

Irwin Jungreis, a lead author of the study and a CSAIL research scientist, said, “We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes. Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

Manolis Kellis, who is the senior author of the study, said, “in most cases, genes that evolved rapidly for long periods before the current pandemic has continued to do so, and those that tended to evolve slowly have maintained that trend. However, the scientists also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host.”

When scientists identified a region of the nucleocapsid protein, they found many more mutations than expected from its historical evolution patterns.

Kellis said, “This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system.”

“The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein. We speculate that those variants that don’t mutate that region get recognized by the human immune system and eliminated. In contrast, those variants that randomly accumulate mutations in that region are better able to evade the human immune system and remain in circulation.”

Scientists also analyzed mutations that have arisen invariants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

Jungreis said, “Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t. So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions.”

“This data could help other scientists focus their attention on the mutations that appear most likely to have significant effects on the virus’ infectivity.”

Kellis said“We can now go and study the evolutionary context of these variants and understand how the current pandemic fits in that larger history. For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations and which mutations are perhaps nothing to write home about.”

Journal Reference:
  1. Jungreis, I., Sealfon, R. & Kellis, M. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat Commun 12, 2642 (2021). DOI: 10.1038/s41467-021-22905-7


See stories of the future in your inbox each morning.