Scientists with the Human Pangenome Reference Consortium have made groundbreaking progress in characterizing the fraction of human DNA that varies between individuals. They have assembled genomic sequences of 47 people from around the world into a so-called pangenome in which more than 99 percent of each sequence is rendered with high accuracy.
For two decades, scientists have relied on the human reference genome as a standard to compare against other genetic data. Thanks to this reference genome, it was possible to identify genes implicated in specific diseases and trace the evolution of human traits, among other things.
However, it has always been a flawed tool: 70% of its data came from a single man of predominantly African-European background whose DNA was sequenced during the Human Genome Project. Hence, it can reveal very little about individuals on this planet who are different from each other, creating an inherent bias in biomedical data believed to be responsible for some of the health disparities affecting patients today.
In this study, scientists found that the layered sequences revealed nearly 120 million DNA base pairs that were previously unseen.
Rockefeller University’s Erich D. Jarvis, one of the primary investigators, said, “This complex genomic collection represents significantly more accurate human genetic diversity than has ever been captured before. With a greater breadth and depth of genetic data at their disposal and greater quality of genome assemblies, researchers can refine their understanding of the link between genes and disease traits and accelerate clinical research.”
The Human Pangenome Reference Consortium (HPRC), a government-funded collaboration between more than a dozen research institutions in the United States and Europe, was launched in 2019 to address the problem of imperfection in the reference genome. At that time, Jarvis- one of the consortium’s leaders- was honing advanced sequencing and computational methods through the Vertebrate Genomes Project.
The project aims to sequence all 70,000 vertebrate species: revealing the variation within a single vertebrate: Homo sapiens.
They turned to the 1000 Genomes Project, a public database of sequenced human genomes, to collect several samples. Most of the samples come from Africa, home to the planet’s largest human diversity.
However, to increase the gene pool, the scientists needed to produce sharper, clearer sequences of each individual. To address this long-standing technological issue in the field, methods developed by participants in the Vertebrate Genome Project and other consortiums were applied.
Since each person receives one genome from each parent, we all have a diploid genome, which has two copies of each chromosome. Additionally, it can be difficult to separate paternal DNA when a person’s genome is sequenced. When integrating parental genetic data for an individual, older methodologies and algorithms frequently make mistakes, creating a hazy perspective.
“The differences between mom’s and dad’s chromosomes are bigger than most people realize,” Jarvis says. “Mom may have 20 copies of a gene, and dad only two.”
With so many genomes represented in a pangenome, that cloudiness threatened to develop into a thunderstorm of confusion. Therefore, HPRC homed in a method developed by Adam Phillippy and Sergey Koren at the National Institutes of Health on parent-child “trios”—a mother, a father, and a child whose genomes had all been sequenced.
They used the data from Mom and Dad to clear up the lines of inheritance. They got a higher-quality sequence for the child, which was then used for pangenome analysis.
Their analyses of 47 people yielded 94 distinct genome sequences, two for each set of chromosomes, plus the sex Y chromosome in males.
Later, by using advanced computational techniques, they could align and layer the 94 sequences. About 90 million of the 120 million previously unknown DNA base pairs, or DNA base pairs that are not in the same place as they were noted to be in the previous reference are the result of structural variations, which are differences in a person’s DNA that result from chromosomes being moved, deleted, inverted, or duplicated more than once.
Jarvis noted, “It’s an important discovery because studies in recent years have established that structural variants play a major role in human health and population-specific diversity. They can dramatically affect trait differences, disease, and gene function. With so many new ones identified, there’s going to be a lot of discoveries that weren’t possible before.”
The team has also uncovered surprising new characteristics of centromeres, which lie at the cruxes of chromosomes and conduct cell division, pulling apart as cells duplicate. Mutations in centromeres can lead to cancers and other diseases.
The current 47-people pangenome is just a starting point, however. The HPRC’s ultimate goal is to produce high-quality, nearly error-free genomes from at least 350 individuals from diverse populations by mid-2024, a milestone that would make it possible to capture rare alleles that confer important adaptive traits.
- Liao, WW., Asri, M., Ebler, J., et al. A draft human pangenome reference. Nature 617, 312–324 (2023). DOI: 10.1038/s41586-023-05896-x