The Power of Next Generation Sequencing in the Human Genome Project

It’s been two decades since the scientists involved in the Human Genome Project released the first human genome sequence. This development in genome sequencing came hot on the heels of Celera Genomics’ utilization of data from the project, which saw them cut the cost of sequencing by some margin. But technological limitations meant there were gaps in the first sequence of the human genome. These limitations stemmed from regions of DNA consisting of long sections of repeated base pairs, which hampers the assembly of multiple short reads – each consisting of the same repetative base pairs – into one contiguous sequence. As a result, when the scientists completed their first draft of the sequence, approximately 15% was missing.

Over the last 20 years, scientists have invested huge volumes of time and resources in an attempt to fill these gaps. However, when scientists completed a new sequence in 2013 and patched this in 2019, they were still lacking 8% of the full sequence. Challenges surrounding sequencing complex sections of DNA like heterochromatin meant they still couldn’t fill all the gaps.

Jumping forward to 2021, geneticists from the Telomere-to-Telomere (T2T) Consortium leveraged new technologies to complete even more of the human genome sequence. Now, only the Y chromosome is left.

Here, we’ll take a look at why and how the T2T Consortium employed long-read next-generation sequencing technologies to sequence the human genome and how the Consortium’s efforts could fill the remaining gaps in the genome.

The T2T Consortium

Approximately 30 institutions make up the T2T Consortium, which is an international association. Evan E Eichler (University of Washington School of Medicine, WA, USA), Karen Miga (University of California, Santa Cruz, CA, USA), and Adam Phillippy (National Human Genome Research Institute, MD, USA) launched the Consortium to research “unmappable” centromere regions.

May 2021 saw the T2T Consortium release a preprint entitled “The complete sequence of a human genome,” which details how the Consortium has sequenced the remaining gaps in the human genome using long-read sequencing technologies. The Consortium’s scientists added nearly 200 million DNA base pairs and 115 protein-coding genes to the human genome sequence, triggering a 4.5% rise in the number of base pairs and a 0.4% rise in the number of protein-coding genes. The number of these genes has climbed to 19,969.

The Power of Long-Read Technologies

The T2T Consortium utilized some of the newest next-generation sequencing technologies from Oxford Nanopore and Pacific Biosciences to finish their latest draft of the human genome sequence, which is termed T2T-CHM13. With these technologies, the Consortium’s scientists sequenced stretches of DNA that contained up to 20,000 base pairs each in parallel.

Although older long-read sequencing techniques can be prone to errors, Pacific Biosciences’ latest long-read sequencing technology makes it possible to recognize minute variations in long stretches of repeated sequences, thereby making repetitive, long chromosome segments tractable. On the other hand, Oxford Nanopore’s platform enables scientists to capture a variety of modifications to DNA that modulate gene expression, allowing scientists to map genome-wide “epigenetic tags.”

These technologies allowed the scientists to move forward with their progress as short-read sequencing techniques only allow scientists to sequence a few hundred base pairs in parallel. After short-read sequencing, researchers have to reassemble the base pairs like they would a jigsaw.

Plus, although the reads that short-read technologies produce are accurate, they aren’t long enough to map highly repetitive genomic sequences (like the telomeres that cap chromosome ends and the centromeres that coordinate the partitioning of newly replicated DNA during cell division) unambigiously. Conversely, long-read sequencing technologies allow for longer stretches of DNA, which are easier to fit together as they’re more likely to contain overlapping sequences.

How the T2T Consortium Sequenced the Human Genome

Rather than taking DNA from a live human, the Consortium’s scientists utilized a cell line from a complete hydatiform mole, a type of tissue that forms in a human when a sperm fertilizes an egg that doesn’t have a nucleus. This meant that the cells contained two full sets of chromosomes. but from the same individual, therfore avoiding the issue of matching up sequencing reads with the correct haplotype.

The sperm used; however, contained an X rather than a Y chromosome, in order to provide the fullest coverage of the human genome. This meant that the sequence didn’t cover the Y chromosome, which stimulates male biological development. Furthermore, the Consortium estimated that approximately 0.3% of the sequence could contain faults resulting from difficulties carrying out quality control checks on problematic areas of the genome and challenges linked to passaged cell lines.

The Future of the Human Genome Sequence

Now, the Consortium is attempting to sequence the Y chromosome using the same method described above. The scientists are planning to sequence a genome with chromosomes from two parents and are working with the Human Pangenome Reference Consortium to sequence more than 300 genomes on a global scale. They will use the T2T-CHM13 sequence as a reference to understand which parts of the genome typically differ among individuals.

Meanwhile, next-generation sequencing technologies, tools, and resources continue to develop. As scientists extend their efforts to fill the remaining gaps in the human genome sequence and identify connections between newly sequenced areas and human diseases, the infrastructure to help them is growing. Human genome sequencing could soon become a mainstream practice.

BioTechniques, the highly acclaimed life sciences journal, closely follows updates surrounding the human genome sequence and publishes insights on its multimedia website.

Since BioTechniques published its first issue in 1983, the life sciences journal has quickly built a reputation as a leading resource for scientists from a range of disciplines (such as chemistry, physics, plant and agricultural science, climate science, and computer science) to learn about the repeatability of new methods that may contribute to the future of science and medicine. Users can pick up a multitude of insights from the journal and the BioTechniques website, where they will find articles, infographics, eBooks, podcasts, webinars, videos, and a platform for industry discussions.