“In humans, there are approximately three billion DNA letters, comprising 20,000-25,000 genes, which collectively make up the human genome. Although >99% of DNA is the same between any two individuals, variation in one’s genome can alter protein functions and lead to differences in physical traits between individuals such as height and hair color. In the extreme case, these variations may even result in disease.”
Revolutionizing Healthcare with High-throughput Sequencing
by: Chun Chan1* and Kendric Wang2*
Author Affiliations: Bioinformatics Training Program, 1Centre for Lymphoid Cancers, University of British Columbia, Vancouver, BC, Canada, and 2Vancouver Prostate Centre, University of British Columbia, Vancouver, BC, Canada
*These authors contributed equally to this work
Biological research and biomedical discoveries are being exponentially accelerated by the emergence of high-throughput sequencing (HTS) technologies. Translating the terabytes of DNA sequence data generated by these technologies into knowledge that can be applied to the clinic requires novel bioinformatics analyses and the development of complex computational and statistical methods. In this article, we describe how the sequencing and subsequent bioinformatics analysis of whole human genomes promises to revolutionize the diagnosis, prognosis, and treatment of disease.
The core functioning and development of all living organisms is determined by their set of genetic instructions called DNA. This set of instructions is written as a sequence of letters, where each letter can be one of A, C, T or G. Within DNA sequences, segments called genes specify the function of proteins, which perform the important cellular functions in our body.
In humans, there are approximately three billion DNA letters, comprising 20,000-25,000 genes, which collectively make up the human genome. Although >99% of DNA is the same between any two individuals, variation in one’s genome can alter protein functions and lead to differences in physical traits between individuals such as height and hair color. In the extreme case, these variations may even result in disease. Historically, it had been too expensive and time-consuming to identify all the letters in a genome through the process known as sequencing. Only recently, using large-scale genome-wide sequence information provided by HTS, has it become possible to efficiently delineate the entire genome for many individuals in the population, compare their genomes and detect genomic variations that are associated with disease.
In general, HTS refers to the large-scale sequencing of a source genome. First, the source genome is sheared into many small DNA fragments (Figure 1A). Each fragment is then sequenced; here the identified sequence of DNA letters for each fragment is referred to as a sequence read or read (Figure 1B). Since each read represents a short segment of the source genome, putting the reads back into the correct order is like assembling a jigsaw puzzle with millions of pieces – a challenging task. To guide this process, sophisticated computational methods are used to align the sequence of each read to the sequence of the reference genome, an approximate representation of the typical human genome (Figure 1C). This alignment step allows us to assemble the correct source genome sequence from the millions of small sequence reads. Subsequently, the source genome can be compared against the reference genome in order to identify variations. Three notable types of genomic variations can be identified in this manner: single letter variations, copy number differences and gene fusions.
First, individuals may possess variations in the single DNA letters within genes. In some cases, this type of variation may be regarded as a harmful mutation that affects proper functioning of the corresponding protein and increases the risk of developing diseases (Figure 1D and 1G). For example, it is known that women with certain single letter variations in the BRCA1 or BRCA2 genes have a 5-fold increased risk for developing breast cancer when compared to those without the variations (1). Identification of these single letter variations can act as good markers that are diagnostic of individuals’ risk to specific diseases.
A second type of variation that can be detected is the differences in abundance between genes; these are referred to as copy number difference (Figure 1E). These variations can be found by summing up all the reads which align to a given gene and then comparing that number to the number of reads for other genes. Variations in gene abundance levels have been linked to the pathogenesis of autoimmune, inflammatory, and neurological disorders, as well as cancer progression (2). In the latter case, copy number difference variations can indicate a good or poor prognosis and can help determine treatment options (Figure 1H).
A third type of variation that can be detected using HTS data is called gene fusion (Figure 1F). A gene fusion occurs when there are breaks in the genome and two normally separated parts of the genome are joined together. In the rare event the breaks occur within genes, the two genes can be joined to one another. Aligned reads, which are split between both genes (also known as split reads), can be used to detect these gene fusion events. Identifying these gene fusions can facilitate development of novel targeted therapies for treating disease (Figure 1I). For example, over 90% of chronic myeloid leukemia is characterised by the fusion of the BCR and ABL genes. The fused gene codes for a chimeric protein that is specifically targeted by the drug Gleevec (3).
HTS technology promises to revolutionize discoveries in biological and biomedical research. Large-scale sequencing efforts, such as the 1000 Human Genome Project (4) and the International Cancer Genome Consortium (5), will provide the genome sequences of thousands of individuals and help uncover genomic variations associated with diseases such as cancer. Undoubtedly, identification of these variations will lead to novel avenues for diagnosis, prognosis and treatment, and will ultimately improve patient care.
- [BRCA1 & BRCA2: Cancer Risk and Genetic Testing. National Cancer Institute. Retrieved from http://www.cancer.gov/cancertopics/factsheet/Risk/BRCA
- Fanciulli M, Petretto E and Aitman T (2010). Gene copy number variation and common human disease. Clinical Genetics, 77: 201-213.
- Imatinib mesylate (Gleevec). National Cancer Institute. Retrieved from http://www.cancer.gov/clinicaltrials/conducting/gleevec
- 1000 Genomes: A Deep Catalogue of Human Genetic Variation. Retrieved from http://www.1000genomes.org/
- International Cancer Genome