Tags

Type your tag names separated by a space and hit enter

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.
Genet Epidemiol. 2021 04; 45(3):316-323.GE

Abstract

Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

Authors+Show Affiliations

Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA. Department of Medical Consilience, Graduate School, Dankook University, Yongin-si, South Korea.Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA.Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.

Pub Type(s)

Journal Article
Research Support, N.I.H., Extramural

Language

eng

PubMed ID

33415739

Citation

Hahn, Georg, et al. "Unsupervised Cluster Analysis of SARS-CoV-2 Genomes Reflects Its Geographic Progression and Identifies Distinct Genetic Subgroups of SARS-CoV-2 Virus." Genetic Epidemiology, vol. 45, no. 3, 2021, pp. 316-323.
Hahn G, Lee S, Weiss ST, et al. Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus. Genet Epidemiol. 2021;45(3):316-323.
Hahn, G., Lee, S., Weiss, S. T., & Lange, C. (2021). Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus. Genetic Epidemiology, 45(3), 316-323. https://doi.org/10.1002/gepi.22373
Hahn G, et al. Unsupervised Cluster Analysis of SARS-CoV-2 Genomes Reflects Its Geographic Progression and Identifies Distinct Genetic Subgroups of SARS-CoV-2 Virus. Genet Epidemiol. 2021;45(3):316-323. PubMed PMID: 33415739.
* Article titles in AMA citation format should be in sentence-case
TY - JOUR T1 - Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus. AU - Hahn,Georg, AU - Lee,Sanghun, AU - Weiss,Scott T, AU - Lange,Christoph, Y1 - 2021/01/08/ PY - 2020/11/19/revised PY - 2020/07/03/received PY - 2020/11/20/accepted PY - 2021/1/9/pubmed PY - 2021/4/7/medline PY - 2021/1/8/entrez KW - SARS-CoV-2 KW - clustering KW - covid KW - jaccard SP - 316 EP - 323 JF - Genetic epidemiology JO - Genet Epidemiol VL - 45 IS - 3 N2 - Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development. SN - 1098-2272 UR - https://www.unboundmedicine.com/medline/citation/33415739/Unsupervised_cluster_analysis_of_SARS_CoV_2_genomes_reflects_its_geographic_progression_and_identifies_distinct_genetic_subgroups_of_SARS_CoV_2_virus_ DB - PRIME DP - Unbound Medicine ER -