Prime

Type your tag names separated by a space and hit enter

Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest.

Abstract

BACKGROUND

Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order.

RESULTS

We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (FST) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE 'sliding window' scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike FST, CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome.

CONCLUSIONS

We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.

Links

  • PMC Free PDF
  • PMC Free Full Text
  • FREE Publisher Full Text
  • Authors+Show Affiliations

    , , , , ,

    Computational and Systems Biology, CSIRO Animal, Food and Health Sciences, St, Lucia, Brisbane, QLD 4067, Australia. r.taft@imb.uq.edu.au.

    Source

    BMC bioinformatics 15: 2014 Mar 07 pg 66

    MeSH

    Animals
    Cattle
    Cluster Analysis
    Data Compression
    Databases, Genetic
    Dogs
    Evolution, Molecular
    Genome
    Genomics
    Humans
    Mice
    Phylogeography
    Polymorphism, Single Nucleotide
    Sheep

    Pub Type(s)

    Journal Article
    Research Support, Non-U.S. Gov't

    Language

    eng

    PubMed ID

    24606587