Research

Project A6

Exploiting information from the variation in allele frequencies for genome-wide data

Lead Partners 8:GT (SME, Wageningen) 3: LMU (University of Munich)

State-of-the-Art Problem and its Solution

For population genetic data with very large numbers of loci, it should be possible to make use of variation in allele frequency among samples to distinguish loci affected by disruptive selection, or selective sweeps from loci with neutral variants. However full-data methods based on the likelihood functions derived from advanced population genetics models, such as the coalescent and ancestral recombination-graphs, will be too time-consuming for these large datasets. GT’s high-throughput sequencing pipeline for horticultural & agricultural crops produces such genome-wide SNP data. The ESRs will therefore implement efficient analytical procedures identifying candidate loci of commercial value.

Objectives

· Develop computationally efficient method to scan polymorphisms produced identified by the GT pipeline, making use of variation at a hierarchy of levels (crop, orchard, cultivar etc).

· Implement the analyses for general use by release of a free package (general public license, GPL) in the open source statistical language R.

· Evaluation of organisms relevant to GT, based on high throughput data.

Methodology

The ESRs will develop a collection of summary statistics reflecting parameters (effective population sizes, dates of foundation, etc) that, in turn, can be used to model the frequency data. The optimisation process will use procedures from statistics and machine learning, including partial least squares, regularisation and kernel-based feature selection. In an approach similar to the fitting of a mixed model, they will fit probability distributions to the empirical distributions across loci. Thus, it will only be necessary to estimate a few hyperparameters for each category of nuisance parameter, instead of one per locus. The models fitted in this way will serve as null models, and outliers from this distribution will indicate loci of interest, for example those that have responded to agricultural selection for traits of commercial value.

ESRs training by research

The GT based ESR will lead on data acquisition and explore fully parameterized models on a subset of the data, to investigate the efficiency of different comparisons, e.g. which crop hierarchy of population subdivision is most informative.

The LMU based ESR will lead on developing the full data analysis using the hyperparameter approximations. They will collaborate on the final analysis and software release.


Job Description 10
Job Description 11