Project A7

Selective sweep data and next-generation sequencing data analysis

Lead Partners: 6: Era7 (SME, Granada) 3: LMU (University of Munich)

State-of-the-Art: Problem and its Solution
Current methods can model the genetic patterns in the in the loci adjacent to one undergoing selection. Genome-wide data, of the sort analysed by Era7’s cloud computing platform would, however, be expected to reveal multiple ‘footprints’ of this sort around the genome, as several loci with related functions respond to the same selective regime. The ESRs will therefore develop methods to interpret such multiple patterns and distinguish them from other causes.

New methods for indexing sequences are fundamental for managing, accessing and analysing massive sequences from NGS projects. In this project we will develop new methods for optimal indexing of massive sequences based on weighted automata.


  • Design new methods for massive indexing of NGS sequence data | Fellow 1 - era7 bioinformatics
  • Design a new system to extract data from pathways and transcriptional networks for massive sets of proteins, based on bio4j | Fellow 1 - era7 bioinformatics
  • Develop a statistical test and a Bayesian inference method that is sensitive to (soft and hard) selective sweeps at several genes that belong to the same pathway or are closely related in regulatory networks. | Fellow 2 - LMU Munich
  • Develop methods that combine functional evidence of genes that are closely connected in regulatory or transcriptional networks, and supply versions for Era7’s pipelines. | Fellow 2 - LMU Munich

The ESRs will extend the multivariate mixed-effects models used in QTL mapping by adding putative explanatory variables which are sensitive to groups of genetic loci which are known to be involved in the same pathway or are co-regulated. Procedures for the selection of variables are not straightforward because many different groups of loci are candidates to be taken into account, including overlapping groups. They will analyse these methods mathematically and by simulation of genetics networks to characterise the ramification of correlated effects throughout the network. They will develop a system for classifying such relationships based on bio4j,developed on new paradigms of databases (neo4j) to get the efficient integration of data from the databases of DNA sequences and genomes (NCBI databases), protein databases (Uniprot), taxonomy databases and Gene Ontology database.
For sweep mapping, the observation that several genomic regions have reduced genetic variability, might indicate passing of a selective sweep, but would be much more meaningful if it each region were to include genes that are known to belong to a certain pathway. We will use the logic developed for the QTL mapping to construct Hidden Markov Models underlying the change in genetic variability along the genome. Markov Chain Monte Carlo (MCMC) methods will be needed to include population genetic phenomena, including selection. Fortunately, this approach is readily parallelised and hence can exploit Era7’s cloud computing infrastructure. The most promising methods for exploiting the network data will be integrated into cloud computing based workflows and pipelines for data analysis.
Improvements to assembly and mapping algorithms would initially be in the greatest demand because the majority of current problems are centered on these tasks; the new indexing methods/systems will play a critical role in those improvements.

ESRs training by research
Fellow 1 will be based at Era7 bioinformatics, Granada, Spain, with a six month outplacement at LMU Munich. He/She will lead on the development of new methods for massive indexing of sequences, and bio4j-related development (see the related objective above). He/She will also work on pathways and transcriptional networks analysis, and the parallelisation of the methods developed by Fellow 2. The main supervisor of this fellow will be Dr Raquel Tobes, Director of Research and Development at Era7 bioinformatics. For more detail, see the corresponding job description

Fellow 2 will be based at LMU Munich, with a six month outplacement at Era7 bioinformatics. He/She
will lead on the implementation of mixed effects models and the sweep mapping.

joint work They will collaborate in supplying versions of the methods developed by Fellow 2 for Era7 pipelines.

Job Descriptions