The main goal of the GenScale project is to develop scalable methods and software programs for processing genomic data. Our research is motivated by the fast development of sequencing technologies, especially next-generation sequencing (NGS), and third-generation sequencing (TGS). NGS provides up to billions of very short (few hundreds of base pairs, bps) DNA fragments of high quality, called short reads, and TGS provides millions of long (thousands to millions of bps) DNA fragments of lower quality called long reads. Synthetic long reads or linked-reads is another technology type that combines the high quality and low cost of short-reads sequencing with long-range information by adding barcodes that tag reads originating from the same long DNA fragment. All these sequencing data bring very challenging problems both in terms of bioinformatics and computer science. As a matter of fact, the recent sequencing machines generate terabytes of DNA sequences to which time-consuming processes must be applied to extract useful and relevant information.
A large panel of biological questions can be investigated using genomic data. A complete project includes DNA extraction from one or several living organisms, sequencing with high throughput machines, and finally the design of methods and development of bioinformatics pipelines to answer the initial question. Such pipelines are made of pre-processing steps (quality control and data cleaning), core functions transforming these data into genomic objects on which GenScale's main expertise is focused (genome assembly, variant discovery -SNP, structural variations-, sequence annotation, sequence comparison, etc.) and sometimes further integration steps helping to interpret and gain some knowledge from data by incorporating other sources of semantic information.
The challenge for GenScale is to develop scaling algorithms able to devour the daily sequenced DNA flow that tends to congest the bioinformatics computing centers. To achieve this goal, our strategy is to work both on space and time scalability aspects. Space scalability is correlated to the design of optimized and low memory footprint data structures able to capture all useful information contained in sequencing datasets. The idea is to represent tera- or petabytes of raw data in a very concise way so that their analyses completely fit into a computer memory. Time scalability means that the execution of the algorithms must be linear with respect to size of the problem or, at least, must last a reasonable amount of time. In this respect, parallelism is a complementary technique for increasing scalability.
A second important objective of GenScale is to create and maintain permanent partnerships with life science research groups. Collaboration with genomics research teams is of crucial importance for validating our tools, and for scientific watch in this extremely dynamic field. Our approach is to actively participate in solving biological problems (with our partners) and to get involved in a few challenging genomic projects.
GenScale research is organized along four main axes:
The aim of this axis is to create and diffuse efficient data structures for representing the mass of genomic data generated by the sequencing machines. This is necessary because the processing of large genomes, such as those of mammals or plants, or multiple genomes from a single sample in metagenomics, requires significant computing resources and a powerful memory configuration. The advances in TGS (Third Generation Sequencers) technologies bring also new challenges to represent or search information based on sequencing data with high error rate.
Part of our research focuses on kmer representation (words of length
A correlated research direction is the indexing of large sets of objects 8. A typical, but non exclusive, need is to annotate nodes of the de-Bruijn graph, that is, potentially billions of items. Again, very low memory footprint indexing structures are mandatory to manage such a large quantity of objects 9.
The main goal of the GenScale team is to develop optimized tools dedicated to genomic data processing. Optimization can be seen both in terms of space (low memory footprint) and in terms of time (fast execution time). The first point is mainly related to advanced data structures as presented in the previous section (axis 1). The second point relies on new algorithms and, when possible, implementations on parallel structures (axis 3).
We do not have the ambition to cover the vast panel of software related to genomic data processing needs. We particularly focused on the following areas:
This third axis investigates a supplementary way to increase performances and scalability of genomic treatments. There are many levels of parallelism that can be used and/or combined to reduce the execution time of very time-consuming bioinformatics processes. A first level is the parallel nature of today processors that now house several cores. A second level is the grid structure that is present in all bioinformatics centers or in the cloud. These two levels are generally combined: a node of a grid is often a multicore system. Another possibility is to work with processing in memory (PIM) boards or to add hardware accelerators to a processor. A GPU board is a good example.
GenScale does not do explicit research on parallelism. It exploits the capacity of computing resources to support parallelism. The problem is addressed in two different directions. The first is an engineering approach that uses existing parallel tools to implement algorithms such as multithreading or MapReduce techniques 5. The second is a parallel algorithmic approach: during the development step, the algorithms are constrained by parallel criteria 3. This is particularly true for parallel algorithms targeting hardware accelerators.
Sequencing data are intensively used in many life science projects. Thus, methodologies developed by the GenScale group are applied to a large panel of life sciences domains. Most of these applications face specific methodological issues that the team proposes to answer by developing new tools or by adapting existing ones. Such collaborations lead therefore to novel methodological developments that can be directly evaluated on real biological data and often lead to novel biological results. In most cases, we also participate in the data analyses and interpretations in terms of biological findings.
Furthermore, GenScale actively creates and maintains permanent partnerships with several local, national, or international groups, bearers of applications for the tools developed by the team and able to give valuable and relevant feedbacks.
Today, sequencing data are intensively used in many life science projects. The methodologies developed by the GenScale group are generic approaches that can be applied to a large panel of domains such as health, agronomy or environment areas. The next sections briefly describe examples of our activity in these different domains.
Genetic and cancer disease diagnostic:
Genetic diseases are caused by some particular mutations in the genomes
that alter important cell processes. Similarly, cancer comes from changes
in the DNA molecules that alter cell behavior, causing uncontrollable growth
and malignancy. Pointing out genes with mutations helps in identifying
the disease and in prescribing the right drugs. Thus, DNA from individual
patients is sequenced and the aim is to detect potential mutations that may
be linked to the patient disease.
Bioinformatics analysis can be based on the detection of SNPs
(Single Nucleotide Polymorphism) from a set of predefined target genes.
One can also scan the complete genome and report
all kinds of mutations, including complex mutations such as large insertions or deletions,
that could be associated with genetic or cancer diseases.
Insect genomics: Insects represent major crop pests, justifying
the need for control strategies to limit population outbreaks and
the dissemination of plant viruses they frequently transmit.
Several issues are investigated through the analysis and comparison of
their genomes: understanding their phenotypic plasticity such as their
reproduction mode changes, identifying the genomic sources of adaptation
to their host plant and of ecological speciation, and understanding the
relationships with their bacterial symbiotic communities 6.
Improving plant breeding: Such projects aim at
identifying favorable alleles at loci contributing to phenotypic
variation, characterizing polymorphism at the functional level
and providing robust multi-locus SNP-based predictors of the
breeding value of agronomical traits under polygenic control.
Underlying bioinformatics processing is the detection of informative
zones (QTL) on the plant genomes.
Food quality control: One way to check food contaminated
with bacteria is to extract DNA from a product and identify the
different strains it contains. This can now be done quickly with
low-cost sequencing technologies such as the MinION sequencer
from Oxford Nanopore Technologies.
Ocean biodiversity: The metagenomic analysis of seawater samples
provides an original way to study the ecosystems of the oceans. Through the
biodiversity analysis of different ocean spots, many biological questions can
be addressed, such as the plankton biodiversity and its role,
for example, in the CO2 sequestration.
Through its long term collaboration with INRAE IGEPP, GenScale is involved in various genomic projects in the field of agricultural research. In particular, we participate in the genome assembly and analyses of some major agricultural pests or their natural ennemies such as parasitoids. The long term objective of these genomic studies is to develop control strategies to limit population outbreaks and the dissemination of plant viruses they frequently transmit, while reducing the use of phytosanitary products.
All current computing platforms are designed following the von Neumann architecture principles, originated in the 1940s, that separate computing units (CPU) from memory and storage. Processing-in-memory (PIM) is expected to fundamentally change the way we design computers in the near future. These technologies consist of processing capability tightly coupled with memory and storage devices. As opposed to bringing all data into a centralized processor, which is far away from the data storage and is bottlenecked by the latency (time to access), the bandwidth (data transfer throughput) to access this storage, and energy required to both transfer and process the data, in-memory computing technologies enable processing of the data directly where it resides, without requiring movement of the data, thereby greatly improving the performance and energy efficiency of processing of massive amounts of data potentially by orders of magnitude. This technology is currently under test in GenScale with a revolutionary memory component developed by the UPMEM company. Several genomic algorithms have been parallelized on UPMEM systems, and we demonstrated significative energy gains compared to FPGA or GPU accelerators. For comparable performances (in terms of execution time) on large scale genomics applications, UPMEM PIM systems consume 3 to 5 times less energy.
On October 10, the “L'Oréal Foundation and Unesco” honored 35 young female doctoral and post-doctoral researchers with the Prix “Jeunes Talents France 2023 Pour les Femmes et la Science”. Garance Gourdel, winner, was a doctoral student in the team. This prize highlights her research work that focuses on the use of algorithms to improve DNA reading, creating and analyzing new algorithms for processing and storing large volumes of data, such as that generated by sequencing. She defended her thesis in October at ENS Paris.
We have presented an original method for Structural Variant genotyping in the prestigious international conference of bioinformatics ISMB/ECCB 2023. In addition to the talk, the paper was published in the journal Bioinformatics 20.
Advances in sequencing technologies have revealed the prevalence and importance of structural variations (deletions, duplications, inversions or rearrangements of DNA segments) which cover 5 to 10 times more bases in the genome than the point mutations commonly analyzed. Over the last 5 years, the rise of 3rd generation sequencing (long reads) has made it possible to characterise and catalogue the full range of SVs in many model organisms, such as in human. Then, the next step to fully understand variations in populations and associate them to phenotypes consists in assessing the presence or absence of known variants in numerous newly sequenced individuals. This is the genotyping problem. We proposed, in this work, SVjedi-graph, the first structural variant genotyping method dedicated to long read data and relying on a variation graph to represent all alleles of all variants in a single data structure. We showed that this graph model prevents the bias toward the reference alleles and allows maintaining high genotyping accuracy whatever the proximity of variants, contrary to other state of the art genotypers.
Approximate membership query (AMQ) data structures are widely used for indexing the presence of elements drawn from a large set. To represent the count of indexed elements, AMQ data structures can be generalized into "counting AMQ" data structures. This is for instance the case of the "counting Bloom filters". However, counting AMQ data structures suffer from false positive and overestimated calls. In this work we propose a novel strategy, called fimpera, that reduces the false-positive rate and overestimation rate of any counting AMQ data structure indexing k-mers (words of length k) from a set of sequences, along with their abundance. Applied on a counting Bloom filter, fimpera decreases its false-positive rate by an order of magnitude while reducing the number of overestimated calls. Furthermore, fimpera lowers the average difference between the overestimated calls and the ground truth. In addition, it slightly decreases the query time. fimpera does not require any modification of the original counting AMQ data structure, it does not generate false-negative calls, and causes no memory overhead. The unique drawback is that fimpera yields a negligible amount of a new kind of false positives and overestimated calls 18.
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We developped kmindex (7.1.2), an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. We demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server “Ocean Read Atlas”, which enables real-time queries on the Tara Oceans dataset 37.
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. To address this challenge, we developed a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures 32. We showed that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and
Efficiently managing large DNA datasets necessitates the development of highly effective sequence compression techniques to reduce storage and computational requirements. Here, we explore the potential of a lossy compression technique, Mapping-friendly Sequence Reductions (MSRs) which is a generalization of homopolymer compression to improve the accuracy of alignment tools. Essentially, MSRs deterministically transform sequences into shorter counterparts, in such a way that if an original query and a target sequence align, their reduced forms will align as well. While homopolymer compression is one example of an MSR, numerous others exist. These rapid computations yield lossy representations of the original sequences. Notably, the reduced sequences can be stored, aligned, assembled, and indexed much like regular sequences. MSRs could be used to improve the efficiency of taxonomic classification tools, by indexing and querying reduced sequences. Our experimentation with a mixture of 10 E. coli strains, demonstrates that this approach can yield greater precision than indexing and querying a reduced portion of k-mers. Other tasks could benefit from sequence reduction, such as mapping, genome assembly, and structural variant detection 26.
Despite the implementation of strict laboratory protocols for controlling ancient DNA contamination, samples still remain highly vulnerable to environmental contamination. Such contamination can significantly alter microbial composition, leading to inaccurate conclusions in downstream analyses. Within the co-supervision of a PhD student at the Institut Pasteur (Paris, France), we contributed to the work related to two
The first method is based on the construction of a k-mer matrix 8 which stores the presence/absence of k-mers across multiple samples of different well-characterised sources. Such a matrix is then used to predict the proportion of each source in unknown input samples.
The second method allows to retain, from an input contaminated set, reads likely to belong to a specific source of interest. On synthetic data, it achieves over 89.53% sensitivity and 94.00% specificity. On real datasets, aKmerBroom shows higher read retainment (+60% on average) than competing methods.
Squares (fragments of the form
The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern
Recently, Bille et al. 45 introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns
The popularity of
Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited. We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove to be NP-Complete. It does not require distance information. It is focused on a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the multimeric forms issue in order to retrieve several genome forms that can exist in the same chloroplast cell. In addition, we provide an Integer Linear Program (ILP) to obtain exact solutions that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties. While the scaffolding problem is traditionally defined with distances data, we show it is possible to avoid them in the case of the well-studied circular chloroplast genomes. The presented results show that the regions view seems to be sufficient to scaffold the repeats 34.
Local assembly consists in reconstructing a sequence of interest from a sample of sequencing reads without having to assemble the entire genome, which is time and labor intensive. This is particularly useful when studying a locus of interest, for gap-filling in draft assemblies, as well as for alternative allele reconstruction of large insertion variants. Whereas linked-read technologies have a great potential to assemble specific loci as they provide long-range information, while maintaining the power and accuracy of short-read sequencing,
there is a lack of local assembly tools for linked-read data.
We present MTG-Link (7.1.6), a novel local assembly tool dedicated to linked-reads. The originality of the method lies in its read subsampling step which takes advantage of the barcode information contained in linked-reads mapped in flanking regions of each targeted locus. Our approach relies then on our tool MindTheGap 10 to perform local assembly of each locus with the read subsets. MTG-Link tests different parameters values for gap-filling, followed by an automatic qualitative evaluation of the assembly.
We validated our approach on several datasets from different linked-read technologies. We show that MTG-Link is able to successfully assemble large sequences, up to dozens of Kb. We also demonstrate that the read subsampling step of MTG-Link considerably improves the local assembly of specific loci compared to other existing short-read local assembly tools.
Furthermore, MTG-Link was able to fully characterize large insertion variants in a human genome and improved the contiguity of a 1.3 Mb locus of biological interest in several individual genomes of the mimetic butterfly Heliconius numata 15.
Long read assemblers struggle to distinguish closely related strains of the same species and collapse them into a single sequence. This is very limiting when analysing a metagenome, as different strains can have important functional differences. We have designed a new methodology supported by a software called HairSplitter (7.1.15), which recovers the strains from a strain-oblivious assembly and long reads. The originality of the method lies in a custom variant calling step that works with erroneous reads and separates an unknown number of haplotypes. On simulated datasets, we show that HairSplitter significantly outperforms the state of the art when dealing with metagenomes containing many strains of the same species 25, 40
We also propose an alternative approach for the strain separation problem using Integer Linear Programming (ILP). We introduce a strain-separation module, strainMiner, and integrate it into an established pipeline to create strain-separated assemblies from sequencing data. Across simulated and real experiments encompassing a wide range of sequencing error rates (5-12%), our tool consistently compared favorably to the state-of-the-art in terms of assembly quality and strain reconstruction. Moreover, strainMiner substantially cuts down the computational burden of strain-level assembly compared to published software by leveraging the powerful Gurobi solver. We think the new methodological ideas presented in this paper will help democratizing strain-separated assembly 27.
One of the problems in Structural Variant (SV) analysis is the genotyping of variants. It consists in estimating the presence or absence of a set of known variants in a newly sequenced individual. Our team previously released SVJedi, one of the first SV genotypers dedicated to long read data. The method is based on linear representations of the allelic sequences of each SV. While this is very efficient for distant SVs, the method fails to genotype some closely located or overlapping SVs. To overcome this limitation, we present a novel approach, SVJedi-graph (7.1.5), which uses a sequence graph instead of linear sequences to represent the SVs.
In our method, we build a variation graph to represent in a single data structure all alleles of a set of SVs. The long reads are mapped on the variation graph and the resulting alignments that cover allele-specific edges in the graph are used to estimate the most likely genotype for each SV. Running SVJedi-graph on simulated sets of close and overlapping deletions showed that this graph model prevents the bias toward the reference alleles and allows maintaining high genotyping accuracy whatever the SV proximity, contrary to other state of the art genotypers. On the human gold standard HG002 dataset, SVJedi-graph obtained the best performances, genotyping 99.5% of the high confidence SV callset with an accuracy of 95% in less than 30 min 20.
A variation graph is a data structure that aims to represent variations among a collection of genomes. It is a sequence graph where each genome is embedded as a path in the graph with the successive nodes, along the path, corresponding to successive segments on the associated genome sequence. Shared subpaths correspond to shared genomic regions between the genomes and divergent path to variations: this structure features inversions, insertions, deletions and substitutions. The construction of a variation graph from a collection of chromosome-size genome sequences is a difficult task that is generally addressed using a number of heuristics such as those implemented in the state-of-the-art pangenome graph builders minigraph-cactus and pggb. The question that arises is to what extent the construction method influences the resulting graph and therefore to what extent the resulting graph reflects genuine genomic variations. We propose to address this question by constructing an edition script between two variation graphs built from the same set of genomes which provides a measure of similarity, and more importantly that enables to identify discordant regions between the two graphs. We proceed by comparing, for each genome, the two corresponding paths in the two graphs which correspond to two possibly different segmentations of the same genomic sequence. As such, for each interval defined by the nodes of the path of the genome in the first graph, we define a set of relations with the nodes of the second graph, such as equalities, prefix and suffix overlaps... which allows for a calculation of how many elementary operations, such as fusions and divisions of nodes, are required to go from one graph to another. We tested our method on variation graphs constructed using both simulated dataset as well as a real dataset made of 15 yeast telomere-to-telomere phased genome assemblies, with minigraph-cactus and pggb as the graph construction tools. We showed that two graphs built with the same tool, minigraph-cactus, but with different incorporation orders of genomes can be more different from one another than two graphs built with the two different tools. We also showed thar our distance allows to pinpoint and vizualize the specific areas of the graph and genomes that are impacted by the changes in segmentation. The method is implemented in a Python tool named Pancat (7.1.13) 33, 24.
Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows–Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of “big data” genomics: a combinatorial core coupled with statistical inference in closed form 16.
We have developed a method based on a dynamic sliding window encoding (DSWE) for storing encrypted data in a DNA form taking into account biological constraints and prohibited nucleotide motifs used for data indexing. Its originality is twofold. First, it takes advantage of variable length DNA codewords to avoid homopolymers longer than
In absence of DNA template, the ab initio production of long double-stranded DNA molecules of predefined sequences is particularly challenging. The DNA synthesis step remains a bottleneck for many applications such as functional assessment of ancestral genes, analysis of alternative splicing or DNA-based data storage. We worked on a fully in vitro protocol to generate very long double-stranded DNA molecule starting from commercially available short DNA blocks in less than 3 days. This innovative application of Golden Gate assembly allowed us to streamline the assembly process to produce a 24 kb long DNA molecule storing part of the Universal Declaration of Human rights and citizens. The DNA molecule produced can be readily cloned into suitable host/vector system for amplification and selection. 36.
Processing-in-Memory (PIM) consists of processing capabilities tightly coupled with the main memory. Contrary to bringing all data into a centralized processor, which is far away from the data storage, in-memory computing processes the data directly where it resides, suppressing most data movements, and, thereby greatly improving the performance of massive data applications by orders of magnitude. NGS data analysis completely falls in these application domains where PIM can strongly accelerate the main time-consuming software in genomic and metagenomic areas. More specifically, mapping algorithms, intensive sequence comparison algorithms or bank searching, for example, can highly benefit of the parallel nature of the PIM concept.
In the framework of the European BioPIM project, we have studied (from a parallelism point of view) a number of data structures used extensively in genomics software to assess the benefits of implementing them on PIM architectures. The following data structures have been studied: bloom filters, Burrows–Wheeler transform and hash tables. A detailed report on the evaluations is currently available on the GenoPIM project website.
The programming model for processing-in-memory is not yet well defined. The question of automatic (or semi-automatic) parallelization remains open. The tools available for programming a complete application on a PIM-equipped architecture are relatively low-level. We are working on the design of a C++ programming environment to unify the programming of CPU and PIM memory processing.
Memory components based on PIM principles have been developed by UPMEM, a young startup founded in 2015. The company has designed an innovative DRAM processing unit (DPU), a RISC processor integrated directly into the memory chip, on the DRAM die. We are using a UPMEM PIM server equipped with 160 GB of PIM memory and 256 GB of legacy memory to carry out full-scale experiments on the implementation of several genomic software for long DNA comparison, bacterial genome comparison, protein sequence alignment, data compression and sorting algorithms.
Initial experiments show that, for certain applications, it is possible to achieve a 20-fold acceleration compared with standard multicore platforms.
We developed an automated pipeline, Mapler (7.1.14), to assess the performances of long read metagenome assemblers, with a special focus on tools dedicated to high fidelity long reads such as PacBio HiFi reads. It compares five assembly tools, namely metaMDBG, metaflye, Hifiasm-meta, opera-ms and miniasm, and uses various evaluation metrics such as reference-based, reference-free and binning-based evalutation metrics. We applied this pipeline on several real metagenome datasets of various complexities, including publicly available mock communities with well characterized species contents and tunnel culture soil metagenomes with high and unknown species complexities. We showed that the different evaluation metrics are complementary and that high fidelity long read data allows drastic improvements in the number of obtained good quality Metagenome Assembled Genomes (MAGs) with respect to low-fidelity long reads or short read data, with MetaMDBG out-performing other HiFi-dedicated assemblers. However, for soil metagenomes most reconstructed MAGs remain of low quality because of the very large number of species in these microbiomes and the probable under-sampling of this diversity by this type of sequencing 41.
In this book chapter, we review the different bioinformatics analyses that can be performed on metagenomics and metatranscriptomics data. We present the differences of this type of data compared to standard genomics data and highlight the methodological challenges that arise from it. We then present an overview of the different methodological approaches and tools to perform various analyses such as taxonomic annotation, genome assembly and binning and de novo comparative genomics 28.
Through its long term collaboration with INRAE IGEPP, and its support to the BioInformatics of Agroecosystems Arthropods platform, GenScale is involved in various genomic and transcriptomics projects in the field of agricultural research. We participated in the genome assembly and analyses of some major agricultural pests or their natural ennemies. In particular, we performed a genome-wide identification of lncRNAs associated with viral infection in the lepidopteran pest Spodoptera frugiperda 19 and participated in a detailed study of a genomic region linked to reproductive mode variation in the pea aphid 17.
In most cases, the genomes and their annotations were hosted in the BIPAA information system, allowing collaborative curation of various sets of genes and leading to novel biological findings 42.
In the framework of a former ANR project (SpecRep 2014-2019), we worked on de novo genome assembly of several ithomiine butterflies. Due to their high heterozygosity level and to sequencing data of various quality, this was a challenging task and we tested numerous assembly tools. Finally, this work led to the generation of high-quality, chromosome-scale genome assemblies for two Melinaea species, M. marsaeus and M. menophilus, and a draft genome of the species Ithomia salapia. We obtained genomes with a size ranging from 396 Mb to 503 Mb across the three species and scaffold N50 of 40.5 Mb and 23.2 Mb for the two chromosome-scale assemblies. Various genomics and comparative genomics analyses were performed and
revealed notably independent gene expansions in ithomiines and particularly in gustatory receptor genes.
These three genomes constitute the first reference genomes for the ithomiine butterflies (Nymphalidae: Danainae), which represent the largest known radiation of Müllerian mimetic butterflies and dominate by number the mimetic butterfly communities. This is therefore a valuable addition and a welcome comparison to existing biological models such as Heliconius, and will enable further understanding of the mechanisms of mimetism and adaptation in butterflies 14.
In some species, the Y is a tiny chromosome but the dioecious plant Silene latifolia has a giant 550 Mb Y chromosome, which has remained unsequenced so far. We participated in a collaborative project that sequenced and obtained a high-quality male S. latifolia genome. We participated in particular in the comparative analysis of the sex chromosomes with outgroups, that showed that the Y is surprisingly rearranged and degenerated for a 11 MY-old system. Recombination suppression between X and Y extended in a stepwise process, and triggered a massive accumulation of repeats on the Y, as well as in the non-recombining pericentromeric region of the X, leading to giant sex chromosomes 38.
ALPACA project on cordis.europa.eu
Genomes are strings over the letters A,C,G,T, which represent nucleotides, the building blocks of DNA. In view of ultra-large amounts of genome sequence data emerging from ever more and technologically rapidly advancing genome sequencing devices—in the meantime, amounts of sequencing data accrued are reaching into the exabyte scale—the driving, urgent question is: how can we arrange and analyze these data masses in a formally rigorous, computationally efficient and biomedically rewarding manner?
Graph based data structures have been pointed out to have disruptive benefits over traditional sequence based structures when representing pan-genomes, sufficiently large, evolutionarily coherent collections of genomes. This idea has its immediate justification in the laws of genetics: evolutionarily closely related genomes vary only in relatively little amounts of letters, while sharing the majority of their sequence content. Graph-based pan-genome representations that allow to remove redundancies without having to discard individual differences, make utmost sense. In this project, we will put this shift of paradigms—from sequence to graph based representations of genomes—into full effect. As a result, we can expect a wealth of practically relevant advantages, among which arrangement, analysis, compression, integration and exploitation of genome data are the most fundamental points. In addition, we will also open up a significant source of inspiration for computer science itself.
For realizing our goals, our network will (i) decisively strengthen and form new ties in the emerging community of computational pan-genomics, (ii) perform research on all relevant frontiers, aiming at significant computational advances at the level of important breakthroughs, and (iii) boost relevant knowledge exchange between academia and industry. Last but not least, in doing so, we will train a new, “paradigm-shift-aware” generation of computational genomics researchers.
Description: The storage of information on DNA requires to set up complex biotechnological processes that introduce a non-negligible noise during the reading and writing processes. Synthesis, sequencing, storage or manipulation of DNA can introduce errors that can jeopardize the integrity of the stored data. From an information processing point of view, DNA storage can then be seen as a noisy channel for which appropriate codes must be defined. The first challenge of MoleculArXiv-PC2 is to identify coding schemes that efficiently correct the different errors introduced at each biotechnological step under its specific constraints.
A major advantage of storing information on DNA, besides durability, is its very high density, which allows a huge amount of data to be stored in a compact manner. Chunks of data, when stored in the same container, must imperatively be indexed to reconstruct the original information. The same indexes can eventually act as a filter during a selective reading of a subgroup of sequences. Current DNA synthesis technologies produce short fragments of DNA. This strongly limits the useful information that can be carried by each fragment since a significant part of the DNA sequence is reserved for its identification. A second challenge is to design efficient indexing schemes to allow selective queries on subgroups of data while optimizing the useful information in each fragment.
Third generation sequencing technologies are becoming central in the DNA storage process. They are easy to implement and have the ability to adapt to different polymers. The quality of analysis of the resulting sequencing data will depend on the implementation of new noise models, which will improve the quality of the data coding and decoding. A challenge will be to design algorithms for third generation sequencing data that incorporate known structures of the encoded information.
Description: To address the constraints of climate change while meeting agroecological objectives, one approach is to efficiently characterize previously untapped genetic diversity stored in ex situ and in situ collections before its utilization in selection. This will be conducted in the AgroDiv project for major animal (rabbits, bees, trout, chickens, pigs, goats, sheep, cattle, etc.) and plant (wheat, corn, sunflower, melon, cabbage, turnip, apricot tree, peas, fava beans, alfalfa, tomatoes, eggplants, apple trees, cherry trees, peach trees, grapevines, etc.) species in French agriculture. The project will thus use and develop cutting-edge genomics and genetics approaches to deeply characterize biological material and evaluate its potential value for future use in the context of agroecological transition and climate change.
The Genscale team is involved in two of the six working axes of the project. First, we will aim at developping efficient and user-friendly indexing and search engines to exploit omic data at a broad scale. The key idea is to mine publicly available omic and genomic data, as well as those generated within this project. This encompasses new algorithmic methods and optimized implementations, as well as their large scale application. This work will start early 2024. Secondly, we will develop novel algorithms and tools for characterizing and genotyping structural variations in pangenome graphs built from the genomic resources generated by the project.
Description: MISTIC connects the INRAE's extensive expertise in experimental crop culture systems with Inria's expertise in computation and artificial intelligence, with the goal of developing tools for modeling the microbiomes of crop plants using a systems approach. The microbial communities found on roots and leaves constitute the “dark matter” in the universe of crop plants, hard to observe but absolutely fundamental. The aim of the project is to develop new tools for analyzing multi-omics data, and new spatio-temporal models of microbial communities in crops.
GenScale’s task is to develop new metagenome assembly tools for these complex communities and taking advantages from novel accurate long read technologies.
Description: The aim of the project is to build a coherent e-infrastructure supporting data management in line with FAIR and open science principles. It will complete and improve the connection between the data production, management and analysis services of the genomics and bioinformatics platforms and the biological resource centers, all linked to the work environments of the research units. It will ensure the connection with the data management services of the phenotyping infrastructures.
GenScale is involved in the integration and representation of "omics" data with graph data structures (WorkPackage 2), as well as in the assembly and annotation of several plant and animal genomes and in the building of pangenome graphs (WorkPackage 3).
Description: The Divalps project aims at better understanding how populations adapt to changes in their environment, and in particular climatic and biotic changes with altitude. Here, we focus on a complex of butterfly species distributed along the alpine altitudinal gradient. We will analyse the genomes of butterflies in contact zones to identify introgressions and rearrangements between taxa.
GenScale’s task is to develop new efficient methods for detecting and representing the genomic diversity among this species complex. We will focus in particular on Structural Variants and genome graph representations.
Partners: Inria teams: Dyliss, Zenith, Taran.
External partners are CEA-GenoScope, Elixir, Pasteur Institute, Inria Challenge OceanIA, CEA-CNRGH, and Mediterranean Institute of Oceanography.
Description: Genomic data enable critical advances in medicine, ecology, ocean monitoring, and agronomy. Precious sequencing data accumulate exponentially in public genomic data banks such as the ENA. A major limitation is that it is impossible to query these entire data (petabytes of sequences).
In this context the project aims to provide a novel global search engine making it possible to query nucleotidic sequences against the vast amount of publicly available genomic data. The central algorithmic idea of a genomic search engine is to index and query small exact words (hundreds of billions over millions of datasets), as well as the associated metadata.