Gemma - Multi-Omics Toolbox
In this project, we will implement a multi-omics toolbox, which is a documented toolbox (software package) to analyze the multi-omics GEMMA data. The toolbox includes pipeline modules for each omics layer (genomics, epigenomics, metagenomics, metabolomics, and immune profiling), comprehensive quality control (QC) routines, and integrative analysis methods (mid-integration strategies such as MOFA/MixOmics, with optional late-integration reporting).
The toolbox enables reproducible preprocessing and integrative analysis of multi-omics data in local computing environments (HPC or institutional servers). This solution ensures that sensitive data remain secure and never leave local infrastructures, while guaranteeing reproducibility and providing scalable routes for large datasets.
The toolbox covers both the necessary analyses to address the scientific aims and clear step-by-step documentation with example datasets, enabling colleagues to reproduce workflows independently.
Table of Contents
Advantages of the Toolbox Approach
- Reproducibility: documented scripts, containers, and workflow managers ensure that analyses can be replicated across environments
- Data security: patient-sensitive data is not shared and thus do not need to be uploaded to external servers
- Resource efficiency: avoids maintaining an online platform while still allowing scalable computations on institutional systems
- Ease of maintenance: pipelines can be updated and released through version control (e.g., GitHub/GitLab), with changelogs and environment specifications
- Analysing data: The data will be linked to the toolbox once the results are published in scientific articles. The data will be shared via dedicated data repositories (e.g., ENA, SRA, GEO, EGA, MGnify) designed for this purpos
- User support: documentation, tutorials, and examples facilitate the further use of the tools
Risk Mitigation
- Usability: clear README, step-by-step instructions
- Dependencies: will be given in instructions
- Computation needs: resource requirements will be documented
- Integration methods: toolbox includes modules for both mid-integration (combined modeling) and late-integration (result combination), depending on the scientific question
- Sustainability: reproducibility supported via version control, documented updates, and maintained environments
Gemma Toolbox as a Platform
Gemma-toolbox-based solution is preferable for sensitive biomedical data, as it prioritizes reproducibility, scalability, and data protection. By combining pipelines, workflow managers, and comprehensive documentation, the toolbox constitutes a sufficient and efficient multi-omics platform for the proposed research.
Integration Strategies within the Toolbox
The toolbox will support multiple integration strategies depending on the research question and data availability. Late integration is implemented by processing each omics layer independently (e.g., significant variants, detecting differentially methylated regions, profiling microbial differences) and combining the results at the interpretation stage, through pathway enrichment, network analyses, or meta-analyses. This approach is straightforward, robust to technical differences across datasets, and provides biologically interpretable outcomes.
Another option for integration is mid integration, which is achieved by feeding the different omics layers into a joint statistical or machine learning model (e.g., MOFA, MixOMics), which identifies shared latent factors across data types while preserving their specific structures. This approach is more powerful for uncovering cross-omics interactions and mechanistic links.
With both late and mid integration modules, the toolbox ensures flexibility: researchers can choose simpler workflows when appropriate, but also apply state-of-the-art integrative modeling for comprehensive analyses.
We will use mid integration strategy for integrating microbiome, methylation and genome data, and further late integration strategy for associating metabolomics, immunoprofiling and proteome data with the other omics measurements.
Available Tools
TOOL TYPE | TITLE | LEAD PARTNER | COMPLETENESS |
---|---|---|---|
Data analysis, Script | OMICS INTEGRATION FOR PRECLINICAL GEMMA DATA | Tampere University, INRAE | 100 % |
Data analysis, Script | Tampere University, CNR-ITB | 90 % (delivery DEC 2025) | |
Data analysis, Script | Tampere University, CNR-ITB | 90 % (delivery DEC 2025) | |
Dataset | A NETWORK OF MOLECULAR AND FUNCTIONAL INTERACTIONS TO ANALYSE GEMMA OMICS DATASETS | CNR-ITB | 80 % (delivery DEC 2025) |
Data analysis, Script | NETWORK-BASED MULTI-OMICS INTEGRATION TO PRIORITIZE FEATURES IN GEMMA OMICS DATASETS | CNR-ITB | 90 % (delivery DEC 2025) |
Data analysis, Script | CNR-ITB | 80 % (delivery DEC 2025) |
Omics integration for preclinical GEMMA Data
Computational pipeline with accompanying scripts for the multi-omics integration of data from GEMMA's preclinical FMT mouse experiments
GEMMA WHOLE GENOME SEQUENCING DATA PROCESSING
We developed a computational pipeline with accompanying scripts for the alignment for paired end Illumina sequencing data and subsequent variant calling in the GEMMA project
BIOMARKER-BASED POLYGENIC RISK SCORE FOR GEMMA GENOMES
We developed a computational pipeline with accompanying scripts for the construction of biomarker informed polygenic risk scores (bioPRS) using genotypes called from GEMMA WGS. Standard PRS variants and effects are collected from Grove et al. 2019 study
A NETWORK OF MOLECULAR AND FUNCTIONAL INTERACTIONS TO ANALYSE GEMMA OMICS DATASETS
We developed a network of molecular and functional interactions to analyse gene-related, metabolite-related and microbiota species-related input scores derived from GEMMA omics datasets.
NETWORK-BASED MULTI-OMICS INTEGRATION TO PRIORITIZE FEATURES IN GEMMA OMICS DATASETS
We developed a pipeline for the integrative analysis of gene-related, metabolite-related and microbiota species-related data.
ASSESSMENT OF FUNCTIONAL SIMILARITY AMONG BIOMARKERS
We developed a pipeline to assess the similarity among biomarkers. The approach uses molecular and functional interactions, as well as molecular pathways, to estimate the functional similarity between novel biomarkers and existing biomarkers.