Persephone¶
- analyseWBMs(mWBMPath, fluxPath, rxnList, varargin)[source]¶
analyseWBMs predicts the maximal fluxes for a list of user-defined reactions (rxnList). All predicted are further described in analyseWBMsol.m.
- USAGE
[FBA_results, pathsToFilesForStatistics] = analyseWBMs (mWBMPath, fluxPath, rxnList)
INPUTS: mWBMPath Path (character array) to WBMs fluxPath Path to directory where the results will be stored rxnList Cell array of VMH metabolites to investigate.
Example: rxnList = {‘DM_trp_L[bc], DM_met_L[bc],’Brain_trp_L[c], Heart_met_L[x]}. Note that demand reactions are automatically added if they are not present in the models.
OPTIONAL INPUTS numWorkersOptimization Number of workers that will perform FBA in parallel. Note
that more workers does not necessarily make the function faster. It is generally not recommended to set numWorkersOptimization equal to the number of available cores (see: feature(‘numCores’)) as linear solvers can already support multi-core linear optimisation, thus resulting in unnessessary overhead. On computers with 8 cores or less, it is recommended to set numWorkersOptimization to 1. On a computer with 36 cores, an optimal configuration of numWorkersOptimization=6 was found.
- saveFullRes Boolean (true/false) indicating if all the complete .v, .y.
, and .w vectors are stored in the result. Default = true. It is recommended to set saveFullRes
paramFluxProcessing Structured array with optional parameters:
.NumericalRounding defines how much the predicted flux values are rounded. A defined value of 1e-6 means that a flux value of 2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to exactly zero. This rounding factor will also be applied to the shadow price values. If microbiome relative abundance data is provided, the same rounding factor will be applied to the relative abundance data.
Default parameterisation: - paramFluxProcessing.NumericalRounding = 1e-6;
Example: - paramFluxProcessing.NumericalRounding = 1e-6;
paramFluxProcessing.NumericalRounding = 1e-6;
.RxnRemovalCutoff defines the minimal number of samples for which a unique reaction flux could be obtained, before removing the reaction for further analysis. This parameter can be expressed as * fraction: the fraction of samples with unique values, * SD: the standard deviation across samples, and * count: the counted number of unique values. If microbiome relative abundance data is provided, the same removal cutoff factor will be applied to the relative abundance data.
Default parameterisation: - paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};
Examples: - paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1}; - paramFluxProcessing.RxnRemovalCutoff = {‘SD’,1}; - paramFluxProcessing.RxnRemovalCutoff = {‘count’,30};
paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};
.RxnEquivalenceThreshold defines the minimal threshold of when functionally identical flux values are predicted, and are thus part of the same linear pathways. The threshold for functional equivalence is expressed as the R2 (r-squared) value after performing a simple linear regression between two reactions.
Default parameterisation: - paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
Example: - paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
paramFluxProcessing.RxnEquivalenceThreshold = 0.999;
.fluxMicrobeCorrelationType defines the method for correlating the predicted fluxes with microbial relative abundances. Note that this metric is not used if mWBMs are not present. The available correlation types are: * regression_r2: the R2 (r-squared) value from pairwised linear regression on the predicted fluxes against microbial relative abundances. * spearman_rho: the correlation coefficient, rho obtained from pairwise Spearman nonparametric correlations between predicted fluxes and microbial relative abundances.
Default parameterisation: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;
Examples: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;
- fluxAnalysisPath: Character array with path to directory where all
results will be saved (Default = pwd)
- Solver: Validated solvers: ‘cplex_direct’,’ibm_cplex’
‘tomlab_cplex’, ‘gurobi’, ‘mosek’
OUTPUT results Structured array with FBA fluxes, solver statistics, and
paths to the flux table and raw flux results
- analyseWBMsol(fluxPath, paramFluxProcessing, fluxAnalysisPath)[source]¶
analyseWBMsol loads the FBA solutions produced in analyseWBMs.m, prepares the FBA results for further analyses, and produces summary statistics into the flux results.
- The function contains six parts:
PART 1: Loading the flux solutions and corresponding shadow prices
PART 2: Converting the microbial biomass shadow prices
associated with each optimised reaction flux to human readable tables and producing statistics on microbial influences on the flux results. - PART 3: Isolate the microbial component of the fluxes by subtracting the germ-free fluxes in a sex specific manner. In this part, descriptive statistics are also obtained on the fluxes across samples and metabolites are removed based on a user defined threshold (see thresholds input) - PART 4: In this part, microbe-flux associations defined by the species biomass shadow prices (PART 2) are further quantified by performing Spearman correlations on the fluxes and microbe relative abundances. - PART 5: Reactions with identical or near identical fluxes across samples (user defined) are grouped and collapsed into a single result. - PART 6: Save all results
- USAGE
[FBA_results, pathsToFilesForStatistics] = analyseWBMsol (fluxPath,paramFluxProcessing, fluxAnalysisPath)
INPUTS: fluxPath Character array with path to .mat files produced in
optimiseRxnMultipleWBM.m
paramFluxProcessing Structured array with optional parameters:
.numericalRounding defines how much the predicted flux values are rounded. A defined value of 1e-6 means that a flux value of 2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to exactly zero. This rounding factor will also be applied to the shadow price values. If microbiome relative abundance data is provided, the same rounding factor will be applied to the relative abundance data.
Default parameterisation: - paramFluxProcessing.numericalRounding = 1e-6;
Example: - paramFluxProcessing.numericalRounding = 1e-6;
paramFluxProcessing.numericalRounding = 1e-6;
.rxnRemovalCutoff defines the minimal number of samples for which a unique reaction flux could be obtained, before removing the reaction for further analysis. This parameter can be expressed as * fraction: the fraction of samples with unique values, * SD: the standard deviation across samples, and * count: the counted number of unique values. If microbiome relative abundance data is provided, the same removal cutoff factor will be applied to the relative abundance data.
Default parameterisation: - paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1};
Examples: - paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; - paramFluxProcessing.rxnRemovalCutoff = {‘SD’,1}; - paramFluxProcessing.rxnRemovalCutoff = {‘count’,30};
paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; RxnEquivalenceThreshold .rxnEquivalenceThreshold defines the minimal threshold of when functionally identical flux values are predicted, and are thus part of the same linear pathways. The threshold for functional equivalence is expressed as the R2 (r-squared) value after performing a simple linear regression between two reactions.
Default parameterisation: - paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
Example: - paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
paramFluxProcessing.rxnEquivalenceThreshold = 0.999;
.fluxMicrobeCorrelationMetric defines the method for correlating the predicted fluxes with microbial relative abundances. Note that this metric is not used if mWBMs are not present. The available correlation types are: * regression_r2: the R2 (r-squared) value from pairwised linear regression on the predicted fluxes against microbial relative abundances. * spearman_rho: the correlation coefficient, rho obtained from pairwise Spearman nonparametric correlations between predicted fluxes and microbial relative abundances.
Default parameterisation: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;
Examples: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;
- fluxAnalysisPath
Character array with path to directory where all results will be saved.
- AUTHOR:
Tim Hensen, July 2024
Jonas Widder, 11/2024 (small bugfixes)
- createBatchMWBM(mgpipePath, mWBMPath, metadataPath, varargin)[source]¶
This function creates personalised host-microbiome WBM models (mWBMs) by joining microbiome community models with unpersonalised WBM models in a sex-specific manner. The models are parameterised on a predefined diet.
- USAGE
[modelStats, summaryStats, dietInfo, dietGrowthStats] = createBatchMWBM (mgpipePath, mWBMPath, metadataPath)
INPUTS mgpipePath: Path to microbiome community models created by the
microbiome modelling toolbox.
- mWBMPath Path to directory where the HM models are
saved
OPTIONAL INPUTS Diet Diet option: ‘EUAverageDiet’ (default) numWorkersCreation Number of cores used for parallelisation.
Default = 4.
- checkFeasibility Flag (true/false) to run the
ensureHMfeasibility.m function and check if the generated models can grow on the diet. Default = true.
- wbmDirectory Path to directory with user defined WBMs.
This function supports one user adapted male and one user adapted female WBM or personalised germ free WBMs for each sample with a microbiome community model. Default is empty. If wbmDirectory is empty, Harvey/Harvetta version 1.04c are used.
OUTPUTS modelStats Table with summary statistics on the generated WBMs:
gender, number of reactions, metabolites,constrainsts, and taxa.
- summaryStats Table with the mean and SD of the model
statistics in the modelStats variable.
- dietInfo Table with diet growth information.
models can grow on the given diet.
dietGrowthStats Table with statistics on
Authors: Tim Hensen, 2023, 2024
- createForestPlot(estimates, ci, names, pValues, plotTitle, xTitle, hideLegend)[source]¶
createForestPlot generates a forest plot to display confidence intervals for estimates.
- USAGE
createForestPlot (estimates, ci, names, pValues, plotTitle, xTitle, hideLegend)
- INPUTS
estimates - Vector of estimates (e.g., effect sizes or log fold changes)
ci - Matrix of confidence intervals [n x 2] with lower and upper bounds in columns.
names - Cell array of labels for each data point.
pValues - Vector of p-values corresponding to each estimate.
plotTitle - String, title of the plot.
xTitle - String, label for the x-axis (typically “Effect Size” or “Log Fold Change”)
hideLegend - Logical, if true, hides the legend (default is false)
- createMWBM(microbiota_model, WBM_model, Diet, saveDir)[source]¶
This function creates personalised host-microbiome WBM models by joining microbiome community models with unpersonalised WBM models. The models are parameterised on a predefined diet.
Example: modelHM = createMWBM(microbiota_model, WBM_model, ‘EUAverageDiet’)
INPUTS microbiota model: Microbiome community model created by the
microbiome modelling toolbox.
- WBM_model: Harvey or Harvetta whole-body metabolic
model
Diet Diet option: ‘EUAverageDiet’ (default)
Output modelHM: Personalised host-microbiome WBM model
- Authors: Tim Hensen, Bronson Weston, 2022
Tim Hensen, expanded WBM parameterisation July 2024 Tim Hensen, improved mWBM annotation November 2024
- createVulcanoPlot(estimates, pValues, names, plotTitle, xTitle, yTitle)[source]¶
createVulcanoPlot generates a volcano plot to visualize the relationship between regression estimates and p-values.
- USAGE
createVulcanoPlot (estimates, pValues, names, plotTitle, xTitle, yTitle)
- INPUTS
estimates - [vector] Regression estimates for each data point (e.g., effect sizes or log fold changes)
pValues - [vector] p-values corresponding to each estimate.
names - [cell array] Labels or names for each data point.
plotTitle - [string] Title of the plot.
xTitle - [string] Label for the x-axis (e.g., “Effect Size” or “Log Fold Change”)
yTitle - [string] Label for the y-axis (e.g., “-log10(p-value)”)
- OUTPUTS
None - This function generates a volcano plot in the current figure.
Example
- createVulcanoPlot(estimates, pValues, {‘Metabolite1’, ‘Metabolite2’}, …
‘Volcano Plot’, ‘Log Fold Change’, ‘-log10(p-value)’)
- ensureWBMfeasibility(mWBMPath, varargin)[source]¶
This function checks for the WBMs in the specified mWBMPath to grow on the given diet, finds a diet that makes all WBMs feasible and propagates this diet onto the models. The functions works as follows: 1) Load the WBMs from the directory and test which models are feasible on the diet. 2) If any WBM is not feasible, collect all infeasible WBMs, open all diet reactions, and test if the WBMs can grow on any diet. If one or more WBMs cannot grow when opening all diet reactions, the statistics are collected and the function stops. The user will need to debug the WBMs manually. 3) If all previously infeasible WBMs can grow when all diet reactions are opened,missing diet components are searched using getMissingDietModelHM. If getMissingDietModelHM cannot find a diet after one function call, this function is run again until a hard limit of 10 iterations is reached. If this limit is reached, no feasible diet can be found for all models and the user will need to manually debug the infeasible models. If missing diet components are found in one WBMs, these component are automatically propagated to the other WBMs infeasible on the original diet before searching for more missing diet components 4) If a new diet is found for which all previously infeasible WBMs can grow, the updated diet is propagated to all models in the mWBMPath. 5) The updated diet is stored as an output variable for this function and for each model, its feasibility on the original and updated diet is recorded.
- USAGE
[dietInfo, dietGrowthStats] = ensureHMfeasibility (mWBMPath, Diet)
INPUTS mWBMPath Path to directory where the HM models are saved Diet Diet option: ‘EUAverageDiet’ (default)
OUTPUTS dietInfo Structured variable containing the missing diet
components and the updated diet propagated to all models. This variable is empty if no updated diet could be found.
- dietGrowthStats Table indicating which WBMs could initially not grow on
the diet, which WBMs could not grow on any diet, and which WBMs could not grow on an updated diet.
Authors
Tim Hensen, 2024
- fdrBHadjustment(p)[source]¶
- IMPORTANT; This function was partially generated by chatGPT o3-mini.
The code is manually tested and altered with bug fixes. This function performs identical to running q = mafdr(p, ‘BHFDR’, true) using the Matlab bioinformatics toolbox.
FDR_BH Performs Benjamini-Hochberg FDR correction on a vector of p-values.
q = FDR_BH(p) returns the FDR-corrected p-values q for the input vector p.
- Input:
p - a vector of p-values (assumed to be in the interval [0, 1])
- Output:
q - a vector of FDR corrected p-values, in the same order as p.
- The function implements the procedure described in:
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing.
- Example:
pvals = [0.01, 0.04, 0.03, 0.002, 0.07]; qvals = fdr_bh(pvals);
- findPvalCategories(pValues)[source]¶
This function inputs a list of p values and outputs the variables pValueGroups and colourGroups. A maximum of three groups are defined, FDR<0.05, P<0.05, and P>0.05.
- USAGE
[groupsToPlot, colours] = findPvalCategories (pValues)
INPUT pValues
OUTPUTS groupsToPlot Table that categorises the indices of the p-value list. colours Cell array with numerical encodings for the
category-associated colours.
- generateStackedBarPlot_PhylumMARScoverage(input_relAbundances_preMapping, input_relAbundances_postMapping, saveDir, varargin)[source]¶
Generates stacked bar plots from mean relative abundances of phyla pre- & post-mapping to a model-database in MARS.
- INPUTS
input_relAbundances_preMapping – [chars/string] Path to table which contains phyla taxa and their mean relative abundances across samples pre-mapped to a model-database (original dataset). This spreadsheet is a standard output generated in MARS.
input_relAbundances_postMapping – [chars/string] Path to table which contains phyla taxa and their mean relative abundances across samples post-mapped to a model-database (filtered dataset). This spreadsheet is a standard output generated in MARS.
saveDir – [chars/string] Path to the directory where the stacked bar plot should be saved.
mappingDatabase_name – [chars/string] The name of the model-database used for mapping in MARS, which will be displayed in the title of the graph. Defaults to ‘’.
- AUTHOR:
Jonas Widder, 12/2024 & 01/2025
- initPersephone(resultPath, paths)[source]¶
- Initializes the Persephone pipeline by:
Checking for MATLAB toolbox dependencies.
Processing and validating the metadata file.
Cross-referencing metadata with microbiome data (if provided).
Setting up directory structures for Persephone results.
- USAGE
[initialised, paths, statToolboxInstalled] = initPersephone (resultPath, metadataPath, readsTablePath)
- INPUTS
resultPath - (char)
metadataPath - (char) Path to the metadata file (.csv or .xlsx)
readsTablePath - (char)
- OUTPUTS
initialised - (logical)
paths - (struct)
statToolboxInstalled - (logical)
updatedMetadataPath - (char)
- REQUIREMENTS:
MATLAB R2020b or newer.
Parallel Computing Toolbox (mandatory).
Statistics and Machine Learning Toolbox (recommended for analysis).
Notes
Metadata must contain a column of sample IDs and a column of sex information.
Metadata and reads table will be synchronized to ensure consistent samples across analyses.
Any discrepancies in variable naming (e.g., “ID”, “Sex”) are automatically resolved.
Processed metadata is saved to an updated file.
Example
resultPath = ‘./results’; metadataPath = ‘./data/metadata.csv’; readsTablePath = ‘./data/readsTable.csv’; [initialised, paths, statToolboxInstalled] = initPersephone(resultPath, metadataPath, readsTablePath);
- AUTHOR:
Tim Hensen, January 2025
Bram Nap, July 2024: Automatic creation of output folders
- loadMinimalWBM(modPath)[source]¶
This function loads the smallest possible combination of WBM model fields needed to perform FBA. Loading the minimal model can decrease loading times by 6X.
Author: Tim Hensen, November 2024
- performRegressions(data, metadata, formula, exponentiateLogOdds)[source]¶
This functions performs linear or logistic regressions on either flux data or microbial relative abundances. The function supports control variables & moderators in the regression formulae.
- USAGE
results = performRegressions (data,metadata,formula)
INPUT data: Table with flux or microbiome data. The first column must be the ID
column.
metadata: Metadata table. The first column must be the ID column. formula: Must contain either “Flux or “relative_abundance” as predictor. exponentiateLogOdds: Logical variable on if a logistic regression estimate should be exponentiately to obtain the odds ratios.
- performStatsPersephone(statPath, pathToProcessedFluxes, metadataPath, response, varargin)[source]¶
The performStatistics function is part of the PSCM toolbox. This function performs regression analyses on processed fluxes and gut microbiome relative abundance data in large cohort studies. This function needs a minimal sample size of 50 samples. This function performs different statistical analyses depending on the user defined inputs.
If the response variable is binary and no confounders are given, wilcoxon tests will be performed for the fluxes and microbiome abundances. If the response variable is binary and confounders are given, multiple logistic regressions will be performed. If the response variable is continuous and no confounders are given, simple linear regressions will be performed. If the response variable is continuous and confounders are given, multiple linear regressions will be performed.
The fluxes are z-transformed before statistical testing. The microbiome relative abundances are log transformed before testing. If a moderator variable is given and a regression is performed, a moderation analysis on the moderator with the predictor will be performed.
- USAGE
performStatistics (statPath, pathToProcessedFluxes, pathToWbmRelAbundances, metadataPath, response, confounders, moderator, microbeCutoff,threshold)
INPUT statPath Path (character array) to working directory pathToProcessedFluxes Path to processed flux data pathToWbmRelAbundances Path to microbial relative abundances metadataPath Path to metadata file response Name of response variables (char or string)
OPTIONAL INPUTS confounders Cell array with the names of confounding variables to
be included. Default = empty.
- microbeCutoff Cutoff threshold for the number of samples a
microbe needs to present to be analysed. Default = 0.1. (10%)
- persWBM(metadataPath, varargin)[source]¶
This function takes a table of physiological parameters for an individual or multiple individuals (listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to create a personalised WBM
The calculation of physiological parameteres will be performed based on the available data e.g., if the user provides Cardiac output, this given value is assigned, if the user provides stroked volume and no CO, CO is calculated based on SV. If neither values are provided, CO will be calculated based on XXX. All details of parameter calculation are detailed in the personalised model output and in the excel file provided by this function.
INPUTS
REQUIRED: metadata Can be a cell array or a table or a path to an
excel file. In any case, this should contain a list with a minimum of three column and option for additional columns for each individual for which a model should be created:
- persWBMmetabolomics(sex, metabolomicParams, varargin)[source]¶
This function takes a table of metabolomic parameters for an individual or multiple individuals (listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to create a personalised WBM
The calculation of metabolomic parameteres will be performed based on the available data
INPUTS
REQUIRED: metabolomicParamters Can be a cell array or a table or a path to an
excel file. In any case, this should contain a list with a minimum of three columns and option for additional columns for each individual for which a model should be created:
|”compartment” |”blood” |”urine” | |”unit” |”mg/dL” | “mg/dL” | |”Individual1” |180 |50 | |”Individual2” |150 |66 |
Sex must be provided for each model. The compartment must be specified and must be one of the following: csf, u, blParameters that can be personalised include: All of those metabolites which are found in the *for each the default unit used by the model is provided. For x, y, z- unit conversion will be performed for all known common units of measurement.
OPTIONAL: WBM/(s) User can provide a WBM (Whole Body Metabolic Model) or a path to multiple models.
If no model is provided, either Harvey or Harvetta will be loaded from the COBRA toolbox, depending on sex in provided physiological data. If multiple models are provided, the model ID must match the model name provided in the physiological data.
- resPath Path on which to store personalised model and
other outputs. Default = current directory
- Diet Diet in the form of text file or named .mat
file from the COBRA toolbox (default = EUAverageDiet) ** This function can also be run in isolation but if a user wants to personalise both physiological and metabolomic- this function should alway be run and NOT persWBMmetabolomics in isolation!
OUTPUTS
- iWBM Model with updated physiological paramteres
(stored as “persModelName.mat”). All updated paramteres are described in model.IndividualisedParameters
- controlWBM This is a WBM with no personalised
adjustments. When a WBM or multiple WBMs have been given as inputs, the controlWBMs are exact copies of those. When Harvey or Harvetta are used, controlWBM is copies of Harvey/Harvetta.
- persParameters Excel file with details of the updated
parameter and how it was calculated
author: Anna Sheehy November 2024 Step One: Read in the available data, check all data is valid Define the input parser
- readMetadataForPersephone(metadataPath)[source]¶
- DESCRIPTION:
This function loads a metadata file into a MATLAB table. It accounts for: - Preserving the full variable names by disabling automatic truncation. - Capturing variable units if provided in the second row of the file. - Storing the captured units in the ‘VariableUnits’ property of the table.
- USAGE
metadata = readMetadataForPersephone (metadataPath)
- INPUTS
metadataPath – A string specifying the path to the metadata file (in CSV or XLSX format).
- OUTPUTS
metadata – A MATLAB table containing the processed metadata. Variable units, if present in the second row of the file, are stored in the table’s ‘VariableUnits’ property.
Notes
If variable names exceed 64 characters, they are truncated, but the true variable names are preserved in the ‘VariableDescriptions’ property.
Warnings related to variable name truncation are suppressed to avoid unnecessary console output.
AUTHORS
Tim Hensen, January 2025
- runMars(pythonPath, marsRepoPath, cutoffMARS, outputExtensionMARS, flagLoneSpecies, taxaSplit, removeCladeExtensionsFromTaxa, whichModelDatabase, userDatabase_path, sample_read_counts_cutoff, readsTablePath, taxaTable, outputPathMARS)[source]¶
This function integrates MATLAB with the MARS Python repository to process microbiome taxonomy and read abundance data. The MARS pipeline maps joined microbiome data to a metabolic model database for downstream analysis.
- INPUTS
pythonPath char with path to the Python executable (e.g., Anaconda installation)
marsRepoPath char with path to the directory containing the MARS pipeline repository.
cutoffMARS double with relative abundance cutoff threshold below which relative abundances – are treated as zero. Defaul is 1e-6.
outputExtensionMARS char with desired file format for output files (e.g., ‘csv’, ‘txt’)
flagLoneSpecies logical that indicates if genus name is omitted in species names – (e.g., “copri” instead of “Prevotella copri”).
taxaSplit char with delimiter used to separate taxonomic levels.
removeCladeExtensionsFromTaxa Logical which specifies whether clade name extensions should be removed – from microbiome taxa for compatibility with AGORA2 models.
whichModelDatabase char that specifies the metabolic model database to use. – Options: ‘AGORA2’, ‘APOLLO’, ‘full_db’, ‘user_db’ (requires ‘userDatabase_path’).
userDatabase_path char with path to a user-defined model database file, required only if ‘whichModelDatabase’ is set to ‘user_db’.
sample_read_counts_cutoff double variable. Threshold for read counts below which samples are excluded. – Only applies if ‘readsTablePath’ contains absolute read counts.
readsTablePath char with path to the file containing microbiome taxonomy and read abundance data.
taxaTable char with path to the file containing taxonomic assignments. – Required only if taxonomy is not included in ‘readsTablePath’.
- OUTPUTS
The function does not return variables but writes processed results
to the specified output directory in the MARS pipeline.
- DEPENDENCIES:
Python and the MARS repository must be installed and accessible.
COBRA Toolbox for MATLAB must be properly set up if further integration with COBRA models is required.
Notes
Ensure ‘pythonPath’ is set to the correct Python executable version and environment containing the required MARS dependencies.
If ‘readsTablePath’ already includes taxonomy assignments, ‘taxaTable’ is optional.
- AUTHOR: Tim Hensen, January 2025
modified by Jonas Widder, January 2025 (added generateStackedBarPlot_PhylumMARScoverage.m)
- runNonparametricTests(data, metadata, predictor, response)[source]¶
Performs non-parametric statistical tests (Wilcoxon or Kruskal-Wallis) on metabolic data
- USAGE
resultTable = runNonparametricTests (data, metadata, predictor, response)
- INPUTS
data – (table) m x n table containing: * m samples * n reactions/taxa with flux or abundance values * First column must contain sample IDs
metadata – (table) Sample metadata containing: * Sample IDs in first column * Response variable in separate column
predictor – (char) Name of predictor variable (‘Flux’ or ‘relative_abundance’)
response – (char) Name of response variable in metadata
- OUTPUTS
resultTable – (table) Statistical results with columns: * name: Reaction/Taxa identifier * groups: Cell array of group names * n: Array of group sizes * statistic: Test statistic (Wilcoxon or Kruskal-Wallis) * p-value: Unadjusted p-value * FDR: Benjamini-Hochberg adjusted p-value * effectSize: Effect size (r for Wilcoxon, η² for Kruskal-Wallis) * confidence: 95% confidence interval (only for binary comparisons)
Example
% Binary comparison (Wilcoxon test) resultTable = runNonparametricTests(fluxData, metadata, ‘Flux’, ‘Disease’)
% Multiple group comparison (Kruskal-Wallis test) resultTable = runNonparametricTests(abundanceData, metadata, ‘relative_abundance’, ‘Treatment’)
Note
Automatically selects appropriate test based on number of groups
Data is automatically normalized before testing
Missing values (NaN) are removed before analysis
Multiple testing correction uses Benjamini-Hochberg FDR
Minimum group size of 3 samples is required for valid testing
- runPersephone(configPath)[source]¶
This pipeline runs throught the entire process of mapping taxonomy assigned reads to AGORA2/APOLLO models to create relative abundance files through the MARS pipeline. The relative panSpecies abundance file is then used to create microbiome models through mgpipe (Microbiome Modelling Toolbox PMID:xx). Before HM models are created there is an option to personalise the WBMs with physiological or metabolomic data through the xx function. The HM creation can then either connect the microbiomes with base-WBMs based on the sex associated with the sample ID in the metadata file or connect it with a personalised WBM. Based on a list of reactions, the HM models will be solved and the flux results will be processed and various analyses such as shadow price and regression are performed. Then based on the make-up of the study used to run this pipeline, mixed-effect linear regression modelling can be used for various response variables while taking into account the effect of predictors. All response and predictor variables need to be present in the metadata file. By default the code will do perform two runs where in one flux data is added as an additional predictor and in the other the relative abundance of panSpecies models. Finally the results are visualised in various plots. The user can indicate which parts of the pipeline should be run with various “flags” described in the input variables below. If the code or matlab stops unexpectedly, parts of pipeline that are already performed will be skipped. If the entire pipeline should be rerun - delete the progress.mat file. If only certain parts of the pipeline should be rerun, open the progress.mat file and alter change the value of the relevant flags to false or 0.
Most of the inputs are optional as the pipeline is designed that parts of the analysis could be skipped. This however means that some of the optional variables are actually required variables in order to run a certain part of the pipeline. Behind the input description indication if the variable is required is given.
- Usage
fullPipeline()
Inputs: :Required inputs: * resultPath – A string with the path to the directory where all
results should be stored
metadataPath – A string with the path to the metadata file.
- Optional inputs
solver – String, indicates which solver is to be used for solving WBMs. OPTIONAL, defaults to GLPK. It is highly recommended to use gurobi or cplex
- SeqC:
- paths.seqC.flagSeqC: Logical variable indicating if bioinformatic processing
of sequencing data is performed by SeqC.
- paths.seqC.repoPathSeqC: REQUIRED. String specifying the directory where the
SeqC repository is located.
- paths.seqC.outputPathSeqC: String specifying the folder where the final output
of SeqC is stored.
- paths.seqC.fileIDSeqC: REQUIRED. String specifying the file name containing
sample IDs for FASTQ files (e.g., sample_id.txt).
- paths.seqC.procKeepSeqC: Logical variable indicating if intermediary outputs are
retained (e.g., post-QC FASTQ). If false, only the final MARS output is kept, and all intermediary content is deleted once SeqC completes.
- paths.seqC.maxMemSeqC: Numeric value specifying the maximum amount of memory
allowed in gigabytes.
- paths.seqC.maxCpuSeqC: Numeric value specifying the maximum number of CPU threads
allowed for processing.
- paths.seqC.maxProcSeqC: Numeric value specifying the maximum number of parallel
processes allowed.
- paths.seqC.debugSeqC: Logical variable indicating if additional debugging
messages should be included in the log.
- MARS:
- paths.Mars.flagMars: Boolean, indicates if Part 2 of the pipeline
MARS should be run. OPTIONAL, Defaults to true
- marsRepoPath: String to the directory where the mars-pipeline
repository is stored. REQUIRED
- pythonPath: String to the python.exe file. How to obtain
the path is described in .txt. REQUIRED
- OTUTable: String to the file where OTUs are matched to
taxonomic assignments. OPTIONAL if the taxonomic assignments are already in the readsTablePath. REQUIRED if not.
- readsTablePath: String to the file where the reads are assigned
to either OTUs or taxonomic assignments. REQUIRED
- outputPathMARS: String to the directory where MARS should store
the results. OPTIONAL, defaults to [resultPath, filesep, ‘ResultMARS’].
- cutoff: Numeric value under which relative abundances
are considered 0. OPTIONAL, defaults to 1e-6.
- outputExtensionMARS: String with the desired file extension of the
saved outputs. OPTIONAL, defaults to ‘csv’.
- flagLoneSpecies: A boolean to indicate if the genus name is in
the name of the species e.g. Prevotella copri. If genus name is in species name set to false. Otherwise set to true. OPTIONAL, defaults to false.
- taxaSplit: String with the delimiter used to separate
taxonomic levels. OPTIONAL, defaults to ‘; ‘.
- MARS analysis
- marsStats: Boolean indicates if Part 2.5 of the pipeline
MARS descriptive statistics should be run. OPTIONAL, defaults to true
- microbiomePath: String with the path to the file where the
reads and taxonomic assignments are combined. Could be the same file as readsTablePath. OPTIONAL, default to [Ask TIMHuls to Create].
- relAbunFilePath: String with the path to the relative abundance
species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.
- analysisDirMARS: Strin with the path where the descriptive
statistics of the MARS output should be stored. OPTIONAL, default to [outputPathMARS, filesep, ‘Analysis’].
- MgPipe:
- paths.mgPipe.flagMgPipe: Boolean, indicates if Part 3 of the pipeline:
MgPipe/microbiome model creation should be run. Defaults to true. OPTIONAL, defaults to true
- panModelsPath: String with path to the directory with the
panSpecies models. REQUIRED
- relAbunFilePath: String with the path to the relative abundance
species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.
- computeProfiles: Boolean, indicates of netSecretion and
netUptake are calculated for the microbiome models via fastFVA. OPTIONAL, defaults to false
- numWorkersCreation: Numeric value, amount of workers that are to be
used to create microbiome models. OPTIONAL, default to use all available cores.
- mgpipeDir: String with the path to where the results of
MgPipe should be saved. OPTIONAL, defaults to [resultPath, filesep, ‘ResultMgPipe’];
- WBM Personalisation:
- paths.persWBM.flagPersonalizeBoolean, indicates if Part 4 of the pipeline:
WBM personalisation should be run. OPTIONAL, defaults to false.
- personalisedWBMDir: String with the path to the location where the
personalised WBM are stored. OPTIONAL, defaults to [resultPath, filesep, ‘personalisedWBMs’].
- diet: String or nx2 character array defining the diet
to constrain WBM models with. OPTIONAL, defaults to EUAverageDiet.
- HM Creation:
- paths.mWBM.flagMWBMCreation: Boolean, indicates if Part 5 of the pipeline:
HM creation and descriptive statistics should be run. OPTIONAL, defaults to true
- mgpipeDir: String with the path to the location where the
community microbiome models generated from MgPipe are stored. OPTIONAL, if this pipeline is used to run MgPipe it defaults to [mgpipeDir, filesep, ‘Diet’]. If MgPipe was used outside of this function state the path explicitly.
- hmDir: The path where the HM models will be stored.
OPTIONAL, defaults to [resultPath, filesep, ‘HMmodels’].
- diet: String or nx2 character array defining the diet
to constrain HM models with. OPTIONAL, defaults to EUAverageDiet.
- numWorkersCreation: Numeric value, amount of workers that are to be
used to create HM WBMs. OPTIONAL, default to use all available cores.
- Flux balance analysis:
- paths.fba.flagFBA: Boolean, indicates if Part 6 of the pipeline:
FBA should be run. OPTIONAL, defaults to true
- hmDir: String with the path to the directory with the
models to use. OPTIONAL, if HM models were made it defaults to hmDirectory used in part 4. If HM models were created outside this function, the path has to be explicitly stated.
- fluxDir: String with the path where the flux results
should be stored. OPTIONAL, defaults to [resultPath, filesep, ‘fluxResults’].
- rxnList: Character array contain reaction IDs for
reactions that are to be optimised. If the user wants to solve non-existing demand reactions, simply add DM_ infront of the desired metabolite and add to the rxnList variable. REQUIRED.
- mgpipeDir: String with the path to the relative abundance
species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.
- thresholds: 1x3 numerical vector. OPTIONAL, defaults to
[90, 90, 0.999]
thresholds(1): “metabolite removal threshold”
This threshold indicates the maximum allowable number of duplicate flux results between samples expressed in the percentage of total samples. Reactions that exceed this threshold will be removed.
thresholds(2): “Microbe presence threshold”
This threshold indicates the maximum allowable number of samples where a microbe is absent expressed by the percentage of total samples. Microbes that exceed this threshold will be removed from the analysis.
thresholds(3): “Reaction grouping threshold”
This threshold indicates the minimal pairwise Spearman correlation between fluxes across the population where reactions are grouped and handled as one result. ADD TEXT 4th input for std dev.
- numWorkersOptmisation: Numeric value, amount of workers that are to be
used to solve the WBM models. OPTIONAL, default to 1.
- saveFullRes: Boolean, indicates if the complete .v, .y., and
.w vectors are stored in the result. OPTIONAL, defaults to true. It is recommended to set saveFullRes.
- fluxAnlsysDir: String with the path where the analyses of the
flux results are stored. OPTIONAL, defaults to [fluxDir, filesep, ‘fluxAnalysis’]
- Statistical analysis:
- paths.stats.flagStatistics: Boolean, indicates if Part 7 of the pipeline:
statistical analysis should be run. OPTIONAL, defaults to true
- statDir: String with the path where the results of the
statistical analyses and the figures should be saved. OPTIONAL, defaults to [resultPath, filesep, ‘statisticsResults’].
- fluxDir: String with the path to the location of the
analysed flux results. OPTIONAL, defaults to [resultPath, filesep, ‘fluxResults’, filesep, ‘fluxAnalysis’].
- microbiomePath: String with the path to the relative abundance
panSpecies file generated by MARS. OPTIONAL, defaults to [resultPath, filesep, resultMars, filesep, present,filesep, ‘present_species.csv’]
- response: A character or string array with variables
found in the metdata file that the user wants to use as response for statistical analysis. REQUIRED
- confounders: A character or string sarray with variables
found in the metadata file that are used as confounders in the statistical analyses. OPTIONAL, defaults to an empty cell array.
- moderationAnalysis: Boolean that indicates if a moderation analysis
will be performed. A moderation analysis can only be performed if confounders. OPTIOANL, defaults to false.
- microbeCutoff: Numeric value, threshold for the number of
samples a microbe needs to present in to be to be analysed. The same variable as thresholds(2) OPTIONAL, defaults to 0.1
- threshold: Numeric value used to find the correct instance
of the ‘processed_fluxes_Thr_x_xxx file, where the threshold value is x_xxx. The same value as thresholds (3). OPTIONAL, defaults to 0.999.
Outputs:
- Example (minimal required input to run all sections):
fullPipeline(resultPath,metadataPath,’marsRepoPath’, marsRepoPath,… ‘pythonPath’, pythonPath, ‘OTUTable’, OTUTable, ‘readsTablePath’ readsTablePath,… ‘panModelsPath’, panModelsPath, ‘rxnList’, rxnList, ‘response’, response)
Example (MARS is run online and is skipped in the this pipeline):
- runSeqC(repoPathSeqC, outputPathSeqC, fileIDSeqC, procKeepSeqC, maxMemSeqC, maxCpuSeqC, maxProcSeqC, debugSeqC, readsTablePath, outputPathMARS, outputExtensionMARS, relAbunFilePath, sample_read_counts_cutoff, cutoffMARS, OTUTable, flagLoneSpecies, taxaSplit, removeCladeExtensionsFromTaxa, whichModelDatabase, userDatabase_path, taxaTable)[source]¶
- ======================================================================================================#
Title: SeqC as Flux Pipeline MATLAB Wrapper Author: Wiley Barton Modified code sources:
matlab script structure: Tim Hensen, runMars.m, 2025.01 assistance and reference from a generative AI model [ChatGPT](https://chatgpt.com/)
clean-up and improved readability
Last Modified: 2025.02.09 Part of: Persephone Pipeline
- Description:
This function builds and runs the SeqC Docker image from a MATLAB environment. It ensures necessary databases exist and coordinates execution with the MARS pipeline.
- Inputs:
repoPathSeqC (char) : Path to the SeqC repository
outputPathSeqC (char) : Path for SeqC output
fileIDSeqC (char) : Unique identifier for file processing
procKeepSeqC (logical) : Keep all files (true/false)
maxMemSeqC (int) : Maximum memory allocation for SeqC
maxCpuSeqC (int) : Maximum CPU allocation for SeqC
maxProcSeqC (int) : Maximum processes for SeqC
debugSeqC (logical) : Enable debug mode (true/false)
…
- Dependencies:
MATLAB
Docker installed and accessible in the system path
- ======================================================================================================#
Determine Operating System
- validatePersephoneInputs(paths, resultPath)[source]¶
Validates the inputs provided in the paths structure for the Persephone pipeline. This function ensures that all required fields, flags, and parameters are present, appropriately formatted, and meet the expected constraints. Validation is performed across multiple pipeline components, including MARS, mgPipe, WBM personalization, HM creation, FBA, and statistics.
- USAGE
validated = validatePersephoneInput (paths)
- INPUT
paths Structure containing all required input fields for the Persephone – pipeline. Fields include: - General flags and paths (e.g., flagSeqC, resultPath) - Component-specific flags and parameters (e.g., flagMars, marsRepoPath, flagMgPipe, metadataPath, etc.)
- OUTPUT
validated Logical scalar indicating whether the validation succeeded – (true) or failed (false).
Notes
Ensure all required flags and fields in the paths structure are defined prior to invoking this function. Undefined or mismatched fields will result in validation failure.
This function is designed to be modular and extendable for future additions to the Persephone pipeline.
- AUTHOR:
Tim Hensen, January 2025
See also: validateattributes, validatestring, COBRA Toolbox