Persephone¶

analyseWBMs(mWBMPath, fluxPath, rxnList, varargin)[source]¶

analyseWBMs predicts the optimal fluxes for a list of user-defined reactions (rxnList). All predicted are further described in analyseWBMsol.m.

USAGE:: [FBA_results, pathsToFilesForStatistics] = analyseWBMs (mWBMPath, fluxPath, rxnList)

INPUTS: mWBMPath Path (character array) to WBMs fluxPath Path to directory where the results will be stored rxnList Cell array of VMH metabolites to investigate.

Example: rxnList = {‘DM_trp_L[bc], DM_met_L[bc],’Brain_trp_L[c], Heart_met_L[x]}. Note that demand reactions are automatically added if they are not present in the models.

OPTIONAL INPUTS rxnSense Character array containing either ‘max’ or ‘min’

to specify the sense of the objective. Option to specify differently for each objective- to do so provide character array the exact length of rxnList. (OPTIONAL, Default = ‘max’).

numWorkersOptimization Number of workers that will perform FBA in parallel. Note: that more workers does not necessarily make the function faster. It is generally not recommended to set numWorkersOptimization equal to the number of available cores (see: feature(‘numCores’)) as linear solvers can already support multi-core linear optimisation, thus resulting in unnessessary overhead. On computers with 8 cores or less, it is recommended to set numWorkersOptimization to 1. On a computer with 36 cores, an optimal configuration of numWorkersOptimization=6 was found.
saveFullRes Boolean (true/false) indicating if all the complete .v, .y.: , and .w vectors are stored in the result. Default = true. It is recommended to set saveFullRes

paramFluxProcessing Structured array with optional parameters:

.NumericalRounding defines how much the predicted flux values are rounded. A defined value of 1e-6 means that a flux value of 2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to exactly zero. This rounding factor will also be applied to the shadow price values. If microbiome relative abundance data is provided, the same rounding factor will be applied to the relative abundance data.

Default parameterisation: - paramFluxProcessing.NumericalRounding = 1e-6;

Example: - paramFluxProcessing.NumericalRounding = 1e-6;

paramFluxProcessing.NumericalRounding = 1e-6;

.RxnRemovalCutoff defines the minimal number of samples for which a unique reaction flux could be obtained, before removing the reaction for further analysis. This parameter can be expressed as * fraction: the fraction of samples with unique values, * SD: the standard deviation across samples, and * count: the counted number of unique values. If microbiome relative abundance data is provided, the same removal cutoff factor will be applied to the relative abundance data.

Default parameterisation: - paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};

Examples: - paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1}; - paramFluxProcessing.RxnRemovalCutoff = {‘SD’,1}; - paramFluxProcessing.RxnRemovalCutoff = {‘count’,30};

paramFluxProcessing.RxnRemovalCutoff = {‘fraction’,0.1};

.RxnEquivalenceThreshold defines the minimal threshold of when functionally identical flux values are predicted, and are thus part of the same linear pathways. The threshold for functional equivalence is expressed as the R2 (r-squared) value after performing a simple linear regression between two reactions.

Default parameterisation: - paramFluxProcessing.RxnEquivalenceThreshold = 0.999;

Example: - paramFluxProcessing.RxnEquivalenceThreshold = 0.999;

paramFluxProcessing.RxnEquivalenceThreshold = 0.999;

.fluxMicrobeCorrelationType defines the method for correlating the predicted fluxes with microbial relative abundances. Note that this metric is not used if mWBMs are not present. The available correlation types are: * regression_r2: the R2 (r-squared) value from pairwised linear regression on the predicted fluxes against microbial relative abundances. * spearman_rho: the correlation coefficient, rho obtained from pairwise Spearman nonparametric correlations between predicted fluxes and microbial relative abundances.

Default parameterisation: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;

Examples: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;

fluxAnalysisPath: Character array with path to directory where all: results will be saved (Default = pwd)
Solver: Validated solvers: ‘cplex_direct’,’ibm_cplex’: ‘tomlab_cplex’, ‘gurobi’, ‘mosek’

OUTPUT results Structured array with FBA fluxes, solver statistics, and

paths to the flux table and raw flux results

analyseWBMsol(fluxPath, paramFluxProcessing, fluxAnalysisPath)[source]¶

analyseWBMsol loads the FBA solutions produced in analyseWBMs.m, prepares the FBA results for further analyses, and produces summary statistics into the flux results.

The function contains six parts:

PART 1: Loading the flux solutions and corresponding shadow prices
PART 2: Converting the microbial biomass shadow prices

associated with each optimised reaction flux to human readable tables and producing statistics on microbial influences on the flux results. - PART 3: Isolate the microbial component of the fluxes by subtracting the germ-free fluxes in a sex specific manner. In this part, descriptive statistics are also obtained on the fluxes across samples and metabolites are removed based on a user defined threshold (see thresholds input) - PART 4: In this part, microbe-flux associations defined by the species biomass shadow prices (PART 2) are further quantified by performing Spearman correlations on the fluxes and microbe relative abundances. - PART 5: Reactions with identical or near identical fluxes across samples (user defined) are grouped and collapsed into a single result. - PART 6: Save all results

USAGE:: [FBA_results, pathsToFilesForStatistics] = analyseWBMsol (fluxPath,paramFluxProcessing, fluxAnalysisPath)

INPUTS: fluxPath Character array with path to .mat files produced in

optimiseRxnMultipleWBM.m

paramFluxProcessing Structured array with optional parameters:

.numericalRounding defines how much the predicted flux values are rounded. A defined value of 1e-6 means that a flux value of 2 + 2.3e-8 is rounded to 2. A flux value of 0 + 1e-15 would be rounded to exactly zero. This rounding factor will also be applied to the shadow price values. If microbiome relative abundance data is provided, the same rounding factor will be applied to the relative abundance data.

Default parameterisation: - paramFluxProcessing.numericalRounding = 1e-6;

Example: - paramFluxProcessing.numericalRounding = 1e-6;

paramFluxProcessing.numericalRounding = 1e-6;

.rxnRemovalCutoff defines the minimal number of samples for which a unique reaction flux could be obtained, before removing the reaction for further analysis. This parameter can be expressed as * fraction: the fraction of samples with unique values, * SD: the standard deviation across samples, and * count: the counted number of unique values. If microbiome relative abundance data is provided, the same removal cutoff factor will be applied to the relative abundance data.

Default parameterisation: - paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1};

Examples: - paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; - paramFluxProcessing.rxnRemovalCutoff = {‘SD’,1}; - paramFluxProcessing.rxnRemovalCutoff = {‘count’,30};

paramFluxProcessing.rxnRemovalCutoff = {‘fraction’,0.1}; RxnEquivalenceThreshold .rxnEquivalenceThreshold defines the minimal threshold of when functionally identical flux values are predicted, and are thus part of the same linear pathways. The threshold for functional equivalence is expressed as the R2 (r-squared) value after performing a simple linear regression between two reactions.

Default parameterisation: - paramFluxProcessing.rxnEquivalenceThreshold = 0.999;

Example: - paramFluxProcessing.rxnEquivalenceThreshold = 0.999;

paramFluxProcessing.rxnEquivalenceThreshold = 0.999;

.fluxMicrobeCorrelationMetric defines the method for correlating the predicted fluxes with microbial relative abundances. Note that this metric is not used if mWBMs are not present. The available correlation types are: * regression_r2: the R2 (r-squared) value from pairwised linear regression on the predicted fluxes against microbial relative abundances. * spearman_rho: the correlation coefficient, rho obtained from pairwise Spearman nonparametric correlations between predicted fluxes and microbial relative abundances.

Default parameterisation: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’;

Examples: - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘regression_r2’; - paramFluxProcessing.fluxMicrobeCorrelationMetric = ‘spearman_rho’;

fluxAnalysisPath

Character array with path to directory where all results will be saved.

AUTHOR:

Tim Hensen, July 2024
Jonas Widder, 11/2024 (small bugfixes)

createBatchMWBM(mgpipePath, mWBMPath, metadataPath, varargin)[source]¶

This function creates personalised host-microbiome WBM models (mWBMs) by joining microbiome community models with unpersonalised WBM models in a sex-specific manner. The models are parameterised on a predefined diet.

USAGE:: [modelStats, summaryStats, dietInfo, dietGrowthStats] = createBatchMWBM (mgpipePath, mWBMPath, metadataPath)

INPUTS mgpipePath: Path to microbiome community models created by the

microbiome modelling toolbox.

mWBMPath Path to directory where the HM models are: saved

OPTIONAL INPUTS Diet Diet option: ‘EUAverageDiet’ (default) numWorkersCreation Number of cores used for parallelisation.

Default = 4.

checkFeasibility Flag (true/false) to run the: ensureHMfeasibility.m function and check if the generated models can grow on the diet. Default = true.
wbmDirectory Path to directory with user defined WBMs.: This function supports one user adapted male and one user adapted female WBM or personalised germ free WBMs for each sample with a microbiome community model. Default is empty. If wbmDirectory is empty, Harvey/Harvetta version 1.04c are used.

OUTPUTS modelStats Table with summary statistics on the generated WBMs:

gender, number of reactions, metabolites,constrainsts, and taxa.

summaryStats Table with the mean and SD of the model: statistics in the modelStats variable.
dietInfo Table with diet growth information.: models can grow on the given diet.

dietGrowthStats Table with statistics on

Authors: Tim Hensen, 2023, 2024

createForestPlot(estimates, ci, names, pValues, plotTitle, xTitle, hideLegend)[source]¶

createForestPlot generates a forest plot to display confidence intervals for estimates.

USAGE:

createForestPlot (estimates, ci, names, pValues, plotTitle, xTitle, hideLegend)

INPUTS:

estimates - Vector of estimates (e.g., effect sizes or log fold changes)
ci - Matrix of confidence intervals [n x 2] with lower and upper bounds in columns.
names - Cell array of labels for each data point.
pValues - Vector of p-values corresponding to each estimate.
plotTitle - String, title of the plot.
xTitle - String, label for the x-axis (typically “Effect Size” or “Log Fold Change”)
hideLegend - Logical, if true, hides the legend (default is false)

createMWBM(microbiota_model, WBM_model, Diet, saveDir)[source]¶

This function creates personalised host-microbiome WBM models by joining microbiome community models with unpersonalised WBM models. The models are parameterised on a predefined diet.

Example: modelHM = createMWBM(microbiota_model, WBM_model, ‘EUAverageDiet’)

INPUTS microbiota model: Microbiome community model created by the

microbiome modelling toolbox.

WBM_model: Harvey or Harvetta whole-body metabolic: model

Diet Diet option: ‘EUAverageDiet’ (default)

Output modelHM: Personalised host-microbiome WBM model

Authors: Tim Hensen, Bronson Weston, 2022: Tim Hensen, expanded WBM parameterisation July 2024 Tim Hensen, improved mWBM annotation November 2024

createVolcanoPlot(estimates, pValues, names, plotTitle, xTitle, yTitle)[source]¶

createVolcanoPlot generates a volcano plot to visualize the relationship between regression estimates and p-values.

USAGE:

createVolcanoPlot (estimates, pValues, names, plotTitle, xTitle, yTitle)

INPUTS:

estimates - [vector] Regression estimates for each data point (e.g., effect sizes or log fold changes)
pValues - [vector] p-values corresponding to each estimate.
names - [cell array] Labels or names for each data point.
plotTitle - [string] Title of the plot.
xTitle - [string] Label for the x-axis (e.g., “Effect Size” or “Log Fold Change”)
yTitle - [string] Label for the y-axis (e.g., “-log10(p-value)”)

OUTPUTS:

None - This function generates a volcano plot in the current figure.

Example

createVolcanoPlot(estimates, pValues, {‘Metabolite1’, ‘Metabolite2’}, …: ‘Volcano Plot’, ‘Log Fold Change’, ‘-log10(p-value)’)

ensureWBMfeasibility(mWBMPath, varargin)[source]¶

This function checks for the WBMs in the specified mWBMPath to grow on the given diet, finds a diet that makes all WBMs feasible and propagates this diet onto the models. The functions works as follows: 1) Load the WBMs from the directory and test which models are feasible on the diet. 2) If any WBM is not feasible, collect all infeasible WBMs, open all diet reactions, and test if the WBMs can grow on any diet. If one or more WBMs cannot grow when opening all diet reactions, the statistics are collected and the function stops. The user will need to debug the WBMs manually. 3) If all previously infeasible WBMs can grow when all diet reactions are opened,missing diet components are searched using getMissingDietModelHM. If getMissingDietModelHM cannot find a diet after one function call, this function is run again until a hard limit of 10 iterations is reached. If this limit is reached, no feasible diet can be found for all models and the user will need to manually debug the infeasible models. If missing diet components are found in one WBMs, these component are automatically propagated to the other WBMs infeasible on the original diet before searching for more missing diet components 4) If a new diet is found for which all previously infeasible WBMs can grow, the updated diet is propagated to all models in the mWBMPath. 5) The updated diet is stored as an output variable for this function and for each model, its feasibility on the original and updated diet is recorded.

USAGE:: [dietInfo, dietGrowthStats] = ensureHMfeasibility (mWBMPath, Diet)

INPUTS mWBMPath Path to directory where the HM models are saved Diet Diet option: ‘EUAverageDiet’ (default)

OUTPUTS dietInfo Structured variable containing the missing diet

components and the updated diet propagated to all models. This variable is empty if no updated diet could be found.

dietGrowthStats Table indicating which WBMs could initially not grow on: the diet, which WBMs could not grow on any diet, and which WBMs could not grow on an updated diet.

Authors

Tim Hensen, 2024
Bram Nap, 07-2025 remove parfor functionality. Time save is not

massive and allows the function to be used for other purposes besides Persephone.

fdrBHadjustment(p)[source]¶

IMPORTANT; This function was partially generated by chatGPT o3-mini.: The code is manually tested and altered with bug fixes. This function performs identical to running q = mafdr(p, ‘BHFDR’, true) using the Matlab bioinformatics toolbox.

FDR_BH Performs Benjamini-Hochberg FDR correction on a vector of p-values.

q = FDR_BH(p) returns the FDR-corrected p-values q for the input vector p.

Input:
p - a vector of p-values (assumed to be in the interval [0, 1])

Output:
q - a vector of FDR corrected p-values, in the same order as p.

The function implements the procedure described in:
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Example:
pvals = [0.01, 0.04, 0.03, 0.002, 0.07]; qvals = fdr_bh(pvals);

findPvalCategories(pValues)[source]¶

This function inputs a list of p values and outputs the variables pValueGroups and colourGroups. A maximum of three groups are defined, FDR<0.05, P<0.05, and P>0.05.

USAGE:: [groupsToPlot, colours] = findPvalCategories (pValues)

INPUT pValues

OUTPUTS groupsToPlot Table that categorises the indices of the p-value list. colours Cell array with numerical encodings for the

category-associated colours.

generateStackedBarPlot_PhylumMARScoverage(input_relAbundances_preMapping, input_relAbundances_postMapping, saveDir, varargin)[source]¶

Generates stacked bar plots from mean relative abundances of phyla pre- & post-mapping to a model-database in MARS.

INPUTS:

input_relAbundances_preMapping – [chars/string] Path to table which contains phyla taxa and their mean relative abundances across samples pre-mapped to a model-database (original dataset). This spreadsheet is a standard output generated in MARS.
input_relAbundances_postMapping – [chars/string] Path to table which contains phyla taxa and their mean relative abundances across samples post-mapped to a model-database (filtered dataset). This spreadsheet is a standard output generated in MARS.
saveDir – [chars/string] Path to the directory where the stacked bar plot should be saved.
mappingDatabase_name – [chars/string] The name of the model-database used for mapping in MARS, which will be displayed in the title of the graph. Defaults to ‘’.

AUTHOR:

Jonas Widder, 12/2024 & 01/2025

getMissingDietModelHM(modelHM, missingDietComponents, testInitialFeasibility)[source]¶

This function determines missing dietary compounds in host-microbiome WBMs that are infeasible (it assumes that the WBMs and the microbiome models individually are feasible). This function first tests whether the model is feasible (which can be skipped by setting testInitialFeasibility = 0), if infeasible, all diet exchange reactions that are not already active will be opened and the feasibility will be tested. If infeasible, the function stops - then there is no dietary solution (of course it could be that active diet constraints are limited but they will not be tested with this script). Subsequently, diet exchanges are randomly closed (first in batches of 50, then 10, then 5, then 1). If the modelHM remains feasible this batch as well as all other 0 diet fluxes in the fba solution will be deemed not necessary for feasibility, if infeasible the batch set will be kept and tested for in the next step. when the batch set size is 1, each remaining diet exchange will be tested for individually. While still being slow, this approach is much faster then testing each diet exchange individually. As the diet exchanges are selected randomly for each batch, running the function twice may not result in the same final set of missingDietComponents

INPUT modelHM Host-microbiome model structure missingDietComponents list of diet exchange reactions that are known

to be missing or that have been identified in a previous run of this function

testInitialFeasibility default 1

initPersephone(resultPath, paths)[source]¶

Initializes the Persephone pipeline by:

Checking for MATLAB toolbox dependencies.
Processing and validating the metadata file.
Cross-referencing metadata with microbiome data (if provided).
Setting up directory structures for Persephone results.

USAGE:

[initialised, paths, statToolboxInstalled] = initPersephone (resultPath, metadataPath, readsTablePath)

INPUTS:

resultPath - (char)
metadataPath - (char) Path to the metadata file (.csv or .xlsx)
readsTablePath - (char)

OUTPUTS:

initialised - (logical)
paths - (struct)
statToolboxInstalled - (logical)
updatedMetadataPath - (char)

REQUIREMENTS:

MATLAB R2020b or newer.
Parallel Computing Toolbox (mandatory).
Statistics and Machine Learning Toolbox (recommended for analysis).

Notes

Metadata must contain a column of sample IDs and a column of sex information.
Metadata and reads table will be synchronized to ensure consistent samples across analyses.
Any discrepancies in variable naming (e.g., “ID”, “Sex”) are automatically resolved.
Processed metadata is saved to an updated file.

Example

resultPath = ‘./results’; metadataPath = ‘./data/metadata.csv’; readsTablePath = ‘./data/readsTable.csv’; [initialised, paths, statToolboxInstalled] = initPersephone(resultPath, metadataPath, readsTablePath);

AUTHOR:

Tim Hensen, January 2025
Bram Nap, July 2024: Automatic creation of output folders

loadMinimalWBM(modPath)[source]¶

This function loads the smallest possible combination of WBM model fields needed to perform FBA. Loading the minimal model can decrease loading times by 6X.

Author: Tim Hensen, November 2024

performRegressions(data, metadata, formula, exponentiateLogOdds)[source]¶

This functions performs linear or logistic regressions on either flux data or microbial relative abundances. The function supports control variables & moderators in the regression formulae.

USAGE:: results = performRegressions (data,metadata,formula)

INPUT data: Table with flux or microbiome data. The first column must be the ID

column.

metadata: Metadata table. The first column must be the ID column. formula: Must contain either “Flux or “relative_abundance” as predictor. exponentiateLogOdds: Logical variable on if a logistic regression estimate should be exponentiately to obtain the odds ratios.

performStatsPersephone(statPath, pathToProcessedFluxes, metadataPath, response, varargin)[source]¶

The performStatistics function is part of the PSCM toolbox. This function performs regression analyses on processed fluxes and gut microbiome relative abundance data in large cohort studies. This function needs a minimal sample size of 50 samples. This function performs different statistical analyses depending on the user defined inputs.

If the response variable is binary and no confounders are given, wilcoxon tests will be performed for the fluxes and microbiome abundances. If the response variable is binary and confounders are given, multiple logistic regressions will be performed. If the response variable is continuous and no confounders are given, simple linear regressions will be performed. If the response variable is continuous and confounders are given, multiple linear regressions will be performed.

The fluxes are z-transformed before statistical testing. The microbiome relative abundances are log transformed before testing. If a moderator variable is given and a regression is performed, a moderation analysis on the moderator with the predictor will be performed.

USAGE:: performStatistics (statPath, pathToProcessedFluxes, pathToWbmRelAbundances, metadataPath, response, confounders, moderator, microbeCutoff,threshold)

INPUT statPath Path (character array) to working directory pathToProcessedFluxes Path to processed flux data pathToWbmRelAbundances Path to microbial relative abundances metadataPath Path to metadata file response Name of response variables (char or string)

OPTIONAL INPUTS confounders Cell array with the names of confounding variables to

be included. Default = empty.

microbeCutoff Cutoff threshold for the number of samples a: microbe needs to present to be analysed. Default = 0.1. (10%)

persWBM(metadata, varargin)[source]¶

This function takes a table of physiological parameters for an individual or multiple individuals (listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to create a personalised WBM

The calculation of physiological parameteres will be performed based on the available data e.g., if the user provides Cardiac output, this given value is assigned, if the user provides stroked volume and no CO, CO is calculated based on SV. If neither values are provided, CO will be calculated based on XXX. All details of parameter calculation are detailed in the personalised model output and in the excel file provided by this function.

INPUTS

REQUIRED: metadata Can be a struct (for personalising one model)

or a path to an excel file (for batch personalisation). If it is a structure, it should follow the format of the individualised parameters struct obtained from running: > sex = “male”; OR > sex = “female”; AND > standardPhysiolDefaultParameters;

persWBMmetabolomics(sex, metabolomicParams, varargin)[source]¶

This function takes a table of metabolomic parameters for an individual or multiple individuals (listed in inputs) and adjusts the paramteres of a provided WBM or Harvey/Harvetta to create a personalised WBM

The calculation of metabolomic parameteres will be performed based on the available data

INPUTS

REQUIRED: metabolomicParamters Can be a cell array or a table or a path to an

excel file. In any case, this should contain a list with a minimum of three columns and option for additional columns for each individual for which a model should be created:

|”compartment” |”blood” |”urine” | |”unit” |”mg/dL” | “mg/dL” | |”Individual1” |180 |50 | |”Individual2” |150 |66 |

Sex must be provided for each model. The compartment must be specified and must be one of the following: csf, u, blParameters that can be personalised include: All of those metabolites which are found in the *for each the default unit used by the model is provided. For x, y, z- unit conversion will be performed for all known common units of measurement.

OPTIONAL: WBM/(s) User can provide a WBM (Whole Body Metabolic Model) or a path to multiple models.

If no model is provided, either Harvey or Harvetta will be loaded from the COBRA toolbox, depending on sex in provided physiological data. If multiple models are provided, the model ID must match the model name provided in the physiological data.

resPath Path on which to store personalised model and: other outputs. Default = current directory
Diet Diet in the form of text file or named .mat: file from the COBRA toolbox (default = EUAverageDiet) ** This function can also be run in isolation but if a user wants to personalise both physiological and metabolomic- this function should alway be run and NOT persWBMmetabolomics in isolation!

OUTPUTS

iWBM Model with updated physiological paramteres: (stored as “persModelName.mat”). All updated paramteres are described in model.IndividualisedParameters
controlWBM This is a WBM with no personalised: adjustments. When a WBM or multiple WBMs have been given as inputs, the controlWBMs are exact copies of those. When Harvey or Harvetta are used, controlWBM is copies of Harvey/Harvetta.
persParameters Excel file with details of the updated: parameter and how it was calculated

author: Anna Sheehy November 2024 Step One: Read in the available data, check all data is valid Define the input parser

readMetadataForPersephone(metadataPath)[source]¶

DESCRIPTION:: This function loads a metadata file into a MATLAB table. It accounts for: - Preserving the full variable names by disabling automatic truncation. - Capturing variable units if provided in the second row of the file. - Storing the captured units in the ‘VariableUnits’ property of the table.

USAGE:: metadata = readMetadataForPersephone (metadataPath)
INPUTS:: metadataPath – A string specifying the path to the metadata file (in CSV or XLSX format).
OUTPUTS:: metadata – A MATLAB table containing the processed metadata. Variable units, if present in the second row of the file, are stored in the table’s ‘VariableUnits’ property.

Notes

If variable names exceed 64 characters, they are truncated, but the true variable names are preserved in the ‘VariableDescriptions’ property.
Warnings related to variable name truncation are suppressed to avoid unnecessary console output.

AUTHORS

Tim Hensen, January 2025

runMars(pythonPath, marsRepoPath, cutoffMARS, outputExtensionMARS, flagLoneSpecies, taxaSplit, removeCladeExtensionsFromTaxa, whichModelDatabase, userDatabase_path, sample_read_counts_cutoff, readsTablePath, taxaTable, outputPathMARS)[source]¶

This function integrates MATLAB with the MARS Python repository to process microbiome taxonomy and read abundance data. The MARS pipeline maps joined microbiome data to a metabolic model database for downstream analysis.

INPUTS:

pythonPath char with path to the Python executable (e.g., Anaconda installation)
marsRepoPath char with path to the directory containing the MARS pipeline repository.
cutoffMARS double with relative abundance cutoff threshold below which relative abundances – are treated as zero. Defaul is 1e-6.
outputExtensionMARS char with desired file format for output files (e.g., ‘csv’, ‘txt’)
flagLoneSpecies logical that indicates if genus name is omitted in species names – (e.g., “copri” instead of “Prevotella copri”).
taxaSplit char with delimiter used to separate taxonomic levels.
removeCladeExtensionsFromTaxa Logical which specifies whether clade name extensions should be removed – from microbiome taxa for compatibility with AGORA2 models.
whichModelDatabase char that specifies the metabolic model database to use. – Options: ‘AGORA2’, ‘APOLLO’, ‘full_db’, ‘user_db’ (requires ‘userDatabase_path’).
userDatabase_path char with path to a user-defined model database file, required only if ‘whichModelDatabase’ is set to ‘user_db’.
sample_read_counts_cutoff double variable. Threshold for read counts below which samples are excluded. – Only applies if ‘readsTablePath’ contains absolute read counts.
readsTablePath char with path to the file containing microbiome taxonomy and read abundance data.
taxaTable char with path to the file containing taxonomic assignments. – Required only if taxonomy is not included in ‘readsTablePath’.

OUTPUTS:

The function does not return variables but writes processed results
to the specified output directory in the MARS pipeline.

DEPENDENCIES:

Python and the MARS repository must be installed and accessible.
COBRA Toolbox for MATLAB must be properly set up if further integration with COBRA models is required.

Notes

Ensure ‘pythonPath’ is set to the correct Python executable version and environment containing the required MARS dependencies.
If ‘readsTablePath’ already includes taxonomy assignments, ‘taxaTable’ is optional.

AUTHOR: Tim Hensen, January 2025: modified by Jonas Widder, January 2025 (added generateStackedBarPlot_PhylumMARScoverage.m)

runNonparametricTests(data, metadata, predictor, response)[source]¶

Performs non-parametric statistical tests (Wilcoxon or Kruskal-Wallis) on metabolic data

USAGE:

resultTable = runNonparametricTests (data, metadata, predictor, response)

INPUTS:

data – (table) m x n table containing: * m samples * n reactions/taxa with flux or abundance values * First column must contain sample IDs
metadata – (table) Sample metadata containing: * Sample IDs in first column * Response variable in separate column
predictor – (char) Name of predictor variable (‘Flux’ or ‘relative_abundance’)
response – (char) Name of response variable in metadata

OUTPUTS:

resultTable – (table) Statistical results with columns: * name: Reaction/Taxa identifier * groups: Cell array of group names * n: Array of group sizes * statistic: Test statistic (Wilcoxon or Kruskal-Wallis) * p-value: Unadjusted p-value * FDR: Benjamini-Hochberg adjusted p-value * effectSize: Effect size (r for Wilcoxon, η² for Kruskal-Wallis) * confidence: 95% confidence interval (only for binary comparisons)

Example

% Binary comparison (Wilcoxon test) resultTable = runNonparametricTests(fluxData, metadata, ‘Flux’, ‘Disease’)

% Multiple group comparison (Kruskal-Wallis test) resultTable = runNonparametricTests(abundanceData, metadata, ‘relative_abundance’, ‘Treatment’)

Note

Automatically selects appropriate test based on number of groups
Data is automatically normalized before testing
Missing values (NaN) are removed before analysis
Multiple testing correction uses Benjamini-Hochberg FDR
Minimum group size of 3 samples is required for valid testing

runPersephone(configPath)[source]¶

This pipeline runs throught the entire process of mapping taxonomy assigned reads to AGORA2/APOLLO models to create relative abundance files through the MARS pipeline. The relative panSpecies abundance file is then used to create microbiome models through mgpipe (Microbiome Modelling Toolbox PMID:xx). Before HM models are created there is an option to personalise the WBMs with physiological or metabolomic data through the xx function. The HM creation can then either connect the microbiomes with base-WBMs based on the sex associated with the sample ID in the metadata file or connect it with a personalised WBM. Based on a list of reactions, the HM models will be solved and the flux results will be processed and various analyses such as shadow price and regression are performed. Then based on the make-up of the study used to run this pipeline, mixed-effect linear regression modelling can be used for various response variables while taking into account the effect of predictors. All response and predictor variables need to be present in the metadata file. By default the code will do perform two runs where in one flux data is added as an additional predictor and in the other the relative abundance of panSpecies models. Finally the results are visualised in various plots. The user can indicate which parts of the pipeline should be run with various “flags” described in the input variables below. If the code or matlab stops unexpectedly, parts of pipeline that are already performed will be skipped. If the entire pipeline should be rerun - delete the progress.mat file. If only certain parts of the pipeline should be rerun, open the progress.mat file and alter change the value of the relevant flags to false or 0.

Most of the inputs are optional as the pipeline is designed that parts of the analysis could be skipped. This however means that some of the optional variables are actually required variables in order to run a certain part of the pipeline. Behind the input description indication if the variable is required is given.

Usage:: fullPipeline()

Inputs: :Required inputs: * resultPath – A string with the path to the directory where all

results should be stored

metadataPath – A string with the path to the metadata file.

Optional inputs:: solver – String, indicates which solver is to be used for solving WBMs. OPTIONAL, defaults to GLPK. It is highly recommended to use gurobi or cplex

SeqC:

paths.seqC.flagSeqC: Logical variable indicating if bioinformatic processing: of sequencing data is performed by SeqC.
paths.seqC.repoPathSeqC: REQUIRED. String specifying the directory where the: SeqC repository is located.
paths.seqC.outputPathSeqC: String specifying the folder where the final output: of SeqC is stored.
paths.seqC.fileIDSeqC: REQUIRED. String specifying the file name containing: sample IDs for FASTQ files (e.g., sample_id.txt).
paths.seqC.procKeepSeqC: Logical variable indicating if intermediary outputs are: retained (e.g., post-QC FASTQ). If false, only the final MARS output is kept, and all intermediary content is deleted once SeqC completes.
paths.seqC.maxMemSeqC: Numeric value specifying the maximum amount of memory: allowed in gigabytes.
paths.seqC.maxCpuSeqC: Numeric value specifying the maximum number of CPU threads: allowed for processing.
paths.seqC.maxProcSeqC: Numeric value specifying the maximum number of parallel: processes allowed.
paths.seqC.debugSeqC: Logical variable indicating if additional debugging: messages should be included in the log.

MARS:

paths.Mars.flagMars: Boolean, indicates if Part 2 of the pipeline: MARS should be run. OPTIONAL, Defaults to true
marsRepoPath: String to the directory where the mars-pipeline: repository is stored. REQUIRED
pythonPath: String to the python.exe file. How to obtain: the path is described in .txt. REQUIRED
OTUTable: String to the file where OTUs are matched to: taxonomic assignments. OPTIONAL if the taxonomic assignments are already in the readsTablePath. REQUIRED if not.
readsTablePath: String to the file where the reads are assigned: to either OTUs or taxonomic assignments. REQUIRED
outputPathMARS: String to the directory where MARS should store: the results. OPTIONAL, defaults to [resultPath, filesep, ‘ResultMARS’].
cutoff: Numeric value under which relative abundances: are considered 0. OPTIONAL, defaults to 1e-6.
outputExtensionMARS: String with the desired file extension of the: saved outputs. OPTIONAL, defaults to ‘csv’.
flagLoneSpecies: A boolean to indicate if the genus name is in: the name of the species e.g. Prevotella copri. If genus name is in species name set to false. Otherwise set to true. OPTIONAL, defaults to false.
taxaSplit: String with the delimiter used to separate: taxonomic levels. OPTIONAL, defaults to ‘; ‘.

MARS analysis

marsStats: Boolean indicates if Part 2.5 of the pipeline: MARS descriptive statistics should be run. OPTIONAL, defaults to true
microbiomePath: String with the path to the file where the: reads and taxonomic assignments are combined. Could be the same file as readsTablePath. OPTIONAL, default to [Ask TIMHuls to Create].
relAbunFilePath: String with the path to the relative abundance: species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.
analysisDirMARS: Strin with the path where the descriptive: statistics of the MARS output should be stored. OPTIONAL, default to [outputPathMARS, filesep, ‘Analysis’].

MgPipe:

paths.mgPipe.flagMgPipe: Boolean, indicates if Part 3 of the pipeline:: MgPipe/microbiome model creation should be run. Defaults to true. OPTIONAL, defaults to true
panModelsPath: String with path to the directory with the: panSpecies models. REQUIRED
relAbunFilePath: String with the path to the relative abundance: species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.
computeProfiles: Boolean, indicates of netSecretion and: netUptake are calculated for the microbiome models via fastFVA. OPTIONAL, defaults to false
numWorkersCreation: Numeric value, amount of workers that are to be: used to create microbiome models. OPTIONAL, default to use all available cores.
mgpipeDir: String with the path to where the results of: MgPipe should be saved. OPTIONAL, defaults to [resultPath, filesep, ‘ResultMgPipe’];

WBM Personalisation:

paths.persWBM.flagPersonalizeBoolean, indicates if Part 4 of the pipeline:: WBM personalisation should be run. OPTIONAL, defaults to false.
personalisedWBMDir: String with the path to the location where the: personalised WBM are stored. OPTIONAL, defaults to [resultPath, filesep, ‘personalisedWBMs’].
diet: String or nx2 character array defining the diet: to constrain WBM models with. OPTIONAL, defaults to EUAverageDiet.

HM Creation:

paths.mWBM.flagMWBMCreation: Boolean, indicates if Part 5 of the pipeline:: HM creation and descriptive statistics should be run. OPTIONAL, defaults to true
mgpipeDir: String with the path to the location where the: community microbiome models generated from MgPipe are stored. OPTIONAL, if this pipeline is used to run MgPipe it defaults to [mgpipeDir, filesep, ‘Diet’]. If MgPipe was used outside of this function state the path explicitly.
hmDir: The path where the HM models will be stored.: OPTIONAL, defaults to [resultPath, filesep, ‘HMmodels’].
diet: String or nx2 character array defining the diet: to constrain HM models with. OPTIONAL, defaults to EUAverageDiet.
numWorkersCreation: Numeric value, amount of workers that are to be: used to create HM WBMs. OPTIONAL, default to use all available cores.

Flux balance analysis:

paths.fba.flagFBA: Boolean, indicates if Part 6 of the pipeline:
FBA should be run. OPTIONAL, defaults to true

hmDir: String with the path to the directory with the
models to use. OPTIONAL, if HM models were made it defaults to hmDirectory used in part 4. If HM models were created outside this function, the path has to be explicitly stated.

fluxDir: String with the path where the flux results
should be stored. OPTIONAL, defaults to [resultPath, filesep, ‘fluxResults’].

rxnList: Character array containing reaction IDs for
reactions that are to be optimised. If the user wants to solve non-existing demand reactions, simply add DM_ infront of the desired metabolite and add to the rxnList variable. REQUIRED.

rxnSense: Character array containing either ‘max’ or ‘min’
to specificy the sense of the objective. Option to specifcy for each objective- character array should then be exact length of rxnList. (OPTIONAL, Default = ‘max’).

mgpipeDir: String with the path to the relative abundance
species file, an output from MARS. OPTIONAL, defaults to [outputPathMARS, filesep, ‘present’, filesep, ‘present_species.’, outputExtensionMARS] if MARS is run with this pipeline. If run online, the path needs to be specified explicitly.

thresholds: 1x3 numerical vector. OPTIONAL, defaults to
[90, 90, 0.999]

thresholds(1): “metabolite removal threshold”

This threshold indicates the maximum allowable number of duplicate flux results between samples expressed in the percentage of total samples. Reactions that exceed this threshold will be removed.

thresholds(2): “Microbe presence threshold”

This threshold indicates the maximum allowable number of samples where a microbe is absent expressed by the percentage of total samples. Microbes that exceed this threshold will be removed from the analysis.

thresholds(3): “Reaction grouping threshold”

This threshold indicates the minimal pairwise Spearman correlation between fluxes across the population where reactions are grouped and handled as one result. ADD TEXT 4th input for std dev.

numWorkersOptmisation: Numeric value, amount of workers that are to be

used to solve the WBM models. OPTIONAL, default to 1.

saveFullRes: Boolean, indicates if the complete .v, .y., and: .w vectors are stored in the result. OPTIONAL, defaults to true. It is recommended to set saveFullRes.
fluxAnlsysDir: String with the path where the analyses of the: flux results are stored. OPTIONAL, defaults to [fluxDir, filesep, ‘fluxAnalysis’]

Statistical analysis:

paths.stats.flagStatistics: Boolean, indicates if Part 7 of the pipeline:: statistical analysis should be run. OPTIONAL, defaults to true
statDir: String with the path where the results of the: statistical analyses and the figures should be saved. OPTIONAL, defaults to [resultPath, filesep, ‘statisticsResults’].
fluxDir: String with the path to the location of the: analysed flux results. OPTIONAL, defaults to [resultPath, filesep, ‘fluxResults’, filesep, ‘fluxAnalysis’].
microbiomePath: String with the path to the relative abundance: panSpecies file generated by MARS. OPTIONAL, defaults to [resultPath, filesep, resultMars, filesep, present,filesep, ‘present_species.csv’]
response: A character or string array with variables: found in the metdata file that the user wants to use as response for statistical analysis. REQUIRED
confounders: A character or string sarray with variables: found in the metadata file that are used as confounders in the statistical analyses. OPTIONAL, defaults to an empty cell array.
moderationAnalysis: Boolean that indicates if a moderation analysis: will be performed. A moderation analysis can only be performed if confounders. OPTIOANL, defaults to false.
microbeCutoff: Numeric value, threshold for the number of: samples a microbe needs to present in to be to be analysed. The same variable as thresholds(2) OPTIONAL, defaults to 0.1
threshold: Numeric value used to find the correct instance: of the ‘processed_fluxes_Thr_x_xxx file, where the threshold value is x_xxx. The same value as thresholds (3). OPTIONAL, defaults to 0.999.

Outputs:

Example (minimal required input to run all sections):: fullPipeline(resultPath,metadataPath,’marsRepoPath’, marsRepoPath,… ‘pythonPath’, pythonPath, ‘OTUTable’, OTUTable, ‘readsTablePath’ readsTablePath,… ‘panModelsPath’, panModelsPath, ‘rxnList’, rxnList, ‘response’, response)

Example (MARS is run online and is skipped in the this pipeline):

runSeqC(repoPathSeqC, outputPathSeqC, fileIDSeqC, procKeepSeqC, maxMemSeqC, maxCpuSeqC, maxProcSeqC, debugSeqC, readsTablePath, outputPathMARS, outputExtensionMARS, relAbunFilePath, sample_read_counts_cutoff, cutoffMARS, OTUTable, flagLoneSpecies, taxaSplit, removeCladeExtensionsFromTaxa, whichModelDatabase, userDatabase_path, taxaTable)[source]¶

======================================================================================================#

Title: SeqC as Flux Pipeline MATLAB Wrapper Author: Wiley Barton Modified code sources:

matlab script structure: Tim Hensen, runMars.m, 2025.01 assistance and reference from a generative AI model [ChatGPT](https://chatgpt.com/)

clean-up and improved readability

Last Modified: 2025.02.09 Part of: Persephone Pipeline

Description:

This function builds and runs the SeqC Docker image from a MATLAB environment. It ensures necessary databases exist and coordinates execution with the MARS pipeline.

Inputs:

repoPathSeqC (char) : Path to the SeqC repository
outputPathSeqC (char) : Path for SeqC output
fileIDSeqC (char) : Unique identifier for file processing
procKeepSeqC (logical) : Keep all files (true/false)
maxMemSeqC (int) : Maximum memory allocation for SeqC
maxCpuSeqC (int) : Maximum CPU allocation for SeqC
maxProcSeqC (int) : Maximum processes for SeqC
debugSeqC (logical) : Enable debug mode (true/false)

…

Dependencies:

MATLAB
Docker installed and accessible in the system path

======================================================================================================#

Determine Operating System

validatePersephoneInputs(paths, resultPath)[source]¶

Validates the inputs provided in the paths structure for the Persephone pipeline. This function ensures that all required fields, flags, and parameters are present, appropriately formatted, and meet the expected constraints. Validation is performed across multiple pipeline components, including MARS, mgPipe, WBM personalization, HM creation, FBA, and statistics.

USAGE:: validated = validatePersephoneInput (paths)
INPUT:: paths Structure containing all required input fields for the Persephone – pipeline. Fields include: - General flags and paths (e.g., flagSeqC, resultPath) - Component-specific flags and parameters (e.g., flagMars, marsRepoPath, flagMgPipe, metadataPath, etc.)
OUTPUT:: validated Logical scalar indicating whether the validation succeeded – (true) or failed (false).

Notes

Ensure all required flags and fields in the paths structure are defined prior to invoking this function. Undefined or mismatched fields will result in validation failure.
This function is designed to be modular and extendable for future additions to the Persephone pipeline.

AUTHOR:: Tim Hensen, January 2025

See also: validateattributes, validatestring, COBRA Toolbox