Additionalfunctions

CreateClusterAndRAtables[source]

Create clusters and their respective rel abund tables

combineFluxResults(directory1, directory2, resultdirectory, set_regexp)[source]

This function merges & prunes the FBA solutions between two runs of the optimiseRxnMultipleWBMs.m function. Reaction fluxes, FBA statistics, & shadow prices get therefore concatenated. Note that in case the sample filenames differ from the standard, the regular expression needs to be adapted.

USAGE

[dietInfo, dietGrowthStats] = ensureHMfeasibility (hmDirectory, Diet)

INPUTS directory1 [char array] Directory to flux solutions from the first run directory2 [char array] Directory to flux solutions from the second run resultdirectory [char array] Directory to empty folder where the combined fluxes

will be saved.

OPTIONAL INPUT set_regexp [char array] Specifying alternative regular expression in case

the sample filenames are different from their style than the standard optimiseRxnMultipleWBMs.m output.

Authors

  • Tim Hensen, 2024

  • modified by Jonas Widder, 10/2024 & 11/2024 (function can now also merge dirs with unequal number of samples + added set_regexp option)

completeSpeciesFolder(agoraPath, panPath)[source]

Some strains in AGORA2 only have one strain. These strains are not moved to the pan model folder, resulting in some soecies not being captured in the models. By converting the strain reconstructions with only a single strain to the species folder, you can solve this problem.

USAGE

completeSpeciesFolder (strainPath, panSpeciesPath)

INPUTS
  • strainPath Path to folder with strain reconstructions

  • panSpeciesPath Path to folder with pan species models

convertVMHIDName(metNames, VMHIDs, suggestSimilar)[source]

FOUR FUNCTIONS:

  1. Retrieve metabolite IDs corresponding to the given metabolite names AND/OR

  2. Retrieve metabolite names corresponding to a given metabolite ID or some

    reactions

  3. Convert metabolite tranport reactions e.g., DM_glc_d[bc] to D-Glucose or

    EX_glc_D[c] to metabolite name e.g., D-Glucose.

  4. Suggest similar names for metabolite names provided that are

    not found in the data base

Inputs
  • metNames - EITHER – Cell array of metabolite names (strings) or metabolilte transport reactions (can be mixed)for which IDs are required or flag indicating to skip step (0).

  • metIDs - Cell array of metabolite IDs (strings) – required or flag indicating to skip step (0).

  • suggestSimilar - Flag indicating whether to generate suggestions (1) or not (0)

Outputs
  • foundVMHIDs - Cell array of metabolite IDs corresponding to the input names.

  • foundMetNames - Cell array of metabolite Names corresponsing to the input IDs

  • similarMets - Cell array of possible matches for each unfound metabolite name, – when searching for names.

Other requirements: COBRA toolbox installation (and paths set)

EXAMPLE OF USE:

VMHIDs = {‘DM_gam[bc]’; ‘malttr’}; metNames = {‘D-glucose’, ‘fructose’, ‘carbon’}; % metNames = false; % VMHIDs = false; % suggestSimilar = false; suggestSimilar = true;

% [foundVMHIDs, foundMetNames, similarMets] = convertVMHIDName(metNames,VMHIDs, suggestSimilar);

Author: - Anna Sheehy & Tim Hensen - 18/07/2024

dimensionalityReductionAndMultivariateAnalysis(measuresTable, metadataTable, varOfInterest, results_path, varargin)[source]

Dimensionality reduction of high-dimensional measures (e.g. microbiome relative abundances or reaction relative abundances) by RPCA following data preprocessing OR beta-diversity measures by PCoA, with the aim to:

1. Find whether there are general differences between groups of a metadata variable of interest (e.g. disease vs ctrl status), in case variable is categorical. 2. Identify variables from the measures (e.g. microbial taxa, reactions) which contribute the most to the first principle component (PC1) of RPCA & therefore its explained variance. 3. Perform linear regression on PC1 ~ metadata variable (e.g. Sex, disease vs Ctrl status) to find metadata variables which might be important confounders in follow-up analysis in case they are significantly correlated & explain a lot of the variance of PC1 from RPCA/PCoA.

INPUTS
  • measuresTable – [table] Contains high-dimensional measures (e.g. microbiome relative abundances or reaction relative abundances), with columns = samples & rows = measured groups (e.g. taxa/reactions).

  • metadataTable – [table] Contains metadata information for samples (e.g. sex), with columns = variables (e.g. Sex) & rows = samples.

  • varOfInterest – [string] Variable (e.g. Sex or disease status) contained in metadata.

  • results_path – [string] Directory path, where results should be stored (figures & statistical results in spreadsheet format).

  • varargin

  • numLoadings – [numeric] Number of PC loadings which shall be displayed in plot of PC strongest feature contributions. Defaults to 15 loadings.

  • inputDataType – [chars/string] Specify whether data input is of type “abundance” or “betaDiversityMatrix”, which results in alternative processing routes (the input is treated case-insensitive). Defaults to “abundance”.

  • PCofInterest – [numeric] Principle component/principle coordinate of interest, which analysis will be performed on. Defaults to PC 1.

OUTPUTS

In form of tables & plots into dir at results_path location.

Authors

  • Jonas Widder, 12/2024 & 01/2025

downloadAGORA2(directory)[source]

Download and unpack agora2 INPUT directory Directory indicating where to donwload AGORA2

OUTPUT AGORA2_dir Directory to AGORA2 folder

Author: Tim Hensen, 2024

filterMetabolitesNotPresentInWBMmodel(metabolitesOfInterest, WBM_compartment)[source]

Filters a table with metabolites for their presence in selected compartment(s) of the unpersonalized Harvey & Harvetta WBM models and returns both the present & absent metabolites in seperate tables. This process ensures that all metabolites of interest are actually present in the models & fluxes can be calculated for.

findOptimalCoreCount(modelDir, solver)[source]

This function finds the optimal number of workers for the HM models being investigated INPUT: modelPath Path to folder with COBRA models OPTIONAL INPUT subSetSize Size of the random subset of models used for testing

OUTPUT fig Figure showing the average speedup factor for each tested

configuration of workers.

generateAGORA2MappingStatTable[source]

generatePanAGORA2database()[source]

Create lookup file for checking which reactions and metabolites are present in which AGORA2 strains

OUTPUT lookupFilePath Path to the generated lookup file

Authors: Tim Hensen, 2024

generatePanDatabase(inputDir)[source]

Create lookup file for checking which reactions and metabolites are present in which AGORA2 models

OUTPUT lookupFilePath Path to the generated lookup file

Authors: Tim Hensen, 2024

generateStackedBarPlot(input_relAbundances, saveDir)[source]

Generates stacked bar plots from relative abundances of taxa for single or multiple samples.

INPUTS
  • input_relAbundances – [table] Contains taxa and their relative abundances for all samples. Requires column ‘Taxon’ and one or more sample columns.

  • saveDir – [chars/string] Path to the directory where the stacked bar plot should be saved.

AUTHOR:
  • Jonas Widder, 12/2024 & 01/2025

getDirectorySize(dirPath)[source]
======================================================================================================#

Title: Directory disk use calculator Author: Wiley Barton Modified code sources:

assistance and reference from a generative AI model [ChatGPT](https://chatgpt.com/)

clean-up and improved readability

Last Modified: 2025.01.29 Part of: Persephone Pipeline

Description:

This function determines the size of a selected directory

Inputs:
  • repoPathSeqC (char) : Path to the SeqC repository

  • outputPathSeqC (char) : Path for SeqC output

  • fileIDSeqC (char) : Unique identifier for file processing

  • procKeepSeqC (logical) : Keep all files (true/false)

  • maxMemSeqC (int) : Maximum memory allocation for SeqC

  • maxCpuSeqC (int) : Maximum CPU allocation for SeqC

  • maxProcSeqC (int) : Maximum processes for SeqC

  • debugSeqC (logical) : Enable debug mode (true/false)

Dependencies:
  • MATLAB

  • Docker installed and accessible in the system path

======================================================================================================#

getMicrobeFluxMappingStats[source]

INPUT: saveDirStats

getPanSpeciesMetProdCapacity[source]

Create lookup file for checking which reactions and metabolites are present in which AGORA2 taxa

OUTPUT lookupFilePath Path to the generated lookup file

Authors: Tim Hensen, 2024

getVMHID(mets, suggest)[source]

getVMHID - Retrieve metabolite IDs corresponding to the given metabolite names.

Inputs
  • mets - Cell array of metabolite names (strings)

  • suggest - Flag indicating whether to generate suggestions (1) or not (0)

Outputs
  • metIDs - Cell array of metabolite IDs corresponding to the input names.

  • suggestedMets - Cell array of possible matches for each unfound metabolite name.

Example

metaboliteNames = {‘glucose’, ‘fructose’}; [metIDs, suggestedMets] = getVMHID(metaboliteNames, 1);

Other requirements: COBRA toolbox installation and initialisation

Author: - Anna Sheehy - 16/07/2024

microbiomeMappingStats(rawPath, marsPath, saveDir, metadataPath)[source]

Function for obtaining statistics on AGORA2 mapping

INPUT rawPath: path to the unfiltered microbiome data marsPath: path to mapped microbiome data saveDir: path to folder where the results are saved

physiologicalConstraintsHMDBbasedTEMP(model, IndividualParameters, ExclList, Type, InputData, Biofluid, setDefault, ExclMet, ExclMetAbbr)[source]

This function applies constraints to the whole-body metabolic model metabolite concentrations have to be given in uM organ weights have to be given in g Please note that reaction specific constraints are applied at the end of the function, which have been derived from the literature.

function modelConstraint = physiologicalConstraintsHMDBbased(model,IndividualParameters, ExclList, Type, InputData, Biofluid, setDefault,ExclMet,ExclMetAbbr)

INPUT model model structure IndividualParameters Structure containing physiological parameters,

as generated in standardPhysiolDefaultParameters

ExclList List of reaction(s) to which no updated bound

should be assigned to

Type Input type (either ‘xlsx’ (default) –> loads by default

‘Parsed_hmdbConc.xlsx’ or ‘direct’). If ‘direct’ InputData must be provided

InputData first column corresponds to vmh id’s of

metabolites, 2nd to data points (will be set as lb and ub)

Biofluid ‘all’ (default if type is xlsx). For direct:

‘bc’,’u’,’csf’

setDefault If input data does not contain concentration information for a given metabolite

then a default concentration ranges will be used to calculate the constraints (default: 1) Note that the default metabolite concentration ranges are specified in IndividualParameters for the different biofluid compartments.

ExclMet Specify if certain metabolites, and thus their associated reactions, should be

excluded from the constraint application (default: 0)

ExclMetAbbr Provide list of metabolites that should be

excluded

OUTPUT modelConstraint model structure with updated constraints

Ines Thiele, 2015-2019

plotAbsentTaxaEffectOnMARScoverage(mars_preprocessedInput, absentTaxa_abundanceMetrics, readCounts, results_path, varargin)[source]

Based on MARS mapping input, this function generates a plot which visualizes how much of an effect the addition of currently unmapped taxa to the microbiome community model would have in terms of read coverage, starting from the most abundant taxa.

INPUTS
  • mars_preprocessedInput – [table] MARS output “preprocessed_input” which contains read counts per pre-mapped taxa.

  • absentTaxa_abundanceMetrics – [table] MARS output listing all unmapped taxa together with summary statistics on their relative abundance across samples (mean relative abundance is of importance for the function).

  • readCounts – [table] Original data table containing read counts per taxa.

  • results_path – [string] Directory path, where results should be stored (figure).

  • numAbsentTaxaToInvestigate – [numerical] Number of unmapped taxa whose effect should be tested for & plotted. Optional, defaults to the full list of all unmapped taxa.

Authors

  • Jonas Widder, 11/2024 & 01/2025

runStatisticsOnModerationAnalysisResults(data, metadata, formula, regressionResults, moderationThreshold_usePValue, moderationThreshold, saveDir)[source]

Filters regression results from moderation analysis for significantly correlating metabolites fluxes/bacterial taxa. Then stratifies the filtered flux/rel. abundances data for the moderator & performs new statistical analysis on the stratified data. Notes: The moderator needs to be categorical.

INPUTS
  • data – [Table] Processed flux/relative abundances data.

  • metadata – [Table] Metadata containing ID & pot. additional variables (confounders, moderators)

  • formula – [String] Regression formula in Wilkinson notation.

  • regressionResults – [Struct] Structure containing tables for flux & rel. abundances regression results.

  • moderationThreshold_usePValue – [Boolean] Cutoff threshold being either FDR or pValue. Default = true.

  • moderationThreshold – [Numerical] Cutoff threshold for maximal FDR value from moderation analysis a metabolite/bacterial taxa needs to pass that it will be included in subsequent analysis of stratified fluxes/taxa. Default = 0.05 (5%).

  • saveDir – [Character array] Path to working directory.

OUTPUT

statResults – [Struct] Structure containing tables for regression results for moderator stratified data of significant hits from initial from moderation analysis regressions. Will be empty, if regression does not contain Flux or relative abundance.

AUTHOR:

Jonas Widder, 11/2024

setResultPath(solutionDir)[source]

Function for creating a common path to all flux results

slimDownFBAresults(FBAsolutionDir)[source]

This function prunes FBA solution results obtained in optimiseRxnMultipleWBM.m and saves the slimmed down solution results in a new folder. The function first creates a new folder and generates paths for the flux results in that folder. Then, only the following data is loaded: ‘rxns’,’ID’,’sex’,’f’, and’stat’. If microbiome data was available: ‘speciesBIO’,’shadowPriceBIO’, and ‘relAbundances’. Then, the solutions are saved to the new paths.

INPUT FBAsolutionDir Character array with path to FBA solutions.

OUTPUT smallFBAsolutionPaths Path to slimmed down FBA results

AUTHOR: Tim Hensen, October 2024

validateDietPath(Diet, resPath)[source]

validateDietPath checks if ‘Diet’ is a valid COBRA Toolbox diet or file path and either loads the corresponding data or saves it to a text file.