Workflows

Workflow Constructor

Construct inferelator workflows from preprocessing, postprocessing, and regression modules

inferelator.workflow.inferelator_workflow(regression=<class 'inferelator.regression.base_regression._RegressionWorkflowMixin'>, workflow=<class 'inferelator.workflow.WorkflowBase'>)

Create and instantiate an Inferelator workflow.

Parameters:
  • regression (str, RegressionWorkflow subclass) –

    A class object which implements the run_regression and run_bootstrap methods for a specific regression strategy. This can be provided as a string.

    ”base” loads a non-functional regression stub.

    ”bbsr” loads Bayesian Best Subset Regression.

    ”elasticnet” loads Elastic Net Regression.

    ”sklearn” loads scikit-learn Regression.

    ”stars” loads the StARS stability Regression.

    ”amusr” loads AMuSR Regression. This requires multitask workflow.

    ”bbsr-by-task” loads Bayesian Best Subset Regression for multiple tasks. This requires multitask workflow.

    ”elasticnet-by-task” loads Elastic Net Regression for multiple tasks. This requires multitask workflow.

    Defaults to “base”.

  • workflow (str, WorkflowBase subclass) –

    A class object which implements the necessary data loading and preprocessing to create design & response data for the regression strategy, and then the postprocessing to turn regression betas into a network. This can be provided as a string.

    ”base” loads a non-functional workflow stub.

    ”tfa” loads the TFA-based workflow.

    ”single-cell” loads the Single Cell TFA-based workflow.

    ”multitask” loads the multitask workflow.

    Defaults to “base”.

Returns:

This returns an initialized object which has both the regression workflow and the preprocessing/postprocessing workflow. This object can then have settings assigned to it, and can be run with .run()

Return type:

Workflow instance

Common Workflow

class inferelator.workflow.WorkflowBaseLoader

WorkflowBaseLoader is the class to load raw data. It does no processing; it only takes data from files.

append_to_path(var_name, to_append)

Add a string to an existing path variable

Parameters:
  • var_name (str) – The name of the path variable (input_dir or output_dir)
  • to_append (str) – The path to join to the end of the existing path variable
print_file_loading_arguments(file_name)

Print the settings that will be used to load a given file name.

Parameters:file_name (str) – The name of the variable containing the file name (from set_file_properties)
set_expression_file(tsv=None, hdf5=None, h5ad=None, tenx_path=None, mtx=None, mtx_barcode=None, mtx_feature=None, h5_layer=None)

Set the type of expression data file. Current loaders include TSV, hdf5, h5ad (AnnData), and MTX sparse files. Only one of these loaders can be used; passing arguments for multiple loaders will raise a ValueError.

Parameters:
  • tsv (str, optional) – A path to a TSV (or tsv.gz) file which can be loaded by pandas.read_csv()
  • hdf5 (str, optional) – A path to a hdf5 file which can be loaded by pandas.HDFStore
  • h5ad (str, optional) – A path to an AnnData hd5 file
  • tenx_path (Path, optional) – A path to the folder containing the 10x mtx, barcode, and feature files
  • mtx (str, optional) – A path to an mtx file
  • mtx_barcode (str, optional) – A path to a list of observation names (i.e. barcodes, etc) for the mtx file
  • mtx_feature (str, optional) – A path to a list of gene names for the mtx file
  • h5_layer (str, optional) – The layer (in an AnnData h5) or the store key (in an hdf5) file to use. Defaults to using the first key.
set_file_loading_arguments(file_name, **kwargs)

Update the settings for a given file name. By default we assume all files can be read in as TSV files. Any arguments provided here will be passed to pandas.read_csv() for the file name provided.

set_file_loading_arguments(‘expression_matrix_file’, sep=”,”) will read the expression_matrix_file as a CSV.

Parameters:
  • file_name (str) – The name of the variable containing the file name (from set_file_properties)
  • kwargs – Arguments to be passed to pandas.read_csv()
set_file_paths(input_dir=None, output_dir=None, expression_matrix_file=None, tf_names_file=None, meta_data_file=None, priors_file=None, gold_standard_file=None, gene_metadata_file=None, gene_names_file=None)

Set the file paths necessary for the inferelator to run

Parameters:
  • input_dir (str) – A path containing the input files
  • output_dir (str, optional) – A path to put the output files
  • expression_matrix_file (str) – Path to the expression data If set here, this expression file will be assumed to be a TSV file. Use set_expression_file() for other file types
  • meta_data_file (str, optional) – Path to the meta data TSV file
  • tf_names_file (str) – Path to a list of regulator names to include in the model
  • priors_file (str) – Path to a prior data file TSV file [Genes x Regulators]
  • gold_standard_file (str) – Path to a gold standard data TSV file [Genes x Regulators]
  • gene_metadata_file (str, optional) – Path to a genes annotation file
  • gene_names_file (str, optional) – Path to a list of genes to include in the model (optional)
set_file_properties(extract_metadata_from_expression_matrix=None, expression_matrix_metadata=None, expression_matrix_columns_are_genes=None, gene_list_index=None, metadata_handler=None)

Set properties associated with the input data files

Parameters:
  • extract_metadata_from_expression_matrix (bool, optional) – A boolean flag that should be set to True if there is non-expression data in the expression matrix. If True, expression_matrix_metadata must be provided. Defaults to False.
  • expression_matrix_metadata (list(str), optional) – A list of columns which, if provided, will be removed from the expression matrix file and kept as metadata. Defaults to None.
  • expression_matrix_columns_are_genes (bool, optional) – A boolean flag indicating the orientation of the expression matrix. False reads the expression matrix as genes on rows, samples on columns. True reads the expression matrix as samples on rows, genes on columns. Defaults to False.
  • gene_list_index (str, optional) – The column name in the gene metadata file which corresponds to the gene labels in the expression and prior data files. Defaults to None. Must be provided if gene_metadata_file was provided to set_file_paths().
  • metadata_handler (str) – A string which identifies the specific metadata parsing method to use. Options include “branching” or “nonbranching”. Defaults to “branching”.
set_network_data_flags(use_no_prior=None, use_no_gold_standard=None)

Set flags to skip using existing network data. Note that these flags will be ignored if network data is provided

Parameters:
  • use_no_prior (bool) – Flag to indicate the inferelator should be run without existing prior data. Will create a mock prior with no information. Highly inadvisable. Defaults to False
  • use_no_gold_standard (bool) – Flag to indicate the inferelator should be run without existing gold standard data. Will create a mock gold standard with no information. Highly inadvisable. Defaults to False
class inferelator.workflow.WorkflowBase

WorkflowBase handles crossvalidation, shuffling, and validating priors and gold standards

run()

Execute workflow, after all configuration.

set_crossvalidation_parameters(split_gold_standard_for_crossvalidation=None, cv_split_ratio=None, cv_split_axis=None)

Set parameters for crossvalidation.

Parameters:
  • split_gold_standard_for_crossvalidation (bool) – Boolean flag indicating if the gold standard should be split. Must be set to True for other crossvalidation settings to have an effect. Defaults to False.
  • cv_split_ratio (float) – The proportion of the gold standard which should be retained for scoring. The rest will be used to train the model. Must be set betweeen 0 and 1.
  • cv_split_axis (int, None) – How to split the gold standard. If 0, split genes; this will take all the data for certain genes and keep it in the gold standard. These genes will be removed from the prior. If 1, split regulators; this will take all the data for certain regulators and keep it in the gold standard. These regulators will be removed from the prior. Splitting regulators is inadvisable. If None, the prior will be replaced with a downsampled gold standard. Setting this to 0 is generally the best choice. Defaults to None.
static set_output_file_names(network_file_name='', confidence_file_name='', nonzero_coefficient_file_name='', pdf_curve_file_name='', curve_data_file_name='')

Set output file names. File names that end in ‘.gz’ will be gzipped.

Parameters:
  • network_file_name (str) – Long-format network TSV file with TF->Gene edge information. Default is “network.tsv”.
  • confidence_file_name (str) – Genes x TFs TSV with confidence scores for each edge. Default is “combined_confidences.tsv”
  • nonzero_coefficient_file_name (str) – Genes x TFs TSV with the number of non-zero model coefficients for each edge. Default is None (this file is not produced).
  • pdf_curve_file_name (str) – PDF file with plotted curve(s). Default is “combined_metrics.pdf”.
  • curve_data_file_name (str) – TSV file with the data used to plot curves. Default is None (this file is not produced).
set_postprocessing_parameters(gold_standard_filter_method=None, metric=None)

Set parameters for the postprocessing engine

Parameters:
  • gold_standard_filter_method (str) – A flag that determines if the gold standard should be shrunk to the size of the produced model. “overlap” will only score on overlap between the gold standard and the inferred gene regulatory network. “keep_all_gold_standard” will score on the entire gold standard. Defaults to “keep_all_gold_standard”.
  • metric (str) – The model metric to use for scoring. Supports “precision-recall”, “mcc”, “f1”, and “combined” Defaults to “combined”.
set_run_parameters(num_bootstraps=None, random_seed=None, use_mkl=None)

Set parameters used during runtime

Parameters:
  • num_bootstraps (int) – The number of bootstraps to run. Defaults to 2.
  • random_seed (int) – The random number seed to use. Defaults to 42.
  • use_mkl (bool) – A flag to indicate if the intel MKL library should be used for matrix multiplication
set_shuffle_parameters(shuffle_prior_axis=None, make_data_noise=None)

Set parameters for shuffling labels on a prior axis. This is useful to establish a baseline.

Parameters:
  • shuffle_prior_axis (int, None) – The axis for shuffling prior labels. 0 shuffles gene labels. 1 shuffles regulator labels. None means labels will not be shuffled. Defaults to None.
  • make_data_noise – Replace loaded data with simulated data that is entirely random. This retains type; integer data remains integer, float remains float. Gene distributions should be centered around the mean of gene expression in the original data, but is otherwise random.

Transcription Factor Activity (TFA) Workflow

Implementation for the Transcription Factor Activity (TFA) based Inferelator workflow. This workflow also has a design driver which will incorporate timecourse data. This is the standard workflow for most applications.

class inferelator.tfa_workflow.TFAWorkFlow

Bases: inferelator.workflow.WorkflowBase

TFAWorkFlow runs the timecourse driver and the TFA driver prior to regression.

run()

Execute workflow, after all configuration.

set_design_settings(timecourse_response_driver=True, delTmin=None, delTmax=None, tau=None)

Set the parameters used in the timecourse design-response driver.

Parameters:
  • timecourse_response_driver (bool) – A flag to indicate that the timecourse calculations should be performed. If set False, no other timecourse settings will have any effect. Defaults to True.
  • delTmin (int, float) – The minimum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 0.
  • delTmax (int, float) – The maximum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 120.
  • tau (int, float) – The tau parameter. Provide in the same units as the metadata time column (usually minutes). Defaults to 45.
set_tfa(tfa_driver=None, tfa_output_file=None, tfa_input_file=None, tfa_input_file_type=None)

Perform or skip the TFA calculations; by default the design matrix will be transcription factor activity. If this is called with tfa_driver = False, the design matrix will be transcription factor expression. It is not necessary to call this function unless setting tfa_driver = False.

Parameters:
  • tfa_driver (bool) – A flag to indicate that the TFA calculations should be performed. Defaults to True
  • tfa_output_file (str, optional) – A path to a TSV file which will be created with the calculated TFAs. Note that this file may contain TF expression if the TFA cannot be calculated for that TF. If None, no output file will be produced. Defaults to None
  • tfa_input_file – A path to a TFA file which will be loaded and used in place of activity calculations If set, all TFA-related settings will be irrelevant. TSV file MUST be Samples X TFA. If None, the inferelator will calculate TFA Defaults to None
  • tfa_input_file_type – A string which identifies file type. Accepts “tsv” and “h5ad”. If None, assume the file is a TSV Defaults to None

Single-Cell Workflow

Run Single Cell Network Inference. This is the same network inference with some extra preprocessing functionality.

class inferelator.single_cell_workflow.SingleCellWorkflow

Bases: inferelator.tfa_workflow.TFAWorkFlow

SingleCellWorkflow has some additional preprocessing prior to calculating TFA and running regression

add_preprocess_step(fun, **kwargs)

Add a preprocessing step after count filtering but before calculating TFA or regression.

Parameters:
  • fun (str, preprocessing.single_cell function) –

    Preprocessing function. Can be provided as a string or as a function in preprocessing.single_cell.

    ”log10” will take the log10 of pseudocounts

    ”ln” will take the natural log of pseudocounts

    ”log2” will take the log2 of pseudocounts

    ”fft” will do the Freeman-Tukey transform

  • kwargs – Additional arguments to the preprocessing function
run()

Execute workflow, after all configuration.

set_count_minimum(count_minimum=None)

Set the minimum count value for each gene (averaged over all samples)

Parameters:count_minimum (float) – The mean expression value which is required to retain a gene for modeling. Data that has already been normalized should probably be filtered during normalization, not now. Defaults to None (disabled).

Multi-Task AMuSR Workflow

Run Multitask Network Inference with TFA-AMuSR.

class inferelator.amusr_workflow.MultitaskLearningWorkflow

Bases: inferelator.single_cell_workflow.SingleCellWorkflow

Class that implements multitask learning. Handles loading and validation of multiple data packages.

create_task(task_name=None, input_dir=None, expression_matrix_file=None, meta_data_file=None, tf_names_file=None, priors_file=None, gene_names_file=None, gene_metadata_file=None, workflow_type='single-cell', **kwargs)

Create a task object and set any arguments to this function as attributes of that task object. TaskData objects are stored internally in _task_objects.

Parameters:
  • task_name (str) – A descriptive name for this task
  • input_dir (str) – A path containing the input files
  • expression_matrix_file (str) – Path to the expression data
  • meta_data_file (str, optional) – Path to the meta data
  • tf_names_file (str) – Path to a list of regulator names to include in the model
  • priors_file (str) – Path to a prior data file
  • gene_metadata_file (str, optional) – Path to a genes annotation file
  • gene_names_file (str, optional) – Path to a list of genes to include in the model (optional)
  • workflow_type (str, inferelator.BaseWorkflow subclass) – The type of workflow for data preprocessing. “tfa” uses the TFA workflow, “single-cell” uses the Single-Cell TFA workflow
  • kwargs – Any additional arguments are assigned to the task object.
Returns:

Returns a task reference which can be additionally modified by calling any valid Workflow function to set task parameters

Return type:

TaskData instance

set_task_filters(regulator_expression_filter=None, target_expression_filter=None)

Set the filtering criteria for regulators and targets between tasks

Parameters:
  • regulator_expression_filter (str, optional) – “union” includes regulators which are present in any task, “intersection” includes regulators which are present in all tasks
  • target_expression_filter (str, optional) – “union” includes targets which are present in any task, “intersection” includes targets which are present in all tasks

Cross-Validation Workflow Wrapper

This is a manager which will take an Inferelator workflow and repeatedly run it with different parameters. This is implemented using deep copies; it is therefore memory-intensive.

class inferelator.crossvalidation_workflow.CrossValidationManager(workflow_object=None)

Bases: object

Crossvalidate an Inferelator Workflow

__init__(workflow_object=None)

Create a new CrossValidationManager instance and give it a workflow

Parameters:workflow_object (Workflow) – The workflow to run crossvalidation with
add_gridsearch_parameter(param_name, param_vector)

Set a parameter to search through by exhaustive grid search

Parameters:
  • param_name (str) – The workflow parameter to change for each run
  • param_vector (iterable) – An iterable with values to use for the parameter
add_grouping_dropin(metadata_column_name, group_size=None, seed=42)

Run modeling on each group (defined by a metadata column) individually.

Parameters:
  • metadata_column_name (str) – Metadata column which has different values for each group
  • group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
  • seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)
add_grouping_dropout(metadata_column_name, group_size=None, seed=42)

Drop each group (defined by a metadata column) and run modeling on all of the other groups.

Parameters:
  • metadata_column_name (str) – Metadata column which has different values for each group
  • group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
  • seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)
add_size_subsampling(size_vector, stratified_column_name=None, with_replacement=False, seed=42)

Resample expression data to a ratio of the original data.

Parameters:
  • size_vector (iterable(floats)) – An iterable with numeric ratios for downsampling. These values must be between 0 and 1.
  • stratified_column_name (str, None) – Set this to stratify sampling (to maintain group size ratios). If None, do not maintain group size ratios. Default is None.
  • with_replacement (bool) – Do sampling with or without replacement. Defaults to False
  • seed – The random seed to use when selecting observations (this is not the same as the seed passed to the workflow)
  • seed – int