Workflows

Workflow Constructor

Construct inferelator workflows from preprocessing, postprocessing, and regression modules

inferelator.workflow.inferelator_workflow(regression=<class 'inferelator.regression.base_regression._RegressionWorkflowMixin'>, workflow=<class 'inferelator.workflows.workflow_base.WorkflowBase'>)

Create and instantiate an Inferelator workflow.

Parameters:

regression (str, RegressionWorkflow subclass) –
A class object which implements the run_regression and run_bootstrap methods for a specific regression strategy. This can be provided as a string.

”base” loads a non-functional regression stub.

”bbsr” loads Bayesian Best Subset Regression.

”elasticnet” loads Elastic Net Regression.

”sklearn” loads scikit-learn Regression.

”stars” loads the StARS stability Regression.

”amusr” loads AMuSR Regression. This requires multitask workflow.

Defaults to “base”.
workflow (str, WorkflowBase subclass) –
A class object which implements the necessary data loading and preprocessing to create design & response data for the regression strategy, and then the postprocessing to turn regression betas into a network. This can be provided as a string.

”base” loads a non-functional workflow stub.

”tfa” loads the TFA-based workflow.

”single-cell” loads the Single Cell TFA-based workflow.

”multitask” loads the multitask workflow.

Defaults to “base”.

Returns:

This returns an initialized object which has both the regression workflow and the preprocessing/postprocessing workflow. This object can then have settings assigned to it, and can be run with .run()

Return type:

Workflow instance

Common Workflow

class inferelator.workflow.WorkflowBase

WorkflowBase handles crossvalidation, shuffling, and validating priors and gold standards

run(): Execute workflow, after all configuration.

set_crossvalidation_parameters(split_gold_standard_for_crossvalidation=None, cv_split_ratio=None, cv_split_axis=None)

Set parameters for crossvalidation.

Parameters:

split_gold_standard_for_crossvalidation (bool) – Boolean flag indicating if the gold standard should be split. Must be set to True for other crossvalidation settings to have an effect. Defaults to False.
cv_split_ratio (float) – The proportion of the gold standard which should be retained for scoring. The rest will be used to train the model. Must be set betweeen 0 and 1.
cv_split_axis (int, None) –
How to split the gold standard.

If 0, split genes; this will take all the data for certain genes and keep it in the gold standard. These genes will be removed from the prior.

If 1, split regulators; this will take all the data for certain regulatorsnand keep it in the gold standard. These regulators will be removed from the prior. Splitting regulators is inadvisable.

If None, the prior will be replaced with a downsampled gold standard.

Setting this to 0 is generally the best choice. Defaults to None.

static set_output_file_names(network_file_name='', confidence_file_name='', nonzero_coefficient_file_name='', pdf_curve_file_name='', curve_data_file_name='', model_h5_file_name='')

Set output file names. File names that end in ‘.gz’ will be gzipped. Set any file name to None to prevent it from being generated

Parameters:

network_file_name (str) – Long-format network TSV file with TF->Gene edge information. Default is “network.tsv”.
confidence_file_name (str) – Genes x TFs TSV with confidence scores for each edge. Default is “combined_confidences.tsv”
nonzero_coefficient_file_name (str) – Genes x TFs TSV with the non-zero model coefficients for each edge. Default is “model_coefficients.tsv.gz”
pdf_curve_file_name (str) – PDF file with plotted curve(s). Default is “combined_metrics.pdf”.
curve_data_file_name (str) – TSV file with the data used to plot curves. Default is None (this file is not produced).
model_h5_file_name (str) – H5 file with model priors, coefficients, and run parameters saved

set_postprocessing_parameters(gold_standard_filter_method=None, metric=None)

Set parameters for the postprocessing engine

Parameters:

gold_standard_filter_method (str) – A flag that determines if the old standard should be shrunk to the size of the produced model. “overlap” will only score on overlap between the gold standard and the inferred gene regulatory network. “keep_all_gold_standard” will score on the entire gold standard. Defaults to “keep_all_gold_standard”.
metric (str) – The model metric to use for scoring. Supports “precision-recall”, “mcc”, “f1”, and “combined” Defaults to “combined”.

set_run_parameters(num_bootstraps=None, random_seed=None, use_mkl=None, use_numba=None)

Set parameters used during runtime

Parameters:

num_bootstraps (int) – The number of bootstraps to run. Defaults to 2.
random_seed (int) – The random number seed to use. Defaults to 42.
use_mkl (bool) – A flag to indicate if the intel MKL library should be used for matrix multiplication, defaults to False
use_numba (bool) – A flag to indicate if numba should be used to accelerate the calculations. Requires numba to be installed if set. Currently only accelerates AMuSR regression, defaults to True

set_shuffle_parameters(shuffle_prior_axis=None, make_data_noise=None, add_prior_noise=None)

Set parameters for shuffling labels on a prior axis. This is useful to establish a baseline.

Parameters:

shuffle_prior_axis (int, None) – The axis for shuffling prior labels. 0 shuffles gene labels. 1 shuffles regulator labels. None means labels will not be shuffled. Defaults to None.
make_data_noise (bool, None) – Replace loaded data with simulated data that is entirely random. This retains type; integer data remains integer, float remains float. Gene distributions should be centered around the mean of gene expression in the original data, but is otherwise random.
add_prior_noise (numeric, None) – Add random edges to the prior data. This is a numeric value between 0 and 1 such that 0 adds no edges, 1 sets every edge in the prior to True, 0.1 sets 10% of the edges in the prior to True, and so on. Note that this will binarize the prior if it is not already binary.

Transcription Factor Activity (TFA) Workflow

Implementation for the Transcription Factor Activity (TFA) based Inferelator workflow.

This workflow also has a design driver which will incorporate timecourse data.

This is the standard workflow for most applications.

class inferelator.tfa_workflow.TFAWorkFlow

Bases: WorkflowBase

TFAWorkFlow runs the timecourse driver and the TFA driver prior to regression.

run(): Execute workflow, after all configuration.

set_design_settings(timecourse_response_driver=None, delTmin=None, delTmax=None, tau=None)

Set the parameters used in the timecourse design-response driver.

Parameters:

timecourse_response_driver (bool) – A flag to indicate that the timecourse calculations should be performed. If set False, no other timecourse settings will have any effect. Defaults to True.
delTmin (int, float) – The minimum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 0.
delTmax (int, float) – The maximum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 120.
tau (int, float) – The tau parameter. Provide in the same units as the metadata time column (usually minutes). Defaults to 45.

set_tfa(tfa_driver=None, tfa_output_file=None, tfa_input_file=None, tfa_input_file_type=None)

Perform or skip the TFA calculations; by default the design matrix will be transcription factor activity. If this is called with tfa_driver = False, the design matrix will be transcription factor expression. It is not necessary to call this function unless setting tfa_driver = False.

Parameters:

tfa_driver (bool) – A flag to indicate that the TFA calculations should be performed. Defaults to True
tfa_output_file (str, optional) – A path to a TSV file which will be created with the calculated TFAs. Note that this file may contain TF expression if the TFA cannot be calculated for that TF. If None, no output file will be produced. Defaults to None
tfa_input_file – A path to a TFA file which will be loaded and used in place of activity calculations. If set, all TFA-related settings will be irrelevant. TSV file MUST be Samples X TFA. If None, the inferelator will calculate TFA Defaults to None
tfa_input_file_type – A string which identifies file type. Accepts “tsv” and “h5ad”. If None, assume the file is a TSV Defaults to None

Single-Cell Workflow

Run Single Cell Network Inference. This is the same TFA network inference with some extra preprocessing functionality.

class inferelator.single_cell_workflow.SingleCellWorkflow

Bases: TFAWorkFlow

SingleCellWorkflow has some additional preprocessing prior to calculating TFA and running regression

add_preprocess_step(fun, **kwargs)

Add a preprocessing step after count filtering but before calculating TFA or regression.

Parameters:: fun – Preprocessing function. Can be provided as a string

or as a function in preprocessing.single_cell.

“log10” will take the log10 of pseudocounts

“ln” will take the natural log of pseudocounts

“log2” will take the log2 of pseudocounts

“fft” will do the Freeman-Tukey transform

Parameters:: kwargs – Additional arguments to the preprocessing function

run(): Execute workflow, after all configuration.

set_count_minimum(count_minimum=None)

Set the minimum count value for each gene (averaged over all samples)

Parameters:: count_minimum (float) – The mean expression value which is required to retain a gene for modeling. Data that has already been normalized should probably be filtered during normalization, not now. Defaults to None (disabled).

Multi-Task AMuSR Workflow

Run Multitask Network Inference with TFA-AMuSR.

class inferelator.amusr_workflow.MultitaskLearningWorkflow

Bases: SingleCellWorkflow

Class that implements multitask learning. Handles loading and validation of multiple data packages

create_task(task_name=None, input_dir=None, expression_matrix_file=None, meta_data_file=None, tf_names_file=None, priors_file=None, gold_standard_file=None, gene_names_file=None, gene_metadata_file=None, workflow_type='single-cell', **kwargs)

Create a task object and set any arguments to this function as attributes of that task object. TaskData objects are stored internally in _task_objects.

Parameters:

task_name (str) – A descriptive name for this task
input_dir (str) – A path containing the input files
expression_matrix_file (str) – Path to the expression data
meta_data_file (str, optional) – Path to the meta data
tf_names_file (str) – Path to a list of regulator names to include in the model
priors_file (str) – Path to a prior data file
gene_metadata_file (str, optional) – Path to a genes annotation file
gene_names_file (str, optional) – Path to a list of genes to include in the model (optional)
workflow_type (str, inferelator.BaseWorkflow subclass) – The type of workflow for data preprocessing. “tfa” uses the TFA workflow, “single-cell” uses the Single-Cell TFA workflow
kwargs – Any additional arguments are assigned to thetask object

Returns:

Returns a task reference which can be additionally modified by calling any valid Workflow function to set task parameters

Return type:

TaskData instance

set_task_filters(regulator_expression_filter=None, target_expression_filter=None)

Set the filtering criteria for regulators and targets between tasks

Parameters:

regulator_expression_filter (str, optional) – “union” includes regulators which are present in any task, “intersection” includes regulators which are present in all tasks
target_expression_filter (str, optional) – “union” includes targets which are present in any task, “intersection” includes targets which are present in all tasks

Cross-Validation Workflow Wrapper

This is a manager which will take an Inferelator workflow and repeatedly run it with different parameters. This is implemented using deep copies; it is therefore memory-intensive.

class inferelator.crossvalidation_workflow.CrossValidationManager(workflow_object=None)

Bases: object

Crossvalidate an Inferelator Workflow

__init__(workflow_object=None)

Create a new CrossValidationManager instance and give it a workflow

Parameters:: workflow_object (Workflow) – The workflow to run crossvalidation with

add_gridsearch_parameter(param_name, param_vector)

Set a parameter to search through by exhaustive grid search

Parameters:

param_name (str) – The workflow parameter to change for each run
param_vector (iterable) – An iterable with values to use for the parameter

add_grouping_dropin(metadata_column_name, group_size=None, seed=42)

Run modeling on each group (defined by a metadata column) individually.

Parameters:

metadata_column_name (str) – Metadata column which has different values for each group
group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)

add_grouping_dropout(metadata_column_name, group_size=None, seed=42)

Drop each group (defined by a metadata column) and run modeling on all of the other groups.

Parameters:

metadata_column_name (str) – Metadata column which has different values for each group
group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)

add_size_subsampling(size_vector, stratified_column_name=None, with_replacement=False, seed=42, size_sample_only=None)

Resample expression data to a ratio of the original data.

Parameters:

size_vector (iterable(floats)) – An iterable with numeric ratios for downsampling. These values must be between 0 and 1.
stratified_column_name (str, None) – Set this to stratify sampling (to maintain group size ratios). If None, do not maintain group size ratios. Default is None.
with_replacement (bool) – Do sampling with or without replacement. Defaults to False
seed – The random seed to use when selecting observations (this is not the same as the seed passed to the workflow)
seed – int