Workflows
Workflow Constructor
Construct inferelator workflows from preprocessing, postprocessing, and regression modules
- inferelator.workflow.inferelator_workflow(regression=<class 'inferelator.regression.base_regression._RegressionWorkflowMixin'>, workflow=<class 'inferelator.workflows.workflow_base.WorkflowBase'>)
Create and instantiate an Inferelator workflow.
- Parameters:
regression (str, RegressionWorkflow subclass) –
A class object which implements the run_regression and run_bootstrap methods for a specific regression strategy. This can be provided as a string.
”base” loads a non-functional regression stub.
”bbsr” loads Bayesian Best Subset Regression.
”elasticnet” loads Elastic Net Regression.
”sklearn” loads scikit-learn Regression.
”stars” loads the StARS stability Regression.
”amusr” loads AMuSR Regression. This requires multitask workflow.
Defaults to “base”.
workflow (str, WorkflowBase subclass) –
A class object which implements the necessary data loading and preprocessing to create design & response data for the regression strategy, and then the postprocessing to turn regression betas into a network. This can be provided as a string.
”base” loads a non-functional workflow stub.
”tfa” loads the TFA-based workflow.
”single-cell” loads the Single Cell TFA-based workflow.
”multitask” loads the multitask workflow.
Defaults to “base”.
- Returns:
This returns an initialized object which has both the regression workflow and the preprocessing/postprocessing workflow. This object can then have settings assigned to it, and can be run with .run()
- Return type:
Workflow instance
Common Workflow
- class inferelator.workflow.WorkflowBase
WorkflowBase handles crossvalidation, shuffling, and validating priors and gold standards
- run()
Execute workflow, after all configuration.
- set_crossvalidation_parameters(split_gold_standard_for_crossvalidation=None, cv_split_ratio=None, cv_split_axis=None)
Set parameters for crossvalidation.
- Parameters:
split_gold_standard_for_crossvalidation (bool) – Boolean flag indicating if the gold standard should be split. Must be set to True for other crossvalidation settings to have an effect. Defaults to False.
cv_split_ratio (float) – The proportion of the gold standard which should be retained for scoring. The rest will be used to train the model. Must be set betweeen 0 and 1.
cv_split_axis (int, None) –
How to split the gold standard.
If 0, split genes; this will take all the data for certain genes and keep it in the gold standard. These genes will be removed from the prior.
If 1, split regulators; this will take all the data for certain regulatorsnand keep it in the gold standard. These regulators will be removed from the prior. Splitting regulators is inadvisable.
If None, the prior will be replaced with a downsampled gold standard.
Setting this to 0 is generally the best choice. Defaults to None.
- static set_output_file_names(network_file_name='', confidence_file_name='', nonzero_coefficient_file_name='', pdf_curve_file_name='', curve_data_file_name='', model_h5_file_name='')
Set output file names. File names that end in ‘.gz’ will be gzipped. Set any file name to None to prevent it from being generated
- Parameters:
network_file_name (str) – Long-format network TSV file with TF->Gene edge information. Default is “network.tsv”.
confidence_file_name (str) – Genes x TFs TSV with confidence scores for each edge. Default is “combined_confidences.tsv”
nonzero_coefficient_file_name (str) – Genes x TFs TSV with the non-zero model coefficients for each edge. Default is “model_coefficients.tsv.gz”
pdf_curve_file_name (str) – PDF file with plotted curve(s). Default is “combined_metrics.pdf”.
curve_data_file_name (str) – TSV file with the data used to plot curves. Default is None (this file is not produced).
model_h5_file_name (str) – H5 file with model priors, coefficients, and run parameters saved
- set_postprocessing_parameters(gold_standard_filter_method=None, metric=None)
Set parameters for the postprocessing engine
- Parameters:
gold_standard_filter_method (str) – A flag that determines if the old standard should be shrunk to the size of the produced model. “overlap” will only score on overlap between the gold standard and the inferred gene regulatory network. “keep_all_gold_standard” will score on the entire gold standard. Defaults to “keep_all_gold_standard”.
metric (str) – The model metric to use for scoring. Supports “precision-recall”, “mcc”, “f1”, and “combined” Defaults to “combined”.
- set_run_parameters(num_bootstraps=None, random_seed=None, use_mkl=None, use_numba=None)
Set parameters used during runtime
- Parameters:
num_bootstraps (int) – The number of bootstraps to run. Defaults to 2.
random_seed (int) – The random number seed to use. Defaults to 42.
use_mkl (bool) – A flag to indicate if the intel MKL library should be used for matrix multiplication, defaults to False
use_numba (bool) – A flag to indicate if numba should be used to accelerate the calculations. Requires numba to be installed if set. Currently only accelerates AMuSR regression, defaults to True
- set_shuffle_parameters(shuffle_prior_axis=None, make_data_noise=None, add_prior_noise=None)
Set parameters for shuffling labels on a prior axis. This is useful to establish a baseline.
- Parameters:
shuffle_prior_axis (int, None) – The axis for shuffling prior labels. 0 shuffles gene labels. 1 shuffles regulator labels. None means labels will not be shuffled. Defaults to None.
make_data_noise (bool, None) – Replace loaded data with simulated data that is entirely random. This retains type; integer data remains integer, float remains float. Gene distributions should be centered around the mean of gene expression in the original data, but is otherwise random.
add_prior_noise (numeric, None) – Add random edges to the prior data. This is a numeric value between 0 and 1 such that 0 adds no edges, 1 sets every edge in the prior to True, 0.1 sets 10% of the edges in the prior to True, and so on. Note that this will binarize the prior if it is not already binary.
Transcription Factor Activity (TFA) Workflow
Implementation for the Transcription Factor Activity (TFA) based Inferelator workflow.
This workflow also has a design driver which will incorporate timecourse data.
This is the standard workflow for most applications.
- class inferelator.tfa_workflow.TFAWorkFlow
Bases:
WorkflowBaseTFAWorkFlow runs the timecourse driver and the TFA driver prior to regression.
- run()
Execute workflow, after all configuration.
- set_design_settings(timecourse_response_driver=None, delTmin=None, delTmax=None, tau=None)
Set the parameters used in the timecourse design-response driver.
- Parameters:
timecourse_response_driver (bool) – A flag to indicate that the timecourse calculations should be performed. If set False, no other timecourse settings will have any effect. Defaults to True.
delTmin (int, float) – The minimum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 0.
delTmax (int, float) – The maximum allowed time difference between timepoints to model as a time series. Provide in the same units as the metadata time column (usually minutes). Defaults to 120.
tau (int, float) – The tau parameter. Provide in the same units as the metadata time column (usually minutes). Defaults to 45.
- set_tfa(tfa_driver=None, tfa_output_file=None, tfa_input_file=None, tfa_input_file_type=None)
Perform or skip the TFA calculations; by default the design matrix will be transcription factor activity. If this is called with tfa_driver = False, the design matrix will be transcription factor expression. It is not necessary to call this function unless setting tfa_driver = False.
- Parameters:
tfa_driver (bool) – A flag to indicate that the TFA calculations should be performed. Defaults to True
tfa_output_file (str, optional) – A path to a TSV file which will be created with the calculated TFAs. Note that this file may contain TF expression if the TFA cannot be calculated for that TF. If None, no output file will be produced. Defaults to None
tfa_input_file – A path to a TFA file which will be loaded and used in place of activity calculations. If set, all TFA-related settings will be irrelevant. TSV file MUST be Samples X TFA. If None, the inferelator will calculate TFA Defaults to None
tfa_input_file_type – A string which identifies file type. Accepts “tsv” and “h5ad”. If None, assume the file is a TSV Defaults to None
Single-Cell Workflow
Run Single Cell Network Inference. This is the same TFA network inference with some extra preprocessing functionality.
- class inferelator.single_cell_workflow.SingleCellWorkflow
Bases:
TFAWorkFlowSingleCellWorkflow has some additional preprocessing prior to calculating TFA and running regression
- add_preprocess_step(fun, **kwargs)
Add a preprocessing step after count filtering but before calculating TFA or regression.
- Parameters:
fun – Preprocessing function. Can be provided as a string
or as a function in preprocessing.single_cell.
“log10” will take the log10 of pseudocounts
“ln” will take the natural log of pseudocounts
“log2” will take the log2 of pseudocounts
“fft” will do the Freeman-Tukey transform
- Parameters:
kwargs – Additional arguments to the preprocessing function
- run()
Execute workflow, after all configuration.
- set_count_minimum(count_minimum=None)
Set the minimum count value for each gene (averaged over all samples)
- Parameters:
count_minimum (float) – The mean expression value which is required to retain a gene for modeling. Data that has already been normalized should probably be filtered during normalization, not now. Defaults to None (disabled).
Multi-Task AMuSR Workflow
Run Multitask Network Inference with TFA-AMuSR.
- class inferelator.amusr_workflow.MultitaskLearningWorkflow
Bases:
SingleCellWorkflowClass that implements multitask learning. Handles loading and validation of multiple data packages
- create_task(task_name=None, input_dir=None, expression_matrix_file=None, meta_data_file=None, tf_names_file=None, priors_file=None, gold_standard_file=None, gene_names_file=None, gene_metadata_file=None, workflow_type='single-cell', **kwargs)
Create a task object and set any arguments to this function as attributes of that task object. TaskData objects are stored internally in _task_objects.
- Parameters:
task_name (str) – A descriptive name for this task
input_dir (str) – A path containing the input files
expression_matrix_file (str) – Path to the expression data
meta_data_file (str, optional) – Path to the meta data
tf_names_file (str) – Path to a list of regulator names to include in the model
priors_file (str) – Path to a prior data file
gene_metadata_file (str, optional) – Path to a genes annotation file
gene_names_file (str, optional) – Path to a list of genes to include in the model (optional)
workflow_type (str, inferelator.BaseWorkflow subclass) – The type of workflow for data preprocessing. “tfa” uses the TFA workflow, “single-cell” uses the Single-Cell TFA workflow
kwargs – Any additional arguments are assigned to thetask object
- Returns:
Returns a task reference which can be additionally modified by calling any valid Workflow function to set task parameters
- Return type:
TaskData instance
- set_task_filters(regulator_expression_filter=None, target_expression_filter=None)
Set the filtering criteria for regulators and targets between tasks
- Parameters:
regulator_expression_filter (str, optional) – “union” includes regulators which are present in any task, “intersection” includes regulators which are present in all tasks
target_expression_filter (str, optional) – “union” includes targets which are present in any task, “intersection” includes targets which are present in all tasks
Cross-Validation Workflow Wrapper
This is a manager which will take an Inferelator workflow and repeatedly run it with different parameters. This is implemented using deep copies; it is therefore memory-intensive.
- class inferelator.crossvalidation_workflow.CrossValidationManager(workflow_object=None)
Bases:
objectCrossvalidate an Inferelator Workflow
- __init__(workflow_object=None)
Create a new CrossValidationManager instance and give it a workflow
- Parameters:
workflow_object (Workflow) – The workflow to run crossvalidation with
- add_gridsearch_parameter(param_name, param_vector)
Set a parameter to search through by exhaustive grid search
- Parameters:
param_name (str) – The workflow parameter to change for each run
param_vector (iterable) – An iterable with values to use for the parameter
- add_grouping_dropin(metadata_column_name, group_size=None, seed=42)
Run modeling on each group (defined by a metadata column) individually.
- Parameters:
metadata_column_name (str) – Metadata column which has different values for each group
group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)
- add_grouping_dropout(metadata_column_name, group_size=None, seed=42)
Drop each group (defined by a metadata column) and run modeling on all of the other groups.
- Parameters:
metadata_column_name (str) – Metadata column which has different values for each group
group_size (int, None) – The maximum size of each group. Groups will be downsampled to the same size if this is not set to None. Default is None.
seed (int) – The random seed to use for the group downsampling (this is not the same as the seed passed to the workflow)
- add_size_subsampling(size_vector, stratified_column_name=None, with_replacement=False, seed=42, size_sample_only=None)
Resample expression data to a ratio of the original data.
- Parameters:
size_vector (iterable(floats)) – An iterable with numeric ratios for downsampling. These values must be between 0 and 1.
stratified_column_name (str, None) – Set this to stratify sampling (to maintain group size ratios). If None, do not maintain group size ratios. Default is None.
with_replacement (bool) – Do sampling with or without replacement. Defaults to False
seed – The random seed to use when selecting observations (this is not the same as the seed passed to the workflow)
seed – int