run_pathfindR is the wrapper function for the pathfindR workflow

run_pathfindR(
input,
gene_sets = "KEGG",
min_gset_size = 10,
max_gset_size = 300,
custom_genes = NULL,
custom_descriptions = NULL,
pin_name_path = "Biogrid",
p_val_threshold = 0.05,
visualize_enriched_terms = TRUE,
max_to_plot = 10,
convert2alias = TRUE,
enrichment_threshold = 0.05,
search_method = "GR",
use_all_positives = FALSE,
saTemp0 = 1,
saTemp1 = 0.01,
saIter = 10000,
gaPop = 400,
gaIter = 200,
gaCrossover = 1,
gaMut = 0,
grMaxDepth = 1,
grSearchDepth = 1,
grOverlap = 0.5,
grSubNum = 1000,
iterations = 10,
n_processes = NULL,
score_quan_thr = 0.8,
sig_gene_thr = 0.02,
plot_enrichment_chart = TRUE,
output_dir = "pathfindR_Results",
list_active_snw_genes = FALSE,
silent_option = TRUE
)

## Arguments

input

the input data that pathfindR uses. The input must be a data frame with three columns:

1. Gene Symbol (Gene Symbol)

2. Change value, e.g. log(fold change) (OPTIONAL)

3. p value, e.g. adjusted p value associated with differential expression

gene_sets

Name of the gene sets to be used for enrichment analysis. Available gene sets are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC", "GO-MF", "cell_markers", "mmu_KEGG" or "Custom". If "Custom", the arguments custom_genes and custom_descriptions must be specified. (Default = "KEGG")

min_gset_size

minimum number of genes a term must contain (default = 10)

max_gset_size

maximum number of genes a term must contain (default = 300)

custom_genes

a list containing the genes involved in each custom term. Each element is a vector of gene symbols located in the given custom term. Names should correspond to the IDs of the custom terms.

custom_descriptions

A vector containing the descriptions for each custom term. Names of the vector should correspond to the IDs of the custom terms.

pin_name_path

Name of the chosen PIN or absolute/path/to/PIN.sif. If PIN name, must be one of c("Biogrid", "STRING", "GeneMania", "IntAct", "KEGG", "mmu_STRING"). If path/to/PIN.sif, the file must comply with the PIN specifications. (Default = "Biogrid")

p_val_threshold

the p value threshold to use when filtering the input data frame. Must a numeric value between 0 and 1. (default = 0.05)

visualize_enriched_terms

Boolean value to indicate whether or not to create diagrams for enriched terms (default = TRUE)

max_to_plot

(necessary only if gene_sets = "KEGG" and visualize_enriched_terms = TRUE) The number of top hsa kegg pathways to visualize. If NULL, visualizes all (default = 10)

convert2alias

boolean to indicate whether or not to convert gene symbols in the input that are not found in the PIN to an alias symbol found in the PIN (default = TRUE) IMPORTANT NOTE: the conversion uses human gene symbols/alias symbols.

enrichment_threshold

adjusted-p value threshold used when filtering enrichment results (default = 0.05)

correction method to be used for adjusting p-values. (default = "bonferroni")

search_method

algorithm to use when performing active subnetwork search. Options are greedy search (GR), simulated annealing (SA) or genetic algorithm (GA) for the search (default = "GR").

use_all_positives

if TRUE: in GA, adds an individual with all positive nodes. In SA, initializes candidate solution with all positive nodes. (default = FALSE)

saTemp0

Initial temperature for SA (default = 1.0)

saTemp1

Final temperature for SA (default = 0.01)

saIter

Iteration number for SA (default = 10000)

gaPop

Population size for GA (default = 400)

gaIter

Iteration number for GA (default = 200)

Number of threads to be used in GA (default = 5)

gaCrossover

Applies crossover with the given probability in GA (default = 1, i.e. always perform crossover)

gaMut

For GA, applies mutation with given mutation rate (default = 0, i.e. mutation off)

grMaxDepth

Sets max depth in greedy search, 0 for no limit (default = 1)

grSearchDepth

Search depth in greedy search (default = 1)

grOverlap

Overlap threshold for results of greedy search (default = 0.5)

grSubNum

Number of subnetworks to be presented in the results (default = 1000)

iterations

number of iterations for active subnetwork search and enrichment analyses (Default = 10)

n_processes

optional argument for specifying the number of processes used by foreach. If not specified, the function determines this automatically (Default == NULL. Gets set to 1 for Genetic Algorithm)

score_quan_thr

active subnetwork score quantile threshold. Must be between 0 and 1 or set to -1 for not filtering. (Default = 0.8)

sig_gene_thr

threshold for the minimum proportion of significant genes in the subnetwork (Default = 0.02) If the number of genes to use as threshold is calculated to be < 2 (e.g. 50 signif. genes x 0.01 = 0.5), the threshold number is set to 2

plot_enrichment_chart

boolean value. If TRUE, a bubble chart displaying the enrichment results is plotted. (default = TRUE)

output_dir

the directory to be created where the output and intermediate files are saved (default = "pathfindR_Results")

list_active_snw_genes

boolean value indicating whether or not to report the non-significant active subnetwork genes for the active subnetwork which was enriched for the given term with the lowest p value (default = FALSE)

silent_option

boolean value indicating whether to print the messages to the console (FALSE) or not (TRUE, this will print to a temp. file) during active subnetwork search (default = TRUE). This option was added because during parallel runs, the console messages get disorderly printed.

## Value

Data frame of pathfindR enrichment results. Columns are:

ID

ID of the enriched term

Term_Description

Description of the enriched term

Fold_Enrichment

Fold enrichment value for the enriched term (Calculated using ONLY the input genes)

occurrence

the number of iterations that the given term was found to enriched over all iterations

support

the median support (proportion of active subnetworks leading to enrichment within an iteration) over all iterations

lowest_p

the lowest adjusted-p value of the given term over all iterations

highest_p

the highest adjusted-p value of the given term over all iterations

non_Signif_Snw_Genes (OPTIONAL)

the non-significant active subnetwork genes, comma-separated

Up_regulated

the up-regulated genes (as determined by change value > 0, if the change column was provided) in the input involved in the given term's gene set, comma-separated. If change column not provided, all affected are listed here.

Down_regulated

the down-regulated genes (as determined by change value < 0, if the change column was provided) in the input involved in the given term's gene set, comma-separated

The function also creates an HTML report with the pathfindR enrichment results linked to the visualizations of the enriched terms in addition to the table of converted gene symbols. This report can be found in "output_dir/results.html" under the current working directory.

By default, a bubble chart of top 10 enrichment results are plotted. The x-axis corresponds to fold enrichment values while the y-axis indicates the enriched terms. Sizes of the bubbles indicate the number of significant genes in the given terms. Color indicates the -log10(lowest-p) value; the more red it is, the more significant the enriched term is. See enrichment_chart.

## Details

This function takes in a data frame consisting of Gene Symbol, log-fold-change and adjusted-p values. After input testing, any gene symbols that are not in the PIN are converted to alias symbols if the alias is in the PIN. Next, active subnetwork search is performed. Enrichment analysis is performed using the genes in each of the active subnetworks. Terms with adjusted-p values lower than enrichment_threshold are discarded. The lowest adjusted-p value (over all subnetworks) for each term is kept. This process of active subnetwork search and enrichment is repeated for a selected number of iterations, which is done in parallel. Over all iterations, the lowest and the highest adjusted-p values, as well as number of occurrences are reported for each enriched term.

## Warning

Especially depending on the protein interaction network, the algorithm and the number of iterations you choose, "active subnetwork search + enrichment" component of run_pathfindR may take a long time to finish.

input_testing for input testing, input_processing for input processing, active_snw_search for active subnetwork search and subnetwork filtering, enrichment_analyses for enrichment analysis (using the active subnetworks), summarize_enrichment_results for summarizing the active-subnetwork-oriented enrichment results, annotate_term_genes for annotation of affected genes in the given gene sets, visualize_terms for visualization of enriched terms, enrichment_chart for a visual summary of the pathfindR enrichment results, foreach for details on parallel execution of looping constructs, cluster_enriched_terms for clustering the resulting enriched terms and partitioning into clusters.
if (FALSE) {