Create Data Frame of Features for Driver Gene Prioritization

create_features_df(
annovar_csv_path,
scna_df,
phenolyzer_annotated_gene_list_path,
batch_analysis = FALSE,
prep_phenolyzer_input = FALSE,
log2_ratio_threshold = 0.25,
gene_overlap_threshold = 25,
MCR_overlap_threshold = 25,
hotspot_threshold = 5L,
log2_hom_loss_threshold = -1,
verbose = TRUE,
na.string = "."
)

## Arguments

annovar_csv_path path to 'ANNOVAR' csv output file the SCNA segments data frame. Must contain: chrchromosome the segment is located in startstart position of the segment endend position of the segment log2ratiolog2 ratio of the segment path to 'phenolyzer' "annotated_gene_list" file boolean to indicate whether to perform batch analysis (TRUE, default) or personalized analysis (FALSE). If TRUE, a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNA table. boolean to indicate whether or not to create a vector of genes for use as the input of 'phenolyzer' (default = FALSE). If TRUE, the features data frame is not created and instead the vector of gene symbols (union of all genes for which scores are available) is returned. the log2 ratio threshold for keeping high-confidence SCNA events (default = 0.25) the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event. the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5) to determine double-hit events, the log2 threshold for identifying homozygous loss events (default = -1). boolean controlling verbosity (default = TRUE) string that was used to indicate when a score is not available during annotation with ANNOVAR (default = ".")

## Value

If prep_phenolyzer_input=FALSE (default), a data frame of features for prioritizing cancer driver genes (gene_symbol as the first column and 26 other columns containing features). If prep_phenolyzer_input=TRUE, the functions returns a vector gene symbols (union of all gene symbols for which scores are available) to be used as the input for performing 'phenolyzer' analysis.

The features data frame contains the following columns:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

prioritize_driver_genes for prioritizing cancer driver genes

## Examples

# \donttest{
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
package = "driveR")
path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",
package = "driveR")
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
scna_df = example_scna_table,
phenolyzer_annotated_gene_list_path = path2phenolyzer_out)
#> Predicting impact of coding variants#> Predicting impact of non-coding variants#> Determining gene-level SCNAs (This may take a while)#> #> Scoring SCNA events#> Determining hotspot/double-hit genes#> Parsing 'phenolyzer' gene scores#> Assessing memberships to KEGG - cancer-related pathways# }