Create Data Frame of Features for Driver Gene Prioritization

create_features_df(
  annovar_csv_path,
  scna_segs_df,
  scna_genes_df,
  phenolyzer_annotated_gene_list_path,
  batch_analysis = FALSE,
  prep_phenolyzer_input = FALSE,
  build = "GRCh37",
  log2_ratio_threshold = 0.25,
  gene_overlap_threshold = 25,
  MCR_overlap_threshold = 25,
  hotspot_threshold = 5L,
  log2_hom_loss_threshold = -1,
  verbose = TRUE,
  na.string = "."
)

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

scna_segs_df

the SCNA segments data frame. Must contain:

chr

chromosome the segment is located in

start

start position of the segment

end

end position of the segment

log2ratio

log2 ratio of the segment

scna_genes_df

data frame of gene-level SCNAs (can be output of create_gene_level_scna_df)

phenolyzer_annotated_gene_list_path

path to 'phenolyzer' 'annotated_gene_list' file

batch_analysis

boolean to indicate whether to perform batch analysis (TRUE, default) or personalized analysis (FALSE). If TRUE, a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNA table.

prep_phenolyzer_input

boolean to indicate whether or not to create a vector of genes for use as the input of 'phenolyzer' (default = FALSE). If TRUE, the features data frame is not created and instead the vector of gene symbols (union of all genes for which scores are available) is returned.

build

genome build for the SCNA segments data frame (default = 'GRCh37')

log2_ratio_threshold

the log2 ratio threshold for keeping high-confidence SCNA events (default = 0.25)

gene_overlap_threshold

the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event.

MCR_overlap_threshold

the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR

hotspot_threshold

to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5)

log2_hom_loss_threshold

to determine double-hit events, the log2 threshold for identifying homozygous loss events (default = -1).

verbose

boolean controlling verbosity (default = TRUE)

na.string

string that was used to indicate when a score is not available during annotation with ANNOVAR (default = '.')

Value

If prep_phenolyzer_input=FALSE (default), a data frame of features for prioritizing cancer driver genes (gene_symbol as the first column and 26 other columns containing features). If prep_phenolyzer_input=TRUE, the functions returns a vector gene symbols (union of all gene symbols for which scores are available) to be used as the input for performing 'phenolyzer' analysis.

The features data frame contains the following columns:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

See also

prioritize_driver_genes for prioritizing cancer driver genes

Examples

# \donttest{
path2annovar_csv <- system.file('extdata/example.hg19_multianno.csv',
                                package = 'driveR')
path2phenolyzer_out <- system.file('extdata/example.annotated_gene_list',
                                   package = 'driveR')
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_segs_df = example_scna_table,
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out)
#> Predicting impact of coding variants
#> Predicting impact of non-coding variants
#> Determining gene-level SCNAs (This may take a while)
#> 
#> Scoring SCNA events
#> Determining hotspot/double-hit genes
#> Parsing 'phenolyzer' gene scores
#> Assessing memberships to KEGG - cancer-related pathways
# }