Create Data Frame of Features for Driver Gene Prioritization

create_features_df(
  annovar_csv_path,
  scna_segs_df,
  scna_genes_df,
  phenolyzer_annotated_gene_list_path,
  batch_analysis = FALSE,
  prep_phenolyzer_input = FALSE,
  build = "GRCh37",
  log2_ratio_threshold = 0.25,
  gene_overlap_threshold = 25,
  MCR_overlap_threshold = 25,
  hotspot_threshold = 5L,
  log2_hom_loss_threshold = -1,
  verbose = TRUE,
  na.string = "."
)

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

scna_segs_df

the SCNA segments data frame. Must contain:

chr: chromosome the segment is located in
start: start position of the segment
end: end position of the segment
log2ratio: log₂ ratio of the segment

scna_genes_df

data frame of gene-level SCNAs (can be output of create_gene_level_scna_df)

phenolyzer_annotated_gene_list_path

path to 'phenolyzer' 'annotated_gene_list' file

batch_analysis

boolean to indicate whether to perform batch analysis (TRUE, default) or personalized analysis (FALSE). If TRUE, a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNA table.

prep_phenolyzer_input

boolean to indicate whether or not to create a vector of genes for use as the input of 'phenolyzer' (default = FALSE). If TRUE, the features data frame is not created and instead the vector of gene symbols (union of all genes for which scores are available) is returned.

build

genome build for the SCNA segments data frame (default = 'GRCh37')

log2_ratio_threshold

the log₂ ratio threshold for keeping high-confidence SCNA events (default = 0.25)

gene_overlap_threshold

the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event.

MCR_overlap_threshold

the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR

hotspot_threshold

to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5)

log2_hom_loss_threshold

to determine double-hit events, the log₂ threshold for identifying homozygous loss events (default = -1).

verbose

boolean controlling verbosity (default = TRUE)

na.string

string that was used to indicate when a score is not available during annotation with ANNOVAR (default = '.')

Value

If prep_phenolyzer_input=FALSE (default), a data frame of features for prioritizing cancer driver genes (gene_symbol as the first column and 26 other columns containing features). If prep_phenolyzer_input=TRUE, the functions returns a vector gene symbols (union of all gene symbols for which scores are available) to be used as the input for performing 'phenolyzer' analysis.

The features data frame contains the following columns:

gene_symbol: HGNC gene symbol
metaprediction_score: the maximum metapredictor (coding) impact score for the gene
noncoding_score: the maximum non-coding PHRED-scaled CADD score for the gene
scna_score: SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located
hotspot_double_hit: boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)
phenolyzer_score: 'phenolyzer' score for the gene
hsa03320: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04010: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04020: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04024: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04060: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04066: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04110: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04115: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04150: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04151: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04210: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04310: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04330: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04340: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04350: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04370: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04510: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04512: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04520: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04630: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04915: boolean indicating whether or not the gene takes part in this KEGG pathway

Examples

# \donttest{
path2annovar_csv <- system.file('extdata/example.hg19_multianno.csv',
                                package = 'driveR')
path2phenolyzer_out <- system.file('extdata/example.annotated_gene_list',
                                   package = 'driveR')
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_segs_df = example_scna_table,
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out)
#> Predicting impact of coding variants
#> Predicting impact of non-coding variants
#> Determining gene-level SCNAs (This may take a while)
#> 
#> Scoring SCNA events
#> Determining hotspot/double-hit genes
#> Parsing 'phenolyzer' gene scores
#> Assessing memberships to KEGG - cancer-related pathways
# }

Create Data Frame of Features for Driver Gene Prioritization

Arguments

Value

See also

Examples