R/core_functions.R
create_features_df.Rd
Create Data Frame of Features for Driver Gene Prioritization
create_features_df(
annovar_csv_path,
scna_segs_df,
scna_genes_df,
phenolyzer_annotated_gene_list_path,
batch_analysis = FALSE,
prep_phenolyzer_input = FALSE,
build = "GRCh37",
log2_ratio_threshold = 0.25,
gene_overlap_threshold = 25,
MCR_overlap_threshold = 25,
hotspot_threshold = 5L,
log2_hom_loss_threshold = -1,
verbose = TRUE,
na.string = "."
)
path to 'ANNOVAR' csv output file
the SCNA segments data frame. Must contain:
chromosome the segment is located in
start position of the segment
end position of the segment
log2 ratio of the segment
data frame of gene-level SCNAs (can be output of create_gene_level_scna_df
)
path to 'phenolyzer' 'annotated_gene_list' file
boolean to indicate whether to perform batch analysis
(TRUE
, default) or personalized analysis (FALSE
). If TRUE
,
a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNA
table.
boolean to indicate whether or not to create
a vector of genes for use as the input of 'phenolyzer' (default = FALSE
).
If TRUE
, the features data frame is not created and instead the vector
of gene symbols (union of all genes for which scores are available) is
returned.
genome build for the SCNA segments data frame (default = 'GRCh37')
the log2 ratio threshold for keeping high-confidence SCNA events (default = 0.25)
the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event.
the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR
to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5)
to determine double-hit events, the log2 threshold for identifying homozygous loss events (default = -1).
boolean controlling verbosity (default = TRUE
)
string that was used to indicate when a score is not available during annotation with ANNOVAR (default = '.')
If prep_phenolyzer_input=FALSE
(default), a data frame of
features for prioritizing cancer driver genes (gene_symbol
as
the first column and 26 other columns containing features). If
prep_phenolyzer_input=TRUE
, the functions returns a vector gene symbols
(union of all gene symbols for which scores are available) to be used as the
input for performing 'phenolyzer' analysis.
The features data frame contains the following columns:
HGNC gene symbol
the maximum metapredictor (coding) impact score for the gene
the maximum non-coding PHRED-scaled CADD score for the gene
SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located
boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)
'phenolyzer' score for the gene
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
boolean indicating whether or not the gene takes part in this KEGG pathway
prioritize_driver_genes
for prioritizing cancer driver genes
# \donttest{
path2annovar_csv <- system.file('extdata/example.hg19_multianno.csv',
package = 'driveR')
path2phenolyzer_out <- system.file('extdata/example.annotated_gene_list',
package = 'driveR')
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
scna_segs_df = example_scna_table,
phenolyzer_annotated_gene_list_path = path2phenolyzer_out)
#> Predicting impact of coding variants
#> Predicting impact of non-coding variants
#> Determining gene-level SCNAs (This may take a while)
#>
#> Scoring SCNA events
#> Determining hotspot/double-hit genes
#> Parsing 'phenolyzer' gene scores
#> Assessing memberships to KEGG - cancer-related pathways
# }