---
title: "How to use driveR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to use driveR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Overview

Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations. 

`driveR` is a tool for personalized or batch analysis of genomic data for driver gene prioritization by combining genomic information and prior biological knowledge. As features, driveR uses coding impact metaprediction scores, non-coding impact scores, somatic copy number alteration scores, hotspot gene/double-hit gene condition, 'phenolyzer' gene scores and memberships to cancer-related KEGG pathways. It uses these features to estimate cancer-type-specific probabilities for each gene of being a cancer driver using the related task of a multi-task learning classification model.

> The package supports GRCh37 (default) and GRCh38 genomic positions, which can be specified via the `build` argument.

The method is described in detail in _Ülgen E, Sezerman OU. driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics. 2021 May 24;22(1):263.[https://doi.org/10.1186/s12859-021-04203-7](https://doi.org/10.1186/s12859-021-04203-7)_ 

Below are some example usage cases for `driveR`:

# Predicting Impact of Coding Variants

For scoring the impact of coding variants, we devised a metapredictor model that utilizes impact prediction scores from 12 different tools: SIFT, PolyPhen-2 (HumDiv scores), LRT, MutationTaster, Mutation Assessor, FATHMM, GERP++, PhyloP, CADD, VEST, SiPhy and DANN. Annotations for all these tools are available in dbNFSP via ANNOVAR. We provide with the package 2 example (shortened) ANNOVAR outputs (see next sections): 

```{r metapred1}
library(driveR)
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
                                 package = "driveR")
```

We can calculate impact scores for all coding variants in this ANNOVAR file via `predict_coding_impact()`:

```{r metapred2}
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv)
head(metaprediction_df)
```

By default, `predict_coding_impact()` keeps only the maximal score per gene. This behavior can be altered by setting `keep_highest_score = FALSE`:

```{r metapred3, eval=FALSE}
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           keep_highest_score = FALSE)
```

Also by default, `predict_coding_impact()` keeps only the first gene symbol for genes where multiple symbols were annotated. This behavior can be altered by setting `keep_single_symbol = FALSE`:

```{r metapred4, eval=FALSE}
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           keep_single_symbol = FALSE)
```

The default `na.string` (string that was used to indicate when a score is not available during annotation with ANNOVAR) is ".", this can be altered via:

```{r metapred5, eval=FALSE}
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           na.string = "NA")
```

# Driver Gene Prioritization - Personalized Analysis

Below, a step-by-step work flow for `driveR` for an individual tumor sample is provided. Note that some steps require operations outside of R. The example data provided within the package are for a lung adenocarcinoma patient studied in Imielinski et al. [^1]

[^1]: Imielinski M, Greulich H, Kaplan B, et al. Oncogenic and sorafenib-sensitive ARAF mutations in lung adenocarcinoma. J Clin Invest. 2014;124(4):1582-6.

## Load the package and prepare input data

```{r setup, eval=FALSE}
library(driveR)
```

As input, `create_features_df()`, the function used to create the features table for driver gene prioritization, requires the following:

- `annovar_csv_path`: the `/path/to/ANNOVAR/csv/output/file`
- `scna_df`: a data frame containing SCNA segments, containing the columns "chr", "start", "end" and "log2ratio"
- `phenolyzer_annotated_gene_list_path`: the `/path/to/phenolyzer/annotated_gene_list/output/file`

### ANNOVAR CSV output

We first run ANNOVAR. An example command provided below:

```{bash, eval=FALSE}
table_annovar.pl example.avinput ~/annovar/humandb/ -buildver hg19 -out /path/to/output -remove -protocol refGene,cytoBand,exac03,avsnp150,dbnsfp30a,cosmic92_coding,cosmic92_noncoding -operation gx,r,f,f,f,f,f -nastring . -csvout -polish
```

The required filters are, as listed in the above command, `refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding`.

With the package, an example (modified) ANNOVAR csv output file is available:

```{r example_av_csv}
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
                                 package = "driveR")
```

### SCNA table

Next, we prepare a SCNA data frame (GRCh37 or GRCh38). Again, an example data frame (GRCh37) is provided within the package:

```{r example_scna}
head(example_scna_table)
```

### phenolyzer output

Finally, for the phenolyzer annotated_gene_list output file, we first obtain the genes to be scored using `create_features_df` and setting `prep_phenolyzer_input=TRUE`:

```{r phenolyzer_input,eval=FALSE}
phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
                                       scna_df = example_scna_table,
                                       prep_phenolyzer_input = TRUE,
                                       build = "GRCh37")
```

Next, we save these genes to be used as input for phenolyzer:
```{r save_phenolyzer_input, eval=FALSE}
write.table(x = data.frame(gene = phenolyzer_genes), 
            file = "input_genes.txt", 
            row.names = FALSE, col.names = FALSE, quote = FALSE)
```

We create another text file, named `phenolyzer_disease.txt`, containing the phenotype (i.e. cancer type. in this case "lung adenocarcinoma") to be used for scoring with phenolyzer. An example command for running phenolyzer is provided below:

```{bash, eval=FALSE}
perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example
```

This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file is provided within the package:
```{r example_phenoylzer}
path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",
                                    package = "driveR")
```

## Create the features data frame

After creating the necessary input data (as detailed above), we simply run `create_features_df()` to obtain the features data frame:

```{r features_df, eval=FALSE}
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_df = example_scna_table, 
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out,
                                  build = "GRCh37")
```

```{r features_df_load, echo=FALSE}
features_df <- example_features_table
```

## Prioritize cancer driver genes

Finally, we can prioritize cancer driver genes using `prioritize_driver_genes()`. For this function, `cancer_type` can be of the short names for the 21 different cancer types that was used to train the multi-task learning model utilized in this package:

```{r cancer_types}
knitr::kable(MTL_submodel_descriptions)
```

Below is the example run for the lung adenocarcinoma patient:

```{r driver_prob}
driver_prob_df <- prioritize_driver_genes(features_df = features_df,
                                          cancer_type = "LUAD")
head(driver_prob_df, 10)
```

The function returns a data frame of genes with probabilities of being cancer driver genes (using the selected sub-model of a multi-task learning model trained using 21 different cancer types) and the prediction of driver genes based on a cancer type specific probability threshold. In the above example, the top 10 genes contain recognizable cancer driver genes.

# Driver Gene Prioritization - Batch Analysis

Below, a step-by-step work flow for `driveR` for a cohort of tumor samples is provided. Note that some steps require operations outside of R. The example data provided within the package are for 10 randomly-selected samples from The Cancer Genome Atlas (TCGA) program's LAML (Acute Myeloid Leukemia) cohort.

## Load the package and prepare input data

```{r setup2, eval = FALSE}
library(driveR)
```

As before, as input, `create_features_df()`, the function used to create the features table for driver gene prioritization, requires the following:

- `annovar_csv_path`: the `/path/to/ANNOVAR/csv/output/file`
- `scna_df`: a data frame containing SCNA segments, containing the columns "chr", "start", "end", "log2ratio" and **"tumor_id"**
- `phenolyzer_annotated_gene_list_path`: the `/path/to/phenolyzer/annotated_gene_list/output/file`

### ANNOVAR CSV output

The required filters are again, as listed in the example ANNOVAR command above, `refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding`. Additionally, for cohort-level analysis, a column named `tumor_id` is required, containing tumor id of each variant.

With the package, an example cohort-level (modified) ANNOVAR csv output file is available:

```{r example_av_csv2}
path2annovar_csv <- system.file("extdata/example_cohort.hg19_multianno.csv",
                                 package = "driveR")
```

### SCNA table

Next, we prepare a SCNA data frame (GRCh37 is required). Again, an example data frame is provided within the package:

```{r example_scna2}
head(example_cohort_scna_table)
```

Again, for cohort-level analysis, it's crucial to have a column named `tumor_id` is required, containing tumor id for each SCNA segment.

### phenolyzer output

Finally, for the phenolyzer annotated_gene_list output file, we first obtain the genes to be scored using `create_features_df` and setting `prep_phenolyzer_input=TRUE`:

```{r phenolyzer_input2, eval=FALSE}
phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
                                       scna_df = example_cohort_scna_table,
                                       prep_phenolyzer_input = TRUE,
                                       batch_analysis = TRUE)
```

Next, we save these genes to be used as input for phenolyzer:
```{r save_phenolyzer_input2, eval=FALSE}
write.table(x = data.frame(gene = phenolyzer_genes), 
            file = "input_genes.txt", 
            row.names = FALSE, col.names = FALSE, quote = FALSE)
```

We create another text file, named `phenolyzer_disease.txt`, containing the phenotype (i.e. cancer type. in this case "acute myeloid leukemia") to be used for scoring with phenolyzer. An example command for running phenolyzer is provided below:

```{bash, eval=FALSE}
perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example
```

This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file for the cohort-level data is provided within the package:
```{r example_phenoylzer2}
path2phenolyzer_out <- system.file("extdata/example_cohort.annotated_gene_list",
                                    package = "driveR")
```

## Create the features data frame

After creating the necessary input data (as detailed above), we simply run `create_features_df()` to obtain the features data frame. Notice that we set `batch_analysis = TRUE`. When `batch_analysis = TRUE`, both the ANNOVAR output and the SCNA table should have a column named 'tumor_id'.

```{r features_df2, eval=FALSE}
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_df = example_cohort_scna_table, 
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out,
                                  batch_analysis = TRUE)
```

```{r features_df_load2, echo=FALSE}
features_df <- example_cohort_features_table
```

## Prioritize cancer driver genes

Finally, we can prioritize cancer driver genes using `prioritize_driver_genes()`. 

Below is the example run for the acute myeloid leukemia cohort:

```{r driver_prob2}
driver_prob_df <- prioritize_driver_genes(features_df = features_df,
                                          cancer_type = "LAML")
head(driver_prob_df, 10)
```

Once again, in the above example, the top 10 genes contain recognizable cancer driver genes.