Introduction to rosace

Jingyou Rao

Installation

To install rosace start R and first install devtools by typing:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

and install rosace by typing:

devtools::install_github("pimentellab/rosace")

If you have cloned the git repository locally, navigate to the rosace folder and type:

devtools::install(".") 

Next load rosace with:

library("rosace")

Example

Read count data

To explain how to use rosace we provide an example based on the OCT1 drug cytotoxicity screen (https://www.biorxiv.org/content/10.1101/2023.06.06.543963v1.full).

The screen has three replicates. The count matrix is stored in “data/oct1.rda”.

data("oct1")
key <- "1SM73"
type <- "growth"

Create Rosace object

First, load the count file into the Assay object. Each replicate in the experiment will form an Assay object and will share the same “key” (1SM73).

assay1 <- CreateAssayObject(counts = as.matrix(oct1_rep1[2:ncol(oct1_rep1)]),
                            var.names = oct1_rep1$hgvs,
                            key = key, rep = 1, type = type)
## Warning: filtering 77 variants that have more than 0.5 missing data
## Warning: filtering 187 variants that have less than 20 count
assay2 <- CreateAssayObject(counts = as.matrix(oct1_rep2[2:ncol(oct1_rep2)]),
                            var.names = oct1_rep2$hgvs,
                            key = key, rep = 2, type = type)
## Warning: filtering 57 variants that have more than 0.5 missing data
## Warning: filtering 113 variants that have less than 20 count
assay3 <- CreateAssayObject(counts = as.matrix(oct1_rep3[2:ncol(oct1_rep3)]),
                            var.names = oct1_rep3$hgvs,
                            key = key, rep = 3, type = type)
## Warning: filtering 115 variants that have more than 0.5 missing data
## Warning: filtering 238 variants that have less than 20 count

Next create Rosace object by adding three Assay objects together.

rosace <- CreateRosaceObject(object = assay1)
rosace <- AddAssayData(object = rosace, assay = assay2)
rosace <- AddAssayData(object = rosace, assay = assay3)
GetAssayName(rosace)
## [1] "1SM73_1" "1SM73_2" "1SM73_3"

Preprocessing

‘CreateAssayObject’ calls function ‘FilterData’ to filter the variants with more than ‘na.rmax’% of NAs and less than ‘min.count’ total counts by default. But we might want to filter more variants later.

rosace <- FilterData(rosace, key = key, na.rmax = 0.5, min.count = 20)

Then we will impute the NA data either by K-Nearest Neighbor method or fill the NA with 0.

rosace <- ImputeData(rosace, key = key, impute.method = "knn", na.rmax = 0.5)
# rosace <- ImputeData(rosace, key = key, impute.method = "zero")

With a complete count matrix in the Assay object, we will normalize the data by either a list of wild-type variants or by the total count at the time point. We recommend using the wild-type normalization because it will align the wild-type variant score to be 0 for hypothesis testing.

rosace <- NormalizeData(rosace, key = key,
                        normalization.method = "wt", 
                        wt.var.names = c("_wt"), wt.rm = TRUE)
# rosace <- NormalizeData(rosace, key = key, normalization.method = "total")

After the Assay objects are all normalized, we can integrate three replicates into an Assays object stored in the “combined.assay” slot of the Rosace object.

rosace <- IntegrateData(object = rosace, key = key)
GetAssaySetName(rosace)
## [1] "1SM73"

Naive method: simple linear regression (optional)

rosace <- runSLR(rosace, name = key, type = "AssaySet")
rosace <- runSLR(rosace, name = paste(key, "1", sep = "_"), type = "Assay")
# rosace <- runSLR(rosace, name = paste(key, "2", sep = "_"), type = "Assay")
# rosace <- runSLR(rosace, name = paste(key, "3", sep = "_"), type = "Assay")

Process variants’ meta data (provide your own function)

Provide your own function for parsing “hgvs” into position, mutation, wildtype, and type of mutation.

library("dplyr")
colnames(rosace@var.data)
## [1] "variants"
rosace@var.data <- rosace@var.data %>%
  mutate(tmp = substr(variants, 4, nchar(variants) - 1),
         position = as.numeric(gsub("[[:alpha:]]", "", tmp)),
         wildtype = substr(tmp, 1, 1),
         tmp = substr(tmp, 2, nchar(tmp)),
         mutation = gsub("[[:digit:]]", "", tmp)) %>%
  dplyr::select(-tmp)

func_map <- function(wt, mut) {
  if (nchar(wt) == 0) {
    return("NA")
  }
  
  if (wt == mut) {
    return("synonymous")
  } else if (mut == "del") {
    return("deletion")
  } else {
    return("missense")
  }
}

rosace@var.data <- rosace@var.data %>%
  rowwise() %>%
  mutate(type = func_map(wildtype, mutation)) %>%
  ungroup()
head(rosace@var.data)
## # A tibble: 6 × 5
##   variants  position wildtype mutation type      
##   <chr>        <dbl> <chr>    <chr>    <chr>     
## 1 _wt             NA ""       ""       NA        
## 2 p.(A107A)      107 "A"      "A"      synonymous
## 3 p.(A107C)      107 "A"      "C"      missense  
## 4 p.(A107D)      107 "A"      "D"      missense  
## 5 p.(A107E)      107 "A"      "E"      missense  
## 6 p.(A107F)      107 "A"      "F"      missense

Run Rosace

There are three ways to run Rosace model:
1) Provide no position or control column. The model would treat each variant independently but with shrinkage on the variance component.
2) Provide position column only. In addition, the model would group variants together with the amino acid position information provided.
3) Provide both position and control column (recommended). In addition, the model would group all control variants together into one position label.

For the purpose of this vignette, we’ve reduced the number of variants to 100 with brute force before running Rosace, since running on the full dataset might take an hour.

# running on an Assay (one replicate)
rosace@assays$`1SM73_1`@norm.counts <- rosace@assays$`1SM73_1`@norm.counts[1:100, ]
rosace@assays$`1SM73_1`@norm.var.names <- rosace@assays$`1SM73_1`@norm.var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "1SM73_1",
                    type = "Assay",
                    savedir = "../tests/results/stan/assay/",
                    pos.col = "position",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = TRUE,
                    cmdstan_ver = "2.35.0")
# running on an AssaySet (all three replicates)
rosace@assay.sets$`1SM73`@raw.counts <- rosace@assay.sets$`1SM73`@raw.counts[1:100, ]
rosace@assay.sets$`1SM73`@combined.counts <- rosace@assay.sets$`1SM73`@combined.counts[1:100, ]
rosace@assay.sets$`1SM73`@var.names <- rosace@assay.sets$`1SM73`@var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "1SM73",
                    type = "AssaySet",
                    savedir = "../tests/results/stan/assayset/",
                    pos.col = "position",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = FALSE)

Update v1.1: Users can now group the nonsense/stop mutations as an additional option. The OCT1 data doesn’t include nonsense mutations, so we show an example of grouping deletion mutations.

# running on an AssaySet (all three replicates)
# rosace <- RunRosace(object = rosace,
#                     name = "1SM73",
#                     type = "AssaySet",
#                     savedir = "../tests/results/stan/assayset_stop/",
#                     pos.col = "position",
#                     ctrl.col = "type",
#                     ctrl.name = "synonymous",
#                     stop.col = "type",
#                     stop.name = "deletion",
#                     install = FALSE)

The Rosace results can be retrieved using the OutputScore function. First, check the name of your Rosace run:

names(rosace@scores)
## [1] "1SM73_SLR"      "1SM73_1_SLR"    "1SM73_1_ROSACE" "1SM73_ROSACE"

Then extract the scores data, which includes the variant information, functional score (mean), standard deviation (sd), and the local false sign discovery rate (lfsr) associated with the score.

scores.data <- OutputScore(rosace, pos.info = FALSE, name = "1SM73_ROSACE")
head(scores.data)
##    variants position wildtype mutation       type       mean        sd
## 1 p.(A107A)      107        A        A synonymous -0.2749303 0.2851470
## 2 p.(A107C)      107        A        C   missense  0.1090313 0.2518722
## 3 p.(A107D)      107        A        D   missense  0.4202352 0.1788397
## 4 p.(A107E)      107        A        E   missense -0.6184181 0.2812731
## 5 p.(A107F)      107        A        F   missense -0.1424143 0.2625147
## 6 p.(A107G)      107        A        G   missense -1.0350811 0.2624148
##           lfsr     lfsr.neg    lfsr.pos test.neg test.pos   label
## 1 0.1674802320 0.1674802320 0.832519768    FALSE    FALSE Neutral
## 2 0.3325496798 0.6674503202 0.332549680    FALSE    FALSE Neutral
## 3 0.0093920833 0.9906079167 0.009392083    FALSE     TRUE     Pos
## 4 0.0139517968 0.0139517968 0.986048203     TRUE    FALSE     Neg
## 5 0.2937369314 0.2937369314 0.706263069    FALSE    FALSE Neutral
## 6 0.0000399923 0.0000399923 0.999960008     TRUE    FALSE     Neg

The functional score represents the slope of the linear regression applied to normalized counts across time points, which serves as a measure of cell growth. A positive score indicates the mutation has a a gain of function (GOF) effect, while a negative score indicates a loss of function (LOF) effect. The Rosace model learns a distribution of scores and reports the mean for each variant.

The lfsr, or local false sign rate, offers a Bayesian perspective for estimating uncertainty in the sign of the score. For a negative mean score, lfsr estimates the probability it could truly be positive, and vice versa. The below section illustrates how to use score and lfsr to identify LOF and GOF mutations.

Interpret Rosace Results

To have a more comprehensive understanding and clearer view of the score distribution, we’ve made available a precomputed object based on the complete OCT1 dataset. This may help you draw parallel to your own data. libra

# obtain the data 
data("oct1_rosace_scored")

# extract scores 
scores.data <- OutputScore(oct1_rosace_scored, name = "1SM73_ROSACE", sig.test = 0.05)
head(scores.data)
##    variants position wildtype mutation       type       mean        sd
## 1 p.(A107A)      107        A        A synonymous -0.2766760 0.2337295
## 2 p.(A107C)      107        A        C   missense  0.1195266 0.2269136
## 3 p.(A107D)      107        A        D   missense  0.3834164 0.2090132
## 4 p.(A107E)      107        A        E   missense -0.6452607 0.2549238
## 5 p.(A107F)      107        A        F   missense -0.1401713 0.2644013
## 6 p.(A107G)      107        A        G   missense -1.0386441 0.2606083
##           lfsr     lfsr.neg   lfsr.pos test.neg test.pos   label
## 1 1.182571e-01 1.182571e-01 0.88174285    FALSE    FALSE Neutral
## 2 2.991837e-01 7.008163e-01 0.29918374    FALSE    FALSE Neutral
## 3 3.329643e-02 9.667036e-01 0.03329643    FALSE    FALSE Neutral
## 4 5.683809e-03 5.683809e-03 0.99431619     TRUE    FALSE     Neg
## 5 2.980054e-01 2.980054e-01 0.70199463    FALSE    FALSE Neutral
## 6 3.367477e-05 3.367477e-05 0.99996633     TRUE    FALSE     Neg

Data visualization

First, we can visualize the distribution of scores across mutation types with the scoreDensity function.

scoreDensity(scores.data, 
             hist = FALSE,
             savedir = "../tests/results/stan/assayset_full/plot/", 
             name = "DensityPlot_1SM73")

For other visualizations including visualizing position-wise distributions, refer to the Visualizing functional score results vignette.

Hypothesis testing

A common goal in DMS experiments is to determine which mutations cause a GOF or a LOF. One natural approach is to conduct hypothesis testing to check whether a score is significantly different from 0. lfsr stands for local false sign rate, and is defined as the minimum of the density on the left of 0 (lfsr.pos) and on the right of 0 (lfsr.neg). The OutputScore function automatically does a two-sided test based on the lfsr with threshold 0.05: if lfsr.neg < 0.05/2, we are confident that the mass of the distribution is on the left side of 0 and label the variant “Neg”, and vice versa.

For the OCT1 data, positive scores represent loss-of-function and negative scores represent gain-of-function. You can update the label to “LOF”, “Neutral”, and “GOF” (instead of “Neg”, “Neutral”, and “Pos”) by the following function.

scores.data <- scores.data %>%
  mutate(label = case_when(
    label == "Neg" ~ "GOF",
    label == "Pos" ~ "LOF",
    label == "Neutral" ~ "Neutral"
  ))

Keep in mind that setting a threshold is a somewhat counter-Bayesian approach. The threshold should be adaptable, depending on the nature of the experiment and distribution of the data. You may be more or less conservative with thresholding based on the context and prior knowledge.

table(scores.data$type, scores.data$label)
##             
##               GOF  LOF Neutral
##   deletion     53  368     107
##   missense   2841 3758    3662
##   synonymous  103   26     387

In the growth screen, hypothesis testing for negative scores is much harder than that for the positive scores because random dropout of a variant is common, even for the synonymous mutations, especially when the mass of LOF variant score is shifting towards positive (faster growth rate than synonymous). As a result, it is expected to see more false discovery in the “GOF” bin in the OCT1 data. One might adjust the threshold to be more stringent when discovering “Neg” variants, and looser when discovering “Pos” variants.

Rank the variants

Sometimes, we want to select the most “LOF” or “GOF” variants for validation. Our most confident set of variants are those that have a high absolute value of the score and small lfsr close to 0.

For example, the 5 most “LOF” and “GOF” variants in the OCT1 data can be found by the function below.

# LOF
scores.data %>% arrange(desc(mean), lfsr.pos) %>% head(5)
##      variants position wildtype mutation     type     mean        sd
## 1 p.(N156del)      156        N      del deletion 2.624989 0.2537912
## 2   p.(L211R)      211        L        R missense 2.586086 0.2678294
## 3   p.(Y127Q)      127        Y        Q missense 2.581790 0.2009293
## 4   p.(V466D)      466        V        D missense 2.485453 0.2608994
## 5    p.(L25K)       25        L        K missense 2.441372 0.1932449
##           lfsr lfsr.neg     lfsr.pos test.neg test.pos label
## 1 2.248601e-25        1 2.248601e-25    FALSE     TRUE   LOF
## 2 2.324369e-22        1 2.324369e-22    FALSE     TRUE   LOF
## 3 4.342803e-38        1 4.342803e-38    FALSE     TRUE   LOF
## 4 8.135244e-22        1 8.135244e-22    FALSE     TRUE   LOF
## 5 6.894090e-37        1 6.894090e-37    FALSE     TRUE   LOF
# GOF
scores.data %>% arrange(mean, lfsr.neg) %>% head(5)
##    variants position wildtype mutation     type      mean        sd
## 1 p.(F460H)      460        F        H missense -2.169316 0.2889278
## 2  p.(P79G)       79        P        G missense -1.914219 0.2858500
## 3 p.(P120A)      120        P        A missense -1.803614 0.2948415
## 4 p.(Y279G)      279        Y        G missense -1.795650 0.2917418
## 5 p.(V103H)      103        V        H missense -1.791982 0.2895312
##           lfsr     lfsr.neg lfsr.pos test.neg test.pos label
## 1 2.998224e-14 2.998224e-14        1     TRUE    FALSE   GOF
## 2 1.066723e-11 1.066723e-11        1     TRUE    FALSE   GOF
## 3 4.760717e-10 4.760717e-10        1     TRUE    FALSE   GOF
## 4 3.755610e-10 3.755610e-10        1     TRUE    FALSE   GOF
## 5 3.022497e-10 3.022497e-10        1     TRUE    FALSE   GOF

Position-level estiamte

When running with the position mode (default), Rosace infers position-specific mean estimate (“phi”) and position-specific variance estimate (“sigma2”). The hypothesis testing columns can be interpreted the same way as the ones in the variant-specific section.

If the user includes synonymous and/or missense mutations when running Rosace, it would exclude those variants from the position-level estimates.

scores.data.list <- OutputScore(oct1_rosace_scored, pos.info = TRUE, name = "1SM73_ROSACE", sig.test = 0.05)
scores.var <- scores.data.list$df_variant
scores.pos <- scores.data.list$df_position
head(scores.pos)
## # A tibble: 6 × 11
##     pos phi_mean phi_sd sigma2_mean sigma2_sd lfsr.neg lfsr.pos     lfsr
##   <dbl>    <dbl>  <dbl>       <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
## 1     2   -0.273  0.173       0.488     0.196   0.0577 9.42e- 1 5.77e- 2
## 2     3   -0.209  0.153       0.364     0.147   0.0860 9.14e- 1 8.60e- 2
## 3     4    0.469  0.176       0.586     0.225   0.996  3.77e- 3 3.77e- 3
## 4     5    1.25   0.147       0.349     0.138   1      8.55e-18 8.55e-18
## 5     6   -0.145  0.155       0.397     0.154   0.174  8.26e- 1 1.74e- 1
## 6     7    0.469  0.218       0.911     0.341   0.985  1.55e- 2 1.55e- 2
## # ℹ 3 more variables: test.neg <lgl>, test.pos <lgl>, label <chr>