Dealing with Multiple Mutations

Riley Xin

Overview

In the introduction, we consider an experiment primarily focusing on single amino acid substitutions, with each variant being a point mutation. Rosace can also be used to analyze data with variants containing multiple mutations at various positions. You can either (1) run Rosace without positional information or (2) provide a positional grouping to allow information to be shared across similar variants. Since the number of variants with multiple mutations at the exact same positions might be small, (2) requires using heuristics to group variants at similar positions together. We demonstrate two clustering methods using the BRCA1 RING phage display data provided by Enrich2.

library(dplyr)
library("rosace")

Load data

We’ve processed the BRCA1 RING data to include nucleotide-level and amino acid-level mutation positions, as well as wildtypes, mutations, and mutation types. The amino acid-level positional label will be used for grouping.

data("BRCA1R")

Create Rosace object

Next, create a Rosace object and run data imputation, normalization, and normalization. The BRCA1 phage experiment contains 6 replicates.

key <- "BRCA1R"
type <- "growth"
assay1 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(18:23)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 1, type = type)
assay2 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(24:29)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 2, type = type)
assay3 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(30:35)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 3, type = type)
assay4 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(36:41)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 4, type = type)
assay5 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(42:47)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 5, type = type)
assay6 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(48:53)),
                            var.names = BRCA1R.data$hgvs_nt,
                            key = key, rep = 6, type = type)

rosace <- CreateRosaceObject(object = assay1)
rosace <- AddAssayData(object = rosace, assay = assay2)
rosace <- AddAssayData(object = rosace, assay = assay3)
rosace <- AddAssayData(object = rosace, assay = assay4)
rosace <- AddAssayData(object = rosace, assay = assay5)
rosace <- AddAssayData(object = rosace, assay = assay6)
GetAssayName(rosace)

rosace <- ImputeData(rosace, key = key, impute.method = "zero")
rosace <- NormalizeData(rosace, key = key,
                        normalization.method = "wt",
                        wt.var.names = "_wt",
                        wt.rm = TRUE)
rosace <- IntegrateData(object = rosace, key = key)

Run Rosace without positional information

Add variant data including hgvs name, aa_pos_mut, wildtype, mutation, and mutation type to the Rosace object. Running Rosace without specifying the ‘pos.col’ argument will compute variant scores without leveraging positional information. For demonstration purposes, only the first 100 variants are used.

rosace@var.data <- rosace@var.data %>% left_join(BRCA1R.data[, 2:7], by = c("variants" = "hgvs_nt"))
head(rosace@var.data)
rosace@assay.sets$`BRCA1R`@raw.counts <- rosace@assay.sets$`BRCA1R`@raw.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@combined.counts <- rosace@assay.sets$`BRCA1R`@combined.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@var.names <- rosace@assay.sets$`BRCA1R`@var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "BRCA1R",
                    type = "AssaySet",
                    savedir = "../tests/results/stan/assayset/",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = FALSE)

Run Rosace with positional information

Running Rosace with positional information requires additional grouping when there are not enough variants at the exact same positions (less than 10). One option is to use unsupurvised clustering on the positions. First, create a position matrix for variants with multiple mutations:

pos.mat <- matrix(unlist(lapply(lapply(strsplit(BRCA1R.data$aa_pos_mut, "\\."), as.numeric), function(x){
   c(x,rep(0,10-length(x)))
})), ncol = 10, byrow = TRUE)

rownames(pos.mat) <- seq(1, dim(BRCA1R.data)[1])
pos.mat <- pos.mat[pos.mat[, 2] != 0,]

Then, consider clustering algorithms such as kmeans or scclust. Kmeans requires specifying the number of groups, while scclust allows one to specify the minimum number of elements in a group.

# group variants using kmeans
clust_kmeans <- kmeans(pos.mat, centers = 500)
clust.res <- clust_kmeans$cluster
# group variants using scclust
install.packages("scclust")
library("scclust")
dist <- distances(data.frame(pos.mat, idx = rownames(pos.mat)),
                  id_variable = "idx",
                  dist_variables = paste0("X", seq(1, 10)))

# optimizing the clusters while ensuring a minimum of 10 variants in each cluster 
clust.res <- sc_clustering(dist, 10)

Label position groups for both single mutations (from original position) and multiple mutations (from clustering assignment) in a column ‘pos_group’.

BRCA1R.data <- BRCA1R.data %>% 
  mutate(pos_group = paste0("S",aa_pos_mut), .after = 4)
BRCA1R.data[rownames(pos.mat),]$pos_group <- paste0("M", clust.res)

Then, run Rosace with the ‘pos.col’ argument being ‘pos_group’:

rosace@var.data <- rosace@var.data %>% left_join(BRCA1R.data[, 2:8], by = c("variants" = "hgvs_nt"))
head(rosace@var.data)
rosace@assay.sets$`BRCA1R`@raw.counts <- rosace@assay.sets$`BRCA1R`@raw.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@combined.counts <- rosace@assay.sets$`BRCA1R`@combined.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@var.names <- rosace@assay.sets$`BRCA1R`@var.names[1:100]
rosace <- RunRosace(object = rosace,
                    name = "BRCA1R",
                    type = "AssaySet",
                    savedir = "../tests/results/stan/assayset/",
                    pos.col = "pos_group",
                    ctrl.col = "type",
                    ctrl.name = "synonymous",
                    install = FALSE)

Finally, get the variant scores and proceed with downstream analysis.

scores.data.list<- OutputScore(rosace, pos.info = TRUE, name = "BRCA1R_ROSACE1", sig.test = 0.05)
scores.pos <- scores.data.list$df_position