In the introduction, we consider an experiment primarily focusing on
single amino acid substitutions, with each variant being a point
mutation. Rosace
can also be used to analyze data with
variants containing multiple mutations at various positions. You can
either (1) run Rosace without positional information or (2) provide a
positional grouping to allow information to be shared across similar
variants. Since the number of variants with multiple mutations at the
exact same positions might be small, (2) requires using heuristics to
group variants at similar positions together. We demonstrate two
clustering methods using the BRCA1
RING phage display data provided by Enrich2
.
We’ve processed the BRCA1 RING data to include nucleotide-level and amino acid-level mutation positions, as well as wildtypes, mutations, and mutation types. The amino acid-level positional label will be used for grouping.
Next, create a Rosace object and run data imputation, normalization, and normalization. The BRCA1 phage experiment contains 6 replicates.
key <- "BRCA1R"
type <- "growth"
assay1 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(18:23)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 1, type = type)
assay2 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(24:29)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 2, type = type)
assay3 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(30:35)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 3, type = type)
assay4 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(36:41)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 4, type = type)
assay5 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(42:47)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 5, type = type)
assay6 <- CreateAssayObject(counts = as.matrix(BRCA1R.data %>% select(48:53)),
var.names = BRCA1R.data$hgvs_nt,
key = key, rep = 6, type = type)
rosace <- CreateRosaceObject(object = assay1)
rosace <- AddAssayData(object = rosace, assay = assay2)
rosace <- AddAssayData(object = rosace, assay = assay3)
rosace <- AddAssayData(object = rosace, assay = assay4)
rosace <- AddAssayData(object = rosace, assay = assay5)
rosace <- AddAssayData(object = rosace, assay = assay6)
GetAssayName(rosace)
rosace <- ImputeData(rosace, key = key, impute.method = "zero")
rosace <- NormalizeData(rosace, key = key,
normalization.method = "wt",
wt.var.names = "_wt",
wt.rm = TRUE)
rosace <- IntegrateData(object = rosace, key = key)
Add variant data including hgvs name, aa_pos_mut, wildtype, mutation, and mutation type to the Rosace object. Running Rosace without specifying the ‘pos.col’ argument will compute variant scores without leveraging positional information. For demonstration purposes, only the first 100 variants are used.
rosace@var.data <- rosace@var.data %>% left_join(BRCA1R.data[, 2:7], by = c("variants" = "hgvs_nt"))
head(rosace@var.data)
rosace@assay.sets$`BRCA1R`@raw.counts <- rosace@assay.sets$`BRCA1R`@raw.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@combined.counts <- rosace@assay.sets$`BRCA1R`@combined.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@var.names <- rosace@assay.sets$`BRCA1R`@var.names[1:100]
rosace <- RunRosace(object = rosace,
name = "BRCA1R",
type = "AssaySet",
savedir = "../tests/results/stan/assayset/",
ctrl.col = "type",
ctrl.name = "synonymous",
install = FALSE)
Running Rosace with positional information requires additional grouping when there are not enough variants at the exact same positions (less than 10). One option is to use unsupurvised clustering on the positions. First, create a position matrix for variants with multiple mutations:
pos.mat <- matrix(unlist(lapply(lapply(strsplit(BRCA1R.data$aa_pos_mut, "\\."), as.numeric), function(x){
c(x,rep(0,10-length(x)))
})), ncol = 10, byrow = TRUE)
rownames(pos.mat) <- seq(1, dim(BRCA1R.data)[1])
pos.mat <- pos.mat[pos.mat[, 2] != 0,]
Then, consider clustering algorithms such as kmeans
or scclust.
Kmeans requires specifying the number of groups, while
scclust
allows one to specify the minimum number of
elements in a group.
# group variants using kmeans
clust_kmeans <- kmeans(pos.mat, centers = 500)
clust.res <- clust_kmeans$cluster
library("scclust")
dist <- distances(data.frame(pos.mat, idx = rownames(pos.mat)),
id_variable = "idx",
dist_variables = paste0("X", seq(1, 10)))
# optimizing the clusters while ensuring a minimum of 10 variants in each cluster
clust.res <- sc_clustering(dist, 10)
Label position groups for both single mutations (from original position) and multiple mutations (from clustering assignment) in a column ‘pos_group’.
BRCA1R.data <- BRCA1R.data %>%
mutate(pos_group = paste0("S",aa_pos_mut), .after = 4)
BRCA1R.data[rownames(pos.mat),]$pos_group <- paste0("M", clust.res)
Then, run Rosace with the ‘pos.col’ argument being ‘pos_group’:
rosace@var.data <- rosace@var.data %>% left_join(BRCA1R.data[, 2:8], by = c("variants" = "hgvs_nt"))
head(rosace@var.data)
rosace@assay.sets$`BRCA1R`@raw.counts <- rosace@assay.sets$`BRCA1R`@raw.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@combined.counts <- rosace@assay.sets$`BRCA1R`@combined.counts[1:100, ]
rosace@assay.sets$`BRCA1R`@var.names <- rosace@assay.sets$`BRCA1R`@var.names[1:100]
rosace <- RunRosace(object = rosace,
name = "BRCA1R",
type = "AssaySet",
savedir = "../tests/results/stan/assayset/",
pos.col = "pos_group",
ctrl.col = "type",
ctrl.name = "synonymous",
install = FALSE)
Finally, get the variant scores and proceed with downstream analysis.
scores.data.list<- OutputScore(rosace, pos.info = TRUE, name = "BRCA1R_ROSACE1", sig.test = 0.05)
scores.pos <- scores.data.list$df_position