We developed Rosette for simulating cell count data in growth or binding-based DMS screens. Rosette estimates summary statistics including sequencing count dispersion and variant score distributions from real data. You may use it to infer distributional properties from custom data and generate simulated data with desired noise effect, sequencing depth etc.
Again, we use the oct1 dataset as our example.
Rosette will learn the distributional properties of variant scores
from score estimates. The score estimates can be naive (e.g. simple
linear regression) or more complicated (e.g. rosace
)
Two dispersion parameters, dispersion of the sequencing count and dispersion of the variant library, are calculated from raw count. The former measures how much variability in variant representation there is before and during sequencing, and the later measures how much variability in variant representation there is before the cell selection. The dispersion parameters are automatically inferred when “CreateRosetteObject” is called.
## [1] 10.31088
## [1] 8.114383
To account for similar functional effects among mutants (substitutions, insertions, or deletions of amino acids), we categorized them into mutant groups using hierarchical clustering.
hclust <- HclustMutant(rosette, save.plot = FALSE)
rosette <- GenMutLabel(rosette, hclust = hclust, Gmut = 4, save.plot = FALSE)
Within each mutant group, the variants can have either neutral, loss of function, or gain of function effect. We therefore categorize the variants into three groups and estimate the score distribution parameters for each group.
rosette <- GenVarLabel(rosette, Gvar = 2)
PlotScoreHist(rosette, var.group = TRUE, mut.group = TRUE)
Next, create a config file with other user defined properties for simulation, such as the number of rounds and replicates, experimental type (growth or binding), wild-type effect (binding) or doubling rate (growth), sequencing depth, shrinkage factor for library or sequencing dispersion, and simulation mode (clean or with replication error).