scPipeline_module_overview.Rmd
In this article we overview each scPipeline module, and provide details on how to specify parameters for analysis. For each module, we provide a description, example reports and documentation for implementation (i.e., parameter specification).
Prior to running any scPipeline module, please visit our getting started with scPipeline article to setup up scPipeline.
In general, our convention is to store all analysis parameters in a
list named parameter.list
at the start of the script.
# example
parameter.list <- list(input.file = "../data/demo.rds", ..., print.inline = F, save.flag = T)
In all scPipeline modules, parameter.list
is already
specified, so users need only to configure the analysis parameters
(e.g., specify input data, set cluster resolution, etc.) prior to
running the module. In general, the default settings will yield
reasonable results, however a little time spent optimizing parameters
will often yield better outcomes.
Initial appraisal of scRNA-seq count matrix data and overview of quality control metrics, data filtering, and normalization. Generates initial UMAP clusters, and PCA analyses. A preprocessed seurat object is saved as output.
Quality control: Cells are filtered based on
genes/cell recovered and fraction of counts belonging to mitochondrial
genes. Injured/broken cells are characterized by low gene recovery and
high mitochondrial content whereas doublets are characterized by
extremely high gene recovery. Based on these behaviors, high quality
non-injured single cells can be retained for downstream analysis using
the gene.upperlimit
, gene.lowerlimit
, and
mt.upperlimit
analysis parameters.
Count Normalization: Count matrices are normalized using SCTransform. In brief, the count data are modeled using regularized negative binomial regression to remove the influence of technical covariates, including sequencing depth and mitochondrial content. The resulting model residuals are treated as normalized count data for downstream analysis.
Dimensional Reduction: Following normalization,
select features (i.e., genes) are used to reduce the dimensionality of
the data set. This effectively eases downstream computational burden,
increases the signal-to-noise and allows for data visualization. We
perform principal component analysis and use the top N principal
components that explain 90% (user-defined;
pca.var.threshold
) variance for Louvain clustering and UMAP
embedding. The UMAP
algorithm embeds reduced data (e.g., principal components) into a lower
dimensional representation while preserving the global structure of the
transcriptomic data to facilitate visualization.
For all the example reports shown here, datasets were subsampled to 10,000 cells and minimal cell filtering was performed due to original count matrices having been cleaned by authors.
Inputs
One of the following data formats must be provided as input:
barcodes.tsv
, genes.tsv
,
matrix.mtx
Outputs
Parameter | Description | Argument |
---|---|---|
input.object | path to seurat object | character |
input.cell_ranger | path to folder containing cell ranger output | character |
input.matrix | path gene x cell count matrix (.tsv, .csv format) | character |
save.flag | save Seurat object from current analysis | logical |
output.object.directory | path to folder in which to save Seurat object | character |
output.object.file | name of file to which Seurat object is saved. Must have “.Rdata” suffix (e.g., ‘object.Rdata’) | character |
subsample.n | value specifying how many cells to sub sample. Set to NA if none. | numeric |
cluster.resolution | value of the resolution parameter passed to
Seurat::FindClusters(...) . Use a value above (below) 1.0 if
you want to obtain a larger (smaller) number of communities. Default is
1. |
numeric |
gene.upperlimit | Upper limit of genes per cell. Value between [0, Inf], but more than
gene.lowerlimit . Any cells exceeding this threshold are
filtered out. Default is 9000. |
numeric |
gene.lowerlimit | Lower limit of genes per cell. Value between [0, Inf], but less than
gene.upperlimit . Any cells below this threshold are
filtered out. Default is 200. |
numeric |
mt.upperlimit | Upper limit of mitochondrial content per cell. Value between [0, 100]. Cells exceeding this threshold are filtered out. Default is 10. | numeric |
unmatched.rate.filter.flag | Logical specifying whether to filtered cells using unmatched rates. Default is TRUE. | |
vars2regress | Variables to regress out during scaling step. Default is ‘percent.mt’. | vector of character(s) |
correct.artifact | identify artifact cells and omit from dimensional reduction. Default is TRUE. | logical |
feature.select.method | Feature selection method for dimensional reduction. | - hvg (default): highly-variable genes - deviance |
scale.method | Expression scaling method. | - sct (default): SCTransform residuals - null_residuals |
pca.method | principal component analysis method. | - pca (default) - glmpca |
pca.component.select.method | method for selecting number of principal components for clustering and UMAP embedding. | - cum_var (default) - elbow |
pca.var.threshold | value between [0, 1] specifying cumulative variance explained by
pca. Ignored if pca.component.select.method is not
‘cum_var’. |
numeric |
conserve.memory | If TRUE, the residual matrix for all genes computed using SCTransform is never created in full; recommended for large data sets, but will take longer to run. Default is FALSE | logical |
species.filter.flag | Logical specifying whether to filter species. Default is FALSE. | logical |
species.include | Specify which species to retain in analysis. Ignored if
species.filter.flag is FALSE |
- Hs: human - Mm: murine |
nworkers | Number of workers used for parallel implementation of SCTransform. Default is 1. | numeric |
save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical |
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
developer | Developer mode. Default is FALSE. | logical |
Batch correction and integration of independent experiments.
Generates a new Seurat integrated
assay using one of four
approaches:
See Stuart et al. (2019). Cell for a description of rPCA and CCA, Polański et al. (2020). Bioinformatics for BBKNN, and Hie et al. (2019). Nature Biotechnology for Scanorama.
Our selection of integration methods is based on the benchmarking efforts published by Luecken et al. (2021). Nature Methods
The primary aim of data integration is to identify common cell
populations between datasets. However, because many integration methods
involve non-linear warping, many statistical assumptions are broken in
the process and the count matrix in the integrated
assay is
not appropriate for downstream differential gene expression. If
downstream differential expression (DEG) analyses are intended, use
counts from the data
slot in the SCT
assay;
Data are processed using PrepSCTFindMarkers()
thereby
ensuring that counts have been corrected for varying sequencing-depth
across data sets.
In addition to specifying the parameter.list
(analysis
parameters), the Integration Module additionally requires a separate
list, named input.data
, to specify the data sets used for
integration. See below for details and examples.
Parameter | Description | Argument |
---|---|---|
save.file | name of output file | character |
integration.feature.select.method | method used to select integration features | - hvg (default): highly-variable genes - deviance |
integration.k.filter | number of neighbors (k) to use when filtering anchors. Default is
200. Passed to Seurat::FindIntegrationAnchors(...) . Ignored
if integration.method = “BBKNN” or “Scanorama”. |
numeric |
integration.k.anchor | number of neighbors (k) to use when picking anchors. Dictates
strength of alignment. 5 is recommended, 20 suggested for stronger
alignments. Passed to Seurat::FindIntegrationAnchors(...) .
Ignored if integration.method = “BBKNN” or “Scanorama”. |
numeric |
integration.k.weight | number of neighbors to consider when weighting anchors. Passed to
Seurat::IntegrateData(...) . Default is 100. Ignored if
integration.method = “BBKNN” or “Scanorama”. |
numeric |
integration.n.genes | number of genes used for integration. 1000-3000 recommended. | numeric |
integration.method | integration method. | - rPCA - CCA - BBKNN - Scanorama |
integration.limit.memory | Specify whether to limit max memory allocated during integration. Default is T. Ignored if integration.method = “BBKNN” or “Scanorama”. | logical |
integration.max.memory | max memory allocated during integration, in terms of Gb. Ignored if integration.limit.memory = F. Default is 150. Ignored if integration.method = “BBKNN” or “Scanorama”. | numeric |
integration.cluster.resolution | value of the resolution parameter passed to
Seurat::FindClusters(...) , which is run after data
integration. Use a value above (below) 1.0 if you want to obtain a
larger (smaller) number of communities. Default is 1. |
numeric |
integration.compute.diversity | Compute change in population diversity before and after integration. Default is T | logical |
pca.var.threshold | proportion of variance used for clustering and UMAP embedding. 0.9 is recommended. | numeric [0,1] |
correct.artifact | omit artifact genes. Default is F | logical |
vars2regress | Variables to regress out during scaling step. Default is ‘percent.mt’. | vector of character(s) |
correct.artifact | identify artifact genes (i.e., genes detected in only one sample) and omit from integration step. Recommended only if integrated datasets have similar cellular compositions. Default is F | logical |
n.workers | number of workers to use for parallel implementation. Provided as list, with each named entry corresponding to different task (e.g., n.workers = list(import = 4, integration = 1)) | list |
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
save.pdf | Logical specifying whether to save figures and tables in separate directory. | logical |
update.log | update analysis log. Ignored if developer = F. | logical |
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical |
developer | Developer mode. Default is FALSE. | logical |
The input.data
list is a named list, with each entry
corresponding to a separate input dataset. For a basic use, given 2 or
more seurat objects for integration, we can specify the path to the
input data using the file
argument as follows:
# basic input specification
input.data <- list(
dataset1 = list(file = "dataset1.Rdata"), # path to .Rdata file containing seurat object named 'so'
dataset2 = list(file = "dataset2.Rdata"),
dataset3 = list(file = "dataset3.Rdata")
)
Alternatively, if a given data set contains multiple batches of data,
we can partition the seurat object into subsets which will then be
handled as independent datasets in the downstream integration step. This
can be accomplished using the integrate.by
argument, which
specifies which feature to use to split the seurat object. The feature
specified by integrate.by
must be present in the metadata
of the corresponding seurat object. In the example below, we split
dataset 1 using the Barcode
feature prior to
integration:
# spliting seurat object into subset prior to integration
input.data <- list(
dataset1 = list(
file = "dataset1.Rdata", # path to .Rdata file containing seurat object named 'so'
integrate.by = "Barcode" # meta data features specifying sample labels
),
dataset2 = list(
file = "dataset2.Rdata"
),
dataset3 = list(
file = "dataset3.Rdata"
)
)
Users can also filter their inputs prior to integration using the
meta.feature.name
and meta.feature.filter
arguments. For example, if dataset 1 consists of multiple cell
types, and we are interested in only integrating the immune compartment,
we can set meta.feature.filter
to “lymphoid|myeloid” and
meta.feature.name
to “cell.type”, with the latter
corresponding to the meta data column present in the seurat object:
# filtering seurat object prior to integration
input.data <- list(
dataset1 = list(
file = "dataset1.Rdata", # path to .Rdata file containing seurat object named 'so'
meta.feature.name = "cell.type", # meta data column used to filter data
meta.feature.filter = "lymphoid|myeloid" # features to include for downstream integration
),
dataset2 = list(
file = "dataset2.Rdata"
),
dataset3 = list(
file = "dataset3.Rdata"
)
)
Note that the filter uses non-exact matching, so please be aware that substrings may unintentionally filter out data of interest if incorrectly specified. For example, “cell_type_1” will filter “cell_type_1”, “cell_type_10”, “cell_type_11”, etc. A few additional examples for filtering:
Clustering is performed at several candidate resolutions, and the optimal resolution is informed using:
Clustering is performed using the Seurat implementation of
FindNeighbors()
and FindClusters()
. In brief,
a cell X cell shared nearest neighbor (SNN) graph is first constructed
by determining the neighborhood overlap (Jaccard Index) between every
cell and its kth nearest neighbor. Then, the SNN graph is
clustered using the Louvain algorithm. The main parameter affecting the
number of clusters generated is the resolution
. Higher
resolutions yield more clusters, whereas lower resolution yield fewer
clusters.
Inputs
so
.Outputs
Parameter | Description | Argument |
---|---|---|
input.file | path to seurat object | character |
cluster.resolution | range of resolution values used for evaluating candidate cluster configurations. Recommended = c(0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3) | vector of numeric(s) |
lfc.threshold | log-fold change threshold used in differential-expression analysis. 0.5 is recommended. | numeric |
fdr.threshold | fdr-value significance threshold. Value between [0,1]. Default is 0.05. | numeric |
group.singletons | Group singletons into nearest cluster. If FALSE, assign all
singletons to a “singleton” group. For certain difficult-to-cluster data
sets, group.singletons = FALSE may yield faster results.
Default is FALSE |
logical |
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric |
subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric |
n.workers | number of workers used for parallel implementation.
parallel::detectCores() is recommended. |
numeric |
only.pos | only included genes with logFC > 0. Default is TRUE. | logical |
save.pdf | save figures and tables | logical |
update.log | update analysis log. Ignored if developer = F. | logical |
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical |
developer | Developer mode. Default is FALSE. | logical |
Annotation of cell populations identified by unsupervised clustering. Cell-type markers are scored using the Miko scoring pipeline.
Prior to running this module, we recommend running the cluster optimization module to identify the appropriate resolution for clustering.
Inputs
so
.Outputs
Parameter | Description | Argument |
---|---|---|
input.file | path to seurat object | character |
external.markers | cell-type markers used for annotation | named list; names are cell-types, entries are genes. E.g., list(cell_A = c(“gene1”, “gene2”, “gene3”), cell_B = c(“gene4”, “gene5”, “gene6”)) |
internal.markers | name of scMiko internal gene set(s). Run
names(geneSets) for list of available gene sets. |
character |
marker.filter | filter marker sets using grepl pattern matching. E.g.,
Manno2021 for all La Manno 2021 atlas-derived markers in
Cell_Catalogy stored in scMiko geneSets . |
character |
coherence.threshold | value of minimum gene set coherence. Gene sets with coherence scores below this threshold are disqualified from consideration. Value must be between [0,1]. Default is 0.8. | numeric |
p.threshold | p-value significance threshold. Value between [0,1]. Default is 0.05. | numeric |
do.center | center scores by null model predictions. FALSE is recommended | logical |
frequent.filter.threshold | path to seurat object | character |
filter.frequent.fliers | path to seurat object | character |
min.representation | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric |
cluster.resolution | value of the resolution parameter passed to
Seurat::FindClusters(...) . Use a value above (below) 1.0 if
you want to obtain a larger (smaller) number of communities. Default is
1. |
numeric |
fdr.correction | apply Benjamini & Hochberg correction | logical |
clean.clusters | filter cells that exceed 3 median absolute deviations from cluster center in UMAP coordinate space. Default is FALSE. | logical |
max.cells | max number of cells used in Miko scoring pipeline. Default is 20000. | numeric |
subsample_factor | value specifying fraction of cells to sub sample. Value between [0,1] | numeric |
min.geneset.size | maximal gene set size. Default is 200. | numeric |
max.geneset.size | minimal gene set size. Default is 3. | numeric |
pathway.db | database used for functional annotation of gene sets | - GO - Bader |
n.workers | number of workers used for parallel implementation. Default is
parallel::detectCores() . |
numeric |
save.pdf | save figures and tables. | logical |
show.top.n.gene.panels | max number of gene sets per cluster to show in report. Default is 5 | numeric |
max.gene.panels.allowed | max number of overall gene sets to show in report. Default is 75 | numeric |
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
update.log | update analysis log. Ignored if developer = F. | logical |
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical |
developer | Developer mode. Default is FALSE. | logical |
Gene program/module identification. Three independent methods are implemented to identify gene programs based on gene co-expression profiles:
Gene set enrichment is performed on each gene program to facilitate functional annotation.
Inputs
so
.Outputs
Parameter | Description | Argument |
---|---|---|
input.file | path to seurat object | character |
cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric |
subsample_factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric |
subsample_n | value specifying how many cells to sub sample. Set to NA if none. | numeric |
subset.data | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as “no.subset” for not subsetting. | data frame specifying which field (df$field ) to subset
on, and which field entries to retain (df$subgroups ) |
batch.feature | meta data field to use for batch correction (set to NA if unspecified) | character |
feature.selection.method | method used to select features for module detection analysis. | - expr: expression-based criterion - hvg: highly-variable genes - deviance: deviance-based criterion |
max.features | max number of features to include in module detection analysis (NA if unspecified) | |
min.pct | minimal feature expression (applicable only if
select.method = “expr”). Value between [0,1], use 0.1 for
large network and 0.5 for small network. |
numeric |
data.type | Data representation used for module detection | - pearson: based on SCTransform (recommended) - deviance: based on multinomial null model |
purity.optimization.step.size | value of step size used for optimizing module purity. 0.5 recommended. | numeric |
filter.parameters | specify which clusters to include or omit | named list with 2 sets of entries; ‘include’ entries specify which
clusters to retain; ‘omit’ entries specify which clusters to omit.
E.g., list(include = c(0,1,2), omit = c(3,4,5)) |
general.weight.by.var | Weight feature embeddings by the variance of each PC. Default is FALSE. | logical |
general.pca.cum.sum | cumulative variance explained by PCs used for SSN analysis. Value between (0,1], 0.9 is recommended. | numeric |
general.umap.knn | number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50. | numeric |
general.cluster.purity | target purity of network modules. Value between [0,1], 0.8 is recommended. | logical |
general.scale.free.topology | apply scale-free transform to shared nearest neighbor graph. Default is TRUE. | logical |
general.ica | perform independent component analysis (ICA). | logical |
general.nmf | perform non-negative matrix factorization (NMF). | logical |
general.nmf.k | number of latent factors to compute with NMF. Can be a range of values (e.g., c(5, 10, 15)) | vector of numeric(s) |
general.robust.pca | use robust PCA for dimensional reduction (warning: computationally intensive) | logical |
n.workers | number of workers used for parallel implementations. | numeric |
print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
save.pdf | save figures and tables | logical |
update.log | update analysis log. Ignored if developer = F. | logical |
rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical |
developer | Developer mode. Default is FALSE. | logical |
Given a query gene or gene set, scRNAseq expression is evaluated and gene similarity metrics are computed to identify associated genes.
Inputs
so
.Outputs
Parameter | Description | Argument |
---|---|---|
input.file | path to seurat object | character |
features | features used for analysis. If feature.format = 1,
features must be vector of features present in seurat
object (e.g., c(“PRRX1”, “TCF4”)). If feature.format =
2, features must be name of column spreadsheet containing
features (e.g., “HALLMARK_genes”). If feature.format =
3, features argument is ignored. |
|
feature.format | format of features provided by user. | - 0: user-provided genes - 1: features are stored in spreadsheet. - 2: discovery mode. |
feature.csv.path | Path to spreadsheet containing features. Required if using
feature.format = 1. For intended use, features in
spreadsheet must be organized in columns that can be referenced by the
column header using the features argument. |
character |
discovery.method | Differential expression algorithm used in discovery mode
(feature.format = 2) |
- Wilcoxon: Identifies DEGs using Wilcoxon Rank Sum test genes - CDI: Identifies DEGs using co-dependency test - Gini: Identifies group-specific features using gini inequality index |
discovery.pct.min | Features that are expressed below this value are omitted from
differential expression analysis. Range between [0,1], 0.1 is
recommended. Required only if feature.format = 2. |
numeric |
discovery.group.by | Name of object meta data field to group cells by when performing
differential expression. E.g., “seurat_clusters”. Required only if
feature.format = 2. |
character |
discovery.top.n | path to seurat object | character |
discovery.max.n | minimum fraction of genes required for gene set consideration. If fraction of genes present in seurat object is below this threshold, gene set is disqualified. Value between [0,1]. Default is 0.5. | numeric |
as.module | aggregate features into single meta-program. Default is FALSE. | logical |
cor.method | feature association method | - pearson: Pearson correlation - spearman: Spearman correlation - rho_p: Proportionality constant - cdi: co-dependency index |
cor.top.n | Number of top associated features to show. 70 is recommended. | numeric |
cor.which.data | aslot to use for feature association analysis. “data” is recommended. | - data - scale |
cor.min.expr | Features that are expressed below this value are omitted from feature association analysis. Range between [0,1], 0.025-0.1 is recommended. | numeric |
general.cluster.resolution | Value of the resolution parameter, use a value above 1.0 if you want to obtain a larger number of communities. | numeric |
general.subsample.factor | Value between [0,1] specifying fraction of cells to subsample. Use 1 for no subsampling. | numeric |
general.subset | Subset Seurat object according to specified meta data field. Only specified meta data entries are retained, while remaining of data is omitted. Set as “no.subset” for not subsetting. | data frame specifying which field (df$field ) to subset
on, and which field entries to retain (df$subgroups ) |
general.clean.clusters | omit cells exceeding 3 median absolute deviations from cluster center in UMAP coordinate space. | logical |
general.save.pdf | save figures and tables | logical |
general.print.inline | print figures after each code chunk. Recommended if running script chunk-by-chunk. Default is FALSE. | logical |
general.rprofile.dir | look for input file in path specified in .Rprofile. Default is FALSE. | logical |
general.barcode.recode | Named list where names are new labels, and entries are old labels.
If is provided for old labels, treated as wild card. e.g. list(in.vitro
= c(“invitro), in.vivo = NA). See
scMiko::recordBarcode(...) for details. |
Named list where names are new labels, and entries are old labels. |
expression.do.hex | show UMAP expression using schex. Default is FALSE | logical |
n.workers | number of workers used for parallel implementation. Set to 1 for non-parallel run. | numeric |
developer | Developer mode. Default is FALSE. | logical |
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Canada.utf8 LC_CTYPE=English_Canada.utf8
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.13 knitr_1.39 magrittr_2.0.3 R6_2.5.1
## [5] ragg_1.2.2 rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0
## [9] tools_4.2.0 xfun_0.31 cli_3.3.0 jquerylib_0.1.4
## [13] htmltools_0.5.2 systemfonts_1.0.4 yaml_2.3.5 digest_0.6.29
## [17] rprojroot_2.0.3 pkgdown_2.0.3 textshaping_0.3.6 purrr_0.3.4
## [21] formatR_1.12 sass_0.4.1 fs_1.5.2 memoise_2.0.1
## [25] cachem_1.0.6 evaluate_0.15 rmarkdown_2.14 stringi_1.7.6
## [29] compiler_4.2.0 bslib_0.3.1 desc_1.4.1 jsonlite_1.8.0