cdQuPath Vignette: overview

Introduction

This package presents a fundamental workflow for spatial phenotyping, cell type annotation, multi-nucleus detection, quality control, and calculating cell neighborhoods from whole slide images of ultra-high multiplexed tissue captured by ‘Akoya Biosciences’ assays. The workflow leverages QuPath, StarDist, Cellpose, Seurat, and hoodscanR.

The upstream code are available in the folder of extdata. The detailed documentation will be provide in following sections.

The cdQuPath aims to provide a easy to use tool for downstream analysis with Seurat and hoodscanR for quality control and neighborhoods analysis.

Quick start

Here is an example using cdQuPath.

Installation

First, please download and install QuPath (v0.4.0 or newer). The cell segmentation requires the cellpose and stardist extension for QuPath. Don’t forget to set the Python path for the Cellpose extension.

Then, download a copy of the Groovy script to your local machine. There are two scripts available: cellpose_stardist_multiNucleis_underGUI.groovy and ExportCellDetectionMeasurement.groovy.

git clone https://github.com/jianhong/cdQuPath.git
cd cdQuPath/inst/extdata/scripts
ls

Then, install cdQuPath and other packages required to run the examples.

library(BiocManager)
BiocManager::install("jianhong/cdQuPath")

Step 1. Cell segmentation

This step involves merging the results from Cellpose and Stardist analyses.

Navigate to line 25 of the cellpose_stardist_multiNucleis_underGUI.groovy script and update the path of the stardistModel variable to match the file path of your cell_seg_model.pb. Adjust the parameters from line 26 to line 38 accordingly. You may want to change the cytoChannels and measurementPercentileCutoff according to your observations. The distanceCutoff is the maximal cutoff distance of nucleus for multiple nucleus cells.

Then, select a region in the opened TIFF view window in QuPath. Start with a small region when testing the script, and expand it once you achieve the desired results. Please note that if you observe regional biases, you may want to exclude the biases regions. If it is un-avoidable, you may want to normalize the signals. We will discuss this later.

Finally, run the cellpose_stardist_multiNucleis_underGUI.groovy script by clicking the Run button in the Script Editor.

Step 2. Export the cell segmentation

The script ExportCellDetectionMeasurement.groovy is designed to export cell segmentation data. Upon export, two files are generated and saved in a folder prefixed with measurementsExport.

The .tsv file is compatible with createSeuratObj.R.
The .geojson file is intended for reloading into QuPath.

These exported files contain comprehensive information, including cell area, locations, marker signal statistics, and nucleus classification. The cell locations are particularly useful for neighborhood analysis.

Step 3. Work in R

Load library

library(dplyr)
library(cdQuPath)
library(Seurat)
library(ggplot2)
library(dittoSeq)
library(hoodscanR)
library(SpatialExperiment)

Create a Seurat object from output `tsv` file.

tsv <- system.file("extdata", "sample.qptiff.tsv.gz",
                   package = "cdQuPath",
                   mustWork = TRUE)
seu <- createSeuratObj(tsv, useValue = 'Median',
                       markerLocations = CodexPredefined$markerLocations)
seu

## An object of class Seurat 
## 600 features across 1093 samples within 25 assays 
## Active assay: RNA (24 features, 0 variable features)
##  1 layer present: counts
##  24 other assays present: Nucleus_Mean, Nucleus_Median, Nucleus_Min, Nucleus_Max, Nucleus_Std.Dev., Nucleus_Variance, Cytoplasm_Mean, Cytoplasm_Median, Cytoplasm_Min, Cytoplasm_Max, Cytoplasm_Std.Dev., Cytoplasm_Variance, Membrane_Mean, Membrane_Median, Membrane_Min, Membrane_Max, Membrane_Std.Dev., Membrane_Variance, Cell_Mean, Cell_Median, Cell_Min, Cell_Max, Cell_Std.Dev., Cell_Variance
##  1 dimensional reduction calculated: pos

Quality Control

The conventional method for evaluating dynamic range involves computing a signal-to-background (SNR) ratio. This is done by dividing the average intensity of the top 20 brightest cells by the average intensity of the weakest 10% of cells. An SNR value of 10 or higher is indicative of reliable image analysis. It is recommended that the SNR falls within the range of (10, Inf], typically exceeding 100. An SNR below 3 suggests that the antibodies are performing poorly.

## please make sure the default assay is 'RNA'
DefaultAssay(seu)

## [1] "RNA"

dr <- dynamicRanges(seu)
dr

## $brightest
##  DAPI.01    CD163    CD127     sox9     CD14      CD4      mpo     cd68 
##  109.625   13.300   11.600   24.300   26.875   52.150   10.000  238.750 
##     cd45    foxp3     lcp1      pd1      cd8    hif1a     ki67     cd19 
##   19.700    8.750   18.350    7.400   10.525   30.575   80.625   17.650 
##    cd206    znf90     asma      cd3     cd86 hist3h2a     cd34    CD11c 
##   12.550   34.850   20.875    2.150   11.925  105.600   50.000   42.000 
## 
## $weakest
##   DAPI.01     CD163     CD127      sox9      CD14       CD4       mpo      cd68 
## 47.686992  2.148882  2.790650  0.000000  2.899390  5.625508  3.911585 69.067073 
##      cd45     foxp3      lcp1       pd1       cd8     hif1a      ki67      cd19 
##  4.411077  3.632622  6.510163  2.408537  3.322663  2.966463  3.584858 10.713415 
##     cd206     znf90      asma       cd3      cd86  hist3h2a      cd34     CD11c 
##  6.490854 17.210874  0.000000  0.000000  9.333841 23.114329 10.100610 23.494411 
## 
## $dynamic_range
##   DAPI.01     CD163     CD127      sox9      CD14       CD4       mpo      cd68 
##  2.298845  6.189265  4.156737       Inf  9.269190  9.270274  2.556508  3.456785 
##      cd45     foxp3      lcp1       pd1       cd8     hif1a      ki67      cd19 
##  4.466029  2.408728  2.818670  3.072405  3.167640 10.306886 22.490432  1.647467 
##     cd206     znf90      asma       cd3      cd86  hist3h2a      cd34     CD11c 
##  1.933490  2.024883       Inf       Inf  1.277609  4.568595  4.950196  1.787659

## markers may not work;
## Please note that DAPI.01 is used for cell detection.
avoid_markers <- 
  names(dr$dynamic_range)[dr$dynamic_range<3 | is.na(dr$dynamic_range)]
avoid_markers

## [1] "DAPI.01" "mpo"     "foxp3"   "lcp1"    "cd19"    "cd206"   "znf90"  
## [8] "cd86"    "CD11c"

Verify the cell area and nucleus area for multi-nuclei.

# Visualize QC metrics as a violin plot
colnames(seu[[]])

##  [1] "orig.ident"              "nCount_RNA"             
##  [3] "nFeature_RNA"            "Cell.classification"    
##  [5] "Cell.X"                  "Cell.Y"                 
##  [7] "Nucleus.Area.um.2"       "Nucleus.Length.um"      
##  [9] "Nucleus.Circularity"     "Nucleus.Solidity"       
## [11] "Nucleus.Max.diameter.um" "Nucleus.Min.diameter.um"
## [13] "Cell.Area.um.2"          "Cell.Length.um"         
## [15] "Cell.Circularity"        "Cell.Solidity"          
## [17] "Cell.Max.diameter.um"    "Cell.Min.diameter.um"   
## [19] "Nucleus.Cell.area.ratio" "Cell.isMultiNucleis"

VlnPlot(seu,
        features = c("Cell.Area.um.2",
                     "Nucleus.Area.um.2",
                     'Nucleus.Cell.area.ratio'),
        split.by = 'Cell.isMultiNucleis',
        pt.size = ifelse(ncol(seu)>500, 0, 1),
        ncol = 3)

## if you want to save the plot, try
## ggsave(filename)

VlnPlot(seu, features = c('cd68', 'CD11c', 'cd34'),
        pt.size = ifelse(ncol(seu)>500, 0, 1),
        ncol = 3)

Now, we can proceed with filtering the cells. Upon observation, we notice that the average area of single-nucleus cells is below 500. Therefore, we establish the cutoff for cell area at approximately three times the size of single-nucleus cells. Additionally, if the nucleus-to-cell area ratio is excessively high, it suggests potential issues with the cell area detection process. We will filter out those cells.

keep <- seu$Cell.Area.um.2<1500 & seu$Nucleus.Cell.area.ratio < .9
seu <- subset(seu,
              cells = colnames(seu)[keep],
              features = names(dr$dynamic_range)[!is.na(dr$dynamic_range)])
seu

## An object of class Seurat 
## 600 features across 1088 samples within 25 assays 
## Active assay: RNA (24 features, 0 variable features)
##  1 layer present: counts
##  24 other assays present: Nucleus_Mean, Nucleus_Median, Nucleus_Min, Nucleus_Max, Nucleus_Std.Dev., Nucleus_Variance, Cytoplasm_Mean, Cytoplasm_Median, Cytoplasm_Min, Cytoplasm_Max, Cytoplasm_Std.Dev., Cytoplasm_Variance, Membrane_Mean, Membrane_Median, Membrane_Min, Membrane_Max, Membrane_Std.Dev., Membrane_Variance, Cell_Mean, Cell_Median, Cell_Min, Cell_Max, Cell_Std.Dev., Cell_Variance
##  1 dimensional reduction calculated: pos

Normalizing the data

The initial query is whether we are observing abnormally high or low intensity regions in specific sub-regions. If these variations are deemed unrelated to tissue-specific characteristics, it might be advisable to normalize each marker channel accordingly. The positional data of cells is contained within the reduction matrix labeled ‘pos’.

nrow(seu)

## [1] 24

suppressMessages(
  FeaturePlot(seu, slot='count',
            features = rownames(seu)[1],
            reduction = 'pos') + scale_y_reverse()
)

You may attempt to normalize the irregular sub-regions using LOESS (Local Polynomial Regression Fitting) with the loessSmooth function.

The next step involves cell-wise normalization. The available normalization methods are ‘LogNormalize’, ‘CLR’ (centered log ratio), ‘RC’ (CPM), and ‘zscore’, which can be applied using the normData function. The normalization method except for ‘zscore’ is borrowed from Seurat package.

seu <- normData(seu, normalizationMethod='zscore')

Identification of highly variable markers (marker selection)

In the background, this step will execute the FindVariableFeatures function and exclude the ‘avoidMarkers’ from the list of variable markers. Please ensure that ‘nFeatures’ is smaller than the total number of available markers.

seu <- findVarMarkers(
    seu,
    avoidMarkers=c('DAPI.01', 'hist3h2a'),
    nFeatures = 10L)
VariableFeatures(seu)

## [1] "ki67"  "CD4"   "cd34"  "hif1a" "mpo"   "znf90" "cd19"  "CD127" "CD14"

plot1 <- VariableFeaturePlot(seu)
LabelPoints(plot = plot1, points = VariableFeatures(seu), repel = TRUE)

Scaling the data and do cluster analysis

The following steps outline the standard process of single-cell analysis in the Seurat package. You may adjust the process accordingly to suit your needs.

all.markers <- rownames(seu)
seu <- ScaleData(seu, features = all.markers)
seu <- RunPCA(seu, features = VariableFeatures(seu))
dim <- seq.int(ncol(Embeddings(seu$pca)))
VizDimLoadings(seu, dims = dim,
               nfeatures=length(VariableFeatures(seu)),
               reduction = "pca")

seu <- FindNeighbors(seu, dims = dim)
seu <- FindClusters(seu, resolution = 0.9)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 1088
## Number of edges: 32836
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7546
## Number of communities: 10
## Elapsed time: 0 seconds

seu <- RunUMAP(seu, reduction = "pca", dims = dim)
pt.size <- ifelse(ncol(seu)>500, 0, 2)
DimPlot(seu,
        reduction = 'umap', label = TRUE,
        pt.size = pt.size) + scale_y_reverse()

## check the distribution of the clusters in spatial position.
ncol <- ceiling(sqrt(length(unique(seu$seurat_clusters))))
DimPlot(seu,
        reduction = 'pos', label = FALSE,
        split.by = 'seurat_clusters',
        ncol = ncol, pt.size = pt.size) + scale_y_reverse()

## check the nucleis distribution
dittoBarPlot(seu,
             var = 'Cell.isMultiNucleis',
             group.by='seurat_clusters')

DimPlot(seu, reduction = 'pos', label = TRUE,
        split.by = 'Cell.isMultiNucleis',
        pt.size = pt.size) + scale_y_reverse()

VlnPlot(seu,
        features = c('Nucleus.Area.um.2',
                     'Cell.Area.um.2',
                     'Nucleus.Cell.area.ratio'),
        pt.size = pt.size)

## Dot plot
markers <- unique(FindAllMarkers(seu)$gene)
DotPlot(seu, features = markers, scale = FALSE) + coord_flip()

Rescale data using Gaussian Mixture Model

In the downstream analysis, cell types are detected by fitting a multiple-class Gaussian Mixture Model (GMM) to the re-scaled data. The raw intensities are converted to probabilities, and a cutoff of 0.5 is set for cell type assignment.

Please run fitGMM for each sub-region of the image if the background of the sub-regions are not identical.

library(future.apply)
plan(multisession) # for multiple process.
set.seed(123) ## for model fit
seu <- fitGMM(seu)
DotPlot(seu, features = markers, assay = 'GMM', scale = FALSE) + coord_flip()

Correlation heatmap

The correlation heatmap will display the signal similarity among different markers. The distance will be calculated using the stats::dist function. Users can experiment with different inputs such as z-scales, GMM data, and so on.

DefaultAssay(seu)

## [1] "RNA"

p <- markerCorrelation(seu)
names(p)

## [1] "counts"     "data"       "scale.data"

## plot the correlations for data, here it is Z-scale data
p[['data']]

## use GMM rescaled data to plot the correlation
DefaultAssay(seu) <- 'GMM'
p_GMM <- markerCorrelation(seu)
p_GMM[['data']]

## 
DefaultAssay(seu) <- 'RNA'

Assign cell types

The phenotype will utilize the re-scaled data obtained through GMM. Two methods are available for assigning cell types: ‘Rank’ and ‘Boolean’. With the ‘Rank’ option, cell types are assigned based on the rank of probabilities for the positive markers. On the other hand, the ‘Boolean’ option assigns cell types according to the order of the classifier for the positive markers.

The metadata ‘celltype’ stores the assigned cell types for each Seurat cluster. Additionally, the metadata ‘celltype_all’ contains all possible cell types for each cell, determined by a cutoff of 0.5. Finally, the metadata ‘celltype_first’ represents the initial assignment.

seu <- detectCellType(seu)
table(seu$celltype)

## 
##              B CELLS    Cytotoxic T Cells          Endothelial 
##                  178                  310                   87 
##        M1 Macrophage             MONOCYTE Proliferation Marker 
##                  318                   47                  148

## check the cell type annotation by clusters
DimPlot(seu, group.by = 'celltype')

DimPlot(seu, group.by = 'celltype', reduction = 'pos') + scale_y_reverse()

dittoBarPlot(seu,
             var = 'celltype',
             group.by='Cell.isMultiNucleis')

## check the first annotation of each cell, not restricted by seurat_cluster
DimPlot(seu, group.by = 'celltype_first', reduction = 'pos') + scale_y_reverse()

dittoBarPlot(seu,
             var = 'celltype_first',
             group.by='Cell.isMultiNucleis')

## check the signals for each celltype
Idents(seu) <- 'celltype'
DotPlot(seu, features = rownames(seu), scale = FALSE) + coord_flip() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

## export the celltype
ct <- data.frame(Object.ID=colnames(seu),
                 x=seu$Cell.X,
                 y=seu$Cell.Y,
                 CellType=seu$celltype)
head(ct, n=2)

##                                                                 Object.ID
## 8d5ccdac-31da-4930-81b6-dd9f272c10d7 8d5ccdac-31da-4930-81b6-dd9f272c10d7
## b21deea4-2d9c-4e0b-9c83-591622b33a9a b21deea4-2d9c-4e0b-9c83-591622b33a9a
##                                             x        y      CellType
## 8d5ccdac-31da-4930-81b6-dd9f272c10d7 13142.86 8012.715 M1 Macrophage
## b21deea4-2d9c-4e0b-9c83-591622b33a9a 13142.86 8012.715 M1 Macrophage

## if you see postfix like _1, then run
ct$Object.ID <- sub('_\\d+$', '', ct$Object.ID)
## output to same folder: celltype_csv <- sub('qptiff.tsv', 'celltype.csv', tsv)
celltype_csv <- tempfile(fileext = '.csv')
write.csv(ct, file = celltype_csv, row.names = FALSE)
## after exporting, you can load the celltypes back to QuPath by running the script load_celltype.groovy in the scripts folder.

Merge multiple Seurat objects if necessary

If you have datasets from multiple sub-regions, you can process each one individually as shown above and then merge them. Occasionally, some images may not have the same background across different regions. To address this, you can detect the cells in each sub-region with a similar background or baseline and then perform GMM rescaling for each sub-region. Once this is done, you can combine the analyzed results to create a merged dataset.

tsv2 <- system.file("extdata", "sample2.qptiff.tsv.gz",
                   package = "cdQuPath",
                   mustWork = TRUE)
seu2 <- createSeuratObj(tsv2, useValue = 'Median',
                       markerLocations = CodexPredefined$markerLocations)
seu2

## An object of class Seurat 
## 600 features across 282 samples within 25 assays 
## Active assay: RNA (24 features, 0 variable features)
##  1 layer present: counts
##  24 other assays present: Nucleus_Mean, Nucleus_Median, Nucleus_Min, Nucleus_Max, Nucleus_Std.Dev., Nucleus_Variance, Cytoplasm_Mean, Cytoplasm_Median, Cytoplasm_Min, Cytoplasm_Max, Cytoplasm_Std.Dev., Cytoplasm_Variance, Membrane_Mean, Membrane_Median, Membrane_Min, Membrane_Max, Membrane_Std.Dev., Membrane_Variance, Cell_Mean, Cell_Median, Cell_Min, Cell_Max, Cell_Std.Dev., Cell_Variance
##  1 dimensional reduction calculated: pos

keep <- seu2$Cell.Area.um.2<1500 & seu2$Nucleus.Cell.area.ratio < .9
table(keep)

## keep
## FALSE  TRUE 
##     1   281

seu2 <- subset(seu2,
              cells = colnames(seu2)[keep])
seu2 <- normData(seu2, normalizationMethod='zscore')
seu2 <- findVarMarkers(
    seu2,
    avoidMarkers=c('DAPI.01', 'hist3h2a'),
    nFeatures = 10L)
seu2 <- ScaleData(seu2, features = rownames(seu2))
seu2 <- RunPCA(seu2, features = VariableFeatures( seu2))
dim <- seq.int(ncol(Embeddings(seu2$pca)))
seu2 <- FindNeighbors(seu2, dims = dim)
seu2 <- FindClusters(seu2, resolution = 0.9)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 281
## Number of edges: 8053
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.6137
## Number of communities: 5
## Elapsed time: 0 seconds

seu2 <- fitGMM(seu2)
seu2 <- detectCellType(seu2)
seu2$pca <- NULL # to advoid different pca columns error.
seu$orig.ident <- 'region1'
seu2$orig.ident <- 'region2'
mergedSeu <- merge(x=seu, y=list(seu2), merge.dr = TRUE)
mergedSeu

## An object of class Seurat 
## 624 features across 1369 samples within 26 assays 
## Active assay: RNA (24 features, 8 variable features)
##  6 layers present: counts.1, counts.2, data.1, scale.data.1, data.2, scale.data.2
##  25 other assays present: Nucleus_Mean, Nucleus_Median, Nucleus_Min, Nucleus_Max, Nucleus_Std.Dev., Nucleus_Variance, Cytoplasm_Mean, Cytoplasm_Median, Cytoplasm_Min, Cytoplasm_Max, Cytoplasm_Std.Dev., Cytoplasm_Variance, Membrane_Mean, Membrane_Median, Membrane_Min, Membrane_Max, Membrane_Std.Dev., Membrane_Variance, Cell_Mean, Cell_Median, Cell_Min, Cell_Max, Cell_Std.Dev., Cell_Variance, GMM
##  3 dimensional reductions calculated: pos, pca, umap

Assays(mergedSeu)

## An object of class "SimpleAssays"
## Slot "data":
## List of length 1

p <- dittoBarPlot(mergedSeu,
             var = 'celltype_first',
             group.by='orig.ident')
p

## pie chart
plotData <- p$data %>%
  dplyr::arrange(desc(label)) %>%
  dplyr::group_by(grouping) %>%
  dplyr::mutate(y_pos = cumsum(percent) - 0.5*percent)
ggplot(plotData, aes(x="", y = percent, fill = label)) +
  geom_bar(stat="identity", width = 1) +
  geom_text(aes(x=1.7, y=y_pos,
                label=paste0(round(percent*100, digits = 1), '%')),
            color = 'black') +
  coord_polar(theta = 'y') +
  facet_wrap(~grouping) +
  theme_void() +
  scale_fill_manual(values = dittoColors())

FeaturePlot(mergedSeu, slot='count',
            features = rownames(mergedSeu)[1],
            reduction = 'pos',
            ncol = 2) + scale_y_reverse()

DimPlot(mergedSeu, group.by = 'celltype', reduction = 'pos') + scale_y_reverse()

## use GMM rescaled data to plot the correlation
DefaultAssay(mergedSeu) <- 'GMM'
p_GMM <- markerCorrelation(mergedSeu, layers = 'data')
p_GMM[['data']]

## 
DefaultAssay(mergedSeu) <- 'RNA'

Detect the celltype by classifier

The simplest classifier is a list of markers.

CodexPredefined$classifier

##            Helper T cells             M1 Macrophage      Cartilage tumor cell 
##                     "CD4"                    "cd68"                    "sox9" 
##      Proliferation Marker             Tumor/Myeloid                  MONOCYTE 
##                    "ki67"                    "lcp1"                    "CD14" 
##             M2 MACROPHAGE             M2 MACROPHAGE                       HSC 
##                   "cd206"                   "CD163"                    "cd45" 
##         Cytotoxic T Cells mDC/cDC - Dendritic cells                   B CELLS 
##                     "cd8"                   "CD11c"                    "cd19" 
##               Endothelial       Pericyte/Fibroblast 
##                    "cd34"                    "asma"

You can also try to detect the cell type by multiple markers with weight for each celltype.

CodexPredefined$classifier2

## $positive
## $positive$`B cell`
## $positive$`B cell`$symbol
## [1] "cd45" "cd19"
## 
## $positive$`B cell`$weight
## [1] 0.5 0.5
## 
## 
## $positive$`T cell`
## $positive$`T cell`$symbol
## [1] "CD4"
## 
## $positive$`T cell`$weight
## [1] 1
## 
## 
## $positive$CD4_TFH
## $positive$CD4_TFH$symbol
## [1] "CD4" "pd1"
## 
## $positive$CD4_TFH$weight
## [1] 0.5 0.5
## 
## 
## $positive$`T-reg`
## $positive$`T-reg`$symbol
## [1] "CD4"   "foxp3" "pd1"  
## 
## $positive$`T-reg`$weight
## [1] 0.3333333 0.3333333 0.3333333
## 
## 
## $positive$`CD8 T-cell`
## $positive$`CD8 T-cell`$symbol
## [1] "cd8"
## 
## $positive$`CD8 T-cell`$weight
## [1] 1
## 
## 
## $positive$`Dendritic cell`
## $positive$`Dendritic cell`$symbol
## [1] "CD11c"
## 
## $positive$`Dendritic cell`$weight
## [1] 1
## 
## 
## $positive$NK
## $positive$NK$symbol
## [1] "cd57"
## 
## $positive$NK$weight
## [1] 1
## 
## 
## $positive$Macrophage
## $positive$Macrophage$symbol
## [1] "cd68"
## 
## $positive$Macrophage$weight
## [1] 1
## 
## 
## $positive$`M2 macrophage`
## $positive$`M2 macrophage`$symbol
## [1] "cd68"  "CD163"
## 
## $positive$`M2 macrophage`$weight
## [1] 0.5 0.5
## 
## 
## $positive$HSC
## $positive$HSC$symbol
## [1] "cd45"
## 
## $positive$HSC$weight
## [1] 1
## 
## 
## $positive$`cd45+lcp1`
## $positive$`cd45+lcp1`$symbol
## [1] "cd45" "lcp1"
## 
## $positive$`cd45+lcp1`$weight
## [1] 0.5 0.5
## 
## 
## $positive$Monocyte
## $positive$Monocyte$symbol
## [1] "CD14"
## 
## $positive$Monocyte$weight
## [1] 1
## 
## 
## $positive$Stromal
## $positive$Stromal$symbol
## [1] "cd34" "asma"
## 
## $positive$Stromal$weight
## [1] 0.5 0.5
## 
## 
## $positive$Tumour
## $positive$Tumour$symbol
## [1] "sox9" "ki67"
## 
## $positive$Tumour$weight
## [1] 0.5 0.5
## 
## 
## 
## $negative
## $negative$Tumour
## $negative$Tumour$symbol
## [1] "cd45"
## 
## $negative$Tumour$weight
## [1] 1

mergedSeu <- detectCellType(mergedSeu,
                            classifier = CodexPredefined$classifier2,
                            celltypeColumnName = 'celltype2')
dittoBarPlot(mergedSeu,
             var = 'celltype2',
             group.by='Cell.isMultiNucleis')

You can also try to detect the cell type by a generalized k-Nearest Neighbors Classifier.

classifier <- readRDS(system.file('extdata', 'classifier.CSA.res0.3.rds', 
                      package='cdQuPath'))
## this request lots of memory.
mergedSeu <- detectCellType(mergedSeu,
               classifier=classifier,
               celltypeColumnName='celltypeByKNN')
table(mergedSeu$celltypeByKNN)

## 
## Chondroblastic    Endothelial   Fibroblastic     Mast cells        Myeloid 
##              0            226              5            661             30 
##   Osteoblastic   Osteoclastic      Pericytes  Proliferating     T/NK cells 
##             37            221              1             27            161

dittoBarPlot(mergedSeu,
             var = 'celltypeByKNN',
             group.by='orig.ident')

Align the single cell RNAseq data to Codex data by KNN classifier

Below is sample code to create a classifier for your data and align single-cell RNA-seq data with Codex data.

Codex data has a limited number of cell markers compared to the 10,000+ features detected by single-cell RNA-seq. To align the scRNA-seq matrix with the Codex matrix, it is crucial to minimize the effects of sparsity, noise, and collinearity among individual cells. This is achieved by performing distance calculations using pseudobulk analysis, which aggregates gene expression from individual cells into group-level values. By treating each group as a single cell, this approach reduces heterogeneity within cell clusters.

The resulting alignment matrix enables the prediction of transcriptomes using Codex spatial information.

## read the protein-gene map from the Codex (protein level) to
## transcriptome (gene level)
markers <- readRDS(system.file('extdata', 'markers.name.map.rds',
                                package='cdQuPath'))
tail(markers)

## $asma
## [1] "ACTA2"
## 
## $ki67
## [1] "MKI67"
## 
## $cd19
## [1] "CD19"
## 
## $cd3
## [1] "CD3E" "CD3D" "CD3G"
## 
## $cd34
## [1] "CD34"
## 
## $CD11c
## [1] "ITGAX"

## load the scRNA-seq data
# scRNAseqSeu <- readRDS('int.rds')
## do cluster, set higher resolution to get more class
# scRNAseqSeu <- FindNeighbors(scRNAseqSeu, dims = 1:20)
## set a big resolution to split the cells into a big number of clusters.
# scRNAseqSeu <- FindClusters(scRNAseqSeu, resolution = 100)
## create a pseudobulking for each cluster
# pseudo <- AggregateExpression(scRNAseqSeu, assays = "RNA",
#                               return.seurat = TRUE,
#                               group.by = c("seurat_clusters"))
pseudo <- readRDS(system.file('extdata', 'pseudobulk.scRNAseq.rds',
                              package='cdQuPath'))
library(future.apply)
plan(multisession)
## will cost several mins
classifier <- buildClassifierFromSeurat(pseudo,
                                        markerList = markers)
alignedCodex <- alignScRNAseqToCodex(codexSeu = mergedSeu,
                                      scSeu = pseudo,
                                      classifier = classifier)
## plot any gene expression with DarkTheme()
FeaturePlot(alignedCodex, slot='data',
            features = rownames(pseudo)[1],
            reduction = 'pos',
            ncol = 2) + scale_y_reverse() + DarkTheme()

Detect the celltype by other tools

Here we will show the sample code to detect the cell types by CELESTA.

if (!require("CELESTA", quietly = TRUE)){
  BiocManager::install("plevritis-lab/CELESTA")
}
library(CELESTA)
library(zeallot)
library(spdep)
prior_marker_info <- matrix(0,
                            nrow = length(CodexPredefined$classifier2$positive),
                            ncol = nrow(mergedSeu),
                            dimnames = list(
                              names(CodexPredefined$classifier2$positive),
                              rownames(mergedSeu)
                            ))
for(i in seq_along(CodexPredefined$classifier2$positive)){
  prior_marker_info[names(CodexPredefined$classifier2$positive)[i],
                    CodexPredefined$classifier2$positive[[i]]$symbol[
                      CodexPredefined$classifier2$positive[[i]]$symbol %in%
                        colnames(prior_marker_info)
                    ]] <- 1
}
## reorder the celltype by lineage level
prior_marker_info <- prior_marker_info[
  c('Tumour', 'Stromal', 'HSC',
    'NK', 'B cell', 'T cell', 'Monocyte',
    'CD4_TFH', 'CD8 T-cell', 'T-reg',
    'Macrophage', 'M2 macrophage', 'cd45+lcp1', 'Dendritic cell'), ]
prior_marker_info <- cbind(
  celltype = rownames(prior_marker_info),
  Lineage_level=paste(c(1, 1, 1,
                        2, 2, 2, 2,
                        3, 3, 3,
                        3, 3, 3, 3), ## lineage level
                      c(0, 0, 0,
                        3, 3, 3, 3,
                        6, 6, 6,
                        7, 7, 7, 7), ## lineage ancestor 
                      c(1, 2, 3,
                        4, 5, 6, 7,
                        8, 9, 10,
                        11, 12, 13, 14), ## index
                      sep='_'),
  data.frame(prior_marker_info))
mergedSeu <- JoinLayers(mergedSeu)
imageData <- GetAssayData(mergedSeu, assay = 'RNA', layer = 'data')
imageData <- cbind(mergedSeu[[]], t(imageData))
colnames(imageData)[colnames(imageData)=='Cell.X'] <- 'X'
colnames(imageData)[colnames(imageData)=='Cell.Y'] <- 'Y'
CelestaObj <- CreateCelestaObject(
  project_title = "test",
  prior_marker_info = prior_marker_info,
  imaging_data = imageData)
CelestaObj <- FilterCells(
  CelestaObj, 
  high_marker_threshold=0.9,
  low_marker_threshold=0.1)
CelestaObj <- AssignCells(
  CelestaObj,
  max_iteration=10,
  cell_change_threshold=0.01)
mergedSeu$celltypeCELESTA <- 
  CelestaObj@final_cell_type_assignment[, 'Final cell type']
dittoBarPlot(mergedSeu,
             var = 'celltypeCELESTA',
             group.by='orig.ident')

Neighborhoods analysis

After annotating cell types, we can assess whether these annotations exhibit spatial enrichment and conduct analyses of cellular neighborhoods throughout the tissue. Here we borrow the power of hoodscanR to identifying interactions between spatial communities.

clusterK <- 12
spe <- neighborhoodsAna(mergedSeu, ## if no merging step, use `seu`.
                        clusterK = clusterK,
                        k = 10, ## maximal 10 neightbors
                        anno_col = 'celltype_first')
# Plot the metrics for probability matrix
plotTissue(spe, color = entropy) +
    scale_color_gradientn(colors=CodexPredefined$heatColor)

plotTissue(spe, color = perplexity) +
    scale_color_gradientn(colors=CodexPredefined$heatColor)

# Plot the clusters for the probability matrix
plotTissue(spe, color = clusters)

# Plot heatmap for neighbourhood analysis
plotColocal(spe, pm_cols = unique(spe$celltype_first))

# Plot probability distribution
plotProbDist(spe, pm_cols = unique(spe$celltype_first), 
             by_cluster = TRUE, plot_all = TRUE, 
             show_clusters = as.character(seq(clusterK)))

Co-occurrence analysis

The Jaccard Index is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard Index is defined as: $\frac{cooccurent\_celltype\_A\_and\_B}{count\_celltype\_A+count\_celltype\_B-cooccurent\_celltype\_A\_and\_B}$ .

This method are based on simple yet intuitive measures, such as counting the number of nearest neighbors of every cell and inspecting the proportion of different cell categories in a given Euclidean (2 norm aka $L_2$ ) distance ( $\sqrt{\sum_{i}{(x_i - y_i})^2}$ , the shortest distance).

## here we set two times of cell length as maximal gaps to check 
## the cell co-location
maxgap <- max(mergedSeu$Cell.Length.um) * 2
ji <- JaccardIndex(mergedSeu, maxRadius = maxgap, anno_col='celltype_first')
plotJaccardIndex(ji)

The co-occurrence score gives us an indication on whether given two factors such as cell cluster, cell type, and etc, co-occur with each other at specific distances across the tissue. The co-occurrence score is defined as: $\frac{p(factor1|factor2)}{p(factor1)*p(factor2)}$ .

maxgap <- max(mergedSeu$Cell.Length.um) * 50
## precalculate cell distance to save time because we will use same maxgap for
## both coOccurrence and spatialScore
celldist <- findNearCellsByRadius(
      seu=mergedSeu,
      maxRadius=maxgap)
cooc <- coOccurrence(mergedSeu, maxRadius=maxgap, bin=50,
                     anno_col='celltype',
                     celldist=celldist)
ggplot(data.frame(x=seq.int(nrow(cooc))*20-10, y=cooc[, 'cell proportion']),
       aes(x, y)) + geom_line() + theme_classic() +
  labs(x='distance', y='proportion')

plotData <- reshape2::melt(cooc[, -1])
ggplot(plotData, aes(x=Var1, y=Var2, fill = value))+
  geom_tile() + labs(x='bin', y='') +
  scale_fill_gradient(low = "white", high = "red")

The p-value of co-occurence can be calculated based on permutation test or a more rigorous framework, the homogeneous multitype Poisson point process (PPP) or complete spatial randomness and independence (CSRI) by picking a specific value of the radius (r). To perform the analysis, you can refer to R-package SpicyR or SpaceANOVA.

The spatial score define the distance ratio among three factors. The spatial score is defined as: $\frac{celltypeCenter\_distance\_to\_nearest\_celltypeRight}{celltypeCenter\_distance\_to\_nearest\_celltypeLeft}$ . The high value indicates the left celltype is closer than the right celltype.

## need to increase the neighbors number
ss <- spatialScore(mergedSeu,
                   celltypeCenter = "Macrophage",
                   celltypeLeft = "T cell",
                   celltypeRight = "Monocyte",
                   anno_col = 'celltype_first',
                   maxRadius=maxgap,
                   celldist=celldist)
head(ss)

##                                   cell   dist.L   dist.R spatialScore
## 1 8d5ccdac-31da-4930-81b6-dd9f272c10d7 68.92733 474.2253     6.880076
## 2 b21deea4-2d9c-4e0b-9c83-591622b33a9a 68.92733 474.2253     6.880076
## 3 64279019-198a-4a87-a829-3f67c19e3e97 68.92733 474.2253     6.880076
## 4 f61663dd-f8a0-4da1-8bae-dcf4146032b3 68.92733 474.2253     6.880076
## 5 78a932ae-bb7c-428c-8329-eef8c1f29c18 68.92733 474.2253     6.880076
## 6 a5b179c6-c0fc-4952-89c6-cfc942a135a3 78.08983 352.7442     4.517159

boxplot(ss$spatialScore)

Session Info

sessionInfo()

R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 22.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] future.apply_1.11.2 future_1.34.0
[3] dplyr_1.1.4 SpatialExperiment_1.15.1
[5] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.1 [7] Biobase_2.65.1 GenomicRanges_1.57.1
[9] GenomeInfoDb_1.41.1 IRanges_2.39.2
[11] S4Vectors_0.43.2 BiocGenerics_0.51.1
[13] MatrixGenerics_1.17.0 matrixStats_1.4.1
[15] hoodscanR_1.3.3 dittoSeq_1.17.0
[17] ggplot2_3.5.1 Seurat_5.1.0
[19] SeuratObject_5.0.2 sp_2.1-4
[21] cdQuPath_0.99.0.8 BiocStyle_2.33.1

loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2
[4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4
[7] lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.3
[10] lattice_0.22-6 MASS_7.3-61 magrittr_2.0.3
[13] plotly_4.10.4 sass_0.4.9 rmarkdown_2.28
[16] jquerylib_0.1.4 yaml_2.3.10 httpuv_1.6.15
[19] sctransform_0.4.1 spam_2.10-0 spatstat.sparse_3.1-0
[22] reticulate_1.39.0 cowplot_1.1.3 pbapply_1.7-2
[25] RColorBrewer_1.1-3 abind_1.4-5 zlibbioc_1.51.1
[28] Rtsne_0.17 purrr_1.0.2 circlize_0.4.16
[31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1
[34] listenv_0.9.1 spatstat.utils_3.1-0 pheatmap_1.0.12
[37] goftest_1.2-3 RSpectra_0.16-2 spatstat.random_3.3-1
[40] fitdistrplus_1.2-1 parallelly_1.38.0 pkgdown_2.1.0
[43] leiden_0.4.3.1 codetools_0.2-20 DelayedArray_0.31.11
[46] shape_1.4.6.1 tidyselect_1.2.1 UCSC.utils_1.1.0
[49] farver_2.1.2 spatstat.explore_3.3-2 jsonlite_1.8.8
[52] GetoptLong_1.0.5 e1071_1.7-14 progressr_0.14.0
[55] iterators_1.0.14 ggridges_0.5.6 survival_3.7-0
[58] systemfonts_1.1.0 foreach_1.5.2 Rmixmod_2.1.10
[61] tools_4.4.1 ragg_1.3.3 ica_1.0-3
[64] Rcpp_1.0.13 glue_1.7.0 gridExtra_2.3
[67] SparseArray_1.5.34 xfun_0.47 withr_3.0.1
[70] BiocManager_1.30.25 fastmap_1.2.0 fansi_1.0.6
[73] digest_0.6.37 R6_2.5.1 mime_0.12
[76] textshaping_0.4.0 colorspace_2.1-1 scattermore_1.2
[79] tensor_1.5 spatstat.data_3.1-2 utf8_1.2.4
[82] tidyr_1.3.1 generics_0.1.3 data.table_1.16.0
[85] class_7.3-22 httr_1.4.7 htmlwidgets_1.6.4
[88] S4Arrays_1.5.7 uwot_0.2.2 pkgconfig_2.0.3
[91] scico_1.5.0 gtable_0.3.5 ComplexHeatmap_2.21.0
[94] lmtest_0.9-40 XVector_0.45.0 htmltools_0.5.8.1
[97] dotCall64_1.1-1 bookdown_0.40 clue_0.3-65
[100] scales_1.3.0 png_0.1-8 spatstat.univar_3.0-1
[103] knitr_1.48 reshape2_1.4.4 rjson_0.2.22
[106] nlme_3.1-166 GlobalOptions_0.1.2 proxy_0.4-27
[109] cachem_1.1.0 zoo_1.8-12 stringr_1.5.1
[112] KernSmooth_2.23-24 parallel_4.4.1 miniUI_0.1.1.1
[115] desc_1.4.3 pillar_1.9.0 grid_4.4.1
[118] vctrs_0.6.5 RANN_2.6.2 promises_1.3.0
[121] xtable_1.8-4 cluster_2.1.6 evaluate_0.24.0
[124] magick_2.8.4 cli_3.6.3 compiler_4.4.1
[127] rlang_1.1.4 crayon_1.5.3 labeling_0.4.3
[130] mclust_6.1.1 plyr_1.8.9 fs_1.6.4
[133] stringi_1.8.4 viridisLite_0.4.2 deldir_2.0-4
[136] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-2
[139] Matrix_1.7-0 RcppHNSW_0.6.0 patchwork_1.2.0
[142] shiny_1.9.1 highr_0.11 ROCR_1.0-11
[145] igraph_2.0.3 bslib_0.8.0

Jianhong Ou

12 September 2024

Introduction

Quick start

Installation

Step 1. Cell segmentation

Step 2. Export the cell segmentation

Step 3. Work in R

Load library

Create a Seurat object from output `tsv` file.

Quality Control

Normalizing the data

Identification of highly variable markers (marker selection)

Scaling the data and do cluster analysis

Rescale data using Gaussian Mixture Model

Correlation heatmap

Assign cell types

Merge multiple Seurat objects if necessary

Detect the celltype by classifier

Align the single cell RNAseq data to Codex data by KNN classifier

Detect the celltype by other tools

Neighborhoods analysis

Co-occurrence analysis

Session Info

cdQuPath Vignette: overview

Jianhong Ou

12 September 2024

Introduction

Quick start

Installation

Step 1. Cell segmentation

Step 2. Export the cell segmentation

Step 3. Work in R

Load library

Create a Seurat object from output tsv file.

Quality Control

Normalizing the data

Identification of highly variable markers (marker selection)

Scaling the data and do cluster analysis

Rescale data using Gaussian Mixture Model

Correlation heatmap

Assign cell types

Merge multiple Seurat objects if necessary

Detect the celltype by classifier

Align the single cell RNAseq data to Codex data by KNN classifier

Detect the celltype by other tools

Neighborhoods analysis

Co-occurrence analysis

Session Info

Create a Seurat object from output `tsv` file.