Tutorial 7: Compatibility with other single-cell libraries ( Seurat, Signac, and SnapATAC)
Overview
The minimal requirement to perform an Enhlink analysis is a cell-by-peak matrix and the locations of targets, such as promoters, in the analyzed genome. Additionally, Enhlink can optionally utilize cluster information, cell-by-gene expression matrices, and other cell metadata.
In this tutorial, we will demonstrate how to extract these data from various single-cell chromatin analysis frameworks, such as Signac/Seurat, SnapATAC, or ScanPy, and launch an Enhlink analysis. For simplicity, we will not use any advanced functionalities of Enhlink, such as the -covariates, -clusters, or -secondOrder options. However, these features can be easily incorporated if needed.
Signac / Seurat
These R packages were developped by the Satija’s lab and provide comprehensive solutions to analyze single-cell RNA-seq and ATAC-seq datasets. Seurat objects are enriched wrapper toward a sparse cell x gene matrix that can be easily converted to be used by Enhlink. In contrast, Signac objects are more complex and contain additional data such as sequence data.
Saving Signac object to sparse matrix
First, let’s load some example dataset with Signac, as described here and assuming Signac is installed correctly.
library('Signac')
library('Seurat')
counts <- Read10X_h5(filename = "atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
The cell x peak sparse matrix defined in pbmc@assays$peaks@counts or pbmc@assays$peaks@data can then be simply saved using the writeMM function from the Matrix library. The row and column indexes can then be saved using the table.write function.
library(Matrix)
# Write the sparse matrix in the MTX matrix format
writeMM(obj = pbmc@assays$peaks@counts, file = "pbmc.counts.mtx")
# Write the cell ID tags (the columns)
write.table(colnames(pbmc@assays$peaks@counts),
file = "pbmc.counts.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Write the chromosome peak coordinates (the rows)
# We need to format the peak coordinates to be tab separated (chrID, start, stop)
write.table(gsub("-", "\t", rownames(pbmc@assays$peaks@counts)),
file = "pbmc.counts.rownames",
quote = FALSE, row.names = FALSE, col.names = FALSE)
Processing the data with Enhlink
To process these data, we will use our custom GTF file indicating the locations of the promoters in the GrCh38 human reference genome. This reference file works here since the data were originally aligned on the Human hg38 genome assembly.
wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed
We are now ready to launch Enhlink. Please note that we are using the cellRanger matrix format since matrix indices start at 1 and the first column correspond to feature indices and the second to cell indices.
enhlink \
-mat pbmc.counts.mtx \
-xgi pbmc.counts.colnames \
-ygi pbmc.counts.rownames \
-out pbmc.counts.out \
-format cellRanger \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-threads 8
Note that additional clustering or metadata can be available within the Seurat and Signac objects. To incorporate them into Enhlink’s computation, it is required to generate a two-columns TSV file for clustering:
Processing a multi-omics dataset from Seurat and Signac R objects
In the case we have an associated gene expression sparse matrix in a Seurat object, we can use a similar technic to save it. As an example, we will use a single-cell multi-omics ATAC/RNA-seq dataset previously used as an example by the authors of Signac.
We first need to download the fragment files and the ATAC-seq and RNA-seq matrices.
# multiome
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
We then need to load the snATAC-seq and snRNA-seq matrices using Seurat and Signac.
library('Signac')
library('Seurat')
counts <- Read10X_h5("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
pbmc <- CreateSeuratObject(
counts = counts$`Gene Expression`,
assay = "RNA"
)
pbmc[["ATAC"]] <- CreateChromatinAssay(
counts = counts$Peaks,
sep = c(":", "-"),
fragments = "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz",
)
The pbmc is a R Signac object containing both the ATAC and the RNA matrices from the dataset processing. We can easily save these matrices from the assays instance:
library(Matrix)
# Write the RNA sparse matrix in the MTX matrix format
writeMM(obj = pbmc@assays$RNA@layers$count, file = "/mnt/pbmc.multi.RNA.mtx")
# Write the cell ID tags (the columns)
write.table(colnames(pbmc@assays$RNA),
file = "/mnt/pbmc.multi.RNA.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Write the gene symbols (the rows)
write.table(rownames(pbmc@assays$RNA),
file = "/mnt/pbmc.multi.RNA.rownames",
quote = FALSE, row.names = FALSE, col.names = FALSE)
# Write the RNA sparse matrix in the MTX matrix format
writeMM(obj = pbmc@assays$ATAC@counts, file = "/mnt/pbmc.multi.ATAC.mtx")
# Write the cell ID tags (the columns)
write.table(colnames(pbmc@assays$ATAC),
file = "/mnt/pbmc.multi.ATAC.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Write the chromosome peak coordinates (the rows)
# We need to format the peak coordinates to be tab separated (chrID, start, stop)
write.table(gsub("-", "\t", rownames(pbmc@assays$ATAC)),
file = "/mnt/pbmc.multi.ATAC.rownames",
quote = FALSE, row.names = FALSE, col.names = FALSE)
Processing the data with Enhlink
We can now launch Enhlink similarly to the first analysis. However, we will now infer links with enhancer activities correlated with gene expression from the second matrix. Again, we need to use "cellRanger" as value for -format (see above).
enhlink \
-mat pbmc.multi.ATAC.mtx \
-xgi pbmc.multi.ATAC.colnames \
-ygi pbmc.multi.ATAC.rownames \
-mat2 pbmc.multi.RNA.mtx \
-xgi2 pbmc.multi.RNA.colnames \
-ygi2 pbmc.multi.RNA.rownames \
-out pbmc.multiome.out \
-format cellRanger \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-threads 6
Analysing Enhlink’s linkage with Signac
We can now load and format links generated by Enhlink to bee compatible with Signac / Cicero. We can use as an example the co-accessible network inference tutorial from Signac. In this tutorial, the authors use Cicero to infer the linkage prior to visualizing different genomic regions. Here, we will simply substitute Cicero’s linkage with Enhlink results.
# Read enhlink results
EnConns = read.table("pbmc.multiome.out/all.bedpe", sep = "\t", stringsAsFactors = FALSE)
# Number of links
nbLinks = dim(EnConns)[1]
# Create a connection data frame
fconns <- data.frame(
Peak1 = rep(NA, nbLinks),
Peak2 = rep(NA, nbLinks),
coaccess = rep(NA, nbLinks),
stringsAsFactors = FALSE
)
# Format Peaks and linkage according to Cicero outputs
fconns$Peak1 = paste(paste(conn$V1, conn$V2, sep="-"), conn$V3, sep="-")
fconns$Peak2 = paste(paste(conn$V4, conn$V5, sep="-"), conn$V6, sep="-")
fconns$coaccess = conn$V9
# We are now ready to use Signac functions dedicated for Linkage
links = ConnectionsToLinks(conns = fconns)
# Using functions from Cicero's library
library(cicero)
ccans = generate_ccans(conns)
# Signac's function
links <- ConnectionsToLinks(conns = fconns, ccans = ccans)
# Plotting specific region
library(ggplot2)
library(patchwork)
# Untested on pbmc.
Links(pbmc) <- links
CoveragePlot(pbmc, region = "chr1-40189344-40252549")
Processing SnapATAC2 data
SnapATAC2 is a Python pipeline to process single-cell ATAC-seq and multi-omics data. SnapATAC2 was developed using ScanPy as backbone and shares numerous similarities with its API. Similarly to the process used with Seurat (see above), we can create a cell x peak sparse matrix and an optional associated cell x gene sparse matrix, save them in an appropriate format, and process them with Enhlink.
import snapatac2 as snap
# this example matrix is created following SnapATAC2 tutorial
data = snap.pp.import_data(
snap.datasets.pbmc500(downsample=True),
chrom_sizes=snap.genome.hg38,
sorted_by_barcode=False)
peak_mat = snap.pp.make_peak_matrix(
data, peak_file=snap.datasets.cre_HEA())
# Now we can save the sparse matrix using the same mechanism described in tutorial 4
from scipy.io import mmwrite
import csv
#write the matrix
mmwrite('peak_mat.snapATAC2.mtx', peak_mat.X.T)
# write the cell barcodes
peak_mat.obs.index.to_frame().to_csv('peak_mat.snapATAC2.barcodes.tsv.gz',
header=None, sep="\t", index=None)
# write the peak IDs tab separated
peak_mat.var["chrID"] = peak_mat.var.index.map(lambda x:x.split(":")[0])
peak_mat.var["start"] = peak_mat.var.index.map(lambda x:x.split(":")[1].split('-')[0])
peak_mat.var["stop"] = peak_mat.var.index.map(lambda x:x.split(":")[1].split('-')[1])
peak_mat.var["start"] = peak_mat.var["start"].astype('int')
peak_mat.var["stop"] = peak_mat.var["stop"].astype('int')
peak_mat.var.to_csv(
'peak_mat.snapATAC2.features.tsv.gz',
header=None, sep="\t", index=None
)
# Now let's create a cell x gene expression associated sparse matrix
gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38)
mmwrite('gene_mat.snapATAC2.mtx', gene_mat.X.T)
# write the cell barcodes
gene_mat.obs.index.to_frame().to_csv('gene_mat.snapATAC2.barcodes.tsv.gz',
header=None, sep="\t", index=None)
# write the gene IDs tab separated
gene_mat.var.index \
.to_frame() \
.to_csv(
'gene_mat.snapATAC2.features.tsv.gz',
header=None, sep="\t", index=None,
)
Note that a similar approach can be used to save any matrix form a ScanPy or EpiScanPy object.
We are now ready to infer the linkage with Enhlink. Note that we will use both matrices to infer enhancers correlated with the gene expressions of the second matrix. However we could have only used the peak matrix.
enhlink \
-mat peak_mat.snapATAC2.mtx \
-xgi peak_mat.snapATAC2.barcodes.tsv.gz \
-ygi peak_mat.snapATAC2.features.tsv.gz \
-mat2 gene_mat.snapATAC2.mtx \
-xgi2 gene_mat.snapATAC2.barcodes.tsv.gz \
-ygi2 gene_mat.snapATAC2.features.tsv.gz \
-out snapATA2.enhlink.out \
-format cellRanger \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-threads 6
Note that additional clustering or metadata can be available within the data .obs instance, which is a dataframe containing the cell metadata. To incorporate them into Enhlink’s computation, it is required to generate a two-columns TSV file for clustering:
# Let's create a categorical covariate indicating either high or low nb. of fragments
data.obs["high_nb_fragment"] = data.obs["n_fragment"] > data.obs["n_fragment"].median()
# save the covariates file
data.obs.loc[:, ["high_nb_fragment"]].to_csv(
'snapATAC2.covariates.tsv.gz', sep="\t")
Launching a new Enhlink’s analysis with covariates:
enhlink \
-mat peak_mat.snapATAC2.mtx \
-xgi peak_mat.snapATAC2.barcodes.tsv.gz \
-ygi peak_mat.snapATAC2.features.tsv.gz \
-mat2 gene_mat.snapATAC2.mtx \
-xgi2 gene_mat.snapATAC2.barcodes.tsv.gz \
-ygi2 gene_mat.snapATAC2.features.tsv.gz \
-covariates snapATAC2.covariates.tsv.gz \
-secondOrder \
-out snapATA2.enhlink.out \
-format cellRanger \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-threads 6