# Tutorial 7: Compatibility with other single-cell libraries ( Seurat, Signac, and SnapATAC) ## Overview The minimal requirement to perform an Enhlink analysis is a cell-by-peak matrix and the locations of targets, such as promoters, in the analyzed genome. Additionally, Enhlink can optionally utilize cluster information, cell-by-gene expression matrices, and other cell metadata. In this tutorial, we will demonstrate how to extract these data from various single-cell chromatin analysis frameworks, such as Signac/Seurat, SnapATAC, or ScanPy, and launch an Enhlink analysis. For simplicity, we will not use any advanced functionalities of Enhlink, such as the -covariates, -clusters, or -secondOrder options. However, these features can be easily incorporated if needed. ## Signac / Seurat These R packages were developped by the [Satija's lab ](https://github.com/satijalab) and provide comprehensive solutions to analyze single-cell RNA-seq and ATAC-seq datasets. Seurat objects are enriched wrapper toward a sparse cell x gene matrix that can be easily converted to be used by Enhlink. In contrast, Signac objects are more complex and contain additional data such as sequence data. ### Saving Signac object to sparse matrix First, let's load some example dataset with Signac, as described [here](https://stuartlab.org/signac/articles/pbmc_vignette) and assuming Signac is installed correctly. ```R library('Signac') library('Seurat') counts <- Read10X_h5(filename = "atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5") metadata <- read.csv( file = "atac_v1_pbmc_10k_singlecell.csv", header = TRUE, row.names = 1 ) chrom_assay <- CreateChromatinAssay( counts = counts, sep = c(":", "-"), fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz', min.cells = 10, min.features = 200 ) pbmc <- CreateSeuratObject( counts = chrom_assay, assay = "peaks", meta.data = metadata ) ``` The cell x peak sparse matrix defined in `pbmc@assays$peaks@counts` or `pbmc@assays$peaks@data` can then be simply saved using the `writeMM` function from the `Matrix` library. The row and column indexes can then be saved using the `table.write` function. ```R library(Matrix) # Write the sparse matrix in the MTX matrix format writeMM(obj = pbmc@assays$peaks@counts, file = "pbmc.counts.mtx") # Write the cell ID tags (the columns) write.table(colnames(pbmc@assays$peaks@counts), file = "pbmc.counts.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE) # Write the chromosome peak coordinates (the rows) # We need to format the peak coordinates to be tab separated (chrID, start, stop) write.table(gsub("-", "\t", rownames(pbmc@assays$peaks@counts)), file = "pbmc.counts.rownames", quote = FALSE, row.names = FALSE, col.names = FALSE) ``` ### Processing the data with Enhlink To process these data, we will use our custom GTF file indicating the locations of the promoters in the GrCh38 human reference genome. This reference file works here since the data were originally aligned on the Human hg38 genome assembly. ```bash wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed ``` We are now ready to launch Enhlink. Please note that we are using the `cellRanger` matrix format since matrix indices start at 1 and the first column correspond to feature indices and the second to cell indices. ```bash enhlink \ -mat pbmc.counts.mtx \ -xgi pbmc.counts.colnames \ -ygi pbmc.counts.rownames \ -out pbmc.counts.out \ -format cellRanger \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -threads 8 ``` Note that additional clustering or metadata can be available within the Seurat and Signac objects. To incorporate them into Enhlink's computation, it is required to generate a two-columns TSV file for clustering: or a n-columns file (with header) for *categorical* covariates: ... (see other [tutorials](https://enhlinktools.readthedocs.io/en/latest/tutorial_1_atac_only.html)). ### Processing a multi-omics dataset from Seurat and Signac R objects In the case we have an associated gene expression sparse matrix in a Seurat object, we can use a similar technic to save it. As an example, we will use a single-cell multi-omics ATAC/RNA-seq [dataset](https://stuartlab.org/signac/articles/integrate_atac.html) previously used as an example by the authors of Signac. We first need to download the fragment files and the ATAC-seq and RNA-seq matrices. ```bash # multiome wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi ``` We then need to load the snATAC-seq and snRNA-seq matrices using Seurat and Signac. ```R library('Signac') library('Seurat') counts <- Read10X_h5("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5") pbmc <- CreateSeuratObject( counts = counts$`Gene Expression`, assay = "RNA" ) pbmc[["ATAC"]] <- CreateChromatinAssay( counts = counts$Peaks, sep = c(":", "-"), fragments = "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz", ) ``` The `pbmc` is a R Signac object containing both the ATAC and the RNA matrices from the dataset processing. We can easily save these matrices from the assays instance: ```R library(Matrix) # Write the RNA sparse matrix in the MTX matrix format writeMM(obj = pbmc@assays$RNA@layers$count, file = "/mnt/pbmc.multi.RNA.mtx") # Write the cell ID tags (the columns) write.table(colnames(pbmc@assays$RNA), file = "/mnt/pbmc.multi.RNA.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE) # Write the gene symbols (the rows) write.table(rownames(pbmc@assays$RNA), file = "/mnt/pbmc.multi.RNA.rownames", quote = FALSE, row.names = FALSE, col.names = FALSE) # Write the RNA sparse matrix in the MTX matrix format writeMM(obj = pbmc@assays$ATAC@counts, file = "/mnt/pbmc.multi.ATAC.mtx") # Write the cell ID tags (the columns) write.table(colnames(pbmc@assays$ATAC), file = "/mnt/pbmc.multi.ATAC.colnames", quote = FALSE, row.names = FALSE, col.names = FALSE) # Write the chromosome peak coordinates (the rows) # We need to format the peak coordinates to be tab separated (chrID, start, stop) write.table(gsub("-", "\t", rownames(pbmc@assays$ATAC)), file = "/mnt/pbmc.multi.ATAC.rownames", quote = FALSE, row.names = FALSE, col.names = FALSE) ``` ### Processing the data with Enhlink We can now launch Enhlink similarly to the first analysis. However, we will now infer links with enhancer activities correlated with gene expression from the second matrix. Again, we need to use `"cellRanger"` as value for `-format` (see above). ```bash enhlink \ -mat pbmc.multi.ATAC.mtx \ -xgi pbmc.multi.ATAC.colnames \ -ygi pbmc.multi.ATAC.rownames \ -mat2 pbmc.multi.RNA.mtx \ -xgi2 pbmc.multi.RNA.colnames \ -ygi2 pbmc.multi.RNA.rownames \ -out pbmc.multiome.out \ -format cellRanger \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -threads 6 ``` ### Analysing Enhlink's linkage with Signac We can now load and format links generated by Enhlink to bee compatible with Signac / Cicero. We can use as an example the [co-accessible network inference tutorial](https://stuartlab.org/signac/articles/cicero) from Signac. In this tutorial, the authors use [Cicero](https://cole-trapnell-lab.github.io/cicero-release/docs_m3/) to infer the linkage prior to visualizing different genomic regions. Here, we will simply substitute Cicero's linkage with Enhlink results. ```R # Read enhlink results EnConns = read.table("pbmc.multiome.out/all.bedpe", sep = "\t", stringsAsFactors = FALSE) # Number of links nbLinks = dim(EnConns)[1] # Create a connection data frame fconns <- data.frame( Peak1 = rep(NA, nbLinks), Peak2 = rep(NA, nbLinks), coaccess = rep(NA, nbLinks), stringsAsFactors = FALSE ) # Format Peaks and linkage according to Cicero outputs fconns$Peak1 = paste(paste(conn$V1, conn$V2, sep="-"), conn$V3, sep="-") fconns$Peak2 = paste(paste(conn$V4, conn$V5, sep="-"), conn$V6, sep="-") fconns$coaccess = conn$V9 # We are now ready to use Signac functions dedicated for Linkage links = ConnectionsToLinks(conns = fconns) # Using functions from Cicero's library library(cicero) ccans = generate_ccans(conns) # Signac's function links <- ConnectionsToLinks(conns = fconns, ccans = ccans) # Plotting specific region library(ggplot2) library(patchwork) # Untested on pbmc. Links(pbmc) <- links CoveragePlot(pbmc, region = "chr1-40189344-40252549") ``` ## Processing SnapATAC2 data SnapATAC2 is a Python pipeline to process single-cell ATAC-seq and multi-omics data. SnapATAC2 was developed using [ScanPy](https://scanpy.readthedocs.io/en/stable/) as backbone and shares numerous similarities with its API. Similarly to the process used with Seurat (see above), we can create a cell x peak sparse matrix and an optional associated cell x gene sparse matrix, save them in an appropriate format, and process them with Enhlink. ```python import snapatac2 as snap # this example matrix is created following SnapATAC2 tutorial data = snap.pp.import_data( snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False) peak_mat = snap.pp.make_peak_matrix( data, peak_file=snap.datasets.cre_HEA()) # Now we can save the sparse matrix using the same mechanism described in tutorial 4 from scipy.io import mmwrite import csv #write the matrix mmwrite('peak_mat.snapATAC2.mtx', peak_mat.X.T) # write the cell barcodes peak_mat.obs.index.to_frame().to_csv('peak_mat.snapATAC2.barcodes.tsv.gz', header=None, sep="\t", index=None) # write the peak IDs tab separated peak_mat.var["chrID"] = peak_mat.var.index.map(lambda x:x.split(":")[0]) peak_mat.var["start"] = peak_mat.var.index.map(lambda x:x.split(":")[1].split('-')[0]) peak_mat.var["stop"] = peak_mat.var.index.map(lambda x:x.split(":")[1].split('-')[1]) peak_mat.var["start"] = peak_mat.var["start"].astype('int') peak_mat.var["stop"] = peak_mat.var["stop"].astype('int') peak_mat.var.to_csv( 'peak_mat.snapATAC2.features.tsv.gz', header=None, sep="\t", index=None ) # Now let's create a cell x gene expression associated sparse matrix gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38) mmwrite('gene_mat.snapATAC2.mtx', gene_mat.X.T) # write the cell barcodes gene_mat.obs.index.to_frame().to_csv('gene_mat.snapATAC2.barcodes.tsv.gz', header=None, sep="\t", index=None) # write the gene IDs tab separated gene_mat.var.index \ .to_frame() \ .to_csv( 'gene_mat.snapATAC2.features.tsv.gz', header=None, sep="\t", index=None, ) ``` Note that a similar approach can be used to save any matrix form a [ScanPy](https://scanpy.readthedocs.io/en/stable/) or [EpiScanPy](https://episcanpy.readthedocs.io/en/latest/) object. We are now ready to infer the linkage with Enhlink. Note that we will use both matrices to infer enhancers correlated with the gene expressions of the second matrix. However we could have only used the peak matrix. ```bash enhlink \ -mat peak_mat.snapATAC2.mtx \ -xgi peak_mat.snapATAC2.barcodes.tsv.gz \ -ygi peak_mat.snapATAC2.features.tsv.gz \ -mat2 gene_mat.snapATAC2.mtx \ -xgi2 gene_mat.snapATAC2.barcodes.tsv.gz \ -ygi2 gene_mat.snapATAC2.features.tsv.gz \ -out snapATA2.enhlink.out \ -format cellRanger \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -threads 6 ``` Note that additional clustering or metadata can be available within the `data` `.obs` instance, which is a dataframe containing the cell metadata. To incorporate them into Enhlink's computation, it is required to generate a two-columns TSV file for clustering: or a n-columns file (with header) for *categorical* covariates: ... (see other tutorials). ```python # Let's create a categorical covariate indicating either high or low nb. of fragments data.obs["high_nb_fragment"] = data.obs["n_fragment"] > data.obs["n_fragment"].median() # save the covariates file data.obs.loc[:, ["high_nb_fragment"]].to_csv( 'snapATAC2.covariates.tsv.gz', sep="\t") ``` Launching a new Enhlink's analysis with covariates: ```bash enhlink \ -mat peak_mat.snapATAC2.mtx \ -xgi peak_mat.snapATAC2.barcodes.tsv.gz \ -ygi peak_mat.snapATAC2.features.tsv.gz \ -mat2 gene_mat.snapATAC2.mtx \ -xgi2 gene_mat.snapATAC2.barcodes.tsv.gz \ -ygi2 gene_mat.snapATAC2.features.tsv.gz \ -covariates snapATAC2.covariates.tsv.gz \ -secondOrder \ -out snapATA2.enhlink.out \ -format cellRanger \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -threads 6 ```