Tutorial 1: Simple processing of snATAC-seq data
In this first tutorial we will process some example single-cell ATAC-seq data from one of the 10X-genomics datasets available online. We will:
Download the single-cell MTX matrix (sparse matrix format) folder from the 10X genomics processed files obtained after cellRanger processing
Infer the Enhlinks “linkage” for all the cells and all the genes
Infer the Enhlinks “linkage” for each clusters
Infer the Enhlinks “linkage” and identify significant “cluster-specific” links
Process all the cells
Download the scATAC-seq MTX matrix
We first need to download the MTX matrix from the 10X example dataset: 10k Human PBMCs, ATAC v2, Chromium Controller that is accessible here: https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz
tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz
We then need to download a GTF file that describes the promoter locations in the HG38 :
wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed
simple processing with Enhlink
We are now ready to launch a simple linkage inference with Enhlink using all the cells and for all the promoters defined in the GTF files.
First, we need to go to the folder containg the MTX matrix and the GFT files. We should then have these files inside
tree
# output:
.
├── Homo_sapiens.GRCh38.99.TSS.2K.bed
└── filtered_peak_bc_matrix
├── barcodes.tsv
├── matrix.mtx
└── peaks.bed
Since Homo_sapiens.GRCh38.99.TSS.2K.bed as more than 60,000 genes, we will only process 500 random genes extracted from this file, as a proof of concept:
# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt
Now, let’s launch Enhlink with the following options:
-targets [optional] which defines the file containing the subset of promoters of interest. If not provided, Enhlink will process the 60K promoters defined in Homo_sapiens.GRCh38.99.TSS.2K.bed
-threads which defines the number of CPUs to use for each gene
-format which refers to the format of the input matrix
-rmPeaksInTargets which prevent inferring promoter-promoter linkage
-xgi is the barcode index of the MTX matrix (in the cellRanger format, this corresponds to the columns index)
-ygi is the peak index (as a BED file) of the MTX matrix (in the cellRanger format, this corresponds to the rows index)
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
-xgi ./filtered_peak_bc_matrix/barcodes.tsv \
-ygi ./filtered_peak_bc_matrix/peaks.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out ./output_tutorial_1_1/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
This process should take about a minute to complete and the results are stored in the /output_tutorial_1_1/ folder:
tree /output_tutorial_1_1/
# output
./output_tutorial_1_1/
└── all.bedpe
head ./output_tutorial_1_1/all.bedpe
# EnhLink (version: 0.18.6)
##### PARAMETERS ####
# maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true isFloatMat: false onlyPositiveLink: false secondOrder: false threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-17 16:21:10
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# cluster ID: all
#chrID1 start stop chrID2 start stop FDR.Adj.Pval Score Adj.Score geneID
chr17 75077203 75078080 chr17 75058620 75062620 1.36e-04 2.19e-06 1.07e-05 ac111186.2
chr17 74981894 74982787 chr17 75058620 75062620 6.97e-08 7.57e-06 6.12e-05 ac111186.2
chr17 75154195 75155099 chr17 75058620 75062620 5.60e-08 9.77e-06 2.25e-05 ac111186.2
chr17 75036398 75037311 chr17 75058620 75062620 0.00e+00 1.16e-04 3.56e-04 ac111186.2
chr21 41973808 41974736 chr21 41765089 41769089 1.82e-02 7.43e-07 3.78e-07 ripk4
chr21 41816424 41817163 chr21 41765089 41769089 1.67e-10 3.25e-06 9.46e-07 ripk4
chr21 41895604 41896499 chr21 41765089 41769089 0.00e+00 2.11e-05 4.96e-05 ripk4
chr21 41926204 41927091 chr21 41765089 41769089 0.00e+00 1.23e-05 1.48e-05 ripk4
Processing for different clusters
Now let’s download the default clustering analysis performed by cell ranger:
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz
tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz
As a proof of concept we will use the K-mean clusters for K=3 analysis/clustering/kmeans_3_clusters/clusters.csv. We first need to create a .tsv file instead of a .csv, which is straightforward:
# Option 1
sed 's/,/\t/g' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv
# option 2 (better for OSX)
awk '{FS=",";OFS="\t"}{print $1,$2}' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv
Next, we just have to add the -clusters variable to Enhlink:
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
-xgi ./filtered_peak_bc_matrix/barcodes.tsv \
-ygi ./filtered_peak_bc_matrix/peaks.bed \
-clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out ./output_tutorial_1_2/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
The results folder has one bedpe per cluster defined in the clusters.tsv file:
tree ./output_tutorial_1_2/
# output
./output_tutorial_1_2/
├── 1.bedpe
├── 2.bedpe
└── 3.bedpe
Including covariates in the model
When inferring enhancer-promoter links with Enhlink, it is often judiciary to include associated covariates in the models to prevent batch effect and to identify covariate-specific links. Cell-covariates of interest depend of the context but can be for example, the library IDs (if multiple runs are merged together), the sex, the condition, or even the cell-type/cluster ID. Since this dataset has no obvious cell covariates to download, we will use the previous cluster IDs as covariate as a proof of concept, and identify the cluster-specific links.
This analysis is different than the one above because it identified the links of each cluster using the subset of cells specific to each cluster. In the previous analysis, a link can be found in multiple clusters. Also, the last analysis is useful to find links specific to small clusters/cell-types among much larger subpopulations because it focused only on the subset of cells.
Please verify that the covariates file contains a header naming the different covariates. The covariates file should have the following structure:
covariateName1 covariateName2 ... covariateNameJ
val_1_1 val_2_1 val_J_1
val_1_2 val_2_2 val_J_2
...
The values must not be integer or float values but rather categorical. Let’s launch this new analysis
enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
-xgi ./filtered_peak_bc_matrix/barcodes.tsv \
-ygi ./filtered_peak_bc_matrix/peaks.bed \
-covariates ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
-secondOrder \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out ./output_tutorial_1_3/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
This analysis should take couple of minutes. The -covariate option allowed to add the binarized covariates into the model. Instead of inferring p = f(P), with p the promoter vector and P the matrix formed with the surrounding peak vectors of p, we inferred p = f(P,C) with C the matrix of binarized covariates. The option allowed -secondOrder to perform additional analyzes to identify covariate-specific links at the expense of more computations.
The results folder has the following files:
tree ./output_tutorial_1_3/
# output
./output_tutorial_1_3/
├── all.2ndOrder.bedpe
├── all.batch.tsv
└── all.bedpe
The file all.batch.tsv lists the covariates associated with each promoter and the file all.2ndOrder.bedpe describes the covariates associated to each links
head -5 output_tutorial_1_3/all.batch.tsv
# output
Cluster_2 chr10 103394626 103398626 3.82e-04 8.77e-05 1.63e-04 pdcd11
Cluster_2 chr7 38176222 38180222 0.00e+00 8.27e-04 1.34e-03 stard3nl
Cluster_3 chr7 74888610 74892610 7.29e-03 5.99e-06 3.58e-06 stag3l2
Cluster_1 chr7 74888610 74892610 2.57e-05 1.75e-05 5.68e-06 stag3l2
head -20 output_tutorial_1_3/all.2ndOrder.bedpe
# Output
# EnhLink (version: 0.18.6)
##### PARAMETERS ####
# maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true isFloatMat: false onlyPositiveLink: false secondOrder: true threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-20 12:45:26
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# metadata: ./analysis/clustering/kmeans_3_clusters/clusters.tsv
# cluster ID: all
##### Second order link-covariate interactions ####
#chrID1 start stop chrID2 start stop Covariate FDR.Adj.Pval Score Adj.Score geneID
chr10 103551405 103552275 chr10 103394626 103398626 Cluster_2 0.00e+00 5.63e-04 1.04e-03 pdcd11
chr10 103485434 103486330 chr10 103394626 103398626 Cluster_2 9.32e-06 1.67e-07 3.09e-07 pdcd11
chr7 38107239 38108115 chr7 38176222 38180222 Cluster_2 0.00e+00 1.77e-05 2.87e-05 stard3nl
chr7 38215499 38216377 chr7 38176222 38180222 Cluster_2 1.72e-02 2.00e-06 3.23e-06 stard3nl
chr7 38305247 38306139 chr7 38176222 38180222 Cluster_2 0.00e+00 6.68e-05 1.08e-04 stard3nl
chr7 38394425 38395335 chr7 38176222 38180222 Cluster_2 0.00e+00 1.61e-05 2.62e-05 stard3nl
The FDR.Adj.Pval, Score, and Adj.Score are scores linked to the covariate-specific inference and not the link by itself.
Class imbalance with covariates
Enhlink provides an option called -uniformSample
that can be used when analyzing datasets with class imbalance in covariates. This option helps to ensure that each bootstrap contains a more uniform distribution of cells from each class by taking into account the distribution of the covariates.
For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the -uniformSample
option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.
However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than nbCellsTotal/nbCovariate
, the number of cells (N
) sampled will be larger than the class size (S
), resulting in a larger number of unique cells (Nu
) being sampled than with a bootstrap procedure (66% on average). For instance, if N
is twice the size of S
, Nu
will reach 90%, and 95% for three times the size of S
, which could decrease the variability of the unique cells sampled.