Tutorial 1: Simple processing of snATAC-seq data

In this first tutorial we will process some example single-cell ATAC-seq data from one of the 10X-genomics datasets available online. We will:

  1. Download the single-cell MTX matrix (sparse matrix format) folder from the 10X genomics processed files obtained after cellRanger processing

  2. Infer the Enhlinks “linkage” for all the cells and all the genes

  3. Infer the Enhlinks “linkage” for each clusters

  4. Infer the Enhlinks “linkage” and identify significant “cluster-specific” links

Process all the cells

Download the scATAC-seq MTX matrix

We first need to download the MTX matrix from the 10X example dataset: 10k Human PBMCs, ATAC v2, Chromium Controller that is accessible here: https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard

wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz

We then need to download a GTF file that describes the promoter locations in the HG38 :

wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed

Processing for different clusters

Now let’s download the default clustering analysis performed by cell ranger:

wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz

As a proof of concept we will use the K-mean clusters for K=3 analysis/clustering/kmeans_3_clusters/clusters.csv. We first need to create a .tsv file instead of a .csv, which is straightforward:

# Option 1
sed 's/,/\t/g' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv

# option 2 (better for OSX)
awk '{FS=",";OFS="\t"}{print $1,$2}' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv

Next, we just have to add the -clusters variable to Enhlink:

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_2/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

The results folder has one bedpe per cluster defined in the clusters.tsv file:

tree ./output_tutorial_1_2/

# output

./output_tutorial_1_2/
├── 1.bedpe
├── 2.bedpe
└── 3.bedpe

Including covariates in the model

When inferring enhancer-promoter links with Enhlink, it is often judiciary to include associated covariates in the models to prevent batch effect and to identify covariate-specific links. Cell-covariates of interest depend of the context but can be for example, the library IDs (if multiple runs are merged together), the sex, the condition, or even the cell-type/cluster ID. Since this dataset has no obvious cell covariates to download, we will use the previous cluster IDs as covariate as a proof of concept, and identify the cluster-specific links.

This analysis is different than the one above because it identified the links of each cluster using the subset of cells specific to each cluster. In the previous analysis, a link can be found in multiple clusters. Also, the last analysis is useful to find links specific to small clusters/cell-types among much larger subpopulations because it focused only on the subset of cells.

Please verify that the covariates file contains a header naming the different covariates. The covariates file should have the following structure:

covariateName1    covariateName2    ...    covariateNameJ
val_1_1           val_2_1                  val_J_1
val_1_2           val_2_2                  val_J_2
...

The values must not be integer or float values but rather categorical. Let’s launch this new analysis

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -covariates ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -secondOrder \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_3/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

This analysis should take couple of minutes. The -covariate option allowed to add the binarized covariates into the model. Instead of inferring p = f(P), with p the promoter vector and P the matrix formed with the surrounding peak vectors of p, we inferred p = f(P,C) with C the matrix of binarized covariates. The option allowed -secondOrder to perform additional analyzes to identify covariate-specific links at the expense of more computations.

The results folder has the following files:

tree ./output_tutorial_1_3/

# output

./output_tutorial_1_3/
├── all.2ndOrder.bedpe
├── all.batch.tsv
└── all.bedpe

The file all.batch.tsv lists the covariates associated with each promoter and the file all.2ndOrder.bedpe describes the covariates associated to each links

head -5 output_tutorial_1_3/all.batch.tsv

# output

Cluster_2       chr10   103394626       103398626       3.82e-04        8.77e-05        1.63e-04        pdcd11
Cluster_2       chr7    38176222        38180222        0.00e+00        8.27e-04        1.34e-03        stard3nl
Cluster_3       chr7    74888610        74892610        7.29e-03        5.99e-06        3.58e-06        stag3l2
Cluster_1       chr7    74888610        74892610        2.57e-05        1.75e-05        5.68e-06        stag3l2

head -20 output_tutorial_1_3/all.2ndOrder.bedpe

# Output

# EnhLink (version: 0.18.6)
##### PARAMETERS ####
#  maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true  isFloatMat: false onlyPositiveLink: false secondOrder: true threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-20 12:45:26
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# metadata: ./analysis/clustering/kmeans_3_clusters/clusters.tsv
# cluster ID: all
##### Second order link-covariate interactions ####
#chrID1 start   stop    chrID2  start   stop    Covariate       FDR.Adj.Pval    Score   Adj.Score       geneID
chr10   103551405       103552275       chr10   103394626       103398626       Cluster_2       0.00e+00        5.63e-04        1.04e-03        pdcd11
chr10   103485434       103486330       chr10   103394626       103398626       Cluster_2       9.32e-06        1.67e-07        3.09e-07        pdcd11
chr7    38107239        38108115        chr7    38176222        38180222        Cluster_2       0.00e+00        1.77e-05        2.87e-05        stard3nl
chr7    38215499        38216377        chr7    38176222        38180222        Cluster_2       1.72e-02        2.00e-06        3.23e-06        stard3nl
chr7    38305247        38306139        chr7    38176222        38180222        Cluster_2       0.00e+00        6.68e-05        1.08e-04        stard3nl
chr7    38394425        38395335        chr7    38176222        38180222        Cluster_2       0.00e+00        1.61e-05        2.62e-05        stard3nl

The FDR.Adj.Pval, Score, and Adj.Score are scores linked to the covariate-specific inference and not the link by itself.

Class imbalance with covariates

Enhlink provides an option called -uniformSample that can be used when analyzing datasets with class imbalance in covariates. This option helps to ensure that each bootstrap contains a more uniform distribution of cells from each class by taking into account the distribution of the covariates.

For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the -uniformSample option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.

However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than nbCellsTotal/nbCovariate, the number of cells (N) sampled will be larger than the class size (S), resulting in a larger number of unique cells (Nu) being sampled than with a bootstrap procedure (66% on average). For instance, if N is twice the size of S, Nu will reach 90%, and 95% for three times the size of S, which could decrease the variability of the unique cells sampled.