Tutorial 1: Simple processing of snATAC-seq data

In this first tutorial we will process some example single-cell ATAC-seq data from one of the 10X-genomics datasets available online. We will:

Download the single-cell MTX matrix (sparse matrix format) folder from the 10X genomics processed files obtained after cellRanger processing
Infer the Enhlinks “linkage” for all the cells and all the genes
Infer the Enhlinks “linkage” for each clusters
Infer the Enhlinks “linkage” and identify significant “cluster-specific” links

Process all the cells

Download the scATAC-seq MTX matrix

We first need to download the MTX matrix from the 10X example dataset: 10k Human PBMCs, ATAC v2, Chromium Controller that is accessible here: https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard

wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz

We then need to download a GTF file that describes the promoter locations in the HG38 :

wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed

simple processing with Enhlink

We are now ready to launch a simple linkage inference with Enhlink using all the cells and for all the promoters defined in the GTF files.

First, we need to go to the folder containg the MTX matrix and the GFT files. We should then have these files inside

tree

# output:

.
├── Homo_sapiens.GRCh38.99.TSS.2K.bed
└── filtered_peak_bc_matrix
    ├── barcodes.tsv
    ├── matrix.mtx
    └── peaks.bed

Since Homo_sapiens.GRCh38.99.TSS.2K.bed as more than 60,000 genes, we will only process 500 random genes extracted from this file, as a proof of concept:

# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt

Now, let’s launch Enhlink with the following options:

-targets [optional] which defines the file containing the subset of promoters of interest. If not provided, Enhlink will process the 60K promoters defined in Homo_sapiens.GRCh38.99.TSS.2K.bed
-threads which defines the number of CPUs to use for each gene
-format which refers to the format of the input matrix
-rmPeaksInTargets which prevent inferring promoter-promoter linkage
-xgi is the barcode index of the MTX matrix (in the cellRanger format, this corresponds to the columns index)
-ygi is the peak index (as a BED file) of the MTX matrix (in the cellRanger format, this corresponds to the rows index)

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
        -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
        -ygi ./filtered_peak_bc_matrix/peaks.bed \
        -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
        -out ./output_tutorial_1_1/ \
        -threads 6 \
        -targets 500_rand_genes.txt \
        -format cellRanger \
        -rmPeaksInTargets

This process should take about a minute to complete and the results are stored in the /output_tutorial_1_1/ folder:

tree /output_tutorial_1_1/

# output
./output_tutorial_1_1/
└── all.bedpe

head ./output_tutorial_1_1/all.bedpe

# EnhLink (version: 0.18.6)
##### PARAMETERS ####
#  maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true  isFloatMat: false onlyPositiveLink: false secondOrder: false threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-17 16:21:10
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# cluster ID: all
#chrID1 start   stop    chrID2  start   stop    FDR.Adj.Pval    Score   Adj.Score       geneID
chr17   75077203        75078080        chr17   75058620        75062620        1.36e-04        2.19e-06        1.07e-05        ac111186.2
chr17   74981894        74982787        chr17   75058620        75062620        6.97e-08        7.57e-06        6.12e-05        ac111186.2
chr17   75154195        75155099        chr17   75058620        75062620        5.60e-08        9.77e-06        2.25e-05        ac111186.2
chr17   75036398        75037311        chr17   75058620        75062620        0.00e+00        1.16e-04        3.56e-04        ac111186.2
chr21   41973808        41974736        chr21   41765089        41769089        1.82e-02        7.43e-07        3.78e-07        ripk4
chr21   41816424        41817163        chr21   41765089        41769089        1.67e-10        3.25e-06        9.46e-07        ripk4
chr21   41895604        41896499        chr21   41765089        41769089        0.00e+00        2.11e-05        4.96e-05        ripk4
chr21   41926204        41927091        chr21   41765089        41769089        0.00e+00        1.23e-05        1.48e-05        ripk4

Processing for different clusters

Now let’s download the default clustering analysis performed by cell ranger:

wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz

tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz

As a proof of concept we will use the K-mean clusters for K=3 analysis/clustering/kmeans_3_clusters/clusters.csv. We first need to create a .tsv file instead of a .csv, which is straightforward:

# Option 1
sed 's/,/\t/g' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv

# option 2 (better for OSX)
awk '{FS=",";OFS="\t"}{print $1,$2}' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv

Next, we just have to add the -clusters variable to Enhlink:

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_2/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

The results folder has one bedpe per cluster defined in the clusters.tsv file:

tree ./output_tutorial_1_2/

# output

./output_tutorial_1_2/
├── 1.bedpe
├── 2.bedpe
└── 3.bedpe

Including covariates in the model

When inferring enhancer-promoter links with Enhlink, it is often judiciary to include associated covariates in the models to prevent batch effect and to identify covariate-specific links. Cell-covariates of interest depend of the context but can be for example, the library IDs (if multiple runs are merged together), the sex, the condition, or even the cell-type/cluster ID. Since this dataset has no obvious cell covariates to download, we will use the previous cluster IDs as covariate as a proof of concept, and identify the cluster-specific links.

This analysis is different than the one above because it identified the links of each cluster using the subset of cells specific to each cluster. In the previous analysis, a link can be found in multiple clusters. Also, the last analysis is useful to find links specific to small clusters/cell-types among much larger subpopulations because it focused only on the subset of cells.

Please verify that the covariates file contains a header naming the different covariates. The covariates file should have the following structure:

covariateName1    covariateName2    ...    covariateNameJ
val_1_1           val_2_1                  val_J_1
val_1_2           val_2_2                  val_J_2
...

The values must not be integer or float values but rather categorical. Let’s launch this new analysis

enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \
    -xgi ./filtered_peak_bc_matrix/barcodes.tsv \
    -ygi ./filtered_peak_bc_matrix/peaks.bed \
    -covariates ./analysis/clustering/kmeans_3_clusters/clusters.tsv \
    -secondOrder \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out ./output_tutorial_1_3/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

This analysis should take couple of minutes. The -covariate option allowed to add the binarized covariates into the model. Instead of inferring p = f(P), with p the promoter vector and P the matrix formed with the surrounding peak vectors of p, we inferred p = f(P,C) with C the matrix of binarized covariates. The option allowed -secondOrder to perform additional analyzes to identify covariate-specific links at the expense of more computations.

The results folder has the following files:

tree ./output_tutorial_1_3/

# output

./output_tutorial_1_3/
├── all.2ndOrder.bedpe
├── all.batch.tsv
└── all.bedpe

The file all.batch.tsv lists the covariates associated with each promoter and the file all.2ndOrder.bedpe describes the covariates associated to each links

head -5 output_tutorial_1_3/all.batch.tsv

# output

Cluster_2       chr10   103394626       103398626       3.82e-04        8.77e-05        1.63e-04        pdcd11
Cluster_2       chr7    38176222        38180222        0.00e+00        8.27e-04        1.34e-03        stard3nl
Cluster_3       chr7    74888610        74892610        7.29e-03        5.99e-06        3.58e-06        stag3l2
Cluster_1       chr7    74888610        74892610        2.57e-05        1.75e-05        5.68e-06        stag3l2

head -20 output_tutorial_1_3/all.2ndOrder.bedpe

# Output

# EnhLink (version: 0.18.6)
##### PARAMETERS ####
#  maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true  isFloatMat: false onlyPositiveLink: false secondOrder: true threads: 6 secondOrderMaxFeat: 2
# time: 2023-02-20 12:45:26
# input matrix: ./filtered_peak_bc_matrix/matrix.mtx
# xgi index: ./filtered_peak_bc_matrix/barcodes.tsv
# ygi index: ./filtered_peak_bc_matrix/peaks.bed
# mat file: ./filtered_peak_bc_matrix/matrix.mtx
# cluster file file:
# promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed
# metadata: ./analysis/clustering/kmeans_3_clusters/clusters.tsv
# cluster ID: all
##### Second order link-covariate interactions ####
#chrID1 start   stop    chrID2  start   stop    Covariate       FDR.Adj.Pval    Score   Adj.Score       geneID
chr10   103551405       103552275       chr10   103394626       103398626       Cluster_2       0.00e+00        5.63e-04        1.04e-03        pdcd11
chr10   103485434       103486330       chr10   103394626       103398626       Cluster_2       9.32e-06        1.67e-07        3.09e-07        pdcd11
chr7    38107239        38108115        chr7    38176222        38180222        Cluster_2       0.00e+00        1.77e-05        2.87e-05        stard3nl
chr7    38215499        38216377        chr7    38176222        38180222        Cluster_2       1.72e-02        2.00e-06        3.23e-06        stard3nl
chr7    38305247        38306139        chr7    38176222        38180222        Cluster_2       0.00e+00        6.68e-05        1.08e-04        stard3nl
chr7    38394425        38395335        chr7    38176222        38180222        Cluster_2       0.00e+00        1.61e-05        2.62e-05        stard3nl

The FDR.Adj.Pval, Score, and Adj.Score are scores linked to the covariate-specific inference and not the link by itself.

Class imbalance with covariates

Enhlink provides an option called -uniformSample that can be used when analyzing datasets with class imbalance in covariates. This option helps to ensure that each bootstrap contains a more uniform distribution of cells from each class by taking into account the distribution of the covariates.

For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the -uniformSample option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.

However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than nbCellsTotal/nbCovariate, the number of cells (N) sampled will be larger than the class size (S), resulting in a larger number of unique cells (Nu) being sampled than with a bootstrap procedure (66% on average). For instance, if N is twice the size of S, Nu will reach 90%, and 95% for three times the size of S, which could decrease the variability of the unique cells sampled.