# Tutorial 1: Simple processing of snATAC-seq data In this first tutorial we will process some example single-cell ATAC-seq data from one of the 10X-genomics datasets available online. We will: 1. Download the single-cell MTX matrix (sparse matrix format) folder from the 10X genomics processed files obtained after cellRanger processing 2. Infer the Enhlinks "linkage" for all the cells and all the genes 3. Infer the Enhlinks "linkage" for each clusters 4. Infer the Enhlinks "linkage" and identify significant "cluster-specific" links ## Process all the cells ### Download the scATAC-seq MTX matrix We first need to download the MTX matrix from the 10X example dataset: _10k Human PBMCs, ATAC v2, Chromium Controller_ that is accessible here: [https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard) ```bash wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.tar.gz ``` We then need to download a GTF file that describes the promoter locations in the HG38 : ```bash wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed ``` ### simple processing with Enhlink We are now ready to launch a simple linkage inference with Enhlink to link the enhancers of the MTX matrix to the target promoters defined in the GTF files. First, we need to go to the folder containg the MTX matrix and the GFT files. We should have these files inside ```bash tree # output: . ├── Homo_sapiens.GRCh38.99.TSS.2K.bed └── filtered_peak_bc_matrix ├── barcodes.tsv ├── matrix.mtx └── peaks.bed ``` Since _Homo_sapiens.GRCh38.99.TSS.2K.bed_ has more than 60,000 genes, we will only process 500 random genes extracted from this file, as a proof of concept: ```bash # Please use gshuf instead of shuf (frome coreutils) on OSX cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt ``` We will launch Enhlink with the following options: - _-targets_ [optional] which defines the file containing the subset of promoters of interest. If not provided, Enhlink will process the 60K promoters defined in Homo_sapiens.GRCh38.99.TSS.2K.bed - _-threads_ which defines the number of CPUs to use for each gene - _-format_ which refers to the format of the input matrix - _-rmPeaksInTargets_ which prevent inferring promoter-promoter linkage - _-xgi_ is the barcode index of the MTX matrix (in the cellRanger format, this corresponds to the columns index) - _-ygi_ is the peak index (as a BED file) of the MTX matrix (in the cellRanger format, this corresponds to the rows index) ```bash enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \ -xgi ./filtered_peak_bc_matrix/barcodes.tsv \ -ygi ./filtered_peak_bc_matrix/peaks.bed \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./output_tutorial_1_1/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` This process should take about a minute to complete and the results are stored in the _/output_tutorial_1_1/_ folder: ``` tree /output_tutorial_1_1/ # output ./output_tutorial_1_1/ └── all.bedpe head ./output_tutorial_1_1/all.bedpe # EnhLink (version: 0.18.6) ##### PARAMETERS #### # maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true isFloatMat: false onlyPositiveLink: false secondOrder: false threads: 6 secondOrderMaxFeat: 2 # time: 2023-02-17 16:21:10 # input matrix: ./filtered_peak_bc_matrix/matrix.mtx # xgi index: ./filtered_peak_bc_matrix/barcodes.tsv # ygi index: ./filtered_peak_bc_matrix/peaks.bed # mat file: ./filtered_peak_bc_matrix/matrix.mtx # cluster file file: # promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed # cluster ID: all #chrID1 start stop chrID2 start stop FDR.Adj.Pval Score Adj.Score geneID chr17 75077203 75078080 chr17 75058620 75062620 1.36e-04 2.19e-06 1.07e-05 ac111186.2 chr17 74981894 74982787 chr17 75058620 75062620 6.97e-08 7.57e-06 6.12e-05 ac111186.2 chr17 75154195 75155099 chr17 75058620 75062620 5.60e-08 9.77e-06 2.25e-05 ac111186.2 chr17 75036398 75037311 chr17 75058620 75062620 0.00e+00 1.16e-04 3.56e-04 ac111186.2 chr21 41973808 41974736 chr21 41765089 41769089 1.82e-02 7.43e-07 3.78e-07 ripk4 chr21 41816424 41817163 chr21 41765089 41769089 1.67e-10 3.25e-06 9.46e-07 ripk4 chr21 41895604 41896499 chr21 41765089 41769089 0.00e+00 2.11e-05 4.96e-05 ripk4 chr21 41926204 41927091 chr21 41765089 41769089 0.00e+00 1.23e-05 1.48e-05 ripk4 ``` ### Processing for different clusters We can now perform an independent linkage analysis for each clusters. As an example, let's download the default clustering analysis performed by cell ranger: ```bash wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_analysis.tar.gz ``` We will use the K-mean clusters for K=3 _analysis/clustering/kmeans_3_clusters/clusters.csv_. We first need to create a .tsv file instead of a .csv, which is straightforward: ```bash # Option 1 sed 's/,/\t/g' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv # option 2 (better for OSX) awk '{FS=",";OFS="\t"}{print $1,$2}' analysis/clustering/kmeans_3_clusters/clusters.csv > analysis/clustering/kmeans_3_clusters/clusters.tsv ``` Next, we just have to add the _-clusters_ variable to Enhlink: ```bash enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \ -xgi ./filtered_peak_bc_matrix/barcodes.tsv \ -ygi ./filtered_peak_bc_matrix/peaks.bed \ -clusters ./analysis/clustering/kmeans_3_clusters/clusters.tsv \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./output_tutorial_1_2/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` The results folder has one bedpe per cluster defined in the clusters.tsv file: ```bash tree ./output_tutorial_1_2/ # output ./output_tutorial_1_2/ ├── 1.bedpe ├── 2.bedpe └── 3.bedpe ``` ### Including covariates in the model It is often judiciary to include, when possible, the covariates associated with each cells in the Enhlink analysis. This helps prevent batch effect and allows to identify covariate-specific links. Cell-covariates of interest depend of the context but can be for example, the library IDs (if multiple runs are merged together), the sex, the condition, or even the cell-type/cluster ID. Since this dataset has no obvious cell covariates to download, we will use the previous cluster IDs as covariate as a proof of concept, and identify the cluster-specific links. This analysis is different than the one above because it identified the links of each cluster using the subset of cells specific to each cluster. In the previous analysis, a link can be found in multiple clusters. Also, the last analysis is useful to find links specific to small clusters/cell-types among much larger subpopulations because it focused only on the subset of cells. Please *verify* that the covariates file contains a *header* naming the different covariates. The covariates file should have the following structure: ```bash covariateName1 covariateName2 ... covariateNameJ val_1_1 val_2_1 val_J_1 val_1_2 val_2_2 val_J_2 ... ``` The values must not be *integer* or *float* but rather be categorical. Let's launch this new analysis ```bash enhlink -mat ./filtered_peak_bc_matrix/matrix.mtx \ -xgi ./filtered_peak_bc_matrix/barcodes.tsv \ -ygi ./filtered_peak_bc_matrix/peaks.bed \ -covariates ./analysis/clustering/kmeans_3_clusters/clusters.tsv \ -secondOrder \ -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \ -out ./output_tutorial_1_3/ \ -threads 6 \ -targets 500_rand_genes.txt \ -format cellRanger \ -rmPeaksInTargets ``` This analysis should take couple of minutes. The _-covariate_ option allowed to add the binarized covariates into the model. Instead of inferring _p = f(P)_, with p the promoter vector and P the matrix formed with the surrounding peak vectors of p, we inferred _p = f(P,C)_ with C the matrix of binarized covariates. The option allowed _-secondOrder_ to perform additional analyzes to identify covariate-specific links at the expense of more computations. The results folder has the following files: ```bash tree ./output_tutorial_1_3/ # output ./output_tutorial_1_3/ ├── all.2ndOrder.bedpe ├── all.batch.tsv └── all.bedpe ``` The file _all.batch.tsv_ lists the covariates associated with each promoter and the file _all.2ndOrder.bedpe_contains an additional column: _Covariate_, defining the covariate associated with each link. ```bash head -5 output_tutorial_1_3/all.batch.tsv # output Cluster_2 chr10 103394626 103398626 3.82e-04 8.77e-05 1.63e-04 pdcd11 Cluster_2 chr7 38176222 38180222 0.00e+00 8.27e-04 1.34e-03 stard3nl Cluster_3 chr7 74888610 74892610 7.29e-03 5.99e-06 3.58e-06 stag3l2 Cluster_1 chr7 74888610 74892610 2.57e-05 1.75e-05 5.68e-06 stag3l2 head -20 output_tutorial_1_3/all.2ndOrder.bedpe # Output # EnhLink (version: 0.18.6) ##### PARAMETERS #### # maxDepth: 2 n_boot: 100 maxNbFeatures: 4 minMatSize: 100 threshold: 0.050000 minLeafSize: 10 downsample: 0 maxFeatType:all neighborhood: 250000 merging_cutoff: 1500 rmPeaksInPromoter: true isFloatMat: false onlyPositiveLink: false secondOrder: true threads: 6 secondOrderMaxFeat: 2 # time: 2023-02-20 12:45:26 # input matrix: ./filtered_peak_bc_matrix/matrix.mtx # xgi index: ./filtered_peak_bc_matrix/barcodes.tsv # ygi index: ./filtered_peak_bc_matrix/peaks.bed # mat file: ./filtered_peak_bc_matrix/matrix.mtx # cluster file file: # promoter file: Homo_sapiens.GRCh38.99.TSS.2K.bed # metadata: ./analysis/clustering/kmeans_3_clusters/clusters.tsv # cluster ID: all ##### Second order link-covariate interactions #### #chrID1 start stop chrID2 start stop Covariate FDR.Adj.Pval Score Adj.Score geneID chr10 103551405 103552275 chr10 103394626 103398626 Cluster_2 0.00e+00 5.63e-04 1.04e-03 pdcd11 chr10 103485434 103486330 chr10 103394626 103398626 Cluster_2 9.32e-06 1.67e-07 3.09e-07 pdcd11 chr7 38107239 38108115 chr7 38176222 38180222 Cluster_2 0.00e+00 1.77e-05 2.87e-05 stard3nl chr7 38215499 38216377 chr7 38176222 38180222 Cluster_2 1.72e-02 2.00e-06 3.23e-06 stard3nl chr7 38305247 38306139 chr7 38176222 38180222 Cluster_2 0.00e+00 6.68e-05 1.08e-04 stard3nl chr7 38394425 38395335 chr7 38176222 38180222 Cluster_2 0.00e+00 1.61e-05 2.62e-05 stard3nl ``` Note that in this file, the FDR.Adj.Pval, Score, and Adj.Score represent scores pertaining to covariate-link inference rather than the links themselves ### Class imbalance with covariates Enhlink provides an option called `-uniformSample` that can be used when analyzing datasets with class imbalance in covariates. This option helps to ensure that each bootstrap contains a more uniform distribution of cells from each class by taking into account the distribution of the covariates. For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the `-uniformSample` option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters. However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than `nbCellsTotal/nbCovariate`, the number of cells (`N`) sampled will be larger than the class size (`S`), resulting in a larger number of unique cells (`Nu`) being sampled than with a bootstrap procedure (66% on average). For instance, if `N` is twice the size of `S`, `Nu` will reach 90%, and 95% for three times the size of `S`, which could decrease the variability of the unique cells sampled.