Tutorial 4: Processing multi-omics data with Enhlink

In this tutorial, we will show how to take advantage of multi-omics single-cell data with Enhlink. The advantage of these datasets is to generate features in multiple omics at the single-cell resolution. With regards to linkage inference, Enhlink allows to to first infer linkages in each omic independently and then to intersect the results and identify the common links. these shared links are logically better true candidates since the correlations is observed in multiple modalities. For example with scATAC/RNA-seq dataset, we can search enhancers coaccessible with promoters and also correalted with their respective gene-expression.

Case study using a multi-omic RNA/ATAC PBMC from a Healthy Donor dataset

As an example, we will use the same dataset that in the tutorial 3, which is a multi-omic RNA/ATAC of the example available datasets of 10XGenomics.

scRNA matrix

We first need to download the filtered_feature_bc_matrix the cell x gene expression MTX matrix and the corresponding indexes

wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz

# decompressing the file
tar -xvf pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz

The matrix is in the Cell-Ranger mtx format see tutorial 3 which is a 3-columns numerical matrix with the feature indexes in the first column, the barcode index in the second columns and having the indexes starting at 1.

scATAC matrix

Unfortunately, the Cell-Ranger processing of this dataset does not provide the matrix equivalent for the scATAC-seq data. However, it does provide the fragments and the peak file. As we have seen in tutorial 3, these files are enough to create a cell x peak matrix:

# downloading fragments and peak files:
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz

# Creating a cell x peak matrix using the filtered barcodes of the scRNA-seq analysis using ATACMatUtils

ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./pbmc_unsorted_3k_atac_peaks.bed \
    -out ./filtered_feature_atac_matrix \
    -format cellRanger

Inferring linkage

To infer enhancers-promoter linkage, we need to download a GTF file indicating where the promoters are located in the reference genome (see tutorial 1).

wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed

Let’s also download the secondary analysis of cell Ranger which provides clustering and also linkage.

wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_analysis.tar.gz
tar -xvf pbmc_unsorted_3k_analysis.tar.gz

As an example, we will use the Gene expression K-Means clustering with K=3 as reference clustering that identified the cell types and convert it to tsv (required by Enhlink). Again, we don’t necessarly need clusters but, in case of multiple cell-type, it is better to infer linkages for all the cell types independently.

First, we need to convert the CSV cluster file into a tab separated file (tsv). We can do it using the sed command.

# On Linux
sed 's/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv

# On OSX
sed -e $'s/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv

We will also use a random list of 500 genes instead of processing all of the 60K genes of Homo_sapiens.GRCh38.99.TSS.2K.bed as a proof of concept.

# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt

-ygi needs a bed file so we needs to convert the features.tsv.gz into a bed file:

# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_atac_matrix/features.tsv.gz|cut -f1|sed 's/_/\t/g' > ./filtered_feature_atac_matrix/features.bed

Now we are ready to use Enhlink for linkage inference.

ATAC-seq

First, we will infer enhancers-promoter coaccessibilities similarly than in tutorial 1.

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_4_1/ \
    -threads 6 \
    -targets 500_rand_genes.txt \
    -format cellRanger \
    -rmPeaksInTargets

here the option -rmPeaksInTargets is optional and prevent to infer promoter-promoter links.

RNA-seq

We need to reformat the features.tsv.gz for the Gene expression matrix since we used the gene symbols as gene IDs in the promoter and genes files.

# the gene symbols are the second columns of the features file

# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_bc_matrix/features.tsv.gz|cut -f2 > ./filtered_feature_bc_matrix/gene_symbols.tsv

We can now infers enhancers-gene links by identifying the ones with accessibility correlated with gene expression.

enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
    -xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
    -ygi ./filtered_feature_atac_matrix/features.bed \
    -mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
    -xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
    -out ./output_tutorial_4_2/ \
    -threads 6 \
    -isExpr \
    -targets 500_rand_genes.txt \
    -format cellRanger

Note that the -isExpr flag is used to define the second matrix (./filtered_feature_bc_matrix/matrix.mtx.gz) as a float matrix, rather than a boolean matrix. When using this option, Enhlink dichotomizes each values of a given feature using the mean of the non null elements of that given gene as threshold. If we would not have used this option, all the non-null zero would have been considered as 1 and the absent values as 0, which in the case of scRNA-seq could still have been enough to infer meaningfull links.

Intersect ATAC-seq and RNA-seq linkages

Now that we have linkages in both modalities, we can intersect them and identify the shared links. We can use enhtools for that purpose:

enhtools -intersect \
   -in ./output_tutorial_4_1/ \
   -in2 ./output_tutorial_4_2/ \
   -out output_tutorial_4_3/

The -intersect command is designed to first find pairs of bedpe files from the -in and -in2 folder having a matching file name and output the intersecting links for each pair. It will output the precision, recall and f1 scores in the stdout and will also create a stats.tsv file for each cluster. the stats.tsv files breakdown the scores for each gene. Let’s have a look to how many links we captured in the three experiments:

for fil in `ls ./output_tutorial_4_*/*.bedpe`
do
	nb=`cat ${fil}|grep -v "#"|wc -l`
	echo "${fil} nb links: ${nb}"
done

## Output:
            # ATAC
./output_tutorial_4_1/1.bedpe nb links: 523
./output_tutorial_4_1/2.bedpe nb links: 585
./output_tutorial_4_1/3.bedpe nb links: 382
            # RNA
./output_tutorial_4_2/1.bedpe nb links: 222
./output_tutorial_4_2/2.bedpe nb links: 159
./output_tutorial_4_2/3.bedpe nb links: 114
            # shared
./output_tutorial_4_3/1.bedpe nb links: 35
./output_tutorial_4_3/2.bedpe nb links: 28
./output_tutorial_4_3/3.bedpe nb links: 11

As we can see, The highest number of links are captured with the ATAC data alone, while RNA+ATAC produce around a third of the number of links, depending of the set of genes used. This is not surprising since correlations between ATAC and RNA are noisier than enhancer / promoter accessibility correlation, since latency and other regulatory mechanisms will affect gene expression. In this case, the number of shared links between the two modalidies is only a fraction of the original numbers. These links indicate robust enhancer-promoter candidates since both promoter accessibility and gene expression are correlated to enhancer accessibility.

increasing the sensitivity of ATAC+RNA linkage

To capture more RNA+ATAC links, we can try to increase the sensitivity of Enhlink by either increasing the pvalue-cutoff threshold (which is already at 0.05) or by increasing the expected number of explanatory features per models or the number of trees (the -n_boot option) in the ensemble model (see Tutorial 5). By doing so, we will decrease the specificity (.e. more noise) but increase the sensitivity (e.g. number of true positive) of the results. Since we wanto intersect with ATAC links, it might be OK to do so. We will describe how to change the hyperparameters and adapt the accuracy of Enhlink in the next section.

Intersecting with 10XGenomics linkage

enhtools -intersect looks for perfect matches between pairs of files of two folders. Alternatively, we could have used enhtools -intersect2 that look for perfect matches between two files or enhtools -intersect3 that looks for intersection based on imperfect matches. In the latter case enhtools looks for intersection between each of the two regions of the bedpe files without necessary a perfect match. Let’s use it to intersect our results with the Cell-RAnger linkage analysis:

mkdir ./output_tutorial_4_4/

enhtools -intersect3 -in ./output_tutorial_4_3/1.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/1.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/2.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/2.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/3.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/3.bedpe

Conclusion

Overall, ATAC-seq alone is enough to find robust co-accessibility links that have better association than random. To further refine these links, it is useful to infer links from a second omics and identify the fraction of shared linkage in both omics. However, it is good to keep in mind that inferring links from ATAC+RNA only might be noisier than ATAC alone.