Tutorial 4: Processing multi-omics data with Enhlink
In this tutorial, we will show how to take advantage of multi-omics single-cell data with Enhlink. The advantage of these datasets is to generate features in multiple omics at the single-cell resolution. With regards to linkage inference, Enhlink allows to to first infer linkages in each omic independently and then to intersect the results and identify the common links. these shared links are logically better true candidates since the correlations is observed in multiple modalities. For example with scATAC/RNA-seq dataset, we want to search enhancers coaccessible with promoters and also correalted with their respective gene-expression.
Case study using a multi-omic RNA/ATAC PBMC from a Healthy Donor dataset
As an example, we will use the same dataset that in the tutorial 3, which is a multi-omic RNA/ATAC of the example available datasets of 10XGenomics.
scRNA matrix
We first need to download the filtered_feature_bc_matrix the cell x gene expression MTX matrix and the corresponding indexes
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz
# decompressing the file
tar -xvf pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz
The matrix is in the Cell-Ranger mtx format see tutorial 3 which is a 3-columns numerical matrix with the feature indexes in the first column, the barcode index in the second columns and having the indexes starting at 1.
scATAC matrix
Unfortunately, the Cell-Ranger processing of this dataset does not provide the matrix equivalent for the scATAC-seq data. However, it does provide the fragments and the peak file. As we have seen in tutorial 3, these files are enough to create a cell x peak matrix:
# downloading fragments and peak files:
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz
# Creating a cell x peak matrix using the filtered barcodes of the scRNA-seq analysis using ATACMatUtils
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
-xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi ./pbmc_unsorted_3k_atac_peaks.bed \
-out ./filtered_feature_atac_matrix \
-format cellRanger
Inferring linkage
To infer enhancers-promoter linkage, we need to download a GTF file indicating where the promoters are located in the reference genome (see tutorial 1).
wget http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/metadata/Homo_sapiens.GRCh38.99.TSS.2K.bed
Let’s also download the secondary analysis of cell Ranger which provides clustering and also linkage.
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_analysis.tar.gz
tar -xvf pbmc_unsorted_3k_analysis.tar.gz
As an example, we will use the Gene expression K-Means clustering with K=3 as reference clustering that identified the cell types and convert it to tsv (required by Enhlink). Again, we don’t necessarly need clusters but, in case of multiple cell-type, it is better to infer linkages for all the cell types independently.
First, we need to convert the CSV cluster file into a tab separated file (tsv). We can do it using the sed
command.
# On Linux
sed 's/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv
# On OSX
sed -e $'s/,/\t/g' analysis/clustering/gex/kmeans_3_clusters/clusters.csv > analysis/clustering/gex/kmeans_3_clusters/clusters.tsv
We will also use a random list of 500 genes instead of processing all of the 60K genes of Homo_sapiens.GRCh38.99.TSS.2K.bed as a proof of concept.
# Please use gshuf instead of shuf (frome coreutils) on OSX
cut -f4 Homo_sapiens.GRCh38.99.TSS.2K.bed|sort|uniq|shuf -n 500 > 500_rand_genes.txt
-ygi
needs a bed file so we needs to convert the features.tsv.gz
into a bed file:
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_atac_matrix/features.tsv.gz|cut -f1|sed 's/_/\t/g' > ./filtered_feature_atac_matrix/features.bed
Now we are ready to use Enhlink for linkage inference.
ATAC-seq
First, we will infer enhancers-promoter coaccessibilities similarly than in tutorial 1.
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_4_1/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
here the option -rmPeaksInTargets
is optional and prevent to infer promoter-promoter links.
RNA-seq
We need to reformat the features.tsv.gz for the Gene expression matrix since we used the gene symbols as gene IDs in the promoter and genes files.
# the gene symbols are the second columns of the features file
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat ./filtered_feature_bc_matrix/features.tsv.gz|cut -f2 > ./filtered_feature_bc_matrix/gene_symbols.tsv
We can now infers enhancers-gene links by identifying the ones with accessibility linked to gene expression.
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
-xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_4_2/ \
-threads 6 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger
the -isExpr
flag is used to define the second matrix (./filtered_feature_bc_matrix/matrix.mtx.gz
) as a float value matrix, rather than boolean matrix. In this case, Enhlink dichotomizes each values of a given gene feature using the mean of the non null elements of that given gene as threshold. If we would not have used this option, all the non-null zero would have been considered as 1 and the absent values as 0, which in the case of scRNA-seq could still have been enough to infer meaningfull links.
Intersect ATAC-seq and RNA-seq linkages
Now that we have linkages in both modalities, we can intersect them and see=identify the shared links. We can use enhtools
for that purpose:
enhtools -intersect \
-in ./output_tutorial_4_1/ \
-in2 ./output_tutorial_4_2/ \
-out output_tutorial_4_3/
The -intersect
command is designed to find pairs of bedpe files from -in
and -in2
with the same name and find the intersecting links for each pair. It will output the precision, recall and f1 scores in the stdout
and will also create a stats.tsv
file for each cluster. the stats.tsv files breakdown the scores for each gene. Let’s have a look to how many links we captured in the three experiments:
for fil in `ls ./output_tutorial_4_*/*.bedpe`
do
nb=`cat ${fil}|grep -v "#"|wc -l`
echo "${fil} nb links: ${nb}"
done
## Output:
# ATAC
./output_tutorial_4_1/1.bedpe nb links: 523
./output_tutorial_4_1/2.bedpe nb links: 585
./output_tutorial_4_1/3.bedpe nb links: 382
# RNA
./output_tutorial_4_2/1.bedpe nb links: 222
./output_tutorial_4_2/2.bedpe nb links: 159
./output_tutorial_4_2/3.bedpe nb links: 114
# shared
./output_tutorial_4_3/1.bedpe nb links: 35
./output_tutorial_4_3/2.bedpe nb links: 28
./output_tutorial_4_3/3.bedpe nb links: 11
As we can see, The highest number of links are captured with the ATAC data alone, while RNA+ATAC produce around a third of the number of links, depending of the set of genes used. This is not surprising since correlations between ATAC and RNA are noisier than enhancer / promoter accessibility correlation, since latency and other regulatory mechanisms will affect gene expression. In this case, the number of shared links between the two modalidies is only a fraction of the original numbers. These links indicate robust enhancer-promoter candidates since both promoter accessibility and gene expression are correlated to enhancer accessibility.
increasing the sensitivity of ATAC+RNA linkage
To capture more RNA+ATAC links, we can try to increase the sensitivity of Enhlink by either increasing the pvalue-cutoff threshold (which is already at 0.05) or by increasing the expected number of explanatory features per models or the number of trees (the -n_boot
option) in the ensemble model (see Tutorial 5). By doing so, we will decrease the specificity (.e. more noise) but increase the sensitivity (e.g. number of true positive) of the results. Since we wanto intersect with ATAC links, it might be OK to do so. We will describe how to change the hyperparameters and adapt the accuracy of Enhlink in the next section.
Intersecting with 10XGenomics linkage
enhtools -intersect
looks for perfect matches between pairs of files of two folders. Alternatively, we could have used enhtools -intersect2
that look for perfect matches between two files or enhtools -intersect3
that looks for intersection based on imperfect matches. In the latter case enhtools
looks for intersection between each of the two regions of the bedpe files without necessary a perfect match. Let’s use it to intersect our results with the Cell-RAnger linkage analysis:
mkdir ./output_tutorial_4_4/
enhtools -intersect3 -in ./output_tutorial_4_3/1.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/1.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/2.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/2.bedpe
enhtools -intersect3 -in ./output_tutorial_4_3/3.bedpe -in2 ./analysis/feature_linkage/feature_linkage.bedpe -out ./output_tutorial_4_4/3.bedpe
Conclusion
Overall, ATAC-seq alone is enough to find robust co-accessibility links that have better association than random. To further refine these links, it is useful to infer links from a second omics and identify the fraction of shared linkage in both omics. However, it is good to keep in mind that inferring links from ATAC+RNA only might be noisier than ATAC alone.