Tutorial 5: Enhlink advanced functionalities
in this tutorial, we will describe some of the advanced functionalities of Enhlink that can be used to regularize the model and adjust the sensitivity / specificity, perform a distal analysis, and adjust computational speed.
Using an external feature matrix with Enhlink.
We saw in the previous tutorials, how to create an cell x peak matrix and how to directly use CellRanger to obtain an input matrix for Enhlink. Enhlink will iterate through the target regions of the file defined in the -promoter
option, create a target vector for each of the promoter, and use the surrounding features as explanatory variables. however, Enhlink can accept a second feature matrix that can be used to directly extract the target vectors.
For example let’s reprocess the example of the tutorial 4 and create a cell x feature matrix that will be used as a feature matrix instead.
# 1rst Example from the tutorial 4
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_4_1/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
Creation of a cell x feature matrix
We will use ATACMatUtils
from the ATACDemultiplex package (see tutorial 3) to create a cell x promoter matrix for this dataset:
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
-xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \
-use_symbol \
-norm_type count \
-out ./filtered_promoter_matrix \
-format cellRanger
Here, the -use_norm count
defines the values to be simply the read count at each promoter (see tutorial 3).
We can now use this second matrix as feature matrix with Enhlink:
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
-xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
-ygi2 ./filtered_promoter_matrix/features.tsv.gz \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_1/ \
-threads 6 \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
In this command, the value of the second matrix doesn’t matter since Enhlink dichotomize between non null (>=1) vs null (==0) values. To take into account the numerical values with Enhlink, we can use the -isExpr
option which will dichotomize the values based on the mean of the non-null elements. For example let’s try to create an promoter activity score matrix using the logRPM normalisation and use these scores as numerical values:
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
-xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \
-use_symbol \
-norm_type logrpm \
-out ./filtered_promoter_matrix_logrpm \
-format cellRanger
Now, let’s process the values of this matrix a numerical values with the -isExpr
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_promoter_matrix_logrpm/matrix.mtx.gz \
-xgi2 ./filtered_promoter_matrix_logrpm/barcodes.tsv.gz \
-ygi2 ./filtered_promoter_matrix_logrpm/features.tsv.gz \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_2/ \
-threads 6 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger \
-rmPeaksInTargets
As a final note on using an additional matrix, since a gene can have multiple promoter region defined in the promoter file and Enhlink doesn’t know which of these promoter regions refers to the value of a given gene defined in the feature matrix, Enhlink will indiscriminately assigns the same link for all the promoter region of the gene.
Regularization and adjusting sensitivity / specificity
In the second example of the tutorial 4, we saw that less linked were inferred from te ATAC+RNA matrices compared to the ATAC matrix alone (which was expected). We can try to increase these numbers by increasing the sensitivity of Enhlonk at the expense of lowering the specificity. One way to do that is to increase the maximal number of explanatory features per models (-max_features
which is 4 by default). We also need to increase the maximum depth of each tree (-depth
which is 2 by default) since these two parameters are linked: at a depth d, there would be no more than 2^d features.
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
-xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_2/ \
-threads 6 \
-depth 3 \
-max_features 8 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger
# Number of lines in each output file:
wc -l output_tutorial_4_2/*.bedpe
# output
251 output_tutorial_4_2/1.bedpe
153 output_tutorial_4_2/2.bedpe
123 output_tutorial_4_2/3.bedpe
527 total
wc -l output_tutorial_5_2/*.bedpe
# output
345 output_tutorial_5_2/1.bedpe
258 output_tutorial_5_2/2.bedpe
167 output_tutorial_5_2/3.bedpe
770 total
Even though we doubled -max_features
, the number of links captured didn’t double since each link inferred by Enhlink has a p-value inferred from the neighborhood and this p-value needs to pass the cutoff threshold to be significant.
We can also increase the forest size (-n_boot
which is 100 per default) to be able to detect smaller associations at the expense of more computations:
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
-xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_3/ \
-n_boot 200 \
-threads 6 \
-depth 3 \
-max_features 8 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger
$ wc -l output_tutorial_5_3/*.bedpe
510 output_tutorial_5_3/1.bedpe
428 output_tutorial_5_3/2.bedpe
296 output_tutorial_5_3/3.bedpe
1234 total
However, we need to insure that the number of background features is large enough to properly compute the p-values. The background features are defined as the set of neighbor features that goes into the model. We assume that the majority of them have do not have an association with the target region, which allows to infer a p-value using their score distribution. If this number is too low (defined by -min_matsize
which is set as 100 by default), Enhlink will create artificial features from the existing ones. Thus, increasing -min_matsize
will limit the amount of false positives at the expense of more computations. Another solution to increase the number of background features is to increase the neighborhood size (-neighborhood
set as 2.5e5 as default) of each gene, defined in number of base pairs. however, the latter case will produce links between more distal regions.
# More precise Enhlink analysis compared to the default parameters
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
-xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_4/ \
-n_boot 400 \
-min_matsize 200 \
-threads 6 \
-depth 3 \
-max_features 8 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger
# This analysis took 186 seconds to complete (31 seconds with the defaults parameters)
wc -l output_tutorial_5_4/*.bedpe
487 output_tutorial_5_4/1.bedpe
384 output_tutorial_5_4/2.bedpe
285 output_tutorial_5_4/3.bedpe
1156 total
Accuracy on the 2nd order analysis
Enhlink can optionally perform a second order analysis ( -secondOrder
) to identify the covariates-specific links. This process is controlled with the same parameters except -secondOrderMaxFeat
replacing -min_matsize
and set as 2 as default.
Improving the speed of the method
Processing 40,000 genes, for multiple cell types, for a large number of cells together with the 2nd order analysis can be quite a long process. The speed of Enhlink can be improved using the -downsample
option, controlling the number of cells randomly sampled for each tree, and the -maxFeatType
option controlling the maximum of features to be considered. By default, all the features and all the cells are used. However, we showed in the paper that as low as 1000 cells can produce good performances. -maxFeatType
can take an integer value (number of features), a float (proportion of the total number of features), or can be set up as “log”, “sqrt”, or “all” (default). We recommand to not use less than 66% of the features, at least with the default forest size (100). Using a smaller subset of features can be usefull for very large forest size, similarly to the classic random forest approach which randomly picks log(nb. features) at each tree.
# More precise Enhlink analysis with cell (1000 per tree) and feature (66% per tree) downsample.
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_feature_bc_matrix/matrix.mtx.gz \
-xgi2 ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi2 ./filtered_feature_bc_matrix/gene_symbols.tsv \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_5/ \
-n_boot 400 \
-min_matsize 200 \
-maxFeatType 0.66 \
-downsample 1000 \
-threads 6 \
-depth 3 \
-max_features 8 \
-isExpr \
-targets 500_rand_genes.txt \
-format cellRanger
# This analysis took 103 seconds to complete
wc -l output_tutorial_5_4/*.bedpe
487 output_tutorial_5_4/1.bedpe
384 output_tutorial_5_4/2.bedpe
285 output_tutorial_5_4/3.bedpe
1156 total
Conclusion
Overall, -max_features
in an intuitive parameter that controls Enhlink sensitivity / specificity and can be seen as the expected number of associations for a given target. -n_boot
is less intuitive but can be seen as the number of trials Enhlink will perform. if enough trials are performed, Enhlink will have enough visibility to detect lower associations. However, due to the high-dimensionality of the data, a lot of features will present some degrees of associations without any biological implication. Thus, it is still important to keep some level of regulation and / or to merge multiple omics together. In that regards, -min_matsize
can always be increased to insure enough background features but will slow down the process for a large number of targer regions. Also, the expected accuracy can be estimated using simulated data (See tutorial 2). Another regularisation mechanism is to randomly subsample the number of features / cells at each trial with -downsample
and -maxFeatType
. This will also increase the speed of the method.
Distal interactions inference
We saw that -neighborhood
controls the neighborhood size in base pairs of each target region. If we put neighborhood to 0, Enhlink will include all the features, except those within the target region in the model. In this case, -min_matsize
becomes useless since the number of features is large enough in theory. Also, it makes sense to increase the sensitivity of Enhlink by increasing -max_features
, -depth
, and -n_boot
. This will also drastically increase the computational time of the model, so it may be worthwhile to only perform the distal analysis for a handful of genes.
First, let’s select 5 genes differentially expressed from cluster 1 using the Cell Ranger analysis:
cat ./analysis/clustering/gex/kmeans_3_clusters/differential_expression.csv|sed 's/,/\t/g'|cut -f2,3,5|grep -v "Feature"|sort -n -k3|head -5|cut -f1 > top_5_genes_cluster1.txt
Now, let’s perform a distal analysis on these 5 genes. We will increase -n_boot
to 250 and -max_features
to 8 to get more sensitivity and we will limit to the cluster 1 subsample.
First let’s isolate the barcodes from cluster 1 only.
cat analysis/clustering/gex/kmeans_3_clusters/clusters.tsv|grep -E "(1$|Cluster$)" > cluster1_subset.tsv
Now, let’s launch the analysis. Notice the -xgi_subset
option.
# Distal analysis
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
-xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
-ygi2 ./filtered_promoter_matrix/features.tsv.gz \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_6/ \
-threads 6 \
-n_boot 250 \
-depth 3 \
-xgi_subset cluster1_subset.tsv \
-max_features 8 \
-neighborhood 0 \
-targets top_5_genes_cluster1.txt \
-format cellRanger \
-rmPeaksInTargets
The analysis took 411 seconds to finish with my computer and identified 55 links. Now, it would be interesting to find enhancers shared between multiple genes. So, let’s see if we can find some:
cat output_tutorial_5_6/1.bedpe|grep -v "#"|cut -f1,2,3,10|sort|uniq|cut -f1,2,3|sort|uniq -c|sort -k1|tail
# The first grep remove the comment lines
# the first cut identifies enhancer region with their geneID
# The sort|uniq sort them and forces to have only one combination of enhancer / geneID
# The second cut, extract the enhancer regions only
# We then sort these regions one more time and count the number of occurences with uniq -c
# We then sort by number of occurences and plot the tail of the sort, which corresponds to the highest occurences
## Output
1 chr7 98281064 98281964
1 chr8 100309714 100310630
1 chr8 140464272 140465151
2 chr1 24642630 24643558
2 chr19 2061075 2061954
2 chr19 42283565 42284564
2 chr19 45584194 45585165
2 chr19 56140532 56141465
2 chr19 680242 681195
3 chr5 179698324 179699220
As we can see, we are already able to identify 7 enhancers shared by more than one gene of the top 5 signature genes of Cluster 1. Now, we have to be careful and examine these regions more in details to see if they are not experimental artifacts which could be widely accessibles and reflect for example the number of reads. These examinations can be done by visualizing the bigwig, the bedpe, and the peak files. We will cover that aspect in the next section.
Discard overaccessible OCRs
As described in the original publication, it may be necessary to remove overacessible OCRs from this type of analysis because thse regions may reflect some aspects of the experimental process such as batch ID or read depth. To do so, we can for example sort the features by mean and remove the top 5 percents. We can easily do it with Python and plot the histogram of the mean accessibility.
from scipy.sparse import load_npz
import pandas as pd
import numpy as np
import pylab as plt
import seaborn as sns
npz = "./filtered_feature_atac_matrix/matrix.mtx.gz"
ygi = "./filtered_feature_atac_matrix/features.bed"
outfig = "./enh.mean.dist.pdf"
out = "./peak.subset.mean.ygi"
M = load_npz(npz)
M = M.tocsr().T
M.data = M.data.astype('bool')
dfy = pd.read_csv(ygi, sep="\t", header=None)
dfy.columns = ['chr', 'start', 'stop']
dfy['peak'] = dfy['chr'] + ":" + dfy['start'].astype('str') + "-" + dfy['stop'].astype('str')
mean = np.asarray(M.mean(axis=0)).reshape(-1)
meani = np.flip(np.argsort(mean))
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
# Plot histogram to see the accessibility mean distribution
sns.histplot(
x=mean,
element='step',
common_norm=False,
cumulative=False,
log_scale=False,
multiple='layer',
stat='probability',
palette="viridis",
ax=ax,
)
ax.grid(False)
ax.set_facecolor("white")
plt.show()
fig.savefig(outfig)
# Define the cutoff as 5%
cutoff = int(float(len(meani)) * 0.05)
# Remove the top 5 percents of most accessible features
dfy.iloc[meani[cutoff:], [0,1,2]].to_csv(out, sep="\t", header=None, index=None)
We can now use ./peak.subset.mean.ygi
to ignore the top 5 percents of the most accessible features
# Distal analysis
enhlink -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-ygi_subset ./filtered_feature_atac_matrix/features.bed \
-mat2 ./filtered_promoter_matrix/matrix.mtx.gz \
-xgi2 ./filtered_promoter_matrix//barcodes.tsv.gz \
-ygi2 ./filtered_promoter_matrix/features.tsv.gz \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_7/ \
-threads 6 \
-n_boot 250 \
-depth 3 \
-xgi_subset cluster1_subset.tsv \
-max_features 8 \
-neighborhood 0 \
-targets top_5_genes_cluster1.txt \
-format cellRanger \
-rmPeaksInTargets
using Enhgrid to distribute computation, test the robustness of results, and improve speed
As we saw in tutorial 2 enhgrid
allows to process arrays of hyperparameters. This is the same as launching enhlink
for each combinations of parameters, however, enhlink
would need to load the input matrices and indexes each time which can represent a tremendous amount of computational time. Also, enhgrid
has the -splitTargetList
option whicj allows to distribute the gene list on multiple processes and optimize the computational time. For example, if we have 12 CPUs, we could distribute the gene list on 3 processes with 4 threads each. This would be faster than running one process with 12 threads.
Let’s apply enhgrid
to the example 1 of tutorial 4:
enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_7/ \
-threads 3 \
-processes 2 \
-targets 500_rand_genes.txt \
-format cellRanger \
-splitTargetList \
-rmPeaksInTargets
We can also use enhgrid to see how robust are Enhlink results by excecuting multiple runs of the same analysis with the -repetition
option. Let’s apply this to the barcodes from cluster 1
enhgrid -mat ./filtered_feature_atac_matrix/matrix.mtx.gz \
-xgi ./filtered_feature_atac_matrix/barcodes.tsv.gz \
-ygi ./filtered_feature_atac_matrix/features.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-clusters analysis/clustering/gex/kmeans_3_clusters/clusters.tsv \
-out ./output_tutorial_5_8/ \
-threads 3 \
-xgi_subset cluster1_subset.tsv \
-repetition 6 \
-processes 3 \
-targets 500_rand_genes.txt \
-format cellRanger \
-splitTargetList \
-rmPeaksInTargets
# The output folder contains the following files:
./output_tutorial_5_8
├── rep0.1.bedpe
├── rep1.1.bedpe
├── rep2.1.bedpe
├── rep3.1.bedpe
├── rep4.1.bedpe
├── rep5.1.bedpe
└── timestamp.log
Now, let see how much links from the first repetitions are retrieved in the other 5 repetitions:
reference="./output_tutorial_5_8/rep0.1.bedpe"
for bedpe in `ls ./output_tutorial_5_8/*.bedpe`
do
nbLinks=`enhtools -intersect3 -in ${bedpe} -in2 ${reference} -stdout|wc -l`
echo "#### file: ${bedpe} #### number of links shared with rep0.1.bedpe: ${nbLinks}"
done
### Output
#### file: ./output_tutorial_5_8/rep0.1.bedpe #### number of links shared with rep0.1.bedpe: 522
#### file: ./output_tutorial_5_8/rep1.1.bedpe #### number of links shared with rep0.1.bedpe: 458
#### file: ./output_tutorial_5_8/rep2.1.bedpe #### number of links shared with rep0.1.bedpe: 436
#### file: ./output_tutorial_5_8/rep3.1.bedpe #### number of links shared with rep0.1.bedpe: 452
#### file: ./output_tutorial_5_8/rep4.1.bedpe #### number of links shared with rep0.1.bedpe: 452
#### file: ./output_tutorial_5_8/rep5.1.bedpe #### number of links shared with rep0.1.bedpe: 458
Also not formal metrics, these numbers can be derived into similarity scores, such as the Rand score.
Filter links based on TAD domains
Topologically associated domains (TAD) define genomic regions whose DNA sequences preferentially contact each other. Thus, such regions can be provided to refine Enlink’s results and exclude links which don’t fall within the boundary of one of these regions.
Using
enhtools
, it is easy to filter a bedpe file with the-filter
option by providing a reference BED file containing genomic regions:
enhtools -filter -in ./output_tutorial_5_1/1.bedpe -bed TAD.bed -out ./output_tutorial_5_1/1.filtered.bedpe
Class imbalance with covariates
To address class imbalance in covariate analysis (when the option -covariates
is used), consider using the -uniformSample
option in Enhlink. This option ensures a more uniform distribution of the covariates at each bootstrap by iteratively adding cells according to their distribution.
For instance, if one cluster represents 80% of cells and the other two clusters only represent 10%, the -uniformSample
option will be beneficial to avoid a fraction of bootstrap samples having too few samples of the underrepresented clusters.
However, it is important to exercise caution when using this option, especially if one class is severely underrepresented. If a class represents less than nbCellsTotal/nbCovariate
, the number of cells (N
) sampled will be larger than the class size (S
), resulting in a larger number of unique cells (Nu
) being sampled than with a bootstrap procedure (66% on average). For instance, if N
is twice the size of S
, Nu
will reach 90%, and 95% for three times the size of S
, which could decrease the variability of the unique cells sampled.