Introduction and installation
Enhlink is a novel computational approach that utilizes single-cell signals to infer linkages between regulatory sequences such as enhancers and promoters. To accomplish this, Enhlink employs an ensemble strategy that integrates cell covariates and produces robust p-values for any link and covariate-specific link.
Designed to leverage multi-omic datasets, Enhlink incorporates complementary omic measurements such as gene expression to infer enhancer-promoter links. By modeling the surrounding enhancers and covariates as predictive features of promoter accessibility (or gene expression), Enhlink utilizes a modified Information Gain score, a random forest framework, and a bootstrap procedure to estimate significant features associated with a given promoter. Additionally, Enhlink performs a second-order analysis, if desired, that identifies linkages specific to a given covariate.
Enhlink is not limited to proximal enhancers but rather can infer distal enhancer-gene linkages. Finally, Enhlink incorporates a simulation workflow designed using experimentally validated enhancer-promoter signals, which improves prediction accuracy.
Enhlink softwares
Enhlink is an analytical framework developed in Go (https://go.dev/), and compiled into three executables:
enhlink
enhgrid
enhtools
The command line manual and arguments of each executable can be accessed using the -h flag, (e.g. enhlink -h). enhlink is the main executable that launches the Enhlink pipeline, while enhgrid allows distributing the processing of a list of genes into multiple CPUs and launch a range of input values for all the hyperparameters accepting a numerical value. enhgrid is useful for, for example, automatizing a grid-search approach by trying a combination of multiple hyperparameters or for testing different noise levels. enhtools intersect results from multiple runs and output either the common or unique links of a particular run. It also computes the accuracy between two runs (F-score, Precision, Recall). The source code can be easily compiled using a go compiler and the Unix/linux executables can be downloaded here: (https://gitlab.com/Grouumf/enhlinktools)
Installation
Using precompiled executables
we precompiled the three executables for enhlink, enhgrid, and enhtools and made them available in an associated figshare project. They are compatible with either Linux x86_64 or OSX arm64 platforms.
Using go installation mechanisms
Install a a golang compiler (if not existing)
Download binaries: https://golang.org/dl/
Configure $GOPATH/$GOBIN
#In .bashrc or .zshrc
export GOROOT=$HOME/go # or wherever is you go folder
export GOBIN=$HOME/go/local/bin # or wherever is your local bin folder for go exectuable
export GOPATH=$HOME/go/code/:$HOME/code
PATH=$GOPATH:$GOROOT:$PATH
PATH=$HOME/go/bin/:$GOBIN:$PATH
source your init file
source ~/.bashrc
Install the packages
# go version >= 1.18. This might work too with older go versions
## Install enhlink
go install gitlab.com/Grouumf/enhlinktools/enhlink@latest
## Install enhgrid
go install gitlab.com/Grouumf/enhlinktools/enhgrid@latest
## Install enhtools
go install gitlab.com/Grouumf/enhlinktools/enhtools@latest
Installation (from repository)
git clone https://gitlab.com/Grouumf/enhlinktools.git
cd ./enhlinktools/enhlink
go install .
cd ../enhgrid/
go install .
cd ../enhtools/
go install .
Alternatively, one can use go build
to direcly build the executable into the same folder
git clone https://gitlab.com/Grouumf/enhlinktools.git
cd ./enhlinktools/enhlink
# this command compile the executable into the local directory
go build .
cd ../enhgrid/
go build .
cd ../enhtools/
go build .
enhlink
Simple processing from cell-ranger ATAC processed files
In its simpliest form, Enhlink requires a sparse matrix (-mat) with de cell barcode index (-xgi), the peak index (-ygi) as a bed file, and a fourth-columns bed file (-promoter) indicating the promoter (or target) regions with the promoter name as a 4th columns. The command below will infer links for all the regions defined in the promoter file and using the +/-250kb surrounding peaks of each target region as features.
enhlink -mat matrix.mtx \
-xgi barcodes.tsv \
-ygi peaks.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out output_example_1/ \
-threads 6 \
-format cellRanger
Please have a look to the different tutorial sections to see more examples.
Tutorial 1 describes the processing of one (ATAC) cell x peak matrix
Tutorial 2 describes how to use simulation to estimate the expected accuracy at a given target region
Tutorial 3 describes how to create a sparse cell x peak matrix compatible with Enhlink
Tutorial 4 describes how to process a multi-omics from one of the single-cell RNA/ATAC-seq datasets of 10X Genomics
Tutorial 5 describes Enhlink’s advance functionalities, such as distal interaction and hyperparameters selection
Tutorial 6 describes how to visualize the links and how to create bigwig files.
Arguments
Here are enhlink’s arguments.
#################### Module to Link enhancers to promoters ########################
enhlink inferes enhancer / promoter co-accessibilities (links) using random forests of ID3 trees and Information gain.
<<<<<<<<<<<<<<<<<<<< WARNING >>>>>>>>>>>>>>>>>>>>
As of March 20 2024, Enhlink v0.21.0,
we Changed some of Enhlink's parameters names
for clarity and consistency purpose.
Below are the list of changes:
(version < 0.21.0) -> (version >= 0.21.0)
cluster -> clusters
promoter -> gtf
genes -> targets
gene -> target
isGeneExpr -> isExpr
rmPeaksInPromoter -> rmPeaksInTargets
<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>
enhlink main inputs are:
a) a (cell x peak) sparse matrix,
b) a 4-columns promoter TSV file <chrID, start, stop, geneID> ,
c) an optional (cell x gene) sparse matrix if the gene activity cannot be inferred from the peaks of the the first matrix and the promoter regions. This matrix can either be interpreted as boolean (e.g. the promoter of a given gene is either accessible or not for a given cell), or as a float matrix using the -isExpr option, which reflects the gene expression (for example in a context of a scATAC-seq/RNA-seq multi-omic study)
In addition, covariates (cell x covariates) and clusters (cell x clusterID) TSV file can be provided.
Finally, multiple optional parameters can be set to fine tune the speed, accuracies, and range of the models.
USAGE:
enhlink -mat <file> -xgi <file> -ygi <file> -promoter <file> -out <path> -tag <string> \
[-mat2 <file> -xgi2 <file> -ygi2 <file>] \ # IF PASSING A GENE MATRIX FILE
[-target <string>] \ # IF FOCUSING ON ONE TARGET
[-targets <file>] \ # IF FOCUSING ON A LIST OF TARGETS
[-isExpr] # IF THE GENE MATRIX IS A GENE EXPRESSION MATRIX
[-covariates <file> -xgi_subset <file> -ygi_subset <file> -cluster <file>] \ # OPTIONAL
[-downsample <int> -threads <int> -n_boot <int> -depth <int> -max_features <int> -threshold <float> min_matsize <int> -min_leafsize <int> -merging_cutoff <int> -neighborhood <int> -format {coo, mtx} -keep_sparse -maxFeatType <string/int/float> -rmPeaksInPromoter -linkType {"all", "positive", "negative"} -secondOrder -ignoreEnhancerWeight --uniformSampling] \ # OPTIONAL
-clusters value
Clusters file
-covariates value
optional covariate dense TSV matrix (row:cell x col:covariates). a column header with the name of the covariates must be the first line and the cell IDs must be the first column. Only categorical covariates can be used for now.
-depth int
Max tree depth. If negative, no max tree depth will be used (To use the maximum depth, set to: -1) (default 2)
-downsample int
Downsample the number of samples used for each bootstrap if > 0 and if > len(cluster)
-format string
Input matrix format (default "coo")
-gtf value
BED File containing the genomic coordinates (three first columns) of each target and the target ID (e.g. gene symbol)
-ignoreEnhancerWeight
Ignore Enhancers weight (the ratio of accessibility) in the computation of the modified Information Gain. Useful to not overweight widely accessible enhancers when performing distal inference with neighborhood very large (or == 0)
-isExpr
Use a expression float matrix for mat2 instead of a boolean matrix
-keep_sparse
Keep the main ColMat matrix sparse in the memory. Usefull for memory reason if neighboorhood is very wide (or ==0), but slow down computation (default: false)
-lambda1 float
Lambda parameter of a poisson distribution, that controls the amount of dropouts of the simulated variables (default 1.8)
-lambda2 float
Lambda parameter of a poisson distribution, that controls the amount of false positives of the simulated variables (default 0.05)
-mat value
Input matrix file
-mat2 value
Input expression matrix file
-maxFeatType string
Maximum of features to be considered for a given tree. accepted value: {"all", "sqrt", "log"}. can be an Int (number of features) or a float between 0 and 1 (fraction of features) (default "all")
-max_features int
Maximum number of explanatory features per bootstrap model. If set, only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default 4)
-merging_cutoff int
Cutoff (bp) for merging close target peak (default: 1500) (default 1500)
-min_leafsize int
Min size of a tree leaf (default: 10) (default 10)
-min_matsize int
Min matrix size to use for computation. If too small, create random features from the surrounding enhancers (default: 0) (default 100)
-n_boot int
Number of boostrap regressions to perform for each target (default 100)
-nb_sim_features int
if set to K and K > 0, Perform a simulation on K features, using the other variables as background and a poisson distribution to control the amount of noise in the simulated variables. The accuracy metrics (TPR, TNR, FPR, FNR), are reported for each target in a separated file (default: 0)
-neighborhood int
Neighborhood regions in (+/-) BP to explore (default 2.5e5) (default 250000)
-linkType string
Which links to keep {"all", "positive", "negative"}. Positive and negative links refers to positively or negatively correlated linkages (default "all")
-onlySim
Only perform simulation.
-out string
Output directory
-rmPeaksInTargets
For each target model, Remove peaks which are within the target boundaries
-secondOrder
Identify the covariates associated with each inferred enhancer-target links
-secondOrderMaxFeat int
Maximum number of explanatory features per bootstrap model. If set, only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default 2)
-tag string
Output files tag
-target string
Perform computation for one single target (e.g. one gene)
-targets value
File containing list of targets (e.g. gene IDs) to use for computation
-threads int
Number of internal threads (default: 2) (default 2)
-threshold float
P-value threshold (default: 0.05) (default 0.05)
-uniformSampling
Randomly sample the cells to have an uniform covariate distribution for each bootstrap. Needs a covariate matrix
-version
Show current version
-xgi value
row index for input mat
-xgi2 value
row index for input expression mat
-xgi_subset value
Subset of xgi (cells) to use for computation
-ygi value
column index for input mat
-ygi2 value
column index for input expression mat
-ygi_subset value
Subset of ygi (peaks) to use for computation
enhgrid
enhgrid can substitute enhlink and produce the same exact results. However, enhgrid is useful to parallelize the processing of a large number of genes with the processes options. With the example below and having 6 x 3 = 18 CPUs available we can use the following commands:
enhgrid -mat matrix.mtx \
-xgi barcodes.tsv \
-ygi peaks.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out output_example_3/ \
-threads 6 \
-processes 3 \
-splitTargetList \
-format cellRanger
This would be faster than using enhlink 18 threads since the threads are used for one target region at the time.
Another reason to use enhgrid is when ones want to try multiple hyperparameters values, such as lambda1 and lambda2 described in Tutorial 2. For example, below we use enhgrid to generate linkages for different neighborhood sizes (in kb) and different forest sizes:
# this is just an example but if you launch it, it will take a long time to compute.
enhgrid -mat matrix.mtx \
-xgi barcodes.tsv \
-ygi peaks.bed \
-gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out output_example_3/ \
-threads 6 \
-n_boot 50,100,150 \
-neighborhood 250000,500000,1e6 \
-processes 3 \
-format cellRanger
enhgrid will generate output files for each combinations of -neighborhood and -nboot.
Here are enhgrid’s arguments.
#################### Module to Link enhancers to promoters ########################
enhgrid performs enhlink on multiple processes for a range of hyperparameter values. enhgrid generates output files for each hyperparameter combination. The following parameters can accept multiple values:
-downsample <int>
-n_boot <int>
-depth <int>
-max_features <int>
-secondOrderMaxFeat <int>
-threshold <float>
-min_matsize <int>
-min_leafsize <int>
-merging_cutoff <int>
-neighborhood <int>
-maxFeatType <string/int/float>
-lambda1 <float>
-lambda2 <float>
-threads <int>
Multiple values can be passed as input using either comma or space: for example -depth 2,3,4 or -depth "2 3 4"
Enhgrid can accept the exact same parameters than Enhlink with additional functionalities:
## Parameters unique to enhgrid:
-randomNTargets <int> which allows to pick, for each grid iteration, N tatgets at random from the index and process them instead of the full list of targets
-repetition <int> Number of repetition to be performed for each iteration (default: 1)
-processes <int> Number of Enhlink processes to be launched in parallel (default: 1)
-splitTargetList Split the list of genes through the n processes
<<<<<<<<<<<<<<<<<<<< WARNING >>>>>>>>>>>>>>>>>>>>
As of March 20 2024, Enhlink v0.21.0,
we Changed some of Enhgrid's parameters names
for clarity and consistency purpose.
Below are the list of changes:
(version < 0.21.0) -> (version >= 0.21.0)
cluster -> clusters
promoter -> gtf
genes -> targets
gene -> target
isGeneExpr -> isExpr
rmPeaksInPromoter -> rmPeaksInTargets
splitGeneList -> splitTargetList
randomNGenes -> randomNTargets
<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>
USAGE:
enhgrid -mat <file> -xgi <file> -ygi <file> -gtf <file> -out <path> -tag <string> \
[-mat2 <file> -xgi2 <file> -ygi2 <file>] \ # IF PASSING A GENE MATRIX FILE
[-target <string>] \ # IF FOCUSING ON ONE TARGET
[-targets <file>] \ # IF FOCUSING ON A LIST OF TARGETS
[-isExpr] # IF THE GENE MATRIX IS A GENE EXPRESSION MATRIX
[-covariates <file> -xgi_subset <file> -ygi_subset <file> -clusters <file>] \ # OPTIONAL
[-downsample <int> -threads <int> -n_boot <int> -depth <int> -max_features <int> -threshold <float> min_matsize <int> -min_leafsize <int> -merging_cutoff <int> -neighborhood <int> -format {coo, mtx} -keep_sparse -maxFeatType <string/int/float> -rmPeaksInTargets -linkType {"all", "positive", "negative"} -secondOrder -ignoreEnhancerWeight -uniformSampling] \ # OPTIONAL
[-randomGenes <int> -repetition <int> -processes <int> --splitTargetList] # OPTIONAL and specific to enhgrid
-clusters value
Clusters file
-covariates value
optional covariate dense TSV matrix (row:cell x col:covariates). a column header with the name of the covariates must be the first line and the cell IDs must be the first column. Only categorical covariates can be used for now.
-depth string
Max tree depth. If negative, no max tree depth will be used (default: -1) (default "2")
-downsample string
Downsample the number of samples used for each bootstrap if > 0 and if > len(cluster) (default "0")
-format string
Input matrix format (default "coo")
-gtf value
BED File containing the genomic coordinates (three first columns) of each target and the target ID (e.g. gene symbol)
-ignoreEnhancerWeight
Ignore Enhancers weight (the ratio of accessibility) in the computation of the modified Information Gain. Useful to not overweight widely accessible enhancers when performing distal inference with neighborhood very large (or == 0)
-isExpr
Use an expression float matrix for -mat2
-keep_sparse
Keep the main ColMat matrix sparse in the memory. Usefull for memory reason if neighboorhood is very wide (or ==0), but slow down computation (default: false)
-lambda1 string
Lambda parameter of a poisson distribution, that controls the amount of dropouts of the simulated variables (default "1.8")
-lambda2 string
Lambda parameter of a poisson distribution, that controls the amount of false positives of the simulated variables (default "0.05")
-mat value
Input matrix file
-mat2 value
Input target matrix file, defining the value of each target for each cell
-maxFeatType string
Maximum of features to be considered for a given tree. accepted value: {"all", "sqrt", "log"}. can be an Int (number of features) or a float between 0 and 1 (fraction of features) (default "all")
-max_features string
Maximum number of explanatory features per bootstrap model. If set only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default "4")
-merging_cutoff int
Cutoff (bp) for merging close promoter peak (default: 1500) (default 1500)
-min_leafsize string
Min size of a tree leaf (default: 10) (default "10")
-min_matsize string
Min matrix size to use for computation. If too small, create random features from the surrounding enhancers (default: 0) (default "100")
-n_boot string
Number of boostrap regressions to perform for each target (default "100")
-nb_sim_features string
if set to K and K > 0, Perform a simulation on K features, using the other variables as backgrou
nd and a poisson distribution to control the amount of noise in the simulated variables. The accuracy me
trics (TPR, TNR, FPR, FNR), are reported for each promoter in a separated file (default: 0) (default "0"
)
-neighborhood string
Neighborhood regions in (+/-) BP to explore (default 2.5e5) (default "250000")
-linkType string
Which links to keep {"all", "positive", "negative"}. Positive and negative links refers to positively or negatively correlated linkages (default "all")
-onlySim
Only perform simulation.
-out string
Output directory
-processes int
Number of Enhlink processes to be launched in parallel. The total number of cpus used would be t
hreads x processes (default 1)
-randomNTargets int
Number of random targets to draft and process for each iteration instead of analyzing the full s
et of targets
-repetition int
Number of repetition to be performed for each iteration (default 1)
-rmPeaksInTargets
For each target model, Remove peaks which are within the target boundaries
-secondOrder
Identify the covariates associated with each inferred enhancer-target links
-secondOrderMaxFeat string
Maximum number of explanatory features per bootstrap model. If set, only K features are kept for
each model based on the coefficient, even if the number of non null coefs is larger (default "2")
-splitTargetList
Split the list of targets through the n processes
-tag string
Output files tag
-target string
Perform computation for one single target
-targets value
File containing a subset of targets to focus on for computation
-threads string
Number of internal threads (default: 2) (default "2")
-threshold string
P-value threshold (default: 0.05) (default "0.05")
-uniformSampling
Randomly sample the cells to have an uniform covariate distribution for each bootstrap. Needs a
covariate matrix
-version
Show current version
-xgi value
row index for input mat
-xgi2 value
row index for input target mat
-xgi_subset value
Subset of xgi (cells) to use for computation
-ygi value
column index for input mat
-ygi2 value
column index for input target mat
-ygi_subset value
Subset of ygi (peaks) to use for computation
enhtools
enhtools can compute intersections and differences between bedpe files, and filter a bedpe file according to genomic regions defined in a bed file:
# -intersect3 is based on unperfect matches between file1 and file2
enthools -intersect3 -in file1.bedpe -in2 file2.bedpe -out file3.bedpe
# file4 will contains links from file1 that don't intersect links from file2
enthools -diff -in file1.bedpe -in2 file2.bedpe -out file4.bedpe
# file1.filtered.bedpe will contains links from file1 that are also within one of the region of TAD.bed
enhtools -filter -in file1.bedpe -bed TAD.bed -out file1.filtered.bedpe
Here are enhtools’s arguments.
#################### Post processing of enhlink results ########################
enhtools interescts bedpe files and compute accuracy metrics (TPR, FPR, F-score...)
USAGE:
* with -intersect (Intersection of output directory obtained from enhlink)
enhtools -intersect -in <directory> -in2 <directory> -out <directory> (optional: -tag/tag2/outtag <string> -scorePos/pvalPos/geneIDPos <string> -mergeScore {left, right, mean} -prec <float> )
* with -intersect2 (Intersection of two bedpe files)
enhtools -intersect2 -in <bedpe file> -in2 <bedpe file> -out <directory> (optional: -tag/tag2/outtag <string> -scorePos/pvalPos/geneIDPos <string> -mergeScore {left, right, mean} -prec <float> -stdout)
* with -intersect3 (Intersect input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively)
enhtools -intersect3 -in <fileRef> -in2 <fileTarget> -out <file> (optional: -stdout)
* with -intersect4 (Intersect input bedpe files, based on unperfect matches. A match occurs if at least one of two regions of the first bedpe file intersects with one of the two regions of the second bedpe file)
The metadata columns to annotate each match are defined with the scorePos, pvalPos, and geneIDPos options.
enhtools -intersect4 -in <fileRef> -in2 <fileTarget> -out <file> (optional: -stdout -scorePos -pvalPos -geneIDPos)
* with -intersectBed (Intersect bedpe with bed, at least one bedpe region of intersect one of the BED region)
Matching type defined by the -maching option ("either", "left", "right", "both") controls which bedpe region should have a match.
enhtools -intersectBed -in <bedpe file> -bed <bed file> -out <file> (optional: -matching -stdout -scorePos -pvalPos -geneIDPos)
* with diff (Difference between (file1 - file2) input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively)
enhtools -diff -in <fileRef> -in2 <fileTarget> -out <file> (optional: -stdout)
* with -filter (filter links from bedpe file (-in) which are not within at least one of the region defined in (-bed). If -diff is added, only the links not within at least one region of -bed will be outputed. -filter is well adapted for filtering links not in TAD regions, defined in a BED file
enhtools -filter -in <bedbe file> -bed <bed file> (optional: -stdout -diff)
TIPS:
scorePos/pvalPos/geneIDPos can be used to se different column IDs for in and in2! Use the ":" to delimitate the seprators for the files from in and in2
-bed string
BED file used to filter bedpe links
-diff
Output the difference between the bedpes given by -in and -in2 with -intersect3
-filter
filter links from bedpe file (-in) which are not within at least one of the region defined in (-bed). If -diff is added, only the links not within at least one region of -bed will be outputed. -filter is well adapted for filtering links not in TAD regions, defined in a BED file
-geneIDPos string
column ID(s) from bedpe file1 (in1), starting from 0, that define the gene ID.if genePos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "8:9"
-in string
Input results directory (-intersect) or input bedpe file (-intersect2)
-in2 string
2nd Input results directory (-intersect) or input bedpe file (-intersect2)
-intersect
intersect the outputs of enhlink by retaining the intersection of the results
-intersect2
intersect two bedpe files
-intersect3
Intersect input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively
-intersect4
Intersect input bedpe files, based on unperfect matches. A match occurs if at least one of two regions of the first bedpe file intersects with one of the two regions of the second bedpe file
-intersectBed
Intersect bedpe with bed (at least one bedpe region of intersect one of the BED region)
-matching string
matching type for -intersectBed {"either" (default) "left" (left region of the bed), "right" (right region), "both" (both regions)} (default "either")
-mergeScore string
Strategy to merge scores: {left, right, mean} (default "left")
-out string
output (directory / file)
-outtag string
Output files tag
-prec int
Float precision (default 4)
-pvalPos string
column ID(s) from bedpe file1 (in1), starting from 0, indicating the pvalues. if pvalPos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "7:6"
-scorePos string
column ID(s) from bedpe file1 (in1), starting from 0, indicating the scores. if scorePos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "8:9"
-stdout
write to stdout
-tag string
Input files tag
-tag2 string
2nd input files tag
-version
Show current version