Introduction and installation

Enhlink is a novel computational approach that utilizes single-cell signals to infer linkages between regulatory sequences such as enhancers and promoters. To accomplish this, Enhlink employs an ensemble strategy that integrates cell covariates and produces robust p-values for any link and covariate-specific link.

Designed to leverage multi-omic datasets, Enhlink incorporates complementary omic measurements such as gene expression to infer enhancer-promoter links. By modeling the surrounding enhancers and covariates as predictive features of promoter accessibility (or gene expression), Enhlink utilizes a modified Information Gain score, a random forest framework, and a bootstrap procedure to estimate significant features associated with a given promoter. Additionally, Enhlink performs a second-order analysis, if desired, that identifies linkages specific to a given covariate.

Enhlink is not limited to proximal enhancers but rather can infer distal enhancer-gene linkages. Finally, Enhlink incorporates a simulation workflow designed using experimentally validated enhancer-promoter signals, which improves prediction accuracy.

Enhlink softwares

Enhlink is an analytical framework developed in Go (https://go.dev/), and compiled into three executables:

enhlink
enhgrid
enhtools

The command line manual and arguments of each executable can be accessed using the -h flag, (e.g. enhlink -h). enhlink is the main executable that launches the Enhlink pipeline, while enhgrid allows distributing the processing of a list of genes into multiple CPUs and launch a range of input values for all the hyperparameters accepting a numerical value. enhgrid is useful for, for example, automatizing a grid-search approach by trying a combination of multiple hyperparameters or for testing different noise levels. enhtools intersect results from multiple runs and output either the common or unique links of a particular run. It also computes the accuracy between two runs (F-score, Precision, Recall). The source code can be easily compiled using a go compiler and the Unix/linux executables can be downloaded here: (https://gitlab.com/Grouumf/enhlinktools)

Installation

Using precompiled executables

we precompiled the three executables for enhlink, enhgrid, and enhtools and made them available in an associated figshare project. They are compatible with either Linux x86_64 or OSX arm64 platforms.

Using go installation mechanisms

Install a a golang compiler (if not existing)

Download binaries: https://golang.org/dl/
Configure $GOPATH/$GOBIN

#In .bashrc or .zshrc
export GOROOT=$HOME/go # or wherever is you go folder
export GOBIN=$HOME/go/local/bin # or wherever is your local bin folder for go exectuable
export GOPATH=$HOME/go/code/:$HOME/code

PATH=$GOPATH:$GOROOT:$PATH
PATH=$HOME/go/bin/:$GOBIN:$PATH

source your init file source ~/.bashrc

Install the packages

# go version >= 1.18. This might work too with older go versions
## Install enhlink
go install gitlab.com/Grouumf/enhlinktools/enhlink@latest
## Install enhgrid
go install gitlab.com/Grouumf/enhlinktools/enhgrid@latest
## Install enhtools
go install gitlab.com/Grouumf/enhlinktools/enhtools@latest

Installation (from repository)

git clone https://gitlab.com/Grouumf/enhlinktools.git
cd ./enhlinktools/enhlink
go install .
cd ../enhgrid/
go install .
cd ../enhtools/
go install .

Alternatively, one can use go build to direcly build the executable into the same folder

git clone https://gitlab.com/Grouumf/enhlinktools.git
cd ./enhlinktools/enhlink
# this command compile the executable into the local directory
go build .
cd ../enhgrid/
go build .
cd ../enhtools/
go build .

enhlink

Simple processing from cell-ranger ATAC processed files

In its simpliest form, Enhlink requires a sparse matrix (-mat) with de cell barcode index (-xgi), the peak index (-ygi) as a bed file, and a fourth-columns bed file (-promoter) indicating the promoter (or target) regions with the promoter name as a 4th columns. The command below will infer links for all the regions defined in the promoter file and using the +/-250kb surrounding peaks of each target region as features.

enhlink -mat matrix.mtx \
    -xgi barcodes.tsv \
    -ygi peaks.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out output_example_1/ \
    -threads 6 \
    -format cellRanger

Please have a look to the different tutorial sections to see more examples.

Tutorial 1 describes the processing of one (ATAC) cell x peak matrix
Tutorial 2 describes how to use simulation to estimate the expected accuracy at a given target region
Tutorial 3 describes how to create a sparse cell x peak matrix compatible with Enhlink
Tutorial 4 describes how to process a multi-omics from one of the single-cell RNA/ATAC-seq datasets of 10X Genomics
Tutorial 5 describes Enhlink’s advance functionalities, such as distal interaction and hyperparameters selection
Tutorial 6 describes how to visualize the links and how to create bigwig files.

Arguments

Here are enhlink’s arguments.

#################### Module to Link enhancers to promoters ########################
enhlink inferes enhancer / promoter co-accessibilities (links) using random forests of ID3 trees and Information gain.

<<<<<<<<<<<<<<<<<<<< WARNING >>>>>>>>>>>>>>>>>>>>
As of March 20 2024, Enhlink v0.21.0,
we Changed some of Enhlink's parameters names
 for clarity and consistency purpose.

Below are the list of changes:
(version < 0.21.0) ->     (version >= 0.21.0)
cluster            ->     clusters
promoter           ->     gtf
genes              ->     targets
gene               ->     target
isGeneExpr         ->     isExpr
rmPeaksInPromoter  ->     rmPeaksInTargets
<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>

enhlink main inputs are:
    a) a (cell x peak) sparse matrix,
    b) a 4-columns promoter TSV file <chrID, start, stop, geneID> ,
    c) an optional (cell x gene) sparse matrix if the gene activity cannot be inferred from the peaks of the the first matrix and the promoter regions. This matrix can either be interpreted as boolean (e.g. the promoter of a given gene is either accessible or not for a given cell), or as a float matrix using the -isExpr option, which reflects the gene expression (for example in a context of a scATAC-seq/RNA-seq multi-omic study)

In addition, covariates (cell x covariates) and clusters (cell x clusterID) TSV file can be provided.
Finally, multiple optional parameters can be set to fine tune the speed, accuracies, and range of the models.

USAGE:
     enhlink -mat <file> -xgi <file> -ygi <file> -promoter <file> -out <path> -tag <string> \
             [-mat2 <file> -xgi2 <file> -ygi2 <file>] \  # IF PASSING A GENE MATRIX FILE
             [-target <string>] \ # IF FOCUSING ON ONE TARGET
             [-targets <file>] \ # IF FOCUSING ON A LIST OF TARGETS
             [-isExpr] # IF THE GENE MATRIX IS A GENE EXPRESSION MATRIX
             [-covariates <file> -xgi_subset <file>  -ygi_subset <file> -cluster <file>] \ # OPTIONAL
             [-downsample <int> -threads <int> -n_boot <int> -depth <int> -max_features <int> -threshold <float> min_matsize <int> -min_leafsize <int> -merging_cutoff <int> -neighborhood <int> -format {coo, mtx} -keep_sparse -maxFeatType <string/int/float> -rmPeaksInPromoter -linkType {"all", "positive", "negative"} -secondOrder -ignoreEnhancerWeight --uniformSampling] \ # OPTIONAL

  -clusters value
        Clusters file
  -covariates value
        optional covariate dense TSV matrix (row:cell x col:covariates). a column header with the name of the covariates must be the first line and the cell IDs must be the first column. Only categorical covariates can be used for now.
  -depth int
        Max tree depth. If negative, no max tree depth will be used (To use the maximum depth, set to: -1) (default 2)
  -downsample int
        Downsample the number of samples used for each bootstrap if > 0 and if > len(cluster)
  -format string
        Input matrix format (default "coo")
  -gtf value
        BED File containing the genomic coordinates (three first columns) of each target and the target ID (e.g. gene symbol)
  -ignoreEnhancerWeight
        Ignore Enhancers weight (the ratio of accessibility) in the computation of the modified Information Gain. Useful to not overweight widely accessible enhancers when performing distal inference with neighborhood very large (or == 0)
  -isExpr
        Use a expression float matrix for mat2 instead of a boolean matrix
  -keep_sparse
        Keep the main ColMat matrix sparse in the memory. Usefull for memory reason if neighboorhood is very wide (or ==0), but slow down computation (default: false)
  -lambda1 float
        Lambda parameter of a poisson distribution, that controls the amount of dropouts of the simulated variables  (default 1.8)
  -lambda2 float
        Lambda parameter of a poisson distribution, that controls the amount of false positives of the simulated variables (default 0.05)
  -mat value
        Input matrix file
  -mat2 value
        Input expression matrix file
  -maxFeatType string
        Maximum of features to be considered for a given tree. accepted value: {"all", "sqrt", "log"}. can be an Int (number of features) or a float between 0 and 1 (fraction of features) (default "all")
  -max_features int
        Maximum number of explanatory features per bootstrap model. If set, only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default 4)
  -merging_cutoff int
        Cutoff (bp) for merging close target peak (default: 1500) (default 1500)
  -min_leafsize int
        Min size of a tree leaf (default: 10) (default 10)
  -min_matsize int
        Min  matrix size to use for computation. If too small, create random features from the surrounding enhancers (default: 0) (default 100)
  -n_boot int
        Number of boostrap regressions to perform for each target (default 100)
  -nb_sim_features int
        if set to K and K > 0, Perform a simulation on K features, using the other variables as background and a poisson distribution to control the amount of noise in the simulated variables. The accuracy metrics (TPR, TNR, FPR, FNR), are reported for each target in a separated file (default: 0)
  -neighborhood int
        Neighborhood regions in (+/-) BP to explore (default 2.5e5) (default 250000)
  -linkType string
        Which links to keep {"all", "positive", "negative"}. Positive and negative links refers to positively or negatively correlated linkages (default "all")
  -onlySim
        Only perform simulation.
  -out string
        Output directory
  -rmPeaksInTargets
        For each target model, Remove peaks which are within the target boundaries
  -secondOrder
        Identify the covariates associated with each inferred enhancer-target links
  -secondOrderMaxFeat int
        Maximum number of explanatory features per bootstrap model. If set, only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default 2)
  -tag string
        Output files tag
  -target string
        Perform computation for one single target (e.g. one gene)
  -targets value
        File containing list of targets (e.g. gene IDs) to use for computation
  -threads int
        Number of internal threads (default: 2) (default 2)
  -threshold float
        P-value threshold (default: 0.05) (default 0.05)
  -uniformSampling
        Randomly sample the cells to have an uniform covariate distribution for each bootstrap. Needs a covariate matrix
  -version
        Show current version
  -xgi value
        row index for input mat
  -xgi2 value
        row index for input expression mat
  -xgi_subset value
        Subset of xgi (cells) to use for computation
  -ygi value
        column index for input mat
  -ygi2 value
        column index for input expression mat
  -ygi_subset value
        Subset of ygi (peaks) to use for computation

enhgrid

enhgrid can substitute enhlink and produce the same exact results. However, enhgrid is useful to parallelize the processing of a large number of genes with the processes options. With the example below and having 6 x 3 = 18 CPUs available we can use the following commands:

enhgrid -mat matrix.mtx \
    -xgi barcodes.tsv \
    -ygi peaks.bed \
    -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
    -out output_example_3/ \
    -threads 6 \
    -processes 3 \
    -splitTargetList \
    -format cellRanger

This would be faster than using enhlink 18 threads since the threads are used for one target region at the time.

Another reason to use enhgrid is when ones want to try multiple hyperparameters values, such as lambda1 and lambda2 described in Tutorial 2. For example, below we use enhgrid to generate linkages for different neighborhood sizes (in kb) and different forest sizes:

# this is just an example but if you launch it, it will take a long time to compute.
enhgrid -mat matrix.mtx \
        -xgi barcodes.tsv \
        -ygi peaks.bed \
        -gtf Homo_sapiens.GRCh38.99.TSS.2K.bed \
        -out output_example_3/ \
        -threads 6 \
        -n_boot 50,100,150 \
        -neighborhood 250000,500000,1e6 \
        -processes 3 \
        -format cellRanger

enhgrid will generate output files for each combinations of -neighborhood and -nboot.

Here are enhgrid’s arguments.

#################### Module to Link enhancers to promoters ########################
enhgrid performs enhlink on multiple processes for a range of hyperparameter values. enhgrid generates output files for each hyperparameter combination.  The following parameters can accept multiple values:

    -downsample <int>
    -n_boot <int>
    -depth <int>
    -max_features <int>
    -secondOrderMaxFeat <int>
    -threshold <float>
    -min_matsize <int>
    -min_leafsize <int>
    -merging_cutoff <int>
    -neighborhood <int>
    -maxFeatType <string/int/float>
    -lambda1 <float>
    -lambda2 <float>
    -threads <int>

Multiple values can be passed as input using either comma or space: for example -depth 2,3,4 or -depth "2 3 4"

Enhgrid can accept the exact same parameters than Enhlink with additional functionalities:

## Parameters unique to enhgrid:
-randomNTargets <int>     which allows to pick, for each grid iteration, N tatgets at random from the index and process them instead of the full list of targets
-repetition <int>     Number of repetition to be performed for each iteration (default: 1)
-processes <int>  Number of Enhlink processes to be launched in parallel (default: 1)
-splitTargetList Split the list of genes through the n processes

<<<<<<<<<<<<<<<<<<<< WARNING >>>>>>>>>>>>>>>>>>>>
As of March 20 2024, Enhlink v0.21.0,
we Changed some of Enhgrid's parameters names
 for clarity and consistency purpose.

Below are the list of changes:
(version < 0.21.0) ->     (version >= 0.21.0)
cluster            ->     clusters
promoter           ->     gtf
genes              ->     targets
gene               ->     target
isGeneExpr         ->     isExpr
rmPeaksInPromoter  ->     rmPeaksInTargets
splitGeneList      ->     splitTargetList
randomNGenes       ->     randomNTargets
<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>

USAGE:
     enhgrid -mat <file> -xgi <file> -ygi <file> -gtf <file> -out <path> -tag <string> \
             [-mat2 <file> -xgi2 <file> -ygi2 <file>] \  # IF PASSING A GENE MATRIX FILE
             [-target <string>] \ # IF FOCUSING ON ONE TARGET
             [-targets <file>] \ # IF FOCUSING ON A LIST OF TARGETS
             [-isExpr] # IF THE GENE MATRIX IS A GENE EXPRESSION MATRIX
             [-covariates <file> -xgi_subset <file>  -ygi_subset <file> -clusters <file>] \ # OPTIONAL
             [-downsample <int> -threads <int> -n_boot <int> -depth <int> -max_features <int> -threshold <float> min_matsize <int> -min_leafsize <int> -merging_cutoff <int> -neighborhood <int> -format {coo, mtx} -keep_sparse -maxFeatType <string/int/float> -rmPeaksInTargets -linkType {"all", "positive", "negative"} -secondOrder -ignoreEnhancerWeight -uniformSampling] \ # OPTIONAL
             [-randomGenes <int> -repetition <int> -processes <int> --splitTargetList] # OPTIONAL and specific to enhgrid

  -clusters value
        Clusters file
  -covariates value
        optional covariate dense TSV matrix (row:cell x col:covariates). a column header with the name of the covariates must be the first line and the cell IDs must be the first column. Only categorical covariates can be used for now.
  -depth string
        Max tree depth. If negative, no max tree depth will be used (default: -1) (default "2")
  -downsample string
        Downsample the number of samples used for each bootstrap if > 0 and if > len(cluster) (default "0")
  -format string
        Input matrix format (default "coo")
  -gtf value
        BED File containing the genomic coordinates (three first columns) of each target and the target ID (e.g. gene symbol)
  -ignoreEnhancerWeight
        Ignore Enhancers weight (the ratio of accessibility) in the computation of the modified Information Gain. Useful to not overweight widely accessible enhancers when performing distal inference with neighborhood very large (or == 0)
  -isExpr
        Use an expression float matrix for -mat2
  -keep_sparse
        Keep the main ColMat matrix sparse in the memory. Usefull for memory reason if neighboorhood is very wide (or ==0), but slow down computation (default: false)
  -lambda1 string
        Lambda parameter of a poisson distribution, that controls the amount of dropouts of the simulated variables (default "1.8")
  -lambda2 string
        Lambda parameter of a poisson distribution, that controls the amount of false positives of the simulated variables (default "0.05")
  -mat value
        Input matrix file
  -mat2 value
        Input target matrix file, defining the value of each target for each cell
  -maxFeatType string
        Maximum of features to be considered for a given tree. accepted value: {"all", "sqrt", "log"}. can be an Int (number of features) or a float between 0 and 1 (fraction of features) (default "all")
  -max_features string
        Maximum number of explanatory features per bootstrap model. If set only K features are kept for each model based on the coefficient, even if the number of non null coefs is larger (default "4")
  -merging_cutoff int
        Cutoff (bp) for merging close promoter peak (default: 1500) (default 1500)
  -min_leafsize string
        Min size of a tree leaf (default: 10) (default "10")
  -min_matsize string
        Min  matrix size to use for computation. If too small, create random features from the surrounding enhancers (default: 0) (default "100")
  -n_boot string
        Number of boostrap regressions to perform for each target (default "100")
  -nb_sim_features string
        if set to K and K > 0, Perform a simulation on K features, using the other variables as backgrou
nd and a poisson distribution to control the amount of noise in the simulated variables. The accuracy me
trics (TPR, TNR, FPR, FNR), are reported for each promoter in a separated file (default: 0) (default "0"
)
  -neighborhood string
        Neighborhood regions in (+/-) BP to explore (default 2.5e5) (default "250000")
  -linkType string
        Which links to keep {"all", "positive", "negative"}. Positive and negative links refers to positively or negatively correlated linkages (default "all")
  -onlySim
        Only perform simulation.
  -out string
        Output directory
  -processes int
        Number of Enhlink processes to be launched in parallel. The total number of cpus used would be t
hreads x processes (default 1)
  -randomNTargets int
        Number of random targets to draft and process for each iteration instead of analyzing the full s
et of targets
  -repetition int
        Number of repetition to be performed for each iteration (default 1)
  -rmPeaksInTargets
        For each target model, Remove peaks which are within the target boundaries
  -secondOrder
        Identify the covariates associated with each inferred enhancer-target links
  -secondOrderMaxFeat string
        Maximum number of explanatory features per bootstrap model. If set, only K features are kept for
 each model based on the coefficient, even if the number of non null coefs is larger (default "2")
  -splitTargetList
        Split the list of targets through the n processes
  -tag string
        Output files tag
  -target string
        Perform computation for one single target
  -targets value
        File containing a subset of targets to focus on for computation
  -threads string
        Number of internal threads (default: 2) (default "2")
  -threshold string
        P-value threshold (default: 0.05) (default "0.05")
  -uniformSampling
        Randomly sample the cells to have an uniform covariate distribution for each bootstrap. Needs a
covariate matrix
  -version
        Show current version
  -xgi value
        row index for input mat
  -xgi2 value
        row index for input target mat
  -xgi_subset value
        Subset of xgi (cells) to use for computation
  -ygi value
        column index for input mat
  -ygi2 value
        column index for input target mat
  -ygi_subset value
        Subset of ygi (peaks) to use for computation

enhtools

enhtools can compute intersections and differences between bedpe files, and filter a bedpe file according to genomic regions defined in a bed file:

# -intersect3 is based on unperfect matches between file1 and file2
enthools -intersect3 -in file1.bedpe -in2 file2.bedpe -out file3.bedpe

# file4 will contains links from file1 that don't intersect links from file2
enthools -diff -in file1.bedpe -in2 file2.bedpe -out file4.bedpe

# file1.filtered.bedpe will contains links from file1 that are also within one of the region of TAD.bed
enhtools -filter -in file1.bedpe -bed TAD.bed -out file1.filtered.bedpe

Here are enhtools’s arguments.

#################### Post processing of enhlink results ########################
enhtools interescts bedpe files and compute accuracy metrics (TPR, FPR, F-score...)

USAGE:
* with -intersect (Intersection of output directory obtained from enhlink)
enhtools -intersect -in <directory> -in2 <directory> -out <directory> (optional: -tag/tag2/outtag <string> -scorePos/pvalPos/geneIDPos <string> -mergeScore {left, right, mean} -prec <float> )

* with -intersect2 (Intersection of two bedpe files)
enhtools -intersect2 -in <bedpe file> -in2 <bedpe file> -out <directory> (optional: -tag/tag2/outtag <string> -scorePos/pvalPos/geneIDPos <string> -mergeScore {left, right, mean} -prec <float> -stdout)

* with -intersect3 (Intersect input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively)
enhtools -intersect3 -in <fileRef>  -in2 <fileTarget> -out <file> (optional:  -stdout)

* with -intersect4 (Intersect input bedpe files, based on unperfect matches. A match occurs if at least one of two regions of the first bedpe file intersects with one of the two regions of the second bedpe file)
The metadata columns to annotate each match are defined with the scorePos, pvalPos, and geneIDPos options.
enhtools -intersect4 -in <fileRef>  -in2 <fileTarget> -out <file> (optional: -stdout -scorePos -pvalPos -geneIDPos)

* with -intersectBed (Intersect bedpe with bed, at least one bedpe region of intersect one of the BED region)
Matching type defined by the -maching option ("either", "left", "right", "both") controls which bedpe region should have a match.
enhtools -intersectBed -in <bedpe file>  -bed <bed file> -out <file> (optional: -matching -stdout -scorePos -pvalPos -geneIDPos)

* with diff (Difference between (file1 - file2) input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively)
enhtools -diff -in <fileRef>  -in2 <fileTarget> -out <file> (optional: -stdout)

* with -filter (filter links from bedpe file (-in) which are not within at least one of the region defined in (-bed). If -diff is added, only the links not within at least one region of -bed will be outputed. -filter is well adapted for filtering links not in TAD regions, defined in a BED file
enhtools -filter -in <bedbe file> -bed <bed file> (optional: -stdout -diff)

TIPS:
scorePos/pvalPos/geneIDPos can be used to se  different column IDs for in and in2! Use the ":" to delimitate the seprators for the files from in and in2

  -bed string
        BED file used to filter bedpe links
  -diff
        Output the difference between the bedpes given by -in and -in2 with -intersect3
  -filter
        filter links from bedpe file (-in) which are not within at least one of the region defined in (-bed). If -diff is added, only the links not within at least one region of -bed will be outputed. -filter is well adapted for filtering links not in TAD regions, defined in a BED file
  -geneIDPos string
         column ID(s) from bedpe file1 (in1), starting from 0, that define the gene ID.if genePos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "8:9"
  -in string
        Input results directory (-intersect) or input bedpe file (-intersect2)
  -in2 string
        2nd Input results directory (-intersect) or input bedpe file (-intersect2)
  -intersect
        intersect the outputs of enhlink by retaining the intersection of the results
  -intersect2
        intersect two bedpe files
  -intersect3
        Intersect input bedpe files, based on unperfect matches. A match occurs if both regions of the 2 bedpe files intersect respectively
  -intersect4
        Intersect input bedpe files, based on unperfect matches. A match occurs if at least one of two regions of the first bedpe file  intersects with one of the two regions of the second bedpe file
  -intersectBed
        Intersect bedpe with bed (at least one bedpe region of intersect one of the BED region)
  -matching string
        matching type for -intersectBed  {"either" (default) "left" (left region of the bed), "right" (right region), "both" (both regions)} (default "either")
  -mergeScore string
        Strategy to merge scores: {left, right, mean} (default "left")
  -out string
        output (directory / file)
  -outtag string
        Output files tag
  -prec int
        Float precision (default 4)
  -pvalPos string
         column ID(s) from bedpe file1 (in1), starting from 0, indicating the pvalues. if pvalPos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "7:6"
  -scorePos string
         column ID(s) from bedpe file1 (in1), starting from 0, indicating the scores. if scorePos different between -in and -in2, use the following format "<posLeft>:<posRight>", example: "8:9"
  -stdout
        write to stdout
  -tag string
        Input files tag
  -tag2 string
        2nd input  files tag
  -version
        Show current version