# Tutorial 6: Generating simulated matrices for method comparison

## Overview

In this tutorial, we will generate a simulated matrix that compares different methods for identifying target regions and linked enhancers. We will begin by creating a cells x peaks matrix using an existing input matrix for background data. We will then simulate additional data for the target regions and linked enhancers that will be included alongside the background data. The simulated matrix can then be tested with Enhlink to try to retrieve the enhancers of the simulated regions.

## Steps

### loading the matrix
First, we need to load an existing cells x peaks sparse matrix with the Python language.
The two required files are the sparse matrix `npz` python file and the columns index, defining the peak regions as a bed file.

```python
from scipy.sparse import load_npz
import pandas as pd

npz = "tmp.repl1_CUBE_ISLET_V1.delta.v2_peak.npz"

mat = load_npz(npz)

ygi = "peaks.index.bed"

dfy = pd.read_csv(ygi, sep="\t", header=None)

dfy.columns = ['chr', 'start', 'stop']

```

### intersecting the peak index with regerence GTF file
Secondly, we need a promoter GTF file defining the promoter regions of each gene.
We then need to find the peaks intersecting these promoter regions and annotate them with the corresponding gene symbols. We can do this using `ATACAnnotateRegions` from the `ATACdemultiplex` package.

It can be installed with the following command:

```bash
go install gitlab.com/Grouumf/ATACdemultiplex/ATACAnnotateRegions@latest
```

Now, let's create a file containing all the peaks from the index file interesting the GTF promoter file:

```bash
YGI="peaks.index.bed"
PROMOTER="promoters.gtf"
OUT="intersecting_peaks.promoter.bed"

ATACAnnotateRegions -bed ${YGI} -ref ${PROMOTER} -ignore -stdout|sort|uniq > -out ${OUT}

head ${OUT}

## output of the first lines:

chr1    10006926        10007925        Ppp1r42
chr1    10008609        10009608        Ppp1r42
chr1    10011702        10012701        Ppp1r42
chr1    10013940        10014939        Ppp1r42
chr1    10037419        10038418        Cops5
chr1    10037419        10038418        Cspp1
```


We can now load this file in python and define the set of peaks that

```python
ygi_inter = "/home/poirio/data2/ENHLINK/SIMMAT/intersecting_peaks.promoter.bed"
dfyi = pd.read_csv(ygi_inter, sep="\t", header=None)

dfyi.columns = ['chr', 'start', 'stop', 'geneID']

dfyi['peak'] = dfyi['chr'] + "_" + dfyi['start'].astype('str') + "_" + dfyi['stop'].astype('str')

dfy_series = dfy['chr'] + "_" + dfy['start'].astype('str') + "_" + dfy['stop'].astype('str')

dfy_index = {a:i for i,a in enumerate(dfy_series)}

gene_index = {}
used_peaks = set()

for i in dfyi.index:
    peak_str = dfyi.loc[i, 'peak']
    peak_id = dfy_index[peak_str]

    if peak_str in used_peaks:
        continue

    gene_index[dfyi.loc[i, 'geneID']] = peak_id

    used_peaks.add(peak_str)
```

### simulation function

Now, we need to define our simulation function that a) shuffle a promoter b) copy it and introduces noise for each simulated enhancer associated with it. It is controlled by two hyperparameters defining the ratios of false positives and false negatives in the simulated enhancers.

```python
import numpy as np

def simulate_ygi_enh(ygi, nb_enh, noise1=1.8, noise2=0.05):
    ygir = ygi.copy()
    np.random.shuffle(ygir)

    enhr_array = []
    for _ in range (nb_enh):

        enhr = ygir.copy()
        for i, v  in enumerate(enhr):
            p1 = np.random.poisson(lam=noise1)
            p2 = np.random.poisson(lam=noise2)

            if p1 > 0 and v > 0:
                enhr[i] = 0
            elif p2 > 0 and v == 0:
                enhr[i] = 1

        enhr_array.append(enhr)

    return ygir, np.asarray(enhr_array)
```

### looping through genes and create simulated promoter and enhancers

We can select random genes from the list and create simulated enhancers for each selected gene. Then, we add these enhancers and promoters to the original sparse matrix as output, along with the column index of the new matrix, the list of simulated enhancers and their associated simulated promoter reference (bedpe), and the list of simulated promoters with their genomic coordinates

```bash
from scipy.sparse import hstack
from scipy.sparse import csr_matrix
from sys import stdout

import numpy as np

from scipy.sparse import save_npz

### Output files ###
fout = "simulated_matrix.npz"
fout_promoter = "simulated_matrix.promoter.tsv"
fout_feat = "simulated_matrix.enhancer.bedpe"
fout_ygi = "simulated_matrix.index.bed"

# Number of random genes to draw
nb_sim_genes = 400

# Number of simulated enhancers to draw for each genes (between 2 and 7)
range_nb_sim_features = (2, 8)

# Initialisation of object
sim_promoters = pd.DataFrame()
sim_features = pd.DataFrame()
dfy2 = dfy.copy()

current_index = dfy.shape[1]

rand_genes = np.random.choice(list(gene_index.keys()), size=nb_sim_genes)
rand_ids = np.array([gene_index[a] for a in rand_genes])

print("loading matrices...")
mat = mat.tocsr()

mat2 = mat.tolil()

print("Creating simulated variables...")
feat_array = None

used_ygis = set()
```

looping for each gene selected, simulate a promoter for a given peak within a promoter region, replace the value the promoter with the simulated promoter, and simulate enhancers linked to that promoter. Each peaks and promoter can only be used once.

```python
for i, rand_id in enumerate(rand_ids):
    stdout.write("\r#### feat nb. {0}".format(i))
    stdout.flush()

    nb_sim = np.random.randint(range_nb_sim_features[0], range_nb_sim_features[1])

    vect_y = np.asarray(mat[:, rand_id].todense()).T[0]

    chosen_prom = dfy.loc[rand_id].copy()
    chosen_prom["geneID"] = i
    sim_promoters = sim_promoters.append(chosen_prom)

    vect_y, array_r = simulate_ygi_enh(vect_y, nb_sim, noise1=1.25, noise2=0.15)
    mat2[:, rand_id] = vect_y

    if feat_array is None:
        feat_array = array_r
    else:
        feat_array = np.vstack([feat_array, array_r])

    for j in range(nb_sim):
        sim_y = chosen_prom.copy()

        sim_y['start'] += 10000 + j
        sim_y['stop'] = sim_y['start'] + 10 + j

        while tuple(sim_y[['chr', 'start', 'stop']]) in used_ygis:
            sim_y['start'] += 1

        used_ygis.add(tuple(sim_y[['chr', 'start', 'stop']]))

        dfy2 = dfy2.append(sim_y[['chr', 'start', 'stop']])
        sim_y.index = ['chr2', 'start2', 'stop2', 'geneID']
        sim_features = sim_features.append(pd.concat([chosen_prom, sim_y]))
```

### Saving the matrix and indexes

Finally, we stack the simulated promoters and enhancers with the original matrices, compute the accessibility mean of each simulated promoter,  and save the indexes

```python
feat_array = csr_matrix(feat_array)

mat2 = hstack([mat2, feat_array.T])

prom_mean = mat[:, rand_ids].mean(axis=0)
enh_mean = mat2.tocsr()[:, -feat_array.shape[0]:].mean(axis=0)
print("\nPromoters mean: {0} std:{1}".format(prom_mean.mean(), prom_mean.std()))
print("Simulated enhancers mean: {0} std:{1}".format(enh_mean.mean(), enh_mean.std()))

for key in ['start', 'stop']:
    sim_promoters[key] = sim_promoters[key].astype('int')
    dfy2[key] = dfy2[key].astype('int')

for key in ['start', 'stop', 'start2', 'stop2', 'geneID']:
    sim_features[key] = sim_features[key].astype('int')

sim_features['geneID'] = 'simGeneID' + sim_features['geneID'].astype('str')
sim_promoters['geneID'] = 'simGeneID' + sim_promoters['geneID'].astype('int').astype('str')

print("\n\nsaving files...")
sim_promoters.to_csv(fout_promoter, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_promoter))

sim_features = sim_features.loc[:, ['chr', 'start', 'stop', 'chr2', 'start2', 'stop2', 'geneID']]

sim_features.to_csv(fout_feat, sep="\t", index=None)
print("file written: {0}".format(fout_feat))

dfy2.to_csv(fout_ygi, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_ygi))

print("saving mat...")
save_npz(fout, mat2)
print("file written: {0}".format(fout))
```

## Using Enhlink to process the simulated matrix

To process the simulated matrix with Enhlink, we have to first convert the matrix to a COO (COOrdinate formate) matrix, since Enhlink cannot read a numpy sparse matrix. A COO matrix is a three-columns file indicating the x and y coordinates and the values of each entry of the sparse matrix. To do so, we can use the following script: [npz_to_COO_mat](https://gitlab.com/Grouumf/enhlinktools/-/blob/master/scripts/npz_to_COO_mat)

```bash
python ./npz_to_COO_mat --npz simulated_matrix.npz --out simulated_matrix.coo.gz
```

We can then call Enhlink to infer enhancers on the specific simulated promoter regions:

```bash
COO="simulated_matrix.coo.gz"
XGI="simulated_matrix.xgi" # these are the row indexes
YGI="simulated_matrix.index.bed"
OUT="./simulated_matrix_enhlink/"
PROMOTER="simulated_matrix.promoter.tsv"


mkdir -p ${OUT}

enhlink \
    -mat ${COO} \
    -xgi ${XGI} \
    -ygi ${YGI} \
    -out ${OUT} \
    -rmPeaksInTargets \
    -gtf ${PROMOTER} \
    -threads 8
```