Tutorial 6: Generating simulated matrices for method comparison

Overview

In this tutorial, we will generate a simulated matrix that compares different methods for identifying target regions and linked enhancers. We will begin by creating a cells x peaks matrix using an existing input matrix for background data. We will then simulate additional data for the target regions and linked enhancers that will be included alongside the background data. The simulated matrix can then be tested with Enhlink to try to retrieve the enhancers of the simulated regions.

Steps

loading the matrix

First, we need to load an existing cells x peaks sparse matrix with the Python language. The two required files are the sparse matrix npz python file and the columns index, defining the peak regions as a bed file.

from scipy.sparse import load_npz
import pandas as pd

npz = "tmp.repl1_CUBE_ISLET_V1.delta.v2_peak.npz"

mat = load_npz(npz)

ygi = "peaks.index.bed"

dfy = pd.read_csv(ygi, sep="\t", header=None)

dfy.columns = ['chr', 'start', 'stop']

intersecting the peak index with regerence GTF file

Secondly, we need a promoter GTF file defining the promoter regions of each gene. We then need to find the peaks intersecting these promoter regions and annotate them with the corresponding gene symbols. We can do this using ATACAnnotateRegions from the ATACdemultiplex package.

It can be installed with the following command:

go install gitlab.com/Grouumf/ATACdemultiplex/ATACAnnotateRegions@latest

Now, let’s create a file containing all the peaks from the index file interesting the GTF promoter file:

YGI="peaks.index.bed"
PROMOTER="promoters.gtf"
OUT="intersecting_peaks.promoter.bed"

ATACAnnotateRegions -bed ${YGI} -ref ${PROMOTER} -ignore -stdout|sort|uniq > -out ${OUT}

head ${OUT}

## output of the first lines:

chr1    10006926        10007925        Ppp1r42
chr1    10008609        10009608        Ppp1r42
chr1    10011702        10012701        Ppp1r42
chr1    10013940        10014939        Ppp1r42
chr1    10037419        10038418        Cops5
chr1    10037419        10038418        Cspp1

We can now load this file in python and define the set of peaks that

ygi_inter = "/home/poirio/data2/ENHLINK/SIMMAT/intersecting_peaks.promoter.bed"
dfyi = pd.read_csv(ygi_inter, sep="\t", header=None)

dfyi.columns = ['chr', 'start', 'stop', 'geneID']

dfyi['peak'] = dfyi['chr'] + "_" + dfyi['start'].astype('str') + "_" + dfyi['stop'].astype('str')

dfy_series = dfy['chr'] + "_" + dfy['start'].astype('str') + "_" + dfy['stop'].astype('str')

dfy_index = {a:i for i,a in enumerate(dfy_series)}

gene_index = {}
used_peaks = set()

for i in dfyi.index:
    peak_str = dfyi.loc[i, 'peak']
    peak_id = dfy_index[peak_str]

    if peak_str in used_peaks:
        continue

    gene_index[dfyi.loc[i, 'geneID']] = peak_id

    used_peaks.add(peak_str)

simulation function

Now, we need to define our simulation function that a) shuffle a promoter b) copy it and introduces noise for each simulated enhancer associated with it. It is controlled by two hyperparameters defining the ratios of false positives and false negatives in the simulated enhancers.

import numpy as np

def simulate_ygi_enh(ygi, nb_enh, noise1=1.8, noise2=0.05):
    ygir = ygi.copy()
    np.random.shuffle(ygir)

    enhr_array = []
    for _ in range (nb_enh):

        enhr = ygir.copy()
        for i, v  in enumerate(enhr):
            p1 = np.random.poisson(lam=noise1)
            p2 = np.random.poisson(lam=noise2)

            if p1 > 0 and v > 0:
                enhr[i] = 0
            elif p2 > 0 and v == 0:
                enhr[i] = 1

        enhr_array.append(enhr)

    return ygir, np.asarray(enhr_array)

looping through genes and create simulated promoter and enhancers

We can select random genes from the list and create simulated enhancers for each selected gene. Then, we add these enhancers and promoters to the original sparse matrix as output, along with the column index of the new matrix, the list of simulated enhancers and their associated simulated promoter reference (bedpe), and the list of simulated promoters with their genomic coordinates

from scipy.sparse import hstack
from scipy.sparse import csr_matrix
from sys import stdout

import numpy as np

from scipy.sparse import save_npz

### Output files ###
fout = "simulated_matrix.npz"
fout_promoter = "simulated_matrix.promoter.tsv"
fout_feat = "simulated_matrix.enhancer.bedpe"
fout_ygi = "simulated_matrix.index.bed"

# Number of random genes to draw
nb_sim_genes = 400

# Number of simulated enhancers to draw for each genes (between 2 and 7)
range_nb_sim_features = (2, 8)

# Initialisation of object
sim_promoters = pd.DataFrame()
sim_features = pd.DataFrame()
dfy2 = dfy.copy()

current_index = dfy.shape[1]

rand_genes = np.random.choice(list(gene_index.keys()), size=nb_sim_genes)
rand_ids = np.array([gene_index[a] for a in rand_genes])

print("loading matrices...")
mat = mat.tocsr()

mat2 = mat.tolil()

print("Creating simulated variables...")
feat_array = None

used_ygis = set()

looping for each gene selected, simulate a promoter for a given peak within a promoter region, replace the value the promoter with the simulated promoter, and simulate enhancers linked to that promoter. Each peaks and promoter can only be used once.

for i, rand_id in enumerate(rand_ids):
    stdout.write("\r#### feat nb. {0}".format(i))
    stdout.flush()

    nb_sim = np.random.randint(range_nb_sim_features[0], range_nb_sim_features[1])

    vect_y = np.asarray(mat[:, rand_id].todense()).T[0]

    chosen_prom = dfy.loc[rand_id].copy()
    chosen_prom["geneID"] = i
    sim_promoters = sim_promoters.append(chosen_prom)

    vect_y, array_r = simulate_ygi_enh(vect_y, nb_sim, noise1=1.25, noise2=0.15)
    mat2[:, rand_id] = vect_y

    if feat_array is None:
        feat_array = array_r
    else:
        feat_array = np.vstack([feat_array, array_r])

    for j in range(nb_sim):
        sim_y = chosen_prom.copy()

        sim_y['start'] += 10000 + j
        sim_y['stop'] = sim_y['start'] + 10 + j

        while tuple(sim_y[['chr', 'start', 'stop']]) in used_ygis:
            sim_y['start'] += 1

        used_ygis.add(tuple(sim_y[['chr', 'start', 'stop']]))

        dfy2 = dfy2.append(sim_y[['chr', 'start', 'stop']])
        sim_y.index = ['chr2', 'start2', 'stop2', 'geneID']
        sim_features = sim_features.append(pd.concat([chosen_prom, sim_y]))

Saving the matrix and indexes

Finally, we stack the simulated promoters and enhancers with the original matrices, compute the accessibility mean of each simulated promoter, and save the indexes

feat_array = csr_matrix(feat_array)

mat2 = hstack([mat2, feat_array.T])

prom_mean = mat[:, rand_ids].mean(axis=0)
enh_mean = mat2.tocsr()[:, -feat_array.shape[0]:].mean(axis=0)
print("\nPromoters mean: {0} std:{1}".format(prom_mean.mean(), prom_mean.std()))
print("Simulated enhancers mean: {0} std:{1}".format(enh_mean.mean(), enh_mean.std()))

for key in ['start', 'stop']:
    sim_promoters[key] = sim_promoters[key].astype('int')
    dfy2[key] = dfy2[key].astype('int')

for key in ['start', 'stop', 'start2', 'stop2', 'geneID']:
    sim_features[key] = sim_features[key].astype('int')

sim_features['geneID'] = 'simGeneID' + sim_features['geneID'].astype('str')
sim_promoters['geneID'] = 'simGeneID' + sim_promoters['geneID'].astype('int').astype('str')

print("\n\nsaving files...")
sim_promoters.to_csv(fout_promoter, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_promoter))

sim_features = sim_features.loc[:, ['chr', 'start', 'stop', 'chr2', 'start2', 'stop2', 'geneID']]

sim_features.to_csv(fout_feat, sep="\t", index=None)
print("file written: {0}".format(fout_feat))

dfy2.to_csv(fout_ygi, sep="\t", header=None, index=None)
print("file written: {0}".format(fout_ygi))

print("saving mat...")
save_npz(fout, mat2)
print("file written: {0}".format(fout))