Tutorial 3: Create sparse matrices compatible with Enhlink

In this tutorial, we will describe how to create matrices that can be used with Enhlink.

recognized matrices format

Enhink currently accepts three types of matrix format:

the COOrdinate format (COO) which is similar to the scipy COO format consists of a three-columns tab separated .tsv file with the two first columns being the cell and the peak index and the last columns the values. The matrix file needs to have two index files describing the cell barcodes and the features, which can be either a bed file with genomic regions or just feature names such as gene IDs. The COO matrix has the indexes starting at 0.

# These are the first lines of dummy matrices used as illustration

head -3 example_atac_coo_mat.tsv

# output:

0    10    1
2    0     1
3    30    1

head -3 example_barcodes.txt

# output:
TCGAAAAA-1
TCGCCCCC-1
TCGGGGGG-1

head -3 example_peaks.bed

# output:

chr1    100000    101000
chr2    100000    101000
chr3    100000    101000

In the example above, the second line of the matrix 2 1 1 refers to TCGGGGGG-1 and chr1 100000 101000 having a value of 1.

the Matrix Market File (MTX) is described here and is very similar to the COO format.the MTX matrix comes with a header and the first line describes the dimensions of the matrix (i.e. the number of cells and features). The MTX matrix also has the indexes starting at 0.

head -4 example_atac_mtx_mat.tsv

# output:

%%MatrixMarket matrix coordinate real general
%
100  500   2022
0    10    1
2    0     1
3    30    1

The matrix above is similar to the COO except the header and the first line: 100 500 2022, indicating the number of cells (100), the number of features (500), and the sum of all the values (2022). These numbers are made up for the example.

The cellRanger format is a MTX matrix with the feature indexes in the first column, the barcode index in the second columns and has the indexes starting at 1 instead of 0.

Please notice the how the matrix above differs from the first MTX matrix.

head -4 example_atac_cellRanger.tsv

# output:

%%MatrixMarket matrix coordinate real general
%
500   100  2022
11    1    1
1     2    1
31    4    1

Create a cell x peak sparse matrix from a fragment file

Fragments files are BED-formatted files where the fourth column corresponds to a cell barcode. These files are one of the standard output files generated by the Cell Ranger single-cell ATAC-seq data processing pipeline. Each line in the file represents a sequencing read that has been assigned to a specific cell barcode.

First, let’s download a fragments file from one of the example available datasets of 10XGenomics:

we will use the multi-omic RNA/ATAC PBMC from a Healthy Donor - No Cell Sorting (3k) dataset.

# Downloading fragment file
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz

# preview the fragments file without the header lines starting with #
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat pbmc_unsorted_3k_atac_fragments.tsv.gz|grep -v "#"|head

# output

chr1    10073   10431   CAAGGGAGTAAGCTCA-1      1
chr1    10085   10279   ACAGGTAAGCTTTGGG-1      1
chr1    10085   10335   GGTTAATGTACTTCAC-1      1
chr1    10091   10339   TGTTGGCCAAGGAATC-1      1
chr1    10150   10192   CCCTTAATCATCCTCA-1      2

# The last column corresponds to the read orientation and has no importance here

Then, we need to download (or to infer) peaks or genomic regions from this fragments file. If we want to compute ourself the peaks, we can use for example the MACS2 software. In the case of ATAC-seq reads, the peaks are which regions that corresponds to open chromatin regions. Alternatively, Cell-Ranger already computed these peaks for us, so let’s just use their file:

# Downloading the peaks file
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed

Finally we need the barcodes file. We will use the one from the filtered feature matrix for the gene expression because we know that these barcodes are considered as valid by Cell-Ranger. Also, we will be able to use both matrices in the next tutorial:

wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz

# decompressing the file
tar -xvf pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz

To create the matrix we can use the ATACMatutils software from a 2021 snATAC-seq study.

To install it, we can directly download the executable if we have a linux x86_64 architecture or install it with go:

# option 1: install it with go
go install gitlab.com/Grouumf/ATACdemultiplex/ATACMatUtils@latest

# option 2: download the executable
wget https://gitlab.com/Grouumf/ATACdemultiplex/-/raw/master/bin/ATACMatUtils
chmod +x ./ATACMatUtils # give exec permission
cp ./ATACMatUtils /usr/local/bin # optionally put the exec in the global path

More info can be found here.

We can now easily create a cell x peak sparse matrix:

ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./pbmc_unsorted_3k_atac_peaks.bed \
    -out ./filtered_feature_atac_matrix \
    -format cellRanger

this should only take a minute and creates a new folder with the mtx matrix and the two index files:

ls ./filtered_feature_atac_matrix

# output:

filtered_feature_atac_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz

The matrix above is boolean, e.g. the value is alwasy 1. in other words, a peak is either accessible or not for a given cell. It is also possible to create a numerical matrices with the -use_count and the -norm_type options of ATACMatUtils (see ATACMatUtils -h). For example, we will bellow create a numerical matrix that can be used as Gene expression matrix with Enhlink (See next tutorial).

# the following command creates a  matrix having the read count as numerical value for each promoter
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
    -xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
    -ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed  \
    -out ./promoter_atac_matrix \
    -use_symbol \
    -use_count \
    -norm_type count \
    -format cellRanger

# display the first lines of the matrix
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat promoter_atac_matrix/matrix.mtx.gz|head

# output

%%MatrixMarket matrix coordinate real general
%
60642   3009    15321203
59922   1       2
18545   1       3
29792   1       1
17524   1       2

# display the first lines of the feature files
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat promoter_atac_matrix/features.tsv.gz|head

A1BG    A1BG    count
A1BG-AS1        A1BG-AS1        count
A1CF    A1CF    count
A2M     A2M     count
A2M-AS1 A2M-AS1 count

Other norms can be used such as simple, rpm, logrpm, fpkm, and logfpkm (see ATACMatUtils -h).

Python frameworks

Transform a Python sparse matrix to a COO/MTX file.

Sparse matrices are manipulated using the scipy.sparse library. To save a python sparse matrix into a MTX format, simply use the scipy.io package.

from scipy.io import mmwrite
import numpy as np
from scipy.sparse import coo_matrix

# dummy matrix
M = np.random.randint(0, 2, (10, 40))
Ms = coo_matrix(M)

mmwrite('dummy_mtx_matrix.mtx', Ms)

Transform a scanpy object to MTX

a scanpy object stores the data in structure linked to a sparse matrix and a dataframe with the barcode and feature indexes

# Here is an example for converting an AnnData object to a sparse matrix with the index files.

import scanpy as sc
# Example dataset
adata = sc.datasets.ebi_expression_atlas("E-MTAB-4888")

# To be conform with the cellRanger format, we have to use the transpose of the matrix
mmwrite('E-MTAB-4888.sparse_matrix.mtx', adata.X.T)
adata.obs.index.to_frame().to_csv('barcodes.tsv.gz', header=None, sep="\t", index=None)
adata.var.index.to_frame().to_csv('features.tsv.gz', header=None, sep="\t", index=None)

read MTX/COO matrices to Python

Reading mtx matrices in Python is straightforward with the mmread function of the scipy.io library. scanpy is also able to directly read a MTX matrix from cellRanger. Finally, here is a code to convert COO matrices to Python.

Concert NPZ sparse matrix and index into a dataframe

Python sparse matrix are .npz files can be read with the scipy sparse library. Since the row and column indexes of this matrix are numerical, a npz matrix requires two index files (xgi and ygi), describing the row and column features.

from scipy.sparse import load_npz
import pandas as pd
import numpy as np

fmat = "python.sparse_matrix.npz"
fxindex = "python.sparse_matrix.xindex.xgi"
fyindex = "python.sparse_matrix.xindex.xgi"

# Load the matrix
M = load_npz(fmat)
# Load the index and convert them into an array
xgi = np.array([x.strip() for x in open(fxindex)])
ygi = np.array([x.strip() for y in open(fyindex)])

# Create a pandas dataframe
df = pd.DataFrame.sparse.from_spmatrix(M, index=xgi, columns=ygi)

R frameworks

Saving matrices

The R-devel library contains the equivalent of the scipy.io functions. Furthermore, the library write10xCounts seems to be suited for saving and reading genomic matrices and data into the cellRanger format, but we didn’t test it at this point.

Reading matrices

reading COO matrices and converting them into R sparse matrices is straightforward using the R sparseMatrix object from the Matrix library, since it takes the same 3 columns as input. here is an example code that can be used for this task.