Tutorial 3: Create sparse matrices compatible with Enhlink
In this tutorial, we will describe how to create matrices that can be used with Enhlink.
recognized matrices format
Enhink currently accepts three types of matrix format:
the COOrdinate format (COO) which is similar to the scipy COO format consists of a three-columns tab separated .tsv file with the two first columns being the cell and the peak index and the last columns the values. The matrix file needs to have two index files describing the cell barcodes and the features, which can be either a bed file with genomic regions or just feature names such as gene IDs. The COO matrix has the indexes starting at 0.
# These are the first lines of dummy matrices used as illustration
head -3 example_atac_coo_mat.tsv
# output:
0 10 1
2 0 1
3 30 1
head -3 example_barcodes.txt
# output:
TCGAAAAA-1
TCGCCCCC-1
TCGGGGGG-1
head -3 example_peaks.bed
# output:
chr1 100000 101000
chr2 100000 101000
chr3 100000 101000
In the example above, the second line of the matrix 2 1 1
refers to TCGGGGGG-1
and chr1 100000 101000
having a value of 1.
the Matrix Market File (MTX) is described here and is very similar to the COO format.the MTX matrix comes with a header and the first line describes the dimensions of the matrix (i.e. the number of cells and features). The MTX matrix also has the indexes starting at 0.
head -4 example_atac_mtx_mat.tsv
# output:
%%MatrixMarket matrix coordinate real general
%
100 500 2022
0 10 1
2 0 1
3 30 1
The matrix above is similar to the COO except the header and the first line: 100 500 2022
, indicating the number of cells (100), the number of features (500), and the sum of all the values (2022). These numbers are made up for the example.
The cellRanger format is a MTX matrix with the feature indexes in the first column, the barcode index in the second columns and has the indexes starting at 1 instead of 0.
head -4 example_atac_cellRanger.tsv
# output:
%%MatrixMarket matrix coordinate real general
%
500 100 2022
11 1 1
1 2 1
31 4 1
Please note the how the matrix above differs from the first MTX matrix.
Create a cell x peak sparse matrix from a fragment file
Fragments files are bed files with the 4th column being a cell barcode. They are one of the standard output files of the Cell-Ranger single-cell ATAC-seq processing pipelines. Each line represents a read assigned to a specific barcode.
First, let’s download a fragments file from one of the example available datasets of 10XGenomics:
we will use the multi-omic RNA/ATAC PBMC from a Healthy Donor - No Cell Sorting (3k) dataset.
# Downloading fragment file
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_fragments.tsv.gz
# preview the fragments file without the header lines starting with #
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat pbmc_unsorted_3k_atac_fragments.tsv.gz|grep -v "#"|head
# output
chr1 10073 10431 CAAGGGAGTAAGCTCA-1 1
chr1 10085 10279 ACAGGTAAGCTTTGGG-1 1
chr1 10085 10335 GGTTAATGTACTTCAC-1 1
chr1 10091 10339 TGTTGGCCAAGGAATC-1 1
chr1 10150 10192 CCCTTAATCATCCTCA-1 2
# The last column corresponds to the read orientation and has no importance here
Then, we need to download (or to infer) peaks or genomic regions from this fragments file. If we want to compute ourself the peaks, we can use for example the MACS2 software. In the case of ATAC-seq reads, the peaks are which regions that corresponds to open chromatin regions. Alternatively, Cell-Ranger already computed these peaks for us, so let’s just use their file:
# Downloading the peaks file
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_atac_peaks.bed
Finally we need the barcodes file. We will use the one from the filtered feature matrix for the gene expression because we know that these barcodes are considered as valid by Cell-Ranger. Also, we will be able to use both matrices in the next tutorial:
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz
# decompressing the file
tar -xvf pbmc_unsorted_3k/pbmc_unsorted_3k_filtered_feature_bc_matrix.tar.gz
To create the matrix we can use the ATACMatutils software from a 2021 snATAC-seq study.
To install it, we can directly download the executable if we have a linux x86_64 architecture or install it with go:
# option 1: install it with go
go install gitlab.com/Grouumf/ATACdemultiplex/ATACMatUtils@latest
# option 2: download the executable
wget https://gitlab.com/Grouumf/ATACdemultiplex/-/raw/master/bin/ATACMatUtils
chmod +x ./ATACMatUtils # give exec permission
cp ./ATACMatUtils /usr/local/bin # optionally put the exec in the global path
More info can be found here.
We can now easily create a cell x peak sparse matrix:
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
-xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi ./pbmc_unsorted_3k_atac_peaks.bed \
-out ./filtered_feature_atac_matrix \
-format cellRanger
this should only take a minute and creates a new folder with the mtx matrix and the two index files:
ls ./filtered_feature_atac_matrix
# output:
filtered_feature_atac_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
The matrix above is boolean, e.g. the value is alwasy 1. in other words, a peak is either accessible or not for a given cell. It is also possible to create a numerical matrices with the -use_count
and the -norm_type
options of ATACMatUtils
(see ATACMatUtils -h). For example, we will bellow create a numerical matrix that can be used as Gene expression
matrix with Enhlink (See next tutorial).
# the following command creates a matrix having the read count as numerical value for each promoter
ATACMatUtils -bed ./pbmc_unsorted_3k_atac_fragments.tsv.gz \
-xgi ./filtered_feature_bc_matrix/barcodes.tsv.gz \
-ygi ./Homo_sapiens.GRCh38.99.TSS.2K.bed \
-out ./promoter_atac_matrix \
-use_symbol \
-use_count \
-norm_type count \
-format cellRanger
# display the first lines of the matrix
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat promoter_atac_matrix/matrix.mtx.gz|head
# output
%%MatrixMarket matrix coordinate real general
%
60642 3009 15321203
59922 1 2
18545 1 3
29792 1 1
17524 1 2
# display the first lines of the feature files
# Please use gzcat instead of zcat (frome coreutils) on OSX
zcat promoter_atac_matrix/features.tsv.gz|head
A1BG A1BG count
A1BG-AS1 A1BG-AS1 count
A1CF A1CF count
A2M A2M count
A2M-AS1 A2M-AS1 count
Other norms can be used such as simple, rpm, logrpm, fpkm, and logfpkm (see ATACMatUtils -h).
Python frameworks
Transform a Python sparse matrix to a COO/MTX file.
Sparse matrices are manipulated using the scipy.sparse library.
To save a python sparse matrix into a MTX format, simply use the scipy.io
package.
from scipy.io import mmwrite
import numpy as np
from scipy.sparse import coo_matrix
# dummy matrix
M = np.random.randint(0, 2, (10, 40))
Ms = coo_matrix(M)
mmwrite('dummy_mtx_matrix.mtx', Ms)
Transform a scanpy object to MTX
a scanpy object stores the data in structure linked to a sparse matrix and a dataframe with the barcode and feature indexes
# Here is an example for converting an AnnData object to a sparse matrix with the index files.
import scanpy as sc
# Example dataset
adata = sc.datasets.ebi_expression_atlas("E-MTAB-4888")
mmwrite('E-MTAB-4888.sparse_matrix.mtx', adata.X)
adata.obs.index.to_frame().to_csv('barcodes.tsv.gz', header=None, sep="\t", index=None)
adata.var.index.to_frame().to_csv('features.tsv.gz', header=None, sep="\t", index=None)
read MTX/COO matrices to Python
Reading mtx matrices in Python is straightforward with the mmread
function of the scipy.io
library.
scanpy
is also able to directly read a MTX matrix from cellRanger. Finally, here is a code to convert COO matrices to Python.
Concert NPZ sparse matrix and index into a dataframe
Python sparse matrix are .npz
files can be read with the scipy sparse library. Since the row and column indexes of this matrix are numerical, a npz matrix requires two index files (xgi and ygi), describing the row and column features.
from scipy.sparse import load_npz
import pandas as pd
import numpy as np
fmat = "python.sparse_matrix.npz"
fxindex = "python.sparse_matrix.xindex.xgi"
fyindex = "python.sparse_matrix.xindex.xgi"
# Load the matrix
M = load_npz(fmat)
# Load the index and convert them into an array
xgi = np.array([x.strip() for x in open(fxindex)])
ygi = np.array([x.strip() for y in open(fyindex)])
# Create a pandas dataframe
df = pd.DataFrame.sparse.from_spmatrix(M, index=xgi, columns=ygi)
R frameworks
Saving matrices
The R-devel
library contains the equivalent of the scipy.io
functions. Furthermore, the library write10xCounts seems to be suited for saving and reading genomic matrices and data into the cellRanger format, but we didn’t test it at this point.
Reading matrices
reading COO matrices and converting them into R sparse matrices is straightforward using the R sparseMatrix
object from the Matrix
library, since it takes the same 3 columns as input. here is an example code that can be used for this task.