Welcome to scGEAToolbox’s documentation!
Single-cell RNA sequencing (scRNA-seq) technology has revolutionized the way research is done in biomedical sciences. It provides an unprecedented level of resolution across individual cells for studying cell heterogeneity and gene expression variability. Analyzing scRNA-seq data is challenging though, due to the sparsity and high dimensionality of the data. scGEAToolbox is a MATLAB toolbox for scRNA-seq data analysis. It contains a comprehensive set of functions for data normalization, feature selection, batch correction, imputation, cell clustering, trajectory/pseudotime analysis, and network construction, which can be combined and integrated to building custom workflow. While most of the functions are implemented in native MATLAB, wrapper functions are provided to allow users to call the “third-party” tools developed in Matlab or other languages. Furthermore, scGEAToolbox is equipped with sophisticated graphical user interfaces (GUIs), making it an easy-to-use application for quick data processing.
Official Websites and Social Networks
Please, visit the official website of scGEAToolbox for further information.
Quick installation
Run the following code in MATLAB:
tic
disp('Installing scGEAToolbox...')
unzip('https://github.com/jamesjcai/scGEAToolbox/archive/main.zip');
addpath('./scGEAToolbox-main');
toc
if exist('scgeatool.m','file')
disp('scGEAToolbox installed!')
end
savepath(fullfile(userpath,'pathdef.m'));
% savepath;
Getting Started
Run the following code in MATLAB to start SCGEATOOL:
scgeatool
SCGEATOOL
SCGEATOOL is a lightweight and blazing fast MATLAB application that provides interactive visualization functionality to analyze single-cell transcriptomic data. SCGEATOOL allows you to easily interrogate different views of your scRNA-seq data to quickly gain insights into the underlying biology.
Overview
In MATLAB, scgeatool function can be used to start SCGEATOOL to visualize SCE class/object. Below are links to several case studies and examples using the scgeatool function to explore scRNA-seq data. All examples are below are publically available through GitHub.
Using SCGEATOOL to explore
For a quick exploratry data analysis using scgeatool function
cdgea;
load example_data\testXgs.mat
scgeatool(X,g,s)
where X is the expression matrix, g is the list of genes, and s is the coordinates of embedding.
You can also load an example SCE (SingleCellExperiment object) variable using the following code:
cdgea;
load example_data\testSce.mat
scgeatool(sce)
If everything goes right, you will see the main inferface of SCGEATOOL like this:
Making scRNA-seq data into SCE
SingleCellExperiment defines a Single-cell Experiment (SCE) class in order to store scRNAseq data and variables. To make an SCE class, you need two variables: \(X\) and \(g\), which are gene expression matrix and gene list, respectively.
cdgea;
load example_data\testXgs.mat
sce=SingleCellExperiment(X,g,s);
scgeatool(sce)
SCGEATOOL standalone for Windows
SCGEATOOL standalone is a lightweight and blazing fast desktop application that provides interactive visualization functionality to analyze single-cell transcriptomic data. SCGEATOOL allows you to easily interrogate different views of your scRNA-seq data to quickly gain insights into the underlying biology. SCGEATOOL is a pre-compiled standalone application developed in MATLAB. Pre-compiled standalone releases are meant for those environments without access to MATLAB licenses. Standalone releases provide access to all of the functionality of the SCGEATOOL standard MATLAB release encapsulated in a single application. SCGEATOOL is open-sourced to allow you to experience the added flexibility and speed of the MATLAB environment when needed.
Code Formulas
Example codes for common tasks.
Import 10x Genomics files
In the 10x Genomics folder, there are three files, namely, matrix.mtx, features.tsv (or genes.tsv) and barcodes.tsv. Here is how to import them:
mtxf='GSM3535276_AXLN1_matrix.mtx';
genf='GSM3535276_AXLN1_genes.tsv';
bcdf='GSM3535276_AXLN1_barcodes.tsv';
[X,genelist,barcodelist]=sc_readmtxfile(mtxf,genf,bcdf,2);
If the barcodees.tsv is not available, then use the following
mtxf='GSM3535276_AXLN1_matrix.mtx';
genf='GSM3535276_AXLN1_genes.tsv';
[X,g]=sc_readmtxfile(mtxf,genf,[],2);
Process expression matrix, X and gene list, g
Here is an example of raw data processing.
[X,g,b]=sc_readmtxfile('matrix.mtx','features.tsv','barcodes.tsv',2);
[X,g]=sc_qcfilter(X,g);
[X,g]=sc_selectg(X,g,1,0.05);
[s]=sc_tsne(X);
scgeatool(X,g,s)
t-SNE embedding of cells using highly varible genes (HVGs)
[~,Xhvg]=sc_hvg(X,g);
[s]=sc_tsne(Xhvg(1:2000,:));
scgeatool(X,g,s)
An example pipeline for raw data processing
[X,g]=sc_readmtxfile('matrix.mtx','features.tsv');
[X,g]=sc_qcfilter(X,g); % basic QC
[X,g]=sc_selectg(X,g,1,0.05); % select genes expressed in at least 5% of cells
[~,Xhvg]=sc_hvg(X,g); % identify highly variable genes (HVGs)
[s]=sc_tsne(Xhvg(1:2000,:)); % using expression of top 2000 HVGs for tSNE
sce=SingleCellExperiment(X,g,s); % make SCE class
sce=sce.estimatepotency(2); % estimate differentiation potency (1-human; 2-mouse)
sce=sce.estimatecellcycle; % estimate cell cycle phase
id=sc_cluster_s(s,10); % clustering on tSNE coordinates using k-means
sce.c_cluster_id=id; % assigning cluster Ids to SCE class
scgeatool(sce) % visualize cells
An example pipeline for processing 10x data folder
Assuming the .m file containing the following code is in the folder ./filtered_feature_bc_matrix. In this folder, three files: matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz, are present.
[X,genelist,celllist]=sc_read10xdir(pwd);
sce=SingleCellExperiment(X,genelist);
sce.c_cell_id=celllist;
sce=sce.qcfilter;
sce=sce.estimatecellcycle;
sce=sce.estimatepotency("mouse");
sce=sce.embedcells('tSNE',true);
save clean_data sce -v7.3
scgeatool(sce)
Merge two data sets (WT and KO)
load WT/clean_data.mat sce
sce_wt=sce;
load KO/clean_data.mat sce
sce_ko=sce;
sce=sc_mergesces({sce_wt,sce_ko},'union'); % use parameter 'union' or 'intersect' to merge genes
sce.c=sce.c_batch_id;
scgeatool(sce) % blue - WT and red - KO
You may want to re-compute tSNE coordinates after merging.
Case Studies and Tutorials
Download 10x Genomics data files from GEO
From GEO database, we obtain the FTP links to the data files we need. Here we use a data set from sample GSM3535276 as an example ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3535276). The sample is human AXLN1 lymphatic endothelial cells.
Supplementary file | Size | Download | File type/resource |
GSM3535276_AXLN1_barcodes.tsv.gz | 33.6 Kb | (ftp)(http) | TSV |
GSM3535276_AXLN1_genes.tsv.gz | 251.2 Kb | (ftp)(http) | TSV |
GSM3535276_AXLN1_matrix.mtx.gz | 45.8 Mb | (ftp)(http) | MTX |
We can use gunzip function directly download and unzip the files.
gunzip('https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3535nnn/GSM3535276/suppl/GSM3535276_AXLN1_matrix.mtx.gz');
gunzip('https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3535nnn/GSM3535276/suppl/GSM3535276_AXLN1_genes.tsv.gz');
We can then use the code below to import data into MATLAB.
[X,g]=sc_readmtxfile('GSM3535276_AXLN1_matrix.mtx','GSM3535276_AXLN1_genes.tsv');
scgeatool(X,g)
Process downloaded 10x Genomics data files
In a 10x Genomics data folder, there should be matrix.mtx and genes.tsv. Here is the commandline code for raw data processing.
[X,g]=sc_readmtxfile('matrix.mtx','genes.tsv');
[X,g]=sc_qcfilter(X,g);
[X,g]=sc_selectg(X,g,1,0.05);
[s]=sc_tsne(X);
scgeatool(X,g,s)
Download Drop-seq data files from GEO
From GEO database, we obtain the FTP links to the data files we need. Here we use a data set from sample GSM3036814 as an example (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3036814). The sample is mouse lung cells.
Supplementary file | Size | Download | File type/resource |
GSM3036814_Control_6_Mouse_lung_digital_gene_expression_6000.dge.txt.gz | 1.7 Mb | (ftp)(http) | TXT |
We can use gunzip function directly download and unzip the files.
gunzip('https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3036nnn/GSM3036814/suppl/GSM3036814_Control_6_Mouse_lung_digital_gene_expression_6000.dge.txt.gz')
We can then use the code below to import data into MATLAB.
[X,g]=sc_readtsvfile('GSM3036814_Control_6_Mouse_lung_digital_gene_expression_6000.dge.txt');
[X,g]=sc_qcfilter(X,g);
[X,g]=sc_selectg(X,g,1,0.05);
[s]=sc_tsne(X);
scgeatool(X,g,s)
Import Seurat RData
For example, we are trying to read files from https://www.synapse.org/#!Synapse:syn22855256. They are described as pbmc_discovery_v1.RData and pbmc_replication_v1.RData are Seurat objects containing the gene expression raw counts and log normalized data, the phenotype Label (“CI” for MCI, “C” for control) and the inferred cell identity of the discovery and replication cohort, respectively.
library(Seurat)
library(Matrix)
load('pbmc_discovery_v1.RData')
countMatrix <- pbmc_discovery@assays$RNA@counts
writeMM(obj = countMatrix, file = 'matrix.mtx')
writeLines(text = rownames(countMatrix), con = 'features.tsv')
writeLines(text = colnames(countMatrix), con = 'barcodes.tsv')
metadata <- pbmc_discovery@meta.data
write.csv(x = metadata, file = 'metadata.csv', quote = FALSE)
After exporting Seurate object data into the three files, you can then use MATLAB to read the files:
[X,genelist,barcodelist]=sc_readmtxfile('matrix.mtx','features.tsv','barcodes.tsv',1);
sce=SingleCellExperiment(X,genelist);
T=readtable('metadata.csv')
c=string(T.Label);
sce.c_batch_id=c;
scgeatool(sce)
Import data from a TSV/Excel file
If your scRNA-seq data is in Excel file, save it as TSV or CSV a file with the format like this:
genes X1 X2 X3 X4 X5 X6 X7 X8 X9
NOC2L 1 1 2 3 3 2 0 1 3
HES4 50 15 19 50 8 87 23 25 29
ISG15 279 312 425 180 406 408 335 403 398
AGRN 3 4 9 5 2 3 8 8 9
SDF4 2 2 4 0 5 0 4 2 5
B3GALT6 2 1 0 0 1 0 1 1 0
UBE2J2 1 2 3 1 1 1 6 3 4
SCNN1D 0 1 0 0 0 0 0 0 0
ACAP3 1 3 1 0 1 0 0 1 0
Then you can use function sc_readtsvfile to import the data. Here is an example:
cdgea;
[X,g]=sc_readtsvfile('example_data\GSM3204304_P_P_Expr.csv');
Visualize data in 6D
cdgea;
load example_data\example10xdata.mat
% s=sc_tsne(X,6,false,true);
s=s_tsne6; % using pre-computed 6-d embedding S_TSNE6
gui.sc_multiembeddings(s(:,1:3),s(:,4:6));
Here is what you should get:
Youtube Playlist
You might also want to take a look at scGEAToolbox in action: see the Youtube playlist
Using SCGEATOOL to explore scRNA-seq data stored as SCE class
Label cell type interactively with SCGEATOOL
Slack Channel
Visit the dedicated Slack Channel if you have questions or to report bugs.
Twitter Hashtag
If you create amazing visualizations using scGEAToolbox and you want to tweet them, remember to include the #scGEAToolbox hashtag: we will be happy to retweet.
Matlab Central
You might be interested in taking a look at the File Exchange.
A1: The scGEAToolbox Paper
Cai JJ, “scGEAToolbox: a Matlab toolbox for single-cell RNA sequencing data analysis,” Bioinformatics, btz830, (2019).
A2: Papers Citing scGEAToolbox
A3: License Agreement
MIT License
Copyright (c) 2021 James Cai
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.