scfetch - Access and Format Single-cell RNA-seq Datasets from Public Resources
scfetch is designed to accelerate users download and
prepare single-cell datasets from public resources. It can be used
to:
GEO/SRA,
foramt fastq files to standard style that can be
identified by 10x softwares (e.g. CellRanger).GEO/SRA,
support downloading original 10x generated bam files (with
custom tags) and normal bam files, and convert bam
files to fastq files.GEO,
PanglanDB and UCSC Cell Browser, load
the downnloaded matrix to Seurat.Zeenodo and
CELLxGENE.SeuratObject, AnnData,
SingleCellExperiment,
CellDataSet/cell_data_set and loom).scfetch is an R package distributed as part of the CRAN. To install the package,
start R and enter:
# install via CRAN (v0.5.0) # old version, it's better to install via Github
install.packages("scfetch")
# if you install from CRAN, you should install the following packages
# install.packages("devtools") #In case you have not installed it.
devtools::install_github("alexvpickering/GEOfastq") # download fastq
devtools::install_github("cellgeni/sceasy") # format conversion
devtools::install_github("mojaveazure/seurat-disk") # format conversion
devtools::install_github("satijalab/seurat-wrappers") # format conversion
# install via Github (v0.5.0)
devtools::install_github("showteeth/scfetch")
For data structures conversion, scfetch
requires several python pcakages, you can install with:
# install python packages
conda install -c bioconda loompy anndata
# or
pip install anndata loompy
In general, it is recommended to install from Github repository (update more timely).
Once scfetch is installed, it can be loaded by the
following command.
library("scfetch")
Since the downloading process is time-consuming, we provide the commands used to illustrate the usage.
For fastq files stored in SRA, scfetch can extract
sample information and run number with GEO accession number or users can
also provide a dataframe contains the run number of interested
samples.
Extract all samples under GSE130636 and the platform is
GPL20301 (use platform = NULL for all
platforms):
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
With the dataframe contains gsm and run number, scfetch
will download sra files using prefetch. The returned result
is a dataframe contains failed runs. If not NULL, users can
re-run DownloadSRA by setting gsm.df to the
returned result.
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]
# download, you may need to set prefetch.path
out.folder <- tempdir()
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs,
out.folder = out.folder
)
# GSE130636.down is null or dataframe contains failed runs
The out.folder structure will be:
gsm_number/run_number.
After obtaining the sra files, scfetch provides function
SplitSRA to split sra files to fastq files using
parallel-fastq-dump (parallel, fastest and gzip output),
fasterq-dump (parallel, fast but unzipped output) and
fastq-dump (slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA can
identify read1, read2 and index files and format the read1 and read2 to
10x required format (sample1_S1_L001_R1_001.fastq.gz and
sample1_S1_L001_R2_001.fastq.gz). In detail, the file with
read length 26 or 28 is considered as read1, the files with read length
8 or 10 are considered as index files and the remain file is considered
as read2. The read length rules is from Sequencing
Requirements for Single Cell 3’ and Sequencing
Requirements for Single Cell V(D)J.
The returned result is a vector of failed sra files. If not
NULL, users can re-run SplitSRA by setting
sra.path to the returned result.
# parallel-fastq-dump requires sratools.path
# you may need to set split.cmd.path and sratools.path
sra.folder <- tempdir()
GSE130636.split <- SplitSRA(
sra.folder = sra.folder,
fastq.type = "10x", split.cmd.threads = 4
)
scfetch can extract sample information and run number
with GEO accession number or users can also provide a dataframe contains
the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
With the dataframe contains gsm and run number, scfetch
provides DownloadBam to download bam files using
prefetch. It suooorts 10x generated bam files and normal
bam files.
prefetch, scfetch adds
--type TenX to make sure the downloaded bam files contain
these tags.DownloadBam will
download sra files first and then convert sra files to bam files with
sam-dump. After testing the efficiency of
prefetch + sam-dump and sam-dump,
the former is much faster than the latter (52G sra and 72G bam
files):# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real 117m26.334s
# # user 16m42.062s
# # sys 3m28.295s
# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real 536m2.721s
# # user 749m41.421s
# # sys 20m49.069s
# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow
The returned result is a dataframe containing failed runs (either
failed to download sra files or failed to convert to bam files for
normal bam; failed to download bam files for 10x generated bam). If not
NULL, users can re-run DownloadBam by setting
gsm.df to the returned result. The following is an example
to download 10x generated bam file:
# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download, you may need to set prefetch.path
out.folder <- tempdir()
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs,
out.folder = out.folder
)
# GSE138266.down is null or dataframe contains failed runs
The out.folder structure will be:
gsm_number/run_number.
With downloaded bam files, scfetch provides function
Bam2Fastq to convert bam files to fastq files. For bam
files generated from 10x softwares, Bam2Fastq utilizes
bamtofastq tool developed by 10x Genomics.
The returned result is a vector of bam files failed to convert to
fastq files. If not NULL, users can re-run
Bam2Fastq by setting bam.path to the returned
result.
bam.folder <- tempdir()
# you may need to set bamtofastq.path and bamtofastq.paras
GSE138266.convert <- Bam2Fastq(
bam.folder = bam.folder
)
scfetch provides functions for users to download
count matrices and annotations
(e.g. cell type annotation and composition) from GEO and some
single-cell databases (e.g. PanglaoDB and UCSC Cell Browser).
scfetch also supports loading the downloaded data to
Seurat.
Until now, the public resources supported and the returned results:
| Resources | URL | Download Type | Returned results |
|---|---|---|---|
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | count matrix | SeuratObject or count matrix for bulk RNA-seq |
| PanglaoDB | https://panglaodb.se/index.html | count matrix | SeuratObject |
| UCSC Cell Browser | https://cells.ucsc.edu/ | count matrix | SeuratObject |
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets.
scfetch provides ExtractGEOMeta to extract
sample metadata, including sample title, source name/tissue,
description, cell type, treatment, paper title, paper abstract,
organism, protocol, data processing methods, et al.
# extract metadata of specified platform
GSE200257.meta <- ExtractGEOMeta(acce = "GSE200257", platform = "GPL24676")
# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60)
# extract metadata of all platforms
GSE94820.meta <- ExtractGEOMeta(acce = "GSE94820", platform = NULL)
After manually check the extracted metadata, users can
download count matrix and load the count
matrix to Seurat with ParseGEO.
For count matrix, ParseGEO supports downloading the
matrix from supplementary files and extracting from
ExpressionSet, users can control the source by specifying
down.supp or detecting automatically (ParseGEO
will extract the count matrix from ExpressionSet first, if
the count matrix is NULL or contains non-integer values,
ParseGEO will download supplementary files). While the
supplementary files have two main types: single count matrix file
containing all cells and CellRanger-style outputs (barcode, matrix,
feature/gene), users are required to choose the type of supplementary
files with supp.type.
With the count matrix, ParseGEO will load the matrix to
Seurat automatically. If multiple samples available, users can choose to
merge the SeuratObject with merge.
# for cellranger output
out.folder <- tempdir()
GSE200257.seu <- ParseGEO(
acce = "GSE200257", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "10x",
out.folder = out.folder
)
# for count matrix, no need to specify out.folder, download count matrix to tmp folder
GSE94820.seu <- ParseGEO(acce = "GSE94820", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "count")
For bulk RNA-seq, set
data.type = "bulk" in ParseGEO, this will
return count matrix.
PanglaoDB is a database
which contains scRNA-seq datasets from mouse and human. Up to now, it
contains 5,586,348 cells from 1368 datasets
(1063 from Mus musculus and 305 from Homo sapiens). It has well
organized metadata for every dataset, including tissue, protocol,
species, number of cells and cell type annotation (computationally
identified). Daniel Osorio has developed rPanglaoDB to access
PanglaoDB data, the
functions of scfetch here are based on rPanglaoDB.
Since PanglaoDB is no
longer maintained, scfetch has cached all metadata and cell
type composition and use these cached data by default to accelerate,
users can access the cached data with PanglaoDBMeta (all
metadata) and PanglaoDBComposition (all cell type
composition).
scfetch provides StatDBAttribute to summary
attributes of PanglaoDB:
# use cached metadata
StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")
scfetch provides ExtractPanglaoDBMeta to
select interested datasets with specified species,
protocol, tissue and cell
number (The available values of these attributes can be
obtained with StatDBAttribute). User can also choose to
whether to add cell type annotation to every dataset with
show.cell.type.
scfetch uses cached metadata and cell type composition
by default, users can change this by setting
local.data = FALSE.
hsa.meta <- ExtractPanglaoDBMeta(
species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"),
show.cell.type = TRUE, cell.num = c(1000, 2000)
)
scfetch provides
ExtractPanglaoDBComposition to extract cell type annotation
and composition (use cached data by default to accelerate, users can
change this by setting local.data = FALSE).
hsa.composition <- ExtractPanglaoDBComposition(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"))
After manually check the extracted metadata, scfetch
provides ParsePanglaoDB to download count
matrix and load the count matrix to Seurat.
With available cell type annotation, uses can filter datasets without
specified cell type with cell.type. Users can also
include/exclude cells expressing specified genes with
include.gene/exclude.gene.
With the count matrix, ParsePanglaoDB will load the
matrix to Seurat automatically. If multiple datasets available, users
can choose to merge the SeuratObject with merge.
# small test
hsa.seu <- ParsePanglaoDB(hsa.meta[1:3, ], merge = TRUE)
The UCSC Cell Browser is a web-based tool that allows scientists to interactively visualize scRNA-seq datasets. It contains 1040 single cell datasets from 17 different species. And, it is organized with the hierarchical structure, which can help users quickly locate the datasets they are interested in.
scfetch provides ShowCBDatasets to show all
available datasets. Due to the large number of datasets,
ShowCBDatasets enables users to perform lazy load
of dataset json files instead of downloading the json files online
(time-consuming!!!). This lazy load requires users to provide
json.folder to save json files and set
lazy = TRUE (for the first time of run,
ShowCBDatasets will download current json files to
json.folder, for next time of run, with
lazy = TRUE, ShowCBDatasets will load the
downloaded json files from json.folder.). And,
ShowCBDatasets supports updating the local datasets with
update = TRUE.
json.folder <- tempdir()
# first time run, the json files are stored under json.folder
# ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = TRUE)
# second time run, load the downloaded json files
ucsc.cb.samples <- ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = FALSE)
# always read online
# ucsc.cb.samples = ShowCBDatasets(lazy = FALSE)
The number of datasets and all available species:
# the number of datasets
nrow(ucsc.cb.samples)
# available species
unique(unlist(sapply(unique(gsub(pattern = "\\|parent", replacement = "", x = ucsc.cb.samples$organisms)), function(x) {
unlist(strsplit(x = x, split = ", "))
})))
scfetch provides StatDBAttribute to summary
attributes of UCSC Cell
Browser:
StatDBAttribute(df = ucsc.cb.samples, filter = c("organism", "organ"), database = "UCSC")
scfetch provides ExtractCBDatasets to
filter metadata with collection,
sub-collection, organ, disease
status, organism, project and
cell number (The available values of these attributes
can be obtained with StatDBAttribute except cell
number). All attributes except cell number support fuzzy match
with fuzzy.match, this is useful when selecting
datasets.
hbb.sample.df <- ExtractCBDatasets(all.samples.df = ucsc.cb.samples, organ = c("brain", "blood"), organism = "Human (H. sapiens)", cell.num = c(1000, 2000))
scfetch provides ExtractCBComposition to
extract cell type annotation and composition.
hbb.sample.ct <- ExtractCBComposition(json.folder = json.folder, sample.df = hbb.sample.df)
After manually check the extracted metadata, scfetch
provides ParseCBDatasets to load the online count
matrix to Seurat. All the attributes available in
ExtractCBDatasets are also same here. Please note that the
loading process provided by ParseCBDatasets will load the
online count matrix instead of downloading it to local. If multiple
datasets available, users can choose to merge the SeuratObject with
merge.
hbb.sample.seu <- ParseCBDatasets(sample.df = hbb.sample.df)
scfetch provides functions for users to download
processed single-cell RNA-seq data from Zenodo, CELLxGENE and Human Cell Atlas, including
RDS, RData, h5ad,
h5, loom objects.
Until now, the public resources supported and the returned results:
| Resources | URL | Download Type | Returned results |
|---|---|---|---|
| Zenodo | https://zenodo.org/ | count matrix, rds, rdata, h5ad, et al. | NULL or failed datasets |
| CELLxGENE | https://cellxgene.cziscience.com/ | rds, h5ad | NULL or failed datasets |
| Human Cell Atlas | https://www.humancellatlas.org/ | rds, rdata, h5, h5ad, loom | NULL or failed projects |
Zenodo contains various types of
processed objects, such as SeuratObject which has been
clustered and annotated, AnnData which contains processed
results generated by scanpy.
scfetch provides ExtractZenodoMeta to
extract dataset metadata, including dataset title, description,
available files and corresponding md5. Please note that when the dataset
is restricted access, the returned dataframe will be empty.
# single doi
zebrafish.df <- ExtractZenodoMeta(doi = "10.5281/zenodo.7243603")
# vector dois
multi.dois <- ExtractZenodoMeta(doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"))
After manually check the extracted metadata, users can
download the specified objects with
ParseZenodo. The downloaded objects are controlled by
file.ext and the provided object formats should be
in lower case (e.g. rds/rdata/h5ad).
The returned result is a dataframe containing failed objects. If not
NULL, users can re-run ParseZenodo by setting
doi.df to the returned result.
out.folder <- tempdir()
multi.dois.parse <- ParseZenodo(
doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"),
file.ext = c("rdata", "rds"), out.folder = out.folder
)
The CELLxGENE is a
web server contains 910 single-cell datasets, users can
explore, download and upload own datasets. The downloaded datasets
provided by CELLxGENE
have two formats: h5ad (AnnData v0.8) and
rds (Seurat v4).
scfetch provides ShowCELLxGENEDatasets to
extract dataset metadata, including dataset title, description, contact,
organism, ethnicity, sex, tissue, disease, assay, suspension type, cell
type, et al.
# all available datasets
all.cellxgene.datasets <- ShowCELLxGENEDatasets()
scfetch provides StatDBAttribute to summary
attributes of CELLxGENE:
StatDBAttribute(df = all.cellxgene.datasets, filter = c("organism", "sex"), database = "CELLxGENE")
scfetch provides ExtractCELLxGENEMeta to
filter dataset metadata, the available values of attributes can be
obtained with StatDBAttribute except cell
number:
# human 10x v2 and v3 datasets
human.10x.cellxgene.meta <- ExtractCELLxGENEMeta(
all.samples.df = all.cellxgene.datasets,
assay = c("10x 3' v2", "10x 3' v3"), organism = "Homo sapiens"
)
After manually check the extracted metadata, users can
download the specified objects with
ParseCELLxGENE. The downloaded objects are controlled by
file.ext (choose from "rds" and
"h5ad").
The returned result is a dataframe containing failed datasets. If not
NULL, users can re-run ParseCELLxGENE by
setting meta to the returned result.
out.folder <- tempdir()
ParseCELLxGENE(
meta = human.10x.cellxgene.meta[1:5, ], file.ext = "rds",
out.folder = out.folder
)
There are many tools have been developed to process scRNA-seq data,
such as Scanpy,
Seurat, scran
and Monocle.
These tools have their own objects, such as Anndata of
Scanpy, SeuratObject of Seurat,
SingleCellExperiment of scran and
CellDataSet/cell_data_set of
Monocle2/Monocle3. There are also some file
format designed for large omics datasets, such as loom. To perform a comprehensive scRNA-seq
data analysis, we usually need to combine multiple tools, which means we
need to perform object conversion frequently. To facilitate user
analysis of scRNA-seq data, scfetch provides multiple
functions to perform object conversion between widely used tools and
formats. The object conversion implemented in scfetch has
two main advantages:
SeuratDisk to convert SeuratObject to loom,
use zellkonverter to perform conversion between
SingleCellExperiment and Anndata. When there
is no such tools, we use sceasy to perform conversion.# library
library(Seurat) # pbmc_small
library(scRNAseq) # seger
SeuratObject:
# object
pbmc_small
SingleCellExperiment:
seger <- scRNAseq::SegerstolpePancreasData()
Here, we will convert SeuratObject to
SingleCellExperiment,
CellDataSet/cell_data_set,
Anndata, loom.
The conversion is performed with functions implemented in
Seurat:
sce.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "SCE")
To CellDataSet (The conversion is performed with
functions implemented in Seurat):
# BiocManager::install("monocle") # reuqire monocle
cds.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", reduction = "tsne", to = "CellDataSet")
To cell_data_set (The conversion is performed with
functions implemented in SeuratWrappers):
# remotes::install_github('cole-trapnell-lab/monocle3') # reuqire monocle3
cds3.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "cell_data_set")
AnnData is a Python object, reticulate is
used to communicate between Python and R. User should create a Python
environment which contains anndata package and specify the
environment path with conda.path to ensure the exact usage
of this environment.
The conversion is performed with functions implemented in
sceasy:
# remove pbmc_small.h5ad first
anndata.file <- tempfile(pattern = "pbmc_small_", fileext = ".h5ad")
# you may need to set conda.path
ExportSeurat(
seu.obj = pbmc_small, assay = "RNA", to = "AnnData",
anndata.file = anndata.file
)
The conversion is performed with functions implemented in
SeuratDisk:
loom.file <- tempfile(pattern = "pbmc_small_", fileext = ".loom")
ExportSeurat(
seu.obj = pbmc_small, assay = "RNA", to = "loom",
loom.file = loom.file
)
The conversion is performed with functions implemented in
Seurat:
seu.obj.sce <- ImportSeurat(obj = sce.obj, from = "SCE", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
CellDataSet to SeuratObject (The conversion
is performed with functions implemented in Seurat):
seu.obj.cds <- ImportSeurat(obj = cds.obj, from = "CellDataSet", count.assay = "counts", assay = "RNA")
cell_data_set to SeuratObject (The
conversion is performed with functions implemented in
Seurat):
seu.obj.cds3 <- ImportSeurat(obj = cds3.obj, from = "cell_data_set", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
AnnData is a Python object, reticulate is
used to communicate between Python and R. User should create a Python
environment which contains anndata package and specify the
environment path with conda.path to ensure the exact usage
of this environment.
The conversion is performed with functions implemented in
sceasy:
# you may need to set conda.path
seu.obj.h5ad <- ImportSeurat(
anndata.file = anndata.file, from = "AnnData", assay = "RNA"
)
The conversion is performed with functions implemented in
SeuratDisk and Seurat:
# loom will lose reduction
seu.obj.loom <- ImportSeurat(loom.file = loom.file, from = "loom")
The conversion is performed with functions implemented in
zellkonverter.
# remove seger.h5ad first
seger.anndata.file <- tempfile(pattern = "seger_", fileext = ".h5ad")
SCEAnnData(
from = "SingleCellExperiment", to = "AnnData", sce = seger, X_name = "counts",
anndata.file = seger.anndata.file
)
seger.anndata <- SCEAnnData(
from = "AnnData", to = "SingleCellExperiment",
anndata.file = seger.anndata.file
)
The conversion is performed with functions implemented in
LoomExperiment.
# remove seger.loom first
seger.loom.file <- tempfile(pattern = "seger_", fileext = ".loom")
SCELoom(
from = "SingleCellExperiment", to = "loom", sce = seger,
loom.file = seger.loom.file
)
seger.loom <- SCELoom(
from = "loom", to = "SingleCellExperiment",
loom.file = seger.loom.file
)