Skip to content

DeepMapper, Public scRNA-seq Data Sources for Verification / Benchmarking

Curated, link-verified (HTTP 200, see "Verified" column) public single-cell RNA-seq datasets that back each DeepMapper case study. Every URL below was checked with an HTTP HEAD/GET on 2026-06-20 unless explicitly marked UNVERIFIED.

Ground-truth (GT) legend: - FACS = cells were antibody/bead-sorted before sequencing → label is experimentally enforced (strongest GT). - Design = condition label is the experimental design (stim/ctrl, fresh/frozen) → reliable GT for that axis, not for cell type. - Author = manual/marker-based annotation by original authors (good, but not orthogonal GT).


TL;DR ranking for verifying the CD4 subtype finding (case 1)

Rank Dataset Why GT
1 10x Zheng 2017 FACS-sorted CD4 subsets (helper / Treg / naive / memory) This is the DeepMapper "CD4 subset data by 10X"; four populations are antibody-sorted, so the class label is physical, not inferred FACS
2 10x Zheng 2017 FACS-sorted CD8 subsets (cytotoxic / naive cytotoxic) Same source, covers case 2 FACS
3 all_pure_select_11types.rds (Zheng pre-packaged 11 sorted sub-pops) One 687 KB file = all sorted pops + labels, instant load FACS
4 Kang 2018 GSE96583 (IFN-β stim vs ctrl) Case 3 ground-truth condition axis Design
5 10x Fresh-68k vs Frozen PBMC (Donor A) Case 4 fresh/stored axis Design
6 scanpy pbmc68k_reduced / pbmc3k Quick smoke-test / general benchmark Author (68k) / none (3k)

Case 1 & 2, CD4+ and CD8+ T-cell subtypes (Zheng et al. 2017, FACS-sorted)

Citation: Zheng GXY et al. "Massively parallel digital transcriptional profiling of single cells." Nature Communications 8:14049 (2017). Ground truth: FACS / bead-enriched sort = physical label. Each population was immuno-sorted before droplet capture, so the cell-type label is the sort gate, not a downstream cluster call. This is the strongest GT available for cell-type classification and is exactly the "CD4+ subset data by 10X" the DeepMapper analyses use. Format: 10x CellRanger filtered_gene_bc_matrices (matrix.mtx + genes.tsv + barcodes.tsv) inside a .tar.gz. Genes ≈ 32,738 (hg19 reference) before filtering.

Direct download URL pattern (all verified 200 OK, 2026-06-20): https://cf.10xgenomics.com/samples/cell/<slug>/<slug>_filtered_gene_bc_matrices.tar.gz

CD4 subsets (case 1 to 4 classes)

Population (sort gate) slug ~Cells tar.gz size Verified
CD4+ Helper T cd4_t_helper ~11,213 21.0 MB ✅ 200
CD4+/CD25+ Regulatory T (Treg) regulatory_t ~10,263 19.3 MB ✅ 200
CD4+/CD45RA+/CD25− Naïve T naive_t ~10,479 17.6 MB ✅ 200
CD4+/CD45RO+ Memory T memory_t ~10,224 20.0 MB ✅ 200

CD8 subsets (case 2 to 2 classes)

Population (sort gate) slug ~Cells tar.gz size Verified
CD8+ Cytotoxic T cytotoxic_t ~10,209 20.0 MB ✅ 200
CD8+/CD45RA+ Naïve Cytotoxic T naive_cytotoxic ~11,953 20.9 MB ✅ 200

Cell counts are the published Zheng-2017 sorted-population sizes (post-CellRanger filtering); treat as approximate (±a few hundred depending on filtering). The byte sizes and HTTP status were directly verified.

Other sorted PBMC populations from the same release (useful for multi-class GT / negatives), all verified 200:

Population slug tar.gz
CD19+ B cells b_cells 18.0 MB
CD14+ Monocytes cd14_monocytes 4.2 MB
CD56+ NK cells cd56_nk 20.0 MB
CD34+ cells cd34 38.0 MB

How to get it (one-liner per population):

wget https://cf.10xgenomics.com/samples/cell/cd4_t_helper/cd4_t_helper_filtered_gene_bc_matrices.tar.gz
# swap slug: regulatory_t | naive_t | memory_t | cytotoxic_t | naive_cytotoxic

Pre-packaged mirror (fastest path to labelled CD4/CD8 subsets)

The official 10x analysis repo ships the sorted populations already merged with their sort labels:

File Contents Size URL Verified
all_pure_select_11types.rds 11 sorted sub-populations + meta/labels 687 KB (703,532 B) https://cf.10xgenomics.com/samples/cell/pbmc68k_rds/all_pure_select_11types.rds ✅ 200
all_pure_pbmc_data.rds Full expression of the 10 bead-enriched samples 1.5 GB https://cf.10xgenomics.com/samples/cell/pbmc68k_rds/all_pure_pbmc_data.rds ⚠️ pattern confirmed live; size not byte-checked
pbmc68k_data.rds ~68k fresh PBMC expression (the PBMC68k set) ~77 MB (80,452,142 B) https://cf.10xgenomics.com/samples/cell/pbmc68k_rds/pbmc68k_data.rds ✅ 200

Repo / README (download instructions): https://github.com/10XGenomics/single-cell-3prime-paper (pbmc68k_analysis/README.md), ✅ reachable.

# R, fastest route to labelled CD4/CD8 sorted subsets:
download.file("https://cf.10xgenomics.com/samples/cell/pbmc68k_rds/all_pure_select_11types.rds",
              "all_pure_select_11types.rds")
x <- readRDS("all_pure_select_11types.rds")   # contains the 11 sorted sub-populations + labels

Case 3, Stimulated vs unstimulated T cells

Reference numbers: stimulated 4,400 cells / 16,394 genes; unstimulated 5,092 cells / 16,612 genes.

Citation: Kang HM et al. "Multiplexed droplet single-cell RNA-sequencing using natural genetic variation." Nature Biotechnology 36:89 to 94 (2018). Accession: GSE96583 (SRA SRP102802). ~15,000 PBMCs, two conditions: control vs IFN-β-stimulated (6 h). Ground truth: Design (stim/ctrl is the experimental condition). Author cell-type annotations also provided in the *.tsne.df.tsv metadata.

File Size Direct URL Verified
GSE96583_RAW.tar (per-sample 10x mtx) 72.7 MB (76,195,840 B) https://ftp.ncbi.nlm.nih.gov/geo/series/GSE96nnn/GSE96583/suppl/GSE96583_RAW.tar ✅ 200
same via GEO portal 72.7 MB https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE96583&format=file ✅ 200
GSE96583_batch2.total.tsne.df.tsv.gz (cell labels + condition) 738.6 KB https://ftp.ncbi.nlm.nih.gov/geo/series/GSE96nnn/GSE96583/suppl/GSE96583_batch2.total.tsne.df.tsv.gz ⚠️ listed on GEO; not byte-checked
GSE96583_batch2.genes.tsv.gz 270.6 KB https://ftp.ncbi.nlm.nih.gov/geo/series/GSE96nnn/GSE96583/suppl/GSE96583_batch2.genes.tsv.gz ⚠️ listed on GEO; not byte-checked

GEO landing page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583, ✅ verified, confirms Kang 2018.

Pre-packaged loaders (same data, easiest):

# pertpy (Python), bundles the Kang IFN-β PBMC set with stim/ctrl + cell_type labels:
import pertpy as pt
adata = pt.dt.kang_2018()        # AnnData with .obs['condition'] (ctrl/stimulated) + .obs['cell_type']
# Seurat / SeuratData (R):
# install.packages("SeuratData"); SeuratData::InstallData("ifnb")
library(ifnb.SeuratData); data("ifnb")   # ifnb$stim = "CTRL"/"STIM"

⚠️ pertpy.dt.kang_2018() and SeuratData::ifnb loader names are widely documented but were not executed/verified in this environment, confirm the exact function name against your installed package version. The underlying GEO data IS verified.

Closer match to the ~4,400 / ~5,092 counts (anti-CD3/CD28 T-cell activation)

The Kang set is IFN-β-stimulated PBMCs (~15k cells), not a perfect match to the reference ~4,400/~5,092 split. If DeepMapper used a T-cell-activation (anti-CD3/CD28) stim-vs-rest set with those exact counts, the most likely public source is a resting-vs-stimulated purified T cell 10x study (e.g. the "stimulated/frozen human PBMC" deep-sequencing benchmark in Scientific Data 2023, s41597-023-02348-z). UNVERIFIED, accession not pinned down. Recommend the author confirm the exact sub-cohort; the 4,400/5,092 numbers strongly suggest a single 10x lane each, which is consistent with one activation experiment rather than the multiplexed Kang design.


Case 4, Fresh vs 24h-old / stored cells (viability / storage)

Same Zheng-2017 1.1.0 release; a matched fresh and frozen aliquot of one donor, the canonical fresh-vs-stored benchmark. Ground truth: Design (fresh vs frozen is the experimental axis). Format: 10x mtx .tar.gz.

Sample slug tar.gz Verified
Fresh 68k PBMC (Donor A) fresh_68k_pbmc_donor_a 124.4 MB (124,442,812 B) ✅ 200
Frozen PBMC (Donor A) frozen_pbmc_donor_a 7.5 MB (7,452,623 B) ✅ 200
wget https://cf.10xgenomics.com/samples/cell/fresh_68k_pbmc_donor_a/fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz
wget https://cf.10xgenomics.com/samples/cell/frozen_pbmc_donor_a/frozen_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz

Time-delay (fresh vs N-hour-old) literature sources

For a fresh vs 24h-delayed-processing axis specifically (as opposed to frozen), the published benchmarks are: - Genome Biology 2017 "Single-cell transcriptome conservation in cryopreserved cells and tissues", https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1171-9 (✅ page reachable; GEO accession in paper, not pinned here, UNVERIFIED accession). - "Effects of Cryopreservation and Thawing on Single-Cell Transcriptomes of Human T Cells" (PMC7458795), relevant to T cells specifically; accession UNVERIFIED.

If "24h-old" is literal (delayed processing, not freezing), confirm the exact GEO accession from the DeepMapper methods; the 10x fresh/frozen pair above is the closest single-download GT.


Case 5, General benchmarking sets with ground truth

Dataset Loader / URL Cells × Genes GT Verified
PBMC3k (10x healthy donor) import scanpy as sc; sc.datasets.pbmc3k(), or https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz (7.6 MB) 2,700 × 32,738 None built-in (annotate via tutorial); sc.datasets.pbmc3k_processed() adds louvain labels ✅ 200 (tar)
PBMC68k (Zheng 2017) sc.datasets.pbmc68k_reduced() (subset 700×765 w/ labels) or pbmc68k_data.rds above reduced 700 × 765 (full ~68k) Author label key bulk_labels (reduced set) ✅ (rds 200)
Kang 2018 ifnb pertpy.dt.kang_2018() / SeuratData ifnb / GSE96583 ~15,000 × ~35k Design (stim/ctrl) + Author cell_type ✅ GEO 200; loader UNVERIFIED
Tabula Muris https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues/27733 ~100k mouse Author (FACS plate + droplet) ⚠️ UNVERIFIED (figshare project id)
Tabula Sapiens https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219 / CELLxGENE ~500k human Author ⚠️ UNVERIFIED
CellTypist Immune_All (training/annotation models) https://www.celltypist.org/models , Author-curated ⚠️ UNVERIFIED
scIB integration benchmark sets https://github.com/theislab/scib (Immune_ALL_human etc.) varies Author ⚠️ UNVERIFIED
import scanpy as sc
adata3k  = sc.datasets.pbmc3k()              # 2700 x 32738, raw
adata3kp = sc.datasets.pbmc3k_processed()    # adds 'louvain' cell-type labels
adata68k = sc.datasets.pbmc68k_reduced()     # 700 x 765.obs['bulk_labels'] = GT

scanpy.datasets loader names/dimensions verified against scanpy docs; not executed locally.


What is verified vs not

  • Fully verified (HTTP 200 + byte size): all 6 Zheng CD4/CD8 sorted tar.gz; b_cells / cd14_monocytes / cd56_nk / cd34; pbmc3k/4k/8k; fresh & frozen Donor A; pbmc68k_data.rds; all_pure_select_11types.rds; GSE96583_RAW.tar (both mirrors).
  • Reachable but byte size not re-checked: all_pure_pbmc_data.rds (1.5 GB), GSE96583 per-batch .tsv.gz metadata files, GenomeBiology cryo paper page.
  • UNVERIFIED (names from docs/literature, not executed/pinned here): pertpy.dt.kang_2018(), SeuratData::ifnb, Tabula Muris/Sapiens figshare IDs, CellTypist model URLs, scIB sets, the exact 4,400/5,092 T-cell-activation accession, and the fresh-vs-24h delayed-processing GEO accession.

See scripts/download_data.sh for an executable downloader covering all verified URLs.