API reference¶
Generated from the source docstrings. The pure modules (transform, accumulate,
config, earlystop) import with the standard library only; the heavier modules load
torch and numpy lazily.
Configuration¶
pydeepmapper.config ¶
Configuration for a DeepMapper run, plain dataclasses, no heavy deps.
Importing this module is cheap (stdlib only); torch/timm are only touched when a backbone is actually built. The defaults encode two project decisions:
- Minimal augmentation. Pseudo-images are simpler and far less variable than
natural photos, and geometric augmentations (flips/rotations/crops) would scramble
the fixed feature->pixel layout. So augmentation is OFF by default, see
docs/design.mdand the benchmark that verifies this. - Small backbone first. The default backbone is a small CNN; the hypothesis that small, locality-biased models are faster and better here than large data-hungry ones is something the benchmark is set up to verify.
AugmentConfig
dataclass
¶
Augmentation policy. Default = identity (no augmentation).
Source code in pydeepmapper/config.py
BackboneSpec
dataclass
¶
Which model sits behind the pseudo-image. CNN <-> ViT <-> autoencoder swap happens here and nowhere else (see backbones.build).
Source code in pydeepmapper/config.py
DeepMapperConfig
dataclass
¶
A full iterate-N-accumulate run.
Source code in pydeepmapper/config.py
Runner¶
pydeepmapper.runner ¶
The DeepMapper iterate-N-accumulate orchestrator.
Generalises the legacy deepmap(X, y, num_passes, min_accuracy, max_tries) loop:
each pass draws a fresh feature->pixel arrangement (seeded permutation), builds
pseudo-images, trains the swappable backbone, gates on held-out accuracy, then
attributes and back-projects to per-feature findings. After N accepted passes the
pure accumulators (accumulate.py) produce the robust, stability-scored result.
The numeric heavy-lifting needs torch; the accumulation is the pure, contracted core and is unit-tested without torch. Importing this module is cheap.
Findings
dataclass
¶
Accumulated per-feature result of a run.
Source code in pydeepmapper/runner.py
ranking ¶
Top features by selection frequency (then median importance), as
(name_or_index, selection_frequency, median_importance) tuples.
Source code in pydeepmapper/runner.py
run ¶
Execute the full iterate-N-accumulate pipeline. Returns :class:Findings.
X: (n_samples, n_features) array-like. y: integer class labels.
Source code in pydeepmapper/runner.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | |
Linear baseline¶
pydeepmapper.linear_baseline ¶
Deterministic linear backbone for DeepMapper, penalised logistic regression on the FULL feature set, with no imagification and no multi-pass accumulation.
Why this exists (the methodological point): for a linear classifier the iterate-N-accumulate machinery is redundant. (i) A flatten+linear model is permutation-equivariant, so the feature->pixel arrangement is irrelevant and averaging over random arrangements removes nothing. (ii) Logistic regression is convex, so a deterministic solver (lbfgs to convergence) returns a unique optimum, every pass would be identical, so N=1 is exact. (iii) Integrated gradients on a linear model equals coef*(x - baseline), so the attribution ranking is exactly the (standardised) coefficients, no gradient sampling needed.
Imagification and iterate-N-accumulate are therefore for arrangement-SENSITIVE backbones (CNN/ViT); the linear analysis needs neither. This module makes that concrete and gives an exact, reproducible baseline.
fit ¶
Fit the deterministic linear backbone. Returns (clf, scaler, ranking) where ranking is a list of (feature_name_or_index, importance) sorted descending. Importance = max_class |standardised coef|.
Source code in pydeepmapper/linear_baseline.py
Data loading¶
pydeepmapper.io ¶
Data intake for real scRNA-seq benchmarks (AnnData / .h5ad).
Turns an AnnData into the plain (X, y, var_names, label_names) the DeepMapper
runner consumes, and provides the preprocessing parity helpers the paper cases need:
log-normalization, highly-variable-gene selection (to measure what filtering costs),
and the gene-intersection of two datasets (the stim-vs-unstim "common genes only"
case, C5). scanpy/anndata are imported lazily, importing this module is cheap.
See docs/paper-cases.md.
Dataset
dataclass
¶
A benchmark-ready matrix: cells × genes, integer labels, names.
Source code in pydeepmapper/io.py
load_h5ad ¶
load_h5ad(path: str, label_key: str, normalize: bool = True, target_sum: float = 10000.0) -> Dataset
Load an AnnData .h5ad → :class:Dataset. label_key is the obs column
holding the ground-truth class. normalize applies normalize_total + log1p
(skip if the matrix is already normalized).
Source code in pydeepmapper/io.py
load_10x_populations ¶
load_10x_populations(specs, normalize: bool = True, target_sum: float = 10000.0, n_per_class: Optional[int] = None, seed: int = 0) -> Dataset
Load several FACS-sorted 10x populations and merge into one labeled Dataset.
specs: list of (mtx_dir, label), e.g. the Zheng CD4 subsets. Cells from each
are tagged with label (the ground-truth FACS class); matrices are concatenated on
shared genes (these share the same 10x reference, so genes align). Optionally subsample
n_per_class cells per population (cap big runs). NO gene filtering, that's the point.
Source code in pydeepmapper/io.py
highly_variable_subset ¶
Return the HVG-filtered dataset + the kept gene indices, used to measure what dimension reduction discards (run DeepMapper on full vs HVG, see if attribution genes survive selection).
Source code in pydeepmapper/io.py
intersect_genes ¶
Restrict two datasets to their shared genes (C5 "common genes only"). Returns the two restricted datasets + the shared gene list, both column-aligned to that order.
Source code in pydeepmapper/io.py
exclusive_genes ¶
Genes unique to each dataset (the "trivially separating" features in C5).
Source code in pydeepmapper/io.py
export_for_seurat ¶
Write counts.csv (cells × genes) + labels.csv (cell,label) for
bench/seurat_pipeline.R. Returns outdir. Cell ids are cell0..cellN.
Source code in pydeepmapper/io.py
Evaluation¶
pydeepmapper.evaluate ¶
Held-out evaluation with arrangement-ensembling, for confusion matrices & metrics.
The iterate-N runner (runner.run) accumulates attribution; this module accumulates
predictions. It fixes ONE stratified train/test split, trains the backbone under N
arrangements, and ensembles the softmax over arrangements into a final prediction, the
principled DeepMapper class call. Returns y_true / y_pred / y_proba + a sklearn report,
ready for plots.py. Needs torch (the pure core stays torch-free).
Attribution¶
pydeepmapper.attribution ¶
Per-pixel attribution -> per-feature findings.
One attribute() seam, several methods:
CNN : integrated_gradients (default), saliency, occlusion [Captum]
ViT : attention_rollout, chefer [attention-based]
The method computes a per-pixel importance map for a batch, then
transform.pixels_to_features back-projects it through the arrangement's inverse
permutation to a per-feature vector (the clean 1-pixel-per-feature gather). torch /
captum are imported lazily; a missing dependency raises a clear error.
feature_importances ¶
feature_importances(model, X_images, targets, method: str, perm: Optional[Sequence[int]], n_features: int, baseline=None, ig_steps: int = 32, ig_internal_batch: int = 16)
Return a per-feature importance vector (length n_features) for method.
X_images: a (batch, C, H, W) tensor on the model's device. targets:
per-sample class indices. Importances are averaged over the batch, then
back-projected to feature space via the arrangement's inverse permutation.
ig_steps / ig_internal_batch bound Integrated-Gradients memory.
Source code in pydeepmapper/attribution.py
Backbones¶
pydeepmapper.backbones ¶
Swappable backbones, CNN <-> ViT <-> autoencoder behind one interface.
The whole point of the revision: the model is a plug-in. build(spec, num_classes)
returns a torch.nn.Module for any registered kind; the rest of the pipeline
(runner, attribution) never names a concrete architecture.
torch / timm are imported lazily inside the builders, so importing this module is
free and the pure core stays torch-independent. If a backbone is requested without
its dependency installed, a clear BackboneUnavailable is raised.
Registry
cnn_small, a tiny 3-conv net. DEFAULT. Small + locality-biased = fast & strong
here (the hypothesis the benchmark verifies).
resnet18, torchvision ResNet-18 (legacy DeepMapper parity baseline).
vit_cct, Compact Convolutional Transformer (top ViT pick for pseudo-images).
timm:
Non-neural classifiers (sklearn LogReg-L1 / random forest / gradient boosting) are NOT registered here: integrated-gradients attribution needs a differentiable model, so they would require a separate permutation-importance attribution path (future work).
BackboneUnavailable ¶
build ¶
Build the backbone for spec. kind is a registry key or timm:<name>.
Source code in pydeepmapper/backbones.py
Early stopping¶
pydeepmapper.earlystop ¶
Plateau early-stopping, the pure decision logic (no torch).
Reproduces the Keras EarlyStopping(monitor, patience, min_delta,
restore_best_weights=True) behaviour that the original TF DeepMapper used and
that was lost in the PyTorch port: train until the monitored held-out metric
stops improving for patience epochs, then restore the best-epoch weights.
A min_epochs floor keeps the run out of the undertrained zone where
attribution is unreliable (the 12-epoch 88% artefact).
Kept torch-free and side-effect-free so it is unit-tested without a GPU, the
same split the project uses for accumulate.
EarlyStopper
dataclass
¶
Track a monitored metric and decide when training has plateaued.
Call :meth:step once per epoch with the held-out value. It records whether
this epoch was a new best (self.improved, the caller saves weights then)
and returns whether training should stop. mode='min' for a loss,
'max' for an accuracy.
Source code in pydeepmapper/earlystop.py
step ¶
Record value at epoch (0-based); return True to stop now.
Before min_epochs (the warmup) we neither record a best epoch nor
count patience, so the restored best-epoch model has always trained at
least min_epochs. The floor gates which epoch we keep, not only when
we stop, on an easy task whose validation loss bottoms out very early
this keeps attribution off a barely-trained model.
Source code in pydeepmapper/earlystop.py
is_improvement ¶
True if value beats best by more than min_delta (pure).
Source code in pydeepmapper/earlystop.py
should_stop ¶
True if training has plateaued: past the min_epochs floor AND
patience consecutive non-improving epochs have elapsed (pure).
Source code in pydeepmapper/earlystop.py
Core transform¶
pydeepmapper.transform ¶
DeepMapper feature->pixel transform, the PURE, contracted core.
A feature vector is laid out into a small dim x dim pseudo-image by padding to
a perfect square and reshaping. Different arrangements come from a seeded
permutation of the feature order; per-pixel attributions are projected back to
feature space through the inverse permutation (a clean 1-pixel-per-feature gather, DeepMapper's differentiating property).
The scalar/list functions here are
stdlib-only and deterministic, so they are testable without numpy or torch. Only
the array builders (to_images / pixels_to_features) need numpy, imported
lazily so importing this module never requires it.
square_dim ¶
Smallest square side dim with dim*dim >= n_features + buffer.
Mirrors the canonical DeepMapper.map sizing. Minimal by construction:
(dim-1)**2 < n_features + buffer <= dim**2.
Source code in pydeepmapper/transform.py
image_side ¶
permutation ¶
A deterministic, seeded permutation of range(n) (one arrangement).
Same seed always yields the same permutation (reproducible runs). Uses the
same random.Random(seed).shuffle mechanism as the canonical
DeepMapper.shuffle_to_seed.
Source code in pydeepmapper/transform.py
inverse_permutation ¶
Inverse of a permutation: inv[perm[i]] == i and perm[inv[i]] == i.
to_images ¶
Lay a (n_samples, n_features) matrix out as (n_samples, dim, dim, 1).
If perm is given, features are reordered by it first (the arrangement);
feature perm[j] lands at flattened pixel j. Padding to dim*dim is
zeros. (Deviation from the legacy code: we pad to exactly dim*dim rather
than using np.resize, which silently repeats data when a buffer is set.)
Source code in pydeepmapper/transform.py
pixels_to_features ¶
Project a per-pixel attribution image back to per-feature importances.
Inverse of :func:to_images' layout: flattened pixel j (for j <
n_features) carries feature perm[j], so out[perm[j]] = a[j]. Padding
pixels are dropped. With perm=None the identity arrangement is assumed.
Source code in pydeepmapper/transform.py
Accumulation¶
pydeepmapper.accumulate ¶
Iterate-N accumulation, the PURE aggregation of per-pass findings.
DeepMapper runs the arrange -> train -> attribute loop N times and accumulates
per-feature findings into robust, stability-scored attributions. These functions
turn a list of per-pass results into the final ranking + stability statistics.
No torch, no IO, stdlib only, so they are testable in isolation.
These functions are property-tested in isolation (selection frequency = the stability-selection statistic; median importance; Borda mean rank; the per-pass min_accuracy quality gate).
accept_pass ¶
ranks_from_importances ¶
Convert an importance vector to ranks (rank 1 = most important).
Ties are broken by original index (stable), so the result is a permutation of
1..n and feeds :func:mean_rank / top-k selection directly.
Source code in pydeepmapper/accumulate.py
top_k_set ¶
Indices of the k most important features in one pass (descending).
selection_frequency ¶
Per-feature fraction of passes that selected it (the stability statistic).
top_sets: one selected-index list per accepted pass. Returns a value in
[0, 1] per feature: how often it landed in the top-k. Empty input -> all 0.
Source code in pydeepmapper/accumulate.py
mean_rank ¶
Borda aggregation: per-feature mean rank across passes (lower = better).
Source code in pydeepmapper/accumulate.py
median_importance ¶
Per-feature median importance across passes (outlier-robust central estimate).
Source code in pydeepmapper/accumulate.py
stability_selection_bound ¶
Meinshausen-Buhlmann expected-false-positives bound E[V] <= q^2 / ((2pi-1)F).
pi_thr: selection-frequency threshold (must be > 0.5). q: average number
of features selected per pass. Returns the upper bound on expected false positives.
Source code in pydeepmapper/accumulate.py
De-novo assembly¶
pydeepmapper.hybrid_assembly ¶
De-novo-hybrid assembly, recover NON-DOCUMENTED transcripts and fold them into the expression "sc table" so DeepMapper sees features a reference-only pipeline discards.
Pipeline (validated on tissue-resident vs circulating memory T-cell bulk RNA, where the top discriminators turned out to be novel DENOVO transcripts):
reads --align(reference)--> UNALIGNED reads (the non-documented fraction)
--pool--> Trinity de-novo assembly --> novel transcripts, tagged ``DENOVO_*``
hybrid_ref = reference_cdna + DENOVO -> Salmon index + quant -> counts matrix
build_sc_table(...) -> transcripts × samples, each transcript flagged is_denovo
This is an optional DeepMapper data-ingestion path: instead of starting from a given matrix, build the matrix (incl. non-documented features) from raw reads. The PURE core below (namespacing + table assembly) is property-tested; the heavy tool orchestration (HISAT2/Trinity/Salmon/biopython) is gated behind dependency checks and requirements-bio.txt so importing this module stays cheap.
BioToolUnavailable ¶
is_denovo ¶
tag_denovo ¶
Namespace de-novo transcript ids with DENOVO_ (idempotent, never double-tags),
so they cannot collide with reference ids in the hybrid reference.
Source code in pydeepmapper/hybrid_assembly.py
merge_reference_and_denovo ¶
Combined transcript id list for the hybrid reference: reference ids untouched, de-novo ids tagged. Asserts (via the contract) that the namespaces don't collide.
Source code in pydeepmapper/hybrid_assembly.py
build_sc_table ¶
Assemble the expression "sc table" (transcripts × samples) from per-sample
transcript->count maps (e.g. one Salmon quant.sf per sample).
Returns a pandas DataFrame indexed by transcript, one column per sample (column order =
insertion order of per_sample_counts), missing transcripts filled 0. An extra boolean
column is NOT added here (keep the matrix numeric); use :func:is_denovo on the index.
Source code in pydeepmapper/hybrid_assembly.py
denovo_fraction ¶
Fraction of ids that are de-novo, a quick diagnostic of how much non-documented signal the hybrid step recovered.
Source code in pydeepmapper/hybrid_assembly.py
align_capture_unaligned ¶
align_capture_unaligned(reads_fastq: str, hisat2_index: str, out_dir: str, strandness: str = 'R') -> str
Align reads_fastq to the reference with HISAT2 and capture the UNALIGNED reads, the non-documented fraction. Returns the path to the unaligned FASTA.
Source code in pydeepmapper/hybrid_assembly.py
denovo_assemble ¶
denovo_assemble(pooled_unaligned_fasta: str, out_dir: str, max_memory: str = '50G', cpu: int = 16) -> str
Trinity de-novo assembly of the pooled unaligned reads. Returns the path to the DENOVO-tagged FASTA (ready to concatenate into the hybrid reference).
Source code in pydeepmapper/hybrid_assembly.py
build_hybrid_reference ¶
build_hybrid_reference(reference_fasta: str, denovo_fasta: str, out_fasta: str, salmon_index_dir: str, kmer: int = 15) -> str
Concatenate reference + DENOVO into the hybrid reference and build a Salmon index. Returns the index directory.
Source code in pydeepmapper/hybrid_assembly.py
quantify ¶
quantify(reads_by_sample: Mapping[str, str], salmon_index_dir: str, out_dir: str, threads: int = 16) -> Dict[str, str]
Salmon quant each sample against the hybrid index. Returns {sample: quant.sf path}.
Source code in pydeepmapper/hybrid_assembly.py
read_quant ¶
Read a Salmon quant.sf into {transcript: NumReads}.
Source code in pydeepmapper/hybrid_assembly.py
run_hybrid_assembly ¶
run_hybrid_assembly(reads_by_sample: Mapping[str, str], hisat2_index: str, reference_fasta: str, work_dir: str) -> 'object'
End-to-end orchestration → the expression sc table (transcripts × samples), with the recovered DENOVO transcripts included. Each step is idempotent enough to resume.