Single-Cell Foundation Models: Scale, Skepticism, and the Epistemology of Data Production

2026-04-16

single-cellfoundation-modelperturbationvirtual-embryodata-production

In 2023, Nature Methods published a piece called "How to build a virtual embryo." It was the year stem cell-based embryo models were named Method of the Year, and the article cautiously introduced the concept of a computational embryo — a digital embryo you could perturb genetically, expose to drugs, and simulate outcomes. Ina Sonnen said it was "definitely a long-term endeavor," and Berna Sozen said "we will never reach perfection, but it could help us understand otherwise inaccessible developmental stages." The vision was clear, but there was no concrete path.

Three years later, in 2026, Cao, Lu, and Qiu published a perspective in the same journal that drew that path in considerable detail. A 26-million human fetal cell atlas, the MOSTA spatial atlas, 3D human embryo reconstruction, cell embeddings from Geneformer and scGPT, flow-matching-based trajectory modeling, vector field simulations from Spateo. What was "someday possible" in 2023 became "here is the data and the model, and here is how" in 2026. What happened in between was two years of single-cell foundation models.

Predicting Answers as Imagined

We are actively pursuing similar research ourselves, though it is hard to define precisely what dimension this occupies. Traditional biology research would define a problem and examine data within that dimension. Now — strange analogy as it may be — we predict answers as we imagine them.

Scale Becomes the Default

Many single-cell foundation model (FM) methods have appeared since the beginning of this year. The first thing that stands out is scale. CellFM trained 800M parameters on 100 million cells. TEDDY pulled 116 million cells from CELLXGENE. scPRINT-2 collected 350 million cells across 16 species. C2S-Scale pushed to 27 billion parameters and was the first in the single-cell field to systematically report scaling laws. Tahoe-x1 incorporated perturbation data from 50 cancer cell lines and 1,100 drugs directly into its 3-billion-parameter training, and MaxToki extracted roughly one trillion tokens from 175 million cells to learn along the temporal axis. Geneformer's 30 million cells were considered large just two years ago — we are now two orders of magnitude beyond that. Collecting data and scaling models is no longer the bottleneck; it is becoming the baseline.

Skeptical Evaluation of Scale

At the same time scale has been climbing, papers have appeared that directly question whether that scale actually produces value. Kedzierska et al. evaluated Geneformer and scGPT in zero-shot settings and found cases where they performed worse than simply selecting highly variable genes. Models trained on tens of millions of cells lost to the simplest baseline of all: HVG selection. Csendes et al. evaluated scGPT and scFoundation on Perturb-seq data, and a train-mean predictor — using the training set average as the prediction — won. Wei et al. ran 27 methods across 29 datasets and concluded that FMs need large amounts of fine-tuning data to beat baselines. These results all say the same thing: the hypothesis that "training big works by itself" is not supported by the evidence so far. Scale-up and skeptical evaluation are happening in parallel — that is the real state of the field this year. Our own internal evaluations reach similar conclusions.

The Proving Ground for FMs: Perturbation Prediction

Where, then, must FMs prove their value? Not in cell type annotation, but in perturbation prediction. Cell type annotation is largely a solved problem, and labeling does not strictly require a model trained on hundreds of millions of cells. Perturbation, on the other hand — what happens to a cell when you knock out this gene, how does the transcriptome shift when you apply this drug — is causal, difficult, and useful when solved. Tahoe-x1 incorporating 1,100 drug response datasets directly into training; Pershad et al. building a closed loop of model prediction, experimental validation, and retraining to triple the PPV for T-cell activation; Hediyeh-zadeh et al. applying continual learning to a 300-patient colorectal cancer atlas to elucidate drug resistance mechanisms — these all sit on this trajectory. Tejada-Lapuerta et al.'s causal ML perspective provides the theoretical grounding: correlation-based models only work within the training distribution, and generalizing to new conditions requires causal structure.

Architecture Shifts: Encoding Biology into Structure

A transition is visible on the architecture side as well. Beyond simply feeding expression values into Transformers, serious attempts to translate biological prior knowledge into structural constraints have emerged this year. HEIMDALL decomposed the single-cell FM tokenizer along three axes — gene identity, expression encoding, and cell composition — and reimplemented five major models within this framework. The conclusion: when the distribution resembles training data, the tokenizer makes no difference; differences only emerge under distribution shift. This means the tokenizer can become a bottleneck at precisely the moment the FM's value should be realized. GREmLN addressed the fundamental problem that scRNA data has no gene ordering by embedding gene regulatory network topology into the attention mechanism via spectral graph signal processing. Tripso grouped genes into gene programs first, creating a three-level hierarchy of gene, gene program, and cell. All three papers point in the same direction: feeding expression values into a Transformer is not enough — biology's structure must be woven into the architecture itself.

The Slow Side Is Not the Model but How We Make Data

Yet watching all these developments, one thought keeps drawing me back. If causal prediction is to be the proving ground for FMs, the models alone cannot be what changes. The data production side must change too. This is something I keep wondering about — building bio foundation models is not actually that attractive to the pharmaceutical industry. What matters more is which problems to solve and how. That requires substantial research, starting with the production of standardized, well-defined data against which biological intuition can be tested. If the target is disease, the data must control for genetic background and clinical information within specific disease cohorts. The homogeneous pipelines that large companies maintain are well-suited for this kind of data production. Yet this approach and philosophy seems largely absent domestically, in contrast to what we see abroad. The same logic applies to the utilization of organoids.

The papers currently producing the most meaningful results in perturbation prediction — Tahoe-x1, Pershad's closed loop, Hediyeh-zadeh's continual learning — all use systematically designed perturbation data. It is because datasets like Tahoe-100M exist, combining 50 cell lines with 1,100 drugs, that models can learn causality at all. The future that the 2026 Virtual Embryo perspective envisions — simulating how morphogenesis changes when you perturb a gene — is also unreachable by models alone without matching data. Systematic measurement of developmental perturbations, quantification of phenotypes, and structured recording of even failed experiments must come first. If this transition does not happen in traditional biology labs, FMs can scale indefinitely and still lack the causal structure needed to learn from.

Ultimately, this is not a problem of models but of whether the epistemology of data production changes. Laboratories that produce data suited for recursive reasoning — data generated within loops where hypotheses are formed, perturbations applied, outcomes measured, and results feed back into the next hypothesis — need to multiply before this field can deliver on its promises. Scaling models is already happening fast. The slow side is not the model. It is how we make the data.