Z-Screen Pilot Release/ Preprints / Paper 03
PAPER 03 - Generalization Ladder

How do we know a generative model is not just copying close cousins?

Drug-discovery ML can look impressive when the test molecules sit close to the training set. This paper sets up a ladder of progressively harder tests so we can separate real prediction from near-neighbor lookup. The honest answer is mixed and useful: one pilot system shows held-building-block extrapolation, another mostly fills gaps inside a dense library, and the hardest scaffold-hop test is now precisely defined.

Paper 03 · 11 pages · April 2026 · CC-BY 4.0
THE LADDER - MODEL VS NEAREST-NEIGHBOR RETRIEVAL - ZEL031 / THP1 0.0 0.3 0.6 0.9 L1 UNSEEN TUPLE L2 BB HOLDOUT L3 MULTI-AXIS +0.168 - 90% wins L4 NEIGHBORHOOD L5 CROSS-LIB next experiment RETRIEVAL MODEL
Hero figure - The ladder separates easy gap-filling from harder held-chemistry prediction.
TL;DR

The question is simple to ask and hard to test fairly: when the chemistry changes, does the model still help? In ZEL031 / THP1 it beat nearest-neighbor retrieval even when whole chemical building blocks were held out. In ZEL024 / HEK293 it mostly filled gaps inside a very dense library. Cross-library scaffold transfer is the hardest case and is not solved yet; the value of the pilot is that it defines the prospective experiment needed to test it cleanly.

Why it matters

Investors hear a lot of claims that ML can predict chemistry. The follow-up question is usually missing: what kind of new chemistry. A new combination of familiar parts is one thing. A molecule built from parts the model has never seen is harder. A new scaffold family is harder still.

The ladder makes that distinction explicit. At every rung, the model is compared against a nearest-neighbor baseline that just copies the closest measured compound. A model only earns credit when it beats that baseline.

What we did

We built five tests. The easiest holds out new combinations of familiar building blocks. The next ones hold out individual building blocks, then multiple building-block positions, then whole chemical neighborhoods. The hardest asks whether a model trained on one library family can transfer to another in the same cell line.

The nearest-neighbor baseline runs at every rung and copies the measured response of the closest training compound. It is the simplest reasonable thing the model has to beat.

What we found

The ladder separates useful claims from inflated ones.

FINDING 01

One system shows prediction beyond close-copy lookup.

In ZEL031 / THP1 the model still helped when held-out compounds used building-block identities the model had never seen. It beat nearest-neighbor retrieval on the majority of test compounds, which is the cleanest extrapolation result in the pilot.

+0.168Model advantage over lookup
FINDING 02

Simpler holdouts agree with the harder one.

Holding out one building-block position at a time, the same system kept beating retrieval. The harder multi-axis result is more credible because the simpler holdouts behave the same way.

89 - 91%Compounds where model beat lookup
FINDING 03

A dense library gives a different but useful answer.

ZEL024 / HEK293 is sampled so densely that many held-out compounds still have very close analogs in the training set. The model is excellent at filling gaps inside that explored design space. We should not call that the same thing as extrapolation onto novel chemistry.

13,769 / 13,944Observed tuples - saturated grid
FINDING 04

Holding out chemical neighborhoods is supportive.

When whole chemical neighborhoods inside a library were held out, the model still beat retrieval in the main systems. So the signal is not coming from local analog copying.

+0.236Neighborhood-holdout gain
FINDING 05

The hardest scaffold-hop question is not solved yet.

Cross-library scaffold-family transfer was weak in the retrospective tests. We treat that as a boundary to investigate, not a failure of the story. The paper specifies a clean prospective chip run that will settle this directly.

~1,000 wellsProspective scaffold-hop test
What this enables

A model is only as useful as its boundary.

A chemistry-to-phenotype model only earns its keep when the next chemistry is genuinely different. The ladder marks where the current pilot already helps, where it is mainly filling gaps inside a known space, and where the next experiment has to settle the question. The benchmark will keep telling the next chip what it actually has to prove.

Access

Preprint, data, and analysis repo.

Public release. The L1 to L5 splits, baselines, and benchmark code ship with the data bundle on Zenodo. The ladder is intended as a reusable benchmark; we welcome submissions that beat it.