Drug-discovery ML can look impressive when the test molecules sit close to the training set. This paper sets up a ladder of progressively harder tests so we can separate real prediction from near-neighbor lookup. The honest answer is mixed and useful: one pilot system shows held-building-block extrapolation, another mostly fills gaps inside a dense library, and the hardest scaffold-hop test is now precisely defined.
The question is simple to ask and hard to test fairly: when the chemistry changes, does the model still help? In ZEL031 / THP1 it beat nearest-neighbor retrieval even when whole chemical building blocks were held out. In ZEL024 / HEK293 it mostly filled gaps inside a very dense library. Cross-library scaffold transfer is the hardest case and is not solved yet; the value of the pilot is that it defines the prospective experiment needed to test it cleanly.
Investors hear a lot of claims that ML can predict chemistry. The follow-up question is usually missing: what kind of new chemistry. A new combination of familiar parts is one thing. A molecule built from parts the model has never seen is harder. A new scaffold family is harder still.
The ladder makes that distinction explicit. At every rung, the model is compared against a nearest-neighbor baseline that just copies the closest measured compound. A model only earns credit when it beats that baseline.
We built five tests. The easiest holds out new combinations of familiar building blocks. The next ones hold out individual building blocks, then multiple building-block positions, then whole chemical neighborhoods. The hardest asks whether a model trained on one library family can transfer to another in the same cell line.
The nearest-neighbor baseline runs at every rung and copies the measured response of the closest training compound. It is the simplest reasonable thing the model has to beat.
In ZEL031 / THP1 the model still helped when held-out compounds used building-block identities the model had never seen. It beat nearest-neighbor retrieval on the majority of test compounds, which is the cleanest extrapolation result in the pilot.
Holding out one building-block position at a time, the same system kept beating retrieval. The harder multi-axis result is more credible because the simpler holdouts behave the same way.
ZEL024 / HEK293 is sampled so densely that many held-out compounds still have very close analogs in the training set. The model is excellent at filling gaps inside that explored design space. We should not call that the same thing as extrapolation onto novel chemistry.
When whole chemical neighborhoods inside a library were held out, the model still beat retrieval in the main systems. So the signal is not coming from local analog copying.
Cross-library scaffold-family transfer was weak in the retrospective tests. We treat that as a boundary to investigate, not a failure of the story. The paper specifies a clean prospective chip run that will settle this directly.
A chemistry-to-phenotype model only earns its keep when the next chemistry is genuinely different. The ladder marks where the current pilot already helps, where it is mainly filling gaps inside a known space, and where the next experiment has to settle the question. The benchmark will keep telling the next chip what it actually has to prove.
Public release. The L1 to L5 splits, baselines, and benchmark code ship with the data bundle on Zenodo. The ladder is intended as a reusable benchmark; we welcome submissions that beat it.