Paper 03 - Generalization Ladder - Z-Screen Pilot Release

TL;DR

The question is simple to ask and hard to test fairly: when the chemistry changes, does the model still help? In ZEL031 / THP1 it beat nearest-neighbor retrieval even when whole chemical building blocks were held out. In ZEL024 / HEK293 it mostly filled gaps inside a very dense library. Cross-library scaffold transfer is the hardest case and is not solved yet; the value of the pilot is that it defines the prospective experiment needed to test it cleanly.

Why it matters

Investors hear a lot of claims that ML can predict chemistry. The follow-up question is usually missing: what kind of new chemistry. A new combination of familiar parts is one thing. A molecule built from parts the model has never seen is harder. A new scaffold family is harder still.

The ladder makes that distinction explicit. At every rung, the model is compared against a nearest-neighbor baseline that just copies the closest measured compound. A model only earns credit when it beats that baseline.

What we did

We built five tests. The easiest holds out new combinations of familiar building blocks. The next ones hold out individual building blocks, then multiple building-block positions, then whole chemical neighborhoods. The hardest asks whether a model trained on one library family can transfer to another in the same cell line.

The nearest-neighbor baseline runs at every rung and copies the measured response of the closest training compound. It is the simplest reasonable thing the model has to beat.

What we found

The ladder separates useful claims from inflated ones.

FINDING 01

One system shows prediction beyond close-copy lookup.

In ZEL031 / THP1 the model still helped when held-out compounds used building-block identities the model had never seen. It beat nearest-neighbor retrieval on the majority of test compounds, which is the cleanest extrapolation result in the pilot.

+0.168Model advantage over lookup

FINDING 02

Simpler holdouts agree with the harder one.

Holding out one building-block position at a time, the same system kept beating retrieval. The harder multi-axis result is more credible because the simpler holdouts behave the same way.

89 - 91%Compounds where model beat lookup

FINDING 03

A dense library gives a different but useful answer.

ZEL024 / HEK293 is sampled so densely that many held-out compounds still have very close analogs in the training set. The model is excellent at filling gaps inside that explored design space. We should not call that the same thing as extrapolation onto novel chemistry.

13,769 / 13,944Observed tuples - saturated grid

FINDING 04

Holding out chemical neighborhoods is supportive.

When whole chemical neighborhoods inside a library were held out, the model still beat retrieval in the main systems. So the signal is not coming from local analog copying.

+0.236Neighborhood-holdout gain

FINDING 05

The hardest scaffold-hop question is not solved yet.

Cross-library scaffold-family transfer was weak in the retrospective tests. We treat that as a boundary to investigate, not a failure of the story. The paper specifies a clean prospective chip run that will settle this directly.

~1,000 wellsProspective scaffold-hop test

What this enables

A model is only as useful as its boundary.

A chemistry-to-phenotype model only earns its keep when the next chemistry is genuinely different. The ladder marks where the current pilot already helps, where it is mainly filling gaps inside a known space, and where the next experiment has to settle the question. The benchmark will keep telling the next chip what it actually has to prove.

Access

Preprint, data, and analysis repo.

Public release. The L1 to L5 splits, baselines, and benchmark code ship with the data bundle on Zenodo. The ladder is intended as a reusable benchmark; we welcome submissions that beat it.

Preprint PDF

Paper 03 - 11 pages

Download

Splits + benchmarks on Zenodo

L1 to L5 splits - reusable benchmark