Booktest is build based of 2 decade-career in data science. It has been used to support RnD on numerous LLM, ML, NLP, information retrieval and also more traditional software engineering.
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.
interesting. cost is the main blocker for us with ai-evaluating-ai. we run ~10k concurrent agents and doubling the inference volume for verification isn't viable. does this support sampling or caching so we can keep the token usage under control?
One of the key techniques is snapshotting the LLM (or any HTTP) request. This means that if the inputs won't change, the LLM will not be called. This will also snapshot /cache LLM verifications steps.
This doesn't only saves costs, but it's main goal was to force determinism and save time. Limited changes may need only the new/changed tests to be rerun with LLM. CI typically don't have LLM API keys and only rerun against snapshots with zero costs and delays
All LLM operations tend to be notoriously slow, and at least on our side: we are often more interested of how our code interacts with the LLM. Having the LLM being fully snapshopshotted does iterating the code delightfully fast.
If you want do sampling, this can be implemented in the test code. Booktest is a bit like pytest in the sense, that the actual testing logic heavylifting is left for the the developer. Lot of LLM test suites are more opinionated, but also more intrusive in that sense
snapshotting makes sense for CI but i'd worry about the storage footprint with high-variance inputs. across 10k agents our prompts diverge significantly so managing that much state might become its own bottleneck. do you have any data on the overhead for high-entropy workloads?
Booktest is build based of 2 decade-career in data science. It has been used to support RnD on numerous LLM, ML, NLP, information retrieval and also more traditional software engineering.
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.
interesting. cost is the main blocker for us with ai-evaluating-ai. we run ~10k concurrent agents and doubling the inference volume for verification isn't viable. does this support sampling or caching so we can keep the token usage under control?
One of the key techniques is snapshotting the LLM (or any HTTP) request. This means that if the inputs won't change, the LLM will not be called. This will also snapshot /cache LLM verifications steps.
This doesn't only saves costs, but it's main goal was to force determinism and save time. Limited changes may need only the new/changed tests to be rerun with LLM. CI typically don't have LLM API keys and only rerun against snapshots with zero costs and delays
All LLM operations tend to be notoriously slow, and at least on our side: we are often more interested of how our code interacts with the LLM. Having the LLM being fully snapshopshotted does iterating the code delightfully fast.
If you want do sampling, this can be implemented in the test code. Booktest is a bit like pytest in the sense, that the actual testing logic heavylifting is left for the the developer. Lot of LLM test suites are more opinionated, but also more intrusive in that sense
snapshotting makes sense for CI but i'd worry about the storage footprint with high-variance inputs. across 10k agents our prompts diverge significantly so managing that much state might become its own bottleneck. do you have any data on the overhead for high-entropy workloads?