Benchmarks
Last updated
Was this helpful?
Last updated
Was this helpful?
Benchmarks are a way to find the best version of your agent based on a quantitative comparison of the performance, cost and latency of each version.
In order to benchmark an AI Feature, you need to have two things:
At least two saved versions of the AI Feature on the same Schema. To save a version, locate a run on the Playground or Runs page and select the "Save" button. This will save the parameters (instructions, temperature, etc.) and model combination used for that run.
After creating a review dataset and saving versions of your AI feature that you want to benchmark, access the Benchmark page in WorkflowAI's sidebar and select the versions you want to compare. The content of your review dataset will automatically be applied to all selected versions to ensure that they are all evaluated using the same criteria.
Version accuracy is based off of the human reviews left on runs and supplimented by AI-powered reviews. The process goes as such:
When benchmarking a version, the selected version runs all the inputs present in the Reviews dataset.
Using the human reviews from the dataset as a baseline, the AI-powered reviews are added to evaluate any runs on the benchmarked version that don't yet have a human review.
The amount of correct and incorrect runs - based on both the human reviews and AI-powered reviews - are used to calculate the accuracy of the version.
Price is calculated based on the number of tokens used in the version's runs and cost of the tokens for the selected model.
Latency is calculated based on the time it takes for each of the version's runs to complete.
If there are inputs that you want to ensure are included when benchmarking a version, all you need to do is review at least one run of that input. Once an input has been reviewed, it will be automatically be added to your Reviews dataset and ultilized when benchmarking.
We're actively working on an even faster way to evaluate new models, but in the meantime here is the current process we recommend:
To quickly benchmark a new model:
Locate the feature you want to test the new model on
Make sure the schema selected in the header matches the schema you have deployed currently.
Go to the versions page and locate your currently deployed version (you will recognize it by the environment icon(s) next to the number)
Hover over the version and select Clone and then select the new model.
Go to the benchmark page and select both the currently deployed version and the new version you just created.
Reviewed runs (we recommend starting with between 10-20 reviews, depending on the complexity of your AI Feature). You can learn more about how to review runs .
AI feature runs are given reviews by a human reviewer. The review runs are added to the Reviews dataset page (visible on the ).
Note: in order to ensure that a benchmark generates a fair and accurate comparison, it's important that you have a large enough evaluation dataset. See for more information.