Advanced Testing Using Evaluations
Learn how to bulk generate, run, and analyze test cases efficiently to validate your AI agents' behavior across multiple scenarios.
Last updated
Learn how to bulk generate, run, and analyze test cases efficiently to validate your AI agents' behavior across multiple scenarios.
Last updated
The Evaluations feature in MindStudio allows you to test AI workflows at scale using autogenerated or manually defined test cases. This is especially helpful for validating workflows like moderation filters, where consistent logic must be applied across many inputs.
Manually testing workflows via the preview debugger becomes inefficient as the number of test cases grows. Evaluations allow you to:
Autogenerate test cases with AI
Specify expected outputs
Run tests in bulk
Compare actual vs. expected results
Use fuzzy matching for flexible validation
In this example, an AI workflow is designed to detect spam comments and flag violations based on defined community guidelines. The workflow takes in a comment via a launch variable and outputs:
A boolean indicating whether it's spam
An array of flags indicating types of violations
Navigate to the top-level "Evaluations" tab in your project.
Click New Test Case to manually add a test or use Autogenerate to let AI create test cases for you.
Input guidance like “generate five test cases that are in violation of our guidelines.”
AI will produce sample comments with the correct input structure.
Add expected results (e.g., "is_spam": true
, "flags": ["hateful", "off-topic"]
).
Click Run All to test all cases in parallel.
MindStudio will show which tests pass or fail based on comparison with expected results.
Each test can be inspected in the debugger.
Repeat the process with a new prompt: “generate five comments not in violation.”
Provide expected results (e.g., "is_spam": false
, "flags": []
).
Run the new set and verify accuracy.
MindStudio supports two types of result matching:
Literal Match: Requires the actual output to exactly match the expected value.
Fuzzy Match: Allows minor differences or variations in phrasing. Useful for outputs with dynamic AI wording.
Run many test cases at once
Easily edit and rerun failing cases
Debug individual results
Improve the reliability of your AI workflows
Evaluations are a key tool for ensuring your AI behaves as expected at scale. Whether you're building content filters, classifiers, or other deterministic logic, this feature helps you confidently validate your workflows.