Advanced Testing Using Evaluations

Learn how to bulk generate, run, and analyze test cases efficiently to validate your AI agents' behavior across multiple scenarios.

Last updated 1 day ago

Advanced Testing Using Evaluations

Learn how to bulk generate, run, and analyze test cases efficiently to validate your AI agents' behavior across multiple scenarios.

The Evaluations feature in MindStudio allows you to test AI workflows at scale using autogenerated or manually defined test cases. This is especially helpful for validating workflows like moderation filters, where consistent logic must be applied across many inputs.

Why Use Evaluations?

Manually testing workflows via the preview debugger becomes inefficient as the number of test cases grows. Evaluations allow you to:

Autogenerate test cases with AI
Specify expected outputs
Run tests in bulk
Compare actual vs. expected results
Use fuzzy matching for flexible validation

Sample Use Case: Spam Detection

In this example, an AI workflow is designed to detect spam comments and flag violations based on defined community guidelines. The workflow takes in a comment via a launch variable and outputs:

A boolean indicating whether it's spam
An array of flags indicating types of violations

Creating and Running Test Cases

Step 1: Access the Evaluations Tab

Navigate to the top-level "Evaluations" tab in your project.
Click New Test Case to manually add a test or use Autogenerate to let AI create test cases for you.

Step 2: Autogenerate Violating Test Cases

Input guidance like “generate five test cases that are in violation of our guidelines.”
AI will produce sample comments with the correct input structure.
Add expected results (e.g., "is_spam": true, "flags": ["hateful", "off-topic"]).

Step 3: Run Test Cases

Click Run All to test all cases in parallel.
MindStudio will show which tests pass or fail based on comparison with expected results.
Each test can be inspected in the debugger.

Step 4: Autogenerate Non-Violating Test Cases

Repeat the process with a new prompt: “generate five comments not in violation.”
Provide expected results (e.g., "is_spam": false, "flags": []).
Run the new set and verify accuracy.

Matching Methods

MindStudio supports two types of result matching:

Literal Match: Requires the actual output to exactly match the expected value.
Fuzzy Match: Allows minor differences or variations in phrasing. Useful for outputs with dynamic AI wording.

Benefits of Evaluations

Run many test cases at once
Easily edit and rerun failing cases
Debug individual results
Improve the reliability of your AI workflows

Evaluations are a key tool for ensuring your AI behaves as expected at scale. Whether you're building content filters, classifiers, or other deterministic logic, this feature helps you confidently validate your workflows.