Error in Evaluations with "Compare Meaning" Test Type.

(0) Share

Report

Posted on by SR-24020711-0

Hi everyone,

I’m encountering an issue while running Evaluations in Microsoft Copilot Studio.

When I run an evaluation using the “Compare meaning” test type, half of the test cases throws "error" with the following message in the Evaluation pane:

“Something went wrong while evaluating this test case.”

For this particular question, the evaluation details panel does not display the usual fields such as:

Agent Response
Knowledge sources cited
Topics

These sections are completely missing for the error test case, while other test cases in the same evaluation run normally. (Either fail due to less similar meaning or pass based on threshold.)

What I’ve checked so far:

The evaluation is configured with the Compare meaning test type.
Other test cases in the same evaluation run successfully.
The issue seems isolated to specific questions.

I’ve attached two screenshots:

The evaluation pane showing the missing Agent Response, Knowledge sources cited, and Topics sections.

Has anyone encountered this issue before, or knows what might cause a test case to fail evaluation in this way? Any suggestions on what to check or troubleshoot would be greatly appreciated.

Thanks!

Categories:

Calling actions from Copilot Studio

Copilot Studio pre-built agents/templates

I have the same question (0)

All responses (4)

Answers (1)

Verified answer

Sayali Microsoft Employee on at

Like (0)

Report
Copy link

Link copied!

Hello,
The “Something went wrong while evaluating this test case” error in Copilot Studio occurs when the evaluation engine cannot generate an agent response before running the “Compare meaning” evaluation. The evaluation process first runs the agent with the test question, captures the response, and then performs semantic similarity scoring. If the agent fails to produce a response, the system cannot generate agent response details, knowledge citations, or topic data, so the test case enters an error state instead of pass/fail.

This typically happens due to content safety filtering, runtime limits during evaluation (lower tokens, shorter timeouts), tool or connector failures, or complex orchestration paths involving multiple topics or flows. These issues may only appear during evaluation because it runs under stricter constraints than normal chat. As a result, the UI hides response-related sections since no response object exists.

In practice, the issue can often be resolved by testing the question manually in Test chat, simplifying the expected response, temporarily disabling tools or flows, or narrowing broad knowledge queries. Microsoft has acknowledged that evaluation currently does not expose detailed runtime errors, so all such failures appear as the same generic error message.

Was this reply helpful? Yes No
SR-24020711-0 24 on at

Like (1)

Report
Copy link

Link copied!

@Sayali Thankyou for providing the facts and potential resolutions in a very prompt way. Appreciate your help!

Was this reply helpful? Yes No
Suggested answer

Sayali Microsoft Employee on at

Like (0)

Report
Copy link

Link copied!

Hello SR-24020711-0 ,
If the response was helpful, could you please share your valuable feedback?
Your feedback is important to us. Please rate us:

🤩 Excellent 🙂 Good 😐 Average 🙁 Needs Improvement 😠 Poor

Was this reply helpful? Yes No
AS-02041910-0 4 on at

Like (0)

Report
Copy link

Link copied!

I am currently facing the same issue but I could see the agent response being generated. As you could see that I have multiple scores to test. While for a particular question the General Quality has a Pass/Fail and Text Similarity has a score value but the Compare meaning mertrics enters in to error state. I would find any resolution or explanation for this behaviour.

Things to point of is , I could this error state for random questions for random metrics for every run.

What might cause a test case to have error in evaluation in this way? Any suggestions on what to check or troubleshoot would be greatly appreciated. Thanks!

Was this reply helpful? Yes No