Randomness of error state for evaluation in co pilot stuido

(1) Share

Report

Posted on by AS-02041910-0

I am currently facing the same issue but I could see the agent response being generated. As you could see in the screenshot that I have multiple scores to test. While for a particular question the General Quality has a Pass/Fail and Text Similarity has a score value but the Compare meaning mertrics enters in to error state. I would find any resolution or explanation for this behaviour.

Things to point out is , I could get this error state for random questions for random metrics for every run.

What might cause a test case to have error in evaluation in this way? Any suggestions on what to check or troubleshoot would be greatly appreciated. Thanks!

Categories:

Autonomous agents

Copilot Studio pre-built agents/templates

I have the same question (0)

All responses (2)

Answers (0)

Suggested answer

rezarizvii 98 on at

Like (0)

Report
Copy link

Link copied!
Hi, hope you are doing well.

This is typically a semantic evaluation failure (model-side), not a logic issue in your agent.

Could you maybe try:

Logging both expected vs actual response for failed rows

Trimming/cleaning the text (no HTML, no extra formatting)

Ensuring neither side is null/empty

Keeping responses reasonably sized (not huge blobs)

Re-running the same dataset. If different rows fail each time, it’s likely transient/model-side

If this reply helped you in any way, please give it a Like 💜 and in case it resolved your issue, please mark it as the Verified Answer ✅.

Was this reply helpful? Yes No
Suggested answer

Sayali Microsoft Employee on at

Like (0)

Report
Copy link

Link copied!

Hello AS-02041910-0 ,

This issue is unrelated to Adaptive Card rendering and instead occurs in the Agent Evaluation runtime (Copilot Studio / AI Foundry Test panel). The key signal is that agent responses generate successfully, and metrics like General Quality and Text Similarity pass, while Compare Meaning fails randomly across runs.

This happens because Compare Meaning is a non-deterministic, LLM-based semantic judge that performs an additional model call using the question, expected answer, and agent response. Failures typically occur due to token limits, complex/structured responses (JSON, tables, code blocks), timeouts, or parsing issues. Since these vary per response, errors appear randomly.

Conclusion: The agent is functioning correctly — the failures are caused by evaluation model invocation limits or formatting issues, not by your agent logic.

Was this reply helpful? Yes No