web
You’re offline. This is a read only version of the page.
close
Skip to main content

Announcements

News and Announcements icon
Community site session details

Community site session details

Session Id :
Power Platform Community / Forums / Copilot Studio / Randomness of error st...
Copilot Studio
Suggested Answer

Randomness of error state for evaluation in co pilot stuido

(1) ShareShare
ReportReport
Posted on by 4
I am currently facing the same issue but I could see the agent response being generated. As you could see in the screenshot that I have multiple scores to test. While for a particular question the General Quality has a Pass/Fail and Text Similarity has a score value but the Compare meaning mertrics enters in to error state. I would find any resolution or explanation for this behaviour.

Things to point out is , I could get this error state for random questions for random metrics for every run.
 
What might cause a test case to have error in evaluation in this way? Any suggestions on what to check or troubleshoot would be greatly appreciated. Thanks!
download - 1.png
I have the same question (0)
  • Suggested answer
    rezarizvii Profile Picture
    92 on at
    Hi, hope you are doing well.
     
    This is typically a semantic evaluation failure (model-side), not a logic issue in your agent.
     
    Could you maybe try:
    • Logging both expected vs actual response for failed rows
    • Trimming/cleaning the text (no HTML, no extra formatting)
    • Ensuring neither side is null/empty
    • Keeping responses reasonably sized (not huge blobs)
    • Re-running the same dataset. If different rows fail each time, it’s likely transient/model-side
     
    If this reply helped you in any way, please give it a Like 💜 and in case it resolved your issue, please mark it as the Verified Answer ✅.
  • Suggested answer
    Sayali Profile Picture
    Microsoft Employee on at
    Hello  ,

    This issue is unrelated to Adaptive Card rendering and instead occurs in the Agent Evaluation runtime (Copilot Studio / AI Foundry Test panel). The key signal is that agent responses generate successfully, and metrics like General Quality and Text Similarity pass, while Compare Meaning fails randomly across runs.

    This happens because Compare Meaning is a non-deterministic, LLM-based semantic judge that performs an additional model call using the question, expected answer, and agent response. Failures typically occur due to token limits, complex/structured responses (JSON, tables, code blocks), timeouts, or parsing issues. Since these vary per response, errors appear randomly.

    Conclusion: The agent is functioning correctly — the failures are caused by evaluation model invocation limits or formatting issues, not by your agent logic.

     
     

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Introducing the 2026 Season 1 community Super Users

Congratulations to our 2026 Super Users!

Kudos to our 2025 Community Spotlight Honorees

Congratulations to our 2025 community superstars!

Congratulations to the March Top 10 Community Leaders!

These are the community rock stars!

Leaderboard > Copilot Studio

#1
Valantis Profile Picture

Valantis 599

#2
chiaraalina Profile Picture

chiaraalina 170 Super User 2026 Season 1

#3
deepakmehta13a Profile Picture

deepakmehta13a 118

Last 30 days Overall leaderboard