Announcements
This issue is unrelated to Adaptive Card rendering and instead occurs in the Agent Evaluation runtime (Copilot Studio / AI Foundry Test panel). The key signal is that agent responses generate successfully, and metrics like General Quality and Text Similarity pass, while Compare Meaning fails randomly across runs. This happens because Compare Meaning is a non-deterministic, LLM-based semantic judge that performs an additional model call using the question, expected answer, and agent response. Failures typically occur due to token limits, complex/structured responses (JSON, tables, code blocks), timeouts, or parsing issues. Since these vary per response, errors appear randomly. Conclusion: The agent is functioning correctly — the failures are caused by evaluation model invocation limits or formatting issues, not by your agent logic.
Under review
Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.
Congratulations to our 2026 Super Users!
Congratulations to our 2025 community superstars!
These are the community rock stars!
Stay up to date on forum activity by subscribing.
Valantis 599
chiaraalina 170 Super User 2026 Season 1
deepakmehta13a 118