Announcements
This issue is unrelated to Adaptive Card rendering and instead occurs in the Agent Evaluation runtime (Copilot Studio / AI Foundry Test panel). The key signal is that agent responses generate successfully, and metrics like General Quality and Text Similarity pass, while Compare Meaning fails randomly across runs. This happens because Compare Meaning is a non-deterministic, LLM-based semantic judge that performs an additional model call using the question, expected answer, and agent response. Failures typically occur due to token limits, complex/structured responses (JSON, tables, code blocks), timeouts, or parsing issues. Since these vary per response, errors appear randomly. Conclusion: The agent is functioning correctly — the failures are caused by evaluation model invocation limits or formatting issues, not by your agent logic.
Under review
Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.
Jump in, show your community spirit, and win prizes!
Expanding mentorship, skilling, and AI innovation
These are the community rock stars!
Stay up to date on forum activity by subscribing.
Valantis 305
Vish WR 170
11manish 146