web
You’re offline. This is a read only version of the page.
close
Skip to main content

Announcements

News and Announcements icon
Community site session details

Community site session details

Session Id :
Copilot Studio
Unanswered

Test answers accuracy

(1) ShareShare
ReportReport
Posted on by 4

Hi all,

Has anyone tested the accuracy of a RAG Copilot? We're working with a few dozen documents, and manual testing is too labor-intensive. Is there a systematic way to ensure that the answers provided by Copilot match our expected answers? We've created a set of Q&A pairs for each document (considered the correct answers) and want to evaluate Copilot's performance. Any insights or methods would be greatly appreciated. Thanks!

Categories:
I have the same question (0)
  • Artur Stepniak Profile Picture
    1,539 Moderator on at
    Hello,
     
    I assume that you're using Azure AI Studio. As you probably already know a lot is changing day-by-day, I see that MS already prepared a tool for testing:
     
     
     
    Have you tried it? Nevertheless, you could still test the deployment by using any programming language and assessing the output. Just as an example in Python:
     
    import unittest
    import openai
    # Set up your OpenAI API key
    openai.api_key = "your_api_key_here"
    def get_llm_response(prompt, model="gpt-4"):
        """
        Get the LLM response for a given prompt.
        
        :param prompt: The prompt to send to the LLM
        :param model: The model to use (default is "gpt-4")
        :return: The LLM response as a string
        """
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message["content"].strip()
    class TestLLMOutput(unittest.TestCase):
        """Test cases for validating LLM output."""
        def test_capital_of_france(self):
            """Test if the LLM correctly identifies the capital of France."""
            prompt = "What is the capital of France?"
            expected_response = "The capital of France is Paris."
            response = get_llm_response(prompt)
            self.assertEqual(response, expected_response, f"LLM response: {response}")
        def test_addition(self):
            """Test if the LLM correctly performs addition."""
            prompt = "What is 2 + 2?"
            expected_response = "2 + 2 equals 4."
            response = get_llm_response(prompt)
            self.assertEqual(response, expected_response, f"LLM response: {response}")
        def test_greeting(self):
            """Test if the LLM returns a polite greeting."""
            prompt = "Say hello to the user."
            expected_response = "Hello, how can I assist you today?"
            response = get_llm_response(prompt)
            self.assertEqual(response, expected_response, f"LLM response: {response}")
    if __name__ == "__main__":
        unittest.main()
     
    You'd need to make sure that the response verification is not based on Equals - due to the nature of LLMs, the answer could be slightly different each time, so you'd fail. :-)
     
    In case of any other questions, let me know. If the answer is correct, mark it as a solution, so that others can benefit from it.
     
    Best regards,
     
    Artur Stepniak

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Introducing the 2026 Season 1 community Super Users

Congratulations to our 2026 Super Users!

Kudos to our 2025 Community Spotlight Honorees

Congratulations to our 2025 community superstars!

Congratulations to the March Top 10 Community Leaders!

These are the community rock stars!

Leaderboard > Copilot Studio

#1
Valantis Profile Picture

Valantis 835

#2
Vish WR Profile Picture

Vish WR 294

#3
Haque Profile Picture

Haque 248

Last 30 days Overall leaderboard