Testing considerations for LLM tasks

LLM tasks are a specific use-case of LLMs that focuses on performing specific tasks on textual inputs, rather than engaging in conversational interactions like chatbots. A simple example can be ‘entity extraction’: identifying and classifying specific data from a textual input. A more concrete example is extracting city names from a textual input, which will serve as our primary example throughout this article. (We wrote more about LLM tasks in a previous blog post)

Sometimes we want to (or should) integrate these tasks into our program flow, just like traditional functional APIs. In such cases, we’d want to ensure that the task behaves, and keeps behaving, as expected, just as we would for other functional APIs in our system.

For the traditional case, we have well-established testing practices. However, testing LLMs is different in many ways. In this post, we will explore the testing considerations for LLM tasks: the similarities between them and conventional software testing practices, how to overcome related challenges, and how to leverage their advantages.

We will focus on structured and/or deterministic tasks—ones intended to return predictable responses, either exact or schema-based.

Challenges in testing LLMs

Costly

Using an external provider for our LLMs usually involves a usage-based pricing model, which can become expensive quickly. For this reason, we naturally want to limit the frequency of our tests to the minimum necessary. Therefore, we would prefer to not include it as part of our code’s CI/CD pipeline test suite.

Changes without warning

When using an external provider’s general API, the model (or components of its stack) may change, often without notice- prior or subsequent. These changes can affect our prompt’s behavior and performance, potentially leading to incorrect results.

To mitigate this risk, it is important to run our tests not only when our prompt changes but also periodically. This can help to at least maintain confidence in the consistent performance of our task, or let us know if it breaks.

Advantages in testing LLM structured tasks (vs other LLM use cases)

LLM tasks have several distinct features compared to open-ended or conversational use cases. These features offer advantages regarding the testing process and flow.

Intended determinism

When using LLMs to perform tasks, in most cases we’ll know the exact output for any given input, basically making the call deterministic. For example, in the case of city name extraction, we can easily tell the expected output for any defined input.

This determinism allows us to test LLM tasks in the same way we would test any other deterministic function in our code. One effective method is to build sample test datasets with known outputs, enabling easy comparisons.

By leveraging the deterministic nature of these tasks, we can implement straightforward testing procedures, ensuring the reliability and accuracy of our task’s performance.

Structured output

The output of LLM tasks is naturally structured, so our program can work with it on the other end. The structure can be entirely defined by us (by instructing the model to respond in a specific JSON schema for example), or partially defined by the provider, such as in function calling conventions.

Structured outputs simplify the evaluation process, reducing it to straightforward verification and validation, and removing some of the complexities involved with the evaluation of open-ended responses.

Isn’t affected by code changes

Prompts can serve as integral components of our program flow. However, while they may be managed within our code repository, they are usually developed and committed independently, or with only limited logic handling when the structure changes.

As a result, prompts remain largely unaffected by most changes in the codebase. Therefore, they are not required to be executed with every CI/CD flow, along with other unit or integration tests.

Good practices

Testing principles

Minimal
Test as little as possible at each lifecycle phase to maintain development velocity and avoid unnecessary costs.
Selective
Avoid testing when there are no affecting changes, to prevent unnecessary costs.
Continuous
Ensure ongoing performance, and be aware of potentially breaking backend changes, or newly discovered unsupported use cases.

Development

Ongoing

When developing our prompt we typically iterate on it while testing with practical, applicative cases, and adjusting it to work as expected.

LLM prompts are “fragile” by nature’, and small changes can sometimes have significant effects. Therefore, it’s important to make sure that new versions don’t introduce regression. This can be achieved by collecting the samples used for the ongoing development and running them on every new version, to verify no regression is introduced.

Testing for regression is especially important when we want to extend our prompt to support a new “feature” (for example, adding the county name to the extracted city from our city name entity extraction example), as broader changes are more prone to breaking behavior.

Pre Deployment

When the prompt version development is complete, it is good practice to conduct thorough regression testing and run the new version on a broader, deeper dataset to ensure robustness.

While in production

Periodic testing

As mentioned before, providers can sometimes change a model’s backend without notice. Therefore, it is crucial to ensure our prompts are resilient to these changes.

A good practice is to periodically test our current prompt with a dataset of practical examples. This might not completely prevent regressions in production, but it will at least alert us when they occur, allowing us to adjust our prompt or quickly roll back to an earlier version if it performs better under the new changes.

Performance Monitoring

Naturally, we can’t predict all of the ways users will interact with our program, so it is important to monitor the performance of our tasks so we can adjust for currently unknown cases if needed.

This can be done applicatively- by trying to understand when our users were dissatisfied with the result, such as when they loop back to a previous stage.

Another simple, and commonly used method is to present users with a feedback mechanism, like a ‘👎’ button, to rate the result. Even ChatGPT uses this!

In some applications, this kind of feedback must be incorporated. From our recurring example: there’s no point in planning a user’s trip if we misidentify the city they plan to go to.

By collecting and analyzing this feedback, we can identify “missed” inputs and improve our prompt accordingly. This feedback can also be used to enrich our test datasets for future prompt versions.

Transitioning to a new model

Providers frequently launch new and improved models, or cheaper alternatives, that we might consider transitioning to. In some cases, we might even consider switching to a weaker model if it’s more cost-effective and still performs adequately.

When transitioning to a new model, it is essential to test our prompt with our most thorough datasets to ensure it performs well. This thorough testing will help us identify any potential issues and ensure the new model meets our performance requirements before fully committing to the transition.

Conclusion

LLM tasks are very similar to traditional functional APIs, but they’re not as robust as plain code. As a result, developing and testing them introduce new challenges compared to traditional software testing. However, they are still easier to test than many other LLM applications.

The best practices provided in this article try to provide ways to address the challenges and leverage the advantages. By following these guidelines, we can develop LLM tasks more effectively, and confidently integrate them into our applications.

Promptotype

Promptotype is a development and testing platform specifically designed for LLM tasks. It offers dedicated solutions to implement most of the patterns and best practices discussed in this article. You’re welcome to check it out!

Until next time- happy prompting!
Ram from Promptotype