An overlooked game-changing use case of LLMs for software developers

Introduction

OpenAI's revolution has transformed the landscape of natural language processing, marking a significant leap in the capabilities of conversational AI. ChatGPT has demonstrated an unprecedented understanding of context, finding applications across a wide range of fields. This innovation signifies a pivotal moment in the evolution of AI, showcasing the potential for advanced language models to enhance communication and problem-solving in diverse contexts.

LLMs have countless use cases, literally! To demonstrate one of them, the above introduction was written by an LLM (only to make a point, not that it’s needed).
Some of these use cases are very relevant to us- software developers.

Context-specific chatbots, writing code for us, and assisting in planning, testing, and design are some of the software-related use cases that are being discussed a lot. However, there’s one very strong use case that imho isn’t discussed as much as it should, and is very relevant to many software developers, especially to those who don’t have a lot of experience with AI/ML technologies- LLM Structured Tasks.

What are LLM Structured Tasks?

I have to admit I think I might have made up that name, but it’s a real thing I promise. On a very high level, it’s a task done by an LLM that can be integrated into our program’s flow- basically, a function that’s written in (a somewhat) natural language instead of code. The task is implemented by instructing the LLM to do some work on some input.

A basic example would be entity extraction: we can instruct the model to extract all names of < countries/people/companies/else > from the provided text input. This is already useful but we can even add follow-up “work” too: return all continents of the mentioned countries, return stock symbols of the companies, return the sentiment of the text as one of 5 values: …, and anything else you can think of.

The task work is often deterministic, having one correct output for every input. It is also potentially complex enough to return multiple values, therefore the output needs to be structured (as a JSON for example), so our program will be able to work with it.
The ‘structured’ part is not trivial for the LLM that’s usually generally built to output text.

LLM function calling can also be considered a structured task- having the LLM deciding whether and how to call the functions, and outputting the arguments structure.

What's so game-changing about it?

Imagine what it would be like to try and implement any of the examples mentioned above without highly accessible LLMs: the simplest example (entity extraction) would require setting up complex ML dedicated systems, and the more complicated ones a whole array of pipelines of them- each requiring many, many hours of work in design, development, and testing.

Now, with the high accessibility (and continually reduced price!), these kinds of tasks can be easily developed and tested to a reasonable quality, sometimes in as little as a few hours- try it yourself!

The challenges in engineering LLM structured tasks

Semantic Ambiguity

LLMs don’t always respond exactly the way we expect them to. A prompt instruction can seem very clear to a human but have unexpected output from the model, and sometimes even worse- a total hallucination. This is often due to the model's "creativity", which helps a lot in some use cases of LLMs (in other words- some of it is by design) but can have a detrimental effect here when the result has a concrete applicative goal.

Structure Fragility

LLMs’ creativity can have big effects on even simpler things, such as JSON validity (OpenAI doesn’t guarantee a valid JSON even when using the dedicated JSON mode). Responding with the correct attributes we requested is even harder for it. Needless to say, such errors can make it impossible for a program to work properly.

Development Iteration Fragility

Our prompt usually keeps evolving- we sometimes find queries it doesn’t work well with and want to fix it, or maybe want to support new “features” by extending it.

The problem is, that even small adjustments to the prompt can sometimes have unexpected effects. We can often find ourselves adding a few words to correctly support a failed query, and find out that other unrelated queries (possibly even much simpler ones) break as a result of the change.

Model Stability

When using LLM’s public APIs, some of the underlying models (or at least parts of their stack) occasionally change under the hood, often without notice. This can potentially have detrimental effects on the tasks we’re performing by breaking it.

The good news

LLMs are strong tools, and the structured tasks use-case can be very powerful and stable with the correct development process and habits. The even better news is that this process and habits are practically the same as the ones required for software engineering anyway: careful development, thorough testing on a diverse set of samples, and reasonable monitoring to make sure it works as we expect it to.

Ambiguity and fragility challenges can all be addressed with careful development and basic testing, on a small set of query samples with expected output- similar to unit tests.

Stability challenges can be addressed by testing on a broader set of use cases, along with periodic scheduled testing- similar to system tests.

Promptotype- The platform for { structured } prompt engineering

Promptotype is a prompt development platform focusing on this specific use case of LLM structured tasks.

It provides you with an extended playground- letting you define templated prompts, function calling (if relevant), and model configuration.

While developing in the playground, you can test your prompts on query inputs (the values for the templated prompt’s variables), by comparing the response with a defined expected response: JSON value or schema, or function calling value. Initially creating the expected response can even be done semi-automatically by running your query through the model, and then simply adjusting the incoming response.

The more important feature is the ability to define whole collections of queries and run them at once with every adjustment iteration.
You can also define scheduled periodic runs of these collections, to ensure everything keeps performing as expected.

This set of features helps overcome all of the four challenges mentioned before.

You’re more than welcome to sign up and check out Promptotype for free right now!
https://www.promptotype.io

Coming Soon

We’re planning to write a few follow-up blog posts, a few major ones that come to mind are development challenges with a demo product example, considerations of when and how to test your prompts (spoiler: it doesn’t fit perfectly to existing CI/CD testing methodologies), and techniques for using AI to improve your prompts or models.

Signed-up Promptotype users get an email with news so you’re more than welcome to sign up, even just to follow (it’s free).

Until next time- happy prompting!
Ram from Promptotype