How To Safely Test Your LLM Prompt Changes

·

6 min read

Large language models are the greatest epiphany creators among developers today. Every developer that has played around with an LLM has some ingenious idea on how they can disrupt entire industries. Their minds run amok, amongst all the excitement, caffeine, and determination to bring about change, lie ideas that a thousand people have already thought of but couldn’t put into production.

“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them” *-*Chip Hyuen

What do we mean by couldn’t put into production?

The consensus is that integrating an LLM into an application is a fairly straightforward process nowadays. Select a model, create prompts, and then build an application around it. People tend to gloss over the fact that most large language models have a mind of their own. They have been trained on a million different topics and could become unruly at any moment.

It is hard to guarantee what a large language model will generate. It would be weird if a chatbot built for a pharmacy started responding to medication questions with how yoga is a brilliant alternative to heart medication.

You might say that with the correct methods of prompt engineering and testing, the model will not generate hallucinations. However, in reality, even if a team of prompt engineers is tinkering with prompts and analysing the model outputs, only a handful of the gazillion possible scenarios can be tested. This means that the prompt that ultimately proves effective for (the small number of) tested cases may exhibit significant underperformance when deployed in a live production environment, particularly when faced with real user queries.

The process of testing a Language model is extraordinarily complicated. In fact, finding the prompt that will work well is itself a confusing and painstaking process. We have a multitude of scenarios to check. Different accuracy metrics might give disparate inferences on how the model is behaving. Developers thrive on immediate results and instant gratification, something which is not achievable when testing LLMs . . . until now.

UpTrain is an open-source toolkit designed to test large language models (LLMs) effectively. Its purpose is to ensure the reliable performance of your LLM applications by assessing various aspects such as correctness, structural integrity, bias, and hallucination in their responses. With UpTrain, you can conduct systematic experiments using multiple prompts, evaluating them across a diverse test set and quantifying their performance. This allows you to select the best prompt without relying solely on manual efforts.

Moreover, UpTrain offers the capability to validate your model’s responses in a production environment. If any of the checks fail, you can modify the responses in real-time, ensuring their compliance. Additionally, you can monitor your model’s performance while it is in production.

By leveraging the comprehensive suite of LLM evaluation tools provided by UpTrain, including standard NLP metrics, model grading checks, embeddings-based checks, and human-assisted checks, you can confidently make changes to your LLM applications. This eliminates the risk of incorrect or faulty prompts being deployed in production. Now, let’s get a basic understanding of how UpTrain functions.

The above image is a screenshot of the UpTrain Dashboard. Notice how the main menu has four sections:

  1. Dataset section ( To upload datasets )

  2. Experiments section ( The various prompts to the model are specified here)

  3. Checks section ( Allows us to specify UpTrain Checks)

  4. Results section ( The visualisations and inferences are obtained here )

Each section gives you a different set of tools to create data visualisations.

Dataset Section:

The “Dataset” section is where we provide data to create test prompts for large language models. This data could be anything … documents, movie scripts, questions, emails, href’s, numbers, etc. All the user has to do is upload a JSON file with the required data, and UpTrain will figure out how to work with it.

For the sake of demonstration, we have chosen a dataset with documents and queries on the documents. Each row has the following information: question, document title, document link and document text. Now, we want to see how various models respond to the question by quoting text from the document. Once we’ve input our JSON file, we move to the experiments section.

Experiments section:

The Experiments section of UpTrain provides an interface to define multiple models and prompts for experimentation. The flexibility to select multiple models like GPT-4, Claude, ChatGPT, etc. allows for an analysis and understanding of the strengths and weaknesses of each model in relation to the specific dataset.

Furthermore, the ability to experiment with multiple prompts is supported. To simplify prompt creation, a prompt template can be defined with a range of options for prompt variables.

For our use case, we are going with gpt-3.5-turbo and gpt-4. Next, we create a prompt template. If we specify variables that aren’t present in the data provided, the UpTrain dashboard creates a text box for you to enter the required information. Once all the data is uploaded, UpTrain will retrieve the outputs generated from running all the prompts. We move on to the checks section after completion.

Checks section:

What is an UpTrain CHECK?

An UpTrain Check helps us specify what type of tests we want to run and the data visualisations we want to obtain.

An Uptrain Check takes 3 arguments:

  1. Name : Name of the check

  2. Operators : The operators that are to be run when the check is executed

  3. Plots : The plots that are to be generated when the check is executed

In the Checks section, we can select all the different dimensions of the model responses we want to compare. UpTrain has pre-built checks for hallucinations, embedding similarity, grammar correctness, politeness check, rouge score, etc. Once we select the check, we notice that more options to configure our visualisations pop up. After selecting all the required options, the check will be generated below. We can add any number of checks, with desired visualisations like histogram, bar plots, line charts, etc. for them.

With this, we have completed all the steps to create our visualisations. The only thing left to do is obtain the results.

Results Section:

The results section allows us to obtain the visualisations created. UpTrain will run the models for all the possible situations and provide outputs. The image below is the visualisation we obtained from the operator we created.

Now, keep in mind that we only used one operator. What would have happened if we had created multiple operators? … We get more visualisations. Uptrain provides a drop-down menu to view all the created visualisations.

In conclusion, UpTrain provides developers with an efficient and confident pathway to make changes to LLM applications. By leveraging UpTrain, developers can streamline the testing process, ensuring efficiency, speed, and a sense of confidence. In the upcoming articles, we will delve into a detailed exploration of the features offered by UpTrain. Stay tuned.