GenServ | Limiting Hallucinations with Generative AI

Here at GenServ, we have developed and deployed dozens of generative AI agents across multiple industries. A common reservation we hear about the adoption of generative AI is the risk of hallucinations, or an AI model “making something up”. The possibilities with this technology are almost endless, but there's often a hesitation about using it within an application because of the question: What if it's wrong?

When Generative AI produces something unfounded in fact, or the context it’s given, it’s considered a “Hallucination.” Unfortunately, it's a part of the underlying architecture of these tools and hallucinations can be common and extremely difficult to solve.

However, with the right system design, we can pretty easily mitigate the impact of these occurrences on end users and (I would argue) make Gen AI both beneficial and safe to use.

How Often Do LLMs Hallucinate?

Before we get into this though, I think it's worth spending some time understanding some real-world numbers on hallucinations. At GenServ, we have the benefit of being able to test many different workflows using generative AI and we track hallucination rates very closely.

BidScore - Evaluation, Classification, Transformation Agents

One of our products, BidScore, automates the grading and evaluation of responses to RFPs. To do this, it orchestrates a dozen AI agents that analyze, extract, and evaluate different aspects of the submission against the RFP requirements.

We benchmark the results that we get from each of these agents and have measured the hallucination rate over a period of several testing cycles. BidScore is a great example for this because over the course of constructing the grading criteria for an RFP and then evaluating submissions, you have to orchestrate nearly a thousand calls to large language models. Understanding which ones are not grounded in fact is critical to having a product that works and can be trusted.

We find that hallucination rates are between 0.2-0.5% across all of the AI calls involved in evaluating submissions. We define “Hallucination” more broadly for BidScore than a normal application would, because we track to see if an answer given by an LLM is one that we would give. If it’s not, even if it’s grounded in the information given, we consider it an inaccuracy. This means that what most people consider a “Hallucination” (the LLM just making something up), is well below that 0.2-0.5% range.

How do you Prevent Hallucations?

The short answer is you can't. But there are several very reliable techniques for reducing the likelihood of hallucinations. These are common patterns that we build into all of our human-in-the-loop workflows created through GenServ.

1. Do not rely on model internal knowledge

We'll start with what we consider is the most important technique: Not relying on an LLM's internal knowledge. Every LLM is trained on a vast knowledge base of information, much more information than any human will ever consume in a lifetime, and this is why they are so good at answering an array of questions without specific training. However, because of this vast knowledge base, they are also very poor at saying “I don’t know” when they really don’t know something.

We don't even try. With extremely rare exceptions, every agent we build has specific instructions only to reference the information we give it to perform the task at hand. Though there's an obvious trade-off with cost, because you have to give it more information in order to ensure that it's given the context necessary, the benefit is that we see a much, much lower rate of hallucinations.

Example: BidScore

Let's take an example to see what we mean: In order to ensure that our evaluation and grading agents do not make up information when evaluating a response to an RFP, we give it the relevant pages of the submission to the RFP as well as the grading criteria for the RFP when we ask for a specific grade. We then instruct the LLM to only use the information provided when producing a grade.

This provides two benefits:

We can know exactly which sections of the submission are referenced when producing a grade.
If information is not provided in the call to the LLM, it is considered irrelevant.

Example: Drafting Legal Documents

This is probably a better example to explain what we mean by avoiding the use of internal knowledge bases. Several of our customers wish to draft documents that are either contractual or legal in nature and rely on very specific definitions of concepts. Instead of relying on the internal knowledge base of the LLM to know these very specific definitions, in most cases we provide the definitions and tell the agent explicitly how to reference it and what it means to the task in order to ensure that it's being used correctly.

2. Produce a Justifications with Answers

The second technique that we use to ensure the accuracy of the output our agents produce is to produce a justification with answers given to users. This can be done in a variety of ways, but our typical path is to produce an answer and then give that answer and the context used to provide that answer back to an agent (sometimes the same one, oftentimes a different one) and ask whether or not the answer is grounded in the information given. If it's not, we then will do one of two things:

We may retry the call with the same information to see if we get a different answer. If we do, we move on and have prevented a hallucination from getting to our users.
If instead we get the same answer or a similar hallucination, we will provide that back to the user, but we'll tell them that we couldn't verify the information used to produce this answer and we recommend that they check it.

The second point here is highly important because it highlights the importance of allowing humans to verify and validate the information that they are given. This is critical to the design of a good system using LLMs because it ensures that responsibility can still lie with someone at your company using the software.

3. Provide Citations with Answers

This is similar to providing a justification but it's a nuanced difference and one could argue more valuable with some agents, such as analysis and extraction agents.

Citations allow the user to view the source documentation used to produce an answer. For example, with BidScore, a grade and justification of a specific submission will provide a link to the submission pages showing the information referenced to produce an answer. This is available to a user to immediately open within the application to read the source document, the submission, and check themselves to see if the information is accurate.

A similar example is contract extraction, where specific terms and stipulations within contracts are extracted to be saved within a data source. As part of extracting the information, you can reference the specific pages that those terms were extracted from, which allows users to cite the source material very quickly and see if the extraction is correct. In general, extraction agents are both the easiest to check for accuracy and the ones with the highest degree of accuracy out-of-the-box. There's a lot of flexibility here.

4. Follow Prompting Best Practices

This is something I'd consider a given for working with LLMs, but it's worth mentioning because it is critically important to reducing hallucinations. There are a few aspects to highlight here when constructing prompts to avoid hallucinations:

Tell the model not to make things up. If this sounds like it shouldn't work, think again - it does. And a lot of people forget to do this when they're trying to get very accurate outputs from an LLM. But the fact remains that if you tell the model not to make something up and to be very explicit and specific about how it should go about constructing an answer, you're going to get higher accuracy.
Use a low temperature. Temperature can be thought of as the level of creativity you'd like a language model to use when producing an answer. This is set when you’re making a call to an LLM, and though it’s not available in ChatGPT, or a general chat interface, it’s available within all GenServ agents, and something we set through API calls. Generally, if you're doing something like summarization or extraction, you want the temperature to be very low because you don't want a lot of creativity in getting information out of the content. This would result in a more deterministic output. For tasks that benefit from more creativity, a mid-to-higher temperature will give good results, such as creative writing.
Limit the content you provide. This may seem counter to our first point of not relying on internal knowledge, but the more content you provide in a single prompt, the higher the likelihood that the important pieces will get lost and the model will just make something up. The most effective prompt will give just enough information to come to accurate and good outcomes.

The Reality of Gen AI

It turns out that when you follow these best practices and have a thoughtfully designed system incorporating generative AI, the risk and impact of hallucinations is extremely low. Ideally, you build a process that uses generative AI to speed up a human when performing a task. Part of the design of the system should be that if you detect a hallucination or a likely hallucination, you make it obvious to a user and allow them to perform the work that they would've been doing anyway. In this scenario, you don’t gain anything, but you also don’t lose anything.

Given the hallucination rates that we typically see are so low, we find that the net gain from using these tools far outweighs the risk hallucinations pose.

We’ll follow up with some future content showing some practical designs for keeping humans in the loop.

‍

Limiting Hallucinations with Generative AI