The Secret to Conversational AI: How to Build Effective LLM Chatbots with RAG

by Renato Durrer

Engagement Lead

Aug 13, 2024 • 10 min. read

Each time Generative AI models like GPT-4, Claude, or Gemini hallucinate or give inaccurate answers, the world questions whether these LLM-powered chatbots can truly deliver value.

Even so, building LLM chatbots is one of the most popular GenAI applications. To address some of the LLMs' limitations, companies have turned to Retrieval Augmented Generation (RAG), a framework that ensures access to accurate information by retrieving it from multiple sources.

The Real Foundation of RAG Chatbots

Most companies find RAG an easy, exciting, and one of the quickest fixes to customer and employee experience with knowledge retrieval. And that's precisely where two of the biggest misconceptions about building a RAG chatbot hide: assuming that it's heavily based on LLMs and that it simply works well.

Knowledge retrieval is about knowing where to find the information you need. This is a key step in developing RAG chatbots before involving a large language model. The LLM is only one part of the RAG architecture, primarily involved in reasoning and generating responses based on the retrieved data.

Let's see what RAG architecture for an AI chatbot looks like and how to ensure that these common misconceptions don’t stand in the way of a successful solution.

A GenAI Architecture for a Conversational Agent

The Retrieval Augmented Generation (RAG) architecture is a framework designed to augment the capabilities of conversational agents. At its core, a RAG architecture involves five key components:

Data Ingestion
Knowledge Base Creation
Document Retrieval
LLM Reasoning
User Interface

Each of these steps is equally important to the overall performance of the RAG chatbot. Below, we'll explore all the steps to guide you through creating an efficient and reliable knowledge retrieval process.

Step 1: Data Ingestion

How to create a document library comprising the most relevant and up-to-date documents?

Companies can have their information spread across many sources. The main goal of data ingestion is to consolidate all sources of information into a document library. This centralized, cloud-based storage system can include various types of data, such as documents, Confluence pages, database entries, or files like PowerPoint presentations.

The first step is to locate all the source systems and build custom connectors for the ingestion to the cloud storage. With these connectors, we ensure that the source records are in sync with the information accessible by the RAG. Data ingestion is a pure data engineering problem.

Consider a scenario where product information is spread over 20,000 technical documents. This is a common situation involving extensive technical documentation of products.

Each time a material or product specification in one of those documents changes, the document library must be updated with the newest version to ensure the RAG user has access to the latest file with the most recent information.

To ensure data security and privacy, access policies determining which users can access specific information must be clearly defined in the source systems and reflected in the document library. It involves setting access management controls at the document level.

Whenever a file is processed or a source is added, these access policies and permission settings must be propagated into the document library. Implementing user-level access management often requires unique configurations to ensure authorized access, protection, and compliance, as it does not come out of the box.

Step 2: Knowledge Base Creation

What is the best representation of the knowledge stored in the document library?

The objective of the knowledge base creation step is to set up a central repository of easily searchable and digestible information. It involves breaking each document into smaller sections, called chunks, each becoming an entry in the knowledge base (a vector database). Most often, the chunks are not only stored as text but also converted to a machine-readable (vector) representation. The main goal of this step is to find the best representation of the documents within the document library, making it a data science problem.

The process begins with two main questions: how to split the documents into chunks and what representation works best for these chunks. Another critical decision is how much information to provide to the model at once. Choosing the best chunking strategy depends on your data, can even vary across sources, and requires experience and experimentation.

Different representation methods and chunking strategies have different price tags and implementation costs. Some services, such as Azure AI Document Intelligence, may offer better performance but come with a hefty price tag, whereas other chuncking strategies may be more affordable but less performant. The approach to these questions varies depending on the scale, whether you're dealing with a million documents or just ten thousand.

By the end of this step, you will have a vector database. Each entry in the database corresponds to one chunk of the original document, containing both a vector representation of the chunk's content and metadata (e.g., source location, access rights, time of creation, etc).

The content of the vector database has a direct impact on the chatbot's ability to respond to queries accurately. This underscores our responsibility for choosing the right chunking strategy and representation methods, as they determine the performance of the RAG system and, ultimately, how well it can serve its purpose.

Step 3: Document Retrieval

How to find the right document chunks for a given search query?

The main goal of this stage is to retrieve the document chunk(s) containing the searched information. So, the first step is to convert the search query into a vector and then find chunks with similar representations.

Sometimes, it’s difficult for RAG systems to distinguish between similar information. When dealing with documents with similar structures and content, filtering by additional metadata, such as the creation date, the type of test, and the experiment date, can be helpful. Although adding metadata is not a requirement, it can refine search results and help users access the right information.

For example, a chemical company conducting many experiments might have similar reports except for minor differences, such as using compound B instead of compound A. But if we knew which compound was present in the document, we could instruct the agent to look only for documents that contain compound B, greatly increasing the accuracy of the retrieved knowledge.

Another challenge in document retrieval is determining the optimal number of chunks to retrieve. Due to computational complexity, it's impractical to compare all chunks with a given search query. The goal is to identify the relevant chunks with high confidence before sending them to the LLM to avoid unnecessary costs. But it’s not only about costs. Using too many irrelevant chunks increases the chance that the LLM might mistakenly extract the wrong information and slow down the retrieval part. On the other hand, when using too few chunks, relevant information may be missed. Therefore, we need to balance the level of approximation, costs, and speed so that the RAG system remains efficient while providing accurate results.

Step 4: LLM Reasoning

What combination of model, information, and prompting strategy leads to the best results?

At this stage, an LLM consumes the provided information and generates a response, but several factors influence its performance. This is an AI problem.

Choosing the optimal model, information, and prompting strategy is a complex task that presents a series of AI challenges. One of these challenges is the delicate balance between price and performance. Costs can vary significantly based on the model chosen, such as GPT-3.5 Turbo or GPT-4o. Additionally, the number of tokens consumed influences the cost. Token consumption depends on factors like chunk size, the number of chunks fed to the model, and specific model instructions. Effective cost management may involve defining output limits since input and output tokens incur charges.

Since LLM outputs are free text, not structured data, evaluating their performance and whether the responses are correct might require additional mechanisms, including other AI models.

Another challenge is dealing with contradictory information. Large language models (LLMs) may encounter conflicting data, making it difficult to provide accurate answers. What can be done? For example, if the primary reason for conflicting data is outdated source documents, prioritizing more recent documents can help mitigate the issue.

Achieving optimal results also depends on crafting the right prompt. The structure of a prompt can influence the quality and relevance of a model's response. A well-designed prompt should be clear, concise, and contextually appropriate, guiding the LLM to generate the desired output. Tailoring prompts to specific user groups and data sets may also be necessary, as different audiences and contexts require distinct approaches. A poorly constructed prompt can lead to vague, incomplete, or off-topic responses.

Step 5: User Interface

How to best surface the information to the user?

At this stage, the focus is on creating an intuitive experience for our users. We want them to be able to input prompts, receive responses, and display a list of relevant documents with ease. This is a software engineering problem.

One of the primary challenges with user experience is minimizing the time it takes for users to find the information they need. The interface should quickly and efficiently display relevant data. The system must also perform well under high user load and handle concurrent queries without compromising speed or reliability.

Over time, users' feedback can help refine the system and optimize ranking and retrieval algorithms by providing feedback on the relevance and accuracy of the responses and retrieved documents. Feedback mechanisms could be thumbs-up/thumbs-down buttons letting users mark their satisfaction with a response.

If the answer isn't satisfactory, users need access to alternative strategies. The interface should offer additional filtering options or the ability to forward the query to an expert for assistance. This ensures that users always have a pathway to find the information they need, even if the initial response isn't perfect, building a sense of security and trust.

The user interface must also be compatible with existing business workflows and user experience expectations. People are accustomed to their workflows, so any innovation will require changing their routines. It’s often overlooked that effective change management, through training and support, can encourage a team to adopt new tools.

We've frequently noticed a big difference in how an educated user base uses a RAG system compared to an untrained one. Educating users on prompting strategies can enhance productivity, such as when to use the system, what information to input, and how to phrase queries for optimal results.

How to Know if the RAG Chatbot is Delivering on its Promise?

Evaluating the success of your RAG-based chatbot involves understanding how the system changes the way users deal with information.

A successful RAG-based chatbot finds information reliably and quickly. It becomes unusable or unnecessary if it fails in either reliability or speed.

Let's examine both.

Speed: how much faster can users find information with the RAG-powered chatbot?

The increase in speed can be quantified by tracking the average time it takes users to locate specific information before and after implementing the RAG-based solution.

For instance, if users are, on average, 50% faster, multiplying this improvement by the number of users and the frequency of their searches will quantify the overall efficiency gains.

Naturally, the next question is how your team will use their spare time. Efficiency gains shouldn't be assessed only in terms of time saved but also in how that time is reinvested in activities that add more value to the organization.

Faster information retrieval could mean allocating extra time to more complex, strategic tasks that require critical thinking and creativity, such as collaborating with the product team to suggest improvements based on recurring customer service issues.
Maintaining the exact workforce costs in areas experiencing high growth can allow the team to handle a larger workload.
Responding faster to client requests results in shorter proposal turnaround times and lower acquisition costs, leading to higher sales volume.

Reliability: does the RAG chatbot provide access to information that was previously unavailable or hard to find?

The reliability of the system can be assessed by counting the number of successful retrievals compared to the old system. You can also evaluate whether the information obtained today is more accurate or comprehensive.

However, true business value comes from using improved access to information to drive better decision-making, respond quicker to requests, or reduce errors. It means finding the right product information and using it to highlight its benefits during the sales process. It means accessing the proper experiment reports to suggest improvements to experiments or the next best actions based on accurate information.

Consider the cost of errors as well. Analyze the impact of generating wrong information, including the effects on business decisions, customer satisfaction, and any necessary rework to correct mistakes.

It's time to shift our perspective from merely acquiring information to actively using it to create tangible business outcomes.

The Power of Thoughtful RAG Implementation

While the advancements in RAG technology are impressive, building reliable RAG-based chatbots remains a complex process that requires a thoughtful approach.

The key to success lies in the meticulous execution of the initial steps: data ingestion, knowledge base creation, and document retrieval - everything involved in finding the source of information and then effectively surfacing this information to the user. Even the best LLM model cannot provide the correct results without accurate knowledge retrieval.

With the many document retrieval implementations we have done, we know how to navigate the complexities of information access. Our expertise allows us to implement alternative strategies that benefit users and provide the guidance LLMs need to add value in any scenario.

As long as a solid foundation is in place, users are guaranteed to get information faster and more reliably. It empowers the Conversational AI to respond with depth and precision, turning uncertainty into confidence in a system users can rely on.

Subscribe to our newsletter

You’ll receive insights, strategies, and best practices that help you succeed in adopting and implementing AI & Data. Only what matters. Once a month.