How to Build an LLM-Powered Customer Support Chatbot with RAG
Generic LLM chatbots hallucinate. They answer confidently with wrong information because they lack context about your business. The solution is RAG (Retrieval-Augmented Generation) — grounding the LLM in your own knowledge base. Here's how Softotic builds this for clients.
What Is RAG?
RAG = Retrieval-Augmented Generation.
Instead of asking the LLM to answer from memory (which leads to hallucinations), you:
- Retrieve relevant documents from your knowledge base.
- Include them in the LLM's prompt as context.
- The LLM generates its answer based only on the retrieved context.
Result: accurate, specific, verifiable answers grounded in your business data.
Architecture Overview
``
User message
↓
[Query Embedding] (OpenAI / sentence-transformers)
↓
[Vector Search in Pinecone] → Top-K relevant chunks
↓
[Prompt Construction] = System prompt + Context chunks + User message
↓
[OpenAI GPT-4o] generates response
↓
[Confidence check] → if low confidence: escalate to human
↓
Response to user
`
Step 1: Build Your Knowledge Base
Index your knowledge base into a vector database.
`python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
Load docs (could be PDF, markdown, website scrape)
docs = load_documents("./knowledge_base/")
Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(chunks, embeddings, index_name="support-kb")
`
Step 2: Build the RAG Chain
`python
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
verbose=False,
)
`
Step 3: FastAPI Endpoint
`python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
session_id: str
message: str
@app.post("/chat")
async def chat(req: ChatRequest):
response = chain.invoke({"question": req.message})
# Escalation trigger: model returns low-confidence signal
should_escalate = needs_human(response["answer"])
return {
"answer": response["answer"],
"escalate": should_escalate,
"sources": [doc.metadata for doc in response.get("source_documents", [])]
}
`
Step 4: Human Escalation
A critical feature often overlooked. When the bot says "I'm not sure" or the user asks for a human, escalate:
- Flag the session as escalated in the database.
- Alert live agents via WebSocket or notification.
- Show the full conversation history to the agent.
- Agent takes over; user sees "You're now connected to a support agent."
Step 5: Multi-Channel Integration
- Web widget: React component connects to
/chat` API via WebSocket.
- WhatsApp: WhatsApp Business API webhook → your chat API → response via WhatsApp.
Production Considerations
- Session management: Store chat history in Redis with TTL.
- Rate limiting: Per-IP and per-session to prevent abuse.
- Content filtering: Validate inputs to prevent prompt injection.
- Logging: Log all conversations for quality review and model fine-tuning.
- Monitoring: Track average response latency, escalation rate, user satisfaction.
Keeping the Knowledge Base Fresh
Set up a pipeline to re-index when content changes:
- Webhook from your CMS triggers re-ingestion
- Weekly full re-index as a scheduled job
Conclusion
A RAG-based customer support bot, built properly, reduces support volume by 60–80% while maintaining accuracy. The critical success factor is a well-structured, comprehensive knowledge base.
Ready to add AI support to your product? Softotic's AI team can build it.