Widget-Chat is a powerful Flutter-based chatbot widget that enables businesses to add AI-powered customer support to their mobile and web applications. With just a few lines of code, you can integrate a fully customizable chat interface that provides instant, intelligent responses to your users.

What platforms does Widget-Chat support?

Widget-Chat supports all major platforms including iOS, Android, Web, Windows, macOS, and Linux. Our Flutter-based solution ensures consistent behavior and appearance across all platforms.

How do I integrate Widget-Chat into my Flutter app?

Integration is simple: 1. Add 'flutter_bot: ^0.0.3' to your pubspec.yaml. 2. Import the package. 3. Add the ChatWidget to your app with your project credentials. The entire process typically takes less than 10 minutes.

Can I customize the chatbot appearance?

Absolutely! Widget-Chat offers extensive customization options including custom colors, fonts, FAB button styles, icons, avatar images, and welcome messages.

How much does Widget-Chat cost?

We offer flexible pricing plans starting from $29/month for small businesses up to custom enterprise solutions. We also offer a 14-day free trial with full access to all features.

← Back to Blog

January 16, 202625 min read

RAG for AI Chatbots: Complete Guide to Retrieval-Augmented Generation in 2026

RAGVector DatabaseEmbeddingsKnowledge BaseLLM

RAG for AI Chatbots: Complete Guide to Retrieval-Augmented Generation in 2026

Retrieval-Augmented Generation (RAG) transforms generic AI chatbots into domain experts. By connecting your chatbot to custom knowledge bases, you eliminate hallucinations and deliver accurate, contextual responses. This guide covers RAG architecture, implementation patterns, and production best practices.

What is RAG and Why Does It Matter?

The Problem with Vanilla LLMs

Standard LLMs have critical limitations for business chatbots:

Knowledge cutoff: Training data becomes stale
Hallucinations: Confident but incorrect answers
No proprietary knowledge: Can't access your docs, products, or policies
Generic responses: Lack company-specific context

How RAG Solves These Problems

RAG combines retrieval systems with generative AI:

User Query → Retrieve Relevant Documents → Augment Prompt → Generate Response

Aspect	Without RAG	With RAG
Knowledge	Training data only	Your custom data
Accuracy	Prone to hallucination	Grounded in sources
Updates	Requires retraining	Update docs anytime
Citations	Cannot cite sources	Links to source docs
Cost	Fine-tuning expensive	Retrieval is cheap

RAG vs Fine-Tuning

Factor	RAG	Fine-Tuning
Setup time	Hours	Days/weeks
Cost	Low (retrieval)	High (GPU training)
Updates	Instant	Retrain required
Accuracy	High with good docs	Can overfit
Best for	Facts, docs, policies	Style, behavior

RAG Architecture Deep Dive

Core Components

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐│
│  │  Query   │───▶│ Embedding│───▶│  Vector  │───▶│ Top-K ││
│  │          │    │  Model   │    │  Search  │    │ Docs  ││
│  └──────────┘    └──────────┘    └──────────┘    └───┬───┘│
│                                                      │     │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐       │     │
│  │ Response │◀───│   LLM    │◀───│ Augmented│◀──────┘     │
│  │          │    │          │    │  Prompt  │             │
│  └──────────┘    └──────────┘    └──────────┘             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Component Breakdown

Document Ingestion: Load and preprocess source documents
Chunking: Split documents into semantic segments
Embedding: Convert text to vector representations
Vector Store: Index embeddings for fast similarity search
Retrieval: Find relevant chunks for a query
Augmentation: Inject context into LLM prompt
Generation: Produce grounded response

Setting Up Your RAG Pipeline

Prerequisites

npm install openai @pinecone-database/pinecone langchain
# or
pip install openai pinecone-client langchain chromadb

Step 1: Document Loading

// lib/documentLoader.ts
import { readFile, readdir } from 'fs/promises'
import { join } from 'path'

interface Document {
  id: string
  content: string
  metadata: {
    source: string
    title?: string
    category?: string
  }
}

export async function loadDocuments(dirPath: string): Promise<Document[]> {
  const documents: Document[] = []
  const files = await readdir(dirPath)

  for (const file of files) {
    if (!file.endsWith('.md') && !file.endsWith('.txt')) continue

    const filePath = join(dirPath, file)
    const content = await readFile(filePath, 'utf-8')

    documents.push({
      id: file.replace(/\.[^.]+$/, ''),
      content,
      metadata: {
        source: filePath,
        title: extractTitle(content),
      },
    })
  }

  return documents
}

function extractTitle(content: string): string {
  const match = content.match(/^#\s+(.+)$/m)
  return match ? match[1] : 'Untitled'
}

Step 2: Text Chunking

Chunking strategy significantly impacts retrieval quality.

// lib/chunker.ts
interface Chunk {
  id: string
  content: string
  metadata: {
    documentId: string
    chunkIndex: number
    source: string
  }
}

interface ChunkingOptions {
  chunkSize: number
  chunkOverlap: number
  separators?: string[]
}

export function chunkDocument(
  doc: Document,
  options: ChunkingOptions
): Chunk[] {
  const { chunkSize, chunkOverlap, separators = ['\n\n', '\n', '. ', ' '] } = options
  const chunks: Chunk[] = []

  // Recursive character text splitter logic
  const text = doc.content
  const segments = splitRecursively(text, separators, chunkSize)

  let currentChunk = ''
  let chunkIndex = 0

  for (const segment of segments) {
    if (currentChunk.length + segment.length > chunkSize) {
      if (currentChunk) {
        chunks.push({
          id: `${doc.id}-chunk-${chunkIndex}`,
          content: currentChunk.trim(),
          metadata: {
            documentId: doc.id,
            chunkIndex,
            source: doc.metadata.source,
          },
        })
        chunkIndex++

        // Keep overlap from end of current chunk
        const overlapStart = Math.max(0, currentChunk.length - chunkOverlap)
        currentChunk = currentChunk.slice(overlapStart)
      }
    }
    currentChunk += segment
  }

  // Don't forget the last chunk
  if (currentChunk.trim()) {
    chunks.push({
      id: `${doc.id}-chunk-${chunkIndex}`,
      content: currentChunk.trim(),
      metadata: {
        documentId: doc.id,
        chunkIndex,
        source: doc.metadata.source,
      },
    })
  }

  return chunks
}

function splitRecursively(
  text: string,
  separators: string[],
  chunkSize: number
): string[] {
  if (!separators.length || text.length <= chunkSize) {
    return [text]
  }

  const separator = separators[0]
  const parts = text.split(separator)

  const result: string[] = []
  for (const part of parts) {
    if (part.length <= chunkSize) {
      result.push(part + separator)
    } else {
      // Recursively split with next separator
      result.push(...splitRecursively(part, separators.slice(1), chunkSize))
    }
  }

  return result
}

Chunking Strategies Comparison

Strategy	Chunk Size	Overlap	Best For
Fixed	500 tokens	50	General text
Semantic	Variable	0	Structured docs
Sentence	3-5 sentences	1	Conversations
Paragraph	Natural	0	Articles
Recursive	1000 chars	200	Mixed content

Step 3: Generate Embeddings

// lib/embeddings.ts
import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
})

export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  })

  return response.data[0].embedding
}

export async function generateEmbeddings(
  texts: string[]
): Promise<number[][]> {
  // Batch for efficiency (max 2048 inputs per request)
  const batchSize = 100
  const embeddings: number[][] = []

  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize)

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: batch,
    })

    embeddings.push(...response.data.map(d => d.embedding))
  }

  return embeddings
}

Embedding Models Comparison

Model	Dimensions	Speed	Quality	Cost
text-embedding-3-small	1536	Fast	Good	$0.02/1M
text-embedding-3-large	3072	Medium	Best	$0.13/1M
text-embedding-ada-002	1536	Fast	Good	$0.10/1M
Cohere embed-v3	1024	Fast	Good	$0.10/1M
Local (e5-small)	384	Fastest	OK	Free

Step 4: Vector Store Setup

Using Pinecone

// lib/vectorStore/pinecone.ts
import { Pinecone } from '@pinecone-database/pinecone'

const pinecone = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY!,
})

const index = pinecone.index('chatbot-knowledge')

interface VectorRecord {
  id: string
  values: number[]
  metadata: Record<string, any>
}

export async function upsertVectors(records: VectorRecord[]) {
  // Pinecone batch limit is 100
  const batchSize = 100

  for (let i = 0; i < records.length; i += batchSize) {
    const batch = records.slice(i, i + batchSize)
    await index.upsert(batch)
  }
}

export async function queryVectors(
  embedding: number[],
  topK: number = 5,
  filter?: Record<string, any>
): Promise<Array<{ id: string; score: number; metadata: any }>> {
  const results = await index.query({
    vector: embedding,
    topK,
    filter,
    includeMetadata: true,
  })

  return results.matches?.map(match => ({
    id: match.id,
    score: match.score || 0,
    metadata: match.metadata,
  })) || []
}

export async function deleteVectors(ids: string[]) {
  await index.deleteMany(ids)
}

Using ChromaDB (Local/Self-Hosted)

// lib/vectorStore/chroma.ts
import { ChromaClient, Collection } from 'chromadb'

const client = new ChromaClient({
  path: process.env.CHROMA_URL || 'http://localhost:8000',
})

let collection: Collection

export async function initCollection() {
  collection = await client.getOrCreateCollection({
    name: 'chatbot-knowledge',
    metadata: { 'hnsw:space': 'cosine' },
  })
}

export async function addDocuments(
  ids: string[],
  embeddings: number[][],
  documents: string[],
  metadatas: Record<string, any>[]
) {
  await collection.add({
    ids,
    embeddings,
    documents,
    metadatas,
  })
}

export async function queryDocuments(
  queryEmbedding: number[],
  nResults: number = 5,
  whereFilter?: Record<string, any>
) {
  const results = await collection.query({
    queryEmbeddings: [queryEmbedding],
    nResults,
    where: whereFilter,
  })

  return results.ids[0].map((id, i) => ({
    id,
    document: results.documents?.[0]?.[i],
    metadata: results.metadatas?.[0]?.[i],
    distance: results.distances?.[0]?.[i],
  }))
}

Vector Database Comparison

Database	Hosting	Scale	Latency	Best For
Pinecone	Managed	Billions	<50ms	Production
Weaviate	Both	Millions	<100ms	Hybrid search
ChromaDB	Self-host	Millions	<50ms	Development
Qdrant	Both	Billions	<50ms	Open source
pgvector	Self-host	Millions	<100ms	Postgres users

Building the RAG Chatbot

Complete Ingestion Pipeline

// scripts/ingest.ts
import { loadDocuments } from '../lib/documentLoader'
import { chunkDocument } from '../lib/chunker'
import { generateEmbeddings } from '../lib/embeddings'
import { upsertVectors } from '../lib/vectorStore/pinecone'

async function ingestKnowledgeBase() {
  console.log('Loading documents...')
  const documents = await loadDocuments('./knowledge-base')
  console.log(`Loaded ${documents.length} documents`)

  console.log('Chunking documents...')
  const allChunks = documents.flatMap(doc =>
    chunkDocument(doc, {
      chunkSize: 1000,
      chunkOverlap: 200,
    })
  )
  console.log(`Created ${allChunks.length} chunks`)

  console.log('Generating embeddings...')
  const embeddings = await generateEmbeddings(
    allChunks.map(c => c.content)
  )

  console.log('Upserting to vector store...')
  const records = allChunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings[i],
    metadata: {
      ...chunk.metadata,
      content: chunk.content,
    },
  }))

  await upsertVectors(records)
  console.log('Ingestion complete!')
}

ingestKnowledgeBase()

RAG Query Pipeline

// lib/ragPipeline.ts
import { generateEmbedding } from './embeddings'
import { queryVectors } from './vectorStore/pinecone'
import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
})

interface RAGResponse {
  answer: string
  sources: Array<{
    id: string
    content: string
    score: number
  }>
}

export async function queryRAG(
  question: string,
  options: {
    topK?: number
    temperature?: number
    systemPrompt?: string
  } = {}
): Promise<RAGResponse> {
  const { topK = 5, temperature = 0.7, systemPrompt } = options

  // Step 1: Embed the question
  const questionEmbedding = await generateEmbedding(question)

  // Step 2: Retrieve relevant chunks
  const results = await queryVectors(questionEmbedding, topK)

  // Step 3: Build context from retrieved chunks
  const context = results
    .map((r, i) => `[${i + 1}] ${r.metadata.content}`)
    .join('\n\n')

  // Step 4: Generate response with context
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    {
      role: 'system',
      content: systemPrompt || `You are a helpful assistant. Answer questions based on the provided context.
If the context doesn't contain relevant information, say so.
Always cite your sources using [1], [2], etc.

Context:
${context}`,
    },
    {
      role: 'user',
      content: question,
    },
  ]

  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages,
    temperature,
  })

  return {
    answer: completion.choices[0].message.content || '',
    sources: results.map(r => ({
      id: r.id,
      content: r.metadata.content,
      score: r.score,
    })),
  }
}

API Endpoint

// app/api/chat/route.ts (Next.js)
import { NextRequest, NextResponse } from 'next/server'
import { queryRAG } from '@/lib/ragPipeline'

export async function POST(req: NextRequest) {
  try {
    const { question, conversationHistory } = await req.json()

    if (!question) {
      return NextResponse.json(
        { error: 'Question is required' },
        { status: 400 }
      )
    }

    const response = await queryRAG(question, {
      topK: 5,
      temperature: 0.7,
      systemPrompt: `You are a customer support assistant for our company.
Use the provided context to answer questions accurately.
If you're unsure, ask for clarification rather than guessing.
Always be polite and professional.`,
    })

    return NextResponse.json(response)
  } catch (error) {
    console.error('RAG error:', error)
    return NextResponse.json(
      { error: 'Internal server error' },
      { status: 500 }
    )
  }
}

Advanced RAG Techniques

Hybrid Search (Dense + Sparse)

Combine semantic similarity with keyword matching.

// lib/hybridSearch.ts
interface HybridResult {
  id: string
  content: string
  denseScore: number
  sparseScore: number
  combinedScore: number
}

export async function hybridSearch(
  query: string,
  topK: number = 5,
  alpha: number = 0.5 // Weight for dense vs sparse
): Promise<HybridResult[]> {
  // Dense search (semantic)
  const queryEmbedding = await generateEmbedding(query)
  const denseResults = await queryVectors(queryEmbedding, topK * 2)

  // Sparse search (BM25 keyword matching)
  const sparseResults = await bm25Search(query, topK * 2)

  // Reciprocal Rank Fusion
  const scores = new Map<string, { dense: number; sparse: number }>()

  denseResults.forEach((r, rank) => {
    const existing = scores.get(r.id) || { dense: 0, sparse: 0 }
    existing.dense = 1 / (rank + 60) // RRF constant = 60
    scores.set(r.id, existing)
  })

  sparseResults.forEach((r, rank) => {
    const existing = scores.get(r.id) || { dense: 0, sparse: 0 }
    existing.sparse = 1 / (rank + 60)
    scores.set(r.id, existing)
  })

  // Combine and sort
  const combined = Array.from(scores.entries()).map(([id, s]) => ({
    id,
    content: denseResults.find(r => r.id === id)?.metadata.content ||
             sparseResults.find(r => r.id === id)?.content || '',
    denseScore: s.dense,
    sparseScore: s.sparse,
    combinedScore: alpha * s.dense + (1 - alpha) * s.sparse,
  }))

  return combined
    .sort((a, b) => b.combinedScore - a.combinedScore)
    .slice(0, topK)
}

Query Rewriting

Improve retrieval by reformulating queries.

// lib/queryRewriter.ts
export async function rewriteQuery(
  originalQuery: string,
  conversationHistory: Array<{ role: string; content: string }>
): Promise<string[]> {
  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [
      {
        role: 'system',
        content: `You are a query rewriting assistant. Given a user question and conversation history,
generate 3 different versions of the query that might help retrieve relevant information.
Return only the queries, one per line.`,
      },
      {
        role: 'user',
        content: `Conversation history:
${conversationHistory.map(m => `${m.role}: ${m.content}`).join('\n')}

Original query: ${originalQuery}

Generate 3 rewritten queries:`,
      },
    ],
    temperature: 0.7,
  })

  const rewritten = completion.choices[0].message.content || ''
  return rewritten.split('\n').filter(q => q.trim())
}

Re-ranking Retrieved Results

Use a cross-encoder for better relevance scoring.

// lib/reranker.ts
interface RerankedResult {
  id: string
  content: string
  originalScore: number
  rerankedScore: number
}

export async function rerankResults(
  query: string,
  results: Array<{ id: string; content: string; score: number }>,
  topK: number = 3
): Promise<RerankedResult[]> {
  // Use Cohere rerank API or a cross-encoder model
  const response = await fetch('https://api.cohere.ai/v1/rerank', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.COHERE_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'rerank-english-v3.0',
      query,
      documents: results.map(r => r.content),
      top_n: topK,
    }),
  })

  const data = await response.json()

  return data.results.map((r: any) => ({
    id: results[r.index].id,
    content: results[r.index].content,
    originalScore: results[r.index].score,
    rerankedScore: r.relevance_score,
  }))
}

Multi-Query RAG

Retrieve from multiple query perspectives.

// lib/multiQueryRAG.ts
export async function multiQueryRAG(
  question: string,
  options: { topK?: number } = {}
): Promise<RAGResponse> {
  const { topK = 5 } = options

  // Generate multiple query perspectives
  const queries = await generateQueryVariations(question)

  // Retrieve for each query
  const allResults: Map<string, { content: string; scores: number[] }> = new Map()

  for (const query of queries) {
    const embedding = await generateEmbedding(query)
    const results = await queryVectors(embedding, topK)

    results.forEach(r => {
      const existing = allResults.get(r.id)
      if (existing) {
        existing.scores.push(r.score)
      } else {
        allResults.set(r.id, {
          content: r.metadata.content,
          scores: [r.score],
        })
      }
    })
  }

  // Aggregate scores (average)
  const aggregated = Array.from(allResults.entries())
    .map(([id, data]) => ({
      id,
      content: data.content,
      score: data.scores.reduce((a, b) => a + b) / data.scores.length,
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)

  // Generate response with aggregated context
  return generateRAGResponse(question, aggregated)
}

Production Considerations

Caching Layer

// lib/cache.ts
import { Redis } from '@upstash/redis'

const redis = new Redis({
  url: process.env.UPSTASH_REDIS_URL!,
  token: process.env.UPSTASH_REDIS_TOKEN!,
})

const CACHE_TTL = 3600 // 1 hour

export async function getCachedResponse(
  query: string
): Promise<RAGResponse | null> {
  const cacheKey = `rag:${hashQuery(query)}`
  const cached = await redis.get<RAGResponse>(cacheKey)
  return cached
}

export async function setCachedResponse(
  query: string,
  response: RAGResponse
): Promise<void> {
  const cacheKey = `rag:${hashQuery(query)}`
  await redis.set(cacheKey, response, { ex: CACHE_TTL })
}

function hashQuery(query: string): string {
  // Simple hash for cache key
  let hash = 0
  for (let i = 0; i < query.length; i++) {
    const char = query.charCodeAt(i)
    hash = ((hash << 5) - hash) + char
    hash = hash & hash
  }
  return hash.toString(36)
}

Monitoring and Analytics

// lib/analytics.ts
interface RAGMetrics {
  queryId: string
  question: string
  retrievalLatency: number
  generationLatency: number
  totalLatency: number
  topKScores: number[]
  responseLength: number
  citationsUsed: number
}

export async function trackRAGQuery(metrics: RAGMetrics) {
  // Send to your analytics service
  await fetch(process.env.ANALYTICS_ENDPOINT!, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      event: 'rag_query',
      timestamp: new Date().toISOString(),
      ...metrics,
    }),
  })
}

// Usage in RAG pipeline
export async function queryRAGWithMetrics(question: string) {
  const queryId = crypto.randomUUID()
  const startTime = Date.now()

  const retrievalStart = Date.now()
  const results = await retrieveDocuments(question)
  const retrievalLatency = Date.now() - retrievalStart

  const generationStart = Date.now()
  const response = await generateResponse(question, results)
  const generationLatency = Date.now() - generationStart

  await trackRAGQuery({
    queryId,
    question,
    retrievalLatency,
    generationLatency,
    totalLatency: Date.now() - startTime,
    topKScores: results.map(r => r.score),
    responseLength: response.answer.length,
    citationsUsed: (response.answer.match(/\[\d+\]/g) || []).length,
  })

  return response
}

Error Handling

// lib/ragPipeline.ts
class RAGError extends Error {
  constructor(
    message: string,
    public code: string,
    public recoverable: boolean
  ) {
    super(message)
    this.name = 'RAGError'
  }
}

export async function queryRAGSafe(question: string): Promise<RAGResponse> {
  try {
    // Validate input
    if (!question || question.length > 2000) {
      throw new RAGError(
        'Invalid question length',
        'INVALID_INPUT',
        false
      )
    }

    // Attempt retrieval
    let results
    try {
      results = await queryVectors(await generateEmbedding(question), 5)
    } catch (e) {
      throw new RAGError(
        'Vector store unavailable',
        'RETRIEVAL_FAILED',
        true
      )
    }

    // Check retrieval quality
    if (!results.length || results[0].score < 0.3) {
      // Fallback to general response
      return {
        answer: "I don't have specific information about that in my knowledge base. Could you rephrase your question or ask about something else?",
        sources: [],
      }
    }

    // Generate response
    return await generateRAGResponse(question, results)
  } catch (error) {
    if (error instanceof RAGError && error.recoverable) {
      // Return graceful fallback
      return {
        answer: "I'm having trouble accessing my knowledge base right now. Please try again in a moment.",
        sources: [],
      }
    }
    throw error
  }
}

Evaluation and Testing

Retrieval Quality Metrics

// tests/evaluation.ts
interface RetrievalMetrics {
  precision: number
  recall: number
  mrr: number // Mean Reciprocal Rank
  ndcg: number // Normalized Discounted Cumulative Gain
}

interface TestCase {
  question: string
  relevantDocIds: string[]
}

export async function evaluateRetrieval(
  testCases: TestCase[],
  topK: number = 5
): Promise<RetrievalMetrics> {
  let totalPrecision = 0
  let totalRecall = 0
  let totalMRR = 0

  for (const testCase of testCases) {
    const results = await queryRAG(testCase.question, { topK })
    const retrievedIds = results.sources.map(s => s.id)

    // Precision@K
    const relevantRetrieved = retrievedIds.filter(id =>
      testCase.relevantDocIds.includes(id)
    ).length
    totalPrecision += relevantRetrieved / topK

    // Recall@K
    totalRecall += relevantRetrieved / testCase.relevantDocIds.length

    // MRR
    const firstRelevantRank = retrievedIds.findIndex(id =>
      testCase.relevantDocIds.includes(id)
    )
    if (firstRelevantRank !== -1) {
      totalMRR += 1 / (firstRelevantRank + 1)
    }
  }

  const n = testCases.length
  return {
    precision: totalPrecision / n,
    recall: totalRecall / n,
    mrr: totalMRR / n,
    ndcg: 0, // Implement if needed
  }
}

End-to-End Testing

// tests/rag.test.ts
import { describe, it, expect } from 'vitest'
import { queryRAG } from '../lib/ragPipeline'

describe('RAG Pipeline', () => {
  it('returns relevant answer for known question', async () => {
    const response = await queryRAG('What is your return policy?')

    expect(response.answer).toBeTruthy()
    expect(response.sources.length).toBeGreaterThan(0)
    expect(response.answer.toLowerCase()).toContain('return')
  })

  it('cites sources correctly', async () => {
    const response = await queryRAG('How do I reset my password?')

    // Check for citation markers
    const citations = response.answer.match(/\[\d+\]/g) || []
    expect(citations.length).toBeGreaterThan(0)

    // Verify cited sources exist
    citations.forEach(citation => {
      const num = parseInt(citation.replace(/[\[\]]/g, ''))
      expect(num).toBeLessThanOrEqual(response.sources.length)
    })
  })

  it('handles unknown topics gracefully', async () => {
    const response = await queryRAG('What is the meaning of life?')

    expect(response.answer).toBeTruthy()
    // Should indicate lack of relevant information
    expect(
      response.answer.toLowerCase().includes("don't have") ||
      response.sources.length === 0 ||
      response.sources[0].score < 0.5
    ).toBe(true)
  })
})

Best Practices Summary

Chunking

Use 500-1000 token chunks for most use cases
Maintain 10-20% overlap between chunks
Preserve semantic boundaries (paragraphs, sections)
Include metadata (source, title, date) with each chunk

Retrieval

Start with top-5 results, adjust based on precision
Use hybrid search for better recall
Implement re-ranking for precision-critical applications
Cache common queries

Generation

Keep context under model's effective window
Use clear source attribution instructions
Implement confidence thresholds
Fallback gracefully when retrieval fails

Operations

Monitor retrieval latency and quality
Set up alerts for degraded performance
Regularly evaluate with test cases
Update knowledge base incrementally

RAG transforms chatbots from generic assistants into domain experts. By grounding responses in your actual documentation, you deliver accurate, trustworthy answers that build user confidence and reduce support burden.

About the author

Widget Chat is a team of developers and designers passionate about creating the best AI chatbot experience for Flutter, web, and mobile apps.

Comments

Comments are coming soon. We'd love to hear your thoughts!

RAG for AI Chatbots: Complete Guide to Retrieval-Augmented Generation in 2026

RAG for AI Chatbots: Complete Guide to Retrieval-Augmented Generation in 2026

What is RAG and Why Does It Matter?

The Problem with Vanilla LLMs

How RAG Solves These Problems

RAG vs Fine-Tuning

RAG Architecture Deep Dive

Core Components

Component Breakdown

Setting Up Your RAG Pipeline

Prerequisites

Step 1: Document Loading

Step 2: Text Chunking

Chunking Strategies Comparison

Step 3: Generate Embeddings

Embedding Models Comparison

Step 4: Vector Store Setup

Using Pinecone

Using ChromaDB (Local/Self-Hosted)

Vector Database Comparison

Building the RAG Chatbot

Complete Ingestion Pipeline

RAG Query Pipeline

API Endpoint

Advanced RAG Techniques

Hybrid Search (Dense + Sparse)

Query Rewriting

Re-ranking Retrieved Results

Multi-Query RAG

Production Considerations

Caching Layer

Monitoring and Analytics

Error Handling

Evaluation and Testing

Retrieval Quality Metrics

End-to-End Testing

Best Practices Summary

Chunking

Retrieval

Generation

Operations

About the author

Comments

Ready to add AI chat widgets to your website?

More Resources