AI Product Manager's Essential Course: RAG (Final)

baoshi.rao

As the final installment of this series, this article will dissect the underlying logic and implementation path of RAG from a product perspective, helping you truly leverage it as a 'cognitive engine' rather than just a 'search plugin.'

1. Introduction

From the essence of RAG and the orchestration of the RAG pipeline, we explore the drawbacks of not using RAG, the benefits of using RAG, and considerations for each step in the RAG system.

Essence of RAG: The decoupling and re-orchestration of knowledge retrieval and language generation.
RAG Pipeline Orchestration: From offline knowledge base construction to online retrieval-augmented generation at every step.

2. Decoupling Retrieval and Generation

This involves splitting an inseparable, implicit process within a single large model into two independently optimizable modules. The decoupling not only separates retrieval and generation functions but also enables the separation of knowledge updates and model training.

2.1 Drawbacks of Not Using RAG

Without RAG, large models must store all knowledge within their parameters. However, knowledge updates occur much faster than model training—sometimes in hours or even minutes. Retraining the model for every update is costly and may introduce new issues.

a) Long Training Cycles and High Costs: Each knowledge update may require retraining or fine-tuning the entire model, a time-consuming and expensive process. RAG avoids frequent retraining by storing knowledge in an external database, allowing the model to retrieve relevant information on demand (e.g., via vector retrieval based on semantic similarity), thereby reducing costs.

b) Data Compliance and Noise Introduction: New data may contain non-compliant content or noise, which, if used directly for training, becomes embedded in the model parameters. RAG allows stricter quality control and compliance checks on the external knowledge base, mitigating risks.

2.2 Benefits of Using RAG

With RAG, knowledge updates don’t require frequent model parameter changes, as knowledge is stored externally. This offers several advantages:

a) Reduced Resource Consumption: Models typically encode knowledge directly into parameters—more knowledge means more parameters and higher computational demands, requiring more resources for deployment and inference (e.g., high-performance servers and GPUs for loading parameters, and significant power for inference).

b) Flexible Knowledge Management: Easily add, delete, or modify information in the knowledge base without altering or retraining the model. This adaptability ensures the system stays current with evolving knowledge needs, such as incorporating the latest industry reports, regulatory updates, or product information.

c) Traceability of Information: Large models are often seen as black boxes, making it hard to understand their reasoning. RAG enables tracing the sources of generated answers, verifying the reliability of information, and providing concrete evidence for human decision-making in critical fields like healthcare, finance, and law.

★ 3. RAG Pipeline Orchestration

(Divided into offline knowledge base construction + online retrieval-augmented generation)

a) Data Collection: Gather raw data from various sources and preprocess it, addressing parsing strategies for different formats (Word, PDF, PPT, cross-page tables).

b) Data Chunking: Split long documents into semantically coherent text chunks (Chunk), considering strategies for structured vs. unstructured data or long vs. short segments.

c) Vector Embedding: Convert text chunks into vectors using an embedding model (Embedding Model) to map text into vector space, involving model selection.

d) Storage and Indexing: Store vectorized text chunks in a vector database and organize them (indexing) for efficient retrieval. This involves choosing index structures (summary index, tree index, knowledge graph, vector index).

Note: For vector indexing, the same embedding model must be used for both offline text vectorization and online query encoding to ensure they reside in the same vector space.

e) Retrieval and Reranking: Vectorize user queries, perform similarity searches in the database, and rank the results to identify the top-K most relevant chunks (e.g., top-3) as references.

f) Answer Generation: The model generates a well-founded response based on the query and references, guided by prompt design.

Note: Post-processing is needed before returning answers to ensure readability, credibility, and filtering of unnecessary information.

For clarity, the steps can be summarized as: Data Collection → Chunking → Indexing → Retrieval & Reranking → Answer Generation.

3.1 Data Collection

Extract and preprocess documents (parsing varies by format), ensuring:

a) Completeness: No valuable information (text, images, tables) is omitted.

b) Accuracy: Avoid errors (e.g., OCR mistakes).

c) Contextual Integrity: Preserve relationships (e.g., hierarchical structures, image-caption links).

d) Metadata Integrity: Include titles, tags, sizes, timestamps, etc., to aid system management.

3.2 Data Chunking

Chunking breaks long texts into manageable segments. Improper chunking can:

Oversized Chunks: Reduce retrieval precision, increase costs, and risk truncation.
Undersized Chunks: Lose context (e.g., separating arguments from evidence), impairing comprehension.

Chunk size depends on:

a) Data structure (e.g., academic papers vs. video scripts).

b) Query complexity and length.

c) Embedding model compatibility.

d) Computational costs and latency.

Token Slicing (Sliding Window): A fixed-length 'window' slides over text, cutting chunks with overlaps (e.g., 10-20%) to preserve context.

Sentence Slicing: Uses punctuation (e.g., periods, question marks) to split text.

Recursive Splitting: Prioritizes separators (paragraphs → sentences → spaces) until chunks meet size limits.

Specialized Chunking: For structured formats (Markdown, LaTeX), leveraging their syntax.

Semantic Chunking: Uses LLMs for precise, context-aware splits (higher accuracy, higher cost).

Hybrid Approach: Combines semantic chunking with sliding windows for balance.

3.3 Indexing

Indexes organize vectorized chunks for fast retrieval, akin to a book's table of contents. Common types:

a) Summary Index: Sequential node linking.

b) Tree Index: Hierarchical node organization.

c) Keyword Index: Maps keywords to nodes (less efficient for multi-keyword queries).

d) Vector Index: Stores vectors for similarity searches (most popular).

e) Knowledge Graph Index: Represents relationships between concepts.

f) Multi-Representation Index: Uses summaries or parent-child chunks for richer context.

3.4 Retrieval & Reranking

Traditional search may miss intent or key details. RAG retrieves context for the model but must avoid overwhelming it (e.g., sending top-50 results). Reranking (e.g., via Reranker models) refines results to top-5 or top-3.

Note: While embeddings capture semantic similarity, reranking provides finer relevance scoring.

3.5 Answer Generation

After retrieval, models assemble answers using prompts. A basic RAG prompt template:

References (retrieved chunks): xxx
User Query: xxx
Answer the query based on the references.

For advanced templates, see AI Product Manager's Essential Course: RAG (2).

4. Conclusion

RAG technology continues to evolve, unlocking new possibilities. Practical implementation may encounter issues—troubleshooting requires reverse-engineering problems and considering multiple angles. For guidance, refer to AI Product Manager's Essential Course: RAG (1) on data sourcing and error handling.