Scaling Compute for RAG
RAG systems fundamentally trade off compute and accuracy. This trade-off is something that most people working on RAG systems don’t realize and the default trade-offs don’t make sense for most applications. This post hopefully convinces you to think about your own trade-offs, and to provide some buds of inspiration on how to use compute to improve accuracy in your RAG systems.
The Usual RAG Suspects
Most guides on how to do RAG look something like this:
At index time
- Information comes in
- Chunk text
- Embed chunks
At query time
- Embed query
- Compare query with chunks
- (Maybe) Use a re-ranker
- Give N chunks to an LLM to answer query
This is fine as a starting point. It is popular because it works reasonably well as a demo to convince stakeholders that they should invest in building a RAG pipeline. There may be some rough edges, but surely it’s just a matter of time and tweaks before it will be production-ready. The harsh reality is that I’ve seen teams spend months running experiments, tweaking embedding models and vector databases, only to wonder why the promised improvements never quite materialize.
This seemingly standard approach, while widely adopted, implicitly optimizes for compute rather than accuracy. We’re doing everything possible to serve answers with as little latency as possible. This makes perfect sense if you are serving a chatbot for millions of users, but there is a vast landscape of applications that don’t need sub-second latency. There are many industries - whether it’s legal due diligence, medical diagnosis, or scientific research - where accuracy is far more valuable than speed, and users are comfortable waiting longer and paying more for reliable results. In these fields, a lawyer reviewing complex contracts, a radiologist analyzing medical imaging, or a researcher studying climate models would gladly trade minutes of waiting time for significantly more accurate and dependable outputs.
Let’s take a moment to understand why the default pipeline doesn’t work for many cases. Embeddings excel at surfacing similar concepts, but struggle when highly specific information is required. For instance, in a document covering multiple topics, embeddings might struggle to match against queries for exact quotes, specific factual claims, or detailed technical information.
But more importantly, getting the right information rarely works with just a single query - users often need to do multiple searches, look at related information, and refine what they’re actually looking for. Think about how you research something complex - you don’t just ask one perfect question to Google (or a LLM) and get everything you need. You explore, build context, and gradually figure out what you actually need to know. The default RAG pipeline completely misses this reality by assuming users can perfectly express what they need in a single shot.
Burning Compute (and Trees) for Accuracy
What if your application doesn’t have a stringent latency requirement, and you can spend more compute to improve accuracy? Consider the most extreme scenario: no indexing at all. Everything is processed on the fly during query time. Instead of an index, we take a large language model and directly process every document in our corpus during query time to determine if they are relevant? Of course, this comes at a significant compute cost. But can we quantify it?
Let’s run the numbers. If you have 10,000 documents in your corpus, each with an average of 10,000 tokens, running each document through a LLM like Gemini 1.5 Flash 8b ($0.0375 / 1 million tokens, 4000 requests/min) would cost $3.75 and take ~3mins. As mentioned earlier, there are many industries that would be willing to pay that cost for a more accurate answer. I’m not actually advocating burning trees in such a wasteful manner, although it’s quickly becoming not a terrible idea as models become more efficient. But consider this the extreme of what’s possible.
So what can we do to trade off compute and accuracy more effectively? A RAG pipeline has two distinct stages: index time and query time. During index time, the data is processed to make it ready for future queries. Query time begins when a query is submitted to the system. While query-time performance is especially critical since users are actively waiting for results, index-time processing can’t be excessively slow either if the system is continuously ingesting new data. The key insight is that we can strategically invest computational resources during both stages to enhance accuracy.
Index Time: Massage Your Data
At index time, the goal is to spend compute upfront to transform data into forms that make it more discoverable and queryable later. Rather than simply chunking and embedding text, we want to extract rich semantic information and create multiple representations that capture different aspects of the content.
Maintaining Structure
Preserving the inherent organization of your data is crucial for providing context and enabling more precise retrieval. This allows your system to understand not just what is being said, but where it’s being said and how it relates to other parts of your data.
Hierarchy: Recognize and preserve the natural organization of your data, from high-level structures like folders and files to granular elements like pages and sections.
Formatting Structure: Capture formatting cues that organize data (headings, tables, lists, citations). They are valuable signals for understanding the data.
Chunking: Even with the expanding context windows of modern LLMs, strategically splitting long documents into chunks remains vital for precise retrieval. Instead of naively splitting by length, split based on the hierarchy and semantic context of the data.
Extract metadata: Capture page numbers, publication dates, authors, document types, and any other relevant metadata. This information can be used to efficiently filter the data for certain queries.
Information Extraction
We want to enrich our data. This is mostly about figuring out the schema that makes sense for your domain, and what might be useful to index ahead of time to allow for efficient filtering.
Keywords/Topics: Extract key terms and overarching themes present in the text. Ideally, align these with a predefined taxonomy for consistency. This allows for topical searches and filtering of documents based on pre-identified areas of interest.
Entity: Extract names of entities relevant to your domain. The most common are companies, persons, locations but depending on your field it might differ. (e.g., for software: functions, classes, variables; for healthcare: diseases, medications, procedures; for legal: laws, court cases, regulations)
Entity Linking: Linking the entities to a database helps with the ambiguity of entities, and makes it cleaner during search time as you have already made a decision whether a given entity is the entity you care about (entity linking could be an essay on its own: What database(s) of entities do you use? How do you handle different entities with similar/exact names? How to detect duplicate entities across databases? But at the same time merge partial information to create a super record? How do you handle shipping containers which are named after popular singers? Okay - the last one was oddly specific, if you couldn’t tell by now, I worked on this problem extensively at Handshakes)
Relationship extraction: If your data has inherent relationships (e.g., company ownership, function calls), extract these relationships to build a knowledge graph. This allows for graph-based queries and a deeper understanding of how different pieces of information connect.
Events: Extract specific events that happen to entities in your data together with the metadata. This allows your system to answer questions that are temporal in nature.
Note: These examples come from my experience building systems for understanding companies and people. The most effective information extraction strategies will be those specifically designed to capture the nuances and key data points within your particular domain.
Representations
The goal of creating diverse representations is to create different views through which the data can be queried. Since we can’t perfectly predict how a query might come in, this increases the surface area for a good match. This is particularly crucial for bridging the modality gap between the user’s textual query and potentially non-textual information within your documents.
Some methods to create alternate representations:
Summary: Generate condensed versions of data, capturing the main ideas. This works especially well with code - convert code snippets into natural language explaining its functionality, purpose, and algorithmic intent. This makes code discoverable through conceptual searches rather than just keyword matching. This is what we do at Cartograph to determine efficiently where certain functionality is in a codebase.
Hypothetical questions (Reverse HyDE): Based on the data, generate a set of plausible questions that the content could answer. This anticipates queries and creates a direct link between the data and potential information needs, even if the query phrasing differs. This is particularly useful if documents provide specific answers to questions.
Key Claims/Facts: Isolate and explicitly state the core factual statements or claims made within the document. This creates a searchable repository of atomic facts that can be directly matched against specific information requests. Ideal for knowledge bases or documents containing definitive statements.
Image: For images, generate textual descriptions detailing the visual content and its relevance to the surrounding text. This indexes the image, and it can be read by a Vision LLM.
Tables: For tabular data, generate textual descriptions that explain the table’s purpose, columns, and key insights. This makes the table content searchable through natural language queries.
When managing multiple representations of the same data, indexing complexity increases significantly. The solution lies in robust provenance tracking - maintaining clear links between derived representations and their original sources. This enables effective deduplication by identifying when different representations refer to the same underlying information. Furthermore, it allows for intelligent ranking, where content referenced by multiple representations can be weighted more heavily to indicate higher relevance. Finally, it ensures transparency by maintaining a clear chain of citations back to the original sources when generating responses.
The key here is to strike the right balance by creating enough representations to capture different information aspects without overwhelming the system with noise and computational cost.
Query Time: Surfing the Latent Space
During query time, the goal is to search and synthesize information until the right answer is found (or we run out of time).
Some methods to improve query performance (ordered by increasing cost and complexity):
Consistency: Send multiple requests to the same model at higher temperatures (self-consistency). Or send requests to different models. Compare the results to find a consensus, or use another model to pick if required. Works better with questions which have a definitive answer.
Distribution shift: Before even trying to search, use an LLM to shift the distribution of the input query to better match the corpus that is being searched. This can be improving the sentence structure (query rewriting), or even generating a fake document that you expect to match in the corpus (HyDE).
Graph Traversal: Leverage the knowledge graph built at index time and use search to find initial relevant nodes but then do graph traversal to uncover related information. This works extremely well for domains such as corporate structures (tracing ownership, subsidiary information) and also code (At Cartograph, we create a graph representation of a codebase and use that to understand dependencies between code).
Multi-Pass RAG: Instead of a single retrieval step, we can have the LLM analyze initial results and generate follow-up queries to dive deeper. This might include clarifying questions or new queries based on information ingested in the previous retrieval step. In WalkingRAG, retrieval is used to find the initial information, but if the returned results point to other pages/diagrams/citations, the LLM requests for that information too, effectively “walking” to the answer. This iterative process mirrors how humans research complex topics.
Test-Time Training: This is admittedly a little forward looking but recent research suggest it might be possible. Imagine a scenario where a user submits a query, and the system then dedicates resources to training itself on the most relevant information related to that query over a period of, say, 24 hours. Upon the user’s return, they would interact with a model significantly more knowledgeable and adept at addressing their specific needs. Looking ahead, as model training becomes more efficient and hardware costs decrease, this type of adaptive specialization could become more common. We might see systems that maintain a general knowledge base but develop temporary expertise ‘branches’ based on user needs, effectively creating personalized expert systems on demand.
Ideally, the system should intelligently allocate compute based on the perceived difficulty and have a mechanism to short-circuit when it’s confident in its answer.
Conclusion: Choose Your Own RAG Adventure
The methods above are not meant to be prescriptive but rather a toolkit of inspiration. The specific implementations would be dictated by the domain and the questions the RAG system must handle. Consider how an expert in the field would research and synthesize answers to guide the system design.
This post is an invitation to be more deliberate and strategic in your RAG design. Ask yourself:
- What is the true cost of inaccuracy for my application? Is it a minor inconvenience, or a critical failure?
- What latency requirements do I actually have? Is sub-second response a necessity, or a nice-to-have?
The key takeaway is this: By consciously investing more compute during index time to enrich your data and during query time to perform more sophisticated searches, you can unlock significantly higher levels of accuracy.
Thanks to Teo Si-Yan and Martin Andrews for reading drafts of this and providing suggestions.