RAG Cold Start: Building a Useful System When You Have No Data

TL;DR A RAG system needs data to answer questions, but data only accumulates as the system gets used. Cold-start strategy is what bridges the gap from empty to useful.

#rag #cold-start #bootstrapping #indexing #data

Table of Contents

Prioritizing Data Sources
Graceful Degradation
Advanced Indexing Strategies
Data Quality > Data Quantity
The Bottom Line
References

🌏 中文版

RAG systems have a chicken-and-egg problem: no data means no answers, but data only accumulates through actual use.

“What routes are at Longdong?” — if Longdong’s route data hasn’t been indexed yet, the system can only say “no relevant data found.” That’s a useless answer, and users will likely give up immediately.

Cold-start strategy addresses this: how do you make a system genuinely useful in the early stages when data is sparse?

Prioritizing Data Sources

Not all data is equally hard to obtain. Sorted by difficulty:

First priority: structured database data

If the system already has a business database, index that data first. A climbing community’s database has routes and crag information — these can be converted into documents and indexed directly:

async function bootstrapFromDatabase(env: Env): Promise<void> {
  // Batch-index route data from the database
  const routes = await db.select().from(routesTable).all();

  for (const batch of chunk(routes, 50)) {
    await Promise.all(
      batch.map(route => indexDocument({
        id: `route-${route.id}`,
        content: formatRouteAsDocument(route),
        metadata: {
          type: 'route',
          crag_id: route.cragId,
          grade_numeric: route.gradeNumeric,
          route_type: route.routeType,
        },
      }, env))
    );
  }
}

function formatRouteAsDocument(route: Route): string {
  return `
    Route name: ${route.name}
    Crag: ${route.cragName}
    Grade: ${route.grade}
    Type: ${route.routeType}
    Description: ${route.description ?? ''}
    Notes: ${route.notes ?? ''}
  `.trim();
}

A database with a few hundred routes can complete the initial index in minutes.

Second priority: scraping public data

The climbing community has public resources (8a.nu, theCrag, Mountain Project) that can be scraped for baseline data. Watch out for licensing — confirm the terms of service permit this kind of use.

Third priority: LLM-generated synthetic data

If you genuinely have nothing, you can use an LLM to generate “seed knowledge”:

async function generateSeedKnowledge(topic: string): Promise<string[]> {
  const response = await llm.generate({
    prompt: `
      Generate 10 pieces of climbing knowledge about "${topic}".
      Write them in the style of a climbing guidebook description, 50–100 words each.
      This content will serve as the initial seed for the knowledge base.
    `,
  });

  return parseBulletPoints(response);
}

// Generate general climbing knowledge
await generateSeedKnowledge("sport climbing fundamentals");
await generateSeedKnowledge("trad protection systems");
await generateSeedKnowledge("bouldering for beginners");

Synthetic data isn’t as good as real data, but it’s better than an empty knowledge base. These entries get replaced by real data over time.

Graceful Degradation

When data is sparse, the system needs to “acknowledge its limits while still being helpful”:

async function handleSparseContext(
  query: string,
  retrievedDocs: Document[],
  ctx: PipelineContext
): Promise<string> {

  if (retrievedDocs.length === 0) {
    // No data at all: answer using LLM general knowledge, with a disclaimer
    return generateWithDisclaimer(query, ctx, {
      disclaimer: "⚠️ No relevant data found. The following answer is based on general climbing knowledge — please verify independently.",
    });
  }

  if (retrievedDocs.length < 3) {
    // Very little data: answer but flag limited coverage
    return generateWithDisclaimer(query, ctx, {
      disclaimer: "ℹ️ Limited data available. The following answer may be incomplete.",
    });
  }

  // Normal response
  return generate(query, retrievedDocs, ctx);
}

Letting users know about the system’s data situation is more honest and more useful than silently failing or hallucinating.

Advanced Indexing Strategies

Incremental indexing — index new data immediately, don’t wait for batch runs:

// When a new route is added to the database, index it right away
async function onRouteCreated(route: Route, ctx: ExecutionContext) {
  ctx.waitUntil(indexDocument(routeToDocument(route), env));
}

ctx.waitUntil() lets indexing continue after the response returns, without blocking the main request.

Index popular data first — when there’s a lot of data, start with what’s most likely to be queried:

// Sort by view count, index popular routes first
const hotRoutes = await db
  .select()
  .from(routes)
  .orderBy(desc(routes.viewCount))
  .limit(500)
  .all();

await batchIndex(hotRoutes);

Query-driven indexing — when someone searches for something that isn’t indexed yet, log it:

// When a search returns no results, record the query
if (searchResults.length === 0) {
  await logMissingQuery(query, env);
  // Periodically analyze the missing-query log to prioritize what data to add next
}

Data Quality > Data Quantity

A common cold-start mistake: rushing to index a large volume of low-quality data.

100 high-quality, well-structured route descriptions are more useful than 1,000 incomplete records with nothing but a name and grade. The quality ceiling of a RAG system is set by data quality, not data volume.

Good data should have:

Complete descriptions (not just names and grades)
Clean metadata (type, location, numeric grade values)
Accurate information (no mislabeled grades or incorrect location data)

The Bottom Line

Cold start is an engineering problem, not a RAG algorithm problem. The solution is pragmatic: index your existing structured data first, design graceful degradation so users get a reasonable experience even when data is sparse, then let usage drive continuous data growth.

Most importantly: don’t wait until your data is complete before going live. Ship → users use it → gaps surface → fill the gaps. That loop beats trying to build a complete database upfront.