Query: 'I just sent Beauty in the Mirror 5.11b — recommend routes of similar difficulty.' The results came back full of routes with similar-sounding names, not similar grades. Root cause: dense embeddings compress multiple attributes into a single vector, and the rarity of the route name drowns out the grade signal. The fix: three layers of defense — metadata pre-filtering, query rewriting, and score fusion.
D1 is Cloudflare's serverless SQLite database that binds directly to Workers, supports full SQL (JOINs, transactions), and handles automatic backups. It's well-suited for small-to-medium relational data needs — NobodyClimb uses it as its primary database.
KV is Cloudflare's globally distributed key-value store. Reads are served from the nearest edge node with extremely low latency. It's ideal for caching, feature flags, and ephemeral data — but writes are eventually consistent.
R2 is Cloudflare's object storage service — S3-compatible API, zero egress fees, and native Workers binding. Stop worrying about bandwidth bills for media-heavy applications.
Cloudflare Workers uses V8 Isolates instead of containers — no cold starts, global edge deployment, and direct access to D1, R2, KV, and AI via Bindings. Great for APIs, SSR, and lightweight backends; not suited for long-running tasks.
Hono is a web framework designed specifically for edge runtimes like Cloudflare Workers, Deno, and Bun. It's an order of magnitude lighter than Express, natively supports Web Standard APIs, and is the go-to choice for edge environments.
@opennextjs/cloudflare enables Next.js 15 App Router deployments on Cloudflare Workers — dynamic SSR runs in a Worker, static assets are served from Cloudflare Assets. Zero server management, but with clear feature limitations.
Upgraded action-maker from hardcoded mock data to live Cloudflare Workers AI generation. The architecture splits into Worker (AI only), Server (data storage), and Frontend (orchestration). Hit two gotchas along the way: Qwen3's thinking block and the Workers AI response format.
RAG doesn't have to be a rigid three-step process. It's a set of steps that can be dynamically enabled, skipped, or reordered. Pipeline as Code lets the system adapt its behavior without redeployment.
LLM generation takes 3-5 seconds, and waiting for the full response before displaying it makes for a terrible experience. SSE pushes tokens as they're generated, reducing time-to-first-character from 5 seconds to under 1 second.
Limiting request count alone is not enough — a single long query can consume ten times the tokens of a normal one. Dual quotas (request count + token count) are what truly control costs.