🟢 Open to roles · Full Stack ML Data Scientist

Vespa! Cloud Infrastructure for Search Applications!

Multi-node Vespa on GKE! Vector + Keyword ranking! Autoscaling Search, Document Processing, and Content Replication!

Config servers (3)Content nodes (2)Query containers (2)Feed containers (2)Admin node (1)

Started locally on Docker → migrated to GKE using Vespa’s tutorials

Vespa in Production

Why Vespa

Vespa is one of the most mature and performant vector-aware search engines in production today — originally developed at Yahoo and now powering AI search at scale for global companies. It natively combines text and vector retrieval, supports streaming writes, online re-ranking, and large-scale low-latency inference in a single platform.

Trusted by Yahoo, Perplexity, Vinted, FARFETCH, Onyx (Danswer Cloud), Splore.
Combines ANN (approximate nearest neighbor) retrieval with BM25 keyword ranking, dynamic filtering, and tensor computation for real-time personalization.
Designed for multi-node scaling — horizontal sharding, coverage tracking, and replica rebalancing are built into the serving layer.
Fully open-source and Kubernetes-ready, with deep documentation at docs.vespa.ai.

Vector + Keyword HybridStreaming WritesRe-ranking PipelinesOn-cluster ML InferenceHorizontal ScalingOpen Source

Tip: Keep ANN profiles separated by embedding dimension (384 vs 1536) to avoid input shape mix-ups.

Cluster Diagram

Cluster Topology (Kubernetes view)

Control plane (Config/Admin) · Data plane (Stateless/Content) · Services (LB/Headless) · PVCs · Ports & health

Diagrams: Vespa cluster roles and concrete GKE deployment — config/admin (control plane), content (stateful), and query/feed (stateless). Headless internal svc vespa-internal; edge via vespa-query and vespa-feed at :8080.

Schema Explorer

Schema & Rank Profiles

schema arxiv2 {
  document arxiv2 {
    field arxiv_id type string { indexing: attribute | summary; fast-search: true }
    field title type string { indexing: index | summary; match: bm25 }
    field abstract type string { indexing: index | summary; match: bm25 }
    field categories type array<string> { indexing: attribute | summary; fast-search: true }
    field family type string { indexing: attribute | summary; fast-search: true }

    field title_Embedding_1 type tensor<bfloat16>(d0[384]) {
      indexing: attribute
      attribute: fast-search
      distance-metric: angular
    }
    field Summarized_Abstract_Embedding_1 type tensor<bfloat16>(d0[384]) {
      indexing: attribute
      attribute: fast-search
      distance-metric: angular
    }
  }

  fieldset default { fields: title, abstract }

  rank-profile keyword inherits default {
    first-phase { expression: bm25(title) + bm25(abstract) }
  }

  rank-profile ann_title_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase { expression: closeness(title_Embedding_1, query_embedding) }
  }

  rank-profile ann_summary_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase { expression: closeness(Summarized_Abstract_Embedding_1, query_embedding) }
  }

  rank-profile ann_multi_1 inherits default {
    inputs {
      query(query_embedding) tensor<bfloat16>(d0[384])
      query(w_title) double
      query(w_abs) double
    }
    first-phase {
      expression: query(w_title) * closeness(title_Embedding_1, query_embedding)
                + query(w_abs)   * closeness(Summarized_Abstract_Embedding_1, query_embedding)
    }
  }

  # Fine-tune style profile example
  rank-profile ann_fine_tune_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase {
      expression: closeness(title_Embedding_1, query_embedding) +
                  pow(max(0, closeness(Summarized_Abstract_Embedding_1, query_embedding)), attribute(weight1)) * attribute(weight2)
    }
  }
}

Highlights

bm25 on title and abstract for lexical recall.
fast-search attributes on arxiv_id, categories, family.
Multiple bf16 vector fields (384-d) with angular distance.
Multi-vector fusion (ann_multi_1) with query-time weights.
Fine-tune variant (ann_fine_tune_1) with attribute-based boost.

Quick Checks

After feeding a doc, GET /document/v1/arxiv2/arxiv2/docid/<id> should return it.
Using ANN profiles? Provide input.query(query_embedding) with shape 384.
Keep 384-dim vs 1536-dim profiles separate to avoid shape mix-ups.

Scaling & Monitoring

Scaling & Latency (observed)

Scaling: DQW 0.5 — Throughput/latency when doc–query work is balanced (DQW≈0.5).

Scaling: DQW 0.9 — Throughput-driven regime (DQW≈0.9): scale-out lifts QPS significantly.

Latency vs docs per node — P95 latency vs documents per node (two query mixes).

Latency vs throughput — P95 latency vs throughput (24 vCPU flavor) — knee near ~2.2k QPS.

Scaling

Horizontal: increase replicas for vespa-query and vespa-feed to raise QPS/ingest.
Vertical: grow container resources and content memory if paging/latency spikes.
Autobalancing: content cluster rebalances buckets after scale; brief <100% coverage is normal.

Monitoring

Pod health

cfg-0

configserver · ● up

cfg-1

configserver · ● up

cfg-2

configserver · ● up

content-0

content · ● up

content-1

content · ● up

query-0

query · ● up

query-1

query · ● up

feed-0

feed · ● up

feed-1

feed · ● up

Refreshed 8:35:30 PM

K8s resource snapshot

admin: 1 Gi
content (each): 1 Gi
query (each): 1.5 Gi
feed (each): 1.5 Gi

Hook Prometheus/Grafana for p95/99 latency, feed throughput, disk usage.

Research & Bench Notes

Vector DB feature matrix — Feature comparison snapshot (context from a recent survey https://superlinked.com/vector-db-comparison).

Premium Vespa Features

Multi-vector search & fusion: rank profiles can combine several embeddings (e.g., title + abstract) with query-time weights.
Tensor expression language: closeness/dot products, feature mixing, and custom first-phase/second-phase ranking on-cluster.
On-cluster ML inference: run models alongside serving for low-latency re-ranking and personalization.
True hybrid retrieval: ANN + BM25 + filters in a single query plan with coverage/degradation reporting.
Streaming writes & consistency: online ingest while maintaining search availability and bucket rebalancing.
Schema-driven performance: fast-search attributes, typed tensors (incl. bfloat16), and per-field distance metrics.
Rich filtering/YQL: structured predicates with facets, ranges, and joins without leaving the serving tier.

The Evaluation Stack Part I: Auto Labeler

Using AI to generate relevance labels for search evaluation.

Auto Labeler! →

🎧 Audio Guide: Page 4 · Vespa Vector DB 🎧

0:00 / 0:00