🟢 Open to roles · Full Stack ML Data Scientist

Vespa! Cloud Infrastructure for Search Applications!

Multi-node Vespa on GKE! Vector + Keyword ranking! Autoscaling Search, Document Processing, and Content Replication!

Config servers (3)Content nodes (2)Query containers (2)Feed containers (2)Admin node (1)
Started locally on Docker → migrated to GKE using Vespa’s tutorials

Vespa in Production

Why Vespa

Vespa is one of the most mature and performant vector-aware search engines in production today — originally developed at Yahoo and now powering AI search at scale for global companies. It natively combines text and vector retrieval, supports streaming writes, online re-ranking, and large-scale low-latency inference in a single platform.

  • Trusted by Yahoo, Perplexity, Vinted, FARFETCH, Onyx (Danswer Cloud), Splore.
  • Combines ANN (approximate nearest neighbor) retrieval with BM25 keyword ranking, dynamic filtering, and tensor computation for real-time personalization.
  • Designed for multi-node scaling — horizontal sharding, coverage tracking, and replica rebalancing are built into the serving layer.
  • Fully open-source and Kubernetes-ready, with deep documentation at docs.vespa.ai.
Vector + Keyword HybridStreaming WritesRe-ranking PipelinesOn-cluster ML InferenceHorizontal ScalingOpen Source
Tip: Keep ANN profiles separated by embedding dimension (384 vs 1536) to avoid input shape mix-ups.

Cluster Diagram

Cluster Topology (Kubernetes view)
Control plane (Config/Admin) · Data plane (Stateless/Content) · Services (LB/Headless) · PVCs · Ports & health
Vespa cluster diagram
Vespa Cluster Deployment on GKE
Diagrams: Vespa cluster roles and concrete GKE deployment — config/admin (control plane), content (stateful), and query/feed (stateless). Headless internal svc vespa-internal; edge via vespa-query and vespa-feed at :8080.

Schema Explorer

Schema & Rank Profiles
schema arxiv2 {
  document arxiv2 {
    field arxiv_id type string { indexing: attribute | summary; fast-search: true }
    field title type string { indexing: index | summary; match: bm25 }
    field abstract type string { indexing: index | summary; match: bm25 }
    field categories type array<string> { indexing: attribute | summary; fast-search: true }
    field family type string { indexing: attribute | summary; fast-search: true }

    field title_Embedding_1 type tensor<bfloat16>(d0[384]) {
      indexing: attribute
      attribute: fast-search
      distance-metric: angular
    }
    field Summarized_Abstract_Embedding_1 type tensor<bfloat16>(d0[384]) {
      indexing: attribute
      attribute: fast-search
      distance-metric: angular
    }
  }

  fieldset default { fields: title, abstract }

  rank-profile keyword inherits default {
    first-phase { expression: bm25(title) + bm25(abstract) }
  }

  rank-profile ann_title_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase { expression: closeness(title_Embedding_1, query_embedding) }
  }

  rank-profile ann_summary_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase { expression: closeness(Summarized_Abstract_Embedding_1, query_embedding) }
  }

  rank-profile ann_multi_1 inherits default {
    inputs {
      query(query_embedding) tensor<bfloat16>(d0[384])
      query(w_title) double
      query(w_abs) double
    }
    first-phase {
      expression: query(w_title) * closeness(title_Embedding_1, query_embedding)
                + query(w_abs)   * closeness(Summarized_Abstract_Embedding_1, query_embedding)
    }
  }

  # Fine-tune style profile example
  rank-profile ann_fine_tune_1 inherits default {
    inputs { query(query_embedding) tensor<bfloat16>(d0[384]) }
    first-phase {
      expression: closeness(title_Embedding_1, query_embedding) +
                  pow(max(0, closeness(Summarized_Abstract_Embedding_1, query_embedding)), attribute(weight1)) * attribute(weight2)
    }
  }
}
Highlights
  • bm25 on title and abstract for lexical recall.
  • fast-search attributes on arxiv_id, categories, family.
  • Multiple bf16 vector fields (384-d) with angular distance.
  • Multi-vector fusion (ann_multi_1) with query-time weights.
  • Fine-tune variant (ann_fine_tune_1) with attribute-based boost.
Quick Checks
  • After feeding a doc, GET /document/v1/arxiv2/arxiv2/docid/<id> should return it.
  • Using ANN profiles? Provide input.query(query_embedding) with shape 384.
  • Keep 384-dim vs 1536-dim profiles separate to avoid shape mix-ups.

Scaling & Monitoring

Scaling & Latency (observed)
Scaling: DQW 0.5
Throughput/latency when doc–query work is balanced (DQW≈0.5).
Scaling: DQW 0.9
Throughput-driven regime (DQW≈0.9): scale-out lifts QPS significantly.
Latency vs docs per node
P95 latency vs documents per node (two query mixes).
Latency vs throughput
P95 latency vs throughput (24 vCPU flavor) — knee near ~2.2k QPS.
Scaling
  • Horizontal: increase replicas for vespa-query and vespa-feed to raise QPS/ingest.
  • Vertical: grow container resources and content memory if paging/latency spikes.
  • Autobalancing: content cluster rebalances buckets after scale; brief <100% coverage is normal.
Monitoring
Pod health
cfg-0
configserver · up
cfg-1
configserver · up
cfg-2
configserver · up
content-0
content · up
content-1
content · up
query-0
query · up
query-1
query · up
feed-0
feed · up
feed-1
feed · up
Refreshed 8:35:30 PM
K8s resource snapshot
  • admin: 1 Gi
  • content (each): 1 Gi
  • query (each): 1.5 Gi
  • feed (each): 1.5 Gi
Hook Prometheus/Grafana for p95/99 latency, feed throughput, disk usage.

Research & Bench Notes

Vector DB feature matrix
Feature comparison snapshot (context from a recent survey https://superlinked.com/vector-db-comparison).
Premium Vespa Features
  • Multi-vector search & fusion: rank profiles can combine several embeddings (e.g., title + abstract) with query-time weights.
  • Tensor expression language: closeness/dot products, feature mixing, and custom first-phase/second-phase ranking on-cluster.
  • On-cluster ML inference: run models alongside serving for low-latency re-ranking and personalization.
  • True hybrid retrieval: ANN + BM25 + filters in a single query plan with coverage/degradation reporting.
  • Streaming writes & consistency: online ingest while maintaining search availability and bucket rebalancing.
  • Schema-driven performance: fast-search attributes, typed tensors (incl. bfloat16), and per-field distance metrics.
  • Rich filtering/YQL: structured predicates with facets, ranges, and joins without leaving the serving tier.
The Evaluation Stack Part I: Auto Labeler
Using AI to generate relevance labels for search evaluation.
Auto Labeler! →
🎧 Audio Guide: Page 4 · Vespa Vector DB 🎧
0:00 / 0:00