🟒 Open to roles · Full Stack ML Data Scientist

Monitoring & Security (Reliability, Scaling, Guardrails)

Ensure the Vespa + FastAPI + OpenAI stack runs safely, predictably, and transparently once deployed beyond localhost. This page summarizes how the live system is observed, protected, and governed.

GKEGrafana / PrometheusCloud LoggingCloud ArmorJWT + OIDCSecret Manager
Infrastructure Overview
Where everything runs
Kubernetes (GKE)
Autoscaled serving + labeling
  • Vespa content + container nodes
  • FastAPI pods (search, labeling, demos)
  • Cloud SQL sidecars for Postgres
Networking
Ingress + Firewall
  • HTTPS Ingress (Managed Cert)
  • Cloud Armor WAF / rules
  • Internal Services for Vespa/DB
Storage & Logs
Durable + auditable
  • GCS for datasets & artifacts
  • Cloud Logging (Stackdriver)
  • Object versioning + backups
Monitoring / Dashboards
Real-time visibility
  • Prometheus scrape β†’ Grafana
  • Alert Policies β†’ Slack/PagerDuty
  • Cloud Logging β†’ error analytics
Identity / Access
Zero-trust model
  • Service Accounts (least privilege)
  • API tokens per role
  • Admin via OIDC + 2FA
Pod Health & Performance
Latency, throughput, and autoscaling
Alerting
  • Cluster Summary: CPU/Memory per nodepool, pod status.
  • Vespa Query Latency: p50/p95/p99 by rank profile.
  • Feed Throughput: docs/sec + retries per content node.
  • Autoscaler Events: node adds/removes vs load.
  • FastAPI: requests/sec + error rate by endpoint.
Alerts: p95 > 500 ms, restarts > 3/10 min, CPU > 85% sustained.
Grafana (live)
Add NEXT_PUBLIC_GRAFANA_URL to embed
Provide a public or auth-proxied Grafana URL via NEXT_PUBLIC_GRAFANA_URL to render an embed here.
Security Controls
Perimeter β†’ AuthZ β†’ Data privacy
1) Perimeter
  • Cloud Armor WAF (IP reputation, geo)
  • Rate limit (e.g., 3 req/min per token)
  • hCaptcha/ReCAPTCHA on public forms
  • HTTPS-only via Managed Cert
2) AuthN/AuthZ
  • JWT bearer tokens for raters/admins
  • Anonymous read endpoints w/ caps
  • Admin via GCP OIDC + 2FA
3) Data Integrity & Privacy
  • PII-stripped logs, hashed rater IDs
  • Nightly GCS backups + versioning
  • Secrets in Secret Manager
AI Use Guardrails
Transparency, reversibility, bias checks
Transparency & Opt-Out
GPT models assist in auto-labeling & embeddings; no personal data is processed or stored. Users may pause auto-labeling to stay manual.
Ethical Guidelines
  • Explainability (β€œShow relevance math” tooltips)
  • Reversibility (versioned, revertible adjustments)
  • Non-manipulative UX (clear consent)
  • Bias checks (weekly category imbalance scan)
Scaling Reliability & Failover
How we grow and recover
ComponentScalingRecovery
Vespa content nodesHPA (1–5) + replica syncWarm replica failover
FastAPI containersAutoscale (GKE/Cloud Run)0β†’N cold start
CloudSQL (Postgres)Managed HAPoint-in-time restore
GCS bucketsMulti-regionImmutable history
Grafana/PrometheusStatefulSet + PVCSnapshot restore job
Load test (Locust) sustained ~3,000 req/min under default limits; bursts absorbed by queue buffering.
Bot Detection & Abuse Mitigation
Keep signals human and trustworthy
  • Behavior heuristics (keypress timing entropy)
  • Consensus API: ≀1 vote/sec per session
  • Honeytoken queries as controls
  • Violations β†’ token suspension + Cloud Armor quarantine
Compliance & Logging
Retention, export, verification
  • GDPR/CCPA disclosure + data export endpoint
  • 30-day user-interaction logs; aggregate metrics kept indefinitely
  • Access logs SHA-256 signed and verified hourly
  • Periodic penetration tests via Cloud Security Scanner
Grafana Embed
Pod CPU + Query latency
Add NEXT_PUBLIC_GRAFANA_URL to render a live panel here.
Firewall & Request Path
Browser β†’ Ingress β†’ Cloud Armor β†’ FastAPI β†’ Vespa
Firewall Diagram
Replace image with your architecture diagram when ready.
Bot Detection Flow
Bot Detection Flow
Ethical AI Statement
β€œOur system amplifies human expertise, not replaces it.” We keep users in the loop, show how scores are computed, and allow reversions of automated decisions.
Technology Stack!
The Architecture and Tools Supporting this Project
Next Page! β†’
🎧 Audio Guide: Page 10 · Monitoring & Security 🎧
0:00 / 0:00