🟢 Open to roles · Full Stack ML Data Scientist

3-Fold Augmentation for Vector Documents

Transforming public arXiv metadata into vector documents with summaries, keywords, context labels, and embeddings—ready for vector search and content filtering.

Document ParsingGenerative PromptingPersonalized signalsEmbeddingsDistributed Process Automation
Avg processed (items/min)
186
Shards in flight
8
Latest batch
Clean inputs. Smarter outputs. At scale.

Before / After JSON Diff

Loading sample documents… Fetching samples…

Prompt “Flipbook”

Last refresh:
Extract research problem, main findings, key terms, and a readable abstract summary.
Messages (system/user)
[
  {"role": "system", "content": "You are an expert at summarizing academia and peer-reviewed journals."},
  {"role": "system", "content": "Given the paper’s key sections, output the following fields:"},
  {"role": "system", "content": "1. Research_Problem: One sentence summary of the problem"},
  {"role": "system", "content": "2. Main_Findings: The most important result"},
  {"role": "system", "content": "3. Key_Terms: list of technical terms or concepts"},
  {"role": "system", "content": "4. Summarized_Abstract: Clear, readable abstract"},
  {"role": "user", "content": "Title: {title}\nAbstract: {abstract}\nIntroduction: {introduction}\nMethods: {methods}\nResults: {results}\nConclusion: {conclusion}"}
]
Paste into your API call’s messages array. Placeholders like {title} are templated server-side.
Sample Output
{
  "note": "No sample found"
}

Quick Embeddings Explainer

Quick Embeddings Explainer
Done
sentence-transformers/all-MiniLM-L6-v2
openai/text-embedding-3-small (via /api/embeddings)
sentence-transformers/all-MiniLM-L6-v2
openai/text-embedding-3-small (via /api/embeddings)
cosine(Query A, Query B) — MiniLM
cosine(Query A, Query B) — OpenAI
Values shown are first 5 dims; cosine uses full (normalized) vectors.
Values above are normalized vectors (MiniLM via Xenova in-browser; OpenAI via your proxy). Use cosine similarity for search.

Embedding Pipeline Diagram

Data Processing Flowchart
Raw arXiv (GCS JSON/CSV)Parse & CleanPrompted EnrichmentEmbeddings (Sentence-Transformers & OpenAI)Artifact Build (enriched JSON)Vespa Feed
Stateless workers · idempotent runs · retries on 5xx · rate-limited model calls · chunk long abstracts when needed.

Parallelization & Sharding

Parallelization & Sharding
X documents(raw batch)Y Nodes (Shards)DB Load / Index
  • Split batch: assign each of Y nodes a deterministic slice (shard).
  • Each node processes its slice independently (stateless workers).
  • On completion, results are merged and loaded to the DB / index.
  • Design for retries: same shard re-run should be idempotent.
Coming Up: Vector DB Vespa
See our Vespa DB in GKE and why we choose it!
Vespa! →
🎧 Audio Guide: Page 3 · Processing and Vectors 🎧
0:00 / 0:00