🟢 Open to roles · Full Stack ML Data Scientist

Evaluation (Metrics & Comparisons)

Comparative Visualizations of Precision@k, Recall@k, MAP, NDCG for Various Search Algorithms and Embedders using our verified relevance labels

P@kRecall@kMAPNDCGModel Selection

Analyzing Search Results at Levels of K

In search, the top results carry most of the value, and logically so. If I return you a list of results where the primary document is a needle in the haystack, everyone would critisize the search and you may never get the answers you seek. That’s why we grade our systems primarily on performance at small k (e.g., 3, 5, 10), which is tantamount to asking:“How good are the first results you show?”

Low-k focus. We emphasize precision and ranking quality for the first few hits—this is what users see and trust.
Breadth vs. sharpness. Precision@k tends to drop as k grows; Recall@k rises—use both to balance accuracy and coverage.
Order matters. MAP and NDCG reward getting relevant items early; two lists with the same members can earn very different scores depending on their order.

Diagram showing we grade only the first five results when K=5

Example: with K=5 we grade only the first five results—because that’s what most users actually inspect.

Precision@k

What fraction of the top-k results are truly relevant?

Precision@k = (# relevant in top k) / k

Higher Precision@k ⇒ fewer junk results up top.

Recall@k

What fraction of all relevant items did we surface by rank k?

Recall@k = (# relevant retrieved up to k) / (total # relevant)

Useful to check coverage; often rises with k.

MAP (Mean Average Precision)

Averages precision at each relevant hit, then averages over queries.

MAP = mean_q( avg_i Precision@rank(relevant_i) )

Rewards placing relevant items early and consistently.

NDCG (Normalized DCG)

Credit for relevant items with logarithmic discount by rank, normalized by the ideal ranking.

NDCG@k = DCG@k / IDCG@k   where   DCG@k = Σ (rel_i / log2(i+1))

Sensitive to ordering; comparable across queries via normalization.

Metrics overview

From assets/eval/Eval_viz.png

Evaluation metrics (P@k, Recall@k, MAP, NDCG) for multiple runs

This static visualization mirrors the offline plot. Use the leaderboard below for exact values.Open full-size

Loading metrics from summary.csv…

Leaderboard

Download summary.csv Download per_query.csv

Run	k	P@k	Recall@k	MAP	NDCG	n_queries

Tip: sort by metric in your CSV export to find top runs at each k.

Per-Query Drilldown

ann_summary_2

✓ Graph Neural Nets for EHR (2509.111)
✓ RAG over Clinical Notes (2509.112)
✓ Message Passing Tricks (2509.113)
✗ Temporal GNN Benchmarks (2509.114)
✗ Contrastive Pretraining (2509.115)

ann_title_1

✓ GNN Survey (2509.211)
✗ Indexing Tricks (2509.212)
✗ GNN for Images (2509.213)
✓ EHR Time Series (2509.214)
✗ Transformers for EHR (2509.215)

Metrics for this query

Interpretation

Reading the Aggregates

Across metrics and cutoffs, — comes out on top most often. In our current snapshot, that’s ann_summary_2 — the run that uses the OpenAI embedding on document summaries. Practically, it returns the most relevant items near the top, and its advantage is most visible on rank-sensitive metrics like NDCG and on breadth-sensitive Recall@10.

Why summaries help: compact, information-dense text captures the core semantics better than titles alone; the embedding sees more signal per token.
Why OpenAI Embeddings: Sometimes NLP models perform better because they were trained on language more similar to the task. Or it might be that OpenAI's embedders represent language with 1536 dimensions, compared to our sentence-transformer model that only does 384, which logically allows for more facets of semantics to be measured and utilized in vector search.
Trade-offs: title-only profiles can sometimes be more direct semantic representations when compared with large bodies of text, which is why we test across features, not to mention models, in order to ensure the best search.

🎧 Audio Guide: Page 7 · Evaluation (Metrics & Comparisons) 🎧

0:00 / 0:00

Demo! User Search and Eval!

See for yourself how labels effect metrics!

What will you search? →