🟢 Open to roles · Full Stack ML Data Scientist

Collaborative Human Safety Net!

Humans verify the AI’s auto-labels. A lightweight queue serves the next best item to rate, we update a Beta–Bernoulli posterior per pair, and stop asking once confidence crosses a threshold.

AI prior from Auto-LabelerAlgorithmic Priority QueueEmbedded Labelling UIBeta–Bernoulli posteriorStandardized Labels
All "Hands on" Deck! (Trust but Verify AI output... Efficiently)

The auto-labeler gives a fast prior probability of relevance for each (query × document) pair. But models can drift or reflect systemic bias. We trust but verify with a human review loop that is focused, fair, and efficient.

  • Human-in-the-loop for fairness: People review edge cases and correct bias; notes are optional but encouraged.
  • Prioritization, not exhaustion: a queue surfaces the most decision-shaping items first (likely positives to confirm, borderline pairings, so on and so forth).
  • Mathematical stopping: we update a Beta–Bernoulli posterior per pair; once confidence crosses a threshold (e.g., ≥0.85 relevant or ≤0.20 irrelevant), we stop asking. That cuts review from “everyone checks everything” to minimal ethical human involvement with strong guarantees.
  • Accountability: each vote carries rater weight × rater confidence; all events are audit-logged.

The result is a high-confidence, bias-aware ground truth for evaluation and future rankers.

Voting Interface

Contact for Rater Token, or Try Rater Token dev-dave-1 for temporary testing

The Voting Queue

Ongoing Prioritization Logic

The app decides what to show next so humans spend time where it matters most. Instead of checking every single pair, the queue focuses attention on the examples that will most improve the model’s confidence and fairness.

  • 1. Confirm strong positives first: We start with examples the model thinks are likely relevant but not yet settled. When humans agree, we reinforce trust in what the AI already does well.
  • 2. Resolve the borderlines: Next come the pairs the model is most uncertain about — the “gray zone.” These are the most valuable for teaching the system where its confidence dips or bias creeps in.
  • 3. Fill in unseen cases: Items that haven’t been rated yet are brought in to balance coverage and ensure no topic or category is ignored.
  • 4. Work inward from the edges: Once clear cases are confirmed, remaining pairs are reviewed in order of how close they are to that uncertain middle, until all are confidently settled.

The system automatically skips anything that’s already settled, or already rated/skipped by you. Of the remaining items, you will see that with the highest priority.

Settled with Confidence

Collaboratively Finalizing the Labels

Every search-document pair starts with a prediction from the auto-labeler — a probability indicating how likely it is to be relevant. Think of this as the AI’s “first opinion.”

Then, humans get to work on solidifying these labels with a greater degree of certainty. Each person’s label counts a little more or less depending on their experience (weight) and how sure they feel (confidence). The system blends these human decisions with the AI’s first guess — almost as if updating an opinion after hearing from your trusted friends.

After every label, the system updates — and there's newfound “confidence” in whether the search x doc are relevant to one another. When that confidence becomes very strong — usually above 0.85 for “yes” or below 0.20 for “no” — the system decides it’s sure enough and stops soliciting for inputs.

  • Start: The AI gives each pair a starting score — its best guess.
  • Human votes: Each vote nudges that score up or down, depending on the voter’s weight and confidence.
  • Combine & update: The app keeps track of the running score after every vote.
  • Settle: Once we’re confident enough (above 0.85 or below 0.20), we focus on issues of greater importance.

With AI, no one has to label everything. The model learns where humans agree, humans focus where the model struggles, and together they reach a stable and trustworthy ground truth.

Live Stats

Live Stats (snapshot)
Total pairs
8,500
Settled pairs
6,171 (72.6%)
Unsettled pairs
2,329
Avg posterior (post_p)
0.859
Zero-vote settled
6,171
Borderline (|post_p−0.5| < 0.10, unsettled)
1
Note: Many items are currently settled by prior alone (zero votes). Tighten thresholds or require a minimum vote count if you want more human verification.
Current posterior distribution (post_p)
0–0.217 (0.2%)0.2–0.40 (0%)0.4–0.61 (0%)0.6–0.82,320 (27.3%)0.8–1.06,162 (72.5%)
Snapshot from consensus_scores. Most pairs sit confidently in the 0.8–1.0 band, reflecting strong priors. Consider raising the positive threshold or requiring a minimum number of votes if you want more human validation before “settled”. Total pairs: 8,500.
Rater weights & activity
RaterWeightVotesAgreementAvg confidence
alice1.842086%91%
bob1.230581%88%
chris1.018878%85%
Make this editable if you want admin-side weight tuning.

Privacy & Safety

Privacy & safety
  • Access requires tokens; limit to your allow-listed domain(s).
  • Optional IP allow-list; all actions audit-logged.
  • Skips prevent repeatedly showing content outside a rater’s expertise.
Evaluating Search Algorithms with our Labels!
Finding the Best Search Models and Algorithms
Search Evaluation! →
🎧 Audio Guide: Page 6 · Human Safety Net 🎧
0:00 / 0:00