Collaborative Human Safety Net!
Humans verify the AI’s auto-labels. A lightweight queue serves the next best item to rate, we update a Beta–Bernoulli posterior per pair, and stop asking once confidence crosses a threshold.
The auto-labeler gives a fast prior probability of relevance for each (query × document) pair. But models can drift or reflect systemic bias. We trust but verify with a human review loop that is focused, fair, and efficient.
- Human-in-the-loop for fairness: People review edge cases and correct bias; notes are optional but encouraged.
- Prioritization, not exhaustion: a queue surfaces the most decision-shaping items first (likely positives to confirm, borderline pairings, so on and so forth).
- Mathematical stopping: we update a Beta–Bernoulli posterior per pair; once confidence crosses a threshold (e.g., ≥0.85 relevant or ≤0.20 irrelevant), we stop asking. That cuts review from “everyone checks everything” to minimal ethical human involvement with strong guarantees.
- Accountability: each vote carries rater weight × rater confidence; all events are audit-logged.
The result is a high-confidence, bias-aware ground truth for evaluation and future rankers.
Voting Interface
dev-dave-1 for temporary testingThe Voting Queue
The app decides what to show next so humans spend time where it matters most. Instead of checking every single pair, the queue focuses attention on the examples that will most improve the model’s confidence and fairness.
- 1. Confirm strong positives first: We start with examples the model thinks are likely relevant but not yet settled. When humans agree, we reinforce trust in what the AI already does well.
- 2. Resolve the borderlines: Next come the pairs the model is most uncertain about — the “gray zone.” These are the most valuable for teaching the system where its confidence dips or bias creeps in.
- 3. Fill in unseen cases: Items that haven’t been rated yet are brought in to balance coverage and ensure no topic or category is ignored.
- 4. Work inward from the edges: Once clear cases are confirmed, remaining pairs are reviewed in order of how close they are to that uncertain middle, until all are confidently settled.
The system automatically skips anything that’s already settled, or already rated/skipped by you. Of the remaining items, you will see that with the highest priority.
Settled with Confidence
Every search-document pair starts with a prediction from the auto-labeler — a probability indicating how likely it is to be relevant. Think of this as the AI’s “first opinion.”
Then, humans get to work on solidifying these labels with a greater degree of certainty. Each person’s label counts a little more or less depending on their experience (weight) and how sure they feel (confidence). The system blends these human decisions with the AI’s first guess — almost as if updating an opinion after hearing from your trusted friends.
After every label, the system updates — and there's newfound “confidence” in whether the search x doc are relevant to one another. When that confidence becomes very strong — usually above 0.85 for “yes” or below 0.20 for “no” — the system decides it’s sure enough and stops soliciting for inputs.
- Start: The AI gives each pair a starting score — its best guess.
- Human votes: Each vote nudges that score up or down, depending on the voter’s weight and confidence.
- Combine & update: The app keeps track of the running score after every vote.
- Settle: Once we’re confident enough (above 0.85 or below 0.20), we focus on issues of greater importance.
With AI, no one has to label everything. The model learns where humans agree, humans focus where the model struggles, and together they reach a stable and trustworthy ground truth.
Live Stats
consensus_scores. Most pairs sit confidently in the 0.8–1.0 band, reflecting strong priors. Consider raising the positive threshold or requiring a minimum number of votes if you want more human validation before “settled”. Total pairs: 8,500.| Rater | Weight | Votes | Agreement | Avg confidence |
|---|---|---|---|---|
| alice | 1.8 | 420 | 86% | 91% |
| bob | 1.2 | 305 | 81% | 88% |
| chris | 1.0 | 188 | 78% | 85% |
Privacy & Safety
- Access requires tokens; limit to your allow-listed domain(s).
- Optional IP allow-list; all actions audit-logged.
- Skips prevent repeatedly showing content outside a rater’s expertise.