Building Semantic XOR Ensembles: Logging, Bias-Proof Judging, and Iterative Model Ratings for Multi-LLM Systems
Large-language-model (LLM) ensembles promise higher factual accuracy and richer answers than any single model, but only if the pipeline is designed to measure and mitigate bias while capturing the data needed for continuous improvement.
This article describes an end-to-end pattern built around semantic XOR merging, independent judging, and rigorous logging.
1. Conceptual backdrop
Element | Purpose |
---|---|
Candidate models (A, B, …) | Generate alternative answers to the same prompt. |
Semantic XOR | A large-model prompt that asks: “Return only the ideas present in answer A or answer B, but not both, preserving factual context.”—The result surfaces disagreements, omissions, and unique insights line-by-line, forming a “delta” that a synthesis agent can weave back into a single draft. |
Judge model | Ranks the drafts and produces the final synthesis using the XOR delta as a checklist. |
Iteration loop | The merged answer can be re-fed to each generator with “Improve or rebut?” to harvest further refinements. |
2. Why the judge must be (mostly) independent
Research shows that an LLM acting as judge gives noticeably higher scores to prose it might have produced itself—self-preference bias—because such text has lower internal perplexity [1][2].
A cross-vendor judge (e.g., Gemini judging GPT-4o vs Claude-3) cuts that bias to ≤5 percentage points, whereas “siblings” from the same vendor still show 10–25 pp bias. Blind answer order and occasional cross-vendor audits provide most of the benefit at minimal cost.
3. Minimal logging schema
Column | Example | Why it matters |
---|---|---|
run_id | UUID | Tie every table together |
timestamp_utc , user_locale | 2025-06-15 13:02Z , en_CA | Latency and regional patterns |
Prompt meta | prompt_hash , domain_tag , task_type | Slice by topic (“legal-QA”, “code-gen”) |
Model meta | model_name , provider , version , persona , temperature , top_p | Identifies the true winners |
Output stats | tokens_out , time_ms , api_cost_$ | Cost/performance optimisation |
Judge scores | overall_pref , plus 1–5 ratings for factuality, relevance, completeness, style, citation | Granular diagnostics |
Contribution map | {"A": 0.57,"B": 0.43} | Detects dead-weight models |
Final_quality | Judge re-score or explicit human rating | Validates that merging helps |
User feedback | thumbs-up/down, edits | Ground-truth reality check |
Bias flags | judge_family , answer_order , self_pref_flag | Weekly bias audit |
4. From raw logs to a living leaderboard
- Pairwise win rate: each prompt is a match; winner = 1, loser = 0, tie = 0.5.
- Elo / TrueSkill: update ratings per match (Chatbot-Arena uses this scheme [3][4]).
- Dashboards: plot Elo-over-time by model and by persona; alert if 7-day Elo drops or cost per accepted token rises.
- Self-preference monitor:
SELECT judge_family,
AVG(CASE WHEN win_model = judge_family THEN 1 ELSE 0 END) AS self_pref_rate
FROM judgements
WHERE self_pref_flag = 1
GROUP BY judge_family;
Fire a warning if the rate drifts upward.
5. Cost-savvy judging patterns
Pattern | Trigger | Added cost | Benefit |
---|---|---|---|
Blind & shuffle | Always | $0 | Removes position + style bias |
Rotating judge pool | Day-to-day | +1 call/run | Limits any single judge’s quirks |
Tiered review | Cheap SLM passes schema & toxicity, big judge for nuanced tasks | −20–40 % | Big model only when needed |
Periodic audit | 5 % nightly sample | +5 % overall | Early detection of drift |
Self-consistency | Judge disagrees with itself (two seeds) | Escalate selectively | Saves money on easy calls |
6. Scaling to >2 models or >1 persona
Treat every (model × persona) pair as a separate “player” in the Elo table.
The semantic XOR still works; it simply produces a multi-way delta. Empirically, marginal gains diminish beyond three diverse players, but specialization pays off—e.g., a “concise-lawyer” persona often beats a “friendly-teacher” persona on legal tasks even on the same base model.
Key takeaways
- Semantic XOR isolates novel insights instead of merging redundantly, giving the synthesis agent a precise to-do list.
- Independent or rotating judges are mandatory; sibling models still favour themselves.
- Log everything—even unused metadata becomes gold once patterns emerge.
- Elo-style ratings turn raw win/loss data into an intuitive leaderboard for budget decisions.
Random fact: Arpad Elo built his original chess-rating tables on a hand-cranked mechanical calculator long before computers—yet the same math now ranks modern language models battling in 3-millisecond inference windows.