| |

Building Semantic XOR Ensembles: Logging, Bias-Proof Judging, and Iterative Model Ratings for Multi-LLM Systems


Large-language-model (LLM) ensembles promise higher factual accuracy and richer answers than any single model, but only if the pipeline is designed to measure and mitigate bias while capturing the data needed for continuous improvement.
This article describes an end-to-end pattern built around semantic XOR merging, independent judging, and rigorous logging.


1. Conceptual backdrop

ElementPurpose
Candidate models (A, B, …)Generate alternative answers to the same prompt.
Semantic XORA large-model prompt that asks: “Return only the ideas present in answer A or answer B, but not both, preserving factual context.”—The result surfaces disagreements, omissions, and unique insights line-by-line, forming a “delta” that a synthesis agent can weave back into a single draft.
Judge modelRanks the drafts and produces the final synthesis using the XOR delta as a checklist.
Iteration loopThe merged answer can be re-fed to each generator with “Improve or rebut?” to harvest further refinements.

2. Why the judge must be (mostly) independent

Research shows that an LLM acting as judge gives noticeably higher scores to prose it might have produced itself—self-preference bias—because such text has lower internal perplexity [1][2].
A cross-vendor judge (e.g., Gemini judging GPT-4o vs Claude-3) cuts that bias to ≤5 percentage points, whereas “siblings” from the same vendor still show 10–25 pp bias. Blind answer order and occasional cross-vendor audits provide most of the benefit at minimal cost.


3. Minimal logging schema

ColumnExampleWhy it matters
run_idUUIDTie every table together
timestamp_utc, user_locale2025-06-15 13:02Z, en_CALatency and regional patterns
Prompt metaprompt_hash, domain_tag, task_typeSlice by topic (“legal-QA”, “code-gen”)
Model metamodel_name, provider, version, persona, temperature, top_pIdentifies the true winners
Output statstokens_out, time_ms, api_cost_$Cost/performance optimisation
Judge scoresoverall_pref, plus 1–5 ratings for factuality, relevance, completeness, style, citationGranular diagnostics
Contribution map{"A": 0.57,"B": 0.43}Detects dead-weight models
Final_qualityJudge re-score or explicit human ratingValidates that merging helps
User feedbackthumbs-up/down, editsGround-truth reality check
Bias flagsjudge_family, answer_order, self_pref_flagWeekly bias audit

4. From raw logs to a living leaderboard

  1. Pairwise win rate: each prompt is a match; winner = 1, loser = 0, tie = 0.5.
  2. Elo / TrueSkill: update ratings per match (Chatbot-Arena uses this scheme [3][4]).
  3. Dashboards: plot Elo-over-time by model and by persona; alert if 7-day Elo drops or cost per accepted token rises.
  4. Self-preference monitor:
SELECT judge_family,
       AVG(CASE WHEN win_model = judge_family THEN 1 ELSE 0 END) AS self_pref_rate
FROM judgements
WHERE self_pref_flag = 1
GROUP BY judge_family;

Fire a warning if the rate drifts upward.


5. Cost-savvy judging patterns

PatternTriggerAdded costBenefit
Blind & shuffleAlways$0Removes position + style bias
Rotating judge poolDay-to-day+1 call/runLimits any single judge’s quirks
Tiered reviewCheap SLM passes schema & toxicity, big judge for nuanced tasks−20–40 %Big model only when needed
Periodic audit5 % nightly sample+5 % overallEarly detection of drift
Self-consistencyJudge disagrees with itself (two seeds)Escalate selectivelySaves money on easy calls

6. Scaling to >2 models or >1 persona

Treat every (model × persona) pair as a separate “player” in the Elo table.
The semantic XOR still works; it simply produces a multi-way delta. Empirically, marginal gains diminish beyond three diverse players, but specialization pays off—e.g., a “concise-lawyer” persona often beats a “friendly-teacher” persona on legal tasks even on the same base model.


Key takeaways

  • Semantic XOR isolates novel insights instead of merging redundantly, giving the synthesis agent a precise to-do list.
  • Independent or rotating judges are mandatory; sibling models still favour themselves.
  • Log everything—even unused metadata becomes gold once patterns emerge.
  • Elo-style ratings turn raw win/loss data into an intuitive leaderboard for budget decisions.

Random fact: Arpad Elo built his original chess-rating tables on a hand-cranked mechanical calculator long before computers—yet the same math now ranks modern language models battling in 3-millisecond inference windows.


References

  1. https://openreview.net/forum?id=Ns8zGZ0lmM
  2. https://arxiv.org/abs/2410.21819
  3. https://lmsys.org/blog/2023-05-03-arena/
  4. https://arxiv.org/abs/2403.04132
  5. https://www.wired.com/story/ai-bias-spreading-stereotypes-across-languages-and-cultures-margaret-mitchell

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *