Consensus Verification: Six Models, One Verdict
Every AI verification service uses one model. We query Claude, GPT, Gemini, DeepSeek, Llama, Grok — plus live data from Yahoo, FRED, and SEC EDGAR. Six models guess. VEROQ knows.
Every AI verification service uses one model to check claims. We use six.
Single-model verification has a structural problem: when the model is wrong, it's often wrong in the same direction on both generation and verification. A student grading their own exam. The failure modes are correlated — same training data, same blind spots, same confident mistakes.
Consensus verification breaks the correlation. Six models, six sets of training data, six independent evaluations. Plus one source none of them have: real-time ground truth from Yahoo Finance, FRED, EIA, and SEC EDGAR.
Six AI models guess. VEROQ knows.
How It Works
One API call. Six models run in parallel. You get back per-model verdicts, a consensus score, and any dissenting opinions with corrections.
curl -X POST https://api.veroq.ai/api/v1/verify/consensus \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"claim": "NVIDIA Q4 2024 revenue was $22.1 billion"}'Response
{
"consensus_verdict": "majority_supported",
"consensus_score": 0.833,
"confidence": 0.89,
"models_queried": 6,
"verdicts": [
{ "label": "Claude", "verdict": "supported", "confidence": 0.94 },
{ "label": "GPT", "verdict": "supported", "confidence": 0.91 },
{ "label": "Gemini", "verdict": "supported", "confidence": 0.88 },
{ "label": "DeepSeek", "verdict": "supported", "confidence": 0.85 },
{ "label": "Llama", "verdict": "supported", "confidence": 0.79 },
{ "label": "Grok", "verdict": "contradicted", "confidence": 0.82,
"correction": "Q4 FY2025 revenue was $22.1B, not Q4 2024" }
],
"dissent": [{ "label": "Grok", "verdict": "contradicted", ... }],
"receipt_id": "vc_abc123"
}The Seven Sources
| Provider | Model | Why It's There |
|---|---|---|
| Anthropic | Claude | Strong reasoning, conservative on uncertain claims |
| OpenAI | GPT | Broad knowledge, high recall on factual data |
| Gemini | Strong on structured data, dates, numbers | |
| DeepSeek | DeepSeek | Different training corpus, catches Western-centric blind spots |
| Meta (Groq) | Llama | Open-weight model, fast inference, different failure modes |
| xAI | Grok | Real-time X/Twitter data, catches stale claims |
| VEROQ | Live Data | Yahoo, FRED, EIA, SEC EDGAR — real-time ground truth |
All six run in parallel. Total latency is the slowest model, not the sum. Cost is ~$0.002 per claim. Results are cached for 30 minutes — repeated checks are free.
Why Dissent Matters
The consensus score tells you how many models agree. But the dissent array is where the real value is. When 5 models say “supported” and one says “contradicted,” that correction is almost always worth reading.
Claim: “NVIDIA beat Q4 estimates by 12%”
5 models: SUPPORTED
Grok (DISSENT): “Beat estimates by 8.6%, not 12%. The 12% figure is the YoY revenue growth rate, not the earnings beat.”
This is a subtle error that a single-model verifier would miss — the claim is close enough to pass a similarity check, but the specific number is wrong. Cross-model consensus catches it because the models are wrong in different ways.
The Ground Truth Advantage
There's a category of error that no amount of multi-model consensus catches: stale data. Every LLM has a knowledge cutoff. Ask six models “What is NVIDIA trading at?” and you get six versions of “I don't have real-time data.”
Claim: “NVIDIA is trading at $280”
Claude, GPT, Gemini, DeepSeek, Llama, Grok: UNVERIFIABLE
VEROQ Live Data: CONTRADICTED — actual price $258.30, checked 2 minutes ago
VEROQ is the 7th source in consensus verification. It checks claims against live data from Yahoo Finance (prices, fundamentals), FRED (economic indicators), EIA (energy data), and SEC EDGAR (filings, insider trades). When the models can't verify, VEROQ often can — because it has the actual number, not a training-data approximation.
This is the difference between “consensus among guesses” and “consensus plus ground truth.”
Using It
Python SDK
from veroq import Veroq
client = Veroq()
# Consensus verification on a single claim
result = client.verify.consensus(
claim="Tesla delivered 1.8M vehicles in 2024"
)
print(result.consensus_verdict) # "majority_supported"
print(result.consensus_score) # 0.833
print(result.dissent) # models that disagree
# Shield any LLM output with consensus
from veroq import shield
result = shield(
llm_output,
consensus=True # use 6-model consensus
)As an Agent Guardrail
from agents import Agent, Runner
from veroq_agentmesh import veroq_consensus_guardrail
agent = Agent(
name="Analyst",
output_guardrails=[veroq_consensus_guardrail(
min_models=4,
block_on_dissent=True
)],
)
# Blocks if any model contradicts a claim
result = await Runner.run(agent, "Summarize AAPL earnings")When to Use Consensus vs Standard Verification
| Scenario | Recommended |
|---|---|
| Real-time agent responses | Standard /verify (faster, cheaper) |
| Financial claims in reports | Consensus (highest accuracy) |
| CI/CD pipeline checks | Consensus (catch subtle errors before deploy) |
| Customer-facing content | Consensus (dissent catches nuance) |
| Internal research notes | Standard /verify (good enough) |
| Regulatory/compliance | Consensus (auditable multi-model trail) |
Consensus verification is live now. Free tier includes 1,000 credits/month.
- Try it in the playground — paste a claim, see 6 models respond
- API documentation — POST /verify/consensus
- Shield SDK — open source on GitHub