ModylBench · open meeting-agent eval

The benchmark for the meeting-native model
frontier labs haven’t built yet.

Every existing benchmark grades a model doing one modality at a time. SWE-bench grades code. tau-bench grades tool-use. GAIA grades general assistance. GDPval and OccuBench grade one-shot work products. Omni-modal benchmarks grade audio-to-audio or video-to-text. ModylBench grades the joint distribution — speech, visible state, structured artifact edits, and intent signals, emitted synchronously from one forward pass. That is the regime real meetings live in. No current frontier model passes it natively; they stitch it together from modular endpoints. The benchmark names the architectural gap.

HuggingFace dataset →Leaderboard (v1 baselines publishing)

Apache 2.06 verticals · ~300 turns29 adversarial edge casesJourney (0.4) × Destination (0.6)Tier-3 headroom · <30% pass expectedJoint-distribution grading

What ModylBench forces

Model-architecture benchmark.
Not a system benchmark.

v1 scenarios grade orchestrated stacks because that is what exists today. The scoring surface is intentionally joint — cross-modal consistency, synchronous artifact construction, embedded intent, graceful stream degradation — so the ceiling of orchestrated systems becomes visible, and the architectural leap to joint-distribution models becomes necessary.

v3 grades models that emit audio + video + CRDT ops + intent + provenance from one forward pass. v1 is the specification; v3 is the grading surface for the models the specification forces. What Modyl does today in orchestrated software is what Modyl becomes in native model architecture — and the benchmark is how we measure the gap.

Where it sits

Six one-shot benchmarks.
One joint-distribution benchmark.

Benchmark

What it measures

Turn shape

Joint-dist?

SWE-bench

Code patches

One-shot

tau-bench

Customer-service turns

Short

GAIA

General assistance

One-shot

GDPval

Economically-valuable work products

One-shot

OccuBench

Professional tasks with Language World Models

One-shot

AgentRewardBench

LLM-judge reliability on agent trajectories

N/A

OmniGAIA

Omni-modal agent perception + multi-turn tool execution (event graphs)

Multi-turn

ModylBench

Speech + visible state + artifact edits + intent + provenance — emitted synchronously in one scenario

Multi-turn live-session

Yes

What it measures

Eleven dimensions.
Journey × destination.

Most benchmarks score the final artifact. ModylBench scores the way the agent got there — turn-taking, presence, interruption handling — and the artifact together. Weights are public: journey 0.4, destination 0.6.

Journey (0.4)

Turn-taking discipline
Reading-the-room latency
Interruption handling
Presence / visible state
Source-of-truth linkage
Omni-intent inference (voice + gaze + gesture + screen + canvas + prior edits)

Destination (0.6)

Artifact correctness
Decision-log fidelity
Owner / follow-up capture
Closeback package completeness
Provenance integrity
Intent-as-output coherence (stated intent matches emitted ops)
Behavioral fit (uncanny-valley / tonal register)

Intent detection goes omni

Intent is not just what the user says.
It’s what the user means across every stream, simultaneously.

Existing intent-detection benchmarks treat intent as a classification labelover one stream — usually text transcripts or GUI screenshots. ModylBench treats intent as a synchronized channel on both the input side (inferred jointly from voice prosody, gaze, gesture, screen state, canvas activity, and prior CRDT edits) and the output side (emitted by the agent as a calibrated scalar, then checked for coherence against actual operations).

PASK LatentNeeds-Bench (arXiv:2604.08000)

100 sessions × 3,936 turns — proactive memory + intent from transcripts

GapText-only. No video / gaze / gesture / screen context.

PIRA-Bench (arXiv:2603.08013)

Proactive intent recommendation from continuous GUI screenshots with interleaved intents, noisy segments, user profiles

GapInput is screenshots only. No voice prosody or live conversational cues.

MISID (arXiv:2604.12700)

Multi-turn, multimodal, multi-participant intent recognition in social strategy games

GapAdversarial social-game scope, not collaborative work setting.

ProAct (2026-02)

75 tasks, 91,581 step-level annotations, proactive agents with explicit task graphs

GapStep-level labels, not synchronous joint-stream inference.

ClauseCompose (arXiv:2603.28929)

Compositional multi-intent detection. 95.7 EM on unseen pairs, 91.1 on unseen triples via clause-factorized decoding

GapText-only. Compositional generalization demonstrated, but not in a live multi-stream setting.

ModylBench (input side)

Omni-intent: infer jointly from audio + video + screen + canvas + prior CRDT state + silence pattern. Composition across verticals tested at ≥3-intent depth.

ModylBench (output side)

Emit intent + confidence scalars as a first-class synchronized output channel, graded against actual emitted operations. A model that says 'summarize only' but emits a delete op fails even if the final artifact looks plausible.

Where the frontier breaks

Recent research is loud on one point:
modality in isolation is not the regime that matters.

ModylBench is built in full knowledge of the counter-benchmarks below. Each one surfaces a failure mode that disappears — or becomes sharper — when a model is graded on the joint distribution instead of on a single stream.

Berkeley MAST (arXiv:2503.13657)

1,642 annotated traces across 7 MAS frameworks. 41–86.7% failure rates. 79% coordination-caused, not capability.

ModylModylBench injects MAST-identified failure modes (spec drift, role ambiguity, premature termination, verification failure) and scores room-mediated recovery.

DeliberationBench (arXiv:2601.08835)

Free-form multi-LLM deliberation underperforms best-single-agent by 6× at 1.5–2.5× higher cost.

ModylModylBench measures rooms, not deliberation. Turn-taking is structural. Human is the decision node, not a tie-breaker.

AgentRewardBench (arXiv:2504.08942)

1,302 expert-labeled agent trajectories; no single LLM judge achieves consistent high precision across success / side-effects / repetition cycles.

ModylModylBench mitigates judge drift with pessimistic-consensus multi-judge scoring plus deterministic validators for structured channels (CRDT ops, patches, spreadsheet cells). Judge audit is part of the benchmark, not assumed infrastructure.

DeepMind, Science of Scaling Agent Systems

Orchestrator-worker gets 4.4× error amplification; flat “bag of agents” gets 17.2×.

ModylModyl's orchestrator-worker + room model is the MAST mitigation pattern, not another instance of the failure.

Status · Tier-3 headroom

Honest state.
No mystery numbers.

ModylBench is a Tier-3 headroom benchmarkin Sarah Sachs’s framing (Notion, Latent Space 2026-04-15). Tier-3 evals are tuned so current frontier models fail most scenarios — the point is to reveal capability movement, not gate launches. Baselines-pending is a feature, not a liability. We expect pass rates below 30% on v1 and below 10% on v3 when that ships. When baselines publish, they publish with full harness metadata (Claude Code, Codex CLI, Gemini CLI, Qwen3.5-Omni) — not just a model name. Meeting-agent performance is a function of harness + model, not either alone.

SHIPPEDv1 scenarios — six professional verticals, ~300 scripted turns each, 29 adversarial edge cases, fully synthetic (no public meeting traces used in scenario generation — zero contamination risk). Apache 2.0 on HuggingFace.
IN PROGRESSv1 baselines — harness-inclusive runs via Claude Code, Codex CLI, Gemini CLI, Qwen3.5-Omni. Tier-3 headroom discipline: expected pass rates below 30%. Publishes to leaderboard when first stable run completes.
FUTURE WORKv2 — multi-agent rooms + FARA-style universal verification on structured artifacts. NeMo DataDesigner-style scenario synthesis.
FUTURE WORKv3 — joint-distribution grading surface for models that emit audio + video + CRDT ops + intent + provenance from one forward pass. See strategic-direction doc.
PENDINGarXiv paper. Targeted for post-baseline-stability publication.

Use it. Break it. Beat the baselines.

ModylBench is Apache 2.0. Run your harness. Submit results. The room will tell you what your agent can’t do yet.

HuggingFace dataset →Agent-native · /llms.txt

ModylBench · open meeting-agent eval

The benchmark for the meeting-native model
frontier labs haven’t built yet.

HuggingFace dataset →Leaderboard (v1 baselines publishing)

Apache 2.06 verticals · ~300 turns29 adversarial edge casesJourney (0.4) × Destination (0.6)Tier-3 headroom · <30% pass expectedJoint-distribution grading

What ModylBench forces

Model-architecture benchmark.
Not a system benchmark.

Where it sits

Six one-shot benchmarks.
One joint-distribution benchmark.

Benchmark

What it measures

Turn shape

Joint-dist?

SWE-bench

Code patches

One-shot

tau-bench

Customer-service turns

Short

GAIA

General assistance

One-shot

GDPval

Economically-valuable work products

One-shot

OccuBench

Professional tasks with Language World Models

One-shot

AgentRewardBench

LLM-judge reliability on agent trajectories

N/A

OmniGAIA

Omni-modal agent perception + multi-turn tool execution (event graphs)

Multi-turn

ModylBench

Speech + visible state + artifact edits + intent + provenance — emitted synchronously in one scenario

Multi-turn live-session

Yes

What it measures

Eleven dimensions.
Journey × destination.

Journey (0.4)

Turn-taking discipline
Reading-the-room latency
Interruption handling
Presence / visible state
Source-of-truth linkage
Omni-intent inference (voice + gaze + gesture + screen + canvas + prior edits)

Destination (0.6)

Artifact correctness
Decision-log fidelity
Owner / follow-up capture
Closeback package completeness
Provenance integrity
Intent-as-output coherence (stated intent matches emitted ops)
Behavioral fit (uncanny-valley / tonal register)

Intent detection goes omni

Intent is not just what the user says.
It’s what the user means across every stream, simultaneously.

PASK LatentNeeds-Bench (arXiv:2604.08000)

100 sessions × 3,936 turns — proactive memory + intent from transcripts

GapText-only. No video / gaze / gesture / screen context.

PIRA-Bench (arXiv:2603.08013)

Proactive intent recommendation from continuous GUI screenshots with interleaved intents, noisy segments, user profiles

GapInput is screenshots only. No voice prosody or live conversational cues.

MISID (arXiv:2604.12700)

Multi-turn, multimodal, multi-participant intent recognition in social strategy games

GapAdversarial social-game scope, not collaborative work setting.

ProAct (2026-02)

75 tasks, 91,581 step-level annotations, proactive agents with explicit task graphs

GapStep-level labels, not synchronous joint-stream inference.

ClauseCompose (arXiv:2603.28929)

Compositional multi-intent detection. 95.7 EM on unseen pairs, 91.1 on unseen triples via clause-factorized decoding

GapText-only. Compositional generalization demonstrated, but not in a live multi-stream setting.

ModylBench (input side)

Omni-intent: infer jointly from audio + video + screen + canvas + prior CRDT state + silence pattern. Composition across verticals tested at ≥3-intent depth.

ModylBench (output side)

Where the frontier breaks

Recent research is loud on one point:
modality in isolation is not the regime that matters.

Berkeley MAST (arXiv:2503.13657)

1,642 annotated traces across 7 MAS frameworks. 41–86.7% failure rates. 79% coordination-caused, not capability.

ModylModylBench injects MAST-identified failure modes (spec drift, role ambiguity, premature termination, verification failure) and scores room-mediated recovery.

DeliberationBench (arXiv:2601.08835)

Free-form multi-LLM deliberation underperforms best-single-agent by 6× at 1.5–2.5× higher cost.

ModylModylBench measures rooms, not deliberation. Turn-taking is structural. Human is the decision node, not a tie-breaker.

AgentRewardBench (arXiv:2504.08942)

1,302 expert-labeled agent trajectories; no single LLM judge achieves consistent high precision across success / side-effects / repetition cycles.

DeepMind, Science of Scaling Agent Systems

Orchestrator-worker gets 4.4× error amplification; flat “bag of agents” gets 17.2×.

ModylModyl's orchestrator-worker + room model is the MAST mitigation pattern, not another instance of the failure.

Status · Tier-3 headroom

Honest state.
No mystery numbers.

SHIPPEDv1 scenarios — six professional verticals, ~300 scripted turns each, 29 adversarial edge cases, fully synthetic (no public meeting traces used in scenario generation — zero contamination risk). Apache 2.0 on HuggingFace.
IN PROGRESSv1 baselines — harness-inclusive runs via Claude Code, Codex CLI, Gemini CLI, Qwen3.5-Omni. Tier-3 headroom discipline: expected pass rates below 30%. Publishes to leaderboard when first stable run completes.
FUTURE WORKv2 — multi-agent rooms + FARA-style universal verification on structured artifacts. NeMo DataDesigner-style scenario synthesis.
FUTURE WORKv3 — joint-distribution grading surface for models that emit audio + video + CRDT ops + intent + provenance from one forward pass. See strategic-direction doc.
PENDINGarXiv paper. Targeted for post-baseline-stability publication.

Use it. Break it. Beat the baselines.

ModylBench is Apache 2.0. Run your harness. Submit results. The room will tell you what your agent can’t do yet.

HuggingFace dataset →Agent-native · /llms.txt

The benchmark for the meeting-native modelfrontier labs haven’t built yet.

Model-architecture benchmark.Not a system benchmark.

Six one-shot benchmarks.One joint-distribution benchmark.

Eleven dimensions.Journey × destination.

Intent is not just what the user says.It’s what the user means across every stream, simultaneously.

Recent research is loud on one point:modality in isolation is not the regime that matters.

Honest state.No mystery numbers.

Use it. Break it. Beat the baselines.

The benchmark for the meeting-native modelfrontier labs haven’t built yet.

Model-architecture benchmark.Not a system benchmark.

Six one-shot benchmarks.One joint-distribution benchmark.

Eleven dimensions.Journey × destination.

Intent is not just what the user says.It’s what the user means across every stream, simultaneously.

Recent research is loud on one point:modality in isolation is not the regime that matters.

Honest state.No mystery numbers.

Use it. Break it. Beat the baselines.

The benchmark for the meeting-native model
frontier labs haven’t built yet.

Model-architecture benchmark.
Not a system benchmark.

Six one-shot benchmarks.
One joint-distribution benchmark.

Eleven dimensions.
Journey × destination.

Intent is not just what the user says.
It’s what the user means across every stream, simultaneously.

Recent research is loud on one point:
modality in isolation is not the regime that matters.

Honest state.
No mystery numbers.

The benchmark for the meeting-native model
frontier labs haven’t built yet.

Model-architecture benchmark.
Not a system benchmark.

Six one-shot benchmarks.
One joint-distribution benchmark.

Eleven dimensions.
Journey × destination.

Intent is not just what the user says.
It’s what the user means across every stream, simultaneously.

Recent research is loud on one point:
modality in isolation is not the regime that matters.

Honest state.
No mystery numbers.