$4.3M seed + Cue is liveRead the announcement
Blog/Research/Article

One judge, many readers

How should AI agents read a company's files, and how should they write down what a company learns? We built a realistic knowledge drive, seeded it with seven traps, and ran eight agent architectures against it. Five ways to read. Three ways to file. Every run measured for speed, computation, and quality against a sealed answer key. This note explains every concept from zero and ends with an architecture you can adopt.

Jul 5, 2026By the Nexus team30 min read
Diagram of the one judge, many readers architecture: a shared drive of documents feeding reader agents feeding one judge
Research note46-document test corpus8 architectures7 planted trapsJuly 2026
Abstract

Organizations are starting to let AI agents work directly on their document stores. Two jobs matter: finding and using what the company already knows (we call this retrieval), and writing down what was just learned so the next person or agent can find it (consolidation). We benchmarked five retrieval strategies and three consolidation strategies on a fictional but realistic 46-document company drive, holding the model, the files, and the questions constant. Only the organization of the agents changed. It changed everything. A single reading agent was flawless on a focused brief at the lowest cost of the whole suite. The same agent asked to map six areas at once missed a third of the traps, including a wrong contract figure, and recovered every one of them when given reading assistants. Splitting one brief across three rival agents tripled computation and bought nothing. On the filing side, a resident librarian that archived learnings within minutes of each event captured 100 percent of them, while end-of-project filing is structurally limited by the quality of the handover memo. The recommendation is a simple shape: one accountable agent per job, assistants below it when the job gets broad, and streaming capture for long projects.

At a glance
  • Thinking is the bottleneck, not fetching. An agent that cut its file operations from 76 to 27 got 31 percent slower. Speed lives in the reasoning steps.
  • Breadth erodes judgment. One agent reading everything itself scored 4 of 6. The same agent delegating the reading to assistants scored 6 of 6, plus a bonus catch, and finished 18 percent faster.
  • Filing lives or dies on timing. Streamed filing captured every learning within 4 to 9 minutes of the event. Batch filing is only ever as good as the memo someone remembers to write.

Recommendation: one accountable agent per job. Reading assistants below it when the job spans more than a few areas. A resident librarian for long projects. Never rival agents on one brief.

1Why this matters

Every company runs on a shared drive. Client folders, meeting notes, templates, price lists, the deck from last quarter. When an AI agent joins the team, that drive becomes its memory. The agent is only as good as its ability to find the right file, trust the right number, and respect the right boundary.

Getting retrieval wrong is not a hypothetical. It looks like a proposal quoting $6.5M when the signed figure is $6M. A confidential client named in an outward-facing deck. A "we have no information on this person" answer when a whole folder about them exists. These are quiet failures. Nobody sees them happen. Someone finds them later.

Getting consolidation wrong is quieter still. A team finishes a six-week project. The method they invented, the template they refined, the questions they wished they had asked on day one: none of it gets written down anywhere findable. The next project starts from zero. Companies pay for the same lesson twice, three times, forever.

The instinctive fix for both problems is a better model. Our benchmark says the bigger lever is cheaper and closer to hand: the organization of the agents. Same model, same files, same questions. We only changed who does what: one agent or several, assistants or rivals, filing at the end or filing as you go. Quality moved from 4 of 6 traps caught to 6 of 6. Cost moved by a factor of three. Time moved by 45 percent. The org chart of the agents mattered more than anything else we could have tuned.

2The concepts, from zero

This section assumes no background. If you work with agents daily, skip to section 3.

2.1 A language model and its working memory

A language model is software that reads text and writes text. It has no persistent memory of its own. Everything it knows about your situation must sit in its context window: a fixed-size working memory that holds the instructions, the documents it has read so far, and its own notes.

Picture a desk. The desk holds roughly a few hundred pages. Anything the model needs must be on the desk. And here is the property that drives this whole study: a crowded desk degrades judgment. As the desk fills, the model still reads everything, but it starts to miss the small things: the footnote that contradicts the headline number, the label that says "confidential". This is not a defect of one product. It is how the technology behaves today, across vendors.

Text is metered in tokens, roughly three quarters of a word each. Tokens are how usage is billed, so "tokens spent" is the cost line in every chart below.

2.2 Agents

An agent is a language model given tools and a goal. Tools are small verbs: list a folder, open a file, write a file, search. The agent works in turns: think, act with a tool, look at the result, think again. A run is a chain of these turns.

Two consequences follow. First, every tool result lands on the desk and stays there, so an agent that opens forty files has a very full desk by file forty. Second, each turn takes wall-clock time, because thinking is the slow part. Remember that: it explains one of our most counterintuitive results.

2.3 Assistant agents

An agent can start another agent: a subagent, or as we will call it here, an assistant. The assistant gets its own clean desk, does a bounded piece of work, and reports back a summary. The lead agent pays only for the summary on its own desk, not for everything the assistant read.

This is exactly how a good chief of staff protects an executive. The executive does not read forty reports. The executive reads eight one-page summaries written by people who each read five reports carefully. The judgment stays sharp because the desk stays clear. Whether this actually works for agents, and when it stops being worth the money, is what sections 5 and 6 measure.

2.4 Model classes

Models come in sizes, and we will stay vendor-neutral: Opus-like (largest, most capable, most expensive), Sonnet-like (mid-size, capable, the workhorse class), and Haiku-like (small, fast, cheap). Every agent in this benchmark ran on a Sonnet-like model, for one reason: in earlier production use, Haiku-like agents in the lead role degenerated on precisely these jobs (empty results, runaway loops, missed folders). We fixed the class so the benchmark isolates architecture. Section 10 lists the model-class questions we deliberately left open.

2.5 The two jobs: retrieval and consolidation

Retrieval is the reading job. A request arrives ("draft the renewal summary for this client"), and an agent must find what the drive knows: the right figures, the right template, the right sensitivities. The output is a dossier: a findings file listing, for each thing the request needed, what was found, where, and with what confidence, or an honest "not found".

Consolidation is the writing job. A project just produced learnings: a method, a template, decisions, artifacts. An agent must file them into the drive so they are findable later: create or extend the right entries, copy the artifacts in, update the catalog, cross-link related material. The librarian metaphor holds for both: retrieval is the reading room, consolidation is the archiving desk.

3Experimental setup

3.1 The test drive

We built a fictional company from scratch: Meridian Analytics, a consultancy, with a confidential client, Northwind Telecom. The drive holds 46 documents organized the way real drives are: a folder per client, a folder per key person, project folders, meeting notes, templates, canonical company documents, and a catalog file (an index) that lists what lives where. All names and figures are fictional. Everything else is real: real agents, real file operations, the same safety tooling used in production, run on the production toolchain and fully reverted afterwards.

Why fictional? Because scoring requires knowing the truth. We wrote a sealed answer key before any run: for each information need, the exact file that answers it and the exact content that counts as correct. No run was scored by opinion.

3.2 The seven traps

A benchmark that only asks "did the agent find the file" measures too little. Real drives are messy in specific, recurring ways. So we planted seven traps, each modeling a documented real-world failure. A strategy scores a point when its written output handles the trap correctly.

T1

The split identity

Planted Knowledge about a key person exists only as a folder of small files (profile, preferences, meeting history), not as one document.

Pass The agent assembles the person from the folder. Real-world failure "We have no information on this person," while a whole folder sits there.

T2

The stale catalog

Planted The drive's index lists a briefing document with an outdated description that no longer matches the file.

Pass The agent notices and reports the stale row instead of trusting it. Real-world failure Decisions made from a catalog that lies.

T3

The confidential client

Planted The client record is marked confidential, with a neutral outward label ("a European telecom operator") to use in anything external.

Pass The agent uses the real name internally and the neutral label for outward-facing content. Real-world failure A confidential name in a public deck.

T4

The exact template

Planted A message template whose value is its exact wording.

Pass The agent reproduces it verbatim in its findings, or points to the exact file. Real-world failure A paraphrase that quietly loses the approved wording.

T5

The conflicting number

Planted The source-of-truth contract file says $6M. An older meeting note says $6.5M. Both are on the drive.

Pass The agent uses $6M and flags the conflict. Real-world failure The wrong figure travels silently into a proposal.

T6

The missing data

Planted The brief asks for a customer reference story that does not exist anywhere on the drive.

Pass An explicit "not found," ideally with the nearest grounded alternative. Real-world failure The agent invents one. This is the failure people call hallucination.

T7

The useful neighbor

Planted An adjacent asset (a ready client-deck route) is relevant to the brief but was not asked for.

Pass The agent surfaces it alongside the answers. Real-world failure Tunnel vision: technically answered, practically unhelpful.

+

The bonus we did not plant

Two files disagreed about the company's office list. We had not planted this: it was a genuine inconsistency in our own fixture, which we confirmed afterwards. Catching it is scored as a bonus, because it signals reading depth beyond the checklist.

3.3 The workloads

Reading strategies faced one of two briefs, because breadth is itself a variable:

  • The focused brief: seven specific needs for a client deliverable (figures, template, sensitivities, a person, a reference story). Deep, narrow.
  • The audit brief: map six areas of the drive and report what exists, what is stale, and what is missing. Shallow, broad.

Filing strategies all faced the same commission: archive a just-finished project (a client workshop deliverable) so that a future colleague, human or agent, can find and reuse everything. The commission is scored on seven checks: a method entry filed; artifacts archived as exact copies; recurring patterns raised (each grounded in at least two documented instances); a reusable playbook including the intake questions to ask next time; catalog and cross-references updated; private and shared material kept separate; and every write independently verified against the drive listing.

3.4 Scoring

Binary per trap, from written outputs only, against the sealed key. The six core traps (T1 to T5 and T7) apply to every reading run, so reading quality is reported as x of 6. The honesty check T6 is reported separately, scored where the run made an explicit presence-or-absence determination. Bonus catches are reported but kept out of the denominator.

3.5 Measurement

Three quantities per run. Time: wall-clock from dispatch to final result, from the harness's run records.1 Computation: tokens spent. Metered exactly for every single-agent run and for each lead agent; assistant fleets and long-lived runs are estimated from transcript volume and always shown hatched, with ranges.2 Quality: the trap score, plus the commission checks for filing.

3.6 Controls and fairness

  • Same drive, same model class, same production tooling, same day, for every run.
  • The answer key was written before any run and never modified.
  • The parallel-writer filing strategy received a second, comparable project, so it could not lean on entries the first archivist had already filed.
  • Verification is independent: we re-listed the drive ourselves and byte-compared archived copies against originals. An agent saying "done" was never accepted as evidence.
  • All test workspaces were deleted afterwards; the real drive was diffed before and after to confirm zero residue.

4The strategies

4.1 Five ways to read

R1 The solo librarian (baseline). One agent, the focused brief. It navigates the drive step by step: catalog first, then the folders that matter, opening files one at a time. It writes its dossier as it goes, so a partial result exists at every moment. This mirrors how a careful analyst works.

R2 The solo librarian with batched fetching. Identical, except it grabs many files per turn instead of one, to cut round trips. The hypothesis: fewer operations, faster run.

R3 The solo auditor. One agent, the audit brief, reading everything itself. The hypothesis: a 46-document drive fits on one desk, so one careful reader can map it all.

R4 The librarian with reading assistants. One lead agent on the audit brief, but it does not read the drive itself. It sends assistants, each to one area, each returning a structured summary: what is there, what is stale, what deserves a closer look. The lead reads summaries, follows up where warranted (it ran two waves: 14 assistants total), and keeps its own desk clear for judging. The hypothesis: the clear desk preserves quality that breadth would otherwise erode.

R5 The sharded brief. Take the focused brief, split it into three thematic slices, give each slice to an independent librarian running in parallel, and merge the three dossiers afterwards. The hypothesis: three agents, one third of the work each, three times faster.

4.2 Three ways to file

C1 The end-of-project archivist (baseline). When the project ships, one agent receives a complete inventory: what was built, what was decided, which artifacts exist and where. It reads the drive to see what already exists, then files everything: entries, exact artifact copies, patterns, the playbook, catalog updates. One agent, one desk, full accountability.

C2 The archivist with parallel writers. The same commission, but the archivist plans the filing, hands each cluster of entries to one of four writer assistants working in parallel, then reviews every draft, fixes defects, files, and verifies. The hypothesis: writing in parallel is faster, and the review layer catches assistant mistakes.

C3 The resident librarian (streaming). No end-of-project moment. A long-lived agent sits alongside the project and receives updates as things happen: "we decided X", "template Y is final", "artifact Z shipped". It files each update within minutes, keeps the catalog current mid-project, and runs one synthesis pass at ship time to capture what the individual events compose into. The hypothesis: what kills end-of-project filing is not the filing, it is everything that never makes it into the handover memo. Streaming removes the memo.

5Results: reading

The headline: the cheapest run was flawless, the broadest run was the worst, and assistants fixed exactly what breadth broke. The full matrix first, then the story.

hover or tap a cell for detail
Table 1 · Trap matrix, all reading strategies
TrapR1 soloR2 batchedR3 auditorR4 assistantsR5 sharded
T1 split identityR1 · Assembled the person from her folder (profile, preferences, meeting history).R2 · Assembled the person from the folder despite batched reading.R3 · Found the folder and assembled the person.R4 · A reader summarized the person folder; the lead assembled her correctly.R5 · The shard owning this need assembled her from the folder.
T2 stale catalogR1 · Noticed the index description no longer matched the file and reported the stale row.R2 · The one miss of this run: the stale row was never reported.R3 · Trusted the catalog as written; the stale row went unreported.R4 · Caught: its readers report stale rows by contract; the dossier flagged the mismatch.R5 · The merged dossiers reported the stale row.
T3 confidential clientR1 · Used the real name internally, the neutral outward label for anything external.R2 · Real name internally, neutral label outward.R3 · Real name internally, neutral label outward.R4 · Real name internally, neutral label outward.R5 · Subtlest handling of the suite: real name inside the client's own deliverable, neutral label everywhere else.
T4 exact templateR1 · Reproduced the template verbatim, wording intact.R2 · Verbatim, wording intact.R3 · Verbatim, wording intact.R4 · Verbatim, wording intact.R5 · Verbatim, wording intact.
T5 conflicting numberR1 · Used $6M from the contract and flagged the $6.5M meeting note as a conflict.R2 · Used $6M and flagged the conflict.R3 · Absorbed $6.5M from the older note without flagging it: the costliest miss of the suite.R4 · Used $6M and flagged the $6.5M decoy.R5 · Used $6M and flagged the conflict.
T7 useful neighborR1 · Surfaced the ready client-deck route the brief never asked for.R2 · Surfaced the adjacent client-deck route unprompted.R3 · Surfaced the adjacent client-deck route unprompted.R4 · Surfaced the adjacent client-deck route unprompted.R5 · Surfaced the adjacent client-deck route unprompted.
T6 missing-data honestynot scoredR1 · Not scored: it declined to invent the story and offered a grounded alternative, but made no exhaustive absence claim.R2 · Explicit not found, backed by a full-text scan of the drive.R3 · Explicit not found.R4 · Explicit not found.R5 · Explicit not found, with a warning about the premise.
Bonus (unplanted catch)·R1 · No unplanted catches this run.R2 · Caught two files disagreeing about the office list: a genuine inconsistency in our fixture.·R3 · None.R4 · Caught the office-list contradiction between two canonical files.·R5 · None.
Core score6 / 65 / 64 / 66 / 66 / 6
caught missed bonus catch beyond the planted set
Figure 1. Traps caught, out of the six core traps. The solo auditor (R3) is the only strategy that silently absorbed the wrong contract figure (T5). Assistants (R4) restored the perfect score on the same audit brief.

T6 for R1: the run handled the underlying need correctly (it declined to invent the missing story and substituted a grounded alternative) but did not make the explicit exhaustive absence claim the key requires, so we excluded it rather than judge it by opinion.

hover or tap for detail
R1 solofocused brief
238 s
238 s · 76 tool calls · 154k tokens · 6 of 6. Catalog first, then targeted reads, dossier written as it went.
R2 batchedfocused brief
312 s
312 s, 31 percent slower · 27 calls · 166k tokens · 5 of 6 plus a bonus. Fewer, heavier turns: batching cut round trips and bought nothing.
R3 auditoraudit brief
307 s
307 s · 35 calls · 169k tokens · 4 of 6. Six areas on one desk: the judgment misses landed here.
R4 assistantsaudit brief
253 s
253 s, 18 percent faster than the inline audit · 24 lead calls plus 14 assistants in 2 waves · 6 of 6 plus a bonus.
R5 shardedfocused brief, split 3 ways
345 s
345 s: the slowest of the three shards, and the caller waits for all of them · 68 calls · 447k tokens · 6 of 6 after a hand merge.
0100 s200 s300 s400 s
focused briefaudit brief
Figure 2. Time to complete. The sharded run's time is the slowest of its three parallel agents, since the caller waits for all of them. Three agents in parallel finished 45 percent later than one agent alone (R1).
hover or tap for detail
Tool and file operations
R1
76
76 operations: one file per turn, many small steps.
R2
27
27 operations: many files grabbed per batched turn.
operations down 64 percent
Seconds to complete
R1
238 s
238 s end to end.
R2
312 s
312 s: each batched turn carried heavier reasoning; thinking, not fetching, is the clock.
time up 31 percent
Figure 3. The fetch paradox. R2 cut file operations by nearly two thirds and still finished 31 percent slower than R1. Each of its turns carried heavier reasoning, and reasoning turns, not file operations, are where the time goes.
hover or tap for detail
R1 solo
154k
154k tokens, metered exactly by the harness.
R2 batched
166k
166k tokens, metered.
R3 auditor
169k
169k tokens, metered.
R4 assistants
400 to 480k
estimated
Lead agent metered at 160k; 14 assistant transcripts add an estimated 240 to 320k more: about 2.5 to 3 times a solo run.
R5 sharded
447k
447k tokens metered across the three shards: 2.9 times the solo baseline, much of it defensive re-reading.
0100k200k300k400k500k
meteredestimated from transcripts (lead agent alone metered at 160k)
Figure 4. Computation. The two multi-agent strategies cost roughly 2.5 to 3 times a solo run. R4 spends that on 14 assistant reading passes that bought quality. R5 spends it on three agents defensively re-reading the same drive.
hover or tap a point for detail
traps caught, of 6
6 / 6
5 / 6
4 / 6
R1 · 6 of 6 at 154k tokens and 238 s: dominates the focused job.R2 · 5 of 6 plus a bonus at 166k: quality sideways, time worse.R3 · 4 of 6 at 169k: the crowded desk, measured.R4 · 6 of 6 plus a bonus at an estimated 400 to 480k: the only perfect audit.R5 · 6 of 6 at 447k measured: R4's price for a job R1 does alone.R1 · strongest valueR2 R3R4 · buys audit qualityR5 · same price, no benefit
100k200k300k400k500k
computation (tokens)
Figure 5. Quality against computation. R1 dominates the focused job. R4 is the only way we found to buy a perfect audit. R5 pays R4's price for a job R1 does alone. Stars mark bonus catches.

What happened in each run

R1 Keep as default Flawless. Six of six, including flagging the $6.5M decoy and assembling the split-identity person from her folder. It refused to invent the missing reference story and offered the nearest grounded alternative instead. Lowest cost of the suite: 154k tokens, 238 seconds. This is the reference standard, and it is cheap.

R2 Reject The mechanism worked: 76 operations became 27. The outcome did not: 31 percent slower, and it never reported the stale catalog row. It did catch the unplanted office-list contradiction, a real bonus. Net: quality sideways, time worse. Fetching was never the bottleneck.

R3 Reject for audits The degradation case, and the most instructive failure. Same agent, same drive, same tooling as R1. Given six areas instead of seven needs, it missed the stale catalog row and, more seriously, absorbed the $6.5M figure without flagging the conflict. Its own coverage report claimed no gaps: it was confident precisely where it was weakest. This is the crowded desk, measured.

R4 Adopt for breadth Everything R3 missed, caught: the stale row, the money conflict, plus the unplanted bonus, with an honest map of what its assistants had visited and skipped. And 18 percent faster than R3, because assistants read in parallel while the lead judged. Cost: roughly 2.5 to 3 times a solo run. That is the price of a clear desk, and on audit work it bought the only perfect score available.

R5 Reject The merged output scored six of six, and one shard produced the suite's subtlest confidentiality reasoning (the client's real name is fine inside their own deliverable, the neutral label everywhere else). But the economics are indefensible: 2.9 times the tokens, measured, and the slowest wall-clock of all five runs. Why: each shard, unsure what the others would cover, defensively re-explored the whole drive (one shard enumerated nearly all 46 documents for its 3-need slice). And the three dossiers still had to be reconciled by the requester: overlap, duplication, one shard even misreading a leftover temp folder as drive content. Splitting the brief split the accountability, and every shard paid for the whole drive anyway.

6Results: filing

All three strategies passed the seven-check commission. The differences live in time, cost, defects, and, decisively, in what reaches the archivist at all.

Table 2 · Filing strategies, measured
StrategyWall timeFiles filedPaceComputationCommissionVerdict
C1 end-of-project archivist13.4 minabout 16about 50 s / file237k metered27 / 7, zero defectsKeep as default
C2 archivist + 4 writersabout 30 min26about 70 s / file2 to 2.5× C1, estimated7 / 7, one writer defect caught and fixedReserve for jumbo runs
C3 resident librarianabout 55 min, spread3all streamed itemsfiled 4 to 9 min after each event1 to 1.5× C1, estimated7 / 7, and 100 percent captureAdopt for long projects
hover or tap for detail
C1 archivist
13.4 min
13.4 minutes for about 16 files: about 50 seconds per filed document, zero defects.
C2 + 4 writers
about 30 min
About 30 minutes for 26 files at about 70 seconds each: coordination overhead exceeded the parallel gain.
C3 resident
An update filed 4 to 9 minutes after the event.An update filed 4 to 9 minutes after the event.An update filed 4 to 9 minutes after the event.An update filed 4 to 9 minutes after the event.Ship-time synthesis, about 20 minutes: files what the individual events compose into.
about 55 min elapsed
four streamed updates filed as they arrived, then a 20-minute synthesis at ship time; the project continues in the gaps
0102030405060 min
active filingship-time synthesiselapsed project time (not blocked)
Figure 6. Filing time. C3's 55 minutes are elapsed, not blocking: work continues between updates, and positions are schematic. On a per-file basis C1 is also the fastest writer, at about 50 seconds per filed document against about 70 for the parallel-writer setup, whose coordination overhead exceeded its parallel gain at this scale.
hover or tap for detail
How long does a learning wait before it is findable?
Batch (C1, C2)
The learning exists only in memory or scattered notes until someone writes the handover memo.
Filed at project end: only what the memo remembers arrives.
filed
waits for the project to end, inside someone's memory or notes
Streaming (C3)
Each event goes straight to the librarian and is findable minutes later; nothing depends on end-of-project recall.
filed 4 to 9 minutes after the event
Schematic, not measured: in the benchmark the batch archivists ran minutes after the project ended. In real projects the wait is the project's full duration, and anything missing from the handover memo never arrives at all.
Figure 7. The capture argument. Batch filing depends on a complete handover at the end. Streaming removes the dependency: each learning is archived while it is fresh, by an agent that was told about it the moment it happened.

What happened in each run

C1 Keep as default A clean, complete run: the method entry, three artifacts archived as exact byte-for-byte copies, three recurring patterns each grounded in two documented instances, the playbook with its intake questions, catalog and cross-links updated, private and shared material correctly separated, every write verified against the drive listing. After filing its report it noticed one folder summary missing a required field and fixed it unprompted. Nothing to criticize, with one structural caveat: it was handed a complete inventory. Its quality is downstream of that memo.

C2 Reserve for jumbo runs The architecture proved safe: no two writers collided, the archivist's review caught a writer's one-way cross-reference and added the missing return link, and its discipline on existing entries was immaculate (it extended existing patterns with new instances rather than duplicating them, and deliberately left accurate entries untouched). But everything the writers saved, the archivist repaid in packet assembly and in reviewing 20 drafted bodies: about 30 minutes against 13.4, on more files but slower per file, at 2 to 2.5 times the computation. The review layer earned its keep exactly once. At this scale, once is not enough. At 50-plus entries, where a single desk would crowd, this shape should win; that is the reserve case.

C3 Adopt for long projects The best filing behavior we observed. Every streamed update was filed within 4 to 9 minutes, in its own increment, so the archive was current mid-project rather than after it. Two artifacts were archived by direct file copy, byte-identical, without the content ever passing through the agent's working memory (a detail with real cost implications: copying is free of desk space). Mid-project, one topic outgrew its single file and the librarian promoted it to a folder, updating every reference: zero broken links afterwards. At ship time its synthesis pass filed what the events composed into, including an explicit decision to archive the full example set rather than a silent sample. And it flagged six items honestly rather than guessing, including a genuine inconsistency in our own fixture. Its costs are operational: reports arrive in bursts that someone must glance at, and it needs a per-run staging discipline (early on, it staged files inside another run's folder; a sibling agent briefly mistook them for drive content).

7What we learned

Law 1 · Thinking dominates fetching

R2 cut file operations by 64 percent and lost 31 percent in time. At this scale, an agent's clock is spent in reasoning turns, not in waiting for files. Do not buy I/O optimizations; buy fewer, better reasoning steps. Any pitch that promises speed through parallel fetching, without changing who does the thinking, is optimizing the cheap part.

Law 2 · Working memory is a budget, and breadth spends it

R1 and R3 are the same agent. Depth produced 6 of 6. Breadth produced 4 of 6, with the misses landing on exactly the judgment-heavy traps: staleness and a numeric conflict. Worse, the broad run was confident: its coverage report claimed no gaps. When a job widens, do not widen the reader. Keep the judge's desk clear and delegate the reading. That is Law 2's constructive half, and R4 is its proof: same brief as R3, perfect score, faster.

Law 3 · Accountability does not shard

R4 and R5 both used multiple agents and cost about the same. One was the best audit of the suite; the other was waste. The difference is the shape. In R4, assistants report to one accountable judge. In R5, three peers each own a fragment, nobody owns the whole, so each hedges by re-reading everything, and the unowned merge lands back on the requester. Parallelism belongs below the point of judgment, never beside it.

Law 4 · For filing, capture beats architecture

All three filing strategies filed well. The decisive difference is upstream: batch filing can only archive what survives into the handover memo, and in real projects the memo is written from memory, late, by someone tired. The resident librarian never faces that filter: it filed 100 percent of what it was told, minutes after each event. If your projects run days or weeks, the timing of capture matters more than any other property of the filing system.

Law 5 · Review layers pay only at scale

C2's writer pool was safe and its review caught one real defect. It also doubled the cost and the wall time of a job one agent did cleanly. A defect-catching layer is insurance: price it against the volume of work it protects. Our data says the premium starts paying somewhere past the scale of a dozen entries, not before.

And one for free · Fidelity survives when it is a contract

The verbatim-template trap (T4) passed in all five reading runs, and both artifact-copying filing runs delivered byte-identical copies. In every case the system carried an explicit rule: exact wording is exact, copies are byte-for-byte, paraphrase equals failure. Make fidelity an explicit contract and it holds under any architecture. Leave it implicit and it becomes a style choice.

8Recommendations

The playbook, in one table. It amounts to a single shape: one judge, many readers.

Table 3 · Which architecture, when
SituationArchitectureGrounds
A focused question or briefOne Sonnet-like agent, writing its findings file as it worksCheapest run of the suite was also flawless (R1)
A broad audit, more than three areasThe same agent, delegating reading to assistants and judging their summariesOnly perfect audit, and the fastest one (R4)
Tempted to split one brief across parallel peer agentsDo not2.9× the cost, slowest run, merge burden lands on you (R5)
Tempted to parallelize fetching for speedDo notOperations fell 64 percent, time rose 31 percent (R2)
A short task's learnings need filingOne archivist at the end, handed an inventory13.4 minutes, 7 of 7, zero defects (C1)
A long project's learnings need filingA resident librarian, updated as events happen, plus a ship-time synthesis100 percent capture at minutes of freshness (C3)
A filing job of 50-plus entriesArchivist plus a writer pool, archivist reviews everythingProven safe with a working review layer, priced for scale (C2)
Always, everywhereIndependent verification: re-list the drive, byte-compare copies, treat agent self-reports as claimsEvery quality number above was established this way

Two supporting rules. First, findings live in a file, not a chat message: every reading agent here wrote its dossier to disk as it worked, which means a partial result always exists, progress is observable mid-run, and the result survives any interruption. Second, the judge is Sonnet-class at minimum. Assistants do bounded, structured work and are the natural place to try smaller models, but we did not isolate that variable here (see section 10).

9Limitations

  • One run per strategy. Agents are nondeterministic; a rerun would move the numbers by some amount. We therefore treat only large gaps as signal (a two-trap quality difference, a 2 to 3× cost difference) and make no claims on small ones.
  • One drive, one scale. Forty-six documents. At thousands of documents a single-context audit stops being possible at all, which strengthens the delegation argument but is extrapolation, not measurement.
  • Token accounting is exact only where metered. Solo runs and lead agents are metered. Assistant fleets and long-lived runs are estimated from transcript volume; every estimate is marked, charted hatched, and given as a range.
  • The traps are ours. Catch rates measure performance on seven failure classes we consider representative. They are not a general quality score.
  • One model class, one platform, one day. Architecture effects this large seem unlikely to invert on another stack, but we did not test that.
  • The batch archivist was flattered. C1 received a complete, well-formed inventory. Real projects rarely produce one, and the streaming case rests partly on that observation from production use, which motivated this study, rather than on the benchmark alone. What the benchmark itself establishes: streaming matched batch quality while removing the dependency on the memo.
  • Supervision cost is not quantified. The resident librarian's periodic reports demand a sliver of ongoing attention that we did not measure.
  • Two operational incidents occurred and were corrected in-run: one shard misread a leftover temp staging folder as drive content, and the streaming librarian staged files inside another run's folder. Both inform the staging-discipline recommendation; neither affected scores.

10Further work

  • Model-class isolation. Rerun R4 with Haiku-like assistants under a Sonnet-like judge, and the judge itself at Opus-like, to price the quality ladder.
  • Scale sweep. The same suite at 500 and 5,000 documents, where we expect delegation to shift from an option to a necessity.
  • Variance. Five runs per arm to put intervals around every number in this note.
  • The breadth trigger. We recommend delegating above roughly three areas; the exact threshold deserves its own measurement.
  • Supervision cost. Quantify the attention the resident librarian consumes, so its capture benefit can be priced fairly.

11Glossary

Agent
A language model given tools (list, read, write, search) and a goal, working in think-act-observe turns.
Answer key (ground truth)
The sealed, pre-written record of the correct answer to every planted question, used for scoring.
Assistant (subagent)
An agent started by another agent. It has its own clean working memory, does a bounded task, and reports a summary back.
Batch vs streaming filing
Batch: archive everything once, at the end, from a handover. Streaming: archive each learning within minutes of the event, continuously.
Byte-exact copy
An archived file identical to its original down to the last byte. The standard we hold artifact archiving to, verified by comparison.
Consolidation
The writing job: filing what was just learned into the shared drive so it is findable later.
Context window (working memory)
The fixed-size memory holding everything a model currently knows about the task. The desk. Crowding it degrades judgment.
Dossier
The findings file a reading agent writes as it works: per need, what was found, where, and with what confidence, or an explicit "not found".
Fan-out
One lead agent dispatching several assistants in parallel, each on a bounded slice, all reporting back to the lead.
Index (catalog)
The drive's table of contents: a file listing what lives where. Useful, and capable of being stale (trap T2).
Model class
Size tier of a language model: Opus-like (largest), Sonnet-like (mid, the workhorse here), Haiku-like (small and fast).
Retrieval
The reading job: finding and correctly using what the drive already knows, sensitivities included.
Sharding
Splitting one job across parallel peer agents, each owning a fragment. Distinct from fan-out: no single judge. Rejected by this benchmark.
Token
The billing unit of model text, roughly three quarters of a word. "Computation" in this note is tokens spent.
Trap
A planted, realistic failure opportunity with a known correct behavior, used to score quality objectively.
Verbatim fidelity
Reproducing content exactly rather than paraphrasing it. Held everywhere it was stated as an explicit contract.
Wall-clock time
Real elapsed time from dispatch to final result, as a person would experience the wait.
  1. Wall-clock times come from the harness's run records (dispatch to final result). The resident librarian's timings come from message and file timestamps, since it runs as a long-lived teammate rather than a single call.
  2. Token metering covers every single-agent run and every lead agent exactly. Assistant fleets and long-lived runs do not report a single usage figure; we estimate from transcript volume (for the parallel-writer run: lead 1.0 MB, four writers 0.6 MB combined, one reading assistant 0.3 MB) and report ranges, hatched in every chart. C1's figure covers the archivist itself; its one reading assistant was not separately metered.
  3. Elapsed, not blocking: the project continues between updates. Four updates were filed in 4 to 9 minutes each; the ship-time synthesis took about 20 minutes.
Method note: all runs executed on production tooling against a fictional corpus, scored against a sealed answer key, independently verified, and fully reverted.July 2026
Let us run Nexus on one of your workflows

Tell us where the work piles up.

12 weeks to a production agent.
And a number you can defend.

Live demo in 24h