Teaching AI to read long books — North Coast AI Labs

The problem

An AI's reading cost grows quadratically.

Every time an AI reads, it compares each word to every other word it's seen. With 100 words that's 10,000 comparisons. Easy. But double the length, and the work quadruples. Triple the length, nine times the work.

For a short message, this is invisible. For a long document, it becomes the bottleneck — slower responses, higher costs, and a hard ceiling on how much text the AI can actually pay attention to at once.

Why long text is so expensive comparisons grow as N²

100

words

10,000 checks

1,000

words

1,000,000 checks

10,000

words

100,000,000 checks

A 100-word essay? Trivial. A 10,000-word short story? One hundred million word-pairs to compare. A whole novel? Out of reach for current models without major workarounds.

A model that compares every word to every other word doesn't scale. We needed a different way.

Our approach

Read like a human reads a book.

You don't re-read every page every time you finish a sentence. You glance back at the last paragraph for context, and if you forget who someone is, you flip back to where they were introduced. That's the entire idea.

Our solution copies that pattern. The AI reads nearby all the time (cheap), and only flips back to distant pages when something triggers a memory (expensive, but rare). Done well, this turns a quadratic cost into something close to linear.

Two reading modes, working together

Read nearby (cheap)

Glance at the last few pages, every step. Free. Fast. Catches recent context.

PAGES IN MEMORY

CURRENT WINDOW

Flip back (expensive, rare)

When something nearby triggers a distant memory, jump back to the right page. Costs more — only do it when needed.

PAGES IN MEMORY

It sounds simple. But making it actually work means answering three hard questions:

How does the AI know which page to flip back to?

Once flipped, what does it actually read on that page?

Can it learn all this without massive amounts of human help?

The rest of this article is how we answered each one.

Question 1 · Finding the right page

One way to spot a relevant page isn't enough.

When the AI is deciding which distant page to flip to, "relevant" can mean very different things. Sometimes the relevant info is the overall vibe of a page. Sometimes it's one weird word that stands out. Sometimes it's a specific phrase that lines up with what's being asked.

No single way of measuring "relevance" works in all three cases. Our first attempt used just the average — like skimming a one-line summary of every page. It missed pages where the answer was a single distinctive word buried in mostly-irrelevant text.

Our solution: ask three librarians

Instead of one scoring method, we use three — each good at spotting a different kind of relevance. Then we combine their picks.

📊

The averager

"This whole page feels related to your question."

Good when the relevant info is spread across many sentences.

🎯

The standout-spotter

"There's one really unusual word here — looks important."

Good when the answer is one rare or distinctive token.

🔍

The topic-matcher

"I see a phrase that lines up exactly with what you're asking."

Good when the answer is on-topic but otherwise unremarkable.

Each librarian picks their own pages independently. The AI flips to all of them, removes duplicates, and reads. The third librarian is the most expensive, so we only call them in when the first two disagree — saving work on most queries.

Question 2 · Reading the page

Read the actual page. Not the summary.

Once the AI picks the right page, it has to actually read it. Our first instinct was to keep using the page summary — it's already there, why waste time reading the whole thing?

Big mistake. A summary tells you "Chapter 3 is about a dog." It doesn't help if you need the dog's actual name. The summary is fine for finding the right page. It's terrible for remembering what was on it.

Read nearby only

recovered

Can't reach distant pages at all. The detail is just gone.

Read the summary

25%

recovered

Knows it's "the chapter about the dog," but the actual details are blurred.

Read the actual page

100%

recovered · perfect

Once we know the right page, look at the actual words.

So our rule is: summaries help us pick. Actual pages give us the answer. Two different jobs, two different tools.

★ Breakthrough 01

Match the key. Grab the next thing.

Once the basic system worked on small puzzles, we tested it inside a real AI model. The result was a shock — even when our system flipped to the right page, the model only got the answer right about 30% of the time. Barely better than guessing.

We ran a test: what if we cheated, and force-fed the right page to the model directly? Accuracy jumped to 88%. So the page-finding wasn't the problem. Something between "having the right page" and "writing the right answer" was broken.

What was actually going wrong

Imagine the question is "What's Alice's favorite color?" and the AI flips to a page that says "... Alice → blue ...". The AI's attention naturally lights up on the word "Alice" — that's what we asked about. So the AI reads the word "Alice" and writes back "Alice" as its answer.

It found the right page. It found the right word. It just read the wrong word on that page. The clue was the next word over: "blue."

The fix

When the AI finds a matching word on a flipped-to page, instead of reading that word, read the next word over. Match "Alice" — but pick up "blue."

A FLIPPED-TO PAGE

Before

~32%

read the matched word

After ★

~82%

read the next word

From 32% to 82%. That's not a tweak — that's a different model. And the breakthrough wasn't a fancier algorithm or more compute. It was realizing the AI was reading the wrong word.

It even figures out the spacing on its own

Real text doesn't always put the answer right next to the key. Sometimes there's punctuation. Sometimes the answer is two words over. So we taught the AI to learn the spacing by example — +1, +2, or wherever — instead of hard-coding "next word."

Same word

"Read what you matched" — the original failure mode.

Next word

"Alice → blue" — adjacent.

Two over

"Alice, blue" — comma in between.

When we trained on text where the spacing varied, the AI learned which to use, per situation. Accuracy held up around 82% even with shifting layouts.

The single biggest win in this entire research line came from a structural insight, not a bigger model.

★ Breakthrough 02

We started teaching the AI to teach itself.

The first breakthrough required a lot of hand-holding: telling the AI which page was right, where the answer sat, exactly which word to read. That works in a research lab. It doesn't scale to real text from the internet, where there are no labels.

So we tried something interesting. We trained one model with all the labels — call it the tutor. Then we used the tutor to teach a second model — the relay — without showing the relay the labels directly. The relay just watched the tutor's choices. Then the relay teaches a student. Labels stop at the tutor.

A chain of teachers

A small twist that mattered

Pure chaining had a problem: each generation got slightly worse — a kind of telephone game effect. Lessons drifted.

The fix: when training the relay, mostly imitate the tutor — but every so often, sneak in a real human-labeled example to keep the relay grounded. We call it an anchor. Like checking your map every now and then to make sure you haven't drifted off course.

Tutor only

85%

with full labels

Pure chain

81%

slight drift

Anchored chain

87%

best result

The anchored version actually beat the original tutor. Sometimes a student watching a careful teacher learns better than the teacher learned from the textbook directly.

Most of what an AI knows came from labels somewhere. The dream is to shrink that somewhere.

★ Breakthrough 03

We added a dial for how hard the AI tries.

Once the model worked, the next question was: can it do less work without getting much worse? Some questions are easy and don't need much flipping back. Others are hard. We wanted to give the AI the option.

So we trained it to handle different effort levels. Full power: ask all the librarians. Half power: ask half of them. Low power: ask one. The surprise was: training it to handle all three levels made every level a little better.

Quality vs work tradeoff higher accuracy, less work — both improve

FOR THE SAME QUESTIONS, AT EACH EFFORT LEVEL

EFFORT

ACCURACY

BEFORE

AFTER ★

Full

all 4 librarians

86.8%

87.7%

Half

2 librarians · ~half the work

84.5%

87.1%

Low

1 librarian · ~quarter the work

82.9%

82.0%

"Half power" gets the biggest jump from this kind of training: 87% accuracy with half the work. That's the new sweet spot.

Why does training for harder conditions improve performance under easy conditions too? Probably the same reason a student who studied for a hard test does fine on the easy version. Practicing the harder case forces you to actually learn the underlying skill, not memorize a shortcut.

The whole solution

Six pieces, working together.

Here's what our system does, every time it has to read a long document. The first three pieces are the foundation. The last three (★) are our breakthroughs.

Read nearby

Always glance at the most recent few pages.

Three librarians

Different ways of spotting which distant page is relevant.

Read the page

Once flipped, read actual words — not summaries.

★ 04

Match the key, grab the next thing

Don't read the matched word — read what's next to it.

★ 05

Teach itself

Tutor → relay → student chain. Anchor along the way.

★ 06

Speed dial

Train at multiple effort levels. Each one improves.

Read mostly nearby. Flip back rarely. Match the key but read the next thing. Learn from a chain of teachers. And know when you can ease off the gas.

What's next

What we still don't know.

A lot has moved over the last few months. Some questions are partially answered. Others are still open. Here's where we are.

Can we use this on real text from the internet?

So far our test cases are name-and-color-style pairs — the structure is too clean. Real documents have rambling, ambiguity, multiple pieces of relevant information. The next milestone is showing the same approach works on actual long documents and chat transcripts.

Can we shrink the labels even further?

Our chain of teachers reduced how many human labels we need, but didn't eliminate them. Can we get the chain started with even fewer? Or train the first tutor from a tiny example set and let the rest follow?

Can we make it run fast on real AI hardware?

The clever tricks we use in the lab are tricky to translate to the kind of code that runs on the chips inside ChatGPT-style models. We have a sketch of what such code should look like, but we haven't built it yet. This is the biggest open engineering question.

Does the AI know when it's wrong?

Our speed dial works when we set it. Ideally the AI itself would know when a question needs more effort and when it doesn't — turning the dial up or down on its own. We've tried a few versions of this. None work well yet.