AI source selection — how a model picks 3 cites out of 50

Parameter	Value
Candidate pool size per prompt	30–80 URLs across our tracked Perplexity and ChatGPT runs
Cites in the final answer	2–4 (median 3)
Drop-off after 12 months without an update	Around 60% of our tracked pages fell out of the top 3 once dateModified aged past a year
Same-domain double-cite rate	Under 7% across 240 measured answers — the model actively spreads citations
Snippet-first pages vs intro-first pages	Snippet-first won the citation 4.1× more often in matched pairs
Effective title pattern	Entity + category + qualifier — e.g. 'Answerly Agency — AEO services for B2B SaaS'

A prompt enters the retriever. The retriever pulls 30 to 80 candidate URLs. The LLM reads, ranks, quotes 2 to 4. The other 50-ish sit in the pool and never make it into the answer the user reads.

That last sentence is the entire AEO problem. You can rank in the pool and never get cited. You can lose to a smaller, sharper page that knows what the model is reading for.

This is what we have observed across 240 tracked answers over the last 90 days — six signals that explain which URLs make it through, and how to engineer for each one. Not a theory paper. Notes from the spreadsheet.

The retrieval-then-rerank model

Five engines, five slightly different stacks, but the rough shape is the same. The retriever (a hybrid of classical lexical search and an embedding model) pulls a candidate set. A second pass — the LLM itself or a smaller reranker — reads the candidates and picks the few it will quote.

Two consequences worth keeping in front of you.

First, classical SEO gets you into the candidate pool. It rarely gets you into the answer on its own. A page that ranks 4th on Google but is hard to extract from will sit in the pool and watch a page ranked 11th win the cite.

Second, the reranker reads snippets, not whole pages. In our log captures the reranker is usually working with 300–800 tokens per candidate. So the question is not “is the answer on my page” — it is “is the answer in the first 600 tokens, structured so the model can lift it cleanly.”

Signal 1 — snippet extractability

The single biggest pattern we measure. In matched pairs — two pages on similar topics, similar authority, both in the candidate pool — the snippet-first version won the citation 4.1× more often than the intro-first version.

Snippet-first means the direct answer to the prompt sits in the first paragraph, 25–40 words, no preamble. Then a Quick Facts table or a short definition list. Then depth.

The version that loses is the one that opens with “In recent years, AI search has changed the way users discover information…” — and the actual answer is buried below an H2 four paragraphs in.

This pattern matters more than authority. We watched a no-name 11-month-old domain win citations against a Forbes column on the same prompt because the no-name page opened with the answer and Forbes opened with a hook.

Signal 2 — entity match in title and H1

Titles do double work. They are the candidate’s first line of pitch to the reranker, and they are also how the LLM resolves which entity the page is about.

The pattern that wins in our data — title contains the brand name + the category. “Answerly Agency — AEO services for B2B SaaS” beats “AEO Services Explained” on commercial-intent prompts, even when the second page has more backlinks.

Why — because the reranker is doing entity resolution in parallel with relevance scoring. If the user asked “best AEO agency for SaaS” the model wants to surface entities (brands, products), not generic explainers. A title that names the entity wins the slot.

For non-commercial prompts the rule inverts — explainers beat brands. “What is answer engine optimization” gets won by a page titled “What is Answer Engine Optimization (AEO)?” not a page titled “Answerly Agency — AEO services.”

Signal 3 — recency

We tracked dateModified ages across the 240 answers. Roughly 60% of pages that lost their cite position lost it after their dateModified aged past 12 months.

Two engines weight this hardest — Gemini and Perplexity. Both seem to penalise stale dates aggressively. ChatGPT and Claude are more lenient on evergreen content, but even they drop pages out of the top 3 after 18–24 months without a refresh.

What counts as a refresh — actual edits, not a touched dateModified field. We tested both. Pages that only had their dateModified bumped (no body edits) recovered only briefly. Pages that got real edits — a new example, a refreshed stat, a “what changed” section — held position.

The half-life details on this are in our citation half-life study. Short version — quarterly refresh, visible date stamp, real edits.

Signal 4 — source-set diversity

A constraint, not a continuous signal. In 240 measured answers, the model picked two URLs from the same domain in fewer than 7% of cases. Three from the same domain — never.

So if your client domain is already in the answer with one URL, a second URL from the same domain is competing against itself, not against the rest of the field. Optimising five pages for the same prompt cluster does not get you five cites. It gets you the same one cite, occasionally rotated.

The practical implication — pick the single best page per prompt cluster and pour effort into it, instead of spreading across five. We have re-routed client roadmaps after seeing this — fewer URLs, more depth per URL.

Signal 5 — contradiction avoidance

A subtler signal we noticed across about 30 cases. When two candidates make contradictory claims, the model either picks the candidate that aligns with the consensus of the rest of the pool, or it picks neither and hedges.

The lesson — being contrarian for traffic backfires in AI search. A page that argues a non-consensus number (“AEO citations decay in 7 days”) will be skipped if the rest of the pool clusters around a different number (18 days median in our data). The model does not want to cite something that disagrees with five other candidates.

This does not mean give up on a sharp opinion. It means anchor your sharp opinion to numbers and methodology that the model can verify against other sources, so you are extending the consensus, not breaking it.

Signal 6 — domain authority, but weighted differently

Authority still matters. It is just not the dominant signal it is in classical search.

What we see — authority works as a tiebreaker. When two candidates have similar snippet quality, similar recency, similar entity match, the model picks the higher-authority domain. When the lower-authority page has a clearly better snippet, authority loses.

So a Forbes column with a buried answer loses to a sharp blog post on a small domain. But two equally sharp posts — Forbes wins.

There is also a category effect. For YMYL prompts (medical, legal, financial) authority gets weighted noticeably higher, and a small blog has to be very sharp to displace a regulator-published source.

How to engineer for selection

Stop optimising for the candidate pool. You are probably already there. Optimise for the rerank pass.

Front-load the answer. First paragraph, 25–40 words, direct answer to the dominant prompt for that page. Then the structure can breathe.

Title that names the entity for commercial prompts, names the concept for explainer prompts. Match the title to the prompt intent the page is targeting.

Quarterly refresh with real edits and a visible date stamp. Not a touched field. Edits that the model can detect as substance.

One page per prompt cluster. Not five. Pick the best one, pour the effort in.

Anchor sharp opinions to verifiable numbers. Sharp is good. Disagreeing with the rest of the pool without evidence is suicide for citation.

What this means for measurement

If you are tracking ranking in the candidate pool but not citation in the final answer, you are measuring the wrong thing. The two correlate loosely — high pool rank lifts your citation odds, but plenty of low-pool-rank pages win cites and plenty of high-pool-rank pages lose them.

The metric that matters — for each tracked prompt, on each engine, is the URL in the cited set or not. Binary. Logged daily. Aggregated weekly into “actively cited in the last 7 days.”

This is also the only metric that will not lie to you. Pool rank looks like progress when nothing is changing in the answer the user actually sees.

If you want the full measurement stack — the prompts, the platforms, the loggers, the survival curves — start with measuring AI citations and then the AEO checklist for the structural work.

The retrieval pool gets you considered. The six signals above get you cited.

AI source selection — how a model picks 3 cites out of 50 candidates

Quick Facts