Four labs, one Sunday school
Ask Claude, GPT-4o, Grok, and DeepSeek about a minority writing system and they will send it to church.
Researchers benchmarked four frontier AI models against a specific class of question. What are minority and historically marginalized writing systems used for in the world today?
Claude, GPT-4o, Grok, and DeepSeek all over-attribute religious function to these writing systems at 4.1x above base expectation. Across all four model families, “used for religion” alone accounts for 43.6% of convergent errors.
NormSense crystallized cluster a7d88dc5 on this finding this week. Nine observations. Adoption 0.72, which for a fresh finding is high.
Four independent labs. Different training pipelines. Different alignment approaches. Different fine-tuning teams. Different corporate cultures. Different countries of headquarters. And the same wrong answer, in the same magnitude, for the same class of question.
Writing systems have many uses. Legal contracts. Personal correspondence. Commercial signage. Newspapers. Song lyrics. Political speech. Social media. The models systematically compress this range toward religious usage for the minority scripts and not for the majority ones. A user asking about the Latin alphabet gets an answer covering commerce, law, literature, and daily life. A user asking about, for example, a marginalized regional script gets an answer skewed toward liturgical function.
Where does the convergence comes from?
The likely explanation is that all four models are downstream of the same English-language corpus of descriptions of minority writing systems. That corpus overrepresents religious use because religious use is the reason those descriptions were written in the first place. Missionary linguistics. Bible translation records. Comparative religion scholarship.
The English-language written record of minority scripts is heavily weighted toward the contexts in which speakers of English were paying attention to them.
Each model is faithfully representing what its training data says. The training data is faithfully representing what English-language scholarship of these scripts has emphasized. The gap between the training data and the world is where the convergent error lives.
Four models drawing from a common pool converge on the pool’s blind spots.
Cross-model agreement.
People use cross-model corroboration as a verification technique. Ask Claude. Ask GPT-4o. Ask Grok. If they agree, treat that as consensus.
The technique assumes the models are independent samples of the possible answers. If four differently-trained models arrive at the same answer, that answer is more likely to be correct than if only one did.
The technique breaks down when the models share upstream. Four models drawing from overlapping training corpora, using similar architectures, evaluated against similar benchmarks, and fine-tuned against similar preference data are not independent samples. They are correlated draws from the same underlying distribution. Their agreement reflects shared exposure, not shared truth.
When users treat correlated draws as independent, cross-model agreement functions the way rumor propagation functions in human networks. Four sources agree because they all heard the same thing.
The finding on minority writing systems is one measurable instance. Other research this week documented that large language models deployed for governance analysis produce confidently fabricated answers for countries underrepresented in training data. Same shape. Different topic.
What this means for practitioners
Cross-model agreement is not verification when the ground truth is thin in mainstream English-language training data.
Whether a question sits in the affected zone is hard to know in advance. It is not about whether the topic is controversial. It is about whether the topic is well-covered in the corpora. Minority writing systems happen to be a case where researchers measured the effect.
Treat cross-model consensus as suggestive when the topic touches populations, geographies, or subject areas historically under-documented in English-language sources. Four confident wrong answers are worth about the same as one confident wrong answer.
The models aren’t lying. The corpus is quietly incomplete. The models are telling the truth about the corpus.
— Zach, see you in the cluster pages


