KI short logo

Why AI believes Crimea is Russian — and what to do about it

7 min read

A topographic map of Ukraine and the Black Sea. (NASA / U.S. Department of State / Getty Images)

Avatar

Ivan Dobrovolsky

Staff software engineer

Not so long ago, Anthropic, one of the leaders in the global AI market and the creator of Claude, published its largest study on what people expect from AI. It is based on 80,000 conversations across 159 countries, with a world map as the central element.

One detail that caught my eye was that it showed Ukraine without Crimea.

Among the respondents to its survey were many Ukrainian voices. For example, a soldier said that in the most difficult moments of his service, it was his "AI friends" who helped him avoid giving up. Another person from a combat zone writes that he studies with AI at night because he cannot sleep due to constant shelling.

Their stories were placed next to a map showing that part of the Ukrainian land belongs to Russia.

And you can see this problem not just with Anthropic.

A map of Ukraine with incorrect borders that exclude Russian-occupied Crimea, as displayed on Anthropic's website
A map of Ukraine with incorrect borders that exclude Russian-occupied Crimea, as displayed on Anthropic's website. (Screenshot: Anthropic; Highlight: The Kyiv Independent)

I work in Silicon Valley, where I create AI products every day. A few months ago, I needed an interactive map of Europe. This is a fairly simple task that requires only a few lines of code or a one-sentence prompt in a code-generation tool like Claude.

As an engineer, I care about scalability, best practices, and code optimization, but as a Ukrainian, I always check whether Ukrainian sovereignty is represented correctly, and seeing Ukraine without Crimea was a turning point for me.

I repeated the prompt with ChatGPT, then with Gemini, and, surely enough, the result was the same: all LLMs showed Crimea under Russian colors.

Got an opinion on anything you've read in the Kyiv Independent so far?

Send it to letters@kyivindependent.com
and it may appear in our Letters section.
SUBMIT AN OPINION
Mail box

The easiest thing would be to just blame AI for hallucinations, which is a common problem in LLMs due to their probabilistic nature. But I couldn't help but wonder, why do all three of the most popular AI show Crimea as Russian by default, while the entire world condemned Crimean annexation?

I spent the next few months researching the scale of this infrastructural "contamination" across 16 AI models (from ChatGPT to xAI) at all levels, including large-scale analyses of training data, AI response variance, and web search.

To categorize those levels of AI behavior in maps and scientific sources, I had to investigate how Russia "infects" the root of the entire system with propaganda — the digital infrastructure that becomes the training data for all modern AI systems.

It all starts with geography and metadata

When the AI ​​model generates a map, it pulls data from Natural Earth, the largest open geodatabase that underpins almost all mapping services. Crimea is automatically marked as SOVEREIGNT="Russia" there. I found the correct Ukrainian ISO 3166-2 UA-43 code in the adjacent line. Natural Earth's own policy page (explains "de facto" approach): Disputed Boundaries Policy.

The four most popular digital map packages that work with its data alone are downloaded more than 20 million times a week.

"Russian Crimea" ended up in the raw database, from which AI takes disinformation as fact, "infecting" the entire system."

They are the basis for code packages (like Lego pieces) that programmers add to their products so they don't have to write everything from scratch.

For example, instead of spending months developing their own map, a developer takes a ready-to-use one from a package. There are more than 7,000 packages that depend on D3, about 7,500 on ECharts, and more than 1,000 on Leaflet. Because of this, SOVEREIGNT="Russia" automatically propagates throughout the ecosystem.

You can check this in just five minutes. For example, after a neutral prompt, "make a dashboard," both Lovable and Claude Code automatically pulled up the Natural Earth architecture. Both assigned Crimea to Russia and included the population in Russian statistics.

Next: Training datasets are heavily contaminated

Geodata is just the tip of the iceberg. Below are gigantic arrays of text used to train LLMs. One of the most well-known datasets is C4 by Google, which was used to train almost all early LLMs and is still used to train modern AIs such as LLaMA.

Because of this, "Russian Crimea" ended up in the raw database, from which AI takes disinformation as fact, "infecting" the entire system.

I created a specialized data signal filter and applied it to all 34.1 million C4 documents (UA, RU, and EN shards) that mention the peninsula. In almost 900,000 cases (2.61%), it is tied to Russian addresses: "Republic of Crimea" or "Simferopol, Russian Federation."

Article image
Russian President Vladimir Putin addresses the crowd during a rally and a concert celebrating the 10th anniversary of Russia's annexation of Crimea, Ukraine, at Red Square in Moscow, Russia, on March 18, 2024. (Natalia Kolesnikova/AFP via Getty Images)

It would seem that sanctions should eradicate this. However, state media and proxy sites account for up to 5% of traffic. The remaining 95.3% are everyday web services: university registers, hotels, bank directories, weather forecasts, etc. Sanctions cannot block weather forecasts.

But what surprised me most is what I've seen in academic publications. After scanning over 91,000 open articles, I found about 1,600 that mentioned the Russian-occupied Crimea. Until 2014, the occupation name "Republic of Crimea" did not exist in academic metadata at all. By 2021, it had become as common as the Ukrainian "Autonomous Republic of Crimea."

Reputable publishers such as Wiley, Elsevier, and IOP also effectively legalize annexation. In the metadata of CrossRef (the largest registrar of scholarly articles), I found 161 articles with a russian affiliation. None of them can be corrected, as CrossRef does not allow for mass correction of such records.

Each such publication permanently sets "Simferopol, Russian Federation" in the scientific archives. As a result, a doctor or scientist at a conference will cite statistics from Western Europe, where Ukraine is missing a whole peninsula, and will hardly know it.

What can be done?

According to the European External Action Service, in 2025, AI was used in 27% of recorded cases of disinformation, and the Russian networks Portal Kombat and Pravda (according to DFRLab) have been purposefully "feeding" AI with Wikipedia fakes for years to avoid sanctions and reach a Western audience.

The transparency requirements in the EU AI Act are a first step towards legitimizing AI's role in society.

However, the root of the "virus" lies in the infrastructure itself. For years, Russia has infected the internet wherever it can reach, and developers have copied it into their software.

The Crimea example exposes the weakness of existing AI safety protocols. To restore full Ukrainian sovereignty in digital infrastructure, international law compliance must be strictly enforced for open source contributors who post geodata databases and maps packages, academia publishers, who accept papers without checking the sovereignty claims, and AI labs who must be responsible for filtering the training corpora before feeding it to the machine learning algorithm that bakes the misinformation in its weights.

Editor's note: The opinions expressed in the op-ed section are those of the authors and do not purport to reflect the views of the Kyiv Independent.

Avatar
Ivan Dobrovolsky

Ivan Dobrovolsky is a Staff Software Engineer based in California, specializing in building large-scale AI systems. He holds an MS in AI & ML Engineering and is actively engaged in research spanning agentic architectures, multilingual NLP, and AI safety.

Read more