Non-Latin Character Detection

When indexing domain names, NetAtlas automatically detects whether a domain contains characters outside the Latin alphabet. This helps us classify domains by language and region, identify internationalised domain names (IDNs), and flag content that is unlikely to be Polish or English.

Detection is split into two independent checks: Cyrillic script (handled by dedicated per-language validators) and all other non-Latin scripts (covered by the Unicode range pattern described below).

Cyrillic script

Cyrillic is treated separately because the script is shared by many languages with different indexing relevance. We apply individual validators for each of the main Cyrillic-script countries and languages:

Russian — the largest Cyrillic-script web presence; detected by vocabulary and domain patterns specific to Russian
Ukrainian — distinguished from Russian by characteristic letter combinations and vocabulary
Belarusian — overlaps with both Russian and Ukrainian but has distinct phonological patterns
Bulgarian — Cyrillic script with South Slavic vocabulary; often co-occurs with .bg domains
Serbian / Macedonian — Cyrillic variants of South Slavic languages, sometimes mixed with Latin script
Kazakh / Kyrgyz / Uzbek / Tajik — Central Asian languages that use or have recently used Cyrillic
Mongolian — uses Cyrillic for its standard written form

Because Cyrillic detection is handled per-language, domains are not simply flagged as "has Cyrillic" — they are attributed to a specific language or country where possible.

Other non-Latin scripts

All remaining non-Latin characters are matched by a single Unicode range pattern. The ranges are grouped below by script family, with the languages and regions they cover.

CJK — Chinese, Japanese, Korean

CJK Unified Ideographs U+4E00–U+9FFF — core Chinese characters, used in Mandarin, Cantonese, Japanese (kanji), and Korean (hanja)
CJK Compatibility Ideographs U+F900–U+FAFF — compatibility duplicates of CJK characters
CJK Extension A U+3400–U+4DBF — rare and historical Chinese characters
CJK Extensions B–G U+20000–U+2CEAF — very rare, archaic, and specialist characters
CJK Compatibility Supplement U+2F800–U+2FA1F — additional compatibility forms
Hiragana U+3040–U+309F — Japanese syllabary used for native words and grammar
Katakana U+30A0–U+30FF — Japanese syllabary used for foreign loanwords
Hangul U+AC00–U+D7AF — Korean alphabet (syllable blocks)

Greek

Greek and Coptic U+0370–U+03FF — modern Greek alphabet; also contains Coptic letters
Greek Extended U+1F00–U+1FFF — polytonic Greek used in classical texts and academic publishing

Arabic and related scripts

Arabic U+0600–U+06FF — covers Arabic, Persian (Farsi), Urdu, Pashto, Kurdish (Sorani), Uyghur, and others
Arabic Supplement U+0750–U+077F — additional letters for African Arabic dialects
Arabic Extended-A U+08A0–U+08FF — additional marks and letters for extended Arabic orthographies
Arabic Presentation Forms-A U+FB50–U+FDFF — contextual and ligature forms used in digital typography
Arabic Presentation Forms-B U+FE70–U+FEFF — further presentation forms and compatibility characters

Semitic and Middle Eastern scripts

Hebrew U+0590–U+05FF — Hebrew and Yiddish
Armenian U+0530–U+058F — Armenian language
Syriac U+0700–U+074F — Syriac (Aramaic-based script used in Assyrian and Chaldean communities)
Thaana U+0780–U+07BF — Maldivian (Dhivehi)
N'Ko U+07C0–U+07FF — N'Ko script used for Manding languages in West Africa

South Asian scripts (Brahmic family)

Devanagari U+0900–U+097F — Hindi, Nepali, Marathi, Sanskrit
Bengali U+0980–U+09FF — Bengali and Assamese
Gurmukhi U+0A00–U+0A7F — Punjabi
Gujarati U+0A80–U+0AFF — Gujarati language
Oriya / Odia U+0B00–U+0B7F — Odia language (Odisha, India)
Tamil U+0B80–U+0BFF — Tamil language (India, Sri Lanka, Singapore)
Telugu U+0C00–U+0C7F — Telugu language (Andhra Pradesh, Telangana)
Kannada U+0C80–U+0CFF — Kannada language (Karnataka)
Malayalam U+0D00–U+0D7F — Malayalam language (Kerala)
Sinhala U+0D80–U+0DFF — Sinhala language (Sri Lanka)

Southeast Asian scripts

Thai U+0E00–U+0E7F — Thai language
Lao U+0E80–U+0EFF — Lao language
Vietnamese Extended U+1EA0–U+1EF9 — precomposed diacritic characters used in Vietnamese (Latin-based but outside the basic Latin range)
Khmer U+1780–U+17FF — Khmer script (Cambodia)
Khmer Symbols U+19E0–U+19FF — lunar date symbols used in Khmer
Myanmar U+1000–U+109F — Burmese and related languages
Myanmar Extended-A U+AA60–U+AA7F — additional Myanmar characters for minority languages
Myanmar Extended-B U+A9E0–U+A9FF — further extensions for Shan, Mon, and other scripts

Philippine scripts

Tagalog (Baybayin) U+1700–U+171F — pre-colonial script of the Tagalog people
Hanunoo U+1720–U+173F — script of the Hanunoo people (Mindoro island)
Buhid U+1740–U+175F — script of the Buhid people (Mindoro island)
Tagbanwa U+1760–U+177F — script of the Tagbanwa people (Palawan island)

Georgian

Georgian (Mkhedruli / Asomtavruli) U+10A0–U+10FF — Georgian language
Georgian Supplement U+2D00–U+2D2F — Nuskhuri ecclesiastical script
Georgian Extended (Mtavruli) U+1C90–U+1CBF — uppercase Mkhedruli introduced in Unicode 11

Ethiopic

Ethiopic U+1200–U+137F — Amharic, Tigrinya, and other Ethiopian languages
Ethiopic Supplement U+1380–U+139F — additional syllables for minority languages
Ethiopic Extended-A U+2D80–U+2DDF — further extensions for Sebatbeit and other scripts
Ethiopic Extended-B U+AB00–U+AB2F — additions for Gamo-Gofa-Dawro and related languages

Other scripts

Tibetan U+0F00–U+0FFF — Tibetan language and Buddhist texts
Mongolian U+1800–U+18AF — traditional Mongolian vertical script (distinct from Cyrillic Mongolian)
Tifinagh U+2D30–U+2D7F — Berber languages of North Africa (Tamazight, Tuareg)
Cherokee U+13A0–U+13FF — Cherokee syllabary (Eastern North America)
Cherokee Supplement U+AB70–U+ABBF — lowercase Cherokee letters added in Unicode 8
Canadian Aboriginal Syllabics U+1400–U+167F — Cree, Ojibwe, Inuktitut, and other indigenous Canadian languages
Ogham U+1680–U+169F — early medieval Irish script
Runic U+16A0–U+16FF — Germanic runic alphabets
Yi Syllables U+A000–U+A48F — Yi language (Sichuan and Yunnan, China)
Yi Radicals U+A490–U+A4CF — component forms used in Yi script

Emoji

Emoji are detected by a separate pattern and counted independently. They are excluded from the non-Latin character count because emoji can appear in any language context and do not reliably indicate a non-Latin script domain. The emoji ranges covered are:

Miscellaneous Symbols and Pictographs U+1F300–U+1F5FF — weather, nature, objects, places
Emoticons U+1F600–U+1F64F — face and person emoji
Transport and Map Symbols U+1F680–U+1F6FF — vehicles, signs, infrastructure
Supplemental Symbols and Pictographs U+1F900–U+1F9FF — extended emoji added in later Unicode versions
Miscellaneous Symbols U+2600–U+26FF — astrological, meteorological, and other symbols
Dingbats U+2700–U+27BF — decorative symbols and arrows