When indexing domain names, NetAtlas automatically detects whether a domain contains characters outside the Latin alphabet. This helps us classify domains by language and region, identify internationalised domain names (IDNs), and flag content that is unlikely to be Polish or English.

Detection is split into two independent checks: Cyrillic script (handled by dedicated per-language validators) and all other non-Latin scripts (covered by the Unicode range pattern described below).

Cyrillic script

Cyrillic is treated separately because the script is shared by many languages with different indexing relevance. We apply individual validators for each of the main Cyrillic-script countries and languages:

  • Russian — the largest Cyrillic-script web presence; detected by vocabulary and domain patterns specific to Russian
  • Ukrainian — distinguished from Russian by characteristic letter combinations and vocabulary
  • Belarusian — overlaps with both Russian and Ukrainian but has distinct phonological patterns
  • Bulgarian — Cyrillic script with South Slavic vocabulary; often co-occurs with .bg domains
  • Serbian / Macedonian — Cyrillic variants of South Slavic languages, sometimes mixed with Latin script
  • Kazakh / Kyrgyz / Uzbek / Tajik — Central Asian languages that use or have recently used Cyrillic
  • Mongolian — uses Cyrillic for its standard written form

Because Cyrillic detection is handled per-language, domains are not simply flagged as "has Cyrillic" — they are attributed to a specific language or country where possible.

Other non-Latin scripts

All remaining non-Latin characters are matched by a single Unicode range pattern. The ranges are grouped below by script family, with the languages and regions they cover.

CJK — Chinese, Japanese, Korean

  • CJK Unified Ideographs U+4E00–U+9FFF — core Chinese characters, used in Mandarin, Cantonese, Japanese (kanji), and Korean (hanja)
  • CJK Compatibility Ideographs U+F900–U+FAFF — compatibility duplicates of CJK characters
  • CJK Extension A U+3400–U+4DBF — rare and historical Chinese characters
  • CJK Extensions B–G U+20000–U+2CEAF — very rare, archaic, and specialist characters
  • CJK Compatibility Supplement U+2F800–U+2FA1F — additional compatibility forms
  • Hiragana U+3040–U+309F — Japanese syllabary used for native words and grammar
  • Katakana U+30A0–U+30FF — Japanese syllabary used for foreign loanwords
  • Hangul U+AC00–U+D7AF — Korean alphabet (syllable blocks)

Greek

  • Greek and Coptic U+0370–U+03FF — modern Greek alphabet; also contains Coptic letters
  • Greek Extended U+1F00–U+1FFF — polytonic Greek used in classical texts and academic publishing

Arabic and related scripts

  • Arabic U+0600–U+06FF — covers Arabic, Persian (Farsi), Urdu, Pashto, Kurdish (Sorani), Uyghur, and others
  • Arabic Supplement U+0750–U+077F — additional letters for African Arabic dialects
  • Arabic Extended-A U+08A0–U+08FF — additional marks and letters for extended Arabic orthographies
  • Arabic Presentation Forms-A U+FB50–U+FDFF — contextual and ligature forms used in digital typography
  • Arabic Presentation Forms-B U+FE70–U+FEFF — further presentation forms and compatibility characters

Semitic and Middle Eastern scripts

  • Hebrew U+0590–U+05FF — Hebrew and Yiddish
  • Armenian U+0530–U+058F — Armenian language
  • Syriac U+0700–U+074F — Syriac (Aramaic-based script used in Assyrian and Chaldean communities)
  • Thaana U+0780–U+07BF — Maldivian (Dhivehi)
  • N'Ko U+07C0–U+07FF — N'Ko script used for Manding languages in West Africa

South Asian scripts (Brahmic family)

  • Devanagari U+0900–U+097F — Hindi, Nepali, Marathi, Sanskrit
  • Bengali U+0980–U+09FF — Bengali and Assamese
  • Gurmukhi U+0A00–U+0A7F — Punjabi
  • Gujarati U+0A80–U+0AFF — Gujarati language
  • Oriya / Odia U+0B00–U+0B7F — Odia language (Odisha, India)
  • Tamil U+0B80–U+0BFF — Tamil language (India, Sri Lanka, Singapore)
  • Telugu U+0C00–U+0C7F — Telugu language (Andhra Pradesh, Telangana)
  • Kannada U+0C80–U+0CFF — Kannada language (Karnataka)
  • Malayalam U+0D00–U+0D7F — Malayalam language (Kerala)
  • Sinhala U+0D80–U+0DFF — Sinhala language (Sri Lanka)

Southeast Asian scripts

  • Thai U+0E00–U+0E7F — Thai language
  • Lao U+0E80–U+0EFF — Lao language
  • Vietnamese Extended U+1EA0–U+1EF9 — precomposed diacritic characters used in Vietnamese (Latin-based but outside the basic Latin range)
  • Khmer U+1780–U+17FF — Khmer script (Cambodia)
  • Khmer Symbols U+19E0–U+19FF — lunar date symbols used in Khmer
  • Myanmar U+1000–U+109F — Burmese and related languages
  • Myanmar Extended-A U+AA60–U+AA7F — additional Myanmar characters for minority languages
  • Myanmar Extended-B U+A9E0–U+A9FF — further extensions for Shan, Mon, and other scripts

Philippine scripts

  • Tagalog (Baybayin) U+1700–U+171F — pre-colonial script of the Tagalog people
  • Hanunoo U+1720–U+173F — script of the Hanunoo people (Mindoro island)
  • Buhid U+1740–U+175F — script of the Buhid people (Mindoro island)
  • Tagbanwa U+1760–U+177F — script of the Tagbanwa people (Palawan island)

Georgian

  • Georgian (Mkhedruli / Asomtavruli) U+10A0–U+10FF — Georgian language
  • Georgian Supplement U+2D00–U+2D2F — Nuskhuri ecclesiastical script
  • Georgian Extended (Mtavruli) U+1C90–U+1CBF — uppercase Mkhedruli introduced in Unicode 11

Ethiopic

  • Ethiopic U+1200–U+137F — Amharic, Tigrinya, and other Ethiopian languages
  • Ethiopic Supplement U+1380–U+139F — additional syllables for minority languages
  • Ethiopic Extended-A U+2D80–U+2DDF — further extensions for Sebatbeit and other scripts
  • Ethiopic Extended-B U+AB00–U+AB2F — additions for Gamo-Gofa-Dawro and related languages

Other scripts

  • Tibetan U+0F00–U+0FFF — Tibetan language and Buddhist texts
  • Mongolian U+1800–U+18AF — traditional Mongolian vertical script (distinct from Cyrillic Mongolian)
  • Tifinagh U+2D30–U+2D7F — Berber languages of North Africa (Tamazight, Tuareg)
  • Cherokee U+13A0–U+13FF — Cherokee syllabary (Eastern North America)
  • Cherokee Supplement U+AB70–U+ABBF — lowercase Cherokee letters added in Unicode 8
  • Canadian Aboriginal Syllabics U+1400–U+167F — Cree, Ojibwe, Inuktitut, and other indigenous Canadian languages
  • Ogham U+1680–U+169F — early medieval Irish script
  • Runic U+16A0–U+16FF — Germanic runic alphabets
  • Yi Syllables U+A000–U+A48F — Yi language (Sichuan and Yunnan, China)
  • Yi Radicals U+A490–U+A4CF — component forms used in Yi script

Emoji

Emoji are detected by a separate pattern and counted independently. They are excluded from the non-Latin character count because emoji can appear in any language context and do not reliably indicate a non-Latin script domain. The emoji ranges covered are:

  • Miscellaneous Symbols and Pictographs U+1F300–U+1F5FF — weather, nature, objects, places
  • Emoticons U+1F600–U+1F64F — face and person emoji
  • Transport and Map Symbols U+1F680–U+1F6FF — vehicles, signs, infrastructure
  • Supplemental Symbols and Pictographs U+1F900–U+1F9FF — extended emoji added in later Unicode versions
  • Miscellaneous Symbols U+2600–U+26FF — astrological, meteorological, and other symbols
  • Dingbats U+2700–U+27BF — decorative symbols and arrows