When indexing domain names, NetAtlas automatically detects whether a domain contains characters outside the Latin alphabet. This helps us classify domains by language and region, identify internationalised domain names (IDNs), and flag content that is unlikely to be Polish or English.
Detection is split into two independent checks: Cyrillic script (handled by dedicated per-language validators) and all other non-Latin scripts (covered by the Unicode range pattern described below).
Cyrillic is treated separately because the script is shared by many languages with different indexing relevance. We apply individual validators for each of the main Cyrillic-script countries and languages:
.bg domainsBecause Cyrillic detection is handled per-language, domains are not simply flagged as "has Cyrillic" — they are attributed to a specific language or country where possible.
All remaining non-Latin characters are matched by a single Unicode range pattern. The ranges are grouped below by script family, with the languages and regions they cover.
U+4E00–U+9FFF — core Chinese characters, used in Mandarin, Cantonese, Japanese (kanji), and Korean (hanja)U+F900–U+FAFF — compatibility duplicates of CJK charactersU+3400–U+4DBF — rare and historical Chinese charactersU+20000–U+2CEAF — very rare, archaic, and specialist charactersU+2F800–U+2FA1F — additional compatibility formsU+3040–U+309F — Japanese syllabary used for native words and grammarU+30A0–U+30FF — Japanese syllabary used for foreign loanwordsU+AC00–U+D7AF — Korean alphabet (syllable blocks)U+0370–U+03FF — modern Greek alphabet; also contains Coptic lettersU+1F00–U+1FFF — polytonic Greek used in classical texts and academic publishingU+0600–U+06FF — covers Arabic, Persian (Farsi), Urdu, Pashto, Kurdish (Sorani), Uyghur, and othersU+0750–U+077F — additional letters for African Arabic dialectsU+08A0–U+08FF — additional marks and letters for extended Arabic orthographiesU+FB50–U+FDFF — contextual and ligature forms used in digital typographyU+FE70–U+FEFF — further presentation forms and compatibility charactersU+0590–U+05FF — Hebrew and YiddishU+0530–U+058F — Armenian languageU+0700–U+074F — Syriac (Aramaic-based script used in Assyrian and Chaldean communities)U+0780–U+07BF — Maldivian (Dhivehi)U+07C0–U+07FF — N'Ko script used for Manding languages in West AfricaU+0900–U+097F — Hindi, Nepali, Marathi, SanskritU+0980–U+09FF — Bengali and AssameseU+0A00–U+0A7F — PunjabiU+0A80–U+0AFF — Gujarati languageU+0B00–U+0B7F — Odia language (Odisha, India)U+0B80–U+0BFF — Tamil language (India, Sri Lanka, Singapore)U+0C00–U+0C7F — Telugu language (Andhra Pradesh, Telangana)U+0C80–U+0CFF — Kannada language (Karnataka)U+0D00–U+0D7F — Malayalam language (Kerala)U+0D80–U+0DFF — Sinhala language (Sri Lanka)U+0E00–U+0E7F — Thai languageU+0E80–U+0EFF — Lao languageU+1EA0–U+1EF9 — precomposed diacritic characters used in Vietnamese (Latin-based but outside the basic Latin range)U+1780–U+17FF — Khmer script (Cambodia)U+19E0–U+19FF — lunar date symbols used in KhmerU+1000–U+109F — Burmese and related languagesU+AA60–U+AA7F — additional Myanmar characters for minority languagesU+A9E0–U+A9FF — further extensions for Shan, Mon, and other scriptsU+1700–U+171F — pre-colonial script of the Tagalog peopleU+1720–U+173F — script of the Hanunoo people (Mindoro island)U+1740–U+175F — script of the Buhid people (Mindoro island)U+1760–U+177F — script of the Tagbanwa people (Palawan island)U+10A0–U+10FF — Georgian languageU+2D00–U+2D2F — Nuskhuri ecclesiastical scriptU+1C90–U+1CBF — uppercase Mkhedruli introduced in Unicode 11U+1200–U+137F — Amharic, Tigrinya, and other Ethiopian languagesU+1380–U+139F — additional syllables for minority languagesU+2D80–U+2DDF — further extensions for Sebatbeit and other scriptsU+AB00–U+AB2F — additions for Gamo-Gofa-Dawro and related languagesU+0F00–U+0FFF — Tibetan language and Buddhist textsU+1800–U+18AF — traditional Mongolian vertical script (distinct from Cyrillic Mongolian)U+2D30–U+2D7F — Berber languages of North Africa (Tamazight, Tuareg)U+13A0–U+13FF — Cherokee syllabary (Eastern North America)U+AB70–U+ABBF — lowercase Cherokee letters added in Unicode 8U+1400–U+167F — Cree, Ojibwe, Inuktitut, and other indigenous Canadian languagesU+1680–U+169F — early medieval Irish scriptU+16A0–U+16FF — Germanic runic alphabetsU+A000–U+A48F — Yi language (Sichuan and Yunnan, China)U+A490–U+A4CF — component forms used in Yi scriptEmoji are detected by a separate pattern and counted independently. They are excluded from the non-Latin character count because emoji can appear in any language context and do not reliably indicate a non-Latin script domain. The emoji ranges covered are:
U+1F300–U+1F5FF — weather, nature, objects, placesU+1F600–U+1F64F — face and person emojiU+1F680–U+1F6FF — vehicles, signs, infrastructureU+1F900–U+1F9FF — extended emoji added in later Unicode versionsU+2600–U+26FF — astrological, meteorological, and other symbolsU+2700–U+27BF — decorative symbols and arrows