Skip to main content

Unicode Broke Our System

The Interview Question​

"We launched in Japan and everything broke. Users can't log in, searches return wrong results, and some usernames display as question marks. What happened and how do we fix it?"

Asked at: Any company going international, especially Stripe, Shopify, Google

Time to solve: 25-30 minutes

Difficulty: ⭐⭐⭐ (Mid-Senior)


Clarifying Questions to Ask​

  1. "What database and encoding?" β†’ MySQL latin1 is common culprit
  2. "Where do we see the issues?" β†’ Storage, display, search, or comparison?
  3. "Are older records affected or just new ones?" β†’ Identifies when problem started
  4. "What languages specifically?" β†’ Japanese, Chinese, emoji all different
  5. "Is password/login affected?" β†’ Security implications

The Problem Visualized​

User types:    η”°δΈ­ε€ͺιƒŽ (Tanaka Taro)
Database stores: ç°À¸­Γ₯Β€ΒͺΓ©Ζ’ (mojibake - garbled text)
Display shows: ??? (replacement characters)
Search returns: nothing (can't match)

Common Unicode Failures​

Failure 1: Wrong Database Encoding​

-- Check current encoding
SHOW VARIABLES LIKE 'character_set%';

-- Common bad result:
-- character_set_database: latin1 ← CAN'T STORE UNICODE!
-- character_set_connection: latin1

-- Correct encoding:
-- character_set_database: utf8mb4 ← Supports all Unicode + emoji

Fix: Migrate to utf8mb4

-- 1. Backup first!
mysqldump -u root -p mydb > backup.sql

-- 2. Convert database
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- 3. Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- 4. Fix connection encoding
SET NAMES 'utf8mb4';

-- In application config:
# database_url: mysql://user:pass@host/db?charset=utf8mb4

⚠️ Warning: MySQL utf8 is NOT full UTF-8! It only supports 3 bytes (no emoji). Use utf8mb4.

Failure 2: Application Encoding Mismatch​

# πŸ”΄ BAD: Default encoding might not be UTF-8
with open('data.txt', 'r') as f:
data = f.read() # Uses system default encoding

# βœ… GOOD: Explicit UTF-8
with open('data.txt', 'r', encoding='utf-8') as f:
data = f.read()

# πŸ”΄ BAD: String to bytes without encoding
data = "η”°δΈ­ε€ͺιƒŽ"
bytes_data = data.encode() # Uses system default

# βœ… GOOD: Explicit encoding
bytes_data = data.encode('utf-8')

Failure 3: HTTP Encoding Issues​

# πŸ”΄ BAD: Missing content-type encoding
response = make_response(json.dumps(data))
# Browser might guess wrong encoding

# βœ… GOOD: Explicit encoding in headers
response = make_response(json.dumps(data))
response.headers['Content-Type'] = 'application/json; charset=utf-8'
<!-- πŸ”΄ BAD: No meta charset -->
<html>
<head>
<title>My Page</title>
</head>

<!-- βœ… GOOD: UTF-8 declared early -->
<html>
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>

Failure 4: String Comparison Failures​

# These look the same but are different!

# Composed form (NFC)
name1 = "cafΓ©" # Γ© as single character (U+00E9)

# Decomposed form (NFD)
name2 = "cafΓ©" # e + combining accent (U+0065 U+0301)

name1 == name2 # False! 😱

len(name1) # 4
len(name2) # 5

# βœ… Solution: Normalize before comparing
import unicodedata

def normalize(s):
return unicodedata.normalize('NFC', s)

normalize(name1) == normalize(name2) # True βœ…

Failure 5: Case-Insensitive Comparison​

# This works for ASCII...
'HELLO'.lower() == 'hello' # True

# But fails for other scripts!
'Δ°'.lower() == 'i' # False! (Turkish I)
'ß'.lower() == 'ss' # False! (German sharp s)

# βœ… Solution: Use locale-aware comparison
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

# Or use casefold() for case-insensitive comparison
'HELLO'.casefold() == 'hello'.casefold() # True
'ß'.casefold() == 'ss' # True!

Failure 6: Search and Collation​

-- Japanese has multiple character sets that are "equivalent"
-- Hiragana: あ
-- Katakana: γ‚’
-- Both mean "a" sound

-- πŸ”΄ BAD: Binary collation won't match
SELECT * FROM users WHERE name LIKE '%γ‚’%';
-- Won't find 'あいう' even though semantically similar

-- βœ… GOOD: Use locale-appropriate collation
-- For Japanese: utf8mb4_ja_0900_as_cs
-- For Chinese: utf8mb4_zh_0900_as_cs
ALTER TABLE users
MODIFY name VARCHAR(255)
CHARACTER SET utf8mb4
COLLATE utf8mb4_ja_0900_as_cs;

Failure 7: String Length Miscalculation​

# Characters vs bytes vs grapheme clusters

text = "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" # Family emoji

len(text) # 7 (Python chars / code points)
len(text.encode('utf-8')) # 25 (bytes)

# Actual "visible characters" = 1 (grapheme cluster)

import grapheme
grapheme.length(text) # 1 βœ…

# πŸ”΄ This breaks displays:
text[:1] # "πŸ‘¨" - shows just the man, broken family!

# βœ… Proper truncation:
grapheme.slice(text, 0, 1) # "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" - full family
-- Database column too short
-- VARCHAR(50) means 50 characters, but:
-- - ASCII: 50 bytes
-- - Japanese: up to 150 bytes (3 bytes/char)
-- - Emoji: up to 200 bytes (4 bytes/char)

-- βœ… Use appropriate size
CREATE TABLE users (
name VARCHAR(255) CHARACTER SET utf8mb4
-- Can hold 255 characters regardless of language
);

Failure 8: Security - Homograph Attack​

# These look identical but are different:
url1 = "apple.com" # Latin letters
url2 = "Π°pple.com" # First 'a' is Cyrillic Π° (U+0430)

url1 == url2 # False - phishing attack!

# βœ… Solution: Use Punycode for domains
import idna

def safe_domain(domain):
try:
# Convert to ASCII (Punycode)
ascii_domain = idna.encode(domain).decode('ascii')
# Convert back and compare
return idna.decode(ascii_domain)
except idna.core.InvalidCodepoint:
raise ValueError("Invalid characters in domain")

safe_domain("Π°pple.com") # Raises error - mixed scripts

Unicode Checklist for Launch​

database:
- [ ] Database encoding is utf8mb4 (not utf8)
- [ ] Connection charset is utf8mb4
- [ ] Collation matches target languages
- [ ] Column sizes account for multibyte characters

application:
- [ ] All file reads/writes specify encoding='utf-8'
- [ ] HTTP responses include charset=utf-8
- [ ] HTML pages have <meta charset="utf-8">
- [ ] JSON serialization uses ensure_ascii=False

data_handling:
- [ ] Strings normalized (NFC) before comparison
- [ ] Use casefold() for case-insensitive comparison
- [ ] Truncation respects grapheme clusters
- [ ] Sorting uses locale-aware collation

security:
- [ ] URLs checked for homograph attacks
- [ ] Usernames restricted to safe character sets
- [ ] Input validation uses unicode-aware regex

Database Migration Script​

# migrate_to_utf8mb4.py

import pymysql
from tqdm import tqdm

def migrate_database(db_config):
conn = pymysql.connect(**db_config)
cursor = conn.cursor()

# 1. Get all tables
cursor.execute("SHOW TABLES")
tables = [row[0] for row in cursor.fetchall()]

# 2. Convert database
db_name = db_config['database']
cursor.execute(f"""
ALTER DATABASE {db_name}
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
""")

# 3. Convert each table
for table in tqdm(tables, desc="Converting tables"):
try:
cursor.execute(f"""
ALTER TABLE {table}
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
""")
conn.commit()
except Exception as e:
print(f"Error converting {table}: {e}")
conn.rollback()

# 4. Verify
cursor.execute("SHOW VARIABLES LIKE 'character_set%'")
print("\nNew settings:")
for row in cursor.fetchall():
print(f" {row[0]}: {row[1]}")

conn.close()

Quick Reference​

ProblemSolution
Garbled text (mojibake)Fix database encoding to utf8mb4
Characters show as ?Fix connection charset
Comparison failsNormalize with NFC before comparing
Case comparison failsUse casefold() instead of lower()
Emoji crashesUse utf8mb4, not utf8
Truncation breaks emojiUse grapheme-aware truncation
Security bypassNormalize and validate usernames

Key Takeaways​

  1. UTF-8 everywhere - Database, connection, HTTP, files
  2. MySQL utf8 β‰  UTF-8 - Use utf8mb4
  3. Normalize before comparing - NFC for storage, NFD for display
  4. Graphemes β‰  characters β‰  bytes - Know the difference
  5. Test with real data - Japanese, Chinese, emoji, RTL languages
  6. Security implications - Homograph attacks are real

Testing tip: Always test with: η”°δΈ­ε€ͺιƒŽ, MΓΌller, Ξ©, πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦, Ω…Ψ±Ψ­Ψ¨Ψ§