Unicode Broke Our System
The Interview Questionβ
"We launched in Japan and everything broke. Users can't log in, searches return wrong results, and some usernames display as question marks. What happened and how do we fix it?"
Asked at: Any company going international, especially Stripe, Shopify, Google
Time to solve: 25-30 minutes
Difficulty: βββ (Mid-Senior)
Clarifying Questions to Askβ
- "What database and encoding?" β MySQL latin1 is common culprit
- "Where do we see the issues?" β Storage, display, search, or comparison?
- "Are older records affected or just new ones?" β Identifies when problem started
- "What languages specifically?" β Japanese, Chinese, emoji all different
- "Is password/login affected?" β Security implications
The Problem Visualizedβ
User types: η°δΈε€ͺι (Tanaka Taro)
Database stores: ç°À¸ΒΓ₯Β€ΒͺΓ©Ζ (mojibake - garbled text)
Display shows: ??? (replacement characters)
Search returns: nothing (can't match)
Common Unicode Failuresβ
Failure 1: Wrong Database Encodingβ
-- Check current encoding
SHOW VARIABLES LIKE 'character_set%';
-- Common bad result:
-- character_set_database: latin1 β CAN'T STORE UNICODE!
-- character_set_connection: latin1
-- Correct encoding:
-- character_set_database: utf8mb4 β Supports all Unicode + emoji
Fix: Migrate to utf8mb4
-- 1. Backup first!
mysqldump -u root -p mydb > backup.sql
-- 2. Convert database
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- 3. Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- 4. Fix connection encoding
SET NAMES 'utf8mb4';
-- In application config:
# database_url: mysql://user:pass@host/db?charset=utf8mb4
β οΈ Warning: MySQL utf8 is NOT full UTF-8! It only supports 3 bytes (no emoji). Use utf8mb4.
Failure 2: Application Encoding Mismatchβ
# π΄ BAD: Default encoding might not be UTF-8
with open('data.txt', 'r') as f:
data = f.read() # Uses system default encoding
# β
GOOD: Explicit UTF-8
with open('data.txt', 'r', encoding='utf-8') as f:
data = f.read()
# π΄ BAD: String to bytes without encoding
data = "η°δΈε€ͺι"
bytes_data = data.encode() # Uses system default
# β
GOOD: Explicit encoding
bytes_data = data.encode('utf-8')
Failure 3: HTTP Encoding Issuesβ
# π΄ BAD: Missing content-type encoding
response = make_response(json.dumps(data))
# Browser might guess wrong encoding
# β
GOOD: Explicit encoding in headers
response = make_response(json.dumps(data))
response.headers['Content-Type'] = 'application/json; charset=utf-8'
<!-- π΄ BAD: No meta charset -->
<html>
<head>
<title>My Page</title>
</head>
<!-- β
GOOD: UTF-8 declared early -->
<html>
<head>
<meta charset="utf-8">
<title>My Page</title>
</head>
Failure 4: String Comparison Failuresβ
# These look the same but are different!
# Composed form (NFC)
name1 = "cafΓ©" # Γ© as single character (U+00E9)
# Decomposed form (NFD)
name2 = "cafΓ©" # e + combining accent (U+0065 U+0301)
name1 == name2 # False! π±
len(name1) # 4
len(name2) # 5
# β
Solution: Normalize before comparing
import unicodedata
def normalize(s):
return unicodedata.normalize('NFC', s)
normalize(name1) == normalize(name2) # True β
Failure 5: Case-Insensitive Comparisonβ
# This works for ASCII...
'HELLO'.lower() == 'hello' # True
# But fails for other scripts!
'Δ°'.lower() == 'i' # False! (Turkish I)
'Γ'.lower() == 'ss' # False! (German sharp s)
# β
Solution: Use locale-aware comparison
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
# Or use casefold() for case-insensitive comparison
'HELLO'.casefold() == 'hello'.casefold() # True
'Γ'.casefold() == 'ss' # True!
Failure 6: Search and Collationβ
-- Japanese has multiple character sets that are "equivalent"
-- Hiragana: γ
-- Katakana: γ’
-- Both mean "a" sound
-- π΄ BAD: Binary collation won't match
SELECT * FROM users WHERE name LIKE '%γ’%';
-- Won't find 'γγγ' even though semantically similar
-- β
GOOD: Use locale-appropriate collation
-- For Japanese: utf8mb4_ja_0900_as_cs
-- For Chinese: utf8mb4_zh_0900_as_cs
ALTER TABLE users
MODIFY name VARCHAR(255)
CHARACTER SET utf8mb4
COLLATE utf8mb4_ja_0900_as_cs;
Failure 7: String Length Miscalculationβ
# Characters vs bytes vs grapheme clusters
text = "π¨βπ©βπ§βπ¦" # Family emoji
len(text) # 7 (Python chars / code points)
len(text.encode('utf-8')) # 25 (bytes)
# Actual "visible characters" = 1 (grapheme cluster)
import grapheme
grapheme.length(text) # 1 β
# π΄ This breaks displays:
text[:1] # "π¨" - shows just the man, broken family!
# β
Proper truncation:
grapheme.slice(text, 0, 1) # "π¨βπ©βπ§βπ¦" - full family
-- Database column too short
-- VARCHAR(50) means 50 characters, but:
-- - ASCII: 50 bytes
-- - Japanese: up to 150 bytes (3 bytes/char)
-- - Emoji: up to 200 bytes (4 bytes/char)
-- β
Use appropriate size
CREATE TABLE users (
name VARCHAR(255) CHARACTER SET utf8mb4
-- Can hold 255 characters regardless of language
);
Failure 8: Security - Homograph Attackβ
# These look identical but are different:
url1 = "apple.com" # Latin letters
url2 = "Π°pple.com" # First 'a' is Cyrillic Π° (U+0430)
url1 == url2 # False - phishing attack!
# β
Solution: Use Punycode for domains
import idna
def safe_domain(domain):
try:
# Convert to ASCII (Punycode)
ascii_domain = idna.encode(domain).decode('ascii')
# Convert back and compare
return idna.decode(ascii_domain)
except idna.core.InvalidCodepoint:
raise ValueError("Invalid characters in domain")
safe_domain("Π°pple.com") # Raises error - mixed scripts
Unicode Checklist for Launchβ
database:
- [ ] Database encoding is utf8mb4 (not utf8)
- [ ] Connection charset is utf8mb4
- [ ] Collation matches target languages
- [ ] Column sizes account for multibyte characters
application:
- [ ] All file reads/writes specify encoding='utf-8'
- [ ] HTTP responses include charset=utf-8
- [ ] HTML pages have <meta charset="utf-8">
- [ ] JSON serialization uses ensure_ascii=False
data_handling:
- [ ] Strings normalized (NFC) before comparison
- [ ] Use casefold() for case-insensitive comparison
- [ ] Truncation respects grapheme clusters
- [ ] Sorting uses locale-aware collation
security:
- [ ] URLs checked for homograph attacks
- [ ] Usernames restricted to safe character sets
- [ ] Input validation uses unicode-aware regex
Database Migration Scriptβ
# migrate_to_utf8mb4.py
import pymysql
from tqdm import tqdm
def migrate_database(db_config):
conn = pymysql.connect(**db_config)
cursor = conn.cursor()
# 1. Get all tables
cursor.execute("SHOW TABLES")
tables = [row[0] for row in cursor.fetchall()]
# 2. Convert database
db_name = db_config['database']
cursor.execute(f"""
ALTER DATABASE {db_name}
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
""")
# 3. Convert each table
for table in tqdm(tables, desc="Converting tables"):
try:
cursor.execute(f"""
ALTER TABLE {table}
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
""")
conn.commit()
except Exception as e:
print(f"Error converting {table}: {e}")
conn.rollback()
# 4. Verify
cursor.execute("SHOW VARIABLES LIKE 'character_set%'")
print("\nNew settings:")
for row in cursor.fetchall():
print(f" {row[0]}: {row[1]}")
conn.close()
Quick Referenceβ
| Problem | Solution |
|---|---|
| Garbled text (mojibake) | Fix database encoding to utf8mb4 |
Characters show as ? | Fix connection charset |
| Comparison fails | Normalize with NFC before comparing |
| Case comparison fails | Use casefold() instead of lower() |
| Emoji crashes | Use utf8mb4, not utf8 |
| Truncation breaks emoji | Use grapheme-aware truncation |
| Security bypass | Normalize and validate usernames |
Key Takeawaysβ
- UTF-8 everywhere - Database, connection, HTTP, files
- MySQL utf8 β UTF-8 - Use
utf8mb4 - Normalize before comparing - NFC for storage, NFD for display
- Graphemes β characters β bytes - Know the difference
- Test with real data - Japanese, Chinese, emoji, RTL languages
- Security implications - Homograph attacks are real
Testing tip: Always test with: η°δΈε€ͺι, MΓΌller, Ξ©, π¨βπ©βπ§βπ¦, Ω
Ψ±ΨΨ¨Ψ§