Unicode Broke Our System

The Interview Question

"We launched in Japan and everything broke. Users can't log in, searches return wrong results, and some usernames display as question marks. What happened and how do we fix it?"

Asked at: Any company going international, especially Stripe, Shopify, Google

Time to solve: 25-30 minutes

Difficulty: ⭐⭐⭐ (Mid-Senior)

Clarifying Questions to Ask

"What database and encoding?" → MySQL latin1 is common culprit
"Where do we see the issues?" → Storage, display, search, or comparison?
"Are older records affected or just new ones?" → Identifies when problem started
"What languages specifically?" → Japanese, Chinese, emoji all different
"Is password/login affected?" → Security implications

The Problem Visualized

User types:    田中太郎 (Tanaka Taro)
Database stores: ç°ä¸­å¤ªéƒ (mojibake - garbled text)
Display shows:  ??? (replacement characters)
Search returns: nothing (can't match)

Common Unicode Failures

Failure 1: Wrong Database Encoding

-- Check current encoding
SHOW VARIABLES LIKE 'character_set%';

-- Common bad result:
-- character_set_database: latin1  ← CAN'T STORE UNICODE!
-- character_set_connection: latin1

-- Correct encoding:
-- character_set_database: utf8mb4  ← Supports all Unicode + emoji

Fix: Migrate to utf8mb4

-- 1. Backup first!
mysqldump -u root -p mydb > backup.sql

-- 2. Convert database
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- 3. Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- 4. Fix connection encoding
SET NAMES 'utf8mb4';

-- In application config:
# database_url: mysql://user:pass@host/db?charset=utf8mb4

⚠️ Warning: MySQL utf8 is NOT full UTF-8! It only supports 3 bytes (no emoji). Use utf8mb4.

Failure 2: Application Encoding Mismatch

# 🔴 BAD: Default encoding might not be UTF-8
with open('data.txt', 'r') as f:
    data = f.read()  # Uses system default encoding

# ✅ GOOD: Explicit UTF-8
with open('data.txt', 'r', encoding='utf-8') as f:
    data = f.read()

# 🔴 BAD: String to bytes without encoding
data = "田中太郎"
bytes_data = data.encode()  # Uses system default

# ✅ GOOD: Explicit encoding
bytes_data = data.encode('utf-8')

Failure 3: HTTP Encoding Issues

# 🔴 BAD: Missing content-type encoding
response = make_response(json.dumps(data))
# Browser might guess wrong encoding

# ✅ GOOD: Explicit encoding in headers
response = make_response(json.dumps(data))
response.headers['Content-Type'] = 'application/json; charset=utf-8'

<!-- 🔴 BAD: No meta charset -->
<html>
<head>
  <title>My Page</title>
</head>

<!-- ✅ GOOD: UTF-8 declared early -->
<html>
<head>
  <meta charset="utf-8">
  <title>My Page</title>
</head>

Failure 4: String Comparison Failures

# These look the same but are different!

# Composed form (NFC)
name1 = "café"  # é as single character (U+00E9)

# Decomposed form (NFD)
name2 = "café"  # e + combining accent (U+0065 U+0301)

name1 == name2  # False! 😱

len(name1)  # 4
len(name2)  # 5

# ✅ Solution: Normalize before comparing
import unicodedata

def normalize(s):
    return unicodedata.normalize('NFC', s)

normalize(name1) == normalize(name2)  # True ✅

Failure 5: Case-Insensitive Comparison

# This works for ASCII...
'HELLO'.lower() == 'hello'  # True

# But fails for other scripts!
'İ'.lower() == 'i'  # False! (Turkish I)
'ß'.lower() == 'ss'  # False! (German sharp s)

# ✅ Solution: Use locale-aware comparison
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

# Or use casefold() for case-insensitive comparison
'HELLO'.casefold() == 'hello'.casefold()  # True
'ß'.casefold() == 'ss'  # True!

Failure 6: Search and Collation

-- Japanese has multiple character sets that are "equivalent"
-- Hiragana: あ
-- Katakana: ア
-- Both mean "a" sound

-- 🔴 BAD: Binary collation won't match
SELECT * FROM users WHERE name LIKE '%ア%';
-- Won't find 'あいう' even though semantically similar

-- ✅ GOOD: Use locale-appropriate collation
-- For Japanese: utf8mb4_ja_0900_as_cs
-- For Chinese: utf8mb4_zh_0900_as_cs
ALTER TABLE users 
  MODIFY name VARCHAR(255) 
  CHARACTER SET utf8mb4 
  COLLATE utf8mb4_ja_0900_as_cs;

Failure 7: String Length Miscalculation

# Characters vs bytes vs grapheme clusters

text = "👨‍👩‍👧‍👦"  # Family emoji

len(text)                    # 7 (Python chars / code points)
len(text.encode('utf-8'))    # 25 (bytes)

# Actual "visible characters" = 1 (grapheme cluster)

import grapheme
grapheme.length(text)        # 1 ✅

# 🔴 This breaks displays:
text[:1]  # "👨" - shows just the man, broken family!

# ✅ Proper truncation:
grapheme.slice(text, 0, 1)  # "👨‍👩‍👧‍👦" - full family

-- Database column too short
-- VARCHAR(50) means 50 characters, but:
-- - ASCII: 50 bytes
-- - Japanese: up to 150 bytes (3 bytes/char)
-- - Emoji: up to 200 bytes (4 bytes/char)

-- ✅ Use appropriate size
CREATE TABLE users (
    name VARCHAR(255) CHARACTER SET utf8mb4
    -- Can hold 255 characters regardless of language
);

Failure 8: Security - Homograph Attack

# These look identical but are different:
url1 = "apple.com"      # Latin letters
url2 = "аpple.com"      # First 'a' is Cyrillic а (U+0430)

url1 == url2  # False - phishing attack!

# ✅ Solution: Use Punycode for domains
import idna

def safe_domain(domain):
    try:
        # Convert to ASCII (Punycode)
        ascii_domain = idna.encode(domain).decode('ascii')
        # Convert back and compare
        return idna.decode(ascii_domain)
    except idna.core.InvalidCodepoint:
        raise ValueError("Invalid characters in domain")

safe_domain("аpple.com")  # Raises error - mixed scripts

Unicode Checklist for Launch

database:
  - [ ] Database encoding is utf8mb4 (not utf8)
  - [ ] Connection charset is utf8mb4
  - [ ] Collation matches target languages
  - [ ] Column sizes account for multibyte characters

application:
  - [ ] All file reads/writes specify encoding='utf-8'
  - [ ] HTTP responses include charset=utf-8
  - [ ] HTML pages have <meta charset="utf-8">
  - [ ] JSON serialization uses ensure_ascii=False

data_handling:
  - [ ] Strings normalized (NFC) before comparison
  - [ ] Use casefold() for case-insensitive comparison
  - [ ] Truncation respects grapheme clusters
  - [ ] Sorting uses locale-aware collation

security:
  - [ ] URLs checked for homograph attacks
  - [ ] Usernames restricted to safe character sets
  - [ ] Input validation uses unicode-aware regex

Database Migration Script

# migrate_to_utf8mb4.py

import pymysql
from tqdm import tqdm

def migrate_database(db_config):
    conn = pymysql.connect(**db_config)
    cursor = conn.cursor()
    
    # 1. Get all tables
    cursor.execute("SHOW TABLES")
    tables = [row[0] for row in cursor.fetchall()]
    
    # 2. Convert database
    db_name = db_config['database']
    cursor.execute(f"""
        ALTER DATABASE {db_name} 
        CHARACTER SET utf8mb4 
        COLLATE utf8mb4_unicode_ci
    """)
    
    # 3. Convert each table
    for table in tqdm(tables, desc="Converting tables"):
        try:
            cursor.execute(f"""
                ALTER TABLE {table} 
                CONVERT TO CHARACTER SET utf8mb4 
                COLLATE utf8mb4_unicode_ci
            """)
            conn.commit()
        except Exception as e:
            print(f"Error converting {table}: {e}")
            conn.rollback()
    
    # 4. Verify
    cursor.execute("SHOW VARIABLES LIKE 'character_set%'")
    print("\nNew settings:")
    for row in cursor.fetchall():
        print(f"  {row[0]}: {row[1]}")
    
    conn.close()

Quick Reference

Problem	Solution
Garbled text (mojibake)	Fix database encoding to utf8mb4
Characters show as `?`	Fix connection charset
Comparison fails	Normalize with NFC before comparing
Case comparison fails	Use `casefold()` instead of `lower()`
Emoji crashes	Use utf8mb4, not utf8
Truncation breaks emoji	Use grapheme-aware truncation
Security bypass	Normalize and validate usernames

Key Takeaways

UTF-8 everywhere - Database, connection, HTTP, files
MySQL utf8 ≠ UTF-8 - Use utf8mb4
Normalize before comparing - NFC for storage, NFD for display
Graphemes ≠ characters ≠ bytes - Know the difference
Test with real data - Japanese, Chinese, emoji, RTL languages
Security implications - Homograph attacks are real

Testing tip: Always test with: 田中太郎, Müller, Ω, 👨‍👩‍👧‍👦, مرحبا

The Interview Question​

Clarifying Questions to Ask​

The Problem Visualized​

Common Unicode Failures​

Failure 1: Wrong Database Encoding​

Failure 2: Application Encoding Mismatch​

Failure 3: HTTP Encoding Issues​

Failure 4: String Comparison Failures​

Failure 5: Case-Insensitive Comparison​

Failure 6: Search and Collation​

Failure 7: String Length Miscalculation​

Failure 8: Security - Homograph Attack​

Unicode Checklist for Launch​

Database Migration Script​

Quick Reference​

Key Takeaways​