Design Search Engine

Problem

Design web search engine like Google.

Requirements

Crawl web pages
Index content
Search with ranking
Scale: Billions of web pages

Key Components

1. Web Crawler

Goal: Download billions of web pages.

Algorithm (BFS):

frontier = Queue()  # URLs to crawl
frontier.add("https://example.com")
visited = Set()

while not frontier.empty():
    url = frontier.dequeue()
    
    if url in visited:
        continue
    
    page = download(url)
    visited.add(url)
    
    # Extract links
    links = parse_links(page)
    for link in links:
        frontier.enqueue(link)

Challenges:

Politeness: Don't overload servers (rate limit per domain)
Robots.txt: Respect crawl rules
De-duplication: Don't crawl same page twice (URL hash)

Scale: 1000s of crawlers in parallel

2. Indexer

Goal: Build inverted index (word → documents).

# Parse document
doc_id = "page123"
content = "the quick brown fox"

# Tokenize
words = ["the", "quick", "brown", "fox"]

# Build inverted index
for word in words:
    inverted_index[word].append(doc_id)

# Result:
# "quick" → [page123, page456, ...]
# "brown" → [page123, page789, ...]

Storage: Elasticsearch (distributed, sharded)

3. Ranking (PageRank)

Idea: Important pages linked by other important pages.

PageRank(A) = (1-d) + d × Σ (PageRank(Ti) / C(Ti))

Where:
- d = damping factor (0.85)
- Ti = pages linking to A
- C(Ti) = number of outbound links from Ti

Example:

Page A linked by Page B (PageRank=0.5, 2 outbound links)
PageRank(A) = 0.15 + 0.85 × (0.5 / 2) = 0.3625

Modern: Google uses 200+ ranking factors (PageRank is one)

Search Query

def search(query):
    # 1. Tokenize query
    words = tokenize(query)  # "brown fox" → ["brown", "fox"]
    
    # 2. Lookup inverted index
    results = set()
    for word in words:
        docs = inverted_index[word]
        results.update(docs)
    
    # 3. Rank results (PageRank + relevance score)
    ranked = rank_results(results, query)
    
    # 4. Return top 10
    return ranked[:10]

Autocomplete

Trie data structure:

Input: "goo"
Trie lookup: "google", "good", "goose"
Return top 5 by popularity

Storage: Redis (pre-computed popular queries)

Scalability

Crawling: Distributed (1000s of crawlers)
Indexing: MapReduce (parallel processing)
Storage: Sharded by document ID
Query: Distributed search (query all shards, merge results)

Next: Cheat Sheets - Quick reference for interviews.

Problem​

Requirements​

Key Components​

1. Web Crawler​

2. Indexer​

3. Ranking (PageRank)​

Search Query​

Autocomplete​

Scalability​