Design Search Engine
Problem
Design web search engine like Google.
Requirements
- Crawl web pages
- Index content
- Search with ranking
- Scale: Billions of web pages
Key Components
1. Web Crawler
Goal: Download billions of web pages.
Algorithm (BFS):
frontier = Queue() # URLs to crawl
frontier.add("https://example.com")
visited = Set()
while not frontier.empty():
url = frontier.dequeue()
if url in visited:
continue
page = download(url)
visited.add(url)
# Extract links
links = parse_links(page)
for link in links:
frontier.enqueue(link)
Challenges:
- Politeness: Don't overload servers (rate limit per domain)
- Robots.txt: Respect crawl rules
- De-duplication: Don't crawl same page twice (URL hash)
Scale: 1000s of crawlers in parallel
2. Indexer
Goal: Build inverted index (word → documents).
# Parse document
doc_id = "page123"
content = "the quick brown fox"
# Tokenize
words = ["the", "quick", "brown", "fox"]
# Build inverted index
for word in words:
inverted_index[word].append(doc_id)
# Result:
# "quick" → [page123, page456, ...]
# "brown" → [page123, page789, ...]
Storage: Elasticsearch (distributed, sharded)
3. Ranking (PageRank)
Idea: Important pages linked by other important pages.
PageRank(A) = (1-d) + d × Σ (PageRank(Ti) / C(Ti))
Where:
- d = damping factor (0.85)
- Ti = pages linking to A
- C(Ti) = number of outbound links from Ti
Example:
Page A linked by Page B (PageRank=0.5, 2 outbound links)
PageRank(A) = 0.15 + 0.85 × (0.5 / 2) = 0.3625
Modern: Google uses 200+ ranking factors (PageRank is one)
Search Query
def search(query):
# 1. Tokenize query
words = tokenize(query) # "brown fox" → ["brown", "fox"]
# 2. Lookup inverted index
results = set()
for word in words:
docs = inverted_index[word]
results.update(docs)
# 3. Rank results (PageRank + relevance score)
ranked = rank_results(results, query)
# 4. Return top 10
return ranked[:10]
Autocomplete
Trie data structure:
Input: "goo"
Trie lookup: "google", "good", "goose"
Return top 5 by popularity
Storage: Redis (pre-computed popular queries)
Scalability
- Crawling: Distributed (1000s of crawlers)
- Indexing: MapReduce (parallel processing)
- Storage: Sharded by document ID
- Query: Distributed search (query all shards, merge results)
Next: Cheat Sheets - Quick reference for interviews.