Skip to main content

Design Search Engine

Problem

Design web search engine like Google.

Requirements

  • Crawl web pages
  • Index content
  • Search with ranking
  • Scale: Billions of web pages

Key Components

1. Web Crawler

Goal: Download billions of web pages.

Algorithm (BFS):

frontier = Queue()  # URLs to crawl
frontier.add("https://example.com")
visited = Set()

while not frontier.empty():
url = frontier.dequeue()

if url in visited:
continue

page = download(url)
visited.add(url)

# Extract links
links = parse_links(page)
for link in links:
frontier.enqueue(link)

Challenges:

  • Politeness: Don't overload servers (rate limit per domain)
  • Robots.txt: Respect crawl rules
  • De-duplication: Don't crawl same page twice (URL hash)

Scale: 1000s of crawlers in parallel

2. Indexer

Goal: Build inverted index (word → documents).

# Parse document
doc_id = "page123"
content = "the quick brown fox"

# Tokenize
words = ["the", "quick", "brown", "fox"]

# Build inverted index
for word in words:
inverted_index[word].append(doc_id)

# Result:
# "quick" → [page123, page456, ...]
# "brown" → [page123, page789, ...]

Storage: Elasticsearch (distributed, sharded)

3. Ranking (PageRank)

Idea: Important pages linked by other important pages.

PageRank(A) = (1-d) + d × Σ (PageRank(Ti) / C(Ti))

Where:
- d = damping factor (0.85)
- Ti = pages linking to A
- C(Ti) = number of outbound links from Ti

Example:

Page A linked by Page B (PageRank=0.5, 2 outbound links)
PageRank(A) = 0.15 + 0.85 × (0.5 / 2) = 0.3625

Modern: Google uses 200+ ranking factors (PageRank is one)

Search Query

def search(query):
# 1. Tokenize query
words = tokenize(query) # "brown fox" → ["brown", "fox"]

# 2. Lookup inverted index
results = set()
for word in words:
docs = inverted_index[word]
results.update(docs)

# 3. Rank results (PageRank + relevance score)
ranked = rank_results(results, query)

# 4. Return top 10
return ranked[:10]

Autocomplete

Trie data structure:

Input: "goo"
Trie lookup: "google", "good", "goose"
Return top 5 by popularity

Storage: Redis (pre-computed popular queries)

Scalability

  • Crawling: Distributed (1000s of crawlers)
  • Indexing: MapReduce (parallel processing)
  • Storage: Sharded by document ID
  • Query: Distributed search (query all shards, merge results)

Next: Cheat Sheets - Quick reference for interviews.