Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
Category
Famous Algorithms
Difficulty
Very Hard
Problem Statement
Given a text string of length n and a pattern string of length m, find all occurrences of the pattern within the text. The KMP algorithm achieves this in O(n + m) time by preprocessing the pattern to build a failure function (also called the partial match table or prefix function) that allows the search to skip redundant comparisons.
Intuition
In a naive search, when a mismatch occurs after matching several characters, we restart the comparison from the next text position and the beginning of the pattern. This wastes information about characters already matched. KMP observes that the pattern itself contains information about which positions are worth retrying. The failure function lps[i] (longest proper prefix which is also a suffix) tells us: "if a mismatch happens at pattern position i, jump back to position lps[i-1] in the pattern instead of restarting from 0." The text pointer never moves backward.
Approach
Phase 1: Build the failure function (LPS array)
- Create array
lpsof sizem, initialized to 0.lps[i]= length of the longest proper prefix ofpattern[0..i]that is also a suffix. - Use two pointers:
length = 0(length of previous longest prefix-suffix) andi = 1. - While
i < m:- If
pattern[i] == pattern[length]: setlps[i] = length + 1, increment bothiandlength. - Else if
length > 0: setlength = lps[length - 1](fall back, do not incrementi). - Else:
lps[i] = 0, incrementi.
- If
Phase 2: Search using the LPS array
- Use two pointers:
ifor text,jfor pattern, both starting at 0. - While
i < n:- If
text[i] == pattern[j]: increment bothiandj. - If
j == m: match found at indexi - m. Setj = lps[j - 1]to continue searching. - Else if
i < nandtext[i] != pattern[j]:- If
j > 0: setj = lps[j - 1]. - Else: increment
i.
- If
- If
Pseudocode
function buildLPS(pattern):
m = length(pattern)
lps = array of size m, initialized to 0
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else if length > 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
function KMPSearch(text, pattern):
n = length(text)
m = length(pattern)
if m == 0:
return []
lps = buildLPS(pattern)
matches = []
i = 0 // text pointer
j = 0 // pattern pointer
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - m)
j = lps[j - 1]
else if i < n and text[i] != pattern[j]:
if j > 0:
j = lps[j - 1]
else:
i += 1
return matches
Time & Space Complexity
- Time: O(n + m). Building the LPS array takes O(m). The search phase takes O(n). In both phases, while there appear to be nested loops, each pointer advances or the other decreases, and total advances are bounded.
- Space: O(m) for the LPS array (plus O(k) if storing
kmatch positions).
Key Insights
- The text pointer
inever moves backward. This is the fundamental advantage over naive search and is what guarantees O(n) search time. - The LPS array encodes the automaton transitions of the pattern.
lps[j-1]answers: "given thatpattern[0..j-1]matched butpattern[j]did not, what is the longest prefix of the pattern that is still a valid match?" - Building the LPS array is essentially running KMP on the pattern against itself.
- KMP is deterministic and makes exactly
ncomparisons in the search phase (plus at mostnfallback operations), giving a tight 2n upper bound on comparisons. - Unlike Boyer-Moore, KMP does not skip characters in the text; it is optimal for streaming input where you cannot look ahead.
Edge Cases
- Empty pattern: Every position is a match (or return empty, depending on convention).
- Pattern longer than text: No match possible.
- Pattern equals text: One match at index 0.
- Pattern of all identical characters (e.g., "aaaa"): The LPS array is
[0, 1, 2, 3]. Searching in "aaaaaa" correctly finds overlapping matches. - No match exists: The algorithm completes in O(n + m) and returns an empty list.
- Single character pattern: Degenerates to a simple linear scan.
- Overlapping matches (e.g., pattern "aba" in "ababa"): KMP correctly finds matches at indices 0 and 2 because the LPS fallback enables detecting overlaps.