Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm

Category

Famous Algorithms

Difficulty

Very Hard

Problem Statement

Given a text string of length n and a pattern string of length m, find all occurrences of the pattern within the text. The KMP algorithm achieves this in O(n + m) time by preprocessing the pattern to build a failure function (also called the partial match table or prefix function) that allows the search to skip redundant comparisons.

Intuition

In a naive search, when a mismatch occurs after matching several characters, we restart the comparison from the next text position and the beginning of the pattern. This wastes information about characters already matched. KMP observes that the pattern itself contains information about which positions are worth retrying. The failure function lps[i] (longest proper prefix which is also a suffix) tells us: "if a mismatch happens at pattern position i, jump back to position lps[i-1] in the pattern instead of restarting from 0." The text pointer never moves backward.

Approach

Phase 1: Build the failure function (LPS array)

  1. Create array lps of size m, initialized to 0. lps[i] = length of the longest proper prefix of pattern[0..i] that is also a suffix.
  2. Use two pointers: length = 0 (length of previous longest prefix-suffix) and i = 1.
  3. While i < m:
    • If pattern[i] == pattern[length]: set lps[i] = length + 1, increment both i and length.
    • Else if length > 0: set length = lps[length - 1] (fall back, do not increment i).
    • Else: lps[i] = 0, increment i.

Phase 2: Search using the LPS array

  1. Use two pointers: i for text, j for pattern, both starting at 0.
  2. While i < n:
    • If text[i] == pattern[j]: increment both i and j.
    • If j == m: match found at index i - m. Set j = lps[j - 1] to continue searching.
    • Else if i < n and text[i] != pattern[j]:
      • If j > 0: set j = lps[j - 1].
      • Else: increment i.

Pseudocode

function buildLPS(pattern):
    m = length(pattern)
    lps = array of size m, initialized to 0
    length = 0
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else if length > 0:
            length = lps[length - 1]
        else:
            lps[i] = 0
            i += 1

    return lps

function KMPSearch(text, pattern):
    n = length(text)
    m = length(pattern)
    if m == 0:
        return []

    lps = buildLPS(pattern)
    matches = []
    i = 0  // text pointer
    j = 0  // pattern pointer

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            matches.append(i - m)
            j = lps[j - 1]
        else if i < n and text[i] != pattern[j]:
            if j > 0:
                j = lps[j - 1]
            else:
                i += 1

    return matches

Time & Space Complexity

  • Time: O(n + m). Building the LPS array takes O(m). The search phase takes O(n). In both phases, while there appear to be nested loops, each pointer advances or the other decreases, and total advances are bounded.
  • Space: O(m) for the LPS array (plus O(k) if storing k match positions).

Key Insights

  • The text pointer i never moves backward. This is the fundamental advantage over naive search and is what guarantees O(n) search time.
  • The LPS array encodes the automaton transitions of the pattern. lps[j-1] answers: "given that pattern[0..j-1] matched but pattern[j] did not, what is the longest prefix of the pattern that is still a valid match?"
  • Building the LPS array is essentially running KMP on the pattern against itself.
  • KMP is deterministic and makes exactly n comparisons in the search phase (plus at most n fallback operations), giving a tight 2n upper bound on comparisons.
  • Unlike Boyer-Moore, KMP does not skip characters in the text; it is optimal for streaming input where you cannot look ahead.

Edge Cases

  • Empty pattern: Every position is a match (or return empty, depending on convention).
  • Pattern longer than text: No match possible.
  • Pattern equals text: One match at index 0.
  • Pattern of all identical characters (e.g., "aaaa"): The LPS array is [0, 1, 2, 3]. Searching in "aaaaaa" correctly finds overlapping matches.
  • No match exists: The algorithm completes in O(n + m) and returns an empty list.
  • Single character pattern: Degenerates to a simple linear scan.
  • Overlapping matches (e.g., pattern "aba" in "ababa"): KMP correctly finds matches at indices 0 and 2 because the LPS fallback enables detecting overlaps.