String Matching Algorithms

Write summaries of KMP Algorithm, Boyer-Moore string-search algorithm, Rabin-Karp Algorithm

Problem Statement

Given a search string and a pattern , we want to see if the pattern can be found in the string .

Naive Approach

The Naive implementation of the search algorithm will compare the pattern against all possible substrings of that length in the search string .

Therefore, the worst case time complexity is where the complete search space represents performing a comparison search in across all substrings in time.

This gives a time complexity which is equal to the product of the two strings.

However, there are different observations, either in preprocessing, or in the searching method that can reduce this quadratic time complexity to a linear one.

The rest of this article is going to discuss three separate approaches to this problem.

Rabin-Karp Approach

Overview

The Rabin-Karp Algorithm reduces the worst-case time complexity for string matching by using a hash function to perform an approximate check for each positions. Only if an approximate check is found, then an exact comparison can be done.

This can clearly be seen in the following example.

Example

As an example, given the string: and a substring , the Rabin-Karp Algorithm will perform the following computations.

Given a hash function , the .

  1. hash(“LEE”)=“a1”. Don’t need to perform a full check of the substrings.
  2. hash(“EET”)=“a2”. No Check needed.
  3. hash(“ETC”)=“a3”. No Check needed.
  4. hash(“TCO”)=X. The result of the hash of this substring to the pattern is equivalent. Compare in linear time these two strings for a match. Yes, there is a match found.
  5. hash(“COD”)=“a4”. No Check needed.
  6. hash(“ODE”)=“a5”. No Check needed.

An analysis of this approach shows that the expected time of the algorithm is linear in the length of the search string .

Analysis & Considerations

The Rabin-Karp Algorithm uses the notion of hashing to reduce the complete search space by removing the need for performing full substring comparisons on every possible substring.

There are two necessary considerations to ensure this algorithm maintains its improvement in time:

  1. Hashing — This algorithm introduces the need to hash the pattern string as well as every substring of that length. If the hashing algorithm performs this work in time, then the resulting algorithm will run in time. The solution to this is using a Rolling Hash algorithm, which computes in constant time a given hash as the algorithm rolls through the search string.

This Algorithm introduces the need to hash the pattern string as well as every possible substring. This means that in order to maintain a time improvement, this

  1. Hash

Uses hashing to find an exact match of a pattern string in a text.

Uses a rolling hash to quickly filter out positions of text that cannot match the pattern, then checks for match at the remaining positions.

A Hash Function is a function which converts every string into a numeric value, called the hash value.

For example, hash(“hello”)=5. If two strings are equal, their hash values will also be equal.

A well designed Hash Function will ensure that the opposite is approximately true — The has value of two different strings will produce different results.

Rabin-Karp therefore goes through the search text and computes at each position the hash value of a string starting at that position with the same length as the pattern. If the hash values are equal, then the algorithm will perform a full comparison at that position.

However, it is important to note that, as explained in the Naive Approach section, that the comparison search of two strings is . Therefore, the hash function needs to be a computation that is computed faster than in linear time.

That is why the concept of a “Rolling Hash” is important. This asserts that the computation of one string can be computed in linear time from the string immediately preceding it .

If this can be achieved, then this Rabin-Karp approach can reduce the expected time from to . If a poorly designed rolling hash function is used, then while the hashing can be performed in constant time, the algorithm will be checking the strings for which there is a hash collision in time.

Applications

KMP Approach

Reduces worst-case time for string matching by extracting more information from each mismatch, allowing them to skip over positions of the text that are guaranteed not to match.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store