String Matching Algorithms

Write summaries of KMP Algorithm, Boyer-Moore string-search algorithm, Rabin-Karp Algorithm

Problem Statement

Naive Approach

Therefore, the worst case time complexity is O(M*N) where the complete search space represents performing a comparison search in O(M) across all substrings in O(N) time.

This gives a time complexity which is equal to the product of the two strings.

However, there are different observations, either in preprocessing, or in the searching method that can reduce this quadratic time complexity to a linear one.

The rest of this article is going to discuss three separate approaches to this problem.

Rabin-Karp Approach

Overview

This can clearly be seen in the following example.

Example

Given a hash function hash(), the hash("TCO")=X.

  1. hash(“LEE”)=“a1”. Don’t need to perform a full check of the substrings.
  2. hash(“EET”)=“a2”. No Check needed.
  3. hash(“ETC”)=“a3”. No Check needed.
  4. hash(“TCO”)=X. The result of the hash of this substring to the pattern is equivalent. Compare in linear time O(M) these two strings for a match. Yes, there is a match found.
  5. hash(“COD”)=“a4”. No Check needed.
  6. hash(“ODE”)=“a5”. No Check needed.

An analysis of this approach shows that the expected time of the algorithm is linear in the length of the search string O(N).

Analysis & Considerations

There are two necessary considerations to ensure this algorithm maintains its improvement in time:

  1. Hashing — This algorithm introduces the need to hash the pattern string as well as every substring of that length. If the hashing algorithm performs this work in O(M) time, then the resulting algorithm will run in O(M*N) time. The solution to this is using a Rolling Hash algorithm, which computes in constant time a given hash as the algorithm rolls through the search string.

This Algorithm introduces the need to hash the pattern string as well as every possible substring. This means that in order to maintain a time improvement, this

  1. Hash

Uses hashing to find an exact match of a pattern string in a text.

Uses a rolling hash to quickly filter out positions of text that cannot match the pattern, then checks for match at the remaining positions.

A Hash Function is a function which converts every string into a numeric value, called the hash value.

For example, hash(“hello”)=5. If two strings are equal, their hash values will also be equal.

A well designed Hash Function will ensure that the opposite is approximately true — The has value of two different strings will produce different results.

Rabin-Karp therefore goes through the search text and computes at each position the hash value of a string starting at that position with the same length as the pattern. If the hash values are equal, then the algorithm will perform a full comparison at that position.

However, it is important to note that, as explained in the Naive Approach section, that the comparison search of two strings is O(M). Therefore, the hash function needs to be a computation that is computed faster than in linear time.

That is why the concept of a “Rolling Hash” is important. This asserts that the computation of one string str[i,j] can be computed in linear time from the string immediately preceding it str[i-1,j-1].

If this can be achieved, then this Rabin-Karp approach can reduce the expected time from O(M*N) to O(N). If a poorly designed rolling hash function is used, then while the hashing can be performed in constant time, the algorithm will be checking the strings for which there is a hash collision in O(M) time.

Applications

KMP Approach

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store