String Matching Algorithms
Write summaries of KMP Algorithm, Boyer-Moore string-search algorithm, Rabin-Karp Algorithm
Problem Statement
Given a search string N='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
and a pattern M='DEF'
, we want to see if the pattern M
can be found in the string N
.
Naive Approach
The Naive implementation of the search algorithm will compare the pattern M
against all possible substrings of that length in the search string N
.
Therefore, the worst case time complexity is O(M*N)
where the complete search space represents performing a comparison search in O(M)
across all substrings in O(N)
time.
This gives a time complexity which is equal to the product of the two strings.
However, there are different observations, either in preprocessing, or in the searching method that can reduce this quadratic time complexity to a linear one.
The rest of this article is going to discuss three separate approaches to this problem.
Rabin-Karp Approach
Overview
The Rabin-Karp Algorithm reduces the worst-case time complexity for string matching by using a hash function to perform an approximate check for each positions. Only if an approximate check is found, then an exact comparison can be done.
This can clearly be seen in the following example.
Example
As an example, given the string: LEETCODE
and a substring TCO
, the Rabin-Karp Algorithm will perform the following computations.
Given a hash function hash()
, the hash("TCO")=X
.
- hash(“LEE”)=“a1”. Don’t need to perform a full check of the substrings.
- hash(“EET”)=“a2”. No Check needed.
- hash(“ETC”)=“a3”. No Check needed.
- hash(“TCO”)=X. The result of the hash of this substring to the pattern is equivalent. Compare in linear time
O(M)
these two strings for a match. Yes, there is a match found. - hash(“COD”)=“a4”. No Check needed.
- hash(“ODE”)=“a5”. No Check needed.
An analysis of this approach shows that the expected time of the algorithm is linear in the length of the search string O(N)
.
Analysis & Considerations
The Rabin-Karp Algorithm uses the notion of hashing to reduce the complete search space by removing the need for performing full substring comparisons on every possible substring.
There are two necessary considerations to ensure this algorithm maintains its improvement in time:
- Hashing — This algorithm introduces the need to hash the pattern string as well as every substring of that length. If the hashing algorithm performs this work in
O(M)
time, then the resulting algorithm will run inO(M*N)
time. The solution to this is using a Rolling Hash algorithm, which computes in constant time a given hash as the algorithm rolls through the search string.
This Algorithm introduces the need to hash the pattern string as well as every possible substring. This means that in order to maintain a time improvement, this
- Hash
Uses hashing to find an exact match of a pattern string in a text.
Uses a rolling hash to quickly filter out positions of text that cannot match the pattern, then checks for match at the remaining positions.
A Hash Function is a function which converts every string into a numeric value, called the hash value.
For example, hash(“hello”)=5. If two strings are equal, their hash values will also be equal.
A well designed Hash Function will ensure that the opposite is approximately true — The has value of two different strings will produce different results.
Rabin-Karp therefore goes through the search text and computes at each position the hash value of a string starting at that position with the same length as the pattern. If the hash values are equal, then the algorithm will perform a full comparison at that position.
However, it is important to note that, as explained in the Naive Approach section, that the comparison search of two strings is O(M)
. Therefore, the hash function needs to be a computation that is computed faster than in linear time.
That is why the concept of a “Rolling Hash” is important. This asserts that the computation of one string str[i,j]
can be computed in linear time from the string immediately preceding it str[i-1,j-1]
.
If this can be achieved, then this Rabin-Karp approach can reduce the expected time from O(M*N)
to O(N)
. If a poorly designed rolling hash function is used, then while the hashing can be performed in constant time, the algorithm will be checking the strings for which there is a hash collision in O(M)
time.
Applications
KMP Approach
Reduces worst-case time for string matching by extracting more information from each mismatch, allowing them to skip over positions of the text that are guaranteed not to match.