strings3

Version 09/03/2018

3 Substring Search

Given a (relatively long) string s and a (relatively short) string p, finding a copy of p in s is the "substring search" problem. The result of the search is, by convention, the index of the first instance of p in s (if successful) or the length of s (if unsuccessful). p is called the search string (or sometimes search pattern) and s is the search space (or sometimes search text). It should be clear that solutions to the substring search problem have many important applications embedded in our digital infrastructure.

Generally the substring search problem is defined over any alphabet. We will mostly discuss the problem in the context of alphabets that are subsets of EXTENDED_ASCII, and whenever generalization to more general alphabets is not straightforward we will discuss the issues.

3.1 Conventional Algorithms

Several useful substring search algorithms follow the "conventional pattern" as follows:

Starting at the beginning of both s and p, check whether the characters in p and s are equal.
If all characters in p are equal to those in s, return the location in s, otherwise slide p forward under s and start checking again.
If we get to a place in s where there are not enough characters left to match up with p, return the length of s.

The "conventional" algorithms follow this solution outline, and they differ mainly in the way it is determined where in s to start a new character-by-character check. For all of these, let n be the length of s and m be the length of p.

3.2 Brute Force

The brute force version of these algorithms is a model on which the others are built.


size_t BruteForce (const char* p, size_t m, const char* s, size_t n)
{
  for (size_t i = 0; i <= n - m; ++i)
  {
    size_t j;
    for (j = 0; j < m; ++j)  // check pattern match loop
    {
      if (s[i+j] != p[j]) // match failure at p[j] (exit loop early)
        break;      
    }
    if (j == m)   // match success (loop runs to completion)
      return i;   // starting index of pattern check loop
  }
  return n;    // outer loop runs to completion with no match
}

The BF algorithm has two drawbacks.

First note that the worst-case runtime of BruteForce is Ω(m*n), which happens when the inner loop runs to completion on all or most characters in s. For random strings this is extremely unlikely, but it can happen, and is a realistic possibility when the search space contains long substrings of repeated characters. Clearly we would like to improve on this.

The second issue is that the algorithm requires "backing up" in the search string, which happens whenever we break from the inner loop and start the match loop over at the next index in s. You can visualize the "backup" process by thinking of the pattern string underneath the search space and checking characters as they are lined up vertically:

  i
ABCADBCABCABCABCABCDABCABCDABC ...
ABCD

(Convention here is that the gray data has not yet been seen by the algorithm.)
The check fails at p[2] != s[2] so we slide p forward one place and start over:

 i
ABCADBCABCABCABCABCDABCABCDABC ...
 ABCD

Note that we left off at s[2] = C but pick up again at s[1] = B ... a "backup".

BruteForce can be refactored to make the backup explicit, and also to provide a more convenient code platform to expand to other algorithms. A straightforward substitution of variables k = i+j (k is the current location in s) facilitates a change of view from that of the pattern to that of the search space and makes the "backup" in s explicit:

size_t BruteForce (const char* p, size_t m, const char* s, size_t n)
{
  size_t i, j;
  for (i = 0, j = 0; i < n && j < m; ++i)
  {
    if (s[i] == p[j])
    {
      ++j;
    }
    else 
    {
      i -= j;  // backup in s is now explicit here
      j = 0;
    }
  }
  if (j == m)
    return i - m;
  return n;
}

(where we have substituted i for k).

Backing up in the search space can be inconvenient, such as for example when we are searching an input stream for a pattern.

So there are two desireable properties we seek in a substring search algorithm that improve on BruteForce:

Worst case runtime on the order of n + m
Backup in the search space not required

Note that we could not expect worst case runtime to be smaller than Ω(n + m), because we would need to check every character in p to be certain of success and every character in s to be certain of failure.

3.3 KMP

The KMP [Donald Knuth, James Morrison, Vaughan Pratt] substring search algorithm achieves both goals. The basic idea is straightforward: At the point at which we find a mis-match, we can use knowledge of the pattern and the portion of the search space already examined to calculate where in the pattern to restart. Here is an example:

Suppose we have

        i
... AAAABABCADBCABCABCABCABCDABCABCDABC ...
    AAAAC

Then when we find the mismatch s[i] != p[j] we can restart the check at s[i+1],p[0] because we have just discovered that s[i] != p[k] for k in 0..j. Thus instead of "backing up" in s by making i smaller we can increment i and position p appropriately:

         i
... AAAABABCADBCABCABCABCABCDABCABCDABC ...
         AAAAC

Of course, how p is positioned depends on the data. Here is another example:

        i
... ABABABABCADBCABCABCABCABCDABCABCDABC ...  [mismatch]
... ABABC

results in the following steps to success:

         i
... ABABABABCADBCABCABCABCABCDABCABCDABC ...  [match]
      ABABC
          i
... ABABABABCADBCABCABCABCABCDABCABCDABC ...  [mismatch]
      ABABC
           i
... ABABABABCADBCABCABCABCABCDABCABCDABC ...  [match]
        ABABC
            i
... ABABABABCADBCABCABCABCABCDABCABCDABC ...  [match - success]
        ABABC

Generally, KMP takes advantage of the sub-patterns in p to calculate a new value for j to work with an increment of i. The result is that either we have encountered a sub-pattern and so have a partial match already verified (as in the second example above) or we can leap forward to start a new match loop at the new value for i (as in the first example). It is not difficult to believe that calculating the change in j is somewhat complicated.

3.3.1 KMP Original Recipe

There are two versions of the kmp, both introduced in the original paper. The first constructs a mapping

dfa_ : A×P -> { 0 1 ... p }

(where A is the alphabet and p is the size of the pattern) such that: if c is the character at s[i] and j is the current check position in the pattern then d(c,j) is the pattern index where the match-check should (re)start.

This mapping is effectively a deterministic finite automaton. Virtually all of the search processing logic is captured in the dfa, making the search itself a "simple" loop.

Alphabet a;
Matrix <size_t> dfa_(0,0);

size_t Search_dfa (const char* p, size_t m, const char* s, size_t n)
{
  Build_dfa(p,a);
  size_t i, j;
  for (i = 0, j = 0; i < n && j < m; ++i)
  {
    size_t c = (size_t)(s[i]); // only for readability of the next line
    j = dfa_[c][j];
  }
  if (j == m)
    return i - m;
  return n;
}

void Build_dfa(const char* p, Alphabet a)
{
  size_t r = a.size, m = p.length;
  dfa_.SetSize(r,m,0);  // note: init all values to 0
  dfa_[p[0]][0] = 1;
  for (size_t i = 0, j = 1; j < m; ++j)
  {
    for (size_t c = 0; c < r; ++c)
      dfa_[c][j] = dfa_[c][i];
    dfa_[p[j]][j] = j+1;
    i = dfa_[p[j]][i];
  } 
}

Demo: See notes_support/fkmp.x for a demonstration executable. This program requires string command line arguments search_pattern, search_space and accepts a third argument unsigned verbosity that controls 3 levels of output information: 0 = silent, 1 = proof, 2 = trace. (Entering the command without arguments will produce a reminder prompt.) The following is a recommended run (copy/paste to your linux prompt):

fkmp.x ABABBABBBABBB ABABAABABABABBABABBAABABBABABABBABBABABBABBBABABBABBBAABABBABBBABABABBABBBABBABABBABBBABBBABABBABBBABBBBABAB 2 > trace
more trace

The trace shows the increasingly partially successful attempts to match the pattern up to complete success.

Construction of the mapping dfa_ is extremely clever, and a proof that dfa_ correctly produces the optimal place to re-start the match loop requires a considerable amount of expository work to be comprehensible. Once dfa_ is constructed, however, it is clear by inspection that KMP Search has runtime Θ(n). It is also clear that Build_dfa has runtime Θ(mr). There are two ways to view the package as an algorithm:

If we want to search for a fixed pattern p in a bag of unkown texts (or a continuous stream of input), we think of Build_dfa as a pre-processing step with runtime Θ(mr) and Search_dfa as a search algorithm with runtime Θ(n).
If we want to do a single search for a pattern p in a text s, we think of Build_dfa and Search_dfa as a two-step search algorithm, and the total search runtime is Θ(mr) + Θ(n) = Θ(mr + n).

For a fixed size alphabet, running the algorithm as a single search for a pattern p in a text s, the runtime is Θ(m + n).

(discussion of dfa and proofs)

The only potential drawback to Original KMP is the requirement that the dfa is defined for all characters in the alphabet. Two consequences of this are that construction of the dfa has runtime Θ(mr) (where r is the alphabet size) and dfa_ requires space +Θ(mr). Back in the day of Knuth, Morris, & Pratt this was not an issue, because 256 was an upper bound for r. But now we have UNICODE16 with r = 65536.

3.3.2 Improved KMP

Looking at the KMP Search loop, it is clear that whenever c = toIndex(s[i]) is not a character in the pattern p, we need to restart the check-loop at p[0], in other words 0 = dfa_[c][j]. For a large alphabet and a pattern that uses only a few characters, using this rule instead of computing dfa_[c][j] for all c not represented in p can save significant time constructing dfa_ as well as space storing it while the algorithm is running. K,M, & P found a way around this with a re-worked match algorithm and a smaller "partial match table" that returns the length of the longest possible proper initial segment of p which is also a segment of the substring ending at p[i - 1]. The partial match function is a mapping

pmt_ : P -> { 0 1 2 ... p }

with significantly smaller domain than dfa_. The search algorithm logic is a little more complex due to the lack of direct information on a mismatch:

Vector<size_t> pmt_(0);

size_t Search_pmt (const char* p, size_t m, const char* s, size_t n) const
{
  Build_pmt(p);
  size_t i = 0, j = 0;
  while (i+j < n)
  {
    if (p[j] == s[i+j])  // match
    {
      if (j == m - 1)    // if at end of p
      {
        return i;          // return location of match
      }
      ++j;               // go to next char
    }
    else
    {
      if (pmt_[j] < m)
      {
        i = i + j - pmt_[j];
        j = pmt_[j];
      }
      else
      {
        j = 0;
        ++i;
      }
    } 
  }
  return n;
}

void Build_pmt(const char* p, size_t m)
{
  pmt_.SetSize(m,0);  // note: init all values to 0
  pmt_[0] = m;  // signals mismatch at p[0]
  pmt_[1] = 0;
  size_t c = 0, j = 2;
  while (j < m)
  {
    if (p[j-1] == p[c])
    {
      ++c;
      pmt_[j] = c;
      ++j;
    }
    else if (c > 0 && c < m)
    {
      c = pmt_[c];
    }
    else
    {
      pmt_[j] = 0;
      ++j;
    }
  }
}

Demo: See notes_support/fkmp.x for a demonstration executable. The trace facility is built around the dfa_ solution, but both dfa_ and pmt_ versions are calculated.

3.3.3 Classic v New

The two versions of KMP search have many characteristics in common, but there are differences that may be important in selecting one version over the other. Some of these are self-evident and some are quite subtle. Here is a summary, using r = size of alphabet, m = size of pattern, and n = size of search space:

Both Search_dfa and Search_pmt have runtime Θ(n). This is fairly straightforward to prove for Search_dfa, but less so for Search_pmt. We will return to this assertion later.
Nevertheless, Search_dfa is faster in practice. It has much simpler code: a simple loop that executes unconditionally and with only a pair of array accesses and no logical tests or branching in the loop body. Even if the main loops in the two algorithms execute the same number of times, the more complex logic in the loop body of Search_pmt will slow down execution.
Search_dfa requires space +Θ(rm), whereas Search_pmt requires space +Θ(m): Search_dfa requires more by a factor of r.
Build_dfa is slower than Build_pmt by a factor of r.

Generally, then, the larger the alphabet the more favorable the new version is over the classic. But for small alphabets, such as binary [important for many things such as forensics and espionage] and DNA [important in biomedical computing], the classic form is probably the way to go.

3.3.4 Runtime

3.3.5 Correctness

3.3.6 References

The original KMP paper was published in 1977 a couple of years after discovery [1]. The dfa version above is exposed in excellent detail in the Sedgewick lecture notes [2] and latest Algorithms text [3]. The pmt version is discussed in the Cormen at al text [4].

[1] Knuth, Donald; Morris, James H., Jr; Pratt, Vaughan (1977). "Fast pattern matching in strings". SIAM Journal on Computing 6 (2): 323-350. (Available here.)

[2] Sedgewick, Robert (2012). "Lecture slides 5.3: Substring Search". Princeton University.

[3] Sedgewick, Robert; Wayne, Kevin (2011). Algorithms 4e. Pearson / Addison-Wesley.

[4] Cormen, T.H.; Leiserson, C.E; Rivest, R.L.; and Stein, C. (2009). Introduction to Algorithms 3e. MIT Press, Cambridge, MA.

3.4 Rabin-Karp

A complete departure from the conventional substring search algorithms was introduced by Richard Karp and Michael Rabin in 1987. As with many game-changing ideas, the concept is quite simple: Suppose we have a "fingerprint" mechanism FP for strings that can be calculated efficiently. Then we can calculate the fingerprint of a search string p and compare it with the fingerprints of successive substrings in the text s and if we find a substring with matching fingerprint, return the starting index of the substring. The general structure of the algorithm is then:

let w[k] represent the "rolling window" sub-string of s with characters s[k] ... s[k+m-1] 
for (k = 0; k < n-m; ++k)
{
  if (FP(w[k]) == FP(p))
    return k;
}
return n;

This disruptive idea immediately brings two practical questions:

How do we know that strings with the same fingerpint are equal?
How can we calculate fingerprints efficiently?

The answer to question 1 is - we cannot make such a guarantee, but we can assert the conclusion to be correct with very low probability of error. Moreover, we can always verify the conclusion with a simple loop of length m. We will return to these concepts after detailing an answer to the second question.

The "fingerprint" of a string will be a special hash function that, once calculated for w[0], can be updated for w[1], w[2], ..., w[n-m] in constant time.

The FP of p and w[0] are needed to make the first test. Then each subsequent iteration of the loop runs in constant time. So we can assert that the best case runtime is Ω(m) and the worst case runtime is O(n). In addition there is only constant space overhead and the Monte Carlo version of the algorithm requires no backing up in the search space. (The verification loop does require backing up from s[k+m-1] to s[k].)

3.4.1 Modular Hashing So-called modular hashing is fairly straightforward to calculate and has proven to have excellent "random" characteristics, producing Analysis distributions for hash tables that closely align with the theoretical best case of binary distributions. For strings this function is calculated at each character, as follows: unsigned long P = (a prime number); unsigned long R = (size of alphabet); unsigned long Hash (const char* s) { unsigned long hash = 0; for (k = 0; k < Length(s); ++k) { hash = (hash*R + (unsigned)s[k]) % P; } return hash; } Note that if we write a string s as a base-R number, we have: int(s) = int(s[k-1]) + int(s[k-2])*R + int(s[k-3])*R² + ... + int(s[0])*R^k-1 Because the modulo operation distributes over addition and multiplication: (a + b) mod p = (a mod p) + (b mod p) and (a * b) mod p = (a mod p) * (b mod p) we can conclude that Hash(s) = int(s) % P For hash tables, P would be the number of buckets in the table, and so approximately the size of the table, in order to have small load factor. For the current application, we will not build a table, so we may take P to be large. The values calculated are distributed in the interval [0,P), and assuming they are uniformly distributed, the probability that two different strings have the same hash value is 1/P. We take P to be a prime on the order of 10²⁰ to achieve probability of error (hash collision) to be 10^-20 = 0.00000000000000000001. 3.4.2 Rolling Fingerprint Update Now consider the "window" substring w at position k in a text s. w would have the following representation as a base R number: w[k] = s[k+m-1] + s[k+m-2]*R + s[k+m-3]*R² + ... + s[k]*R^m-1 and the next window at position k+1 is w[k+1] = s[k+m] + s[k+m-1]*R + s[k+m-2]*R² + ... + s[k+1]*R^m-1 Inspecting these we obtain w[k]*R - s[k]*R^m = s[k+m-1]*R + s[k+m-2]*R² + s[k+m-3]*R³ + ... + s[k+1]*R^m-1 and w[k+1] - s[k+m] = s[k+m-1]*R + s[k+m-2]*R² + s[k+m-3]*R³ + ... + s[k+1]*R^m-1 which leads to w[k+1] = w[k]*R - s[k]*R^m + s[k+m] = (w[k] - s[k]*R^m-1)*R + s[k+m] Because the modulo operation distributes, the same formula applies to the hash values: hash(w[k+1]) = ((hash(w[k]) - s[k]*R^m-1)*R + s[k+m]) % P which provides the constant time update of the hash values of the rolling window in the string. 3.4.3 Rabin-Karp Algorithm Applying the reasoning established above, we can now fill in details of the algorithm. The search first computes the fingerprint of the initial string in s - to get started, there is no way to "roll" the calculation - and checks for a match at this initial position. Without a match, the algorithm continues with a rolling update of the text hash and a compare. The logic ensures that if "vegas" is true then Verify is called before returning an index. const uint64_t alphabet_size = 256; // template argument const uint64_t prime = 268435399; // = fsu::PrimeBelow(0x0FFFFFFF); uint64_t RM; // holds calculation of R^{m-1} mod prime unit64_t pathash; // pattern fingerprint size_t Search (const char* p, const char* s, bool vegas) const { size_t m = strlen(p); size_t n = strlen(s); uint64_t r = alphabet_size; uint64_t pathash = Hash(p,m,prime); uint64_t txthash = Hash(s,m,prime); if (txthash == pathash) { if (!vegas || Verify(s,0)) return 0; } for (size_t i = m; i < n; ++i) { txthash = (txthash + prime - ((RM*(uint64_t)s[i-m]) % prime)) % prime; // rolling hash step 1 txthash = (txthash*r + (uint64_t)s[i]) % prime; // rolling hash step 2 if (txthash == pathash) { if (vegas && !Verify(s,i-m+1)) continue; return i-m+1; } } return n; } This is the calculation of modular hashing for strings: uint64_t Hash(const char* s, size_t length, uint64_t prime) { uint64_t hash = 0; for (size_t i = 0; i < length; ++i) { hash = (r * hash + (uint64_t)s[i]) % prime; } return hash; } The Verify function just checks each character for a match: bool Verify(const char* p, const char* s, size_t loc) // no bounds checks! { for (size_t i = 0; i < strlen(p); ++i) { if (p[i] != s[i + loc]) { std::cout << " ** RK: match verification failure at s[" << loc << "]\n"; return 0; } } std::cout << " ** RK: match verified\n"; return 1; } And as usual there is a pre-processing phase to set up the search: void Build_rk(const char* pattern) { unit64_t r = alphabet_size; uint64_t m = strlen(pattern); pathash = Hash(pattern,m,prime); RM = 1; for (size_t i = 1; i < m; ++i) RM = (RM * r ) % prime; } (You may notice extra "mod prime" operations and an occasional "+ prime" in the above. These do not change the mathematical outcome, but serve to keep the intermediate values of the calculation in the range of unsigned 64-bit integers.) Demo: See notes_support/frk.x for a demonstration executable. This program requires string command line arguments (1) search_pattern and (2) search_space. It accepts two more optional arguments (3) unsigned verbose that controls 3 levels of output information: 0 = silent, 1 = proof, 2 = dump, and (4) bool dump that indicates whether to run in Monte Carlo or Las Vegas mode. (Entering the command without arguments will produce a reminder prompt.) 3.4.4 Monte Carlo v Las Vegas As already emphasized, Rabin-Karp search can run in either Monte Carlo or Las Vegas mode. The distinctions are: Monte Carlo. The search returns a result based purely on whether fingerprints match. The result has very high probability of being correct, and the runtime is guaranteed to be O(n). The algorithm requires no backing up in the search text. Las Vegas. The search returns an index where fingerprints are the same only after a match is verified by character comparison. The result is guaranteed correct, and the runtime has very high probability of being O(n). However, the worst case runtime of Ω(m*n) is possible. The algorithm requires backing up m places in the search text for each verification step. Neither version requires more than a constant amount of runspace. 3.4.5 Other Advantages of Rabin-Karp We already mentioned that Rabin-Karp search can be re-engineered to work in multidimensional spaces, such as searching for a rectangular pattern in a larger digital image. This can be useful, for example, in finding steganographically hidden text in images, in addition to finding repeated patterns in an image. Rabin-Karp also lends itself to the problem of searching for a set of patterns. For example, suppose we want to check whether a text s contains sentences plagiarized from other sources. We may have a large number of such sentences to check. Implementing that kind of search is straightforward and efficient using RK. First calculate the fingerprints of all of the strings you want to search for, and place all of these signatures in a hash table patterns with key = fingerprint and data = pattern_string. Then in the search algorithm check whether the rolling signature is a key in patterns and if so retrieve the associated pattern. Library search engines also use this technique. 3.4.6 References The original Rabin-Karp paper was published in 1987 [1]. It is an early and famous example of a Randomized Algorithm. The Sedgwick lectures notes [2], 4th edition of Agorithms [3], and the Cormen text [4] are all excellent resources for more details. [1] Karp, Richard M.; Rabin, Michael O. (March 1987). "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development 31 (2): 249-260. [2] Sedgewick, Robert (2012). "Lecture slides 5.3: Substring Search". Princeton University. [3] Sedgewick, Robert; Wayne, Kevin (2011). Algorithms 4e. Pearson / Addison-Wesley. [4] Cormen, T.H.; Leiserson, C.E; Rivest, R.L.; and Stein, C. (2009). Introduction to Algorithms 3e. MIT Press, Cambridge, MA. 3.5 Boyer-Moore The Las vegas version of Rabin-Karp requires "backing up" in the input string, which can be a disadvantage under some circumstances. However, if we accept backing up, the speed of KMP can be improved significantly. 3.6 Classes The following prototype classes have been tested for efficacy and are used in the demo executables fkmp.x and frk.x. template <size_t R> class KMP { public: KMP () : pattern_(nullptr), plength_(0), alength_(R), dfa_(0,0) {} KMP (const char* p) : pattern_(nullptr), plength_(0), alength_(R), dfa_(0,0) { Init (p); } void Init (const char* p); size_t Search (const char* s, bool demo = 0) const; void Dump (std::ostream& os = std::cout) const; private: // data char* pattern_; size_t plength_; size_t alength_; fsu::Matrix dfa_; private: // method void Build_dfa(); }; The KMP template parameter (above) is used to provide the alphabet size. In a more general implementation this parameter would be an alphabet class. KMP_pmt (below) does not need the alphabet size, but might well need access to an alphabet in a future development. class KMP_pmt { public: KMP_pmt () : pattern_(nullptr), plength_(0), pmt_(0) {} KMP_pmt (const char* p) : pattern_(nullptr), plength_(0), pmt_(0) { Init (p); } void Init (const char* p); size_t Search (const char* s, bool demo = 0) const; void Dump (std::ostream& os = std::cout) const; private: // data char* pattern_; size_t plength_; fsu::Vector pmt_; private: // method void Build_pmt(); }; Two template parameters are used to provide Rabin-Karp with an alphabet size and a large prime number. Again, in a more complete development we would replace the alphabet size with a complete alphabet class. template <size_t R, size_t P> // alphabet size, prime number class RabinKarp { public: RabinKarp() : pattern_(nullptr), plength_(0), alength_(R), pathash_(0), prime_(prime), RM(1) {} RabinKarp(const char* p) : pattern_(nullptr), plength_(0), alength_(R), pathash_(0), prime_(prime), RM(1) { Init(p); } void Init (const char* p); size_t Search (const char* s, bool vegas = 0) const; void Dump (std::ostream& os = std::cout) const; private: // data char* pattern_; // p = pattern uint64_t plength_; // m = length of patternn uint64_t alength_; // R = size of alphabet uint64_t pathash_; // hash value of pattern uint64_t prime_; // Q = prime divisor used in hash function uint64_t RM_; // R^{m-1} % Q private: // methods uint64_t Hash (const char* s, size_t length) const; bool Verify (const char* s, size_t loc) const; }; Rabin-Karp can be upgraded significantly, first to an Alphabet template parameter and then to a general pattern search tool.