2 String SortsThe order operator< defined for strings implements "dictionary" or lexicographical order, based on the numerical Index order of the characters. In the worst case, x < y will entail examining all of the characters in the two strings x and y, an Ω(W) operation (where W is the length of the strings). One of the general inquiries in this chapter is to investigate whether efficiencies can be gained in sorting and searching collections of string keys by using the constant-time order operator on the character set itself. We look at three sort algorithms adapted specifically to strings. 2.1 LSD String SortLSD (Least-Significant-Digit) string sort: assume that the N keys to be sorted are strings, all of the same length W, and that the alphabet has size R. (The default we are accustomed to is R = 256, strings are extended ascii characters.) Note that we can sort a collection of characters using counting sort. The general idea for LSD string sort is to apply counting sort on the characters at each position, beginning with the "least significant" (highest index = W-1) character and working up to the "most significant" (smallest index = 0) character. Because counting sort is stable, the final result will be a sort of all of the keys. This is exactly what we did implementing byte_sort, except that we were sorting actual numbers, which consisted of at most 8 bytes, so our byte_sort looked like this: mask = 255 = 0x00000000000000FF; for i = 0 ... 7 apply counting sort to (keys & mask); mask = mask << 8; In other words we just apply counting sort to the right-most byte, then shift over 8 bits, and repeat until we have moved the mask all the way to the left-most byte. If we think of a string as a "base R" number, LSD sort does the same thing to string keys, except that we have much bigger "numbers" that cannot be represented numerically in the machine. Therefore we must maintain the symbolic representation as strings of "digits". LSD can be adapted with almost no extra effort when the String class used to house strings has a built-in null-terminator, a la C-strings. in fact, all that is needed is an Element(index) method that returns the null character whenever there is no character at that index. 2.1.1 LSD Pitfalls
2.1.2 Bit, Byte, and Word SortsYou can refresh your perspective by experimenting with notes_support/sortspy.x. Note that bit_sort is beaten by several generic sorts for most data. Note also that counting_sort is very fast, but it becomes impractical when the maximum "spread" of individual number values in the data is too large. This is due to the locally declared array of size k = 1 + max_spread in the implementation of counting_sort (which has k as a parameter). Bit_sort, byte_sort, word_sort, and LSD string sort all get around this limitation by considering the data one component [bit, byte, word, or character] at a time and looping through the components from least to most significant. Byte_sort is exactly LSD string sort on 8-character extended ASCII strings. Word_sort is exactly LSD string sort on 4-character UNICODE16 strings. Note that the runtimes for byte_sort are doubled when data is processed using variables of type uint64_t, even thought the input data is restricted to be bounded by UINT32_MAX. For strings, interpret this as having to sort at all character positions even when all of the strings have the same 4-character prefix. 2.1.3 LSD Cost EstimatesLSD string sort thus runs in time proportional to the number N of strings + the size R of the alphabet (because counting_sort is Θ(N+R)) times the number of characters in the strings: Θ((N+R)*W). Note also that the space overhead is the number of keys plus size of the alphabet: +Θ(N+R). 2.1.4 A Generic Hollorith MappingLSD string sort relies on counting_sort as discussed earlier. The version g_counting_sort below takes counting_sort all the way to a generic algorithm. It's a post-modern version of the original card sorting machine invented by Herman Hollerith, co-founder of IBM. There is no explicit assumption needed on the element types being processed, and the counting_sort algorithm is re-phrased as a kind of permutation mapping an input range to an output range. The place where numbers enter the picture is via the function object f. This may seem abstract/esoteric, but it is very useful. We will illustrate with ByteSort (for integers) and LSD (for strings). template < class I , class J , class F > void g_counting_sort(I source_beg, I source_end, J dest_beg, size_t R, F f) // Pre: I,J are iterator types with the same ElementType // destination range is at least as large as source range // f maps ElementType to int values in the range [0,R) // Post: source range is unchanged // dest range is a stable f-sorted permutation of source range // I.e., i < j ==> f(B[i]) <= f(B[j]) // and relative order of f-equal elements is preserved { size_t * c = new size_t[R+1]; // declare counter array for (size_t r = 0; r <= R; ++r) // initialize counters to 0 c[r] = 0; for (I i = source_beg; i != source_end; ++i) // count instances of f(t) == r offset by one ++c[1+f(*i)]; // c[r+1] = number of a's that map to r for (size_t r = 1; r <= R; ++r) // accumulate instance counts c[r] += c[r-1]; // c[r+1] = number of a's that map to 0 .. r // c[r] = number of a's that map to < r for (I i = source_beg; i != source_end; ++i) // map a -> b { dest_beg[c[f(*i)]] = *i; ++c[f(*i)]; } } (We have migrated from the Cormen-like implementation to a Sedgewick-like implementation to get all the loops running in the same direction.) Note that the implementing code implicitly requires ranges determined by random access iterators or pointers. 2.1.5 Byte SortNotice that counting_sort permutes the input range by stably ordering the elements according to the function object f. The trick in making practical use of counting_sort is in finding a family of "mask-like" function objects that serve to isolate small/manageable components of the input data. Consider for example the following function class: template <typename N> class Byte { public: N operator () (N n) { return ((n >> offset_) & 0xFF); // the byte at the offset location } Byte() : offset_(static_cast<N>(0x00)) {} void SetByte(unsigned char i) { offset_ = static_cast<N>(i << 3); // the ith byte = offset*8 } private: N offset_; }; If b is a Byte<unsigned long> object, b(n) returns the ith byte of n embedded as the right-most byte in an N object. The offset i is set by the method SetByte. With these two helpers we can now write complete code for ByteSort: template <typename N> void byte_sort (N* A, size_t n) { N* B = new N [n]; fsu::Byte<N> b; size_t numBytes = sizeof(N); for (size_t i = 0; i < numBytes; ++i) { b.SetByte(i); // byte i will be isolated with a mask fsu::g_counting_sort(A,A+n,B,256,b); // call the Hollorith mapping fsu::Swap(A,B); // swap pointers } delete [] B; } Note that we are swapping addresses of memory blocks which is much more efficient than data copy (Θ(1) v Θ(n)). Even if we had a type with an odd number of bits (so that delete [] B actually deletes the original A) there is no problem letting A take over B's original memory allocation, due to the global scope of dynamically allocated entities. 2.1.6 LSD String SortQuite analogous to ByteSort, the following function class combines with g_counting_sort to implement LSD string sort: class IndexValue { public: size_t operator() ( const fsu::String& s ) { return (size_t)s.Element(index_); } IndexValue () : index_(0) {} void SetIndex (size_t i) { index_ = i; } private: size_t index_; }; void LSD (fsu::Vector<fsu::String>& a, size_t L, size_t R) { fsu::Vector<fsu::String> b(a); IndexValue iv; for (size_t d = L + 1; d > 0; ) { --d; iv.SetIndex(d); g_counting_sort(a.Begin(), a.End(), b.Begin(), R, iv); a.Swap(b); } } LSD string sort applies to strings of varying length without any fuss with the help of the IndexValue function class and the fact that the null-character '\0' comes before any other character in the character set. (Recall that the fsu::String method s.Element(i) returns the character at i if i is in range and '\0' otherwise, so that IndexValue(s) returns 0 when i ≥ s.Length().) 2.2 MSD String SortGiven the "right" notion of string object, the LSD approach adapts to string keys of varying length, but wide variation in string length can lead to inefficiencies. For example, suppose we have many keys of length 6 and one of length 100. Then the main loop in LSD would run 100 times and produce no meaningful change in the array on the first 94 iterations. Like ByteSort, LSD is a strictly "run-to-completion" process. It can be fast, but it cannot be sped up. MSD (most significant digit first) uses a recursive approach: an application of counting_sort to the first (left-most = most significant) character organizes the array of keys into subarrays, one for each value of the leading character; then a recursive call on each of these subarrays completes the sort of the array. Note that after each application of counting_sort there are R recursive calls, where R is the size of the alphabet. The (maximum possible) depth of the recursions is the string length W. 2.2.1 MSD Pitfalls
2.2.2 MSD Cost EstimatesThe time & space costs for MSD are not as simple to calculate as those of LSD, due to the recursive nature of the algorithm and to the variability due to characteristics of the set of strings being sorted. For random strings, the following can be proved [from Sedgewick/Wayne]. Proposition. Let N be the number of strings to be sorted, R the number of characters in the alphabet, W the maximum length of the strings, and w the average length of the strings. Then:
2.3 Three-Way String QuickSortThis version of string sort is modelled on 3-way quick sort. The idea is to adapt quick_sort_3w to apply to the leading (left-most, index 0) character in the vector of strings, using the Alphabet::operator<. This will then re-organize the strings into three ranges: those with leading character less than the pivot character, those with leading character equal to the pivot character, and those with leading character greater than the pivot character. Then apply the same algorithm recursively to each of these three sub-ranges, with the middle range considering the second character instead of the first. The following example illustrates the process:
The three ranges are color coded blue, red, and green, with the red color omitted from the first letter in the middle range. The above illustrates a run to completion. But of course the algroithm does not proceed left to right uniformly in the illustration, rather recursive calls are made first on the blue range, then the red range, then the green range. The pivot element in each range is underscored. The illustration terminates the process when the range size is <= 1. In the actual implementation a cutoff to insertion sort should happen when the range size is small, but higher than 1. The illustration also ignores possible permutations within elements making up the three ranges. String Sort ProjectImplement the three string sorts discussed above: LSD, MSD, and SQS3w, applying the optimizations discussed. Using collections of strings of various data characteristics, test these algorihms against the optimized generic sorts. The goal is to find recommendations of sorts to use, by data characteristic. |