Sorting Algorithms

Sorting algorithms, "sorts" for short, are among the most classic and also most important algorithms in computing. Many sorts actually pre-date machine computing. Moreover, in spite of the intense interest and research into sorts over the years, new knowledge is still being discovered in the area.

In this chapter we look at representative sorts in several categories. The classics are used whenever appropriate for our examples. In some cases we develop the sort all the way to a generic algorfithm, and in others we discuss the algorithm only at the level of pseudo-code or even at the conceptual level.

A sort is an algorithm that receives a sequential collection of data as input and results in a permutation of the input data such that the elements are in non-decreasing order. The classification of sorts breaks along a number of dichotomies:

A sort is in place if it uses only the original data storage space provided by the input collection plus a fixed amount of local variable storage. Asymptotic notation for this space usage is given as +Θ(1), indicating that the extra space required is aymptotically constant.
A sort is stable if the order of two equal elements is left unchanged by the algorithm.
A sort is by key comparison if decisions in the algorithm are based on comparing elements using a less than operator (or function object).
A sort may be suitable for use on external files.
A sort body may be recursive or iterative.
Sort runtimes are typically between Ω(n log n) and O(n²) for key comparison sorts but may be as fast as Θ(n) for special purpose sorts, where n is the number of elements in the collection to be sorted.

We start with relatively simple O(n²) sort algorithms, progress to three classic faster sorts, and conclude with a discussion of special purpose (non-key comparison) sorts.

Sort By Insertion

This algorithm may well be the most naturally occuring sort, and it certainly predates any notion of computation by machine. It is the algorithm that arises from the task of putting things away one at a time maintaining sorted order during the process: whenever a new item is put away, it is placed in the correct location to maintain sorted order. This would be the algorithm you would use, for example, to place a book on a shelf in order of author's name.

The algorithm may be realized in several ways. For example, one could move things out of the input container into an output container and always insert in sorted order. A convenient output container for this wouild be a list. Since we already have a sorted list container class, the following loop performs the essential tasks:

// ordered list insertion sort
// T = element type
// C c = input container
sorted_list<T> L;
for (C::Iterator i = c.Begin(); i != c.End(); ++i)
  L.Insert(*i);
// L is a sorted list of all elements of c

The list L contains a copy of the data in the input container, permuted into sorted order. The original container is left intact.

To get the elements one would need a traversal of L. For example, if one wanted the elements placed back into the original container, add code like:

C::Iterator i;
sorted_list<T>::Iterator j;
for (i = c.Begin(), j = L.Begin(); j != L.End(); ++i, ++j)
{
  *i = *j;
}

and then let the list L go out of scope. This algorithm is not in-place, because the amount of extra space used is Θ(size). However in a case where you want to preserve the original unsorted copy of the data, you would need space for the sorted copy in any case, so this non-in-place technique would serve nicely.

The worst case runtime of this algorithm is the sum of the worst case runtimes of the calls L.Insert(t) over the size of input, which is Θ(k) (when the list has size k). Hence the worst case runtime for the algorithm is

1 + 2 + ... + n = Θ(n²)

where n = size of input.

Note that we could replace the list L with any other sorted associative container:

// ordered set insertion sort
// T = element type
// C c = input container
// P p = predicate object used for order
sorted_set<T,P> S;
for (C::Iterator i = c.Begin(); i != c.End(); ++i)
  S.Insert(*i);
// S is a sorted container of all elements of c

Clearly this is still not an in-place sort. However the runtime could be improved over the ordered list case by choosing a set implementation in which the insertion operation has better runtime, say Θ(log k) when the set has size k, resulting in a worst case runtime for the sort of

log 1 + log 2 + ... + log n = Θ(n log n)

where n = size of input. (Set implementations where S.Insert(t) has runtime Θ(log S.Size()) are discussed in the chapter on associative binary trees.)

Exercise. Prove these two formulas:

1 + 2 + ... + n = Θ(n²)
log 1 + log 2 + ... + log n = Θ(n log n)

You may use any references helpful.

Classic Insertion Sort

The insertion sort process can be made in-place for certain kinds of input containers. The key idea is to build a sorted initial range of the container by progressively pushing the next element into the initial subrange. This in-place version is the classic Insertion Sort. Here is pseudo code designed for an array:

// Classic Insertion Sort / array
// T = element type
T A[n];   // array of size n
T t;      // local variable
int i, j; // loop control variables
for (i = 1; i < n; ++i)
{
  t = A[i];
  for (j = i; j > 0 && t < A[j - 1]; --j)
    A[j] = A[j - 1];
  A[j] = t;
}

The body of the outer loop finds the correct location for A[i] in the initial range A[0..i] using the assumption (loop invariant) that the range A[0..i-1] is already sorted. The search loop also moves the data up one index to make room for t = A[i] when its place is found. Note that the inner loop body is executed only until the place for t = A[i] is found. The worst case runtime for the inner loop is Θ(i), but the best case run time, which occurs when the input data is already sorted, is constant. Thus the worst case run time for Classic Insertion Sort is

1 + 2 + ... + n = Θ(n²)

while the best case run time is

1 + 1 + ... + 1 = Θ(n)

One final observation on Insertion Sort is that it can be re-written as a generic algorithm:

// Classic Insertion Sort / bidirectional iterators
template < class BidirectionalIterator >
void g_insertion_sort (BidirectionalIterator beg, BidirectionalIterator end)
{
  BidirectionalIterator i, j, k;
  typename BidirectionalIterator::ValueType t;
  for (i = beg; i != end; ++i)
  {
    t = *i;
    for (k = i, j = k--; j != beg && t < *k; --j, --k)
      *j = *k;
    *j = t;
  }
}

The runtime analysis for the array case works as well for the generic case. Note that Classic Generic Insertion Sort is applicable to TVector, TDeque, and TList containers, since these all are supported by bidirectional iterators. However, because there is no definition of ValueType for plain pointers, the generic algorithm doesn't work for plain arrays. A special case must be made for arrays.

These algorithms can of course use a non-default order operator (passed in as a predicate object) for the order criterion:

// Classic Insertion Sort / bidirectional iterators
template < class BidirectionalIterator , class Comparator >
void g_insertion_sort (BidirectionalIterator beg, BidirectionalIterator end, const Comparator& cmp)
{
  BidirectionalIterator i, j, k;
  typename BidirectionalIterator::ValueType t;
  for (i = beg; i != end; ++i)
  {
    t = *i;
    for (k = i, j = k--; j != beg && cmp(t,*k); --j, --k)
      *j = *k;
    *j = t;
  }
}

Exercise. Show that classic generic insertion sort is in-place and stable.
Exercise. Code two versions of classic insertion sort applicable to arrays, one for default order and one for predicate order.

Selection Sort

Selection sort is another algorithm that pre-dates machine computing. The idea is basic and simple: find the smallest item in the range and swap it into the first position. Then find the second smallest item and swap it into the second position, and so on until the range is exhausted. This process amounts to a loop that swaps the smallest item in the range [k,n) to place k, for k = 0, ..., n -1.

template < class ForwardIterator , class Comparator >
void g_selection_sort (ForwardIterator beg, ForwardIterator end, Comparator cmp)
{
  ForwardIterator i, j, k;
  for (i = beg; i != end; ++i)
  {
    k = i;
    for (j = i; j != end; ++j)
      if (cmp(*j , *k))
        k = j;
    swap (*i, *k);
  }
}

Selection sort is also in place, but not stable. Even though we always swap the first encounter of the first smallest item, the swap may interchange the positions of equal elements. For example:

[1 2 8_1 8_2 4] -> [1 2 4 8_2 8_1]

Here the smallest element in the tail [8 8 4] is 4, which is then swapped with the first 8, moving it to the right of the second 8.

Selection sort has the advantage over insertion sort in that it does not have as many ValueType assignments and it requires a weaker category of iterator. It has the disadvantage that its best case run time is Θ(n²), because neither of the two loops in the algorithm body have a data-dependent early return. In fact, there are always exactly n(n + 1)/2 calls to the comparison operator on input of size n, independent of the order of the input.

Exercise. Show that the number of ValueType assignments in selection sort is Θ(n) while the number of ValueType assignments in classic insertion sort is Θ(n²) in the worst case.

Exercise. Show that the number calls to the comparison operator in selection sort is always n(n + 1)/2, while that number is data-dependent in classic insertion sort, averaging n(n + 1)/4 for random data and approaching n for nearly sorted data.

Heapsort

Heapsort uses the abstract tree model of a random access range of values that is described in detail in our first chapter on trees. All the heavy lifting needed for heapsort is done in that chapter, where the algorithms g_push_heap() and g_pop_heap() were derived. Please review that material before proceeding here. Heapsort, along with the push/pop heap algorithms and their use in implementing priority queues, was invented by J.W.J. Williams in 1964.

heapsort consists of two simple loops executed sequentially: the first organizes the range into a heap one element at a time, and the second swaps the largest remaining element of the heap to the highiest unfilled location in the range, working backwards until the heap is exhausted:

template <class RAIter, class Pred>
void g_heap_sort (RAIter beg, RAIter end, const Pred& LessThan)
{
  if (end - beg <= 1)
    return;
  size_t size = end - beg, k;

  // push elements onto heap one at a time
  for (k = 0; k < size; ++k)
    g_push_heap(beg, beg + (k + 1), LessThan);
   
  // keep popping largest remaining element to end of remaining range
  for (k = size; k > 1; --k)
    g_pop_heap(beg, beg + k, LessThan);
}

Because the push and pop algorithms run in place, heapsort is an in place sort. The runtime of heapsort is the sum of the runtimes of the calls to push and pop:

(p₀ + p₁ + ... + p_n) + (q₀ + q₁ + ... + q_n)

where p_k = Θ(g_push_heap(beg, beg + (k + 1), LessThan)) and q_k = Θ(g_pop_heap(beg, beg + k, LessThan)). We have shown that

Θ(g_push_heap(beg, beg + k, LessThan)) = Θ(log k) and
Θ(g_pop_heap(beg, beg + k, LessThan)) = Θ(log k)

Therefore the run time of heapsort is

2(log 1 + log 2 + ... + log n) = 2Θ(n log n) = Θ(n log n)

About the only bad news for heapsort is that it is not stable.

Exercise. Develop a generic sort with the following header:

template <class ForIter, class Pred>
void g_PQ_sort (ForIter beg, ForIter end, const Pred& LessThan)

g_PQ_sort() should be a version of sort by insertion, using a priority queue as the temporary receptacle of elements of the range [beg, end). The iterator class should be a forward iterator. Estimate the runtime (depending on the runtimes of the PQ operations) and investigate the stability of g_PQ_sort().

Sort by Merge

Stop to think: what if I need to sort a file that is too large for computer memory? Here is a practical way to proceed:

Break the file F up into subfiles F[0,1] ... F[0,m] small enough to fit into working memory. For simplicity of exposition, assume m = 2^k for some k.
Sort each F[0,i] in memory, using say g_heap_sort
Loop:

Merge these files two at a time to F[1,1] ... F[1,m/2]; here F[1,1] = merge(F[0,1],F[0,2]), F[1,2] = merge(F[0,3],F[0,4]), ... , F[1,m/2] = merge(F[0,m-1],F[0,m])
Merge the result files again, two at a time, to F[2,1] ... F[2,m/4]
Continue until at the last merge F[k,1] = merge(F[k-1,1],F[k-1,2]); note that k = log₂ m

Rename the sorted file F[k,1] to F

This process depends on the merge algorithm, which we have already developed as a generic algorithm

template < class I1 , class I2 , class J >
void g_set_merge (I1 b1, I1 e1, I2 b2, I2 e2, J d)
where:

I1 and I2 are input iterator types

J is an output iterator type

Input ranges are [b1,e1) and [b2,e2)

Output range starts at d

Merge runtime = Θ(n), where n is the total number of elements in both input ranges

This process is a sort that works for large files. It is a straightforward coding process to associate input iterators to files open for read and output iterators to files open for write and apply g_set_merge() to accomplish the merge steps in the outline above. If you need to sort such large files on a regular basis, this process could be developed into a general file sorting framework. As we see below, the runtime of the sort is Θ(n log n), where n is the size of the original file.

We conclude this section with a runtime analysis of the loop process described by step 3. Again for simplicity of exposition, assume that the original breakup of F is into sets of equal size s, that is, each F[0,i] consists of s elements. Here is a more formal statement of that loop:

// algorithm for merging m sorted sets of size s
// input: sorted sets F[0,1] ... F[0,m]
// output: sorted set F 
// assumption: m is a power of 2
numsets = m;
setsize = s;
numits  = 0;
while (numsets > 1)
{
  // 1: "numsets" is the number of sets to merge
  // 2: "setsize" is the size of the sets to be merged
  numsets = numsets/2;
  numits  = numits + 1;
  for (j = 1; j <= numsets; ++j) 
    F[numits,j] = merge(F[numits-1,2j-1],F[numits-1,2j] 
  setsize = setsize * 2;
}
return F[numits,1]

First consider the runtime cost of the merge operation in the inner for loop. As we observed above, the cost of a merge is asymptotically the number of elements in the sets being merged, or the size of the result, which is setsize. Therefore the cost of the entire inner for loop is the sum

setsize + ... + setsize = numsets * setsize = the total number of elements

which is independent of which iteration of the outer while loop we are in. Denote this total number of elements by n. It follows that the cost of the algorithm is

n + ... + n = numits * n

that is, the total number of elements times the number of iterations of the outer loop. The outer loop terminates when

numits = log₂m

so the outer loop body executes log m times, and the runtime of the entire algorithm is Θ(n log m), where n is the total number of elements and m is the number of sets to be merged.

Exercise. Show that the while loop in the algorithm for merging m sorted sets terminates after log₂m iterations.

Classic Merge Sort

Classic merge sort is shown in this slide. Note that it follows the outline given previously for sorting files, except it is applied to a range of values in internal memory, and the initial items to be merged are singletons rather than largish data sets. It's most convenient description is with a recursive body on an array A. The two parameters are necessary to delimit the range of values for the recursive calls.

The function merge called after the two recursive calls can be implemented using generic algorithms already defined. The following is actual code:

void merge(T* A, size_t p, size_t q, size_t r)
// pre: 0 <= p <= q <= r // A is an array of type T
// A is defined for the range [p,r)
// A[p,q) and A[q,r) are each sorted ranges
// post: A[p,r) is a sorted range
{
  T B [r-p];                           // temp space for merged copy of A
  g_set_merge(A+p, A+q, A+q, A+r, B);  // merge the two parts of A to B
  g_copy(B, B+(r-p), A+p);             // copy B back to A[p,r)
}

Clearly the runtime of classic merge sort is Θ(n log n), using the arguments given above for file sort. While classic merge sort is not in-place, it is stable. Thus there are a few circumstances when classic merge sort might be the algorithm of choice.

Exercise. Explain why classic merge sort is stable. (Hint: it depends on choices made implementing g_set_merge().)

Exercise. Suppose we want to make classic merge sort "in place" by replacing the call to merge() with

g_set_merge(A+p, A+q, A+q, A+r, A+p);

Will this work? Explain.

List Merge Sort

The runspace difficulty can be worked around when storage is not required to be contiguous, as in a linked list. MergeSort provides the golden fleece of comparison sorts for such structures: in-place, stable, and optimal asymptotic runtime. For this reason, the List container is given its own Sort method, and the typical implementation is one form or another of MergeSort. Here is pseudocode for MergeSort in a mythological linked list:

void List::Sort()
{
  Link * currSeg, * nextSeg; // ptrs to sub-lists to be merged
  segSize = 1;
  do
  {
    numMerges = 0;
    // merge all adjacent pairs of sub-lists of length segSize
    currSeg = firstLink;
    while (currSeg != 0)
    {
      nextSeg = currSeg;
      advance nextSeg segSize steps in the list
      merge the sublist at currSeg with the sublist at nextSeg
      merge the sublist at currSeg with the sublist at nextSeg
        (leaving currSeg at the beginning of the next segment)
      ++numMerges;
    }
    // double the sub-list size to be merged
    segSize = 2 * segSize;
  }
  while (numMerges > 1); // stop when only 1 merge has occured - the last merge
  take care of remainders
  fix list at ends
}

Note that first the line highlighted implies a loop such as:

for (size_t i = 0; i < segSize; ++i)
  nextSeg = nextSeg -> nextLink_;

because linked stuctures do not provide random access to locations in the list. These pointer advancements add to the runtime cost of the algorithm. This is a tradeoff for the fact that we do not need extra memory as a temporary target for the merge process.

Note that these pointer advances take place in only one of the two segments to be merged next, so that there about n/2 pointer advances in the loop. The loop is executed log n times. Thus the total cost of these pointer advances is about 1/2 n log n. We can conclude that List::MergeSort is stable, in-place, with runtime = Θ(n log n).

The second line highlighted is implemented by essentiall the same algorithm as g_set_merge, with one compare and re-link step for each link in each segment. The total for for a given segment size is therefore the total number of links in the list.

Exercise. Where is the falacy in the following argument? We want a sort for arrays (and vectors and deques) such that:

Runtime is Θ(n log n)
Runspace is +Θ(1)
Sort is stable

So we just copy the vector to a list, perform List::MergeSort, and then copy the list back to the vector. The copies are done in space-conservative manner, so that the vector footprint is decreased whenever an element is removed.

Quicksort

Quicksort is another modern invention, by C.A.R. Hoare in 1962. Note that it pre-dates heapsort by a couple of years. The classic description of quicksort is recursive and operates on an array of values. The version discussed here is slightly different (and simplified) from the classic as described by Hoare.

// Cormen quicksort
void quick_sort(A,p,r)
// Pre:  A is an array of type T 
//       A is defined for the range [p,r)
// Post: A[p,r) is sorted
{
  if (r - p > 1)
  {
    q = Partition(A,p,r);
    quick_sort(A,p,q);
    quick_sort(A,q+1,r);
  }
}

Note that the form of this algorithm is similar to that of classic merge sort, with two recursive calls to sort two sub-ranges in the range. The distinctions are in how the two sub-ranges are obtained and what their order properties are. Merge sort chooses the midpoint of the input range, makes a recursive call to sort the two subranges, and then merges the two together. Quicksort, in contrast, relies on a data-dependent partitioning of the array elements: the result of a call to Partition() is a division of the elements into those less than or equal to the last element (in the range [p,q)) and those greater than the last element (in the range [q+1,r)), with the last element moved to location q. If we could guarantee that q is close to the midpoint between p and r then it would be straightforward to show that quicksort has runtime similar to that of merge sort. However, that is not possible, because the size of the partitions and the location of the "pivot" (at q) are dependent on the data stored in the array.

Most of the work and all of the luck for quicksort is accomplished in Partition(), one version of which follows:

size_t Partition(A,p,r)
{
  i = p;
  for (j = p; j < r-1; ++j)
  // if A[j] is less than the last, swap it into the low range
  {
    if (A[j] <= A[r-1])
    {
      swap(A[i],A[j]);      
      ++i;
    }
  }
  // the last place requires no test:
  swap(A[i],A[r-1]);
  return i;
}

Note that the last element in the range is used as a pivot value for the resulting partition. This choice and the use of the operator <= ensures that the sort is stable. However, this also ensures that the worst case run time for the sort is Ω(n²). Moreover, these worst case times occur in two common types of data: when the input range is sorted and when the input range has a lot of duplicate elements. Here is a summary of properties of quicksort:

Quicksort worst case run time = Θ(n²), where n is the size of the range.
Quicksort average case run time = Θ(n log n).
Quickort is not in-place; the array elements are never moved out of the array, but the recursive nature of the algorithm results in at least Ω(n) stack usage.
Quicksort using the partition algorithm above is stable; however all schemes to get around the quadratic worst case run time, such as randomizing the input or the choice of pivot, result in an unstable sort.
Quicksort can be made into a generic algorithm operating on random access iterators. In fact, it can be made to operate on certain bidirectional iterators, which may make it applicable to list containers. The specific requirement is that operator --i "backs up" to the last element in the range when i = C::End(). (The critical feature is to have a viable generic way to get to the last element in the range to use as a pivot.)

Thus there is no theoretical advantage of quicksort over heapsort whenever the latter can be applied. Heapsort can even be applied to a list by first copying the list to a vector, sorting, and copying back to the list: three generic algorithm calls, at the cost of Θ(n) space.

Nevertheless quicksort remains a popular choice for sorting applications as well as courses on algorithms. The reasons for quicksort remaining popular in applications are (1) it is a very elegant idea, (2) it is relatively easy to code correctly, and (3) in some tests it outperforms heapsort. Reason (3) is due in large measure to the use of the runtime stack (for recursion) by quicksort, opposed to the iteration by heapsort, and the difference is disappearing due to better compiler optimization technology. Reason (2) is of little importance now with generic algorithm technology. Moreover, heapsort retains the fundamental advantage of having worst case run time Θ( n log n).

The reasons quicksort remains popular in algorithms courses are (4) it is a very elegant idea and (5) the runtime analysis is subtle and gives an opportunity to showcase more sophisticated analysis technology invented for this kind of algorithm. The alert reader will note that (4) and (1) are the same. The cynical reader may ask: math for math sake?

Analysis of Quick Sort

Yes! Never let it be said that computer science is without tradition: we will now discuss analysis of the runtime of quicksort. An outline of the proof is shown in the slide.

Observation 1. An element is used as a pivot at most one time, so there are at most n calls made to the partition routine.

Observation 2. All comparisons are made inside the partition routine. Therefore

Quicksort runtime <= O(n + x)

where x is the total number of comparisons made by the partition routine.

Observation 3. Consider the case of sorted input. Then every element will serve as a pivot and every other element will be compared to it. Thus the partition routine will be called at least n times and the k-th call will perform k-1 comparisons. Therefore

Worst case runtime >= Ω(n²)

Definitions. In order to estimate the average case runtime, define the following entities:

E[x] = expected value of x
z₀, z₁, ..., z_n-1 = the elements in sorted order
[z_i,z_j] = {z_i,z_i+1 ... z_j}
x_i,j = bool{z_i is compared to z_j}
e_i,j = Probability{z_i is compared to z_j}

Observation 4. No pair is compared more than one time. Therefore

x = Σ_0..n-2 Σ_{j = i+1..n-1} x_i,j = ΣΣ_i,j x_i,j

where the double sum ranges over the upper triangle of indices defined by 0 <= i < j < n.

Observation 5. Compute the expected value:

E[x] = E[ΣΣ_i,j x_i,j]

= ΣΣ_i,j E[x_i,j]

= ΣΣ_i,j e_i,j

Observation 6. Because of the way Partition divides the data using the pivot:

If a pivot p is selected with z_i < p < z_j then z_i and z_j will never be compared at a subsequent stage of the process.
If z_i is selected as a pivot before any other item in [z_i,z_j] then every element in (z_i,z_j] will be compared to z_i
If z_j is selected as a pivot before any other item in [z_i,z_j] then every element in [z_i,z_j) will be compared to z_j

It follows that:

z_i is compared to z_j iff the first element in [z_i,z_j] to be chosen as a pivot is one of the two ends of the interval.

Observation 7. Assuming pivot values are chosen at random, we have:

e_i,j = P{z_i or z_j is first chosen from [z_i,z_j]}

= P{z_i is first chosen from [z_i,z_j]} + P{z_j is first chosen from [z_i,z_j]}

= 1/(j - i + 1) + 1/(j - i + 1)

= 2/(j - i + 1)

Observation 8. Estimate the expected value as follows:

E[x] = Σ_0..n-2 Σ_{j = i+1..n-1} e_i,j

= Σ_0..n-2 Σ_{j = i+1..n-1} 2/(j - i + 1) [substitute k = j - i]

= Σ_0..n-2 Σ_{k = 1..n-i-1} 2/(k + 1)

< Σ_0..n-1 Σ_{k = 1..n} 2/(k + 1)

= Σ_0..n-1 O(log n)

= O(n log n)

Putting all these observations together completes the argument. Note that the assumption that any element in the interval [z₀,z_n) is equally likely to be chosen as a pivot is used in observation 7.

Key Comparison Sorts

A basic fact about all general purpose sorts is that they make decisions by key comparison. The following shows that our various sorts with worst case runtime Θ(n log n) are as fast as possible, at least asymptotically:

Theorem. Any comparison sort requires Ω(n log n) comparisons in the worst case.

Proof. For a given sort algorithm and input range size n, consider the decision tree associated with the sort: This is a binary tree whose internal nodes represent comparison between values at position pairs and whose leaves represent all possible permutations of the input data. If the sort algorithm is correct, then every permutation of the n input locations must be represented as a leaf of the decision tree. Exactly which internal nodes (comparisons between positions) exist depends on the particular algorithm, but all permutations must be at the leaves of the tree. The sort of a particular data set is represented by a descending path in the decision tree, and the length of this path is the number of key comparisons made by the algorithm.

Note that the depth of this tree is the worst case runtime for the sort, since every permutation of the input must be reachable by descending the decision tree. Note also that there are n! permutations of the n input locations. Therefore the decision tree must have at least n! leaves.

A binary tree with L leaves must have depth at least log L. Therefore the depth of the decision tree for any comparison sort on n items must be at least log n!. We conclude that the worst case run time is

log n! >= Ω(n log n)

where the last is an application of Stirling's formula (see Corman et al, equation 3.18).

Here is an elementary proof that log n! = Θ(n log n) :

Proof. Consider these two inequalities:

n! = n(n-1)...1 definition of factorial

< nn...n repeating the largest factor

= nⁿ definition of exponent

n! = n(n-1)...1 definition of factorial

> n(n-1)...n/2 taking only about half of the factors

> (n/2)^n/2 repeating the smallest factor

In summary, we have (n/2)^n/2 < n! < nⁿ. Now take log₂ of all three to get:

log₂ (n/2)^n/2 < log₂ n! < log₂ nⁿ

Applying basic properties of logarithms, we have:

log₂ (n/2)^n/2 = n/2 (log₂ n - log₂ 2) = n/2 (log₂ n - 1) = Θ(n log n)

and

log₂ nⁿ = n log₂ n = Θ(n log n)

Now we have a lower bound and an upper bound of log₂ n! that are each Θ(n log n). It follows that log₂ n! = Θ(n log n).

Counting Sort

There are sorts that run in linear time. For example, suppose we have a set S of n non-negative integers known to have values in the range [0,k) with no repeated values. We can sort S with a bitvector bv of k bits. Begin with all bits cleared. Then for each number x in S do bv.Set(x). The sorted set of values is obtained by outputting the index of the set bits in bv. This algorithm has runtime Θ(n + k): one loop of length n to set the bits followed by one loop of length k to output the indices of set bits.

An elaboration of this idea that allows repeated values is shown in the slide, with the loops marked for the following discussion/proof.

Loop 1 initializes all elements of the "counting" data in C to zero. The plan is for C[i] to count the number of elements of A that have value less than or equal to i. Note that this is enough information to define the sorted output.

Loop 2 counts the occurrences of i in A, setting C[i] to the number of elements equal to i. This clever idea uses the fact that elements can be cast to index values.

Loop 3 accumulates, setting C[i] to the number of elements less than or equal to i. Note how the loop effectively adds all the elements C[i] from 0 to i.

Loop 4 does the mapping of A to B using the counting data in C. First place A[j] in its correct location, then decrement the count so that the next element down in A equal to A[j] will go into the position imediately before A[j] in B. Note that the sum of the elements of C is exactly the size of A and B, making the output mapping work correctly.

Note that, as expected, we are not using key comparisons to sort the array A. Rather we are using some specific knowledge about the elements to be sorted, namely that they are integers. Note also that the algorithm consists of four simple loops, of length either k or n, so the runtime is Θ(k + n). Typically we have k <= O(n), in which case the run time is Θ(n).

A useful extension of counting_sort uses a function object f as a fifth parameter. The function object is required to map into unsigned integers and counting is of the values f(A[i]). Here is the header of such an extension:

template < typename N , class F >
void counting_sort(const N* A, N* B, size_t n, size_t k, F f)
// Pre:  A,B are arrays of type N
//       A,B are defined in the index range [0,n)
//       f maps A to int values in the range [0,k)
// Post: A is unchanged
//       B is a stable f-sorted permutation of A:
//       I.e., i < j ==> f(B[i]) <= f(B[j])

This version of counting_sort can be used to implement radix sort for any base [radix] integers (including the radix 2 case, bit_sort) and can be applied to types N that are not even numerical.

Exercise. Modify counting sort so that it works for array values in a range [L,U) instead of [0,k) (where L and U are signed integers). Do this two ways - first by modifying the algorithm, and second by applying the 5-parameter version of counting_sort with a judiciously chosen function object.

Exercise. Find a function object so that the 5-parameter version of counting_sort puts an integer array A into descending order.

Radix Sort

Like counting sort, radix sort uses knowledge of the data being sorted instead of key comparisons. One typical use of radix sort is in sorting records by year, month, and day. Rather than define a comparison between two date objects and using a generic sort algorithm, we can first sort by the day field, second sort by the month field, and finally sort by the year field. It is important that each of these individual sorts be stable, so that for example the correct day order is not broken by the sort using the month field. Note that each of the three sorts could use counting sort, so the runtime of the entire radix sort process is Θ(3n) = Θ(n) where n is the number of records. Here is an example of radix sort applied to records with three date fields. Note the importance of starting from "least" and progressing to "most" significant field and that the sorts be stable (at least after the first one).

original data:     sorted by day:     sorted by mon:     sorted by year:     
-------------      -------------      -------------      -------------     
mon  day  year     mon  day  year     mon  day  year     mon  day  year     
---  ---  ----     ---  ---  ----     ---  ---  ----     ---  ---  ----     
12   20   1981     06   05   1950     04   25   1947     04   25   1947     
07   10   1947     07   10   1947     06   05   1950     07   10   1947     
10   14   1952     10   12   1952     06   30   1990     06   05   1950     
06   30   1990     09   13   1952     07   10   1947     09   13   1952     
11   25   1981     10   14   1952     09   13   1952     10   12   1952     
10   12   1952     12   20   1981     10   12   1952     10   14   1952     
04   25   1947     11   25   1981     10   14   1952     11   25   1981     
09   13   1952     04   25   1947     11   25   1981     12   20   1981     
06   05   1950     06   30   1990     12   20   1981     06   30   1990

The slide shows another version designed for numbers, where the individual sorts use a digit position. If we use counting_sort for the stable sort in radix_sort a straightforward examination of the algorithm shows that the run time is Θ(d(n+k)). Please refer to Section 8.3 of [Corman, et al] for more discussion and interesting uses of the radix sort method.

Radix sort can be made somewhat generic as well, applicable to any situation where the sort is by a collection of keys (such as year/month/day/hour/second or the bits in a binary representation of a number) that can be sorted by counting sort.

Summary

The slide shows a summary table of properties of the sorts discussed in this chapter and an interesting reference. Note that Quicksort can be made to work on iterators that have significantly less functionality than random access: bidirectional iterator in which the last element of the range can be accessed in constant time. For example, if --i always works, even when the i == End().

Note that we give the run space requirements in terms of additional space, using notation such as +Θ(n) [for merge sort]. Note that without the addition notation, in other words just giving an asymptotic estimate of total space usage, is almost meaningless, because Θ(n + n), Θ(n + c), and Θ(n) are all the same asymptotic estimate, so that heap sort and merge sort have total space usage Θ(n), giving us no information to distinguish them. But heap sort is in place, that is, +Θ(1), whereas merge sort is +Θ(n).

Note also that tail recursion in quicksort can be eliminated and the algorithm redone in such a way that the stack usage is O(log n), making it more space efficient.

The case of bit sort is interesting. We apply counting sort with a function object that isolates the bit value, looping from least to most significant bit. Counting sort uses extra space the size of the range of bit values, i.e., k = 2. Of course, Θ(2) == Θ(1) and Θ(n + 2) == Θ(n). The runspace of +Θ(n) comes from the creation of a second internal array for the calls to counting sort.

Bit sort can be optimized further: when we know the count c₀ of zero values, we can get the count c₁ of one values by subtraction: c₁ = n - c₀, so we don't need a counting array at all: just count the zero bits. And the destination array can be used back and forth, so that on odd calls to counting sort we map from the original array to the local array and on even calls we map back to the original. In the unlikely event there happen to be an odd number of bits for the integer type being sorted we make one last copy back to the original. Even the copy operation can be optimized by swapping the two pointers rather than copying the data.

We conclude by mentioning byte sort and word sort. These are very similar to bit sort, except that we isolate on a byte [word] rather than a single bit. For 64-bit numbers, bit sort has a loop of 64 calls to counting sort while byte sort has a loop of 8 calls and word sort has a loop of 4 calls. The calls to counting sort require 2¹ = 2, 2⁸ = 256, and 2¹⁶ = 131072 as the "range" argument, respectively.

E[x]	= E[ΣΣ_i,j x_i,j]
	= ΣΣ_i,j E[x_i,j]
	= ΣΣ_i,j e_i,j

e_i,j	= P{z_i or z_j is first chosen from [z_i,z_j]}
	= P{z_i is first chosen from [z_i,z_j]} + P{z_j is first chosen from [z_i,z_j]}
	= 1/(j - i + 1) + 1/(j - i + 1)
	= 2/(j - i + 1)

E[x]	= Σ_0..n-2 Σ_{j = i+1..n-1} e_i,j
	= Σ_0..n-2 Σ_{j = i+1..n-1} 2/(j - i + 1) [substitute k = j - i]
	= Σ_0..n-2 Σ_{k = 1..n-i-1} 2/(k + 1)
	< Σ_0..n-1 Σ_{k = 1..n} 2/(k + 1)
	= Σ_0..n-1 O(log n)
	= O(n log n)

n!	=	n(n-1)...1	definition of factorial
	<	nn...n	repeating the largest factor
	=	nⁿ	definition of exponent
n!	=	n(n-1)...1	definition of factorial
	>	n(n-1)...n/2	taking only about half of the factors
	>	(n/2)^n/2	repeating the smallest factor