Project 2: Sort Analysis

Note: This assignment will be used to assess the required outcomes for the course, as outlined in the course syllabus. These outcomes are:

analyze the computational complexity of algorithms used in the solution of a programming problem
evaluate the performance trade-offs of alternative data structures and algorithms

These will be assessed using the following rubric:

I E H
Key:
I = ineffective
E = effective
H = highly effective

Performance Analysis

Runtime Analysis - - -

Runspace Analysis - - -

Tradeoff Analysis

Comparison Sorts - - -

Numerical Sorts - - -

In order to earn a course grade of C- or better, the assessment must result in Effective or Highly Effective for each outcome.

Educational Objectives: On successful completion of this assignment, the student should be able to

Implement a variety of comparison sort algorithms as generic algorithms, including Insertion Sort, Selection Sort, Heap Sort, Merge Sort, and Quick Sort, re-using code as much as possible from the course library. The implementations should cover both default order and order by passed predicate object.
Discuss the capabilities and use constraints for each of these generic algorithms, including assumptions on assumed iterator type, worst and average case runtimes.
Implement the Counting Sort algorithm for specified arrays of integers
Implement the Counting Sort algorithm as a template function taking function object parameter that is used to define the sort value of the input integers, obtaining Bit Sort and Reverse Sort as special cases.
Collect data and use the method of least squares to find a best fit scalability curve for each sort algorithm (including Counting Sort), based on a form derived from the known asymptotic runtime for the algorithm.
Perform a comparative qualitative analysis of these algorithms using asymptotic runtime analysis as well as quantitative analysis using data collected from implementations.

Background Knowledge Required: Be sure that you have mastered the material in these chapters before beginning the project: Sequential Containers, Function Classes and Objects, Iterators, Generic Algorithms, Generic Set Algorithms, Heap Algorithms, and Sorting Algorithms

Part I: Generic Sort Algorithms

Operational Objectives: Implement various comparison sorts as generic algorithms, with the minimal practical constraints on iterator types. Each generic comparison sort should be provided in two froms: (1) default order and (2) order supplied by a predicate class template parameter.

Also implement some numerical sorts as template functions, with the minimal practical constraints on template parameters. Again there should be two versions, one for default order and one for order determined by a function object whose class is passed as a template parameter.

The sorts to be developed and tested are selection sort, insertion sort, heap sort, merge sort, quick sort, counting sort, bit sort, byte sort, and word sort.

Deliverables: Two files:

gsort.h         # contains the generic algorithm implementations of comparison sorts
nsort.h         # contains the numerical sorts and classes Bit, Byte, and Word

Procedural Requirements

The official development, testing, and assessment environment is g++47 -std=c++11 -Wall -Wextra on the linprog machines. Code should compile without error or warning.
Develop and fully test all of the sort algorithms listed under requirements below. Make certain that your testing includes "boundary" cases, such as empty ranges, ranges that have the same element at each location, and ranges that are in correct or reverse order before sorting. Place all of the generic sort algorithms in the file gsort.h and all of the numerical sort algorithms in the file nsort.h. Your test data files should have descriptive names explaining their content.
Turn in gsort.h and nsort.h using the script LIB/proj2/proj21submit.sh.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Note that Parts 1 and 2 have different due dates.

Code Requirements and Specifications

The two sort algorithm files are expected to operate using the supplied test harnesses: fgsort.cpp (tests gsort.h) and fnsort.cpp (tests nsort.h). Note that this means, among other things, that:
1. All generic sorts in gsort.h should work with ordinary arrays as well as iterators of the appropriate category
2. Both classic (4-argument) counting_sort and the 5-argument version should work
3. bit_sort should work with the class Bit defined in nsort.h
4. byte_sort should work with the class Byte defined in nsort.h
5. word_sort should work with the class Word defined in nsort.h
The comparison sorts should be implemented as generic algorithms with template parameters that are iterator types.
Each comparison sort should have two versions, one that uses default order (operator < on I::ValueType) and one that uses a predicate object whose type is an extra template parameter.
Some of the comparison sorts will require specializations (for both the default and predicate versions) to handle the case of arrays and pointers, for which I::ValueType is not defined.
Re-use as many components as possible, especially existing generic algorithms such as g_copy (in genalg.h), g_set_merge (in gset.h), and the generic heap algorithms (in gheap.h).

Two versions of counting_sort should be implemented: the classic 4-parameter version, plus one that takes a function object as an argument. Here is a prototype for the 5-parameter version:

template < class F >
void counting_sort(const int * A, int * B, size_t n, size_t k, F f)
// Pre:  A,B are arrays of type unsigned int
//       A,B are defined in the range [0,n)
//       f is defined for all elements of A and has values in the range [0,k)
// Post: A is unchanged
//       B is a stable f-sorted permutation of A
//       I.e., i < j ==> f(B[i]) <= f(B[j])

Test and submit both versions of counting_sort.

Also test and submit specific instantiations of radix sort called bit_sort, byte_sort, and word_sort.
1. bit_sort is implemented using a call to counting_sort with an object of type Bit:
```
  template <typename N>
  class Bit
  {
  public:
    size_t operator () (N n)
    {
      return (0 != (mask_ & n)); // the bit at the offset location
    }
    Bit() : mask_(static_cast(0x00)) {}
    void SetBit(unsigned char i)
    {
      mask_ = (static_cast(0x01) << i);  // the ith bit
    }
  private:
    N mask_;
  };
```
  The template parameter represents an integer type. bit_sort is implemented as a loop of calls to counting_sort at each bit (increasing in significance). Note that the size of N can be calculated and used to limit the length of the loop.
2. byte_sort is implemented using a call to counting_sort with an object of type Byte:
```
  template <typename N>
  class Byte
  {
  public:
    size_t operator () (N n)
    {
      return ((n >> offset_) & 0xFF); // the byte at the offset location
    }
    Byte() : offset_(static_cast(0x00)) {}
    void SetByte(unsigned char i)
    {
      offset_ = static_cast(i << 3); // the ith byte = offset*8
    }
  private:
    N offset_;
  };
```
  Again the template parameter represents an integer type. byte_sort is implemented as a loop of calls to counting_sort at each byte (increasing in significance). Again, the size of N can be calculated and used to limit the length of the loop.
3. word_sort is implemented using a call to counting_sort with an object of type Word. Developing this class and the word_sort algorithm is left to your creativity.
Test and submit bit_sort, byte_sort, and word_sort (in file nsort.h).

Hints

g_heap_sort is already done and distributed in LIB/tcpp/gheap.h. There are three slightly different heap sort algorithms implemented: fsu::g_heap_sort, which is discussed in the lecture notes; alt::g_heap_sort, which uses a different "make heap" algorithm; and cormen::g_heap_sort, which is the version discussed in the Cormen text. At some point before midterm exams, you should understand the distinctions among the three.
Similarly, g_selection_sort is fully implemented in gsort_stub.h. The prototypes for the default and predicate versions should be useful as models for the other generic comparison sorts.
You will need specializations for some generic sort algorithms (g_insertion_sort and g_merge_sort) so that they work with arrays (raw pointers), because the generic versions use the iterator feature I::ValueType that pointers do not have.
The 3-parameter version of counting_sort, along with a start on other code, is given in nsort_stub.h. Note that there is a template parameter N in all of the numerical sort implementations that represents the integer type being used. This allows the compiler to select a type based on usage, a nice efficiency since sizeof(N) is a limit on loop size in several of the applications. The 4-parameter version of counting_sort will thus have two parameters: N (the number type) and F (the function class).

The following is a summary of the code files that are supplied in LIB/proj2 needed for Part I:

fgsort.cpp      # functionality test for all of the generic sorts in gsort.h
fnsort.cpp      # functionality test for all of the numeric sorts in nsort.h
gsort_stub.cpp  # contains some complete implementations and other partial implementations
nsort_stub.cpp  # contains some complete implementations and other partial implementations

TAKE NOTES! Use either an engineers lab book or (thoughtfully named) text files to keep careful notes on what you do and what the results are. Date your entries. This will be of immense assistance when you are preparing your report. In real life, these could be whipped out when that argumentative know-it-all starts to question the validity of your report.

Part II: Sort Algorithm Data Collection

Operational Objectives:

Step 1: Problem Selection. Begin by selecting one of the analysis problems for your work:

Curve-Fitting. Use a theoretical review to assign a "form" to each sort algorithm, and then use the method of least squares (aka regression) on actual timing data to find coefficients for a best-fit curve in the form. See curve_fitting for more details.
Optimal Cutoff for Recursive Sorts. Recursive sort algorithms tend to make many recursive calls on small or empty ranges. There is usually a point where these calls make the recursive algorithm less effective than a simple non-recursive sort such as insertion_sort. Use a combination of runtime theory and practical experiment to find the "optimal cutoff size" for switching from the recursive algorithm to a call to insertion_sort, for: merge_sort and quick_sort. Submit revised code for g_merge_sort and g_quick_sort that implements this cutoff.
Sorting Almost Sorted Data. When data is "almost sorted", with only a few (say k) items out of place, discuss the pros and cons of the various sort algorithms. In particular, devise an analysis of insertion_sort for almost sorted data in terms of n (the size of the data set) and k (the number of items not already in order). If you prefer, you could re-phrase the analysis in terms of the average number of places each element is "out of position" from sorted data.
Key-Comp v Numerical Sorting. Given that key comparison sorts cannot run faster than Ω(n log n) and that the numerical sorts have runtime O(n), eventually the numerical sorts must be faster for sufficiently large n. Use actual timing data to estimate the value of n where this change takes place, and also discuss the tradeoffs involved, including memory use. Which of the numerical sorts are practical for these very large data sets?
String Sorts. Discuss the pros and cons of sort algorithms designed specifically for strings, compared to the general-purpose sort algorithms. Consider at least two string sorts: LSD and MSD.

Step 2: Data Collection Plan. Create a plan to collect data for analysis for your chosen analysis problem. This will involve creation of data files, timing data, and/or comp_count data, appropriate for analysis of all of the sorts. The plan should be outlined in data_collection_plan.txt, and makefiles for creating input data and output results should be created that support the plan. The plan should support the analyses you have chosen to do.

Deliverables: Five files:

data_collection_plan.txt # text file describing the data that will be collected and
                         # the rational for the choices
                         # included in sort_analysis as an Appendix
makefile.files.*   # create input data files used for your analysis 
makefile.times.*   # create output timing data used in your analysis
makefile.counts.*  # create output comp_count data used in your analysis

Note that you may have several suffixes for the makefiles. (See Hints below.)

Procedural Requirements

Choose your topic, either from the list in step 1 above or another topic (cleared with the instructor).
Devise a plan to collect data using a CPU timing system (and optionally comparison counters) to obtain appropriate timing / comp_count data to support your analysis. Input sizes should range from small to substantially large. Also qualitative aspects of the data may vary, for example data with many repeats, data that is almost sorted, data with bounded values, and completely random data. Be sure that you have specific questions you want to research and answer by analizing the collected data. Outline the data collection plan, including the questions to be researched, in the text file named "data_collection_plan.txt".
Create makefiles that create the data described in data_collection_plan.txt. Name these makefiles makefile.files, makefile.times, and (optionally) makefile.counts.
Turn in Turn in data_collection_plan.txt, makefile.files, makefile.times, and (optionally) makefile.counts using the script LIB/proj2/proj22submit.sh.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.
Note: That your plan and collection makefiles may be changed as you get deeper into the project, just resubmit whenever changes occur.

Hints

Several code files that support data collection are provided:

sorttimer.cpp      # times sort algorithms on input data file
sortspy.cpp        # adds comp_count data for comparison sorts (times are inflated)
ranuint.cpp        # generates file of randum unsigned integers
makefile.files.eg  # sample creates input data files
makefile.times.eg  # sample creates timing data files

The data file nomenclature uses a base file name, such as "random" or "dupes" that you can set for a series of data files. The suffix is the count of items in the file.

Your file naming system should reflect the nature of the contents of the data files. For example, "ran" could be the base name for a series of data files with unconstrained random data. Then "ran.1 ran.10, ran.100, ran.1000, ..., ran.1000000" would be a series of data files of sizes 1, 10, 100, 1000, ... , 1000000 generated by "makefile.files.ran". This, or some other mechanism described carefully in data_collection_plan.txt.
Be sure that your timing data collection plan uses file naming conventions that are compatible with your data file names. For example, your timing data files created by "makefile.times.ran" could have the base name "time.ran" and be names "times.ran.1, times.ran.10, ..." with concatenated results in "times.ran".

Part III: Sort Analysis

Operational Objectives: Perform your analyses of sort algorithms and write the report. The data collection plan submitted in Part II should be followed [or revised, resubmitted, and followed], various analyses completed, and a paper written on your findings. The paper should be named sort_analysis.pdf. Guidelines for the structure pof the paper are given below and should be followed.

Deliverables: One file:

sort_analysis.pdf   # your Assignment 5 report

Procedural Requirements

Read the analysis and report guidelines below.
Collect data according to your data collection plan, perform your analyses, and write your paper.
Turn in sort_analysis.pdf using the script LIB/proj2/proj23submit.sh.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.
Also submit your report sort_analysis.pdf to the Blackboard course web site.

Analysis and Report Requirements

Be sure that you use and document good investigation habits by keeping careful records of your analysis and data collection activities.
Before beginning any data collection, think through which versions of sorts you are going to test. These should ideally be versions that are most comparable across all of the sorts and for which the "container overhead" is as low as practical. This means using the array case for all sorts.
You also need to plan what kinds and sizes of data sets you will use. It is generally advisable to create these in advance of collecting runtime data and to use the same data sets for all of the sorts, to reduce the effects of randomness in the comparisons. On the other hand, the data sets themselves should "look random" so as not to hit any particular weakness or strenght of any particular algorithm. For example: if a data set has size 100,000 but consists of integers in the range 0 .. 1000 then there will be lots of repeats in the data set, which could be bad for quicksort.
Generally, it is best to use unsigned integer data for the data sets, so that they can be consumed by all of the sorts, including the numerical sorts.
If you use a multi-user machine to collect data, there will be the possibility that your timings are exaggerated by periods when your process is idled by the OS. One way to compensate for this is to do several runs and use the lowest time among all of the runs in your analysis. Most likely you will need to collect your data using linprog, because the random number generator needs 64 bit words.
The framework of (pseudo) random object generators in LIB/cpp/xran.* has been upgraded to provide 32 bit integers. To simplify your tasks (and ensure some uniformity in the raw data) we supply a random unsigned int generator proj2/ranuint.cpp. This should compile and run on linprog (but not program, due to word size restrictions).
The CPU timing framework in LIB/cpp/timer.* can be used to collect timing data. Again with the goals of simplifying your work load and ensuring more uniformity, a timing program is supplied in proj2/sorttimer.cpp. Like ranuint.cpp, this program requires 64-bit architecture.
The supplied sort timer program outputs time in milliseconds (1ms = sec/1000). You can change your scale for your report if you like. Whatever units are chosen, you will need to deal with large differences of elapsed time, ranging over several orders of magnitude. Some displays may need log scaling in the vertical axis.
Analysis should be done in two senses. First, provide an asymptotic analysis that results in the scalability curve forms shown above. This step is of course independent of platform and programming language. Formal analysis is not required, but an informed and informative discussion is expected. Second, collect data on actual runs of the sorts and use that data to support your findings. If you are doing the curve-fitting project, find a best-fit concrete scalability curve using the form derived above. This curve will depend on almost any choice made, so it is important to use the same choices across the sorts being analyzed and to eliminate irrelevant overhead costs as much as possible: same input data, same machine, simplest data structures. Other projects would follow similar guidelines, as appropriate for the problem at hand.
The tools that are supplied should give you more time to think about the data: what kind of test data to generate, what kind of test data to collect, and have a good plan to accomplish that. Be sure to address these issues in your report as well.
Your report should be structured something like the following outline. You are free to change the titles, organization, and section numbering scheme.
1. Abstract or Executive Summary
  [brief, logical, concise overview of what the report is about and what its results and conclusions/recommendations are; this is for the Admiral or online library]
2. Introduction
  [A narrative overview "story" of what the paper is about: what, why, how, and conclusions, including how the paper is organized]
3. Background
  [what knowledge underpins the paper, such as theory, in this case the known asymptotic runtimes of the sorts, with references, and the statistical procedure to be used, with references]
4. Theoretical Analysis
  [asymptotic runtime analysis of each sort, concluding with what form to use in fitting a curve to runtime data.]
5. Data Analysis Process or Procedure
  [details on what data decisions are made and why; how input data is created; how timing data is collected; and how analysis is accomplished, including references for any software packages that are used and detailed documentation on any software used for regression]
6. Analysis Results
  [give results in both tabular and graphical form] Illustrate all major conclusions and recommendations graphically where appropriate, for example with a single figure comparing concrete scalability curves superimposed in the same graphical display.
7. Conclusions
  [use results, including comparative tabular and/or graphical displays, to back up your conclusions]
8. Appendix 1
  Give complete details on calculation of all coefficients.
9. Appendix 2
  Give tables of all collected data.
10. Appendix 3
  Give detailed descriptions of all input files, including how they were built, there size, and constraints on content. (Do not put the actual input files in the report.)
Reading your report should make it clear how to use the test functions and how data was collected from them.