Project 1: List::Sort()

MergeSort for Linked Lists

Note: This assignment is used to assess the required outcomes for the course, as outlined in the course syllabus. These outcomes are:

analyze the computational complexity of algorithms used in the solution of a programming problem
evaluate the performance trade-offs of alternative data structures and algorithms

These will be assessed using the following rubric:

I E H
Key:
I = ineffective
E = effective
H = highly effective

Performance Analysis

Runtime Analysis - - -

Runspace Analysis - - -

Tradeoff Analysis

Comparison Sorts - - -

Numerical Sorts - - -

In order to earn a course grade of C- or better, the assessment must result in Effective or Highly Effective for each outcome.

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Implement and test MergeSort for linked lists, as an in-place sort
Improve the fsu::List class with modified implementations
Explain why MergeSort is an optimal choice for List::Sort()
Instrument an algorithm implementation [List::MergeSort] to estimate (a) its essential runtime and (b) its overhead runtime
Use a theoretical asymptotic runtime, timing data, and non-linear regression to calculate the best-fit curve approximating the runtime of an implementation of an algorithm

Part 2: Analysis of List::Sort()

Deliverables: One file

assign4.pdf

containing your analysis of List::Sort as answers to the first three questions and an analysis of efficiency of hash tables using HashTable::Analysis as answer to the fourth question.

Background: Curve Fitting

By a scalability curve for an algorithm implementation we shall mean an equation whose form is determined by known asymptotic properties of the algorithm and whose coefficients are determined by a least squares fit to actual timing data for the algorithm as a function of input size. For example, the merge sort algorithm, implemented as a generic algorithm named g_merge_sort(), is known to have asymptotic runtime Θ(n log n), and we will use the following for its form:

R = A + B n log n

where R is the predicted run time on input of size n. To obtain the concrete scalability curve, we need to obtain actual timing data for the sort and use that data to find optimal values for the coefficients A and B. Note this curve will depend on the implementation all the way from source code to hardware, so it is important to keep the compiler and testing platform the same in order to compare efficiencies of different sorts using their concrete scalability curves.

The method for finding the coefficients A and B is the method of least squares. Assume that we have sample runtime data as follows:

Input size: n₁ n₂ ... n_k

Measured runtime: t₁ t₂ ... t_k

and the scalability form is given by

f(n) = A + B g(n)

Define the total square error of the approximation to be the sum of squares of errors at each data point:

E = Σ [t_i - f(n_i)]²

where the sum is taken from i = 1 to i = k, k being the number of data points. The key observation in the method of least squares is that total square error E is minimized when the gadient of E is zero, that is, where all three partial derivatives D_AE and D_BE are zero. Calculating these partial derivatives gives:

D_XE = 2 Σ [t_i - f(n_i)] D_Xf

= 2 Σ [t_i - (A + B g(n_i))] D_Xf

(where X is A or B). This gives the partial derivatives of E in terms of those of f, which may be calculated to be:

D_Af = 1
D_Bf = g(n)

(because n and g(n) are constant with respect to A and B.) Substituting these into the previous formula and setting the results equal to zero yields the following equations:

A Σ 1 + B Σ g(n_i) = Σ t_i
A Σ g(n_i) + B Σ (g(n_i))² = Σ t_i g(n_i)

Rearranging and using Σ 1 = k yields:

k A + [Σ g(n_i)] B = Σ t_i

[Σ g(n_i)] A + [Σ (g(n_i))²] B = Σ t_i g(n_i)

These are two linear equations in the unknowns A and B. With even a small amount of luck, they have a unique solution, and thus optimal values of A and B are determined. (Here is a link to a more detailed derivation in the quadratic case.)

Note that all of the coefficients in these equations may be calculated from the original data table and knowledge of the function g(n), in a spreadsheet or in a simple stand-alone program. The solution to the system of equations itself is probably easiest to find by hand by row-reducing the 2x3 matrix of coefficients to upper triangular form and then back-substitution.

Procedural Requirements

Answer questions 1-4 in Assignment 4.

Input size:	n₁	n₂	...	n_k
Measured runtime:	t₁	t₂	...	t_k

D_XE	= 2 Σ [t_i - f(n_i)] D_Xf
	= 2 Σ [t_i - (A + B g(n_i))] D_Xf

k A	+	[Σ g(n_i)] B	=	Σ t_i
[Σ g(n_i)] A	+	[Σ (g(n_i))²] B	=	Σ t_i g(n_i)