Assignment 5

Predict the Next Play in an NFL Game

Due: 26 Nov 2013

Educational objectives: Implementing efficient data structures and comparing the performance of different data structures.

Statement of work: Implement code for the NFL play prediction problem using two different data structures as follows. (i) Implement a good hash function for NFL plays and use the STL unordered_set to store the plays. (ii) Implement any other efficient data structure to store NFL plays. You may combine different containers, create your own container, or use STL containers and algorithms. Compare the performance of the above two data structures. You will be graded on correctness and performance. So, please use good compiler optimization flags in your makefile. You may also use OpenMP to improve the performance of your code.

Deliverables: Turn in a makefile and all header (*.h) and cpp (*.cpp) files that are needed to build your software. Turn in your development log too, which should be a plain ASCII text file called LOG.txt in your project directory. This file should also contain a brief description of your hash function and your data structure. Also turn in an ASCII file, testing.txt, describing how you tested your code. This file should describe any remaining run time errors in your program. You will lose points for errors that we discover which were not identified by you in the above file. You will submit all of these as described in www.cs.fsu.edu/~asriniva/courses/DS13/HWinstructions.html.

Requirements:

Create a subdirectory called proj5.

You will need to have a makefile in this directory. In addition, all the header and cpp files needed to build your software must be present here, as well as the LOG.txt and testing.txt files.

You should implement appropriate classes for the software. Your code should be well designed and object oriented.

Your software's main task is as follows. A user will give it a list of .csv files. Each file contains a list of plays in each NFL game for one year. The user will then execute a sequence of queries to predict the next play that is likely. You may imagine, for instance, that the coach for the defense will use this to predict what the offense will do next in order to prepare for it. The queries will be of only one type in this assignment. A list query will list plays that the offense executed in similar game situations.

The software is run by the user on the command line, as follows:

Analyze Year-List, where Year-List is a non-empty list of valid years separated by whitespace. The following years are valid: 2007, 2008, 2009, 2010, 2011, and 2012. This instructs the software to analyze data for the specified years. Data for the year n is present in the file n.csv. Each line of this file contains information on a particular play, except the first line which gives field/column headings. Each field in a line is separated by commas. The relevant fields for us are the following: 2. quarter, 3. minutes remaining in the game, 5. team name for offense, 6. team name for defense, 7. down, 8. yards to go for the next down, 9. starting location for that down, 10. description of the play. Some of the fields may be empty.
The description is a string, which we will use to determine the type of play. We give below play types of interest to us and how they are identified, based on words in the description.

Deep pass right: presence of the words 'deep', 'pass', and 'right' in the description.
Deep pass left: presence of the words 'deep', 'pass', and 'left' in the description.
Deep pass middle: presence of the words 'deep', 'pass', and 'middle' in the description.
Short pass right: presence of the words 'short', 'pass', and 'right' in the description.
Short pass left: presence of the words 'short', 'pass', and 'left' in the description.
Short pass middle: presence of the words 'short', 'pass', and 'middle' in the description.
Run to the right: presence of the word 'right' in the description, but not 'pass'.
Run to the left: presence of the word 'left' in the description, but not 'pass'.
Run to the middle: presence of the word 'middle' in the description, but not 'pass'.
Field goal attempt: presence of the words 'field' and 'goal' in the description.
Punt: presence of the word 'punts' in the description.

The software first reads each file specified through the command line and stores relevant information in each of the two data structures. You should not store plays or information that are not relevant to our program. The software then waits for a series of user input from stdin, and responds to each user input as described below.
Possible user actions and required software response:

list n MIN OFF DEF DOWN TOGO YDLINE : The fields have the same meaning as in assignment 4. The software outputs the n most relevant similar plays. If fewer than n plays are similar, then all the similar plays are output.
Plays are considered similar as defined below. For a play to be considered similar, it should be by the same team in offense and the same down. In addition, the yards to go should be within one yard of the above and the field position should be within 10% of the above. If no similar play exists, then output No similar play exists to standard output (not to standard error).
Relevance is a floating point number defined as: -(|Min-min|*5/3 + |TOGO-togo| + |YDLINE-ydline|). In addition, if the defense teams are identical, add 100 to the relevance. The fields marked in lower case denote corresponding fields in the play database. Each line of the output will give the type of play followed by min off def down togo ydline of the play, followed by its relevance. The plays are output in decreasing order of relevance (that is, most relevant play first). If multiple plays have the same relevance value, then the one that occured in a later year is considered more relevant. If both plays ocurred in the same year, then the one that occured later in the file is considered more relevant.
You need to print the above results first for the hash table and then for your data structure, with a blank line after each set of results. If your implementations are correct, then both sets of results will be identical.

x: Quit the program. But before that, print statistics on the performance of each data structure as follows. The program should print the time taken for storing all the plays, the minimum time for identifying the list of relevant plays, the maximum time for identifying the list of relevant plays, and average time for identifying the list of relevant plays. An example is shown below.
myhash: store 9.1 s, list: min 0.01 s, max 0.09 s, mean 0.03 s
mydatastructure: store 1.1 s, list: min 0.001 s, max 0.02 s, mean 0.03 s

Output Invalid command to standard output for any other command.

A sample executable will not be provided. You need to develop good test cases to verify the correctness of your program. The .csv files are already available there under the NFLData subdirectory of proj1.

Notes:
1. Your program should not have any output other than those specified above.
2. You may lose points if your code is very inefficient.
Bonus points:
You may get up to 50 additional points if your code is correct and the fastest in class. You may get up to 25 bonus points if your data structure or an unordered_set with your hash function works correctly and is substantially faster than our solution to assignment 1.
Copyright: Ashok Srinivasan, Florida State University.
Last modified: 6 Nov 2013