Project 4: Movie Match

Final Algorithms Project

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Define and implement graph classes
Implement the adjacency list representation for a graph
Design and implement graph algorithms operating on a standard graph interface
Accurately describe the Breadth First Search [BFS] algorithm
Accurately describe the Depth First Search [DFS] algorithm
Efficiently implement BFS operating on a standard graph API
Efficiently implement DFS operating on a standard graph API
Analyze the runtime and runspace requirements of DFS on adjacency list representations
Analyze the runtime and runspace requirements of BFS on adjacency list representations
Analyze the runtime and runspace requirements of DFS on adjacency matrix representations
Analyze the runtime and runspace requirements of BFS on adjacency matrix representations
Describe and implement the Symbol Graph
Define bipartite graphs
Explain the besic conclusions about path lengths in bipartite graphs
Describe the back-end design for the Movie Match game solver

Operational Objectives: Design and implement the following classes:

Breadth First Survey
Depth First Survey (3-teams only)
Symbol Graph
Movie Match Game

You may have teams of 2 or 3 people. The team should compose a brief summary of work that explains the responsibilities and work products each member of the team accomplished. Also each team member should submit the project individually. Please make certain that the submissions for each member of a team are identical.

Deliverables: Files:

readme.txt     # required for all
bfsurvey.h     # required for all
dfsurvey.h     # required for 3-teams only
symgraph.h     # required for all
moviematch.h   # required for all
makefile       # builds all executables in project, including tests

Procedural Requirements

The official development | testing | assessment environment is gnu g++ on the linprog machines.
Each member of a team submits all team deliverables

Deliverables submitted should be identical across all team members.

The team makup is listed in the file header documentation of each submitted file (see C++ Style link for standards)

File readme.txt explains how the software was developed, what responsibilities each team member had, how it was tested, and how it is expected to be operated.

Copy all of the test harnesses and graph files from LIB/proj4:

fgraph.cpp     # general test for graph classes
ftopsort.cpp   # another test for directed graphs
fbfsurvey.cpp  # used for fbfsurvey_ug.cpp and fbfsurvey_dg.cpp
fdfsurvey.cpp  # used for fdfsurvey_ug.cpp and fdfsurvey_dg.cpp
KevinBacon.cpp # client program for MovieMatch

Copy the file LIB/proj4/proj4submit.sh into your project directory, change its permissions to executable, and submit the project by executing the script.

Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications - ALGraph

Class ALUGraph implements the adjacency list representation of a graph whose vertices are assumed to be unsigned integers 0,1,...,n-1. The interface should conform to:

namespace fsu
{
  template < typename N >
  class ALUGraph
  {
  public:
    typedef N      Vertex;
    typedef xxxxx  AdjIterator;

    void   SetVrtxSize  (N n);
    void   AddEdge      (Vertex from, Vertex to);
    size_t VrtxSize     () const;
    size_t EdgeSize     () const;
    size_t OutDegree    (Vertex x) const;
    size_t InDegree     (Vertex x) const;
    AdjIterator Begin   (Vertex x) const;
    AdjIterator End     (Vertex x) const;

    ALUGraph ( );
    ALUGraph ( N n );
  ...
  };
} // namespace fsu

where xxxxx is a type that you define. This is an iterator for the adjacency list, which could be fsu::List<Vertex>::ConstIterator, std::list<Vertex>::const_iterator, or some other type. The directed graph API is exactly the same (but for the name of the class):

namespace fsu
{
  template < typename N >
  class ALDGraph
  {
  public:
    typedef N      Vertex;
    typedef xxxxx  AdjIterator;

    void   SetVrtxSize  (N n);
    void   AddEdge      (Vertex from, Vertex to);
    size_t VrtxSize     () const;
    size_t EdgeSize     () const;
    size_t OutDegree    (Vertex x) const;
    size_t InDegree     (Vertex x) const;
    AdjIterator Begin   (Vertex x) const;
    AdjIterator End     (Vertex x) const;

    ALUGraph ( );
    ALUGraph ( N n );
  ...
  };
} // namespace fsu

Much of the implementation code for the undirected and directed cases is identical, so it can be profitable to derive one of these from the other. In the derived class, only AddEdge, EdgeSize, and InDegree require re-definition.

Begin(x) returns an AdjIterator which is a forward ConstIterator that iterates through the adjacency list of the vertex v. End(x) returns the end iterator of the adjacency list. So, the loop
```
for (typename GraphType::AdjIterator i = g.Begin(x); i != g.End(x); ++i)
{/*   do something at the vertex *i   */}
```
encounters all of the vertices adjacent from v in the (directed or undirected) graph g.
The template argument is some unsigned integer type. We are using templates mainly as a convenience so that member functions will not be compiled (or even require implementation) if they are not called by client code.
Test graph classes thoroughly using fgraph.cpp.

Code Requirements and Specifications - Algorithms

Algorithms should operate on ALGraph objects via the interface defined above, so that another team's version of ALGraph can be substituted without modification.
Algorithms should be class templates (in line with the graph class template). See discussion of algorithm classes in the Graphs 1 Lecture Notes.
Test algorithms (surveys) thoroughly using the supplied survey tests.

Code Requirements and Specifications - SymbolGraph

Class SymbolGraph implements a graph whose vertices are symbols (typically strings). The API is largely the same as that of the abstract graph classes discussed above, with the additional ability to adjust the vertex size "on the fly" using the Push() operation.

namespace fsu
{
  template < typename S , typename N >
  class SymbolGraph
  {
  public:
    typedef S      Vertex;
    typedef xxxxx  AdjIterator;

    void   SetVrtxSize  (N n);
    void   AddEdge      (Vertex from, Vertex to);
    size_t VrtxSize     () const;
    size_t EdgeSize     () const;
    size_t OutDegree    (Vertex x) const;
    size_t InDegree     (Vertex x) const;
    AdjIterator Begin   (Vertex x) const;
    AdjIterator End     (Vertex x) const;

    void   Push         (const S& s); // add s to the vertex set

    // access to underlying data
    const ALUGraph<N>&      GetAbstractGraph() const; // reference to g_
    const HashTable<S,N,H>& GetSymbolMap() const; // reference to s2n_
    const Vector<S>&        GetVertexMap() const; // reference to n2s_

    SymbolGraph ( );
    SymbolGraph ( N n );
    ...
  private:
    ALUGraph<N>      g_;
    HashTable<S,N,H> s2n_;
    Vector<S>        n2s_;
    ...
  };
} // namespace fsu

where xxxxx is the adjacency iterator type. There is a directed version SymbolDirectedGraph<S,N> whose implementation is almost identical to the undirected case, except using ALDGraph<N> as the abstract graph underpinning.

The template arguments are S = SymbolType and N = IntegerType. S is the type for the names of vertices, and is typically some form of string. N is the parameter to instantiate the underpinning abstract graph.
s2n_ is an associative array, or mapping, from symbols to vertices in the abstract graph g_. n2s_ is the inverse mapping from vertices in g_ to symbols. The symbol graph uses the two mappings to translate symbols to abstract vertices and calls operations in the abstract graph.

Code Requirements and Specifications - MovieMatch

MovieMatch should provide services required by KevinBacon.cpp. This will require the following (partial) class definition:

class MovieMatch
{
public:

  MovieMatch (const char* baseActor) : baseActor_(0)
  {
    size_t length = strlen(baseActor);
    baseActor_ = new char [length + 1];
    baseActor_[length] = '\0';
    strcpy (baseActor_,baseActor);
  }

  void Load (const char* filename);
  // loads a moview/actor table

  unsigned long MovieDistance(const char* actor);
  // returns the number of movies required to get from actor to baseActor_

  ...

private:
  char* baseActor_;
  SymbolGraph < fsu::String , size_t > sg_;
  ...
};

(The names can be your choice, except for those required by the distributed client program.)

If you prefer you may build the symbol graph directly in MovieMatch.

The underlying graph should be built from the "database" provided in the text file movies.txt. Each line of this file represents a movie and the actors in the movie. Forward slash '/' is used to delimit the strings representing movie titles and actor names in each line.

It will be helpful to use either the cstring library or std::string to read entire lines and break them up into strings using the '/' delimiter, so that spaces are captured. We will distribute a client program for MovieMatch that illustrates this approach by allowing actor names (with blanks) to be entered through the keyboard.

Movie Distance and Kevin Bacon

The Kevin Bacon game is this: given an actor by name, what is his/her Kevin Bacon number?

To solve this we first need a clear definition of the Kevin Bacon number for an actor, or more generally, the movie distance between two actors. The definition is much like the path distance between two vertices in a graph, except using movie chains instead of edges.

A movie chain from actor x to actor y is a sequence m₁ m₂ ... m_k such that

m_j and m_j+1 have an actor in common for 0 < j < k

x is in movie m₁

y is in movie m_k

The movie distance md(x,y) is defined to be the number of movies in a shortest movie chain from x to y. If there is no movie chain from x to y, we define md(x,y) = infinity.

The Kevin Bacon number of an actor x is the movie distance from x to Kevin Bacon.

Some consequences are:

Kevin Bacon has Kevin Bacon number 0.
All other actors have Kevin Bacon number at least 1.
if x != y and x and y are in the same movie, then md(x,y) = 1
md(x,z) <= md(x,y) + md(y,z)

The actor-movie graph

To solve the Kevin Bacon game (or any other similar game based on another actor) we use graphs. Specifically, create a graph in which both actors and movies are vertices, and insert an edge whenever an actor is in a movie. Thus each edge has an actor for one vertex and a movie for the other.

A graph is said to be bipartite if the vertices can be colored with two colors, say Red and Blue, such that each edge has different colored vertices, that is, each edge goes between a blue vertex and a red vertex. Clearly the movie-actor graph is bipartite, with actors colored blue and movies colored red.

The following result is proved in discrete math courses and most books on graph theory:

Theorem. In a bipartite graph, a path whose ends have the same color has an even number of edges.

As a consequence, any path from one actor to another in the movie-actor graph has an even number of edges. If P is such a path, with length n, then n is even and n/2 is the number of movies passed through by P. If P is a shortest path from actor x to actor y, then n/2 is the movie distance from x to y.

Thus to solve the Kevin Bacon game, we perform a Breadth-First survey from Kevin Bacon. The Breadth First Tree rooted at Kevin Bacon consists of shortest paths from Kevin Bacon to all other actors who have a finite Kevin Bacon number. Dividing the length of such a path by 2 yields the Kevin Bacon number for the actor at the other end of the path.

In practical terms, we start at an actor x and follow the parent pointers of the BF tree back to Kevin Bacon, counting the steps. Then divide this count by 2 to get the number.

Hints

The abstract graph classes are provided, so interpret the admonitions to "thoroughly test" them as "become thourougly understand how these are designed and implemented".

Even though our typical use of the graph classes will have the template argument N = size_t, it will be very useful in your implementation code to distinguish between type Vertex and type size_t and carefully cast between the two when the two types have different connotations. For example, if Vertex x and size_t i, then Begin((Vertex)i) and parent[(size_t)x].

Several graph files are distributed in LIB/proj4. Some of these are named graph.v.e and some are named are named dag.v.e, where v is the number of vertices and e the number of edges of the graph represented by the file. DO NOT rely on these suffixes in your programs, they are for human convenience only (and in some instances may not even be accurate). Those named dag are purported to be acyclic when interpreted as directed graphs, but will have cycles when interpreted as undirected graphs.
A thorough understandiing of the material on the Graphs 1 Lecture Notes will be helpful.

The following is output from a test of DFSSurvey run on G1 (undirected case):

linprog2> fdfsug.x graph1.10.10

Begin DFSurvey functionality test
graph type: undirected adjacency list
 Load complete
 Input file: graph1.10.10
  VrtxSize = 10
  EdgeSize = 10

   df survey data
   ==============
   vertex     dtime      ftime      parent        color
   ------     -----      -----      ------        -----
        0         0         19        NULL            b
        1         1          2           0            b
        2         6         15           5            b
        3         3         18           0            b
        4         4         17           3            b
        5         5         16           4            b
        6         7         14           2            b
        7         8         13           6            b
        8        10         11           9            b
        9         9         12           7            b

Vertex discovery order: 0 1 3 4 5 2 6 7 9 8
Vertex finishing order: 1 8 9 7 6 2 5 4 3 0

End DFSurvey functionality test
linprog2>

Note the table of survey data and the output of the vertices in preorder and postorder.

The following is output from a test of BFSSurvey run on G1 (directed case):

linprog2> fbfsdg.x graph1.10.10

Begin BFSurvey functionality test
graph type: directed adjacency list
 Load complete
 Input file: graph1.10.10
  VrtxSize = 10
  EdgeSize = 10

   df survey data
   ==============
   vertex  distance      dtime      parent        color
   ------  --------      -----      ------        -----
        0         0          0        NULL            b
        1         1          1           0            b
        2         0          7        NULL            b
        3         1          2           0            b
        4         2          3           3            b
        5         3          4           4            b
        6         1          8           2            b
        7         2          9           6            b
        8         4          5           5            b
        9         5          6           8            b

Vertex discovery order: 0 1 3 4 5 8 9 2 6 7
   grouped by distance: [ ( 0 ) ( 1 3 ) ( 4 ) ( 5 ) ( 8 ) ( 9 ) ] [ ( 2 ) ( 6 ) ( 7 ) ]

End BFSurvey functionality test
linprog2>

Again the table shows the survey data. The vertex discovery order, grouped by distance from the search vertex, is also shown. (BFS discovers and finishes vertices in the same order.) The discovery order grouped by distance uses [ ] to delimit trees in the forest and ( ) to delimit vertices the same distance away from the root of the tree.

The discovery and finishing order are computed post-survey from the timestamps. (Discovery time was added to the usual BFS to facilitate this.) The discovery order "grouped by distance" output in fbfsurvey uses both distance and time. One could also output a Lisp-syntax record of the search forest for either survey.

Sample executables are available in LIB/area51. These show some elaborations such as digraphs, topological sort for digraphs, and output from the surveys that isn't direct. I added discovery time to BFSurvey, which is handy information to have, as illustrated by the post-survey computation of discovery order. Decode the names as follows:

fgraph.x   # general test of Graph classes  - can supply detailed log
fdfsud.x   # functionality test of DFSurvey - undirected graphs
fdfsdg.x   # functionality test of DFSurvey - directed graphs
fbfsud.x   # functionality test of BFSurvey - undirected graphs
fbfsdg.x   # functionality test of BFSurvey - directed graphs
ftopsort.x # functionality test of TopSort  - directed graphs only

It will be helpful to have a function (or method in MovieMatch) that gets an entire line from movies.txt and returns a vector v of the individual names delimited by '/' in the line. The first name v[0] would be a movie title and the remaining v[1] .. v[v.Size() - 1] would be actors in that movie.

You will need to decide what a "string" is. We recommend using string objects, either fsu::String or std::string.