CIS5930-07 Parallel Computing: Announcements and review questions

Dates	Topic	Dates	Topic
28 Aug - 31 Aug	Chapter 1: Parallel computers and computation Review questions: Do you know the following terminology? SISD, SIMD, MIMD, SMP, UMA, shared memory machine, distributed memory machine, NUMA, cc-NUMA, Distributed shared memory, SPMD, message passing, shared memory programming, task/channel model, multicomputer, multiprocessors. (We have discussed some of these topics in class yet, but you should read Chapter 1.) Give an example to demonstrate why lack of cache-coherence can lead to incorrect results. Suggest some methods of ensuring cache coherence efficiently. Can you simulate, through software, a distributed memory machine on a shared memory machine, or a shared memory machine on a distributed memory machine? Can you efficiently simulate, through software, a distributed memory machine on a shared memory machine, or a shared memory machine on a distributed memory machine?	22 Oct - 26 Oct	Discussion on homework, projects and papers
4 Sep - 7 Sep	Chapter 2: Designing parallel algorithms Review questions: Do you understand the issues of: partitioning, domain decomposition, functional decomposition, communication, local vs global communication patterns, structured vs unstructured communication patterns, static vs dynamic communication patterns, synchronous vs asynchronous communication patterns, agglomeration, mapping, use of graph partitioning in mapping, divide and conquer paradigm. For the first prefix scheme we discussed in class, can you show that the time complexity is O(log²N)? Can you come up with an example where you may be willing to sacrifice load balance requirement to improve the total computation time? We considered parallel prefix in a message passing paradigm. How would you implement it in a shared memory paradigm? What will the time complexity be? What assumptions on memory access did you use to derive the time complexity (for example, can multiple processors read the same memory location simultaneously? Can they write simultaneously?)? Design a parallel algorithm for matrix-vector multiplication and matrix-matrix multiplication.	29 Oct - 2 Nov	Discussions on the projects
10 Sep - 14 Sep	Chapter 3: A quantitative basis for design, sections 3.1 - 3.4 Check HW 1	5 Nov - 9 Nov	Midterm review and midterm
17 Sep - 21 Sep	Section 3.7 Check the partial list of papers, from which you will present later in in the semester. Review questions: Do you understand the issues of: the various factors to consider in performance (such as execution time, memory requirement, software development cost, etc), Amdahl's law and limitations, limitations of extrapolating from observations, asymptotic analysis, modeling execution time, the communication model we consider, and improvements to account for contention, efficiency, speed-up, scalability analysis, iso-efficiency function, the different network topologies. Can you perform a scalability analysis for the parallel prefix algorithms we discussed in class? Can you create a torus using only "short" wires (that is, wires of constant length)? Can you create a hypercube of dimension greater than three in the 3-dimensional world in which we live? Can you create it such that wires do not cross? Can you embed an arbitrary graph in 3-D space so that wires do not cross?	13 Nov - 16 Nov	Paper presentations
24 Sep - 28 Sep	Chapter 4: Putting components together, sections 4.1 - 4.3	19 Nov - 21 Nov	Paper presentations
1 Oct - 5 Oct	Chapter 4: Putting components together, section 4.6 Review questions: Do you know the following: modularity issues in sequential and parallel software, the three composition techniques, their advantages and disadvantages, different matrix distribution schemes (block versus cyclic, striped versus checkerboard, one dimensional versus two dimensional)? Given a problem, can you suggest a suitable parallel algorithm and data distribution, and discuss the trade-offs involved with different composition techniques? For example, give the total memory required, discuss factors that may change the total execution time, and analyze the communication and computation costs. For an example of the above, try to analyze the following two problems: (i) addition of `N` numbers, and (ii) matrix-vector multiplication, where the matrix is distributed in a checker-board manner, while you can assume any suitable distribution for the vector, with each processor having `N/P` elements of the vector.	26 Nov - 30 Nov	Project presentations
8 Oct - 12 Oct	Chapter 8: MPI Pacheco's tutorial Gropp's tutorial Review questions: Do you know the following? The 6 basic MPI calls, need for tags and communicators, buffering issues and deadlock, how to prevent deadlocks, immediate sends and receives, duplicating and splitting communicators, collective communication, topologies, derived data types (vectors and structures). Given a desired topology (for example, a hypercube), can you give suitable arguments to create a Cartesian mesh that is identical to the desired topology? Can you give an example to demonstrate how send/recv can cause deadlocks, or given an example, can you determine if deadlock can occur, under what conditions, and how it can be prevented using facilities provided by MPI? How can you implement reduction using only sends and receives? Given a sequential algorithm, you should be able to write an efficient MPI program for it. For example, try to do this for matrix-vector multiplication.	3 Dec - 7 Dec	Project presentations
15 Oct - 19 Oct	OpenMP Review questions: Do you know the following? concept of threads, OpenMP execution model, compiling and OpenMP program on the SGI origin 2000, compiler directives for creating a parallel region and work-sharing a for loop, data scope attributes clauses (private, last private, and first private), how private variables are created, reduction, library calls to set the number of threads, and get the thread number. Can you give examples (other than those discussed in class) to demonstrate errors that can occur in a program when multiple threads execute a piece of code? Given a piece of sequential code, you should be able to parallelize it with OpenMP directives. For example, try parallelizing matrix-matrix multiplication. What do you think is the most likely reason for the loop variable in a work-shared construct to be private, by default? What do you think is the most likely reason for the restrictions OpenMP places on the type of `for` loops?