Quite common to have multiple cores these days; indeed, it's become the rare case for any processor to support only one core.
As the text notes, instruction-level parallelism (ILP) was of great interest back before the turn of the millenium;
Vector parallelism refers to the application of one operation on multiple data elements; SIMD is the most common realization of this technique, and Intel/AMD have create a lot of SIMD instructions (and ones to handle quite large registers!)
"Black box": you don't even know it's there.
Disjoint tasks: parallel execution over disjoint data
Synchronicity (but also see RCU-type parallelism)
Multithreading seems reasonable when you have multiple logical processors that can act in concert; certainly some tasks, such as complex i/o driven tasks, seem naturally suited to having separate threads
However, the great success of the last few years has been the resurrection of event-driven programming, which generally does not use multiple threads, except occasionally as a means to make use of multiple processors. (Your text discusses this alternative rather dismissively in the section title "The Dispatch Loop Alternative".)
Generally, we have seen communication via either message passing or some sort of shared memory mechanism (though it is possible to mix these two, such as with distributed shared memory models, as your text mentions in the Design and Implementation note 13.2 on page 636.)
Synchronization can be implemented by, say, spinning (busy-wait), or by blocking.
In higher-level languages, threading can be done at the language level, at the compiler "extension" level, or via a library.
At the assembly language level, generally you have to ask the operating system to help you (which is what is silently happening in the higher-level languages in the above three methods.)
Languages that support parallelism generally use some variation on the following themes
co-begin -- all n statements run concurrently
... [stmt1]
... [stmt2]
... [stmt3]
...
... [stmtN]
co-end
Parallel.For(0,3,i => { somefunction(i); })
Parallel.Map( somefunction somelist );
Parallel.Fold( somefunction someaccum somelist );
In Unix-land, we have
Using these two mechanisms, we can create implementations such as pthreads
Going lock-free with transactional memory systems: It does seem to have promise, and there are a lot of implementations.
You declare a code section to be "atomic", and the system takes responsibility for trying to execute these in parallel. If the system supports rollback, it can even try "speculative" computations.
$ cpuid
[ ... ]
extended feature flags (7):
[ ... ]
HLE hardware lock elision = false
[ ... ]
RTM: restricted transactional memory = false