Sun Microsystems, Inc.
spacerspacer
spacer www.sun.com docs.sun.com |
spacer
black dot
 
 
9.  Programming Guidelines Working With Multiprocessors The Underlying Architecture "Shared-Memory" Multiprocessors  Previous   Contents   Next 
   
 

For memory that is strongly ordered (for instance, a modification to memory on one processor is immediately available to the other processors), this solution is correct (it is correct even taking into account that in and out will eventually overflow, as long as BSIZE is less than the largest integer that can be represented in a word).

Shared-memory multiprocessors do not necessarily have strongly ordered memory. A change to memory by one processor is not necessarily available immediately to the other processors. When two changes to different memory locations are made by one processor, the other processors do not necessarily detect the changes in the order in which they were made because changes to memory do not happen immediately.

First the changes are stored in store buffers that are not visible to the cache.

The processor checks these store buffers to ensure that a program has a consistent view, but because store buffers are not visible to other processors, a write by one processor does not become visible until it is written to cache.

The synchronization primitives (see Chapter 4, Programming With Synchronization Objects) use special instructions that flush the store buffers to cache. So, using locks around your shared data ensures memory consistency.

When memory ordering is very relaxed, Example 9-5 has a problem because the consumer might see that in has been incremented by the producer before it sees the change to the corresponding buffer slot.

This is called weak ordering because stores made by one processor can appear to happen out of order by another processor (memory, however, is always consistent from the same processor). To fix this, the code should use mutexes to flush the cache.

The trend is toward relaxing memory order. Because of this, programmers are becoming increasingly careful to use locks around all global or shared data.

As demonstrated by Example 9-5 and Example 9-6, locking is essential.

Peterson's Algorithm

The code in Example 9-6 is an implementation of Peterson's Algorithm, which handles mutual exclusion between two threads. This code tries to guarantee that there is never more than one thread in the critical section and that, when a thread calls mut_excl(), it enters the critical section sometime "soon."

An assumption here is that a thread exits fairly quickly after entering the critical section.


Example 9-6 Mutual Exclusion for Two Threads?

void mut_excl(int me /* 0 or 1 */) {
    static int loser;
    static int interested[2] = {0, 0};
    int other; /* local variable */
   
    other = 1 - me;
    interested[me] = 1;
    loser = me;
    while (loser == me && interested[other])
        ;

    /* critical section */
    interested[me] = 0;
}

This algorithm works some of the time when it is assumed that the multiprocessor has strongly ordered memory.

Some multiprocessors, including some SPARC-based multiprocessors, have store buffers. When a thread issues a store instruction, the data is put into a store buffer. The buffer contents are eventually sent to the cache, but not necessarily right away. (Note that the caches on each of the processors maintain a consistent view of memory, but modified data does not reach the cache right away.)

When multiple memory locations are stored into, the changes reach the cache (and memory) in the correct order, but possibly after a delay. SPARC-based multiprocessors with this property are said to have total store order (TSO).

When one processor stores into location A and then loads from location B, and another processor stores into location B and loads from location A, the expectation is that either the first processor fetches the newly modified value in location B or the second processor fetches the newly modified value in location A, or both. However, the case in which both processors load the old values simply cannot happen.

Moreover, with the delays caused by load and store buffers, the "impossible case" can happen.

What could happen with Peterson's algorithm is that two threads running on separate processors each stores into its own slot of the particular array and then loads from the other slot. They both read the old values (0), assume that the other party is not present, and both enter the critical section. (Note that this is the sort of problem that might not occur when you test a program, but only much later.)

This problem is avoided when you use the threads synchronization primitives, whose implementations issue special instructions to force the writing of the store buffers to the cache.

Parallelizing a Loop on a Shared-Memory Parallel Computer

In many applications, and especially numerical applications, while part of the algorithm can be parallelized, other parts are inherently sequential, as shown in the following:

Thread1
Thread2 through Threadn
while(many_iterations) {

    sequential_computation
    --- Barrier ---
    parallel_computation
}
while(many_iterations) {


    --- Barrier ---
    parallel_computation
}

For example, you might produce a set of matrixes with a strictly linear computation, then perform operations on the matrixes using a parallel algorithm, then use the results of these operations to produce another set of matrixes, then operate on them in parallel, and so on.

The nature of the parallel algorithms for such a computation is that little synchronization is required during the computation, but synchronization of all the threads employed is required to ensure that the sequential computation is finished before the parallel computation begins.

The barrier forces all the threads that are doing the parallel computation to wait until all threads involved have reached the barrier. When they've reached the barrier, they are released and begin computing together.

Summary

This guide has covered a wide variety of important threads programming issues. Look in Appendix A, Sample Application--Multithreaded grep, for a pthreads program example that uses many of the features and styles that have been discussed. Look in Appendix B, Solaris Threads Example, for a program example that uses Solaris threads.

Further Reading

For more in-depth information about multithreading, see the following book:

  • Programming with Threads by Steve Kleiman, Devang Shah, and Bart Smaalders (Prentice-Hall, published in 1995)

 
 
 
  Previous   Contents   Next