Atomic Operations, Locks, Multithreading, And Other Beasts

I have decided to devote this article to the multithreading subject in relation to concurrently dealing with the same resource accomplished without OS locks but rather through atomic locks. The ultimate objection to using OS locks is that those locks utilize the one more layer of application environment – OS layer – to perform locking whereas atomic locks approach is locking without any additional layers but rather directly through locking instructions of an x86 compatible processor.

Before I can proceed with the theme of the article I think it is necessary to say that when I talk about locks within the scope of this article I mean locking for a relatively short time. Accordingly, when a shared resource is locked, a method that has got its turn to deal with a resource is not expected to call any OS I/O operations or even to do any long-term computations. To sum up, it is supposed for long-term concurrent computations and OS I/O operations to be handled via OS locks due to its such nature; consequently, if you are considering locks only with regard to the aforementioned type of operations and computations, you can definitely ignore the whole article.

Providing that you the reader has bothered yourself to consider the previous paragraph, I can proceed with the article. The first thing that I want you to direct attention to is word “atomic”. What does it mean? According to Wikipedia, word “atomic” as it is can be considered to be the smallest thing or object of something, and for our context it appears to be true since in the concurrent programming area an atomic operation is composed of a few even smaller operations that perform instantaneously and is considered by processor as one solid operation i.e. the performance of atomic operation cannot be interrupted by literally anything.

For example, we can suppose that our program has “a” variable of integer type and that variable is expected to be manipulated by many other threads so that every thread would increase “a” by 1. For the usual instruction flow, we can divide this operation to the following steps: 1. read “a” into some variable “x”, 2. add 1 to “x”, 3. write “x” to “a”. Now suppose that before one thread accomplished the last step, another thread had already written to “x”; therefore, the normal flow became broken, because any thread must wait for another thread to finish flow.

Atomic operations were intended to ensure that the above flow is performed as uninterrupted irrespective of the number of processors, processor cores, or threads, which are running simultaneously. Moreover, atomic operations are implemented on the x86 processor level, which in itself benefits from the highest speed of execution of an atomic operation flow in comparison to similar flow but without atomicity i.e. when an execution could be interrupted on any step.

The one issue that still remains to be considered concerning the atomicity is accessing the same memory address by different processors/cores while they are simultaneously performing respective atomic operations. Although each processor/core could execute an atomic operation independently, processor/core by default is not intended to wait for another processor/core to have finished an atomic operation that operates the same memory address. To fulfill the need of waiting for an atomic operation (that access a shared memory address) to be completed, the prefix “lock” was incorporated in the instruction set of the x86 processor architecture. This prefix precedes an atomic operation and assures that processor/core will first wait for a completion if the same memory address is already being used by another processor/core operation.

In order to ascertain the veracity of the abovementioned behavior of x86 architecture, we have to look deep inside a machine code generated by .Net. The following is the program that utilizes atomic operation CompareExchange:

class Program
{
    static int _lock = 0;

    static void Main(string[] args)
    {
        int r = Interlocked.CompareExchange(ref _lock, 1, 0);
        Console.WriteLine(r);
    }
}

What CompareExchange is doing is a simple thing: at the first step, it compares the _lock with last argument (0), and at the second step, if they are equal, it assigns the second argument (1) to the _lock and returns the _lock new value (1). If they are not equal, it just returns the _lock original value and changes nothing.

Using OllyDbg debugger, we can observe the machine code that was generated by .Net for the atomic operation:

OllyDbg Screenshot

From the above screenshot, we can consider the ample evidence that when we use the atomic operation CompareExchange, the appropriate machine code is generated. In this sample, the generated machine code for the atomic operation is “lock cmpxchg dword ptr DS:[4043AC], EDX”.

Now, given that we have a grasp of how atomic operations and processor’s lock work, we can proceed and are going to build a software lock that can enable threads to manage a shared resource in turn.

Consider the following – the ALock class and its usage:

public class ALock
{
    private volatile int _lock = 0;

    public IDisposable Lock(int threadId)
    {
        var spinWait = new SpinWait();

        while (Interlocked.CompareExchange(ref _lock, threadId, 0) != threadId)
        {
            spinWait.SpinOnce();
        }

        return new ALockAcquired(this);
    }

    protected void Free()
    {
        Interlocked.Exchange(ref _lock, 0);
    }

    private class ALockAcquired : IDisposable
    {
        private readonly ALock _lockedObject;

        public ALockAcquired(ALock lockedObject)
        {
            _lockedObject = lockedObject;
        }

        public void Dispose()
        {
            _lockedObject.Free();
        }
    }
}

class Program
{
    private static ALock _lock = new ALock();

    private static volatile int counter = 0;

    static void Main(string[] args)
    {
        var threads = Enumerable.Range(1, 10).Select(threadId => CreateThread(threadId)).ToList();

        Stopwatch stopWatch = new Stopwatch();

        stopWatch.Start();

        threads.ForEach(t => t.Join());

        stopWatch.Stop();

        Console.WriteLine($"expected: {threads.Count * 100000}, actual: {counter}, elapsed: {stopWatch.ElapsedMilliseconds} ms");
    }

    private static Thread CreateThread(int threadId)
    {
        var thread = new Thread(ThreadMain);
        thread.IsBackground = true;
        thread.Start(threadId);
        return thread;
    }

    private static void ThreadMain(object _threadId)
    {
        Thread.BeginThreadAffinity();

        try
        {
            int threadId = (int)_threadId;

            var processorsCount = Environment.ProcessorCount;

            if (processorsCount > 1)
            {
                var currentThreadId = GetCurrentThreadId();

                foreach (ProcessThread thread in Process.GetCurrentProcess().Threads)
                {
                    if (thread.Id == currentThreadId)
                    {
                        var processorNum = threadId >= processorsCount ? threadId % processorsCount : threadId;
                        thread.ProcessorAffinity = (IntPtr)(1L << processorNum);
                        break;
                    }
                }
            }

            for (int i = 0; i < 100000; i++)
            {
                using (_lock.Lock(threadId))
                {
                    counter++;
                }
            }
        }
        finally
        {
            Thread.EndThreadAffinity();
        }
    }

    [DllImport("kernel32.dll")]
    static extern uint GetCurrentThreadId();
}

The first thing I want to emphasize is the use of class SpinWait within while loop inside the ALock.Lock method. Insofar as we are constructing SpinLock, it implies that we must either acquire or release it as soon as possible. To accomplish that, we could use OS API but it would be expensive in many terms in the context of SpinLock. Thus, instead of an OS API use, we simply can wait for acquiring in a while loop, and that will be as expensive in the terms of time as a context switching between threads is. But, what kind of role does SpinWait play here?

In the context of a processor layer, if we use bare loop - i.e. either while or for loop -with nothing inside its body, we always force processor to waste many its resources because of looping with nothing to do, and unfortunately that consequences occur at the expense of resources that could be utilized by other threads including the one we are waiting for to release the lock. Moreover, due to the mentioned consequences, the whole group of threads, in which every constituent is contending for a lock to be released, greatly suffer from the lack of processor resources, and all that decelerates process of waiting for a lock which in turn decelerates performing the main action that resource has been acquired for.

There are a few approaches to address the aforementioned issue. The first is to do some work inside the loop that processor can treat as specific and therefore would not consume many resources, and the second is to switch context and relinquish processor resources to any other threads. The third can be the one that combines the means of the first and the second.

The SpinWait was intended to accomplish that kind of waiting. A developer has only just to create an instance of the class before a loop and calls SpinOnce() method inside the loop. The method uses the first and second mentioned approaches or combination of them. The approach that the method chooses depends on the number of calling of it inside a loop.

Given that we can proceed next and consider the above ThreadMain method. It is essential for multithreading application to evenly distribute threads among the available cores that environment has. Here it is accomplished via the GetCurrentThreadId method that returns OS thread is of the current managed thread, and then we should use this Id to find OS thread and assign a thread to the core so that all threads are evenly assigned among available cores.

There is one thing that is worth to mention relative to threading that in some circumstances managed thread during its execution time could be conveyed from one OS thread to another. If that happened, that would violate evenness of distribution among the cores. Thread.BeginThreadAffinity was intended to address this issue, and all you have to do is to call Thread.BeginThreadAffinity at the beginning of thread action and call Thread.EndThreadAffinity at the end of it.

In conclusion, we have looked at the beginning or rather guidance of multithreading journey, much of detail are still behind the scene. You had better treat this article as peek into threading because much more is happening under the hood, and you have to dig it by yourself and acquire your own experience.

The repository related to this article is here.

Written on June 20, 2018