How does a mutex work? What does a mutex cost?

Concurrent programming requires synchronisation. We can’t have more than one thread accessing data at the same time otherwise we end up with a data race. The most common solution is to wrap the critical data access in a mutex. Mutexes are, of course, not free. How the mutex is used has a significant impact in the cost of the code we are writing. When used correctly we’ll barely notice the overhead. When used incorrectly it can cause a program to run worse in threaded mode than it would have single threaded!

What is a mutex?

A mutex (mutual exclusion lock) is a synchronization primitive used to prevent multiple threads from accessing a shared resource at the same time. A mutex, in its most fundamental form, is just an integer in memory. This memory can have a few different values depending on the that state of the mutex. Though usually when we speak of mutexes we also talk of the locks which use the mutex. The integer in memory is not very interesting, but the operations around it are.

How does a mutex work?

At a high level, a mutex has two main operations:

Lock (acquire)

• A thread tries to take ownership of the mutex:
• If the mutex is free → thread acquires it and continues
• If it’s already locked → the thread waits (blocks or spins)

Unlock (release)

• The owning thread releases the mutex:
• Another waiting thread can now acquire it

Typical flow:

• Thread A calls lock() → succeeds → enters critical section
• Thread B calls lock() → blocked (since A holds it)
• Thread A calls unlock()
• Thread B wakes up → acquires lock → proceeds

Under the hood:

• Implemented using atomic CPU instructions (like compare-and-swap)
• May involve OS scheduling if threads are blocked
• Can be either:
- Blocking mutex (thread sleeps)
- Spinlock (thread loops until available)

There can be only one lock on a mutex at any given time. If another thread wishes to lock the same mutex it must wait for the first to unlock it. This is the primary goal of the mutex. Attempting to lock an already locked mutex is called contention. In a well planned program contention should be quite low; you should be designing your code so that most attempts to lock the mutex will not block.

There are two reasons why you want to avoid contention. The first is simply that any thread waiting on a mutex is obviously not doing anything else — possibly resulting in unused CPU cycles. The second reason is more interesting for high performance code. Locking a currently unlocked mutex is extremely cheap compared to the contention case. We have to look at how the mutex works to understand why.

What does a mutex cost?

Mutexes are not “free.” They introduce several types of overhead:

a. Time overhead

Lock/unlock operations take CPU cycles
Contention (many threads competing) increases latency
Blocking mutexes may trigger:
Context switches (expensive)
Kernel involvement

b. Performance impact

Serializes access → reduces parallelism
Can become a bottleneck in highly concurrent systems

c. Memory overhead

Mutex object itself (usually small)
OS structures for managing waiting threads

d. Hidden costs

Cache coherence traffic (especially on multicore CPUs)
Priority inversion (low-priority thread holding lock blocks high-priority one)

More details about Mutex

As mentioned before, the data of a mutex is simply an integer in memory. It’s value starts as 0, meaning that it is unlocked. If you wish to lock the mutex you can simply check if it is zero and then assign one. The mutex is now locked and you are the owner of it.

The trick is that the test and set operation has to be atomic. If two threads happen to read 0 at the exact same time, then both would write 1 and think they own the mutex. Without CPU support there is no way to implement a mutex in user space: this operation must be atomic with respect to the other threads. Fortunately CPUs has a function called “compare-and-set” or “test-and-set” which does exactly this. This function takes the address of the integer, and two integer values: a compare and set value. If the compare value matches the current value of the integer then it is replaced with the new value. In C style code this might like look this:

int compare_set( int * to_compare, int compare, int set );
int mutex_value;
int result = compare_set( &mutex_value, 0, 1 );
if( !result ) { /* we got the lock */ }

The caller determines what happens by the return value. It is the value at the pointer provided prior to the swap. If this value is equal to the test value the caller knows the set was successful. If the value is different then the caller knows the value has not changed. When the piece of code is done with the mutex it can simply set the value back to 0. This makes up the very basic part of our mutex.

More details about waiting duration

Now comes the tricky part. Well, only in a way is it tricky, in another way it is simple. The above test-and-set mechanism provides no support for a thread to wait on the value (aside from a CPU intensive spin-lock). The CPU doesn’t really understand high-level threads and processes, so it isn’t in a position to implement waiting. The OS must provide the waiting functionality.

In order for the CPU to wait correctly a caller is going to need to go through a system call. It is the only thing that can synchronise the various threads and provide the waiting functionality. So if we have to wait on a mutex, or release a waiting mutex, we have no choice but to call the OS. Most OSs have built in mutex primitives. In some cases they provide full fledged mutexes. So if a system call does provide a full mutex why would we bother with any sort of test-and-set in user space? The answer is that system calls have quite a bit of overhead and should be avoided when possible.

Various operating systems diverge greatly at this point, and will likely change as time goes on. Under linux there is a system call futex which provides mutex like semantics. It is specifically designed so that non-contention cases can be completely resolved in user space. Contention cases are then delegated to the operating system to handle in a safe, albeit far costlier manner. The waiting is then handled as part of the OS process scheduler.

More details about Mutex cost

There are a few points of interest when it comes to the cost of a mutex. The first, and very vital point, is waiting time. Your threads should spend only a fraction of their time waiting on mutexes. If they are waiting too often then you are losing concurrency. In a worst case scenario many threads always trying to lock the same mutex may result in performance worse than a single thread serving all requests. This really isn’t a cost of the mutex itself, but a serious concern with concurrent programming.

The overhead costs of a mutex relate to the test-and-set operation and the system call that implements a mutex. The test-and-set is likely very low cost; being essential to concurrent processing the CPUs have good reason to make it efficient. We’ve kind of omitted another important instruction however: the fence. This is used in all high-level mutexes and may have a higher cost than the test-and-set operation. More costlier than even that however is the system call. Not only do you suffer the context switch overhead of the system call, the kernel now spends some time in its scheduling code.