CPU threads affinity hyperthreading

This post is not a tutorial on C++11 threads, but it uses them as the main threading mechanism to demonstrate its points. It starts with a basic example but then quickly veers off into the specialized area of thread affinities, hardware topologies and performance implications of hyperthreading. It does as much as feasible in portable C++, clearly marking the deviations into platform-specific calls for the really specialized stuff.

CPU Socket: refers to a physical connector on a motherboard that accepts a single physical chip. It is commonplace for modern CPUs to provide multiple physical cores which are exposed to the operating system as logical CPUs that can perform parallel execution streams. This document refers to a socket and physical CPU synonymously. See Also: CPU Socket
NUMA: Non-niform Memory Access, refers to the commonplace architecture in which machines with multiple CPU sockets divide the memory banks of RAM into nodes on a per-socket basis. Access to memory on a socket’s “local” memory node is faster than accessing memory on a remote node tied to a different socket. See Also: Numa
CPU Core: Contemporary CPUs are likely to run multiple cores which are exposed to the underlying OS as a CPU. See Also: Multi-core processing
Hyper-threading: Intel technology to make a single core appear logically as multiple cores on the same chip to improve the performance
Logical CPU: What the operating system sees as a CPU. The number of CPUs available to the OS is num sockets * cores per socket * hyper threads per core.
Processor Affinity: Refers to the act of restricting the set of logical CPUs on which a particular program thread can execute.

Benefits of thread affinitization
Pinning a thread to a particular CPU ensures that the OS won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from main memory which results in jitter. When all critical threads in the processing pipeline are pinned to their own CPU and busy spinning, the OS scheduler is less likely to schedule another thread onto that core, keeping the threads’ processor caches hot.

Logical CPUs, cores and threads

Most modern machines are multi-CPU. Whether these CPUs are divided into sockets and hardware cores depends on the machine, of course, but the OS sees a number of “logical” CPUs that can execute tasks concurrently.

The easiest way to get this information on Linux is to cat /proc/cpuinfo, which lists the system’s CPUs in order, providing some infromation about each (such as current frequency, cache size, etc). On my (8-CPU) machine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
[...]
stepping : 3
microcode : 0x7
cpu MHz : 3501.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
[...]

processor : 1
vendor_id : GenuineIntel
cpu family : 6
[...]

[...]
processor : 7
vendor_id : GenuineIntel
cpu family : 6

A summary output can be obtained from lscpu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Stepping: 3
CPU MHz: 3501.000
BogoMIPS: 6984.09
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7

Here it’s also very easy to see that the machine has 4 cores, each having two HW threads (see hyperthreading). And yet the OS sees them as 8 “CPUs” numbered 0-7.

Launching a thread per CPU

The C++11 threading library gracefully made available a utility function that we can use to find out how many CPUs the machine has, so that we could plan our parallelism strategy. The function is called hardware_concurrency, and here is a complete example that uses it to launch an appropriate number of threads. The following is just a code snippet; full code samples for this post, along with a Makefile for Linux can be found in this repository.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";

// A mutex ensures orderly access to std::cout from multiple threads.
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([&iomutex, i] {
{
// Use a lexical scope and lock_guard to safely lock the mutex only for
// the duration of std::cout usage.
std::lock_guard<std::mutex> iolock(iomutex);
std::cout << "Thread #" << i << " is running\n";
}

// Simulate important work done by the tread by sleeping for a bit...
std::this_thread::sleep_for(std::chrono::milliseconds(200));

});
}

for (auto& t : threads) {
t.join();
}
return 0;
}

A std::thread is a thin wrapper around a platform-specific thread object; this is something we’ll use to our advantage shortly. So when we launch a std::thread, and actual OS thread is launched. This is fairly low-level thread control, but in this article I won’t detour into higher-level constructs like task-based parallelism, leaving this to some future post.

Thread affinity

So we know how to query the system for the number of CPUs it has, and how to launch any number of threads. Now let’s do something a bit more advanced.

All modern OSes support setting CPU affinity per thread. Affinity means that instead of being free to run the thread on any CPU it feels like, the OS scheduler is asked to only schedule a given thread to a single CPU or a pre-defined set of CPUs. By default, the affinity covers all logical CPUs in the system, so the OS can pick any of them for any thread, based on its scheduling considerations. In addition, the OS will sometimes migrate threads between CPUs if it makes sense to the scheduler (though it should try to miminize migrations because of the loss of warm caches on the core from which the thread was migrated). Let’s observe this in action with another code sample:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
int main(int argc, const char** argv) {
constexpr unsigned num_threads = 4;
// A mutex ensures orderly access to std::cout from multiple threads.
std::mutex iomutex;
std::vector<std::thread> threads(num_threads);
for (unsigned i = 0; i < num_threads; ++i) {
threads[i] = std::thread([&iomutex, i] {
while (1) {
{
// Use a lexical scope and lock_guard to safely lock the mutex only
// for the duration of std::cout usage.
std::lock_guard<std::mutex> iolock(iomutex);
std::cout << "Thread #" << i << ": on CPU " << sched_getcpu() << "\n";
}

// Simulate important work done by the tread by sleeping for a bit...
std::this_thread::sleep_for(std::chrono::milliseconds(900));
}
});
}

for (auto& t : threads) {
t.join();
}
return 0;
}

This sample launches four threads that loop infinitely, sleeping and reporting which CPU they run on. The reporting is done via the sched_getcpu function (glibc specific - other platforms will have other APIs with similar functionality). Here’s a sample run:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ ./launch-threads-report-cpu
Thread #0: on CPU 5
Thread #1: on CPU 5
Thread #2: on CPU 2
Thread #3: on CPU 5
Thread #0: on CPU 2
Thread #1: on CPU 5
Thread #2: on CPU 3
Thread #3: on CPU 5
Thread #0: on CPU 3
Thread #2: on CPU 7
Thread #1: on CPU 5
Thread #3: on CPU 0
Thread #0: on CPU 3
Thread #2: on CPU 7
Thread #1: on CPU 5
Thread #3: on CPU 0
Thread #0: on CPU 3
Thread #2: on CPU 7
Thread #1: on CPU 5
Thread #3: on CPU 0
^C

Some observations: the threads are sometimes scheduled onto the same CPU, and sometimes onto different CPUs. Also, there’s quite a bit of migration going on. Eventually, the scheduler managed to place each thread onto a different CPU, and keep it there. Different constraints (such as system load) could result in a different scheduling, of course.

Now let’s rerun the same sample, but this time using taskset to restrict the affinity of the process to only two CPUs - 5 and 6:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ taskset -c 5,6 ./launch-threads-report-cpu
Thread #0: on CPU 5
Thread #2: on CPU 6
Thread #1: on CPU 5
Thread #3: on CPU 6
Thread #0: on CPU 5
Thread #2: on CPU 6
Thread #1: on CPU 5
Thread #3: on CPU 6
Thread #0: on CPU 5
Thread #1: on CPU 5
Thread #2: on CPU 6
Thread #3: on CPU 6
Thread #0: on CPU 5
Thread #1: on CPU 6
Thread #2: on CPU 6
Thread #3: on CPU 6
^C

As expected, though there’s some migration happening here, all threads remain faithfully locked to CPUs 5 and 6, as instructed.

Setting CPU affinity programatically

As we’ve seen earlier, command-line tools like taskset let us control the CPU affinity of a whole process. Sometimes, however, we’d like to do something more fine-grained and set the affinities of specific threads from within the program. How do we do that?

On Linux, we can use the pthread-specific pthread_setaffinity_np function. Here’s an example that reproduces what we did before, but this time from inside the program. In fact, let’s go a bit more fancy and pin each thread to a single known CPU by setting its affinity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
int main(int argc, const char** argv) {
constexpr unsigned num_threads = 4;
// A mutex ensures orderly access to std::cout from multiple threads.
std::mutex iomutex;
std::vector<std::thread> threads(num_threads);
for (unsigned i = 0; i < num_threads; ++i) {
threads[i] = std::thread([&iomutex, i] {
std::this_thread::sleep_for(std::chrono::milliseconds(20));
while (1) {
{
// Use a lexical scope and lock_guard to safely lock the mutex only
// for the duration of std::cout usage.
std::lock_guard<std::mutex> iolock(iomutex);
std::cout << "Thread #" << i << ": on CPU " << sched_getcpu() << "\n";
}

// Simulate important work done by the tread by sleeping for a bit...
std::this_thread::sleep_for(std::chrono::milliseconds(900));
}
});

// Create a cpu_set_t object representing a set of CPUs. Clear it and mark
// only CPU i as set.
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}

for (auto& t : threads) {
t.join();
}
return 0;
}

Note how we use the native_handle method discussed earlier in order to pass the underlying native handle to the pthread call (it takes a pthread_t ID as its first argument). The output of this program on my machine is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ ./set-affinity
Thread #0: on CPU 0
Thread #1: on CPU 1
Thread #2: on CPU 2
Thread #3: on CPU 3
Thread #0: on CPU 0
Thread #1: on CPU 1
Thread #2: on CPU 2
Thread #3: on CPU 3
Thread #0: on CPU 0
Thread #1: on CPU 1
Thread #2: on CPU 2
Thread #3: on CPU 3
^C

The threads get pinned to single CPUs exactly as requested.

refers:
https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html#Threadbinding
https://docs.neeveresearch.com/display/TALONDOC/Tuning+Thread+Affinitization+and+NUMA
https://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/