What Is the Difference Between an I/o-bound Process and a Cpu-bound Process?
Multi-threading is where it tends to matter the most
In this answer, I will investigate one important use case of distinguishing between CPU vs IO bounded work: when writing multi-threaded code.
RAM I/O bound example: Vector Sum
Consider a program that sums all the values of a single vector:
#define SIZE 1000000000 unsigned int is[SIZE]; unsigned int sum = 0; size_t i = 0; for (i = 0; i < SIZE; i++) /* Each one of those requires a RAM access! */ sum += is[i]
Parallelizing that by splitting the array equally for each of your cores is of limited usefulness on common modern desktops.
For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I get results like this:
Plot data.
Note that there is a lot of variance between run however. But I can't increase the array size much further since I'm already at 8GiB, and I'm not in the mood for statistics across multiple runs today. This seemed however like a typical run after doing many manual runs.
Benchmark code:
-
POSIX C
pthread
source code used in the graph.And here is a C++ version that produces analogous results.
-
plot script
I don't know enough computer architecture to fully explain the shape of the curve, but one thing is clear: the computation does not become 8x faster as naively expected due to me using all my 8 threads! For some reason, 2 and 3 threads was the optimum, and adding more just makes things much slower.
Compare this to CPU bound work, which actually does get 8 times faster: What do 'real', 'user' and 'sys' mean in the output of time(1)?
The reason it is all processors share a single memory bus linking to RAM:
CPU 1 --\ Bus +-----+ CPU 2 ---\__________| RAM | ... ---/ +-----+ CPU N --/
so the memory bus quickly becomes the bottleneck, not the CPU.
This happens because adding two numbers takes a single CPU cycle, memory reads take about 100 CPU cycles in 2016 hardware.
So the CPU work done per byte of input data is too small, and we call this an IO-bound process.
The only way to speed up that computation further, would be to speed up individual memory accesses with new memory hardware, e.g. Multi-channel memory.
Upgrading to a faster CPU clock for example would not be very useful.
Other examples
-
matrix multiplication is CPU-bound on RAM and GPUs. The input contains:
2 * N**2
numbers, but:
N ** 3
multiplications are done, and that is enough for parallelization to be worth it for practical large N.
This is why parallel CPU matrix multiplication libraries like the following exist:
- http://www.netlib.org/scalapack/pblas_qref.html
- http://icl.cs.utk.edu/magma/software/
Cache usage makes a big difference to the speed of implementations. See for example this didactic GPU comparison example.
See also:
- Why can GPU do matrix multiplication faster than CPU?
- BLAS equivalent of a LAPACK function for GPUs
-
Networking is the prototypical IO-bound example.
Even when we send a single byte of data, it still takes a large time to reach it's destination.
Parallelizing small network requests like HTTP requests can offer a huge performance gains.
If the network is already at full capacity (e.g. downloading a torrent), parallelization can still increase improve the latency (e.g. you can load a web page "at the same time").
-
A dummy C++ CPU bound operation that takes one number and crunches it a lot:
- serial
- parallel
-
Sorting appears to be CPU based on the following experiment: Are C++17 Parallel Algorithms implemented already? which showed a 4x performance improvement for parallel sort, but I would like to have a more theoretical confirmation as well
-
The well known Coremark benchmark from EEMBC explicitly checks how well a suite of problems scale. Sample benchmark result clearing showing that:
Workload Name (iter/s) (iter/s) Scaling ----------------------------------------------- ---------- ---------- ---------- cjpeg-rose7-preset 526.32 178.57 2.95 core 7.39 2.16 3.42 linear_alg-mid-100x100-sp 684.93 238.10 2.88 loops-all-mid-10k-sp 27.65 7.80 3.54 nnet_test 32.79 10.57 3.10 parser-125k 71.43 25.00 2.86 radix2-big-64k 2320.19 623.44 3.72 sha-test 555.56 227.27 2.44 zip-test 363.64 166.67 2.18 MARK RESULTS TABLE Mark Name MultiCore SingleCore Scaling ----------------------------------------------- ---------- ---------- ---------- CoreMark-PRO 18743.79 6306.76 2.97
-
the linking of a C++ program can be parallelized to a certain degree: Can gcc use multiple cores when linking?
How to find out if you are CPU or IO bound
Non-RAM IO bound like disk, network: ps aux
, then check if CPU% / 100 < n threads
. If yes, you are IO bound, e.g. blocking read
s are just waiting for data and the scheduler is skipping that process. Then use further tools like sudo iotop
to decide which IO is the problem exactly.
Or, if execution is quick, and you parametrize the number of threads, you can see it easily from time
that performance improves as the number of threads increases for CPU bound work: What do 'real', 'user' and 'sys' mean in the output of time(1)?
RAM-IO bound: harder to tell, as RAM wait time it is included in CPU%
measurements, see also:
- How to check if app is cpu-bound or memory-bound?
- https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound
Some options:
- Intel Advisor Roofline (non-free): https://software.intel.com/en-us/articles/intel-advisor-roofline (archive) "A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks."
GPUs
GPUs have an IO bottleneck when you first transfer the input data from the regular CPU readable RAM to the GPU.
Therefore, GPUs can only be better than CPUs for CPU bound applications.
Once the data is transferred to the GPU however, it can operate on those bytes faster than the CPU can, because the GPU:
-
has more data localization than most CPU systems, and so data can be accessed faster for some cores than others
-
exploits data parallelism and sacrifices latency by just skipping over any data that is not ready to be operated on immediately.
Since the GPU has to operate on large parallel input data, it is better to just skip to the next data that might be available instead of waiting for the current data to be come available and block all other operations like the CPU mostly does
Therefore the GPU can be faster then a CPU if your application:
- can be highly parallelized: different chunks of data can be treated separately from one another at the same time
- requires a large enough number of operations per input byte (unlike e.g. vector addition which does one addition per byte only)
- there is a large number of input bytes
These designs choices originally targeted the application of 3D rendering, whose main steps are as shown at What are shaders in OpenGL and what do we need them for?
- vertex shader: multiplying a bunch of 1x4 vectors by a 4x4 matrix
- fragment shader: calculate the color of each pixel of a triangle based on its relative position withing the triangle
and so we conclude that those applications are CPU-bound.
With the advent of programmable GPGPU, we can observe several GPGPU applications that serve as examples of CPU bound operations:
-
Image Processing with GLSL shaders?
Local image processing operations such as a blur filter are highly parallel in nature.
-
Is it possible to build a heatmap from point data at 60 times per second?
Plotting of heatmap graphs if the plotted function is complex enough.
-
https://www.youtube.com/watch?v=fE0P6H8eK4I "Real-Time Fluid Dynamics: CPU vs GPU" by Jesús Martín Berlanga
Solving partial differential equations such as the Navier Stokes equation of fluid dynamics:
- highly parallel in nature, because each point only interacts with their neighbour
- there tend to be enough operations per byte
See also:
- Why are we still using CPUs instead of GPUs?
- What are GPUs bad at?
- https://www.youtube.com/watch?v=_cyVDoyI6NE "CPU vs GPU (What's the Difference?) - Computerphile"
CPython Global Intepreter Lock (GIL)
As a quick case study, I want to point out to the Python Global Interpreter Lock (GIL): What is the global interpreter lock (GIL) in CPython?
This CPython implementation detail prevents multiple Python threads from efficiently using CPU-bound work. The CPython docs say:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use
multiprocessing
orconcurrent.futures.ProcessPoolExecutor
. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Therefore, here we have an example where CPU-bound content is not suitable and I/O bound is.
JavaScript async
and Node.js worker_threads
The situation is similar to Python's.
JavaScript is fundamentally single threaded. Not sure if this is part of the language standard or not (for Python it is not, there isn't even a language standard besides the reference CPython implementation AFAIK).
JavaScript has the async
keyword which allows execution to stall, and it then it start executing something else. You can write stuff like:
async function myfunc(init) { ret = init for (let i = 0; i < 1000000; i++) { ret += i*i + 2*i + 3 } return ret; } async function doNetworkRequest(init) { // Some native method that gets network data. } const [first, second, myfunc_ret1, myfunc_ret2] = await Promise.all([ doNetworkRequest('example.com'), doNetworkRequest('example2.com'), myfunc(1), myfunc(2), ])
where await
says "wait for all these async things to finish before moving on".
However, only one of the async
methods can run at a time on your CPU, so the CPU intensive work of myfunc
does not get sped up by this at all.
The prototypically IO bound operation of network access however can get sped up, as this would fire both network requests one after the other, and just wait for both to return while the servers/network do the work.
The fact that there is an actual keyword in the language dedicated for this, async
, is telling: unsurprisingly, network requests are very important in the browser dominated context of JavaScript.
With the advent of Node.js however, people started to want to parallelize CPU workload as well more and more, and they reached a similar solution to CPython: create separate processes rather than threads. This is done with the worker_threads
library, which:
- actually takes a JavaScript script path as input, and launches a new interpreter instance to run it
- since the processes don't share memory space, you pass messages between instances with a messaging system
https://medium.com/@Trott/using-worker-threads-in-node-js-80494136dbb6 contains a good example.
The documentation of worker_threads
once more states the difference as mentioned elsewhere in this answer:
Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.
On the browser there's also Web Workers, not sure how it compares to Node's implementation: What's the difference between Web Workers and Worker Threads?
What Is the Difference Between an I/o-bound Process and a Cpu-bound Process?
Source: https://stackoverflow.com/q/868568