This is a new implementation of BabelStream using Fortran.
The code uses a Fortran driver that is largely equivalent to the C++ one, with a few exceptions. First, it does not use a C++ class for the stream object, since that doesn't seem like a useful way to do things in Fortran. Instead, I use a module that contains the same methods, and which has alloc and dealloc that act like CTOR and DTOR.
The current implementations are:
- DO CONCURRENT
- Fortran array notation
- Sequential DO loops
- OpenACC parallel loop
- OpenACC kernels on Fortran array notation
- OpenMP parallel do
- OpenMP taskloop
- OpenMP target teams distribute parallel do simd
- OpenMP target teams loop
- CUDA Fortran (handwritten CUDA Fortran kernels, except DOT)
- CUDA Fortran kernels (!$cuf kernel do <<<*,*>>>)
I have tested with GCC, Intel (ifort and ifx), and NVHPC compilers on AArch64, x86_64 and NVIDIA GPU targets, although not exhaustively. Cray and Fujitsu have been tested as well. The only untested compiler of significance is IBM XLF.
The current build system is GNU Make, and requires the user to manually specify the compiler and implementation.
CSV printing is supported.
Squashed commit of the following:
commit 15f13ef9d326102cc003b2fdfe1b31c4aea55373
Author: Jeff Hammond <email>
Date: Tue Nov 15 06:42:46 2022 +0200
8 cores unless user changes
commit 62ca680546ff89a1987b6fb797273038f767bf7b
Author: Jeff Hammond <email>
Date: Tue Nov 15 06:42:09 2022 +0200
hoist and disable orin flags
commit 76495509abcdb0686f293a72f7ded7c8ed7bb882
Author: Jeff Hammond <email>
Date: Tue Nov 15 06:40:13 2022 +0200
cleanup scripts
commit 5b45df87954282cbb6b0f7eb2dcb3570d08bb5c2
Author: Jeff Hammond <email>
Date: Tue Nov 15 06:39:31 2022 +0200
add autopar flag for GCC
commit 87eb07e4a8c3e8d6247ab5f72e14bf90002733ce
Merge: a732e7c 270644e
Author: Jeff Hammond <email>
Date: Wed Nov 9 15:53:41 2022 +0200
Merge remote-tracking branch 'origin/fortran_compiler_details' into fortran-ports
commit a732e7c49e12ce8aff15e9d4bcbd215fa4a05d82
Merge: cfafd99 5697d94
Author: Jeff Hammond <email>
Date: Wed Nov 9 15:53:36 2022 +0200
Merge remote-tracking branch 'origin/fortran_int32_option' into fortran-ports
commit cfafd993b646d5f5a90eb6d37d347cc545ab36d4
Merge: de5ff67 26a9707
Author: Jeff Hammond <email>
Date: Wed Nov 9 15:53:25 2022 +0200
Merge remote-tracking branch 'origin/fortran_csv' into fortran-ports
commit de5ff6772b2036ad259a6a9c331ff5408146b54c
Merge: 3109653
|
||
|---|---|---|
| .github/workflows | ||
| cmake | ||
| results | ||
| src | ||
| .gitignore | ||
| CHANGELOG.md | ||
| CITATION.cff | ||
| CMakeLists.txt | ||
| LICENSE | ||
| README.md | ||
BabelStream
Measure memory transfer rates to/from global device memory on GPUs. This benchmark is similar in spirit, and based on, the STREAM benchmark [1] for CPUs.
Unlike other GPU memory bandwidth benchmarks this does not include the PCIe transfer time.
There are multiple implementations of this benchmark in a variety of programming models.
This code was previously called GPU-STREAM.
Table of Contents
Programming Models
BabelStream is currently implemented in the following parallel programming models, listed in no particular order:
- OpenCL
- CUDA
- HIP
- OpenACC
- OpenMP 3 and 4.5
- C++ Parallel STL
- Kokkos
- RAJA
- SYCL and SYCL 2020
- TBB
- Thrust (via CUDA or HIP)
This project also contains implementations in alternative languages with different build systems:
- Julia - JuliaStream.jl
- Java - java-stream
- Scala - scala-stream
- Rust - rust-stream
How is this different to STREAM?
BabelStream implements the four main kernels of the STREAM benchmark (along with a dot product), but by utilising different programming models expands the platforms which the code can run beyond CPUs.
The key differences from STREAM are that:
- the arrays are allocated on the heap
- the problem size is unknown at compile time
- wider platform and programming model support
With stack arrays of known size at compile time, the compiler is able to align data and issue optimal instructions (such as non-temporal stores, remove peel/remainder vectorisation loops, etc.). But this information is not typically available in real HPC codes today, where the problem size is read from the user at runtime.
BabelStream therefore provides a measure of what memory bandwidth performance can be attained (by a particular programming model) if you follow today's best parallel programming best practice.
BabelStream also includes the nstream kernel from the Parallel Research Kernels (PRK) project, available on GitHub. Details about PRK can be found in the following references:
-
Van der Wijngaart, Rob F., and Timothy G. Mattson. The parallel research kernels. IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2014.
-
R. F. Van der Wijngaart, A. Kayi, J. R. Hammond, G. Jost, T. St. John, S. Sridharan, T. G. Mattson, J. Abercrombie, and J. Nelson. Comparing runtime systems with exascale ambitions using the Parallel Research Kernels. ISC 2016, DOI: 10.1007/978-3-319-41321-1_17.
-
Jeff R. Hammond and Timothy G. Mattson. Evaluating data parallelism in C++ using the Parallel Research Kernels. IWOCL 2019, DOI: 10.1145/3318170.3318192.
Building
Drivers, compiler and software applicable to whichever implementation you would like to build against is required.
CMake
The project supports building with CMake >= 3.13.0, which can be installed without root via the official script.
Each BabelStream implementation (programming model) is built as follows:
$ cd babelstream
# configure the build, build type defaults to Release
# The -DMODEL flag is required
$ cmake -Bbuild -H. -DMODEL=<model> <model specific flags prefixed with -D...>
# compile
$ cmake --build build
# run executables in ./build
$ ./build/<model>-stream
The MODEL option selects one implementation of BabelStream to build.
The source for each model's implementations are located in ./src/<model>.
Currently available models are:
omp;ocl;std;std20;hip;cuda;kokkos;sycl;sycl2020;acc;raja;tbb;thrust
Overriding default flags
By default, we have defined a set of optimal flags for known HPC compilers.
There are assigned those to RELEASE_FLAGS, and you can override them if required.
To find out what flag each model supports or requires, simply configure while only specifying the model. For example:
> cd babelstream
> cmake -Bbuild -H. -DMODEL=ocl
...
- Common Release flags are `-O3`, set RELEASE_FLAGS to override
-- CXX_EXTRA_FLAGS:
Appends to common compile flags. These will be used at link phase at well.
To use separate flags at link time, set `CXX_EXTRA_LINKER_FLAGS`
-- CXX_EXTRA_LINK_FLAGS:
Appends to link flags which appear *before* the objects.
Do not use this for linking libraries, as the link line is order-dependent
-- CXX_EXTRA_LIBRARIES:
Append to link flags which appears *after* the objects.
Use this for linking extra libraries (e.g `-lmylib`, or simply `mylib`)
-- CXX_EXTRA_LINKER_FLAGS:
Append to linker flags (i.e GCC's `-Wl` or equivalent)
-- Available models: omp;ocl;std;std20;hip;cuda;kokkos;sycl;acc;raja;tbb
-- Selected model : ocl
-- Supported flags:
CMAKE_CXX_COMPILER (optional, default=c++): Any CXX compiler that is supported by CMake detection
OpenCL_LIBRARY (optional, default=): Path to OpenCL library, usually called libOpenCL.so
...
Alternatively, refer to the CI script, which test-compiles most of the models, and see which flags are used there.
It is recommended that you delete the build directory when you change any of the build flags.
GNU Make
Support for Make has been removed from 4.0 onwards. However, as the build process only involves a few source files, the required compile commands can be extracted from the CI output.
Results
Sample results can be found in the results subdirectory.
Newer results are found in our Performance Portability repository.
Contributing
As of v4.0, the main branch of this repository will hold the latest released version.
The develop branch will contain unreleased features due for the next (major and/or minor) release of BabelStream.
Pull Requests should be made against the develop branch.
Citing
Please cite BabelStream via this reference:
Deakin T, Price J, Martineau M, McIntosh-Smith S. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. Special issue. Vol. 17, No. 3, pp. 247–262. 2018. DOI: 10.1504/IJCSE.2018.095847
Other BabelStream publications
-
Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany. DOI: 10.1007/978- 3-319-46079-6_34
-
Deakin T, McIntosh-Smith S. GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units. 2015. Poster session presented at IEEE/ACM SuperComputing, Austin, United States. You can view the Poster and Extended Abstract.
-
Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM: Now in 2D!. 2016. Poster session presented at IEEE/ACM SuperComputing, Salt Lake City, United States. You can view the Poster and Extended Abstract.
-
Raman K, Deakin T, Price J, McIntosh-Smith S. Improving achieved memory bandwidth from C++ codes on Intel Xeon Phi Processor (Knights Landing). IXPUG Spring Meeting, Cambridge, UK, 2017.
-
Deakin T, Price J, McIntosh-Smith S. Portable methods for measuring cache hierarchy performance. 2017. Poster sessions presented at IEEE/ACM SuperComputing, Denver, United States. You can view the Poster and Extended Abstract
[1]: McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.