Each Stream implementation owns its own data, so the driver code shouldn't allocate a large array just before. On processors with strong NUMA effects and smaller memory capacities per NUMA domain, these checking vectors can result in the main arrays being allocated in the wrong NUMA domain. The fix is to simply move the driver allocation until after the computation has finished and we want to check the answers. This commit only changes the driver; each model will be updated in subsequent commits. Fixes #80.
5.8 KiB
5.8 KiB
Changelog
All notable changes to this project will be documented in this file.
[Unreleased]
Added
- New implementation using the C++ parallel STL.
- Compiler options for OpenMP and OpenACC GNU offloading to NVIDIA and AMD.
- Compiler options for Arm Clang added to OpenMP and Kokkos.
- Kokkos 3 build system (No code changes made).
- SYCL build rules for ComputeCpp, DPCPP and HipSYCL
Changed
- Default branch renamed from
mastertomain. - Driver now delays allocating large checking vectors until after computation has finished.
- Use cl::sycl::id parameters instead of cl::sycl::item.
- Update local copy of OpenCL C++ header file.
- Ensure correct SYCL queue constructor with explicit async_handler.
- Use built in SYCL runtime device discovery.
- Cray compiler OpenMP flags updated
- Reorder OpenCL objects in class so destructors are called in safe order.
Removed
- Pre-building of kernels in SYCL version to ensure compatibility with SYCL 1.2.1. Pre-building kernels is also not required, and shows no overhead as the first iteration is not timed.
- OpenACC Cray compiler flags.
- Build support for Kokkos 2.x (No code changes made).
[v3.4] - 2019-04-10
Added
- OpenACC flags to build for Power 9, Volta, Skylake and KNL.
- Kokkos list CLI argument shows some information about which device will be used.
- OpenMP GNU compiler now uses native target flag.
- Support CSV output for Triad only running mode.
- NEC and PGI compiler option for OpenMP version.
- Option to calculate memory bandwidth in base 2 (MiB/s) rather than base 10 (MB/s).
Changed
- Update SYCL implementation to SYCL 1.2.1 interface.
- Output formatting of Kokkos implementation.
- Capitalisation of Kokkos filenames.
- Updated HIP implementation to new interface.
- Use parallel loop instead of kernels for OpenACC.
- OpenMP build for XL compiler uses
-qarch=auto.
Removed
- Superfluous OpenMP 4.5 map(to:) clauses on kernel target regions.
- Kokkos namespace not used by default so the API is easier to spot.
- Manual specification of Kokkos layout (DEVICE) as the Kokkos library sets this by default.
Fixed
- Kokkos now compiles and links separately to fix complication with Kokkos 2.05.00.
- Kokkos can now instantiate single and double precision.
- OpenMP 4.5 map and reduction clause order to ensure reduction result copied back.
- Potential race condition in SYCL code between unloading OpenCL library and device list deconstructor.
[v3.3] - 2017-12-04
Added
- Add runtime option to run just the Triad kernel.
- Add runtime option for CSV output of results.
- ROCm HC implementation added for AMD GPUs.
Changed
- Renamed project to BabelStream (from GPU-STREAM).
- Update SYCL Makefile to use ComputeCpp path variables.
- SYCL exceptions are now fatal, and are propagated to a runtime exception.
[v3.2] - 2017-04-06
Added
- Build instructions for RAJA and Kokkos libraries.
Changed
- Use RAJA and Kokkos internal iterator types instead of int.
- Ensure RAJA pointers do not alias.
- Align memory to 2MB pages in RAJA and OpenMP.
- Updated Intel compiler flags for OpenMP, Kokkos and RAJA to ensure streaming stores.
- CUDA Makefile now uses variables to set compiler and flags.
- Use static shared memory for dot kernel in CUDA and HIP.
Fixed
- Fix initialisation of b array bug in Kokkos implementation.
[v3.1] - 2017-02-25
Added
- Dot kernel HIP implementation.
Changed
- Build system overhauled from CMake to a series of Makefiles.
Deprecated
- Android build instructions.
[v3.0] - 2017-01-30
Added
- New Dot kernel added to the 4 standard kernels.
Changed
- All model implementations now initialise and allocate their own arrays rather than copying from a master copy. This allows for better performance on NUMA architectures.
- Version string definition moved from header to main file.
- Combined OpenMP 3 and 4.5 implementations.
- OpenMP 4.5 target implementation uses alloc instead of to.
- Made SYCL indexing consistent.
- Update SYCL CMake build to use ComputeCpp CE 0.1.1.
Fixed
- OpenMP deconstructor now only frees GPU memory only on GPU build.
- SYCL template specializations for float and double.
[v2.1] - 2016-10-21
Added
- New HIP version added.
- Results for v2.0 added.
- Output of OpenCL kernel build log on failure.
Changed
- Use globally defined scalar value.
- Change scalar value to stop overflow.
- Restructure results directory.
- Change SYCL default work-group size.
- CMake defaults to Release build.
Fixed
- CUDA device name output string corrected.
- Out of tree builds.
[v2.0] - 2016-06-30
Added
- Implementations in OpenMP 4.5 OpenACC, RAJA, Kokkos and SYCL.
- Copyright headers to source files.
- Runtime option variables are printed out.
- Device selection added to OpenCL and CUDA.
Changed
- Major refactor to include multiple programming models. The change now uses C++ for driver code, with different models plugged in as classes which implement the STREAM kernels.
- Starting values in the arrays to reduce floating point errors with high iteration counts.
- Default array size now 2^25.
- Default to 100 iterations instead of 10.
- CUDA thread block size set via define rather than hardcoded value.
- Require CUDA 7 for C++11 support.
- OpenCL C++ header updated.
Fixed
- Various CMake build fixes.
- Require at least 2 iterations.
Removed
- Warning message for single precision iterations.
[v1.1] - 2016-05-09
Added
- HIP implementation and results.
- Titan X and Fury X results.
- Output of array sizes and other information at runtime.
- Ability to set CUDA block sizes on command line.
- Android build instructions.
Changed
- Update OpenCL C++ header.
- Requires CUDA 6.5 or above.
- OpenCL uses Kernel Functor APIs instead of make_kernel API.
Fixed
- Unsigned integer warnings.
[v0.9] - 2015-08-07
Initial public release of OpenCL and CUDA GPU-STREAM.