java-stream =========== This is an implementation of BabelStream in Java 8 which contains the following implementations: * `jdk-plain` - Single threaded `for` * `jdk-stream` - Threaded implementation using JDK8's parallel stream API * `tornadovm` - A [TornadoVM](https://github.com/beehive-lab/TornadoVM) implementation for PTX/OpenCL * `aparapi` - A [Aparapi](https://git.qoto.org/aparapi/aparapi) implementation for OpenCL ### Build & Run Prerequisites * JDK >= 8 To run the benchmark, first create a binary: ```shell > cd java-stream > ./mvnw clean package ``` The binary will be located at `./target/java-stream.jar`. Run it with: ```shell > java -version  ✔  11.0.11+9 ☕  tom@soraws-uk  05:03:20 openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment GraalVM CE 21.1.0 (build 11.0.11+8-jvmci-21.1-b05) OpenJDK 64-Bit Server VM GraalVM CE 21.1.0 (build 11.0.11+8-jvmci-21.1-b05, mixed mode) > java -jar target/java-stream.jar --help ``` For best results, benchmark with the following JVM flags: ``` -XX:-UseOnStackReplacement # disable OSR, not useful for this benchmark as we are measuring peak performance -XX:-TieredCompilation # disable C1, go straight to C2 -XX:ReservedCodeCacheSize=512m # don't flush compiled code out of cache at any point ``` Worked example: ```shell > java -XX:-UseOnStackReplacement -XX:-TieredCompilation -XX:ReservedCodeCacheSize=512m -jar target/java-stream.jar BabelStream Version: 3.4 Implementation: jdk-stream; (Java 11.0.11;Red Hat, Inc.; home=/usr/lib/jvm/java-11-openjdk-11.0.11.0.9-4.fc33.x86_64) Running all 100 times Precision: double Array size: 268.4 MB (=0.3 GB) Total size: 805.3 MB (=0.8 GB) Function MBytes/sec Min (sec) Max Average Copy 17145.538 0.03131 0.04779 0.03413 Mul 16759.092 0.03203 0.04752 0.03579 Add 19431.954 0.04144 0.05866 0.04503 Triad 19763.970 0.04075 0.05388 0.04510 Dot 26646.894 0.02015 0.03013 0.02259 ``` If your OpenCL/CUDA installation is not at the default location, TornadoVM and Aparapi may fail to detect your devices. In those cases, you may specify the library directly, for example: ```shell > LD_PRELOAD=/opt/rocm-4.0.0/opencl/lib/libOpenCL.so.1.2 java -jar target/java-stream.jar ... ``` ### Instructions for TornadoVM The TornadoVM implementation requires you to run the binary with a patched JVM. Follow the official [instructions](https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/10_INSTALL_WITH_GRAALVM.md) or use the following simplified instructions: Prerequisites * CMake >= 3.6 * GCC or clang/LLVM (GCC >= 5.5) * Python >= 2.7 * Maven >= 3.6.3 * OpenCL headers >= 1.2 and/or CUDA SDK >= 9.0 First, get a copy of the TornadoVM source: ```shell > cd > git clone https://github.com/beehive-lab/TornadoVM tornadovm ``` Take note of the required GraalVM version in `tornadovm/assembly/src/docs/10_INSTALL_WITH_GRAALVM.md`. We'll use `21.1.0` in this example. Now, obtain a copy of GraalVM and make sure the version matches the one required by TornadoVM: ```shell > wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-21.1.0/graalvm-ce-java11-linux-amd64-21.1.0.tar.gz > tar -xf graalvm-ce-java11-linux-amd64-21.1.0.tar.gz ``` Next, create `~/tornadovm/etc/sources.env` and populate the file with the following: ```shell #!/bin/bash export JAVA_HOME= export PATH=$PWD/bin/bin:$PATH export TORNADO_SDK=$PWD/bin/sdk export CMAKE_ROOT=/usr # path to CMake binary ``` Proceed to compile TornadoVM: ```shell > cd ~/tornadovm > . etc/sources.env > make graal-jdk-11-plus BACKEND={ptx,opencl} ``` To test your build, source the environment file: ```shell > source ~/tornadovm/etc/sources.env > LD_PRELOAD=/opt/rocm-4.0.0/opencl/lib/libOpenCL.so.1.2 tornado --devices Number of Tornado drivers: 1 Total number of OpenCL devices : 3 Tornado device=0:0 AMD Accelerated Parallel Processing -- gfx1012 Global Memory Size: 4.0 GB Local Memory Size: 64.0 KB Workgroup Dimensions: 3 Max WorkGroup Configuration: [1024, 1024, 1024] Device OpenCL C version: OpenCL C 2.0 Tornado device=0:1 Portable Computing Language -- pthread-AMD Ryzen 9 3900X 12-Core Processor Global Memory Size: 60.7 GB Local Memory Size: 8.0 MB Workgroup Dimensions: 3 Max WorkGroup Configuration: [4096, 4096, 4096] Device OpenCL C version: OpenCL C 1.2 pocl Tornado device=0:2 NVIDIA CUDA -- NVIDIA GeForce GT 710 Global Memory Size: 981.3 MB Local Memory Size: 48.0 KB Workgroup Dimensions: 3 Max WorkGroup Configuration: [1024, 1024, 64] Device OpenCL C version: OpenCL C 1.2 ``` You can now use TornadoVM to run java-stream: ```shell > tornado -jar ~/java-stream/target/java-stream.jar --impl tornadovm --arraysize 65536  1 ✘  11.0.11+9 ☕  tom@soraws-uk  05:31:34 BabelStream Version: 3.4 Implementation: tornadovm; (Java 11.0.11;GraalVM Community; home=~/graalvm-ce-java11-21.1.0) Running all 100 times Precision: double Array size: 0.5 MB (=0.0 GB) Total size: 1.6 MB (=0.0 GB) Using TornadoVM device: - Name : NVIDIA GeForce GT 710 CL_DEVICE_TYPE_GPU (available) - Id : opencl-0-0 - Platform : NVIDIA CUDA - Backend : OpenCL Function MBytes/sec Min (sec) Max Average Copy 8791.100 0.00012 0.00079 0.00015 Mul 8774.107 0.00012 0.00061 0.00014 Add 9903.313 0.00016 0.00030 0.00018 Triad 9861.031 0.00016 0.00030 0.00018 Dot 2799.465 0.00037 0.00056 0.00041 ```