Estimating value of pi (π) using Monte Carlo Simulation and Vector API
Java’s vector API takes advantage of certain CPU architectures to parallelize calculations which improves performance multifold. This article briefly discusses the API and the takes a practical approach to show how the performance gain is achieved with an example.
SIMD
SIMD stands for Single Instruction, Multiple Data. It enables a CPU to apply the same instruction to multiple data elements in parallel. Modern CPUs support SIMD instruction sets. These instructions operate on wide CPU registers that can hold multiple data elements at once. When a SIMD instruction is executed, it processes all elements packed in those registers simultaneously. This lets the CPU perform multiple computations per instruction, improving throughput. The Vector API in Java leverages these hardware SIMD capabilities and allows the JIT compiler to generate appropriate vector instructions for the CPU.
Java has supported auto‑vectorization for many years. This means that, when certain conditions are met, the JIT compiler can transform suitable loops into SIMD (vector) instructions, giving performance improvements. The Vector API provides an explicit programming model for vector operations. It expresses vector semantics clearly enough for the JIT compiler to reliably generate optimal SIMD instructions. This allows vectorization to be applied more consistently and gives developers more control, enabling better use of the CPU’s hardware vector capabilities across different architectures.
This JEP describes the Vector API in detail: https://openjdk.org/jeps/529
Monte Carlo simulation for π estimation
When I was thinking about an example for showing Vector API, I wanted an example which is interesting, easy to understand and one where there’s lots of calculation. So “Monte Carlo simulation for estimating value of pi (π)” seemed like a perfect choice.
How does it work?
Look at the picture below. Here we have a square of size 2 and there’s a circle, with radius 1, fitting perfectly in it. Here the area of square is 4 and area of circle is π. (π * r * r where r = 1).
Now, if we put large number of evenly distributed random points inside this square, the ratio of number of points inside the circle to number of points inside the square will be equal to the ratio of area of circle to the area of square. That is,
[ \frac{N_{\text{circle}}}{N_{\text{square}}} = \frac{A_{\text{circle}}}{A_{\text{square}}} = \frac{\pi}{4} ]
where
NcircleN_{\text{circle}}Ncircle = number of random points falling inside the circle NsquareN_{\text{square}}Nsquare = total number of random points inside the enclosing square AcircleA_{\text{circle}}Acircle = area of the circle AsquareA_{\text{square}}Asquare = area of the square
We can even take a step further, by taking a quadrant like this:
Here, the ratio still remains π/4, but it’s easier to calculate because x and y both range from 0 to 1 and any point is inside the circle if ( x^2 + y^2 \le 1 ).
Simple/ scalar implementation
Here’s a simple implementation that calculates pi.
public class PiEstimator {
public static double getPi(int samples, long seed) {
var random = new SplittableRandom(seed);
var insideCircle = 0L;
for (long i = 0; i < samples; i++) {
var x = random.nextDouble();
var y = random.nextDouble();
//inside if x*x + y*y <= 1
if (x * x + y * y <= 1.0) {
insideCircle++;
}
}
return 4.0D * insideCircle / samples;
}
static void main() {
var actual = Math.PI;
var estimated = getPi(100_000_000, 1L);
var errPercent = Math.abs(actual - estimated) / 100;
System.out.printf("Actual value: %.10f%n", actual);
System.out.printf("Estimated value: %.10f%n", estimated);
System.out.printf("Error percentage: %.10f%n", errPercent);
}
}Note: All source code for this post is present here: https://github.com/balkrishnarawool/vector-pi-estimator.
And the output is
Actual value: 3,1415926536
Estimated value: 3,1416522000
Error percentage: 0,0000005955Note: Note that I am using Dutch locale, so the decimal separator is ‘,’ instead of ‘.’.
As we can see we get pretty good approximation of π with 10 million points.
Vector implementation
Here’s implementation using Vector API
public class VectorPiEstimator {
private static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
public static double getPi(int samples, long seed) {
var random = new SplittableRandom(seed);
var insideCircle = 0L;
var vectorLength = SPECIES.length();
var vectorLoops = samples / vectorLength;
var remainder = samples % vectorLength;
// Pre-allocate arrays for better performance
double[] xs = new double[vectorLength];
double[] ys = new double[vectorLength];
// Handle vectors
for (int i = 0; i < vectorLoops; i++) {
for (int j = 0; j < vectorLength; j++) {
xs[j] = random.nextDouble();
ys[j] = random.nextDouble();
}
var xVec = DoubleVector.fromArray(SPECIES, xs, 0);
var yVec = DoubleVector.fromArray(SPECIES, ys, 0);
// x * x + y * y
var distSq = xVec.fma(xVec, yVec.mul(yVec));
var insideMask = distSq.compare(VectorOperators.LE, 1.0);
insideCircle += insideMask.trueCount();
}
// Handle final scalar remainder
int offset = vectorLoops * vectorLength;
for (int i = 0; i < remainder; i++) {
var x = xs[offset + i];
var y = ys[offset + i];
if (x * x + y * y <= 1.0) {
insideCircle++;
}
}
return 4.0D * insideCircle / samples;
}
static void main() {
var actual = Math.PI;
var estimated = getPi(100_000_000, 1L);
var errPercent = Math.abs(actual - estimated) / 100;
System.out.printf("Actual value: %.10f%n", actual);
System.out.printf("Estimated value: %.10f%n", estimated);
System.out.printf("Error percentage: %.10f%n", errPercent);
}
}And the output is
Actual value: 3,1415926536
Estimated value: 3,1416522000
Error percentage: 0,0000005955Also, here we get good approximation of pi. In fact, we can observe that for same seed and number of samples, we get same value of pi in vector implementation as in scalar implementation.
Performance gain
Let’s check the performance gain. For that let’s write a simple Benchmark class:
public class Benchmark {
private static final int SAMPLES = 100_000_000; // 100 million samples
private static final int WARMUP_RUNS = 3;
private static final int BENCHMARK_RUNS = 5;
static void main() {
var samples = SAMPLES;
System.out.printf("Benchmark for Monte Carlo Simulation for Pi Estimation using Vector API%n%n");
System.out.printf("Samples per run: %,d%n%n", samples);
// Use fixed seed for reproducibility
var seed = 1L;
// Warmup phase
System.out.printf("Warming up JIT compiler (%s runs)...%n", WARMUP_RUNS);
for (int i = 0; i < WARMUP_RUNS; i++) {
PiEstimator.getPi(samples, seed);
VectorPiEstimator.getPi(samples, seed);
}
System.out.println("Warmup complete.\n");
// Benchmark scalar implementation
System.out.printf("Running SCALAR implementation (%s runs)...%n", BENCHMARK_RUNS);
System.out.printf("Actual Pi: %.10f%n", Math.PI);
var scalarTotalTime = 0D;
for (int i = 0; i < BENCHMARK_RUNS; i++) {
var startTime = System.nanoTime();
var scalarPi = PiEstimator.getPi(samples, seed);
var endTime = System.nanoTime();
var duration = endTime - startTime;
scalarTotalTime += duration;
System.out.printf(" Run %d: %,d ms%n", i+1, duration / 1_000_000);
System.out.printf(" Estimated Pi: %.10f%n", scalarPi);
}
var scalarAvgTime = scalarTotalTime / BENCHMARK_RUNS;
System.out.printf(" Average: %.2f ms%n%n", scalarAvgTime / 1_000_000);
// Benchmark vector implementation
System.out.printf("Running VECTOR implementation (%s runs)...%n", BENCHMARK_RUNS);
System.out.printf("Vector Length: %d%n", DoubleVector.SPECIES_PREFERRED.length());
System.out.printf("Actual Pi: %.10f%n", Math.PI);
var vectorTotalTime = 0D;
for (int i = 0; i < BENCHMARK_RUNS; i++) {
var startTime = System.nanoTime();
var vectorPi = VectorPiEstimator.getPi(samples, seed);
var endTime = System.nanoTime();
var duration = endTime - startTime;
vectorTotalTime += duration;
System.out.printf(" Run %d: %,d ms%n", i + 1, duration / 1_000_000);
System.out.printf(" Estimated Pi: %.10f%n", vectorPi);
}
var vectorAvgTime = vectorTotalTime / BENCHMARK_RUNS;
System.out.printf(" Average: %.2f ms%n%n", vectorAvgTime / 1_000_000);
// Results comparison
System.out.println("RESULTS:");
var vectorSpeedup = scalarAvgTime / vectorAvgTime;
var vectorImprovement = ((scalarAvgTime - vectorAvgTime) / scalarAvgTime) * 100;
System.out.printf("Scalar time: %.2f ms%n", scalarAvgTime / 1_000_000);
System.out.printf("Vector time: %.2f ms%n%n", vectorAvgTime / 1_000_000);
System.out.printf("Vector Speedup: %.2fx (%.1f%% reduction)%n%n", vectorSpeedup, vectorImprovement);
}
}The output is
Benchmark for Monte Carlo Simulation for Pi Estimation using Vector API
Samples per run: 100.000.000
Warming up JIT compiler (3 runs)...
Warmup complete.
Running SCALAR implementation (5 runs)...
Actual Pi: 3,1415926536
Run 1: 201 ms
Estimated Pi: 3,1416522000
Run 2: 201 ms
Estimated Pi: 3,1416522000
Run 3: 201 ms
Estimated Pi: 3,1416522000
Run 4: 201 ms
Estimated Pi: 3,1416522000
Run 5: 201 ms
Estimated Pi: 3,1416522000
Average: 201,85 ms
Running VECTOR implementation (5 runs)...
Vector Length: 2
Actual Pi: 3,1415926536
Run 1: 185 ms
Estimated Pi: 3,1416522000
Run 2: 187 ms
Estimated Pi: 3,1416522000
Run 3: 185 ms
Estimated Pi: 3,1416522000
Run 4: 186 ms
Estimated Pi: 3,1416522000
Run 5: 185 ms
Estimated Pi: 3,1416522000
Average: 186,00 ms
RESULTS:
Scalar time: 201,85 ms
Vector time: 186,00 ms
Vector Speedup: 1,09x (7,8% reduction)There is performance improvement. But it seems pretty low. It looks like our program is spending a lot of time creating the random points. Let’s pre-calculate the points and see how much performance gain we get.
Estimation with pre-calculated random points
Scalar implementation with pre-calculated points:
public class PiEstimatorPrecomputed {
public static double getPi(double[] xs, double[] ys) {
var insideCircle = 0;
for (int i = 0; i < xs.length; i++) {
var x = xs[i];
var y = ys[i];
//inside if x*x + y*y <= 1
if (x * x + y * y <= 1.0) {
insideCircle++;
}
}
return 4.0D * insideCircle / xs.length;
}
}Vector implementation with pre-calculated points:
public class VectorPiEstimatorPrecomputed {
private static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
public static double getPi(double[] xs, double[] ys) {
var samples = xs.length;
var insideCircle = 0L;
var vectorLength = SPECIES.length();
var vectorLoops = samples / vectorLength;
var remainder = samples % vectorLength;
// Handle vectors
for (int i = 0; i < vectorLoops; i++) {
var offset = i * vectorLength;
var xVec = DoubleVector.fromArray(SPECIES, xs, offset);
var yVec = DoubleVector.fromArray(SPECIES, ys, offset);
// x * x + y * y
var distSq = xVec.fma(xVec, yVec.mul(yVec));
var insideMask = distSq.compare(VectorOperators.LE, 1.0);
insideCircle += insideMask.trueCount();
}
// Handle final scalar remainder
int offset = vectorLoops * vectorLength;
for (int i = 0; i < remainder; i++) {
var x = xs[offset + i];
var y = ys[offset + i];
if (x * x + y * y <= 1.0) {
insideCircle++;
}
}
return 4.0D * insideCircle / samples;
}
}And here’s how Benchmark would look like:
public class BenchmarkPrecomputed {
private static final int SAMPLES = 100_000_000; // 100 million samples
private static final int WARMUP_RUNS = 3;
private static final int BENCHMARK_RUNS = 5;
static void main() {
// Use fixed seed for reproducibility
var seed = 1L;
var samples = SAMPLES;
var xs = new double[samples];
var ys = new double[samples];
var random = new SplittableRandom(seed);
for (int i = 0; i < samples; i++) {
xs[i] = random.nextDouble();
ys[i] = random.nextDouble();
}
System.out.println("Benchmark for Monte Carlo Simulation for Pi Estimation using Vector API with precomputed random points");
System.out.println();
System.out.println("Samples per run: " + String.format("%,d", samples));
System.out.println();
// Warmup phase
System.out.println("Warming up JIT compiler (" + WARMUP_RUNS + " runs)...");
for (int i = 0; i < WARMUP_RUNS; i++) {
PiEstimatorPrecomputed.getPi(xs, ys);
VectorPiEstimatorPrecomputed.getPi(xs, ys);
}
System.out.println("Warmup complete.\n");
// Benchmark scalar implementation
System.out.println("Running SCALAR implementation (" + BENCHMARK_RUNS + " runs)...");
System.out.printf("Actual Pi: %.10f%n", Math.PI);
var scalarTotalTime = 0D;
for (int i = 0; i < BENCHMARK_RUNS; i++) {
var startTime = System.nanoTime();
var scalarPi = PiEstimatorPrecomputed.getPi(xs, ys);
var endTime = System.nanoTime();
var duration = endTime - startTime;
scalarTotalTime += duration;
System.out.printf(" Run %d: %,d ms%n", i+1, duration / 1_000_000);
System.out.printf(" Estimated Pi: %.10f%n", scalarPi);
}
var scalarAvgTime = scalarTotalTime / BENCHMARK_RUNS;
System.out.printf(" Average: %.2f ms%n", scalarAvgTime / 1_000_000);
System.out.println();
// Benchmark vector implementation
System.out.println("Running VECTOR implementation (" + BENCHMARK_RUNS + " runs)...");
System.out.printf("Vector Length: %d%n", DoubleVector.SPECIES_PREFERRED.length());
System.out.printf("Actual Pi: %.10f%n", Math.PI);
var vectorTotalTime = 0D;
for (int i = 0; i < BENCHMARK_RUNS; i++) {
var startTime = System.nanoTime();
var vectorPi = VectorPiEstimatorPrecomputed.getPi(xs, ys);
var endTime = System.nanoTime();
var duration = endTime - startTime;
vectorTotalTime += duration;
System.out.printf(" Run %d: %,d ms%n", i + 1, duration / 1_000_000);
System.out.printf(" Estimated Pi: %.10f%n", vectorPi);
}
var vectorAvgTime = vectorTotalTime / BENCHMARK_RUNS;
System.out.printf(" Average: %.2f ms%n", vectorAvgTime / 1_000_000);
System.out.println();
// Results comparison
System.out.println("RESULTS:");
var vectorSpeedup = scalarAvgTime / vectorAvgTime;
var vectorImprovement = ((scalarAvgTime - vectorAvgTime) / scalarAvgTime) * 100;
System.out.printf("Scalar time: %.2f ms%n", scalarAvgTime / 1_000_000);
System.out.printf("Vector time: %.2f ms%n", vectorAvgTime / 1_000_000);
System.out.println();
System.out.printf("Vector Speedup: %.2fx (%.1f%% reduction)%n", vectorSpeedup, vectorImprovement);
System.out.println();
}
}And the output of the benchmark is:
Benchmark for Monte Carlo Simulation for Pi Estimation using Vector API with precomputed random points
Samples per run: 100.000.000
Warming up JIT compiler (3 runs)...
Warmup complete.
Running SCALAR implementation (5 runs)...
Actual Pi: 3,1415926536
Run 1: 80 ms
Estimated Pi: 3,1416522000
Run 2: 64 ms
Estimated Pi: 3,1416522000
Run 3: 65 ms
Estimated Pi: 3,1416522000
Run 4: 70 ms
Estimated Pi: 3,1416522000
Run 5: 65 ms
Estimated Pi: 3,1416522000
Average: 69,35 ms
Running VECTOR implementation (5 runs)...
Vector Length: 2
Actual Pi: 3,1415926536
Run 1: 38 ms
Estimated Pi: 3,1416522000
Run 2: 38 ms
Estimated Pi: 3,1416522000
Run 3: 40 ms
Estimated Pi: 3,1416522000
Run 4: 41 ms
Estimated Pi: 3,1416522000
Run 5: 38 ms
Estimated Pi: 3,1416522000
Average: 39,62 ms
RESULTS:
Scalar time: 69,35 ms
Vector time: 39,62 ms
Vector Speedup: 1,75x (42,9% reduction)The gain this time is much more. This is exactly what we expected.
If you look at the Vector Length, it is 2. (I am running this on MacBook Pro with Apple M1 Pro chip.)
This means with Vector API, the CPU is now doing 2 calculations at the same time. This should theoretically give us 2x performance gain.
What we get is 1.75x. So wat we see is the practical gain is very close to the theoretical value we expected.
Conclusion
SIMD instructions and wide CPU registers enable modern CPUs to perform same instructions on multiple data-elements at the same time. Java’s Vector API provides a way to express vector computations that the JIT compiler can reliably compile into optimized SIMD instructions. This provides higher performance (more generally than the traditional auto-vectorization did). The example discussed above helps us see the performance gain clearly.
Note: All source code for this post is present here: https://github.com/balkrishnarawool/vector-pi-estimator.