Estimating value of pi (π) using Monte Carlo Simulation and Vector API

Java’s vector API takes advantage of certain CPU architectures to parallelize calculations which improves performance multifold. This article briefly discusses the API and then takes a practical approach to show how the performance gain is achieved with an example.

SIMD and Vector API

SIMD stands for Single Instruction, Multiple Data. It enables a CPU to apply the same instruction to multiple data elements in parallel. Modern CPUs support SIMD instruction sets. These instructions operate on wide CPU registers that can hold multiple data elements at once. When a SIMD instruction is executed, it processes all elements packed in those registers simultaneously. This lets the CPU perform multiple computations per instruction, improving throughput. The Vector API in Java leverages these hardware SIMD capabilities and allows the JIT compiler to generate appropriate SIMD instructions for the CPU.

The multiple data elements present in the CPU registers are called a vector (of data). So a register holds a vector. The instructions that operate on a vector are called SIMD instructions (or even vector instructions). The Vector API provides an API that indicates JIT compiler to generate appropriate SIMD instructions, depending on the specific hardware being used.

Java has supported auto‑vectorization for many years. This means that, when certain conditions are met, the JIT compiler can transform suitable loops into SIMD (vector) instructions, giving performance improvements. The Vector API provides an explicit programming model for vector operations. It expresses vector semantics clearly enough for the JIT compiler to reliably generate optimal SIMD instructions. This allows vectorization to be applied more consistently and gives developers more control, enabling better use of the CPU’s hardware vector capabilities across different architectures.

This JEP describes the Vector API in detail: https://openjdk.org/jeps/529

Monte Carlo simulation for π estimation

When I was thinking about an example for showing Vector API, I wanted an example which is interesting, easy to understand and one where there’s lots of calculation. So “Monte Carlo simulation for estimating value of pi (π)” seemed like a perfect choice.

How does it work?

Look at the picture below. Here we have a square of size 2 and there’s a circle, with radius 1, fitting perfectly in it. Here the area of square is 4 and area of circle is π. ( It is $π * r * r\ where\ r = 1 $).

Now, if we put large number of evenly distributed random points inside this square, the ratio of number of points inside the circle to number of points inside the square will be equal to the ratio of area of circle to the area of square. That is,

\[ \frac{N_{\text{circle}}}{N_{\text{square}}} = \frac{A_{\text{circle}}}{A_{\text{square}}} = \frac{\pi}{4} \]

$where$
$N_{\text{circle}} = Total\ number\ of\ random\ points\ inside\ the\ circle$
$N_{\text{square}} = Total\ number\ of\ random\ points\ inside\ the\ square$
$A_{\text{circle}} = Area\ of\ the\ circle$
$A_{\text{square}} = Area\ of\ the\ square$

We can even take a step further, by taking a quadrant like this:

Here, the ratio still remains π/4, but it’s easier to calculate the areas as x and y both range from 0 to 1 and a point is inside the circle if $x^2 + y^2 <= 1 $.

Simple/ scalar implementation

Here’s a simple implementation that calculates pi.

public class PiEstimator {
    public static double getPi(int samples, long seed) {
        var random = new SplittableRandom(seed);
        var insideCircle = 0L;

        for (long i = 0; i < samples; i++) {
            var x = random.nextDouble();
            var y = random.nextDouble();

            //inside if x*x + y*y <= 1
            if (x * x + y * y <= 1.0) {
                insideCircle++;
            }
        }

        return 4.0D * insideCircle / samples;
    }

    static void main() {
        var actual = Math.PI;
        var estimated = getPi(100_000_000, 1L);
        var errPercent = Math.abs(actual - estimated) / 100;
        System.out.printf("Actual value: %.10f%n", actual);
        System.out.printf("Estimated value: %.10f%n", estimated);
        System.out.printf("Error percentage: %.10f%n", errPercent);
    }
}

Note: All source code for this post is present here: https://github.com/balkrishnarawool/vector-pi-estimator.

And the output is

Actual value: 3,1415926536
Estimated value: 3,1416522000
Error percentage: 0,0000005955

Note: Note that I am using Dutch locale, so the decimal separator is ‘,’ instead of ‘.’.

As we can see we get pretty good approximation of π with 100 million points.

Vector implementation

Here’s implementation using Vector API. Here, we first get the vectorLength (the number of data elements that can be put in a vector). Then process that many number of points as a vector. It is easier to forma vector from an array and we reuse the array objects for efficiency.

public class VectorPiEstimator {
    private static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;

    public static double getPi(int samples, long seed) {
        var random = new SplittableRandom(seed);
        var insideCircle = 0L;
        
        var vectorLength = SPECIES.length();
        var vectorLoops = samples / vectorLength;
        var remainder = samples % vectorLength;

        // Pre-allocate arrays for better performance
        double[] xs = new double[vectorLength];
        double[] ys = new double[vectorLength];

        // Handle vectors
        for (int i = 0; i < vectorLoops; i++) {
            for (int j = 0; j < vectorLength; j++) {
                xs[j] = random.nextDouble();
                ys[j] = random.nextDouble();
            }
            var xVec = DoubleVector.fromArray(SPECIES, xs, 0);
            var yVec = DoubleVector.fromArray(SPECIES, ys, 0);
            // x * x + y * y
            var distSq = xVec.fma(xVec, yVec.mul(yVec));
            var insideMask = distSq.compare(VectorOperators.LE, 1.0);
            insideCircle += insideMask.trueCount();
        }

        // Handle final scalar remainder
        int offset = vectorLoops * vectorLength;
        for (int i = 0; i < remainder; i++) {
            var x = xs[offset + i];
            var y = ys[offset + i];
            if (x * x + y * y <= 1.0) {
                insideCircle++;
            }
        }
        
        return 4.0D * insideCircle / samples;
    }

    static void main() {
        var actual = Math.PI;
        var estimated = getPi(100_000_000, 1L);
        var errPercent = Math.abs(actual - estimated) / 100;
        System.out.printf("Actual value: %.10f%n", actual);
        System.out.printf("Estimated value: %.10f%n", estimated);
        System.out.printf("Error percentage: %.10f%n", errPercent);
    }
}

And the output is

Actual value: 3,1415926536
Estimated value: 3,1416522000
Error percentage: 0,0000005955

Also, here we get good approximation of pi. In fact, we can observe that for same seed and number of samples, we get same value of pi in vector implementation as in scalar implementation.

Performance gain

Let’s check the performance gain. For that let’s write a simple Benchmark class. This does 3 warmup runs and 5 benchmark runs for both the implementations and calculates average times for both and compares them.

public class Benchmark {
    private static final int SAMPLES = 100_000_000; // 100 million samples
    private static final int WARMUP_RUNS = 3;
    private static final int BENCHMARK_RUNS = 5;
    
    static void main() {
       var samples = SAMPLES;

       System.out.printf("Monte Carlo Simulation for Pi Estimation using Vector API%n%n");
       System.out.printf("Samples per run: %,d%n%n", samples);

       // Use fixed seed for reproducibility
       var seed = 1L;
       
       // Warmup phase
       System.out.printf("Warming up JIT compiler (%s runs)...%n", WARMUP_RUNS);
       for (int i = 0; i < WARMUP_RUNS; i++) {
           PiEstimator.getPi(samples, seed);
           VectorPiEstimator.getPi(samples, seed);
       }
       System.out.println("Warmup complete.\n");
       
       // Benchmark scalar implementation
       System.out.printf("Running SCALAR implementation (%s runs)...%n", BENCHMARK_RUNS);
       System.out.printf("Actual Pi: %.10f%n", Math.PI);
       var scalarTotalTime = 0D;

       for (int i = 0; i < BENCHMARK_RUNS; i++) {
           var startTime = System.nanoTime();
           var scalarPi = PiEstimator.getPi(samples, seed);
           var endTime = System.nanoTime();
           var duration = endTime - startTime;

           scalarTotalTime += duration;
           System.out.printf("  Run %d: %,d ms%n", i+1, duration / 1_000_000);
           System.out.printf("  Estimated Pi: %.10f%n", scalarPi);
       }
       
       var scalarAvgTime = scalarTotalTime / BENCHMARK_RUNS;
       System.out.printf("  Average: %.2f ms%n%n", scalarAvgTime / 1_000_000);

       // Benchmark vector implementation
       System.out.printf("Running VECTOR implementation (%s runs)...%n", BENCHMARK_RUNS);
       System.out.printf("Vector Length: %d%n", DoubleVector.SPECIES_PREFERRED.length());
       System.out.printf("Actual Pi: %.10f%n", Math.PI);
       var vectorTotalTime = 0D;

       for (int i = 0; i < BENCHMARK_RUNS; i++) {
           var startTime = System.nanoTime();
           var vectorPi = VectorPiEstimator.getPi(samples, seed);
           var endTime = System.nanoTime();
           var duration = endTime - startTime;

           vectorTotalTime += duration;
           System.out.printf("  Run %d: %,d ms%n", i + 1, duration / 1_000_000);
           System.out.printf("  Estimated Pi: %.10f%n", vectorPi);
       }
       
       var vectorAvgTime = vectorTotalTime / BENCHMARK_RUNS;
       System.out.printf("  Average: %.2f ms%n%n", vectorAvgTime / 1_000_000);

       // Results comparison
       System.out.println("RESULTS:");

       var vectorSpeedup = scalarAvgTime / vectorAvgTime;
       var vectorImprovement = ((scalarAvgTime - vectorAvgTime) / scalarAvgTime) * 100;
       
       System.out.printf("Scalar time:          %.2f ms%n", scalarAvgTime / 1_000_000);
       System.out.printf("Vector time:          %.2f ms%n%n", vectorAvgTime / 1_000_000);
       System.out.printf("Vector Speedup:          %.2fx (%.1f%% reduction)%n%n", vectorSpeedup, vectorImprovement);
  } 
}

The output is

Monte Carlo Simulation for Pi Estimation using Vector API

Samples per run: 100.000.000

Warming up JIT compiler (3 runs)...
Warmup complete.

Running SCALAR implementation (5 runs)...
Actual Pi: 3,1415926536
    Run 1: 201 ms
    Estimated Pi: 3,1416522000
    Run 2: 201 ms
    Estimated Pi: 3,1416522000
    Run 3: 201 ms
    Estimated Pi: 3,1416522000
    Run 4: 201 ms
    Estimated Pi: 3,1416522000
    Run 5: 201 ms
    Estimated Pi: 3,1416522000
    Average: 201,85 ms

Running VECTOR implementation (5 runs)...
Vector Length: 2
Actual Pi: 3,1415926536
    Run 1: 185 ms
    Estimated Pi: 3,1416522000
    Run 2: 187 ms
    Estimated Pi: 3,1416522000
    Run 3: 185 ms
    Estimated Pi: 3,1416522000
    Run 4: 186 ms
    Estimated Pi: 3,1416522000
    Run 5: 185 ms
    Estimated Pi: 3,1416522000
    Average: 186,00 ms

RESULTS:
Scalar time:          201,85 ms
Vector time:          186,00 ms

Vector Speedup:          1,09x (7,8% reduction)

There is performance improvement. But it seems pretty low. It looks like our program is spending a lot of time creating the random points. Let’s pre-calculate the points and see how much performance gain we get.

Estimation with pre-calculated random points

Here the calculation of random points moves out of the pi calculation and it makes them simpler. Both the implementations take arrays of x and y values.

Scalar implementation with pre-calculated points:

public class PiEstimatorPrecomputed {
    public static double getPi(double[] xs, double[] ys) {
        var insideCircle = 0;

        for (int i = 0; i < xs.length; i++) {
            var x = xs[i];
            var y = ys[i];

            //inside if x*x + y*y <= 1
            if (x * x + y * y <= 1.0) {
                insideCircle++;
            }
        }
        return 4.0D * insideCircle / xs.length;
    }

}

Vector implementation with pre-calculated points:

public class VectorPiEstimatorPrecomputed {
    private static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;

    public static double getPi(double[] xs, double[] ys) {
        var samples = xs.length;
        var insideCircle = 0L;
        
        var vectorLength = SPECIES.length();
        var vectorLoops = samples / vectorLength;
        var remainder = samples % vectorLength;
        
        // Handle vectors
        for (int i = 0; i < vectorLoops; i++) {
            var offset = i * vectorLength;
            var xVec = DoubleVector.fromArray(SPECIES, xs, offset);
            var yVec = DoubleVector.fromArray(SPECIES, ys, offset);
            // x * x + y * y
            var distSq = xVec.fma(xVec, yVec.mul(yVec));
            var insideMask = distSq.compare(VectorOperators.LE, 1.0);
            insideCircle += insideMask.trueCount();
        }
        
        // Handle final scalar remainder
        int offset = vectorLoops * vectorLength;
        for (int i = 0; i < remainder; i++) {
            var x = xs[offset + i];
            var y = ys[offset + i];
            if (x * x + y * y <= 1.0) {
                insideCircle++;
            }
        }
        
        return 4.0D * insideCircle / samples;
    }
}

The banchmark class has to do a bit more as it has to calculated the random points and them pass them to the implementations. And here’s how the new benchmark looks like:

public class BenchmarkPrecomputed {
    private static final int SAMPLES = 100_000_000; // 100 million samples
    private static final int WARMUP_RUNS = 3;
    private static final int BENCHMARK_RUNS = 5;
    
    static void main() {
        // Use fixed seed for reproducibility
        var seed = 1L;
        var samples = SAMPLES;
        
        var xs = new double[samples];
        var ys = new double[samples];
        
        var random = new SplittableRandom(seed);
        for (int i = 0; i < samples; i++) {
           xs[i] = random.nextDouble();
           ys[i] = random.nextDouble();
        }
        
        System.out.println("Monte Carlo Pi Estimation using Vector API with precomputed random points");
        System.out.println();
        System.out.println("Samples per run: " + String.format("%,d", samples));
        System.out.println();
        
        // Warmup phase
        System.out.println("Warming up JIT compiler (" + WARMUP_RUNS + " runs)...");
        for (int i = 0; i < WARMUP_RUNS; i++) {
           PiEstimatorPrecomputed.getPi(xs, ys);
           VectorPiEstimatorPrecomputed.getPi(xs, ys);
        }
        System.out.println("Warmup complete.\n");
        
        // Benchmark scalar implementation
        System.out.println("Running SCALAR implementation (" + BENCHMARK_RUNS + " runs)...");
        System.out.printf("Actual Pi: %.10f%n", Math.PI);
        var scalarTotalTime = 0D;
        
        for (int i = 0; i < BENCHMARK_RUNS; i++) {
           var startTime = System.nanoTime();
           var scalarPi = PiEstimatorPrecomputed.getPi(xs, ys);
           var endTime = System.nanoTime();
           var duration = endTime - startTime;
        
           scalarTotalTime += duration;
           System.out.printf("  Run %d: %,d ms%n", i+1, duration / 1_000_000);
           System.out.printf("  Estimated Pi: %.10f%n", scalarPi);
        }
        
        var scalarAvgTime = scalarTotalTime / BENCHMARK_RUNS;
        System.out.printf("  Average: %.2f ms%n", scalarAvgTime / 1_000_000);
        System.out.println();
        
        // Benchmark vector implementation
        System.out.println("Running VECTOR implementation (" + BENCHMARK_RUNS + " runs)...");
        System.out.printf("Vector Length: %d%n", DoubleVector.SPECIES_PREFERRED.length());
        System.out.printf("Actual Pi: %.10f%n", Math.PI);
        var vectorTotalTime = 0D;
        
        for (int i = 0; i < BENCHMARK_RUNS; i++) {
           var startTime = System.nanoTime();
           var vectorPi = VectorPiEstimatorPrecomputed.getPi(xs, ys);
           var endTime = System.nanoTime();
           var duration = endTime - startTime;
        
           vectorTotalTime += duration;
           System.out.printf("  Run %d: %,d ms%n", i + 1, duration / 1_000_000);
           System.out.printf("  Estimated Pi: %.10f%n", vectorPi);
        }
        
        var vectorAvgTime = vectorTotalTime / BENCHMARK_RUNS;
        System.out.printf("  Average: %.2f ms%n", vectorAvgTime / 1_000_000);
        System.out.println();
        
        // Results comparison
        System.out.println("RESULTS:");
        
        var vectorSpeedup = scalarAvgTime / vectorAvgTime;
        var vectorImprovement = ((scalarAvgTime - vectorAvgTime) / scalarAvgTime) * 100;
        
        System.out.printf("Scalar time:          %.2f ms%n", scalarAvgTime / 1_000_000);
        System.out.printf("Vector time:          %.2f ms%n", vectorAvgTime / 1_000_000);
        System.out.println();
        System.out.printf("Vector Speedup:          %.2fx (%.1f%% reduction)%n", vectorSpeedup, vectorImprovement);
        System.out.println();
    } 
}

And the output of the benchmark is:

Monte Carlo Pi Estimation using Vector API with precomputed random points

Samples per run: 100.000.000

Warming up JIT compiler (3 runs)...
Warmup complete.

Running SCALAR implementation (5 runs)...
Actual Pi: 3,1415926536
    Run 1: 80 ms
    Estimated Pi: 3,1416522000
    Run 2: 64 ms
    Estimated Pi: 3,1416522000
    Run 3: 65 ms
    Estimated Pi: 3,1416522000
    Run 4: 70 ms
    Estimated Pi: 3,1416522000
    Run 5: 65 ms
    Estimated Pi: 3,1416522000
    Average: 69,35 ms

Running VECTOR implementation (5 runs)...
Vector Length: 2
Actual Pi: 3,1415926536
    Run 1: 38 ms
    Estimated Pi: 3,1416522000
    Run 2: 38 ms
    Estimated Pi: 3,1416522000
    Run 3: 40 ms
    Estimated Pi: 3,1416522000
    Run 4: 41 ms
    Estimated Pi: 3,1416522000
    Run 5: 38 ms
    Estimated Pi: 3,1416522000
    Average: 39,62 ms

RESULTS:
Scalar time:          69,35 ms
Vector time:          39,62 ms

Vector Speedup:          1,75x (42,9% reduction)

The gain this time is much more. This is exactly what we expected. If you look at the Vector Length, it is 2. (I am running this on MacBook Pro with Apple M1 Pro chip.) This means with Vector API, the CPU is now doing 2 calculations at the same time. This should theoretically give us maximum 2x performance gain. What we get is 1.75x. So what we see is that the practical gain is very close to the theoretical maximum value.

Conclusion

SIMD instructions and wide CPU registers enable modern CPUs to perform same instructions on multiple data-elements at the same time. Java’s Vector API provides a way to express vector computations that the JIT compiler can reliably compile into optimized SIMD instructions. This provides higher performance (more generally than the traditional auto-vectorization did). The example discussed above helps us see the performance gain clearly.

Note: All source code for this post is present here: https://github.com/balkrishnarawool/vector-pi-estimator.