Accelerating SHA256 by 100x in Golang on ARM
The 64-bit ARMv8 core has introduced new instructions for SHA1 and SHA2 acceleration as part of the Cryptography Extensions. We at Minio were curious as to the difference that these instructions might make, and this turns out to be one of the nicer surprises that you sometimes get from time to time.
We have been running an ARMv8 server at miniNodes.com and if you look at the CPU info you will see the following:
As you can see it nicely lists the sha1 and sh2 features of which the latter is the topic of this blog post.
As part of the minio/sha256-simd repository we have added an arm64 Golang assembly version of which an excerpt is shown below (just one cycle of many cycles):
We have compared this version against the default implementation that is now available in Go. Without further ado, here are the results that are reported by benchcmp:
This is of course a massive increase from a (meager) 6 MB/sec to 615 MB/sec (per core). To be fair, the default implementation for Go on ARM is not accelerated in any way (unlike for instance for Intel CPUs where there is an assembly version). If it had been in assembly the difference would be quite a bit smaller, but in our view still minimally a factor 10x due to the new sha256h, sha256h2, sha256su0, and sha256su1 instructions.
As you can maybe derive from the name of the minio/sha256-simd repo, yes, we are working on adding SIMD (AVX2, AVX and SSE flavors) support for SHA256 on Intel, so stay tuned for that. We do not promise a 100x speedup for Intel though…
Interestingly enough, there are actually comparable Intel SHA extensions to the ARM equivalents. Linux 4.4 has added support for this but so far we have not been able to identify any CPUs that will actually run this code. If you do, please let us know and with wider support we would actually be interested in adding this to minio/sha256-simd.
Finishing off, the asm2plan9s tool that we initially developed to assist in minio/blake2b-simd is being extended with ARM support (in addition to Intel that is already available).
And sharp readers may have noticed that there is one more interesting feature for ARMv8 which is PMULL (polynomial multiplication). This instruction may help very well with Reed Solomon Erasure Coding. We use this technique at Minio for the XL version to guarantee additional safety and protection against bit rot. So stay tuned for that as well.