Fast hashing in Golang using BLAKE2
In this blog we would like to present an optimized implementation, blake2b-simd , in pure Go for the BLAKE2 hashing algorithm that takes advantage of SIMD instructions. It gives up to a 4x speed increase over the (non-assembly) Go implementation and can achieve hashing speeds close to 1 GB/sec per core on high end CPUs. Compared to the standard SHA512 implementation it gives roughly a 3x performance improvement.
Three flavors are available namely AVX2, AVX, and SSE (best one is automatically selected). The list below shows how they compare against the (non-assembly) Go implementation:
- AVX2: 3.94x
- AVX: 3.28x
- SSE: 2.85x
Comparison to other hashing techniques
Below is a comparison of several hashing techniques and their processing speed in terms of MB/s and the performance improvement for BLAKE2b (AVX2 version).
MD5 : 607.48 MB/s (1.4x)
SHA1 : 522.94 MB/s (1.6x)
SHA256 : 189.58 MB/s (4.5x)
SHA512 : 306.33 MB/s (2.8x)
BLAKE2b: 850.64 MB/s
Bit rot protection
At Minio (for the XL version) we use erasure encoding to protect against loss of data. Objects are split into (by default) 8 data chunks in combination with 8 parity chunks onto a total of 16 hard disks. This means that (in the very unlikely event) of a loss of even 8 hard disks it is still possible to reconstruct the original object and return it to the user (note that a hard disk can of course be replaced and will then be reconstructed automatically).
To protect against bit rot as may happen over time, Minio verifies the integrity of any data chunk read back from disk by computing the hash of the chunk and comparing this to the initially computed hash. Only in the case both hashes are identical it is guaranteed that the data has not been altered in any way (and is thus safe to return to the client). In case of any difference between the hashes the block itself is discarded and has to be reconstructed from the other data and parity chunks.
As this verification step is a very frequent operation it will be clear that the performance of the hashing technique has a great impact on the overall system performance. As such Minio has invested in an as fast as possible yet reliable hashing technique and found BLAKE2 to be a good candidate.
Who else is this for?
In addition to bit rot protection there are many other software use cases for fast hashing. Here you can think of deduplication in storage systems, intrusion detection, version control systems, and integrity checking. Also in these applications hashing is a very frequent task so any performance gains achievable here will have a great overall performance impact.
asm2plan9s
The initial version is based on the reference SSE implementation which is part of BLAKE2. As there is only partial support for AVX/SSE instructions in Go Assembly, we have developed a simple tool called asm2plan9s that helps in assembling the instructions into their BYTE code equivalent.
Behind the scenes this tool uses yasm to assemble every instruction one by one and insert the obtained byte codes into the source code file.
For instance, take the following AVX instruction (saved in a file example.s)
// VPADDQ XMM0,XMM1,XMM8
By running $ asm2plan9s example.s this file will be transformed into
LONG $0xd471c1c4; BYTE $0xc0 // VPADDQ XMM0,XMM1,XMM8
Depending on the instruction, there will be more or fewer LONGs, WORDs, and/or BYTEs. Furthermore it has support for use within defines and for more information see the project on GitHub.
Technical details
BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four “Golang” operations in parallel. AVX and SSE operate on 128-bit XMM registers (two operations in parallel). Below are excerpts from the source code of the files compressAvx2_amd64.s, compressAvx_amd64.s, and compress_generic.go respectively.
AVX2:
VPADDQ YMM0,YMM0,YMM1 /* v0 += v4,v1 += v5,v2 += v6,v3 += v7 */
AVX:
VPADDQ XMM0,XMM0,XMM2 /* v0 += v4, v1 += v5 */VPADDQ XMM1,XMM1,XMM3 /* v2 += v6, v3 += v7 */
Golang:
v0 += v4
v1 += v5
v2 += v6
v3 += v7
The parallelism shown in the instructions will be clear. Due to other constraints such as memory access, caching, etc., one does not get the full (theoretical) 2x improvement when going from eg AVX to AVX2 but as time goes the difference may actually widen. (Note that compilers these days also try to vectorize operations.)
Conclusion
We hope that you may find this project useful. Head over to blake2b-simd on GitHub to check it out. And let us know what you think.