1 points | by shrecshrec 6 hours ago ago
3 comments
I will be glad to hire any suggestions from everyone abut future improvements and ideas.
I implemented a full secp256k1 engine from scratch in C++ and CUDA with zero external dependencies (no GMP, no OpenSSL).
The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.
I will be glad to hire any suggestions from everyone abut future improvements and ideas.
I implemented a full secp256k1 engine from scratch in C++ and CUDA with zero external dependencies (no GMP, no OpenSSL).
The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.