SLH-DSA — countermeasures
- FIPS spec:
- Crate path:
quantica::slh_dsa- Cargo feature:
slh-dsa(on by default).
SLH-DSA (SPHINCS+) is hash-based: it has no rejection sampling, no secret polynomial arithmetic, no NTT — only a large tree of SHAKE / SHA2 calls. That removes most of the classical PQC side-channel concerns and concentrates the remaining risks on three well-studied attack surfaces:
Fault injection anywhere in the hash-tree construction. Corrupting any intermediate hash — whether FORS, WOTS+, or an XMSS authentication node — produces a signature that verifies under a different subtree root; collecting a handful of faulted signatures yields a universal forgery ([CMP18]). Practical voltage-glitch realisation is documented in [GenetKPM18], and recent work ([A+25]) extends the threat to a purely software attacker via Rowhammer — making fault redundancy the central hardening axis for hash-based signatures.
DPA on the PRF that expands ``SK.seed`` into WOTS+ and FORS leaf secrets. The same seed is reused across every leaf of every tree, so a DPA attacker accumulates arbitrarily many traces on a controllable input ([KGenetB+18]). Countermeasure directions: masked PRF ([Flu24]) or a threshold-implementation Keccak core ([Saa24a]).
Template / SPA on FORS index extraction and sibling ordering. The FORS digit (a secret derived from the message digest via
SK.PRF) drives the order in which sibling subtrees are computed. Template matching on the PRF absorption patterns can recover the digit bit by bit (discussion in [KGenetB+18], SoK update in [DRC+25]).
This chapter lists what is implemented today and what is scheduled for the next hardening round. Threat classes: Threat model.
Coverage matrix
Threat |
Status |
Countermeasure(s) |
|---|---|---|
Fault on FORS / WOTS+ / XMSS (grafting-tree forgery) |
implemented ( |
Recompute-and-compare redundancy on the FORS signature path
( |
DPA on the master PRF ( |
planned (tier 4, |
First-order masking of the PRF call that derives WOTS+ and FORS leaf secrets, following the 3-share SHAKE posture of [Flu24]; long-term alternative is a TI Keccak core ([Saa24a]). |
SPA / template on FORS sibling PRF addresses (leaks idx bits) |
implemented ( |
Full-tree streaming FORS sign ( |
Fault on digest → FORS indices |
implemented ( |
Recompute-and-compare check at the tail of
|
SPA on hypertree walk / memory-stack SPA on FORS |
implemented (tier 2 RAM) |
Iterative BDS treehash ( |
Software / remote timing |
implemented |
No secret-dependent early exit in the public signing path;
all intermediate comparisons use |
Template attacks on WOTS+ chain values |
implemented |
Same |
Memory / stack-timing — iterative treehash + streaming signature
Principle
The FORS treehash, if written recursively, allocates ~256 KiB of
stack in the worst parameter set. A recursive trace exposes a memory-
access envelope that matches the tree geometry and indirectly leaks
the FORS digit. The iterative variant keeps a BDS-style stack of
z+1 nodes (~448 B) and walks the tree with a loop counter that
is independent of the secret.
Streaming signature emission complements this: the final signature
is allocated once at the top level and sub-slices are passed to
fors_sign_into / ht_sign_into / xmss_sign_into /
wots_sign_into. No intermediate heap buffer is ever resized, so
the allocator state cannot leak intermediate component sizes.
Published basis
Code pointers
Item |
Location |
|---|---|
Iterative FORS treehash |
|
Streaming sign entry point |
|
Per-layer streaming variants |
|
Timing — no secret-dependent branches in the public path
Principle
The SLH-DSA public-signing path (slh_sign_internal) is
deterministic modulo the randomizer R and does not take a secret-
dependent early exit. The three inner functions that do contain
conditional branches during signing — fors::fors_pk_from_sig,
wots::chain_iter, xmss::xmss_pk_from_sig — branch on
(md, idx_tree, idx_leaf, digits) which are derived from R and
the public keys. R is the first n bytes of the emitted
signature; once transmitted, an observer recomputes these values
from R and the public keys, so leaking them via timing is
information-theoretically equivalent to reading the signature.
This is formally documented as the SLH-DSA block of the ctgrind
suppression file (Verification methodology has the full threat-model
paragraph); the suppression is scheduled to be closed by item
T2-D below.
Code pointers
Item |
Location |
|---|---|
Signing entry + component layout |
|
Constant-time helpers used by verify |
|
DFA / fault injection — current posture
SLH-DSA has no rejection sampling and no double representation of
intermediates, so the current implementation does not yet
include a DFA hardening layer. This is known to be the dominant
residual risk for hash-based signatures: a single-fault universal
forgery is the canonical attack class since
[CMP18], with a practical voltage-
glitch realisation in [GenetKPM18] and,
more recently, a purely software Rowhammer realisation in
[A+25] — the latter removing the
“needs a lab” argument that previously justified deferring this
layer. T1-C (the canonical recompute-and-compare redundancy)
and its CT prerequisite T1-F have shipped — see below; T1-E
(digest → FORS-indices integrity check) remains planned for the
next sprint.
Planned hardening
The following items are planned for the next hardening round. Signatures are provided as rustdoc sketches ahead of implementation — the code stubs are deliberately left out so that the API surface can be reviewed before implementation starts.
T1-C — FORS signature redundancy — shipped
Addresses: grafting-tree universal forgery ([CMP18], [GenetKPM18], [A+25]). Canonical recommendation of [Genet23]: sign the FORS component twice, compare the results in constant time, abort on divergence before the signature can leave the device.
Implementation: fors::fors_sign_into_redundant in
quantica/src/slh_dsa/fors.rs, gated by the sca-fors-redundancy
cargo feature. The routine signs FORS twice into independent heap-
backed [SecretBytes] scratch buffers, derives the FORS public key
from each signature via the constant-time fors_pk_from_sig
(T1-F), then compares both signatures and both derived public
keys under silentops::ct_eq. On any mismatch it returns
Err(SlhDsaError::FaultDetected) without writing anything into
the caller’s signature buffer — the faulted signature never propagates.
On a clean run it copies the validated signature into out and
returns the FORS pk, which the caller (slh::slh_sign_internal_redundant)
feeds straight into the hypertree signer.
/// Recompute-and-compare FORS signing (T1-C). Returns the validated
/// FORS public key, or `Err(FaultDetected)` on a single-fault attack
/// against the FORS hash chain.
pub fn fors_sign_into_redundant<P: Params>(
md: &[u8],
sk_seed: &[u8],
pk_seed: &[u8],
adrs_template: &Adrs,
out: &mut [u8],
) -> Result<Vec<u8>, SlhDsaError>;
Comparing both surfaces (signature bytes and derived pk) is
defence-in-depth: a fault that corrupts auth-path bytes might
round-trip to the same FORS root under the verifier path; the byte-
level ct_eq catches that case. Symmetrically, a fault inside the
second fors_pk_from_sig derivation is caught by the pk
ct_eq. Both checks together cost a single extra ct_eq and
are paid only on the slow path that already runs the FORS signer
twice.
Abort posture — unlike ML-KEM’s double-decaps + branchless fault-fallback (ML-KEM — countermeasures), this routine aborts rather than substituting a fault-derived value. The asymmetry is deliberate: a KEM must always return a shared secret, while a signer that detects a fault must, per [Genet23], refuse to emit so the faulted signature does not propagate.
Dispatch. The public SlhDsa::<P>::sign switches between the
redundant path (slh::slh_sign_internal_redundant) and the
historic non-redundant path (slh::slh_sign_internal) at
compile time via #[cfg(feature = "sca-fors-redundancy")]. The
non-redundant path stays publicly re-exported as the CAVP / KAT
deterministic entry point.
Validation. Three module tests in fors.rs:
fors_sign_into_redundant_matches_reference_shake128s/…_shake128f— drive the redundant path on multipleseed × messagepermutations and assert that (a) the validated signature is byte-identical to the non-redundantfors_sign_intooutput, and (b) the returned FORS pk matches the standalonefors_pk_from_sigderivation from the produced signature.fors_redundancy_compare_detects_divergence— exercises the internalfors_redundancy_comparehelper with synthetically divergent buffers (signature mismatch, pk mismatch, both) and asserts each surfacesErr(FaultDetected); the all-equal case surfacesOk. Lets us validate the abort logic without injecting a real fault into the FORS signer.
Cost. One extra fors_sign_into (~1× FORS signing time
again) plus two fors_pk_from_sig derivations and two
silentops::ct_eq checks. The bulk is the second signing —
mirrors the double-decaps posture of ML-KEM in spirit.
Memory. One SecretBytes scratch of length fors_sig_len =
K * (1 + A) * N (~10 KiB for SHAKE-256s, ~7 KiB for SHAKE-128f)
heap-allocated so the M0 baseline stack budget stays honest, drop-
zeroized on both the success and the abort path.
T1-D — full-tree streaming FORS sign — shipped
Addresses: template attack on FORS sibling PRF addresses
([KGenetB+18]). In the FIPS-205 default
path, the address passed to fors_node during the authentication-
path loop is base + s * 2^j where s = floor(idx / 2^j) XOR 1
— the upper (A - j) bits of the secret FORS digit idx with
the lowest bit flipped. The set of addresses absorbed by Keccak
across j ∈ [0, A) reveals idx byte-by-byte to a template
attacker.
Implementation: the per-FORS-tree inner loop of
fors::fors_sign_into (gated by the sca-fors-dummy-siblings
cargo feature) is replaced by a single BDS-style full-tree
streaming traversal:
Iterate
kfrom0to2^A - 1in fixed order.For each leaf at position
leaf_idx = base + k:Generate the leaf secret via
fors_sk_gen(absorbs the idx-independent addressset_tree_index(leaf_idx)).Branchlessly save the leaf secret into the signature’s “leaf secret” slot if
k == idx, viasilentops::ct_copyguarded bysilentops::ct_eq_u32.Hash the leaf via
f_hash; push the height-0 node onto a BDS stack.Iteratively merge same-height stack tops via
hash_h(absorbs idx-independentset_tree_index(absolute_pos)whereabsolute_posdepends only oniandk).At each merge to height
h, branchlessly save the resulting node toauth_path[h]if(k >> h) == ((idx >> h) XOR 1).
After streaming all 2^A leaves, the BDS stack contains exactly
one node — the FORS root, discarded (the caller re-derives it via
fors_pk_from_sig). Both the leaf secret and the A auth-path
siblings are populated in the signature slot.
Signature stays unchanged — the output bytes are byte-identical
to the FIPS-205 default path on every input (KAT-verified across
all six SHAKE parameter sets, with and without sca-fors-redundancy
composed).
/// `fors_sign_into` under `sca-fors-dummy-siblings` — sketch.
for k in 0..(1u32 << P::A) {
let leaf_idx = base + k;
let sk = fors_sk_gen::<P>(sk_seed, pk_seed, &mut adrs, leaf_idx);
silentops::ct_copy(leaf_slot, &sk, silentops::ct_eq_u32(k, idx));
let mut node = hash::f_hash::<P>(pk_seed, &mut adrs, &sk);
let mut height = 0u32;
let mut local_pos = k;
silentops::ct_copy(
&mut auth_slot[0..P::N], &node,
silentops::ct_eq_u32(local_pos, idx ^ 1),
);
while let Some(&(_, top_h)) = stack.last() {
if top_h != height { break; }
// ... pop, merge, save auth_slot[h] branchlessly ...
}
stack.push((node, height));
}
What this kills. The Keccak absorption sequence becomes a
deterministic function of the public FORS-tree index i only;
no idx-dependent address ever reaches the PRF. The template
oracle of [KGenetB+18] is closed for
FORS signing. The same reasoning protects against DPA on the
leaf-secret PRF (fors_sk_gen) since its address argument is
likewise idx-independent in the streamed path.
Cost. Roughly 2× the default FORS hash count per signature.
The default path computes sum_{j=0..A-1} 2^j = 2^A - 1 leaves
across the auth-path subtrees; the full-tree stream computes
2^A leaves + 2^A - 1 internal merges. KAT wall-time
(host x86_64) goes from ~80 s to ~135 s under
--features slh-dsa,sca-fors-dummy-siblings — ratio consistent
with the predicted ~2×.
Memory. Stack budget unchanged at O(A * N) for the BDS
stack (same as the existing iterative treehash in fors_node,
quantica/src/slh_dsa/fors.rs:62-118). No new heap hot-spot.
Historical correction. An earlier draft of this section
described T1-D as “compute both possible siblings (s = 0 and
s = 1) at fixed positions, select the right one branchlessly”.
That framing is wrong: FIPS-205 Algorithm 16 has
s = floor(idx / 2^j) XOR 1, multi-bit, taking values in
[0, 2^(A-j)) at level j. At j = 0 (deepest level) the
sibling sits at one of up to 2^A idx-dependent positions,
not at one of a fixed pair. A first implementation along the
“two-candidate” line silently produced non-FIPS-compliant
signatures (5/16 KAT vectors diverged). The full-tree streaming
traversal documented above is the only mechanism that produces an
idx-independent address sequence at the same asymptotic cost.
Validation. End-to-end KAT
(cargo test --release -p quantica --test slh_dsa_kat --features
slh-dsa,sca-fors-dummy-siblings) — 16/16 vectors byte-identical
to the default path. Lib tests
(cargo test --release -p quantica --lib --features
sca-fors-dummy-siblings) — 5/5 green; composition with
sca-fors-redundancy also green
(--features sca-fors-dummy-siblings,sca-fors-redundancy).
Aligns with the SLotH threshold-implementation posture
([Saa24a]).
Out of scope. Extension of full-tree streaming to WOTS+ chains inside the hypertree — same template-oracle reasoning applies but the leak surface is smaller; tracked as a Tier-4 candidate.
T1-E — digest → indices integrity check — shipped
Addresses: single-fault attack forcing one of the FORS indices
to a controlled value (zero-index variant of
[CMP18]). The corruption reveals
PRF(SK.seed, addr_0) cleanly. Even with T1-D (full-tree
streaming) shipped, a fault during the upstream
message_to_indices derivation, or during the digest extraction
itself, could redirect the leaf-secret commit to a faulted
position before the streaming traversal kicks in.
Implementation: at the tail of fors::fors_sign_into, the
FORS index vector is re-derived from the same ``md`` slice and
CT-compared to the vector consumed during signing. The check
is gated by the sca-fors-indices-check cargo feature; on a
mismatch fors_sign_into returns
Err(SlhDsaError::FaultDetected), the slh_sign_internal
caller propagates via ?, and the hypertree-signing step never
runs — the faulted FORS sub-signature never gets wrapped into a
full signature emitted to the host.
pub(crate) fn fors_indices_consistency_check<P: Params>(
md: &[u8],
used: &[u32],
) -> Result<(), SlhDsaError> {
let recomputed = message_to_indices::<P>(md);
if recomputed.len() != used.len() {
return Err(SlhDsaError::FaultDetected);
}
let used_b: Vec<u8> = used.iter().flat_map(|x| x.to_le_bytes()).collect();
let rec_b: Vec<u8> = recomputed.iter().flat_map(|x| x.to_le_bytes()).collect();
if silentops::ct_eq(&used_b, &rec_b) != 1 {
return Err(SlhDsaError::FaultDetected);
}
Ok(())
}
The fresh derivation is run on the same md slice, so a
fault that lands persistently on md itself (e.g. Rowhammer on
the stack region holding the digest) passes the check; that
threat is the redundant-signing class T1-C already covers
(two independent FORS signings see different intermediate state).
T1-E specifically catches transient faults in the
base_2b bit-extraction or in the index vector storage between
production and consumption.
Cost. One extra message_to_indices (= one base_2b) per
FORS signature — negligible byte-shuffling, no hashing,
K * A / 8 bytes processed. The two Vec<u8> serialisations
for silentops::ct_eq allocate 4 * K bytes each, freed at
function return; well under any M0-baseline budget.
Composition. Orthogonal to T1-C (which compares two
independent FORS signings to catch in-FORS faults) and to T1-D
(which closes the template oracle on Keccak addresses). Under
--features sca-fors-redundancy, T1-C’s
fors_sign_into_redundant calls fors_sign_into twice and
each call independently runs the T1-E check (if also enabled).
KAT determinism preserved in every combination
(sca-fors-indices-check on its own; combined with T1-D;
combined with T1-D + T1-C).
Validation. Lib tests
fors_indices_check_accepts_correct_shake128s /
…_shake128f exercise the positive path on multiple seed
permutations. fors_indices_check_rejects_flipped_index
drives the helper with synthetically corrupted index vectors
(one bit flipped, and a length mismatch) and asserts
FaultDetected in each case. End-to-end determinism: KAT
cargo test --release -p quantica --test slh_dsa_kat
--features slh-dsa,sca-fors-indices-check — 16/16 vectors
byte-identical to the default path (~85 s wall-time vs ~80 s
default, the overhead is in the integrity check, the signing
itself is unchanged).
T4-B — PRF masking
Addresses: DPA on SK.seed through the FORS / WOTS+ leaf PRF
([KGenetB+18]). The baseline construction
is [Flu24] (3-share SHAKE),
with a hardware-side alternative documented in
[Saa24a].
Planned API (transparent wrapper over the existing
hash::prf):
/// 3-share masked PRF. Emits the same byte string as
/// `hash::prf` but keeps `sk_seed` split into shares through
/// every SHAKE-absorb step, per Fluhrer's construction.
#[cfg(feature = "sca-masked-prf")]
pub fn prf_masked<P: Params>(
pk_seed: &[u8],
sk_seed_s: &MaskedSeed, // two shares of SK.seed
adrs: &Adrs,
) -> Vec<u8>;
Cost: roughly 1.7× per signature. Gated behind an opt-in feature
until SHAKE masking lands in silentops.
T1-F — constant-time fors_pk_from_sig — shipped
Addresses: the secret-dependent branch
if ((idx >> j) & 1) == 0 { ... } else { ... } inside the original
FIPS-205 Algorithm 17. Verifier-side, the branch is on public data;
but when the same routine is reused under T1-C as part of the
signing-side redundancy check, its input becomes secret and a Rust
if would re-introduce a timing leak.
Implementation: fors::fors_pk_from_sig in
quantica/src/slh_dsa/fors.rs was reworked to a single
constant-time routine. For every authentication-path level, the
original branch is replaced by a byte-wise
silentops::ct_select_u8 cswap that materialises the
(left, right) hash_h inputs into two N-byte stack
buffers, then calls hash_h(left, right) unconditionally. The
tree_index written into adrs is identical in both original
branches so it needs no extra masking. Scratch buffers are
silentops::ct_zeroize-d at the end of the routine.
/// Constant-time FORS pk-from-sig (FIPS-205 Alg. 17). The
/// secret-dependent `hash_h` argument ordering is resolved by
/// a branchless `silentops::ct_select_u8` cswap. Single routine —
/// used by both the standalone verifier and the T1-C signing-side
/// redundancy check (where `idx` is secret).
pub fn fors_pk_from_sig<P: Params>(
sig_fors: &[u8],
md: &[u8],
pk_seed: &[u8],
adrs: &mut Adrs,
) -> Vec<u8>;
A previous variable-time sibling has been removed: keeping a single CT implementation eliminates the foot-gun of a future call site picking a leaky variant by autocomplete.
Validation: two round-trip tests in
quantica/src/slh_dsa/fors.rs
(fors_pk_from_sig_round_trip_shake128s and …_shake128f)
exercise the sign → pk-from-sig pipeline across multiple seed /
message permutations and assert that two back-to-back derivations
agree (determinism) and produce N-byte outputs. End-to-end
correctness against FIPS-205 reference output is covered by the
KAT suite quantica/tests/slh_dsa_kat.rs.
Cost: 2 * N byte scratch (~32 B for SHAKE-128, ~64 B for
SHAKE-256) plus 2 * N * A ct_select_u8 calls per FORS
tree per signature. Negligible compared to the underlying SHAKE
work.
T2-D — explicit unpoison of R, digest, indices
Programmatic proof to ctgrind that the branches inside
fors::fors_pk_from_sig, wots::chain_iter,
xmss::xmss_sign_into / xmss_pk_from_sig are on data that
has reached the “publish-ready” state. Closes the four suppressions
listed in tools/ctgrind.supp. Zero-cost on production builds.