SLH-DSA — countermeasures

FIPS spec:: [NationalIoSaTechnology24c]
Crate path:: quantica::slh_dsa
Cargo feature:: slh-dsa (on by default).

SLH-DSA (SPHINCS+) is hash-based: it has no rejection sampling, no secret polynomial arithmetic, no NTT — only a large tree of SHAKE / SHA2 calls. That removes most of the classical PQC side-channel concerns and concentrates the remaining risks on three well-studied attack surfaces:

Fault injection anywhere in the hash-tree construction. Corrupting any intermediate hash — whether FORS, WOTS+, or an XMSS authentication node — produces a signature that verifies under a different subtree root; collecting a handful of faulted signatures yields a universal forgery ([CMP18]). Practical voltage-glitch realisation is documented in [GenetKPM18], and recent work ([A+25]) extends the threat to a purely software attacker via Rowhammer — making fault redundancy the central hardening axis for hash-based signatures.
DPA on the PRF that expands ``SK.seed`` into WOTS+ and FORS leaf secrets. The same seed is reused across every leaf of every tree, so a DPA attacker accumulates arbitrarily many traces on a controllable input ([KGenetB+18]). Countermeasure directions: masked PRF ([Flu24]) or a threshold-implementation Keccak core ([Saa24a]).
Template / SPA on FORS index extraction and sibling ordering. The FORS digit (a secret derived from the message digest via SK.PRF) drives the order in which sibling subtrees are computed. Template matching on the PRF absorption patterns can recover the digit bit by bit (discussion in [KGenetB+18], SoK update in [DRC+25]).

This chapter lists what is implemented today and what is scheduled for the next hardening round. Threat classes: Threat model.

Coverage matrix 

SLH-DSA countermeasure / threat matrix
Threat	Status	Countermeasure(s)
Fault on FORS / WOTS+ / XMSS (grafting-tree forgery)	implemented (`sca-fors-redundancy`)	Recompute-and-compare redundancy on the FORS signature path (`T1-C`, opt-in feature `sca-fors-redundancy`), aligned with [Genet23]. Addresses both physical fault injection ([CMP18], [GenetKPM18]) and software Rowhammer ([A+25]). Consumes the constant-time `fors_pk_from_sig` (`T1-F`).
DPA on the master PRF (`SK.seed` → leaf secrets)	planned (tier 4, `T4-B`)	First-order masking of the PRF call that derives WOTS+ and FORS leaf secrets, following the 3-share SHAKE posture of [Flu24]; long-term alternative is a TI Keccak core ([Saa24a]).
SPA / template on FORS sibling PRF addresses (leaks idx bits)	implemented (`sca-fors-dummy-siblings`)	Full-tree streaming FORS sign (`T1-D`, opt-in feature `sca-fors-dummy-siblings`): all `2^A` leaves of each FORS tree are absorbed in fixed order `[base, base + 2^A)`, the leaf secret + auth-path siblings are extracted branchlessly via `silentops::ct_copy` / `silentops::ct_eq_u32`. PRF address sequence becomes idx-independent, closing the per-bit template oracle of [KGenetB+18]. Output bytes unchanged; cost ~2×.
Fault on digest → FORS indices	implemented (`sca-fors-indices-check`)	Recompute-and-compare check at the tail of `fors_sign_into` (`T1-E`, opt-in feature `sca-fors-indices-check`). The index vector is re-derived from `md` and CT-compared via `silentops::ct_eq` to the vector consumed during signing; on mismatch returns `Err(SlhDsaError::FaultDetected)` before the hypertree step runs. Output bytes unchanged; cost negligible.
SPA on hypertree walk / memory-stack SPA on FORS	implemented (tier 2 RAM)	Iterative BDS treehash (`fors_node`) with data-independent stack depth; streaming signature emission avoids heap allocator side-channels.
Software / remote timing	implemented	No secret-dependent early exit in the public signing path; all intermediate comparisons use `silentops::ct_eq`. The branches ctgrind still flags during signing are on values that are byte-for-byte part of the emitted signature (`R`, `digest`, indices) — interim suppression documented under Verification methodology; scheduled removal via `T2-D`.
Template attacks on WOTS+ chain values	implemented	Same `ct_*` routing + fixed-length chain iteration (`chain_iter` executes a constant number of F-hashes per chain).

Memory / stack-timing — iterative treehash + streaming signature 

Principle 

The FORS treehash, if written recursively, allocates ~256 KiB of stack in the worst parameter set. A recursive trace exposes a memory- access envelope that matches the tree geometry and indirectly leaks the FORS digit. The iterative variant keeps a BDS-style stack of z+1 nodes (~448 B) and walks the tree with a loop counter that is independent of the secret.

Streaming signature emission complements this: the final signature is allocated once at the top level and sub-slices are passed to fors_sign_into / ht_sign_into / xmss_sign_into / wots_sign_into. No intermediate heap buffer is ever resized, so the allocator state cannot leak intermediate component sizes.

Published basis 

[BRS22] — the memory- constrained methodology used as inspiration (originally for ML-DSA but transferrable).
[DLK+25] — compact SLH-DSA variant whose engineering choices match our streaming approach.

Code pointers 

Item	Location
Iterative FORS treehash	`quantica/src/slh_dsa/fors.rs` `fors_node`
Streaming sign entry point	`quantica/src/slh_dsa/slh.rs` `slh_sign_internal` (passes sub-slices of the output buffer to each layer’s `*_sign_into` function).
Per-layer streaming variants	`quantica/src/slh_dsa/wots.rs` `wots_sign_into` ; `quantica/src/slh_dsa/xmss.rs` `xmss_sign_into` ; `quantica/src/slh_dsa/hypertree.rs` `ht_sign_into` ; `quantica/src/slh_dsa/fors.rs` `fors_sign_into`

Timing — no secret-dependent branches in the public path 

Principle 

The SLH-DSA public-signing path (slh_sign_internal) is deterministic modulo the randomizer R and does not take a secret- dependent early exit. The three inner functions that do contain conditional branches during signing — fors::fors_pk_from_sig, wots::chain_iter, xmss::xmss_pk_from_sig — branch on (md, idx_tree, idx_leaf, digits) which are derived from R and the public keys. R is the first n bytes of the emitted signature; once transmitted, an observer recomputes these values from R and the public keys, so leaking them via timing is information-theoretically equivalent to reading the signature.

This is formally documented as the SLH-DSA block of the ctgrind suppression file (Verification methodology has the full threat-model paragraph); the suppression is scheduled to be closed by item T2-D below.

Code pointers 

Item	Location
Signing entry + component layout	`quantica/src/slh_dsa/slh.rs` `slh_sign_internal`
Constant-time helpers used by verify	`quantica/src/slh_dsa/slh.rs` (`slh_verify_internal` uses `silentops::ct_eq` for the final PK equality check).

DFA / fault injection — current posture 

SLH-DSA has no rejection sampling and no double representation of intermediates, so the current implementation does not yet include a DFA hardening layer. This is known to be the dominant residual risk for hash-based signatures: a single-fault universal forgery is the canonical attack class since [CMP18], with a practical voltage- glitch realisation in [GenetKPM18] and, more recently, a purely software Rowhammer realisation in [A+25] — the latter removing the “needs a lab” argument that previously justified deferring this layer. T1-C (the canonical recompute-and-compare redundancy) and its CT prerequisite T1-F have shipped — see below; T1-E (digest → FORS-indices integrity check) remains planned for the next sprint.

Planned hardening 

The following items are planned for the next hardening round. Signatures are provided as rustdoc sketches ahead of implementation — the code stubs are deliberately left out so that the API surface can be reviewed before implementation starts.

T1-C — FORS signature redundancy — shipped

Addresses: grafting-tree universal forgery ([CMP18], [GenetKPM18], [A+25]). Canonical recommendation of [Genet23]: sign the FORS component twice, compare the results in constant time, abort on divergence before the signature can leave the device.

Implementation: fors::fors_sign_into_redundant in quantica/src/slh_dsa/fors.rs, gated by the sca-fors-redundancy cargo feature. The routine signs FORS twice into independent heap- backed [SecretBytes] scratch buffers, derives the FORS public key from each signature via the constant-time fors_pk_from_sig (T1-F), then compares both signatures and both derived public keys under silentops::ct_eq. On any mismatch it returns Err(SlhDsaError::FaultDetected) without writing anything into the caller’s signature buffer — the faulted signature never propagates. On a clean run it copies the validated signature into out and returns the FORS pk, which the caller (slh::slh_sign_internal_redundant) feeds straight into the hypertree signer.

/// Recompute-and-compare FORS signing (T1-C). Returns the validated
/// FORS public key, or `Err(FaultDetected)` on a single-fault attack
/// against the FORS hash chain.
pub fn fors_sign_into_redundant<P: Params>(
    md:            &[u8],
    sk_seed:       &[u8],
    pk_seed:       &[u8],
    adrs_template: &Adrs,
    out:           &mut [u8],
) -> Result<Vec<u8>, SlhDsaError>;

Comparing both surfaces (signature bytes and derived pk) is defence-in-depth: a fault that corrupts auth-path bytes might round-trip to the same FORS root under the verifier path; the byte- level ct_eq catches that case. Symmetrically, a fault inside the second fors_pk_from_sig derivation is caught by the pk ct_eq. Both checks together cost a single extra ct_eq and are paid only on the slow path that already runs the FORS signer twice.

Abort posture — unlike ML-KEM’s double-decaps + branchless fault-fallback (ML-KEM — countermeasures), this routine aborts rather than substituting a fault-derived value. The asymmetry is deliberate: a KEM must always return a shared secret, while a signer that detects a fault must, per [Genet23], refuse to emit so the faulted signature does not propagate.

Dispatch. The public SlhDsa::<P>::sign switches between the redundant path (slh::slh_sign_internal_redundant) and the historic non-redundant path (slh::slh_sign_internal) at compile time via #[cfg(feature = "sca-fors-redundancy")]. The non-redundant path stays publicly re-exported as the CAVP / KAT deterministic entry point.

Validation. Three module tests in fors.rs:

fors_sign_into_redundant_matches_reference_shake128s / …_shake128f — drive the redundant path on multiple seed × message permutations and assert that (a) the validated signature is byte-identical to the non-redundant fors_sign_into output, and (b) the returned FORS pk matches the standalone fors_pk_from_sig derivation from the produced signature.
fors_redundancy_compare_detects_divergence — exercises the internal fors_redundancy_compare helper with synthetically divergent buffers (signature mismatch, pk mismatch, both) and asserts each surfaces Err(FaultDetected); the all-equal case surfaces Ok. Lets us validate the abort logic without injecting a real fault into the FORS signer.

Cost. One extra fors_sign_into (~1× FORS signing time again) plus two fors_pk_from_sig derivations and two silentops::ct_eq checks. The bulk is the second signing — mirrors the double-decaps posture of ML-KEM in spirit.

Memory. One SecretBytes scratch of length fors_sig_len = K * (1 + A) * N (~10 KiB for SHAKE-256s, ~7 KiB for SHAKE-128f) heap-allocated so the M0 baseline stack budget stays honest, drop- zeroized on both the success and the abort path.

T1-D — full-tree streaming FORS sign — shipped

Addresses: template attack on FORS sibling PRF addresses ([KGenetB+18]). In the FIPS-205 default path, the address passed to fors_node during the authentication- path loop is base + s * 2^j where s = floor(idx / 2^j) XOR 1 — the upper (A - j) bits of the secret FORS digit idx with the lowest bit flipped. The set of addresses absorbed by Keccak across j ∈ [0, A) reveals idx byte-by-byte to a template attacker.

Implementation: the per-FORS-tree inner loop of fors::fors_sign_into (gated by the sca-fors-dummy-siblings cargo feature) is replaced by a single BDS-style full-tree streaming traversal:

Iterate k from 0 to 2^A - 1 in fixed order.
For each leaf at position leaf_idx = base + k:
- Generate the leaf secret via fors_sk_gen (absorbs the idx-independent address set_tree_index(leaf_idx)).
- Branchlessly save the leaf secret into the signature’s “leaf secret” slot if k == idx, via silentops::ct_copy guarded by silentops::ct_eq_u32.
- Hash the leaf via f_hash; push the height-0 node onto a BDS stack.
- Iteratively merge same-height stack tops via hash_h (absorbs idx-independent set_tree_index(absolute_pos) where absolute_pos depends only on i and k).
- At each merge to height h, branchlessly save the resulting node to auth_path[h] if (k >> h) == ((idx >> h) XOR 1).

After streaming all 2^A leaves, the BDS stack contains exactly one node — the FORS root, discarded (the caller re-derives it via fors_pk_from_sig). Both the leaf secret and the A auth-path siblings are populated in the signature slot.

Signature stays unchanged — the output bytes are byte-identical to the FIPS-205 default path on every input (KAT-verified across all six SHAKE parameter sets, with and without sca-fors-redundancy composed).

/// `fors_sign_into` under `sca-fors-dummy-siblings` — sketch.
for k in 0..(1u32 << P::A) {
    let leaf_idx = base + k;
    let sk = fors_sk_gen::<P>(sk_seed, pk_seed, &mut adrs, leaf_idx);
    silentops::ct_copy(leaf_slot, &sk, silentops::ct_eq_u32(k, idx));
    let mut node = hash::f_hash::<P>(pk_seed, &mut adrs, &sk);
    let mut height = 0u32;
    let mut local_pos = k;
    silentops::ct_copy(
        &mut auth_slot[0..P::N], &node,
        silentops::ct_eq_u32(local_pos, idx ^ 1),
    );
    while let Some(&(_, top_h)) = stack.last() {
        if top_h != height { break; }
        // ... pop, merge, save auth_slot[h] branchlessly ...
    }
    stack.push((node, height));
}

What this kills. The Keccak absorption sequence becomes a deterministic function of the public FORS-tree index i only; no idx-dependent address ever reaches the PRF. The template oracle of [KGenetB+18] is closed for FORS signing. The same reasoning protects against DPA on the leaf-secret PRF (fors_sk_gen) since its address argument is likewise idx-independent in the streamed path.

Cost. Roughly 2× the default FORS hash count per signature. The default path computes sum_{j=0..A-1} 2^j = 2^A - 1 leaves across the auth-path subtrees; the full-tree stream computes 2^A leaves + 2^A - 1 internal merges. KAT wall-time (host x86_64) goes from ~80 s to ~135 s under --features slh-dsa,sca-fors-dummy-siblings — ratio consistent with the predicted ~2×.

Memory. Stack budget unchanged at O(A * N) for the BDS stack (same as the existing iterative treehash in fors_node, quantica/src/slh_dsa/fors.rs:62-118). No new heap hot-spot.

Historical correction. An earlier draft of this section described T1-D as “compute both possible siblings (s = 0 and s = 1) at fixed positions, select the right one branchlessly”. That framing is wrong: FIPS-205 Algorithm 16 has s = floor(idx / 2^j) XOR 1, multi-bit, taking values in [0, 2^(A-j)) at level j. At j = 0 (deepest level) the sibling sits at one of up to 2^A idx-dependent positions, not at one of a fixed pair. A first implementation along the “two-candidate” line silently produced non-FIPS-compliant signatures (5/16 KAT vectors diverged). The full-tree streaming traversal documented above is the only mechanism that produces an idx-independent address sequence at the same asymptotic cost.

Validation. End-to-end KAT (cargo test --release -p quantica --test slh_dsa_kat --features slh-dsa,sca-fors-dummy-siblings) — 16/16 vectors byte-identical to the default path. Lib tests (cargo test --release -p quantica --lib --features sca-fors-dummy-siblings) — 5/5 green; composition with sca-fors-redundancy also green (--features sca-fors-dummy-siblings,sca-fors-redundancy). Aligns with the SLotH threshold-implementation posture ([Saa24a]).

Out of scope. Extension of full-tree streaming to WOTS+ chains inside the hypertree — same template-oracle reasoning applies but the leak surface is smaller; tracked as a Tier-4 candidate.

T1-E — digest → indices integrity check — shipped

Addresses: single-fault attack forcing one of the FORS indices to a controlled value (zero-index variant of [CMP18]). The corruption reveals PRF(SK.seed, addr_0) cleanly. Even with T1-D (full-tree streaming) shipped, a fault during the upstream message_to_indices derivation, or during the digest extraction itself, could redirect the leaf-secret commit to a faulted position before the streaming traversal kicks in.

Implementation: at the tail of fors::fors_sign_into, the FORS index vector is re-derived from the same ``md`` slice and CT-compared to the vector consumed during signing. The check is gated by the sca-fors-indices-check cargo feature; on a mismatch fors_sign_into returns Err(SlhDsaError::FaultDetected), the slh_sign_internal caller propagates via ?, and the hypertree-signing step never runs — the faulted FORS sub-signature never gets wrapped into a full signature emitted to the host.

pub(crate) fn fors_indices_consistency_check<P: Params>(
    md:   &[u8],
    used: &[u32],
) -> Result<(), SlhDsaError> {
    let recomputed = message_to_indices::<P>(md);
    if recomputed.len() != used.len() {
        return Err(SlhDsaError::FaultDetected);
    }
    let used_b: Vec<u8> = used.iter().flat_map(|x| x.to_le_bytes()).collect();
    let rec_b: Vec<u8> = recomputed.iter().flat_map(|x| x.to_le_bytes()).collect();
    if silentops::ct_eq(&used_b, &rec_b) != 1 {
        return Err(SlhDsaError::FaultDetected);
    }
    Ok(())
}

The fresh derivation is run on the same md slice, so a fault that lands persistently on md itself (e.g. Rowhammer on the stack region holding the digest) passes the check; that threat is the redundant-signing class T1-C already covers (two independent FORS signings see different intermediate state). T1-E specifically catches transient faults in the base_2b bit-extraction or in the index vector storage between production and consumption.

Cost. One extra message_to_indices (= one base_2b) per FORS signature — negligible byte-shuffling, no hashing, K * A / 8 bytes processed. The two Vec<u8> serialisations for silentops::ct_eq allocate 4 * K bytes each, freed at function return; well under any M0-baseline budget.

Composition. Orthogonal to T1-C (which compares two independent FORS signings to catch in-FORS faults) and to T1-D (which closes the template oracle on Keccak addresses). Under --features sca-fors-redundancy, T1-C’s fors_sign_into_redundant calls fors_sign_into twice and each call independently runs the T1-E check (if also enabled). KAT determinism preserved in every combination (sca-fors-indices-check on its own; combined with T1-D; combined with T1-D + T1-C).

Validation. Lib tests fors_indices_check_accepts_correct_shake128s / …_shake128f exercise the positive path on multiple seed permutations. fors_indices_check_rejects_flipped_index drives the helper with synthetically corrupted index vectors (one bit flipped, and a length mismatch) and asserts FaultDetected in each case. End-to-end determinism: KAT cargo test --release -p quantica --test slh_dsa_kat --features slh-dsa,sca-fors-indices-check — 16/16 vectors byte-identical to the default path (~85 s wall-time vs ~80 s default, the overhead is in the integrity check, the signing itself is unchanged).

T4-B — PRF masking 

Addresses: DPA on SK.seed through the FORS / WOTS+ leaf PRF ([KGenetB+18]). The baseline construction is [Flu24] (3-share SHAKE), with a hardware-side alternative documented in [Saa24a].

Planned API (transparent wrapper over the existing hash::prf):

/// 3-share masked PRF. Emits the same byte string as
/// `hash::prf` but keeps `sk_seed` split into shares through
/// every SHAKE-absorb step, per Fluhrer's construction.
#[cfg(feature = "sca-masked-prf")]
pub fn prf_masked<P: Params>(
    pk_seed:   &[u8],
    sk_seed_s: &MaskedSeed,      // two shares of SK.seed
    adrs:      &Adrs,
) -> Vec<u8>;

Cost: roughly 1.7× per signature. Gated behind an opt-in feature until SHAKE masking lands in silentops.

T1-F — constant-time `fors_pk_from_sig` — shipped

Addresses: the secret-dependent branch if ((idx >> j) & 1) == 0 { ... } else { ... } inside the original FIPS-205 Algorithm 17. Verifier-side, the branch is on public data; but when the same routine is reused under T1-C as part of the signing-side redundancy check, its input becomes secret and a Rust if would re-introduce a timing leak.

Implementation: fors::fors_pk_from_sig in quantica/src/slh_dsa/fors.rs was reworked to a single constant-time routine. For every authentication-path level, the original branch is replaced by a byte-wise silentops::ct_select_u8 cswap that materialises the (left, right) hash_h inputs into two N-byte stack buffers, then calls hash_h(left, right) unconditionally. The tree_index written into adrs is identical in both original branches so it needs no extra masking. Scratch buffers are silentops::ct_zeroize-d at the end of the routine.

/// Constant-time FORS pk-from-sig (FIPS-205 Alg. 17). The
/// secret-dependent `hash_h` argument ordering is resolved by
/// a branchless `silentops::ct_select_u8` cswap. Single routine —
/// used by both the standalone verifier and the T1-C signing-side
/// redundancy check (where `idx` is secret).
pub fn fors_pk_from_sig<P: Params>(
    sig_fors: &[u8],
    md:       &[u8],
    pk_seed:  &[u8],
    adrs:     &mut Adrs,
) -> Vec<u8>;

A previous variable-time sibling has been removed: keeping a single CT implementation eliminates the foot-gun of a future call site picking a leaky variant by autocomplete.

Validation: two round-trip tests in quantica/src/slh_dsa/fors.rs (fors_pk_from_sig_round_trip_shake128s and …_shake128f) exercise the sign → pk-from-sig pipeline across multiple seed / message permutations and assert that two back-to-back derivations agree (determinism) and produce N-byte outputs. End-to-end correctness against FIPS-205 reference output is covered by the KAT suite quantica/tests/slh_dsa_kat.rs.

Cost: 2 * N byte scratch (~32 B for SHAKE-128, ~64 B for SHAKE-256) plus 2 * N * A ct_select_u8 calls per FORS tree per signature. Negligible compared to the underlying SHAKE work.

T2-D — explicit unpoison of `R`, `digest`, indices 

Programmatic proof to ctgrind that the branches inside fors::fors_pk_from_sig, wots::chain_iter, xmss::xmss_sign_into / xmss_pk_from_sig are on data that has reached the “publish-ready” state. Closes the four suppressions listed in tools/ctgrind.supp. Zero-cost on production builds.