ML-DSA — countermeasures

FIPS spec:: [NationalIoSaTechnology24b]
Crate path:: quantica::ml_dsa
Cargo features:: ml-dsa (on by default); sca-protected (on by default, gates masking + shuffling); sca-masked-y (masked y pipeline, on in hardened builds); sca-ct-rejection (branchless rejection loop, on in hardened builds).

ML-DSA has the richest SCA threat surface of the three algorithms — it mixes a non-deterministic rejection-sampling loop, several secret polynomials used in linear combinations, and a Fiat–Shamir challenge that exposes bit-level intermediates. This chapter lists the countermeasures implemented in quantica::ml_dsa, indexed by threat class, and the planned hardening items still outstanding for the next hardening round.

Threat classes reference: Threat model. Primitive reference: Shared side-channel primitives — silentops. Verification methodology: Verification methodology.

Coverage matrix 

ML-DSA countermeasure / threat matrix
Threat	Status	Countermeasure(s)
SPA / SEMA on secret NTT	implemented	Fisher–Yates shuffled NTT for `s1`, `s2`, `t0` (`shuffle::ntt_shuffled`) using a signature-specific `ScaRng`.
DPA on `y` sampling	implemented	Masked `y` sampled as two shares from SHAKE256, kept as shares through masked NTT and masked `A·y` (`sca-masked-y`, `masked::masked_mat_vec_mul*`).
DPA on `s1`, `s2`, `t0`	implemented	First-order arithmetic masking kept across the rejection loop (`masked::MaskedPoly`, `masked_pointwise_mul_public`).
Timing on rejection loop	implemented	Compute all intermediates (cs1, z, cs2, r0, ct0, hint) every iteration, single branch-free accept/reject decision (`sca-ct-rejection`).
Software / remote timing	partial (interim)	All conditional selections route through `silentops::ct_*`; remaining ctgrind flags on branches-into-public-signature are documented in `tools/ctgrind.supp` pending hardening closure.
DFA on norm checks	partial	CT rejection loop already double-checks norms before emission; explicit redundant signing planned (see Roadmap chapter of the README).
Template attacks	implemented	NTT shuffling destroys trace alignment; masking multiplies the profile cost.
Higher-order DPA via mask re-use	implemented (`T1-A`)	Per-iteration `MaskedPoly::refresh` of every polynomial of `s1_hat_m`, `s2_hat_m`, `t0_hat_m` at the head of each rejection iteration (before any operation on the shares, per [HNP25] §4). Output bytes unchanged (mask cancels in unmask, KAT byte-identical).
Hermelink 2025/276 leakage map of masked-y gadgets	audit shipped (`T1-B`)	Information-theoretic audit pass over every masked gadget and unmask call site, classified against the Hermelink leak taxonomy (C1-C5) with a per-row follow-up tracker. See Hermelink 2025/276 audit pass on ml_dsa::masked.

SPA / SEMA — Fisher-Yates shuffled NTT 

Principle 

Same idea as ML-KEM: draw a random permutation of the NTT butterfly groups and of the butterflies within each group, execute in the permuted order. The shuffle is applied to s1 (l polynomials), s2 (k) and t0 (k) — the three secret vectors. The public matrix A uses the classical NTT.

The permutations are drawn from a dedicated ScaRng seeded with K ‖ rnd ‖ tr ‖ M' (SHAKE256), so a given signature uses a reproducible but unpredictable-to-an-attacker order.

Published basis 

[XWT25] — original shuffling analysis (ML-KEM focus, but the technique transfers directly).
[AKL25] — Cortex-M4/M7 performance measurements for the shuffled variant.
[Azo25] — NIST’s recommended posture including shuffling as an SPA mitigation.

Code pointers 

Item	Location
Fisher-Yates permutation generator	`quantica/src/ml_dsa/shuffle.rs` `generate_permutation`
Shuffled NTT	`quantica/src/ml_dsa/shuffle.rs` `ntt_shuffled`
Call sites (Step 1 of `sign_internal`)	`quantica/src/ml_dsa/dsa.rs` around line 531 — three for-loops applying `ntt_shuffled` to each polynomial of `s1_hat`, `s2_hat`, `t0_hat`.
`ScaRng` construction + seeding	`quantica/src/ml_dsa/dsa.rs` lines 516-526; seed = SHAKE256 of the domain-separator tag `quantica-mldsa-sca-seed-v1` concatenated with `K`, `rnd`, `tr` and `M'`.

DPA — first-order masking of secret polynomials 

Principle 

Each secret polynomial (s1, s2, t0) is kept as a pair (P_0, P_1) with P = P_0 + P_1 (mod q). Operations taking a secret as operand (NTT, pointwise mul with public A, matrix- vector multiplication) are rewritten on shares. The A·y step is the most DPA-critical: it operates on the masked y and the public A, yielding a share representation of w that is unmasked only once in the accept/reject logic.

Published basis 

[RRCC24] — masked hardware ML-DSA (reference construction, we follow the same share topology in software).
[Azo25] — masking recommendation.

Code pointers 

Item	Location
`MaskedPoly` type + arithmetic helpers	`quantica/src/ml_dsa/masked.rs` (`masked_ntt`, `masked_ntt_inv`, `masked_pointwise_mul_public`, `masked_mat_vec_mul`, `masked_mat_vec_mul_lazy`)
Call sites for masked NTT on secrets	`quantica/src/ml_dsa/dsa.rs` — lines around the `s1_hat`, `s2_hat`, `t0_hat` initialization under `#[cfg(feature = "sca-protected")]`.
Zeroization of masked polynomials	`quantica/src/ml_dsa/masked.rs` `zeroize_poly`, `zeroize_bytes`.

DPA on `y` — the `sca-masked-y` pipeline 

Principle 

The masking vector y is the main vector target for DPA: the published signature component z = y + c·s1 reveals a linear combination of y and s1, so averaging many signatures on equal message / equal c recovers s1 from y if y ever appears unmasked on the power trace.

quantica samples y as two arithmetic shares directly from SHAKE256, runs masked NTT on the shares, computes A·y with the public matrix on the shares, and unmasks w = A·y only when the rejection loop has committed to publishing it — exactly the construction of [CGerardL+24].

Published basis 

[CGerardL+24] — canonical high-order masked generation of the masking vector and masked rejection sampling gadget (TCHES 2024.4). Construction followed by our implementation.
[BelaidBD+26] — SUCRE (TCHES 2026.1), a shuffle-and-unmask alternative that delivers 4–6× speedup on the same security claim. Candidate for T4-A migration evaluation (see below).

Known attacks against the construction 

[HNP25] (CRYPTO 2025): information-theoretic leakage map of masked-y implementations at first, second, and higher orders. Not a break of the construction itself, but an auditor’s checklist for the gadgets instantiating it. T1-B tracks the pass-through of this checklist on our code.
[DFM+25] (ASIACRYPT 2025): introduces concealed ILWE with Huber/Cauchy regression; breaks masked-Dilithium implementations that leak up to 90% of the shares. Motivates strong care on the masked-NTT and masked-A·y gadgets.
[ZCQ+26] (DATE 2026): non-profiling attack on the unmasked / hedged rejection loop (96 traces for c, ~300 traces for the key on a Cortex-M4 target). Primary motivator for the ``sca-ct-rejection`` feature below.

Code pointers 

Item	Location
Masked `y` sampling	`quantica/src/ml_dsa/masked.rs` (look for `masked_expand_mask` in the sampling region).
Masked `A·y`	`quantica/src/ml_dsa/masked.rs` `masked_mat_vec_mul` / `masked_mat_vec_mul_lazy`.
Call site in the sign loop	`quantica/src/ml_dsa/dsa.rs` — inside the rejection loop of `sign_internal`, under `#[cfg(feature = "sca-masked-y")]`.
Tests	`quantica/src/ml_dsa/masked.rs` end-of-file (`masked_expand_mask_matches_unmasked_expand_mask`, `masked_mat_vec_mul_matches_unmasked`).

Timing — constant-time rejection loop 

Principle 

The FIPS 204 rejection loop as written branches out of the iteration as soon as the norm check fails:

repeat
    compute w, w1, c, z, r0
    if norm(z) >= gamma1 - beta then restart
    if norm(r0) >= gamma2 - beta then restart
    ...
until accepted

A timing observer can therefore tell at which test the candidate was rejected, which leaks information about z and r0 — and thereby about s1, s2 [LWW+25].

The sca-ct-rejection feature rewrites the loop so that every iteration computes all intermediates (cs1, z, cs2, r0, ct0, hint) and accumulates a single branch-free accept flag that is consulted only at the very end of the iteration. The loop keeps running until accept; observing an iteration cannot tell which norm check decided the fate.

Published basis 

[LWW+25] — the initial timing-leak analysis that motivates the countermeasure.
[ZCQ+26] (DATE 2026) — a non-profiling public-template attack recovering c in 96 traces and the signing key in ~300 traces on a Cortex-M4 target in hedged / unprotected mode. The sca-ct-rejection feature is the intended answer to this attack class.

Code pointers 

Item	Location
Rejection loop with branch-free accept	`quantica/src/ml_dsa/dsa.rs` `sign_internal`, block guarded by `#[cfg(feature = "sca-ct-rejection")]`.
Norm-check helpers returning bit flags	`quantica/src/ml_dsa/dsa.rs` (look for `infinity_norm_*` inside the `sca-ct-rejection` region; the flags feed a single-variable AND accumulator).

Template attacks 

Template attacks against ML-DSA rely on profile-matching the NTT coefficients of s1, s2, t0 or the y sampling. The defences already described — masking + shuffling — destroy the inter-trace alignment a template attack depends on, and multiply the profile size the attacker has to maintain.

See Threat model for cost estimates; see [Chh26] for a practical profile against an unprotected Cortex-M0 implementation and the required trace counts once shuffling is in place.

Planned hardening 

The following items are scheduled for the next hardening round; each closes one of the tools/ctgrind.supp entries documented under Verification methodology.

T2-A — explicit ct_grind::unpoison after the algorithmic unmasking point of w1, h, z. Lets ctgrind re-verify with zero suppressions on the decompose::high_bits_vec, encode::w1_encode, decompose::make_hint_vec, encode::sig_encode paths.
T2-B — branch-free generate_permutation (Feistel- or Floyd-based) to close the suppression on shuffle::generate_permutation.
T1-A — A3: refresh the shares of s1, s2, t0 at the start of every rejection iteration to defeat higher-order DPA variants that combine two iterations’ leakage — shipped. The dsa.rs rejection loop opens with a #[cfg(feature = "sca-protected")] block that calls MaskedPoly::refresh on every polynomial of s1_hat_m, s2_hat_m, t0_hat_m before any operation on the shares — the Hermelink [HNP25] §4 prescription matched exactly. Output bytes are byte-identical to the pre-T1-A baseline (mask cancels in unmask); cost is unchanged versus the previous end-of-cs/ct refresh placement (same number of ScaRng bytes consumed per iteration). Audit row flipped to protected in Hermelink 2025/276 audit pass on ml_dsa::masked.
T2-C — documentation traceability: after A/B/C land, the historical suppression file becomes a “resolved-findings” annex in Verification methodology.
T4-A — SUCRE migration evaluation ([BelaidBD+26]). Benchmark sca-masked-y against SUCRE’s shuffle-and-unmask gadget on our target platforms (Cortex-M4 class). Migrate the masked rejection path if the published 4–6× speedup holds on-device and the transient memory footprint fits the embedded budget. The existing masked-y pipeline remains the fallback if the speedup is swallowed by our other constraints.
T1-B — Hermelink audit pass on masked.rs — shipped ([HNP25]). The information-theoretic leakage map of CRYPTO 2025 has been applied to every gadget of quantica/src/ml_dsa/masked.rs and every unmask call site of the rejection loop in quantica/src/ml_dsa/dsa.rs::sign_internal; each row is classified as protected, partial, or acknowledged residual risk, with a per-row follow-up pointer. The full audit annex is Hermelink 2025/276 audit pass on ml_dsa::masked. Primary follow-up surfaced by the audit — T1-A (per-iteration share refresh) — has since shipped, closing the C4 sufficiency row and reducing the C1 residuals to the plaintext-aggregate floor. Remaining open follow-ups (Tier-2 CT norm-on-shares and Tier-3 share-domain Decompose/MakeHint) are tracked in the audit’s work-list.