ML-DSA — countermeasures

FIPS spec:

[NationalIoSaTechnology24b]

Crate path:

quantica::ml_dsa

Cargo features:

ml-dsa (on by default); sca-protected (on by default, gates masking + shuffling); sca-masked-y (masked y pipeline, on in hardened builds); sca-ct-rejection (branchless rejection loop, on in hardened builds).

ML-DSA has the richest SCA threat surface of the three algorithms — it mixes a non-deterministic rejection-sampling loop, several secret polynomials used in linear combinations, and a Fiat–Shamir challenge that exposes bit-level intermediates. This chapter lists the countermeasures implemented in quantica::ml_dsa, indexed by threat class, and the planned hardening items still outstanding for the next hardening round.

Threat classes reference: Threat model. Primitive reference: Shared side-channel primitives — silentops. Verification methodology: Verification methodology.

Coverage matrix

ML-DSA countermeasure / threat matrix

Threat

Status

Countermeasure(s)

SPA / SEMA on secret NTT

implemented

Fisher–Yates shuffled NTT for s1, s2, t0 (shuffle::ntt_shuffled) using a signature-specific ScaRng.

DPA on y sampling

implemented

Masked y sampled as two shares from SHAKE256, kept as shares through masked NTT and masked A·y (sca-masked-y, masked::masked_mat_vec_mul*).

DPA on s1, s2, t0

implemented

First-order arithmetic masking kept across the rejection loop (masked::MaskedPoly, masked_pointwise_mul_public).

Timing on rejection loop

implemented

Compute all intermediates (cs1, z, cs2, r0, ct0, hint) every iteration, single branch-free accept/reject decision (sca-ct-rejection).

Software / remote timing

partial (interim)

All conditional selections route through silentops::ct_*; remaining ctgrind flags on branches-into-public-signature are documented in tools/ctgrind.supp pending hardening closure.

DFA on norm checks

partial

CT rejection loop already double-checks norms before emission; explicit redundant signing planned (see Roadmap chapter of the README).

Template attacks

implemented

NTT shuffling destroys trace alignment; masking multiplies the profile cost.

Higher-order DPA via mask re-use

implemented (T1-A)

Per-iteration MaskedPoly::refresh of every polynomial of s1_hat_m, s2_hat_m, t0_hat_m at the head of each rejection iteration (before any operation on the shares, per [HNP25] §4). Output bytes unchanged (mask cancels in unmask, KAT byte-identical).

Hermelink 2025/276 leakage map of masked-y gadgets

audit shipped (T1-B)

Information-theoretic audit pass over every masked gadget and unmask call site, classified against the Hermelink leak taxonomy (C1-C5) with a per-row follow-up tracker. See Hermelink 2025/276 audit pass on ml_dsa::masked.

SPA / SEMA — Fisher-Yates shuffled NTT

Principle

Same idea as ML-KEM: draw a random permutation of the NTT butterfly groups and of the butterflies within each group, execute in the permuted order. The shuffle is applied to s1 (l polynomials), s2 (k) and t0 (k) — the three secret vectors. The public matrix A uses the classical NTT.

The permutations are drawn from a dedicated ScaRng seeded with K rnd tr M' (SHAKE256), so a given signature uses a reproducible but unpredictable-to-an-attacker order.

Published basis

  • [XWT25] — original shuffling analysis (ML-KEM focus, but the technique transfers directly).

  • [AKL25] — Cortex-M4/M7 performance measurements for the shuffled variant.

  • [Azo25] — NIST’s recommended posture including shuffling as an SPA mitigation.

Code pointers

Item

Location

Fisher-Yates permutation generator

quantica/src/ml_dsa/shuffle.rs generate_permutation

Shuffled NTT

quantica/src/ml_dsa/shuffle.rs ntt_shuffled

Call sites (Step 1 of sign_internal)

quantica/src/ml_dsa/dsa.rs around line 531 — three for-loops applying ntt_shuffled to each polynomial of s1_hat, s2_hat, t0_hat.

ScaRng construction + seeding

quantica/src/ml_dsa/dsa.rs lines 516-526; seed = SHAKE256 of the domain-separator tag quantica-mldsa-sca-seed-v1 concatenated with K, rnd, tr and M'.

DPA — first-order masking of secret polynomials

Principle

Each secret polynomial (s1, s2, t0) is kept as a pair (P_0, P_1) with P = P_0 + P_1 (mod q). Operations taking a secret as operand (NTT, pointwise mul with public A, matrix- vector multiplication) are rewritten on shares. The A·y step is the most DPA-critical: it operates on the masked y and the public A, yielding a share representation of w that is unmasked only once in the accept/reject logic.

Published basis

  • [RRCC24] — masked hardware ML-DSA (reference construction, we follow the same share topology in software).

  • [Azo25] — masking recommendation.

Code pointers

Item

Location

MaskedPoly type + arithmetic helpers

quantica/src/ml_dsa/masked.rs (masked_ntt, masked_ntt_inv, masked_pointwise_mul_public, masked_mat_vec_mul, masked_mat_vec_mul_lazy)

Call sites for masked NTT on secrets

quantica/src/ml_dsa/dsa.rs — lines around the s1_hat, s2_hat, t0_hat initialization under #[cfg(feature = "sca-protected")].

Zeroization of masked polynomials

quantica/src/ml_dsa/masked.rs zeroize_poly, zeroize_bytes.

DPA on y — the sca-masked-y pipeline

Principle

The masking vector y is the main vector target for DPA: the published signature component z = y + c·s1 reveals a linear combination of y and s1, so averaging many signatures on equal message / equal c recovers s1 from y if y ever appears unmasked on the power trace.

quantica samples y as two arithmetic shares directly from SHAKE256, runs masked NTT on the shares, computes A·y with the public matrix on the shares, and unmasks w = A·y only when the rejection loop has committed to publishing it — exactly the construction of [CGerardL+24].

Published basis

  • [CGerardL+24] — canonical high-order masked generation of the masking vector and masked rejection sampling gadget (TCHES 2024.4). Construction followed by our implementation.

  • [BelaidBD+26] — SUCRE (TCHES 2026.1), a shuffle-and-unmask alternative that delivers 4–6× speedup on the same security claim. Candidate for T4-A migration evaluation (see below).

Known attacks against the construction

  • [HNP25] (CRYPTO 2025): information-theoretic leakage map of masked-y implementations at first, second, and higher orders. Not a break of the construction itself, but an auditor’s checklist for the gadgets instantiating it. T1-B tracks the pass-through of this checklist on our code.

  • [DFM+25] (ASIACRYPT 2025): introduces concealed ILWE with Huber/Cauchy regression; breaks masked-Dilithium implementations that leak up to 90% of the shares. Motivates strong care on the masked-NTT and masked-A·y gadgets.

  • [ZCQ+26] (DATE 2026): non-profiling attack on the unmasked / hedged rejection loop (96 traces for c, ~300 traces for the key on a Cortex-M4 target). Primary motivator for the ``sca-ct-rejection`` feature below.

Code pointers

Item

Location

Masked y sampling

quantica/src/ml_dsa/masked.rs (look for masked_expand_mask in the sampling region).

Masked A·y

quantica/src/ml_dsa/masked.rs masked_mat_vec_mul / masked_mat_vec_mul_lazy.

Call site in the sign loop

quantica/src/ml_dsa/dsa.rs — inside the rejection loop of sign_internal, under #[cfg(feature = "sca-masked-y")].

Tests

quantica/src/ml_dsa/masked.rs end-of-file (masked_expand_mask_matches_unmasked_expand_mask, masked_mat_vec_mul_matches_unmasked).

Timing — constant-time rejection loop

Principle

The FIPS 204 rejection loop as written branches out of the iteration as soon as the norm check fails:

repeat
    compute w, w1, c, z, r0
    if norm(z) >= gamma1 - beta then restart
    if norm(r0) >= gamma2 - beta then restart
    ...
until accepted

A timing observer can therefore tell at which test the candidate was rejected, which leaks information about z and r0 — and thereby about s1, s2 [LWW+25].

The sca-ct-rejection feature rewrites the loop so that every iteration computes all intermediates (cs1, z, cs2, r0, ct0, hint) and accumulates a single branch-free accept flag that is consulted only at the very end of the iteration. The loop keeps running until accept; observing an iteration cannot tell which norm check decided the fate.

Published basis

  • [LWW+25] — the initial timing-leak analysis that motivates the countermeasure.

  • [ZCQ+26] (DATE 2026) — a non-profiling public-template attack recovering c in 96 traces and the signing key in ~300 traces on a Cortex-M4 target in hedged / unprotected mode. The sca-ct-rejection feature is the intended answer to this attack class.

Code pointers

Item

Location

Rejection loop with branch-free accept

quantica/src/ml_dsa/dsa.rs sign_internal, block guarded by #[cfg(feature = "sca-ct-rejection")].

Norm-check helpers returning bit flags

quantica/src/ml_dsa/dsa.rs (look for infinity_norm_* inside the sca-ct-rejection region; the flags feed a single-variable AND accumulator).

Template attacks

Template attacks against ML-DSA rely on profile-matching the NTT coefficients of s1, s2, t0 or the y sampling. The defences already described — masking + shuffling — destroy the inter-trace alignment a template attack depends on, and multiply the profile size the attacker has to maintain.

See Threat model for cost estimates; see [Chh26] for a practical profile against an unprotected Cortex-M0 implementation and the required trace counts once shuffling is in place.

Planned hardening

The following items are scheduled for the next hardening round; each closes one of the tools/ctgrind.supp entries documented under Verification methodology.

  • T2-A — explicit ct_grind::unpoison after the algorithmic unmasking point of w1, h, z. Lets ctgrind re-verify with zero suppressions on the decompose::high_bits_vec, encode::w1_encode, decompose::make_hint_vec, encode::sig_encode paths.

  • T2-B — branch-free generate_permutation (Feistel- or Floyd-based) to close the suppression on shuffle::generate_permutation.

  • T1-A — A3: refresh the shares of s1, s2, t0 at the start of every rejection iteration to defeat higher-order DPA variants that combine two iterations’ leakage — shipped. The dsa.rs rejection loop opens with a #[cfg(feature = "sca-protected")] block that calls MaskedPoly::refresh on every polynomial of s1_hat_m, s2_hat_m, t0_hat_m before any operation on the shares — the Hermelink [HNP25] §4 prescription matched exactly. Output bytes are byte-identical to the pre-T1-A baseline (mask cancels in unmask); cost is unchanged versus the previous end-of-cs/ct refresh placement (same number of ScaRng bytes consumed per iteration). Audit row flipped to protected in Hermelink 2025/276 audit pass on ml_dsa::masked.

  • T2-C — documentation traceability: after A/B/C land, the historical suppression file becomes a “resolved-findings” annex in Verification methodology.

  • T4-A — SUCRE migration evaluation ([BelaidBD+26]). Benchmark sca-masked-y against SUCRE’s shuffle-and-unmask gadget on our target platforms (Cortex-M4 class). Migrate the masked rejection path if the published 4–6× speedup holds on-device and the transient memory footprint fits the embedded budget. The existing masked-y pipeline remains the fallback if the speedup is swallowed by our other constraints.

  • T1-B — Hermelink audit pass on masked.rsshipped ([HNP25]). The information-theoretic leakage map of CRYPTO 2025 has been applied to every gadget of quantica/src/ml_dsa/masked.rs and every unmask call site of the rejection loop in quantica/src/ml_dsa/dsa.rs::sign_internal; each row is classified as protected, partial, or acknowledged residual risk, with a per-row follow-up pointer. The full audit annex is Hermelink 2025/276 audit pass on ml_dsa::masked. Primary follow-up surfaced by the audit — T1-A (per-iteration share refresh) — has since shipped, closing the C4 sufficiency row and reducing the C1 residuals to the plaintext-aggregate floor. Remaining open follow-ups (Tier-2 CT norm-on-shares and Tier-3 share-domain Decompose/MakeHint) are tracked in the audit’s work-list.