################################################################### ML-DSA — countermeasures ################################################################### :FIPS spec: :cite:`fips204` :Crate path: ``quantica::ml_dsa`` :Cargo features: ``ml-dsa`` (on by default); ``sca-protected`` (on by default, gates masking + shuffling); ``sca-masked-y`` (masked ``y`` pipeline, on in hardened builds); ``sca-ct-rejection`` (branchless rejection loop, on in hardened builds). ML-DSA has the richest SCA threat surface of the three algorithms — it mixes a non-deterministic rejection-sampling loop, several secret polynomials used in linear combinations, and a Fiat–Shamir challenge that exposes bit-level intermediates. This chapter lists the countermeasures implemented in ``quantica::ml_dsa``, indexed by threat class, and the planned hardening items still outstanding for the next hardening round. Threat classes reference: :doc:`../threat_model`. Primitive reference: :doc:`../primitives`. Verification methodology: :doc:`../verification`. .. contents:: :local: :depth: 2 Coverage matrix =============== .. list-table:: ML-DSA countermeasure / threat matrix :header-rows: 1 :widths: 25 18 57 * - Threat - Status - Countermeasure(s) * - SPA / SEMA on secret NTT - implemented - Fisher–Yates shuffled NTT for ``s1``, ``s2``, ``t0`` (``shuffle::ntt_shuffled``) using a signature-specific ``ScaRng``. * - DPA on ``y`` sampling - implemented - Masked ``y`` sampled as two shares from SHAKE256, kept as shares through masked NTT and masked ``A·y`` (``sca-masked-y``, ``masked::masked_mat_vec_mul*``). * - DPA on ``s1``, ``s2``, ``t0`` - implemented - First-order arithmetic masking kept across the rejection loop (``masked::MaskedPoly``, ``masked_pointwise_mul_public``). * - Timing on rejection loop - implemented - Compute all intermediates (cs1, z, cs2, r0, ct0, hint) every iteration, single branch-free accept/reject decision (``sca-ct-rejection``). * - Software / remote timing - partial (interim) - All conditional selections route through ``silentops::ct_*``; remaining ctgrind flags on branches-into-public-signature are documented in ``tools/ctgrind.supp`` pending hardening closure. * - DFA on norm checks - partial - CT rejection loop already double-checks norms before emission; explicit redundant signing planned (see Roadmap chapter of the README). * - Template attacks - implemented - NTT shuffling destroys trace alignment; masking multiplies the profile cost. * - Higher-order DPA via mask re-use - implemented (``T1-A``) - Per-iteration ``MaskedPoly::refresh`` of every polynomial of ``s1_hat_m``, ``s2_hat_m``, ``t0_hat_m`` at the **head** of each rejection iteration (before any operation on the shares, per :cite:`hermelink2025_weakest_link_masked_mldsa` §4). Output bytes unchanged (mask cancels in unmask, KAT byte-identical). * - Hermelink 2025/276 leakage map of masked-y gadgets - audit shipped (``T1-B``) - Information-theoretic audit pass over every masked gadget and unmask call site, classified against the Hermelink leak taxonomy (C1-C5) with a per-row follow-up tracker. See :doc:`../audits/hermelink_masked`. SPA / SEMA — Fisher-Yates shuffled NTT ====================================== Principle --------- Same idea as ML-KEM: draw a random permutation of the NTT butterfly groups and of the butterflies within each group, execute in the permuted order. The shuffle is applied to ``s1`` (``l`` polynomials), ``s2`` (``k``) and ``t0`` (``k``) — the three secret vectors. The public matrix ``A`` uses the classical NTT. The permutations are drawn from a dedicated ``ScaRng`` seeded with ``K ‖ rnd ‖ tr ‖ M'`` (SHAKE256), so a given signature uses a reproducible but unpredictable-to-an-attacker order. Published basis --------------- * :cite:`arxiv2024_mlkem_shuffling_hw` — original shuffling analysis (ML-KEM focus, but the technique transfers directly). * :cite:`eprint2025_cortexm4_m7_slothy` — Cortex-M4/M7 performance measurements for the shuffled variant. * :cite:`nist2025_physical_security_mldsa` — NIST's recommended posture including shuffling as an SPA mitigation. Code pointers ------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Item - Location * - Fisher-Yates permutation generator - ``quantica/src/ml_dsa/shuffle.rs`` ``generate_permutation`` * - Shuffled NTT - ``quantica/src/ml_dsa/shuffle.rs`` ``ntt_shuffled`` * - Call sites (Step 1 of ``sign_internal``) - ``quantica/src/ml_dsa/dsa.rs`` around line 531 — three for-loops applying ``ntt_shuffled`` to each polynomial of ``s1_hat``, ``s2_hat``, ``t0_hat``. * - ``ScaRng`` construction + seeding - ``quantica/src/ml_dsa/dsa.rs`` lines 516-526; seed = SHAKE256 of the domain-separator tag ``quantica-mldsa-sca-seed-v1`` concatenated with ``K``, ``rnd``, ``tr`` and ``M'``. DPA — first-order masking of secret polynomials =============================================== Principle --------- Each secret polynomial (``s1``, ``s2``, ``t0``) is kept as a pair ``(P_0, P_1)`` with ``P = P_0 + P_1 (mod q)``. Operations taking a secret as operand (NTT, pointwise mul with public ``A``, matrix- vector multiplication) are rewritten on shares. The ``A·y`` step is the most DPA-critical: it operates on the masked ``y`` and the public ``A``, yielding a share representation of ``w`` that is unmasked only once in the accept/reject logic. Published basis --------------- * :cite:`eprint2024_mldsa_hw_masking` — masked hardware ML-DSA (reference construction, we follow the same share topology in software). * :cite:`nist2025_physical_security_mldsa` — masking recommendation. Code pointers ------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Item - Location * - ``MaskedPoly`` type + arithmetic helpers - ``quantica/src/ml_dsa/masked.rs`` (``masked_ntt``, ``masked_ntt_inv``, ``masked_pointwise_mul_public``, ``masked_mat_vec_mul``, ``masked_mat_vec_mul_lazy``) * - Call sites for masked NTT on secrets - ``quantica/src/ml_dsa/dsa.rs`` — lines around the ``s1_hat``, ``s2_hat``, ``t0_hat`` initialization under ``#[cfg(feature = "sca-protected")]``. * - Zeroization of masked polynomials - ``quantica/src/ml_dsa/masked.rs`` ``zeroize_poly``, ``zeroize_bytes``. DPA on ``y`` — the ``sca-masked-y`` pipeline ============================================ Principle --------- The masking vector ``y`` is the main vector target for DPA: the published signature component ``z = y + c·s1`` reveals a linear combination of ``y`` and ``s1``, so averaging many signatures on equal message / equal ``c`` recovers ``s1`` from ``y`` if ``y`` ever appears unmasked on the power trace. ``quantica`` samples ``y`` as two arithmetic shares directly from SHAKE256, runs masked NTT on the shares, computes ``A·y`` with the public matrix on the shares, and unmasks ``w = A·y`` only when the rejection loop has committed to publishing it — exactly the construction of :cite:`coron2024_masked_rejection_dilithium`. Published basis --------------- * :cite:`coron2024_masked_rejection_dilithium` — canonical high-order masked generation of the masking vector and masked rejection sampling gadget (TCHES 2024.4). **Construction followed by our implementation.** * :cite:`belaid2026_sucre` — SUCRE (TCHES 2026.1), a shuffle-and-unmask alternative that delivers 4–6× speedup on the same security claim. Candidate for ``T4-A`` migration evaluation (see below). Known attacks against the construction -------------------------------------- * :cite:`hermelink2025_weakest_link_masked_mldsa` (CRYPTO 2025): information-theoretic leakage map of masked-``y`` implementations at first, second, and higher orders. **Not a break of the construction itself, but an auditor's checklist for the gadgets instantiating it.** ``T1-B`` tracks the pass-through of this checklist on our code. * :cite:`damm2025_concealed_ilwe` (ASIACRYPT 2025): introduces *concealed ILWE* with Huber/Cauchy regression; breaks masked-Dilithium implementations that leak up to 90% of the shares. Motivates strong care on the masked-NTT and masked-``A·y`` gadgets. * :cite:`zhao2026_rejection_matters` (DATE 2026): non-profiling attack on the *unmasked* / hedged rejection loop (96 traces for ``c``, ~300 traces for the key on a Cortex-M4 target). **Primary motivator for the ``sca-ct-rejection`` feature** below. Code pointers ------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Item - Location * - Masked ``y`` sampling - ``quantica/src/ml_dsa/masked.rs`` (look for ``masked_expand_mask`` in the sampling region). * - Masked ``A·y`` - ``quantica/src/ml_dsa/masked.rs`` ``masked_mat_vec_mul`` / ``masked_mat_vec_mul_lazy``. * - Call site in the sign loop - ``quantica/src/ml_dsa/dsa.rs`` — inside the rejection loop of ``sign_internal``, under ``#[cfg(feature = "sca-masked-y")]``. * - Tests - ``quantica/src/ml_dsa/masked.rs`` end-of-file (``masked_expand_mask_matches_unmasked_expand_mask``, ``masked_mat_vec_mul_matches_unmasked``). Timing — constant-time rejection loop ===================================== Principle --------- The FIPS 204 rejection loop as written branches out of the iteration as soon as the norm check fails: .. code-block:: text repeat compute w, w1, c, z, r0 if norm(z) >= gamma1 - beta then restart if norm(r0) >= gamma2 - beta then restart ... until accepted A timing observer can therefore tell at which test the candidate was rejected, which leaks information about ``z`` and ``r0`` — and thereby about ``s1``, ``s2`` :cite:`eprint2025_rejected_signatures_sca`. The ``sca-ct-rejection`` feature rewrites the loop so that every iteration computes **all** intermediates (``cs1``, ``z``, ``cs2``, ``r0``, ``ct0``, ``hint``) and accumulates a single branch-free accept flag that is consulted only at the very end of the iteration. The loop keeps running until accept; observing an iteration cannot tell *which* norm check decided the fate. Published basis --------------- * :cite:`eprint2025_rejected_signatures_sca` — the initial timing-leak analysis that motivates the countermeasure. * :cite:`zhao2026_rejection_matters` (DATE 2026) — a non-profiling public-template attack recovering ``c`` in 96 traces and the signing key in ~300 traces on a Cortex-M4 target in hedged / unprotected mode. The ``sca-ct-rejection`` feature is the intended answer to this attack class. Code pointers ------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Item - Location * - Rejection loop with branch-free accept - ``quantica/src/ml_dsa/dsa.rs`` ``sign_internal``, block guarded by ``#[cfg(feature = "sca-ct-rejection")]``. * - Norm-check helpers returning bit flags - ``quantica/src/ml_dsa/dsa.rs`` (look for ``infinity_norm_*`` inside the ``sca-ct-rejection`` region; the flags feed a single-variable AND accumulator). Template attacks ================ Template attacks against ML-DSA rely on profile-matching the NTT coefficients of ``s1``, ``s2``, ``t0`` or the ``y`` sampling. The defences already described — masking + shuffling — destroy the inter-trace alignment a template attack depends on, and multiply the profile size the attacker has to maintain. See :doc:`../threat_model` for cost estimates; see :cite:`arxiv2025_mlkem_mldsa_cortexm0_rp2040` for a practical profile against an unprotected Cortex-M0 implementation and the required trace counts once shuffling is in place. Planned hardening ========================== The following items are scheduled for the next hardening round; each closes one of the ``tools/ctgrind.supp`` entries documented under :doc:`../verification`. * **T2-A** — explicit ``ct_grind::unpoison`` after the algorithmic unmasking point of ``w1``, ``h``, ``z``. Lets ctgrind re-verify with zero suppressions on the ``decompose::high_bits_vec``, ``encode::w1_encode``, ``decompose::make_hint_vec``, ``encode::sig_encode`` paths. * **T2-B** — branch-free ``generate_permutation`` (Feistel- or Floyd-based) to close the suppression on ``shuffle::generate_permutation``. * **T1-A** — A3: refresh the shares of ``s1``, ``s2``, ``t0`` at the start of every rejection iteration to defeat higher-order DPA variants that combine two iterations' leakage — **shipped**. The ``dsa.rs`` rejection loop opens with a ``#[cfg(feature = "sca-protected")]`` block that calls ``MaskedPoly::refresh`` on every polynomial of ``s1_hat_m``, ``s2_hat_m``, ``t0_hat_m`` before any operation on the shares — the Hermelink :cite:`hermelink2025_weakest_link_masked_mldsa` §4 prescription matched exactly. Output bytes are byte-identical to the pre-T1-A baseline (mask cancels in unmask); cost is unchanged versus the previous end-of-cs/ct refresh placement (same number of ``ScaRng`` bytes consumed per iteration). Audit row flipped to *protected* in :doc:`../audits/hermelink_masked`. * **T2-C** — documentation traceability: after A/B/C land, the historical suppression file becomes a "resolved-findings" annex in :doc:`../verification`. * **T4-A — SUCRE migration evaluation** (:cite:`belaid2026_sucre`). Benchmark ``sca-masked-y`` against SUCRE's shuffle-and-unmask gadget on our target platforms (Cortex-M4 class). Migrate the masked rejection path if the published 4–6× speedup holds on-device and the transient memory footprint fits the embedded budget. The existing masked-``y`` pipeline remains the fallback if the speedup is swallowed by our other constraints. * **T1-B — Hermelink audit pass on masked.rs** — **shipped** (:cite:`hermelink2025_weakest_link_masked_mldsa`). The information-theoretic leakage map of CRYPTO 2025 has been applied to every gadget of ``quantica/src/ml_dsa/masked.rs`` and every unmask call site of the rejection loop in ``quantica/src/ml_dsa/dsa.rs::sign_internal``; each row is classified as *protected*, *partial*, or *acknowledged residual risk*, with a per-row follow-up pointer. The full audit annex is :doc:`../audits/hermelink_masked`. Primary follow-up surfaced by the audit — ``T1-A`` (per-iteration share refresh) — has since **shipped**, closing the C4 sufficiency row and reducing the C1 residuals to the plaintext-aggregate floor. Remaining open follow-ups (Tier-2 CT norm-on-shares and Tier-3 share-domain Decompose/MakeHint) are tracked in the audit's work-list.