###################################################################
ML-DSA — countermeasures
###################################################################

:FIPS spec:   :cite:`fips204`
:Crate path:  ``quantica::ml_dsa``
:Cargo features:
    ``ml-dsa`` (on by default); ``sca-protected`` (on by default,
    gates masking + shuffling); ``sca-masked-y`` (masked ``y``
    pipeline, on in hardened builds); ``sca-ct-rejection``
    (branchless rejection loop, on in hardened builds).

ML-DSA has the richest SCA threat surface of the three algorithms
— it mixes a non-deterministic rejection-sampling loop, several
secret polynomials used in linear combinations, and a Fiat–Shamir
challenge that exposes bit-level intermediates. This chapter lists
the countermeasures implemented in ``quantica::ml_dsa``, indexed by
threat class, and the planned hardening items still outstanding for the next
hardening round.

Threat classes reference: :doc:`../threat_model`. Primitive reference:
:doc:`../primitives`. Verification methodology: :doc:`../verification`.

.. contents::
   :local:
   :depth: 2

Coverage matrix
===============

.. list-table:: ML-DSA countermeasure / threat matrix
   :header-rows: 1
   :widths: 25 18 57

   * - Threat
     - Status
     - Countermeasure(s)
   * - SPA / SEMA on secret NTT
     - implemented
     - Fisher–Yates shuffled NTT for ``s1``, ``s2``, ``t0``
       (``shuffle::ntt_shuffled``) using a signature-specific
       ``ScaRng``.
   * - DPA on ``y`` sampling
     - implemented
     - Masked ``y`` sampled as two shares from SHAKE256, kept as
       shares through masked NTT and masked ``A·y``
       (``sca-masked-y``, ``masked::masked_mat_vec_mul*``).
   * - DPA on ``s1``, ``s2``, ``t0``
     - implemented
     - First-order arithmetic masking kept across the rejection loop
       (``masked::MaskedPoly``, ``masked_pointwise_mul_public``).
   * - Timing on rejection loop
     - implemented
     - Compute all intermediates (cs1, z, cs2, r0, ct0, hint) every
       iteration, single branch-free accept/reject decision
       (``sca-ct-rejection``).
   * - Software / remote timing
     - partial (interim)
     - All conditional selections route through ``silentops::ct_*``;
       remaining ctgrind flags on branches-into-public-signature are
       documented in ``tools/ctgrind.supp`` pending hardening closure.
   * - DFA on norm checks
     - partial
     - CT rejection loop already double-checks norms before emission;
       explicit redundant signing planned (see Roadmap chapter of
       the README).
   * - Template attacks
     - implemented
     - NTT shuffling destroys trace alignment; masking multiplies
       the profile cost.
   * - Higher-order DPA via mask re-use
     - implemented (``T1-A``)
     - Per-iteration ``MaskedPoly::refresh`` of every polynomial of
       ``s1_hat_m``, ``s2_hat_m``, ``t0_hat_m`` at the **head** of
       each rejection iteration (before any operation on the shares,
       per :cite:`hermelink2025_weakest_link_masked_mldsa` §4).
       Output bytes unchanged (mask cancels in unmask, KAT
       byte-identical).
   * - Hermelink 2025/276 leakage map of masked-y gadgets
     - audit shipped (``T1-B``)
     - Information-theoretic audit pass over every masked gadget and
       unmask call site, classified against the Hermelink leak
       taxonomy (C1-C5) with a per-row follow-up tracker. See
       :doc:`../audits/hermelink_masked`.

SPA / SEMA — Fisher-Yates shuffled NTT
======================================

Principle
---------

Same idea as ML-KEM: draw a random permutation of the NTT butterfly
groups and of the butterflies within each group, execute in the
permuted order. The shuffle is applied to ``s1`` (``l`` polynomials),
``s2`` (``k``) and ``t0`` (``k``) — the three secret vectors. The
public matrix ``A`` uses the classical NTT.

The permutations are drawn from a dedicated ``ScaRng`` seeded with
``K ‖ rnd ‖ tr ‖ M'`` (SHAKE256), so a given signature uses a
reproducible but unpredictable-to-an-attacker order.

Published basis
---------------

* :cite:`arxiv2024_mlkem_shuffling_hw` — original shuffling analysis
  (ML-KEM focus, but the technique transfers directly).
* :cite:`eprint2025_cortexm4_m7_slothy` — Cortex-M4/M7 performance
  measurements for the shuffled variant.
* :cite:`nist2025_physical_security_mldsa` — NIST's recommended
  posture including shuffling as an SPA mitigation.

Code pointers
-------------

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Item
     - Location
   * - Fisher-Yates permutation generator
     - ``quantica/src/ml_dsa/shuffle.rs`` ``generate_permutation``
   * - Shuffled NTT
     - ``quantica/src/ml_dsa/shuffle.rs`` ``ntt_shuffled``
   * - Call sites (Step 1 of ``sign_internal``)
     - ``quantica/src/ml_dsa/dsa.rs`` around line 531 — three
       for-loops applying ``ntt_shuffled`` to each polynomial of
       ``s1_hat``, ``s2_hat``, ``t0_hat``.
   * - ``ScaRng`` construction + seeding
     - ``quantica/src/ml_dsa/dsa.rs`` lines 516-526; seed =
       SHAKE256 of the domain-separator tag
       ``quantica-mldsa-sca-seed-v1`` concatenated with
       ``K``, ``rnd``, ``tr`` and ``M'``.

DPA — first-order masking of secret polynomials
===============================================

Principle
---------

Each secret polynomial (``s1``, ``s2``, ``t0``) is kept as a pair
``(P_0, P_1)`` with ``P = P_0 + P_1 (mod q)``. Operations taking a
secret as operand (NTT, pointwise mul with public ``A``, matrix-
vector multiplication) are rewritten on shares. The ``A·y`` step is
the most DPA-critical: it operates on the masked ``y`` and the public
``A``, yielding a share representation of ``w`` that is unmasked
only once in the accept/reject logic.

Published basis
---------------

* :cite:`eprint2024_mldsa_hw_masking` — masked hardware ML-DSA
  (reference construction, we follow the same share topology in
  software).
* :cite:`nist2025_physical_security_mldsa` — masking recommendation.

Code pointers
-------------

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Item
     - Location
   * - ``MaskedPoly`` type + arithmetic helpers
     - ``quantica/src/ml_dsa/masked.rs`` (``masked_ntt``,
       ``masked_ntt_inv``, ``masked_pointwise_mul_public``,
       ``masked_mat_vec_mul``, ``masked_mat_vec_mul_lazy``)
   * - Call sites for masked NTT on secrets
     - ``quantica/src/ml_dsa/dsa.rs`` — lines around the ``s1_hat``,
       ``s2_hat``, ``t0_hat`` initialization under
       ``#[cfg(feature = "sca-protected")]``.
   * - Zeroization of masked polynomials
     - ``quantica/src/ml_dsa/masked.rs`` ``zeroize_poly``,
       ``zeroize_bytes``.

DPA on ``y`` — the ``sca-masked-y`` pipeline
============================================

Principle
---------

The masking vector ``y`` is the main vector target for DPA: the
published signature component ``z = y + c·s1`` reveals a linear
combination of ``y`` and ``s1``, so averaging many signatures on
equal message / equal ``c`` recovers ``s1`` from ``y`` if ``y`` ever
appears unmasked on the power trace.

``quantica`` samples ``y`` as two arithmetic shares directly from
SHAKE256, runs masked NTT on the shares, computes ``A·y`` with the
public matrix on the shares, and unmasks ``w = A·y`` only when the
rejection loop has committed to publishing it — exactly the
construction of :cite:`coron2024_masked_rejection_dilithium`.

Published basis
---------------

* :cite:`coron2024_masked_rejection_dilithium` — canonical
  high-order masked generation of the masking vector and masked
  rejection sampling gadget (TCHES 2024.4). **Construction
  followed by our implementation.**
* :cite:`belaid2026_sucre` — SUCRE (TCHES 2026.1), a
  shuffle-and-unmask alternative that delivers 4–6× speedup on
  the same security claim. Candidate for ``T4-A`` migration
  evaluation (see below).

Known attacks against the construction
--------------------------------------

* :cite:`hermelink2025_weakest_link_masked_mldsa` (CRYPTO 2025):
  information-theoretic leakage map of masked-``y`` implementations
  at first, second, and higher orders. **Not a break of the
  construction itself, but an auditor's checklist for the gadgets
  instantiating it.** ``T1-B`` tracks the pass-through of this
  checklist on our code.
* :cite:`damm2025_concealed_ilwe` (ASIACRYPT 2025): introduces
  *concealed ILWE* with Huber/Cauchy regression; breaks
  masked-Dilithium implementations that leak up to 90% of the
  shares. Motivates strong care on the masked-NTT and
  masked-``A·y`` gadgets.
* :cite:`zhao2026_rejection_matters` (DATE 2026): non-profiling
  attack on the *unmasked* / hedged rejection loop (96 traces for
  ``c``, ~300 traces for the key on a Cortex-M4 target). **Primary
  motivator for the ``sca-ct-rejection`` feature** below.

Code pointers
-------------

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Item
     - Location
   * - Masked ``y`` sampling
     - ``quantica/src/ml_dsa/masked.rs`` (look for
       ``masked_expand_mask`` in the sampling region).
   * - Masked ``A·y``
     - ``quantica/src/ml_dsa/masked.rs``
       ``masked_mat_vec_mul`` / ``masked_mat_vec_mul_lazy``.
   * - Call site in the sign loop
     - ``quantica/src/ml_dsa/dsa.rs`` — inside the rejection loop of
       ``sign_internal``, under ``#[cfg(feature = "sca-masked-y")]``.
   * - Tests
     - ``quantica/src/ml_dsa/masked.rs`` end-of-file
       (``masked_expand_mask_matches_unmasked_expand_mask``,
       ``masked_mat_vec_mul_matches_unmasked``).

Timing — constant-time rejection loop
=====================================

Principle
---------

The FIPS 204 rejection loop as written branches out of the iteration
as soon as the norm check fails:

.. code-block:: text

   repeat
       compute w, w1, c, z, r0
       if norm(z) >= gamma1 - beta then restart
       if norm(r0) >= gamma2 - beta then restart
       ...
   until accepted

A timing observer can therefore tell at which test the candidate
was rejected, which leaks information about ``z`` and ``r0`` — and
thereby about ``s1``, ``s2`` :cite:`eprint2025_rejected_signatures_sca`.

The ``sca-ct-rejection`` feature rewrites the loop so that every
iteration computes **all** intermediates (``cs1``, ``z``, ``cs2``,
``r0``, ``ct0``, ``hint``) and accumulates a single branch-free
accept flag that is consulted only at the very end of the iteration.
The loop keeps running until accept; observing an iteration cannot
tell *which* norm check decided the fate.

Published basis
---------------

* :cite:`eprint2025_rejected_signatures_sca` — the initial
  timing-leak analysis that motivates the countermeasure.
* :cite:`zhao2026_rejection_matters` (DATE 2026) — a non-profiling
  public-template attack recovering ``c`` in 96 traces and the
  signing key in ~300 traces on a Cortex-M4 target in hedged /
  unprotected mode. The ``sca-ct-rejection`` feature is the
  intended answer to this attack class.

Code pointers
-------------

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Item
     - Location
   * - Rejection loop with branch-free accept
     - ``quantica/src/ml_dsa/dsa.rs`` ``sign_internal``, block
       guarded by ``#[cfg(feature = "sca-ct-rejection")]``.
   * - Norm-check helpers returning bit flags
     - ``quantica/src/ml_dsa/dsa.rs`` (look for ``infinity_norm_*``
       inside the ``sca-ct-rejection`` region; the flags feed a
       single-variable AND accumulator).

Template attacks
================

Template attacks against ML-DSA rely on profile-matching the NTT
coefficients of ``s1``, ``s2``, ``t0`` or the ``y`` sampling. The
defences already described — masking + shuffling — destroy the
inter-trace alignment a template attack depends on, and multiply
the profile size the attacker has to maintain.

See :doc:`../threat_model` for cost estimates; see
:cite:`arxiv2025_mlkem_mldsa_cortexm0_rp2040` for a practical
profile against an unprotected Cortex-M0 implementation and the
required trace counts once shuffling is in place.

Planned hardening
==========================

The following items are scheduled for the next hardening round; each
closes one of the ``tools/ctgrind.supp`` entries documented under
:doc:`../verification`.

* **T2-A** — explicit ``ct_grind::unpoison`` after the algorithmic
  unmasking point of ``w1``, ``h``, ``z``. Lets ctgrind re-verify
  with zero suppressions on the ``decompose::high_bits_vec``,
  ``encode::w1_encode``, ``decompose::make_hint_vec``,
  ``encode::sig_encode`` paths.

* **T2-B** — branch-free ``generate_permutation`` (Feistel- or
  Floyd-based) to close the suppression on
  ``shuffle::generate_permutation``.

* **T1-A** — A3: refresh the shares of ``s1``, ``s2``, ``t0`` at
  the start of every rejection iteration to defeat higher-order DPA
  variants that combine two iterations' leakage — **shipped**.
  The ``dsa.rs`` rejection loop opens with a
  ``#[cfg(feature = "sca-protected")]`` block that calls
  ``MaskedPoly::refresh`` on every polynomial of ``s1_hat_m``,
  ``s2_hat_m``, ``t0_hat_m`` before any operation on the shares —
  the Hermelink :cite:`hermelink2025_weakest_link_masked_mldsa` §4
  prescription matched exactly. Output bytes are byte-identical to
  the pre-T1-A baseline (mask cancels in unmask); cost is unchanged
  versus the previous end-of-cs/ct refresh placement (same number
  of ``ScaRng`` bytes consumed per iteration). Audit row flipped to
  *protected* in :doc:`../audits/hermelink_masked`.

* **T2-C** — documentation traceability: after A/B/C land, the
  historical suppression file becomes a "resolved-findings" annex in
  :doc:`../verification`.

* **T4-A — SUCRE migration evaluation**
  (:cite:`belaid2026_sucre`). Benchmark ``sca-masked-y`` against
  SUCRE's shuffle-and-unmask gadget on our target platforms
  (Cortex-M4 class). Migrate the masked rejection path if the
  published 4–6× speedup holds on-device and the transient memory
  footprint fits the embedded budget. The existing masked-``y``
  pipeline remains the fallback if the speedup is swallowed by our
  other constraints.

* **T1-B — Hermelink audit pass on masked.rs** — **shipped**
  (:cite:`hermelink2025_weakest_link_masked_mldsa`). The
  information-theoretic leakage map of CRYPTO 2025 has been applied
  to every gadget of ``quantica/src/ml_dsa/masked.rs`` and every
  unmask call site of the rejection loop in
  ``quantica/src/ml_dsa/dsa.rs::sign_internal``; each row is
  classified as *protected*, *partial*, or *acknowledged residual
  risk*, with a per-row follow-up pointer. The full audit annex is
  :doc:`../audits/hermelink_masked`. Primary follow-up surfaced by
  the audit — ``T1-A`` (per-iteration share refresh) — has since
  **shipped**, closing the C4 sufficiency row and reducing the C1
  residuals to the plaintext-aggregate floor. Remaining open
  follow-ups (Tier-2 CT norm-on-shares and Tier-3 share-domain
  Decompose/MakeHint) are tracked in the audit's work-list.