###################################################################
Shared side-channel primitives — ``silentops``
###################################################################

The ``silentops`` crate is the single source of truth for the low-level
side-channel primitives used by ``quantica`` (and by ``arcana`` on the
classical side). Keeping these primitives in a separate crate means:

* a single audit surface for CT correctness, independent of any
  particular algorithm;
* architecture-specific assembly backends selected at compile time via
  Cargo features, so a downstream crate never embeds per-arch ``asm``
  in its own source;
* the same primitives are used by the statistical (``dudect``) and
  the client-request (``ctgrind``) side-channel verifiers, keeping
  test coverage coherent.

This chapter is a reference for those primitives. The threats they
mitigate and the algorithmic uses live in :doc:`threat_model` and
:doc:`countermeasures/ml_kem`, :doc:`countermeasures/ml_dsa`,
:doc:`countermeasures/slh_dsa`.

Module layout
=============

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Module
     - Role
   * - ``silentops::ct``
     - Branchless constant-time primitives with architecture-specific
       assembly backends. ``no_std``. Public functions are re-exported
       at the crate root so call sites write ``silentops::ct_eq(...)``.
   * - ``silentops::ct_grind``
     - Valgrind memcheck client-request helpers (``poison`` /
       ``unpoison``) for constant-time verification.
       See :doc:`verification`.
   * - ``silentops::verify``
     - Dudect-style timing-leak detector (``TTest``,
       ``Xorshift64``, ``measure_ns``, ``report``). ``std`` only.

Constant-time primitives — ``silentops::ct``
============================================

Surface
-------

.. list-table::
   :header-rows: 1
   :widths: 30 20 50

   * - Function
     - Signature (logical)
     - Purpose
   * - ``ct_select_u8``
     - ``(a: u8, b: u8, cond: u8) -> u8``
     - Return ``a`` if ``cond != 0`` else ``b``. Core branchless select.
   * - ``ct_select_i16``
     - ``(a: i16, b: i16, cond: u8) -> i16``
     - Same, for NTT-domain coefficients in ``i16``.
   * - ``ct_select_i32``
     - ``(a: i32, b: i32, cond: u8) -> i32``
     - Same, for ML-DSA coefficients in ``i32``.
   * - ``ct_eq``
     - ``(a: &[u8], b: &[u8]) -> u8``
     - Constant-time byte-slice equality. No early exit; returns
       ``1`` on equality, ``0`` otherwise (including different lengths).
   * - ``ct_copy``
     - ``(dst: &mut [u8], src: &[u8], cond: u8)``
     - Conditional in-place copy. Always reads both buffers; writes to
       ``dst`` are branch-free XOR-mask updates.
   * - ``ct_zeroize``
     - ``(buf: &mut [u8])``
     - Volatile zeroization resistant to dead-store elimination
       (``write_volatile`` + ``compiler_fence(SeqCst)``).
   * - ``ct_zeroize_i16``
     - ``(buf: &mut [i16])``
     - Same for polynomial coefficient arrays.

Calling convention
------------------

* ``condition: u8`` must be exactly ``0`` or ``1``. The primitives
  compute the mask via ``0u8.wrapping_sub(condition)``; passing
  ``0xFF`` or any other non-``0/1`` value breaks the CT invariant
  **and** the functional result.
* ``ct_eq`` always processes the full buffer length; it is ``O(n)`` in
  ``n = a.len()`` with a fixed per-byte cost. Buffer length itself is
  considered public.
* The loop-based primitives (``ct_eq``, ``ct_copy``,
  ``ct_zeroize``) are marked ``#[inline(never)]`` so that LLVM does
  not re-inline the loop into caller contexts where it might re-
  optimise it into variable-time code.

Architecture dispatch
=====================

The ``silentops/src/ct/mod.rs`` file selects exactly one backend at
compile time based on ``target_arch`` and the cargo features listed
below.

.. list-table::
   :header-rows: 1
   :widths: 28 22 50

   * - Target
     - Feature
     - Implementation technique
   * - ``x86_64``
     - ``asm-x86_64``
     - Inline ``cmovne`` on values held in GPRs. Each call compiles to
       ``test`` + ``cmov`` that LLVM cannot introspect or rewrite.
   * - ``aarch64``
     - ``asm-aarch64``
     - ``csel`` (one cycle, branch-free, unconditional in the
       AArch64 architecture).
   * - ``thumbv7em`` / ``thumbv7m``
     - ``asm-thumbv7``
     - ``IT`` blocks + conditional execution; Cortex-M4/M7/M33
       guarantee fixed timing inside an ``IT`` block.
   * - ``thumbv6m`` (Cortex-M0 / M0+)
     - ``asm-thumbv6m``
     - No ``IT``, no ``cmov``; falls back to AND/OR/XOR bitwise mask
       (same as the generic fallback) but written as inline asm so the
       compiler cannot regenerate a branch.
   * - ``riscv32``
     - ``asm-riscv32``
     - No conditional move; uses AND/OR/XOR with a mask derived from
       ``neg``, hand-written in asm.
   * - any (default)
     - *none*
     - Pure Rust bitwise fallback. **Not recommended for production
       CT builds** — see the warning below.

Why the pure-Rust fallback is dangerous at ``opt-level >= 2``
-------------------------------------------------------------

The generic fallback writes each primitive as ``b ^ (mask & (a ^ b))``.
The LLVM back-end recognises this pattern. At ``opt-level = 2`` or
above it will frequently rewrite the ``ct_select`` wrapper (e.g. the
32-byte select in ``ml_kem::kem::ct_select``) into::

    test   ecx, ecx
    cmovne rdx, rsi     ; pointer CMOV
    cmovne r8,  rax
    movups xmm0, [rdx]  ; load from the selected address
    movups xmm1, [r8]

— a **secret-dependent pointer CMOV followed by a load**. The cache
line fetched then depends on the secret ``cond``, which is a
classical cache-timing leak recoverable by a local attacker.

This behaviour was confirmed in ``ctgrind`` runs against an early
build of ``quantica`` and is the entire reason the ``asm-x86_64``
backend exists. See :doc:`verification` for the ctgrind trace.

Recommended build profile
-------------------------

On ``x86_64`` hosts, build with at minimum::

    cargo build --release \
        -p quantica \
        --features asm-x86_64

The ``quantica_bench/ct-grind`` cargo feature forwards
``silentops/asm-x86_64`` automatically, so builds intended for
side-channel verification always get the asm backend.

``core::hint::black_box`` shielding — design choice
---------------------------------------------------

The workspace ``SECURITY.md`` (Section 4.1) lists
``core::hint::black_box`` shielding as a workspace-wide rule
"wherever a CT mask is derived from a secret", because without it
LLVM (rustc 1.84+) is known to recover branches over the
``b ^ (mask & (a ^ b))`` idiom — exactly the pattern documented
above as the failure mode of the pure-Rust fallback.

In the quantica crate this rule is satisfied **structurally** by
delegating every CT decision to ``silentops::ct_*``, whose asm
backends (``asm-x86_64``, ``asm-aarch64``, ``asm-thumbv7``,
``asm-thumbv6m``, ``asm-riscv32``) bypass the LLVM optimiser
entirely. Consequently ``quantica/src/`` does not call
``core::hint::black_box`` directly anywhere — the asm backends are
the *stronger fix* mentioned in the same SECURITY.md row.

.. admonition:: Caveat — non-asm targets
   :class: important

   On architectures without an asm backend (notably WebAssembly
   through the ``quantica_wasm`` crate), the CT path falls back to
   ``silentops::ct::generic`` and the LLVM-recovers-branch hazard
   **does** apply. A planned hardening pass (no roadmap ID assigned
   yet — flagged here as a workspace residual) will add explicit
   ``core::hint::black_box`` calls inside
   ``silentops::ct::generic`` so every consumer (quantica + arcana)
   inherits the shielding regardless of target. Until that lands,
   WebAssembly builds of quantica should be considered *best-effort*
   on the CT axis.

Source pointers
---------------

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Item
     - File
   * - Public API & re-exports
     - ``silentops/src/lib.rs``
   * - Module dispatch
     - ``silentops/src/ct/mod.rs``
   * - Generic (bit-twiddling) fallback
     - ``silentops/src/ct/generic.rs``
   * - x86_64 asm backend
     - ``silentops/src/ct/x86_64.rs``
   * - aarch64 asm backend
     - ``silentops/src/ct/aarch64.rs``
   * - thumbv7 asm backend
     - ``silentops/src/ct/thumbv7.rs``
   * - thumbv6m asm backend
     - ``silentops/src/ct/thumbv6m.rs``
   * - riscv32 asm backend
     - ``silentops/src/ct/riscv32.rs``
   * - CT unit tests (run on every arch)
     - ``silentops/src/ct/tests.rs``

ctgrind instrumentation — ``silentops::ct_grind``
=================================================

``ct_grind`` provides the two-function API needed to drive
Valgrind/memcheck-based CT verification:

.. code-block:: rust

    silentops::ct_grind::poison(buf);     // mark as secret
    silentops::ct_grind::unpoison(buf);   // mark as public again
    silentops::ct_grind::is_active();     // true only when the feature
                                          // is enabled AND the target is
                                          // x86_64-linux or aarch64-linux

The implementation emits the Valgrind client-request magic sequence
via stable ``core::arch::asm!``, with no C shim or third-party crate.
Surrounding ``compiler_fence(SeqCst)`` calls prevent LLVM from
reordering subsequent memory reads past a ``poison`` / ``unpoison``
call — a subtle but critical detail first identified during the
initial ``quantica_bench`` ctgrind bring-up.

When the ``ct-grind`` feature is disabled, or on non-supported
targets, all three functions compile to zero-cost no-ops so call
sites can stay unconditional (no ``#[cfg]`` walls in consumer
code).

The full methodology, the demo binary that validates the plumbing,
and the interpretation rules for memcheck output are covered in
:doc:`verification`.

Statistical timing verification — ``silentops::verify``
=======================================================

The ``verify`` module packages the Reparaz–Balasch–Verbauwhede
methodology :cite:`reparaz2017dudect` as a library — a tiny
``Xorshift64`` for class selection, an incremental Welch t-test
(``TTest``), a ``measure_ns`` sampler, and a ``report`` helper that
prints ``PASS`` / ``FAIL`` against ``T_THRESHOLD = 4.5``
(``p < 10⁻⁵``).

Consumers write their own measurement loops on top of this API. The
canonical example is ``silentops/examples/ct_verify_pqc.rs``, which
exercises ML-KEM-768 Decaps, the ML-KEM Barrett reduction, and
ML-DSA-44 Sign / Verify.

``verify`` is the complement of ``ctgrind`` — it runs on real
hardware and catches timing leaks that depend on microarchitectural
state rather than pure control flow. A typical high-assurance run
uses both: ctgrind on the CI host for control-flow CT correctness,
dudect on the target hardware for timing-on-device evidence.