Shared side-channel primitives — `silentops`

The silentops crate is the single source of truth for the low-level side-channel primitives used by quantica (and by arcana on the classical side). Keeping these primitives in a separate crate means:

a single audit surface for CT correctness, independent of any particular algorithm;
architecture-specific assembly backends selected at compile time via Cargo features, so a downstream crate never embeds per-arch asm in its own source;
the same primitives are used by the statistical (dudect) and the client-request (ctgrind) side-channel verifiers, keeping test coverage coherent.

This chapter is a reference for those primitives. The threats they mitigate and the algorithmic uses live in Threat model and ML-KEM — countermeasures, ML-DSA — countermeasures, SLH-DSA — countermeasures.

Module layout

Module	Role
`silentops::ct`	Branchless constant-time primitives with architecture-specific assembly backends. `no_std`. Public functions are re-exported at the crate root so call sites write `silentops::ct_eq(...)`.
`silentops::ct_grind`	Valgrind memcheck client-request helpers (`poison` / `unpoison`) for constant-time verification. See Verification methodology.
`silentops::verify`	Dudect-style timing-leak detector (`TTest`, `Xorshift64`, `measure_ns`, `report`). `std` only.

Constant-time primitives — `silentops::ct`

Surface

Function	Signature (logical)	Purpose
`ct_select_u8`	`(a: u8, b: u8, cond: u8) -> u8`	Return `a` if `cond != 0` else `b`. Core branchless select.
`ct_select_i16`	`(a: i16, b: i16, cond: u8) -> i16`	Same, for NTT-domain coefficients in `i16`.
`ct_select_i32`	`(a: i32, b: i32, cond: u8) -> i32`	Same, for ML-DSA coefficients in `i32`.
`ct_eq`	`(a: &[u8], b: &[u8]) -> u8`	Constant-time byte-slice equality. No early exit; returns `1` on equality, `0` otherwise (including different lengths).
`ct_copy`	`(dst: &mut [u8], src: &[u8], cond: u8)`	Conditional in-place copy. Always reads both buffers; writes to `dst` are branch-free XOR-mask updates.
`ct_zeroize`	`(buf: &mut [u8])`	Volatile zeroization resistant to dead-store elimination (`write_volatile` + `compiler_fence(SeqCst)`).
`ct_zeroize_i16`	`(buf: &mut [i16])`	Same for polynomial coefficient arrays.

Calling convention

condition: u8 must be exactly 0 or 1. The primitives compute the mask via 0u8.wrapping_sub(condition); passing 0xFF or any other non-0/1 value breaks the CT invariant and the functional result.
ct_eq always processes the full buffer length; it is O(n) in n = a.len() with a fixed per-byte cost. Buffer length itself is considered public.
The loop-based primitives (ct_eq, ct_copy, ct_zeroize) are marked #[inline(never)] so that LLVM does not re-inline the loop into caller contexts where it might re- optimise it into variable-time code.

Architecture dispatch

The silentops/src/ct/mod.rs file selects exactly one backend at compile time based on target_arch and the cargo features listed below.

Target	Feature	Implementation technique
`x86_64`	`asm-x86_64`	Inline `cmovne` on values held in GPRs. Each call compiles to `test` + `cmov` that LLVM cannot introspect or rewrite.
`aarch64`	`asm-aarch64`	`csel` (one cycle, branch-free, unconditional in the AArch64 architecture).
`thumbv7em` / `thumbv7m`	`asm-thumbv7`	`IT` blocks + conditional execution; Cortex-M4/M7/M33 guarantee fixed timing inside an `IT` block.
`thumbv6m` (Cortex-M0 / M0+)	`asm-thumbv6m`	No `IT`, no `cmov`; falls back to AND/OR/XOR bitwise mask (same as the generic fallback) but written as inline asm so the compiler cannot regenerate a branch.
`riscv32`	`asm-riscv32`	No conditional move; uses AND/OR/XOR with a mask derived from `neg`, hand-written in asm.
any (default)	none	Pure Rust bitwise fallback. Not recommended for production CT builds — see the warning below.

Why the pure-Rust fallback is dangerous at `opt-level >= 2`

The generic fallback writes each primitive as b ^ (mask & (a ^ b)). The LLVM back-end recognises this pattern. At opt-level = 2 or above it will frequently rewrite the ct_select wrapper (e.g. the 32-byte select in ml_kem::kem::ct_select) into:

test   ecx, ecx
cmovne rdx, rsi     ; pointer CMOV
cmovne r8,  rax
movups xmm0, [rdx]  ; load from the selected address
movups xmm1, [r8]

— a secret-dependent pointer CMOV followed by a load. The cache line fetched then depends on the secret cond, which is a classical cache-timing leak recoverable by a local attacker.

This behaviour was confirmed in ctgrind runs against an early build of quantica and is the entire reason the asm-x86_64 backend exists. See Verification methodology for the ctgrind trace.

Recommended build profile

On x86_64 hosts, build with at minimum:

cargo build --release \
    -p quantica \
    --features asm-x86_64

The quantica_bench/ct-grind cargo feature forwards silentops/asm-x86_64 automatically, so builds intended for side-channel verification always get the asm backend.

`core::hint::black_box` shielding — design choice

The workspace SECURITY.md (Section 4.1) lists core::hint::black_box shielding as a workspace-wide rule “wherever a CT mask is derived from a secret”, because without it LLVM (rustc 1.84+) is known to recover branches over the b ^ (mask & (a ^ b)) idiom — exactly the pattern documented above as the failure mode of the pure-Rust fallback.

In the quantica crate this rule is satisfied structurally by delegating every CT decision to silentops::ct_*, whose asm backends (asm-x86_64, asm-aarch64, asm-thumbv7, asm-thumbv6m, asm-riscv32) bypass the LLVM optimiser entirely. Consequently quantica/src/ does not call core::hint::black_box directly anywhere — the asm backends are the stronger fix mentioned in the same SECURITY.md row.

Caveat — non-asm targets

On architectures without an asm backend (notably WebAssembly through the quantica_wasm crate), the CT path falls back to silentops::ct::generic and the LLVM-recovers-branch hazard does apply. A planned hardening pass (no roadmap ID assigned yet — flagged here as a workspace residual) will add explicit core::hint::black_box calls inside silentops::ct::generic so every consumer (quantica + arcana) inherits the shielding regardless of target. Until that lands, WebAssembly builds of quantica should be considered best-effort on the CT axis.

Source pointers

Item	File
Public API & re-exports	`silentops/src/lib.rs`
Module dispatch	`silentops/src/ct/mod.rs`
Generic (bit-twiddling) fallback	`silentops/src/ct/generic.rs`
x86_64 asm backend	`silentops/src/ct/x86_64.rs`
aarch64 asm backend	`silentops/src/ct/aarch64.rs`
thumbv7 asm backend	`silentops/src/ct/thumbv7.rs`
thumbv6m asm backend	`silentops/src/ct/thumbv6m.rs`
riscv32 asm backend	`silentops/src/ct/riscv32.rs`
CT unit tests (run on every arch)	`silentops/src/ct/tests.rs`

ctgrind instrumentation — `silentops::ct_grind`

ct_grind provides the two-function API needed to drive Valgrind/memcheck-based CT verification:

silentops::ct_grind::poison(buf);     // mark as secret
silentops::ct_grind::unpoison(buf);   // mark as public again
silentops::ct_grind::is_active();     // true only when the feature
                                      // is enabled AND the target is
                                      // x86_64-linux or aarch64-linux

The implementation emits the Valgrind client-request magic sequence via stable core::arch::asm!, with no C shim or third-party crate. Surrounding compiler_fence(SeqCst) calls prevent LLVM from reordering subsequent memory reads past a poison / unpoison call — a subtle but critical detail first identified during the initial quantica_bench ctgrind bring-up.

When the ct-grind feature is disabled, or on non-supported targets, all three functions compile to zero-cost no-ops so call sites can stay unconditional (no #[cfg] walls in consumer code).

The full methodology, the demo binary that validates the plumbing, and the interpretation rules for memcheck output are covered in Verification methodology.

Statistical timing verification — `silentops::verify`

The verify module packages the Reparaz–Balasch–Verbauwhede methodology [RBV17] as a library — a tiny Xorshift64 for class selection, an incremental Welch t-test (TTest), a measure_ns sampler, and a report helper that prints PASS / FAIL against T_THRESHOLD = 4.5 (p < 10⁻⁵).

Consumers write their own measurement loops on top of this API. The canonical example is silentops/examples/ct_verify_pqc.rs, which exercises ML-KEM-768 Decaps, the ML-KEM Barrett reduction, and ML-DSA-44 Sign / Verify.

verify is the complement of ctgrind — it runs on real hardware and catches timing leaks that depend on microarchitectural state rather than pure control flow. A typical high-assurance run uses both: ctgrind on the CI host for control-flow CT correctness, dudect on the target hardware for timing-on-device evidence.

Shared side-channel primitives — silentops