Shared side-channel primitives — silentops

The silentops crate is the single source of truth for the low-level side-channel primitives used by quantica (and by arcana on the classical side). Keeping these primitives in a separate crate means:

  • a single audit surface for CT correctness, independent of any particular algorithm;

  • architecture-specific assembly backends selected at compile time via Cargo features, so a downstream crate never embeds per-arch asm in its own source;

  • the same primitives are used by the statistical (dudect) and the client-request (ctgrind) side-channel verifiers, keeping test coverage coherent.

This chapter is a reference for those primitives. The threats they mitigate and the algorithmic uses live in Threat model and ML-KEM — countermeasures, ML-DSA — countermeasures, SLH-DSA — countermeasures.

Module layout

Module

Role

silentops::ct

Branchless constant-time primitives with architecture-specific assembly backends. no_std. Public functions are re-exported at the crate root so call sites write silentops::ct_eq(...).

silentops::ct_grind

Valgrind memcheck client-request helpers (poison / unpoison) for constant-time verification. See Verification methodology.

silentops::verify

Dudect-style timing-leak detector (TTest, Xorshift64, measure_ns, report). std only.

Constant-time primitives — silentops::ct

Surface

Function

Signature (logical)

Purpose

ct_select_u8

(a: u8, b: u8, cond: u8) -> u8

Return a if cond != 0 else b. Core branchless select.

ct_select_i16

(a: i16, b: i16, cond: u8) -> i16

Same, for NTT-domain coefficients in i16.

ct_select_i32

(a: i32, b: i32, cond: u8) -> i32

Same, for ML-DSA coefficients in i32.

ct_eq

(a: &[u8], b: &[u8]) -> u8

Constant-time byte-slice equality. No early exit; returns 1 on equality, 0 otherwise (including different lengths).

ct_copy

(dst: &mut [u8], src: &[u8], cond: u8)

Conditional in-place copy. Always reads both buffers; writes to dst are branch-free XOR-mask updates.

ct_zeroize

(buf: &mut [u8])

Volatile zeroization resistant to dead-store elimination (write_volatile + compiler_fence(SeqCst)).

ct_zeroize_i16

(buf: &mut [i16])

Same for polynomial coefficient arrays.

Calling convention

  • condition: u8 must be exactly 0 or 1. The primitives compute the mask via 0u8.wrapping_sub(condition); passing 0xFF or any other non-0/1 value breaks the CT invariant and the functional result.

  • ct_eq always processes the full buffer length; it is O(n) in n = a.len() with a fixed per-byte cost. Buffer length itself is considered public.

  • The loop-based primitives (ct_eq, ct_copy, ct_zeroize) are marked #[inline(never)] so that LLVM does not re-inline the loop into caller contexts where it might re- optimise it into variable-time code.

Architecture dispatch

The silentops/src/ct/mod.rs file selects exactly one backend at compile time based on target_arch and the cargo features listed below.

Target

Feature

Implementation technique

x86_64

asm-x86_64

Inline cmovne on values held in GPRs. Each call compiles to test + cmov that LLVM cannot introspect or rewrite.

aarch64

asm-aarch64

csel (one cycle, branch-free, unconditional in the AArch64 architecture).

thumbv7em / thumbv7m

asm-thumbv7

IT blocks + conditional execution; Cortex-M4/M7/M33 guarantee fixed timing inside an IT block.

thumbv6m (Cortex-M0 / M0+)

asm-thumbv6m

No IT, no cmov; falls back to AND/OR/XOR bitwise mask (same as the generic fallback) but written as inline asm so the compiler cannot regenerate a branch.

riscv32

asm-riscv32

No conditional move; uses AND/OR/XOR with a mask derived from neg, hand-written in asm.

any (default)

none

Pure Rust bitwise fallback. Not recommended for production CT builds — see the warning below.

Why the pure-Rust fallback is dangerous at opt-level >= 2

The generic fallback writes each primitive as b ^ (mask & (a ^ b)). The LLVM back-end recognises this pattern. At opt-level = 2 or above it will frequently rewrite the ct_select wrapper (e.g. the 32-byte select in ml_kem::kem::ct_select) into:

test   ecx, ecx
cmovne rdx, rsi     ; pointer CMOV
cmovne r8,  rax
movups xmm0, [rdx]  ; load from the selected address
movups xmm1, [r8]

— a secret-dependent pointer CMOV followed by a load. The cache line fetched then depends on the secret cond, which is a classical cache-timing leak recoverable by a local attacker.

This behaviour was confirmed in ctgrind runs against an early build of quantica and is the entire reason the asm-x86_64 backend exists. See Verification methodology for the ctgrind trace.

core::hint::black_box shielding — design choice

The workspace SECURITY.md (Section 4.1) lists core::hint::black_box shielding as a workspace-wide rule “wherever a CT mask is derived from a secret”, because without it LLVM (rustc 1.84+) is known to recover branches over the b ^ (mask & (a ^ b)) idiom — exactly the pattern documented above as the failure mode of the pure-Rust fallback.

In the quantica crate this rule is satisfied structurally by delegating every CT decision to silentops::ct_*, whose asm backends (asm-x86_64, asm-aarch64, asm-thumbv7, asm-thumbv6m, asm-riscv32) bypass the LLVM optimiser entirely. Consequently quantica/src/ does not call core::hint::black_box directly anywhere — the asm backends are the stronger fix mentioned in the same SECURITY.md row.

Caveat — non-asm targets

On architectures without an asm backend (notably WebAssembly through the quantica_wasm crate), the CT path falls back to silentops::ct::generic and the LLVM-recovers-branch hazard does apply. A planned hardening pass (no roadmap ID assigned yet — flagged here as a workspace residual) will add explicit core::hint::black_box calls inside silentops::ct::generic so every consumer (quantica + arcana) inherits the shielding regardless of target. Until that lands, WebAssembly builds of quantica should be considered best-effort on the CT axis.

Source pointers

Item

File

Public API & re-exports

silentops/src/lib.rs

Module dispatch

silentops/src/ct/mod.rs

Generic (bit-twiddling) fallback

silentops/src/ct/generic.rs

x86_64 asm backend

silentops/src/ct/x86_64.rs

aarch64 asm backend

silentops/src/ct/aarch64.rs

thumbv7 asm backend

silentops/src/ct/thumbv7.rs

thumbv6m asm backend

silentops/src/ct/thumbv6m.rs

riscv32 asm backend

silentops/src/ct/riscv32.rs

CT unit tests (run on every arch)

silentops/src/ct/tests.rs

ctgrind instrumentation — silentops::ct_grind

ct_grind provides the two-function API needed to drive Valgrind/memcheck-based CT verification:

silentops::ct_grind::poison(buf);     // mark as secret
silentops::ct_grind::unpoison(buf);   // mark as public again
silentops::ct_grind::is_active();     // true only when the feature
                                      // is enabled AND the target is
                                      // x86_64-linux or aarch64-linux

The implementation emits the Valgrind client-request magic sequence via stable core::arch::asm!, with no C shim or third-party crate. Surrounding compiler_fence(SeqCst) calls prevent LLVM from reordering subsequent memory reads past a poison / unpoison call — a subtle but critical detail first identified during the initial quantica_bench ctgrind bring-up.

When the ct-grind feature is disabled, or on non-supported targets, all three functions compile to zero-cost no-ops so call sites can stay unconditional (no #[cfg] walls in consumer code).

The full methodology, the demo binary that validates the plumbing, and the interpretation rules for memcheck output are covered in Verification methodology.

Statistical timing verification — silentops::verify

The verify module packages the Reparaz–Balasch–Verbauwhede methodology [RBV17] as a library — a tiny Xorshift64 for class selection, an incremental Welch t-test (TTest), a measure_ns sampler, and a report helper that prints PASS / FAIL against T_THRESHOLD = 4.5 (p < 10⁻⁵).

Consumers write their own measurement loops on top of this API. The canonical example is silentops/examples/ct_verify_pqc.rs, which exercises ML-KEM-768 Decaps, the ML-KEM Barrett reduction, and ML-DSA-44 Sign / Verify.

verify is the complement of ctgrind — it runs on real hardware and catches timing leaks that depend on microarchitectural state rather than pure control flow. A typical high-assurance run uses both: ctgrind on the CI host for control-flow CT correctness, dudect on the target hardware for timing-on-device evidence.