Before You Fine-Tune: A Quant-Retention Framework for Edge SLM Feasibility
Most edge-AI projects start by training a model and end by discovering it never fit the target device. This is the inverted method: a measurement-first framework that answers 'can a small model do this job on this hardware at all?' before a single training run — using quant-retention ratios, behavioural adherence rates, and a hardware envelope as the go/no-go gate.
Porsync Research · Published 2026-07-05
Finding
Edge SLM feasibility can be decided without training anything: benchmark candidate models across quantisation levels and treat retention (metric at quant ÷ metric at 8-bit) as the gate — below 0.85 is degradation, below 0.70 is collapse. If no candidate clears the gate inside the device's RAM and latency envelope, fine-tuning cannot save the project and should not be started.
Edge SLM feasibility framework — decision thresholds
The locked metric definitions and go/no-go thresholds of the framework. These are the numbers a feasibility verdict is read against; they are fixed before any model is measured so the verdict cannot be negotiated after the fact.
| Measurement | Value | Note |
|---|---|---|
| Quant retention — acceptable | ≥ 0.85 | metric@quant ÷ metric@8-bit baseline |
| Quant retention — degradation | 0.70 – 0.85 | usable only if the task tolerates errors |
| Quant retention — collapse | < 0.70 | hard fail for the model+quant pair |
| Behavioural adherence | % of prompts where the model follows the task role | e.g. withholding an answer when asked to hint, or returning schema-valid JSON |
| Hardware envelope | tok/s · TTFT · peak RAM | measured per size/quant pair; maps to a device band |
| Training runs required | 0 | the entire verdict is produced with public models, prompt-only |
The failure mode this prevents
The default edge-AI project plan is: collect data, fine-tune a model, then squeeze it onto the device. It fails in a specific, expensive way — months into training, the team discovers that at the quantisation level the device actually requires, the model's reasoning collapses. The training was real; the feasibility was assumed.
The fix is to invert the order. Build the ruler before the product: measure whether *any* candidate model, at the quantisation the hardware demands, can do the job at an acceptable level **prompt-only**. Fine-tuning improves a model that basically works. It does not rescue one that has collapsed under quantisation.
The framework
Four locked components, defined before anything is measured:
**1. A fixed measurement set, not a training corpus.** A few hundred items, gold-answered, covering the actual task — including its hardest slice (for a multilingual task, the lowest-resource language; for a reasoning task, the multi-step items). It stays small on purpose: it is a ruler. The moment it starts growing into training data, the study has silently become the product.
**2. Metrics locked in advance.** Task accuracy, plus the behavioural rates that decide real usability: does the model stay in its role (a tutor that hints without revealing answers, an extractor that returns valid JSON)? Report the hardest slice separately — **never average it away**. An aggregate score that blends an easy majority with a failing minority is how infeasible projects get greenlit.
**3. Quant retention as the headline number.** For each model, establish the 8-bit score as baseline, then re-run the identical benchmark at each lower quantisation. Retention = metric@quant ÷ metric@8-bit. Above 0.85, the compression is roughly free. Between 0.70 and 0.85, the task decides whether the loss is tolerable. Below 0.70, that model+quant pair is out — regardless of how good the model looked unquantised. Low-resource and multi-step capabilities usually collapse first, which is exactly why the hardest slice is reported separately.
**4. A hardware envelope, measured not estimated.** Tokens per second, time-to-first-token, and peak RAM per size/quant pair — mapped against what the target device class actually has. A model that clears the quality gate but needs 6GB of RAM on a 3GB device fails feasibility just as hard as one that collapsed.
Reading the verdict
The go/no-go is mechanical by design: at least one model+quant pair must clear the retention gate on every metric — including the hardest slice — *inside* the hardware envelope. If one does, the project is feasible and the same benchmark becomes the regression harness for every later fine-tune. If none does, the honest conclusions are "wait for better small models" or "change the hardware target" — and the framework has delivered that answer for the cost of evenings on a desktop GPU, not a training budget.
The discipline that makes this work is separating measurement from judgment: the harness produces numbers; a human reads them against thresholds that were locked before the first run. Feasibility studies fail as instruments when the thresholds move after the results arrive.
Why this is a strategy tool, not just an eval
"Should we build this?" is the most expensive question in applied AI, and most organisations answer it with a pilot of the *product*. This framework answers it with a pilot of the *physics* — model capability under the real deployment constraint — at roughly 1% of the cost. It is the pattern behind every adoption question worth asking: define the gate before you fall in love with the build.