Research/Before You Fine-Tune: A Quant-Retention Framework for Edge SLM Feasibility

Edge AISmall Language ModelsQuantisationFeasibilityAI StrategyMethodology

Before You Fine-Tune: A Quant-Retention Framework for Edge SLM Feasibility

Name: Edge SLM feasibility framework — decision thresholds
Creator: Porsync
Published: 2026-07-05
Keywords: Edge AI, Small Language Models, Quantisation, Feasibility, AI Strategy, Methodology

Most edge-AI projects start by training a model and end by discovering it never fit the target device. This is the inverted method: a measurement-first framework that answers 'can a small model do this job on this hardware at all?' before a single training run — using quant-retention ratios, behavioural adherence rates, and a hardware envelope as the go/no-go gate.

Porsync Research · Published 2026-07-05

Finding

Edge SLM feasibility can be decided without training anything: benchmark candidate models across quantisation levels and treat retention (metric at quant ÷ metric at 8-bit) as the gate — below 0.85 is degradation, below 0.70 is collapse. If no candidate clears the gate inside the device's RAM and latency envelope, fine-tuning cannot save the project and should not be started.

Edge SLM feasibility framework — decision thresholds

The locked metric definitions and go/no-go thresholds of the framework. These are the numbers a feasibility verdict is read against; they are fixed before any model is measured so the verdict cannot be negotiated after the fact.

Measurement	Value	Note
Quant retention — acceptable	≥ 0.85	metric@quant ÷ metric@8-bit baseline
Quant retention — degradation	0.70 – 0.85	usable only if the task tolerates errors
Quant retention — collapse	< 0.70	hard fail for the model+quant pair
Behavioural adherence	% of prompts where the model follows the task role	e.g. withholding an answer when asked to hint, or returning schema-valid JSON
Hardware envelope	tok/s · TTFT · peak RAM	measured per size/quant pair; maps to a device band
Training runs required	0	the entire verdict is produced with public models, prompt-only

The failure mode this prevents

The default edge-AI project plan is: collect data, fine-tune a model, then squeeze it onto the device. It fails in a specific, expensive way — months into training, the team discovers that at the quantisation level the device actually requires, the model's reasoning collapses. The training was real; the feasibility was assumed.

The fix is to invert the order. Build the ruler before the product: measure whether *any* candidate model, at the quantisation the hardware demands, can do the job at an acceptable level **prompt-only**. Fine-tuning improves a model that basically works. It does not rescue one that has collapsed under quantisation.

The framework

Four locked components, defined before anything is measured:

**1. A fixed measurement set, not a training corpus.** A few hundred items, gold-answered, covering the actual task — including its hardest slice (for a multilingual task, the lowest-resource language; for a reasoning task, the multi-step items). It stays small on purpose: it is a ruler. The moment it starts growing into training data, the study has silently become the product.

**2. Metrics locked in advance.** Task accuracy, plus the behavioural rates that decide real usability: does the model stay in its role (a tutor that hints without revealing answers, an extractor that returns valid JSON)? Report the hardest slice separately — **never average it away**. An aggregate score that blends an easy majority with a failing minority is how infeasible projects get greenlit.

**3. Quant retention as the headline number.** For each model, establish the 8-bit score as baseline, then re-run the identical benchmark at each lower quantisation. Retention = metric@quant ÷ metric@8-bit. Above 0.85, the compression is roughly free. Between 0.70 and 0.85, the task decides whether the loss is tolerable. Below 0.70, that model+quant pair is out — regardless of how good the model looked unquantised. Low-resource and multi-step capabilities usually collapse first, which is exactly why the hardest slice is reported separately.

**4. A hardware envelope, measured not estimated.** Tokens per second, time-to-first-token, and peak RAM per size/quant pair — mapped against what the target device class actually has. A model that clears the quality gate but needs 6GB of RAM on a 3GB device fails feasibility just as hard as one that collapsed.

Reading the verdict

The go/no-go is mechanical by design: at least one model+quant pair must clear the retention gate on every metric — including the hardest slice — *inside* the hardware envelope. If one does, the project is feasible and the same benchmark becomes the regression harness for every later fine-tune. If none does, the honest conclusions are "wait for better small models" or "change the hardware target" — and the framework has delivered that answer for the cost of evenings on a desktop GPU, not a training budget.

The discipline that makes this work is separating measurement from judgment: the harness produces numbers; a human reads them against thresholds that were locked before the first run. Feasibility studies fail as instruments when the thresholds move after the results arrive.

Why this is a strategy tool, not just an eval

"Should we build this?" is the most expensive question in applied AI, and most organisations answer it with a pilot of the *product*. This framework answers it with a pilot of the *physics* — model capability under the real deployment constraint — at roughly 1% of the cost. It is the pattern behind every adoption question worth asking: define the gate before you fall in love with the build.

All Research