当前位置：首页 > news >正文

模型量化：从 FP16 到 INT4，怎么平衡精度和速度

news 2026/6/18 12:51:55

模型量化：从 FP16 到 INT4，怎么平衡精度和速度

一、量化的本质：用精度换速度的数学基础

模型量化就是把模型参数从高精度浮点数（FP32/FP16）转成低精度整数（INT8/INT4）。这么做能减少内存占用和计算量。举个例子，一个 7B 参数的模型，FP16 需要 14GB 显存，INT8 需要 7GB，INT4 只要 3.5GB。显存减半意味着可以用更便宜的 GPU 推理，或者在同一张 GPU 上跑更大的 Batch。

但量化不是简单的截断。直接把 FP16 的权重四舍五入到 INT8，精度损失可能高达 30% 以上。量化的数学基础是仿射映射：把浮点数的值域[rmin, rmax]线性映射到整数的值域[0, 255]（INT8）或[0, 15]（INT4）。映射的精度取决于值域的覆盖范围——如果权重分布集中在[-0.1, 0.1]，但映射范围设为[-1.0, 1.0]，大部分整数值被浪费，有效精度只有 2-3 位。

难点在于：怎么在最小化精度损失的前提下，最大化推理加速比。这需要在量化粒度、校准策略和混合精度之间做权衡。

二、量化方案对比：从逐层量化到逐组量化的精度演进

量化方案按粒度从粗到细分几种：逐张量量化、逐通道量化、逐组量化和混合精度量化。粒度越细，精度保持越好，但存储和计算开销也越大。

graph TB subgraph 量化粒度演进 A[逐张量量化<br/>Per-Tensor] --> B[逐通道量化<br/>Per-Channel] B --> C[逐组量化<br/>Per-Group] C --> D[混合精度量化<br/>Mixed-Precision] end subgraph 精度与开销 A1["精度: ★★<br/>开销: ★"] --> A B1["精度: ★★★<br/>开销: ★★"] --> B C1["精度: ★★★★<br/>开销: ★★★"] --> C D1["精度: ★★★★★<br/>开销: ★★★★"] --> D end subgraph 适用场景 A2["CV 模型<br/>权重分布均匀"] --> A B2["NLP 模型<br/>通道间差异大"] --> B C2["LLM<br/>权重分布不均"] --> C D2["敏感层保护<br/>精度要求高"] --> D end style A fill:#8c8c8c,color:#fff style B fill:#1890ff,color:#fff style C fill:#722ed1,color:#fff style D fill:#eb2f96,color:#fff

逐张量量化对整个权重矩阵用同一个缩放因子，实现最简单但精度损失最大。逐通道量化对每个输出通道用独立的缩放因子，精度提升明显，是 INT8 量化的标准方案。逐组量化把通道再细分成多个组（通常 group_size=128），每组用独立的缩放因子，是 LLM INT4 量化的主流方案（比如 GPTQ、AWQ）。混合精度量化对敏感层保留 FP16，非敏感层用 INT4，在精度和速度之间找平衡。

三、GPTQ 量化算法的 Python 实现

下面是简化版的 GPTQ 量化算法实现，展示逐组量化的核心逻辑：

# quantization/gptq_simple.py import numpy as np from dataclasses import dataclass from typing import Optional @dataclass class QuantConfig: """量化配置""" bits: int = 4 # 量化位数（4 或 8） group_size: int = 128 # 每组的元素数量 # 是否使用对称量化 symmetric: bool = True # 校准数据集大小 calibration_samples: int = 128 class GPTQQuantizer: """ GPTQ 量化器：基于 Hessian 信息的逐组量化 GPTQ 的思路：量化一个权重时，利用 Hessian 矩阵信息 调整未量化的权重来补偿量化误差，让整体输出误差最小 """ def __init__(self, config: QuantConfig): self.config = config self.qmax = 2 ** config.bits - 1 # 量化最大值 self.qmin = 0 def quantize_weight( self, weight: np.ndarray, hessian: Optional[np.ndarray] = None, ) -> dict: """ 量化权重矩阵 Args: weight: [out_features, in_features] 的权重矩阵 hessian: Hessian 矩阵的 Cholesky 分解，用于误差补偿 Returns: 包含量化权重、缩放因子和零点的字典 """ out_features, in_features = weight.shape group_size = self.config.group_size # 计算需要的组数 num_groups = (in_features + group_size - 1) // group_size # 存储量化结果 q_weights = np.zeros_like(weight, dtype=np.int32) scales = np.zeros((out_features, num_groups), dtype=np.float32) zeros = np.zeros((out_features, num_groups), dtype=np.float32) # 量化误差累积 total_error = np.zeros_like(weight, dtype=np.float32) # 逐列量化（GPTQ 的关键：按列顺序处理） for col in range(in_features): group_idx = col // group_size # 获取当前列的权重 w_col = weight[:, col].copy() # 加上之前列的累积误差补偿 if hessian is not None: w_col += total_error[:, col] # 计算当前组的缩放因子和零点 if col % group_size == 0: # 每组开始时重新计算缩放因子 group_start = group_idx * group_size group_end = min(group_start + group_size, in_features) group_weights = weight[:, group_start:group_end] scales[:, group_idx], zeros[:, group_idx] = ( self._compute_scale_and_zero(group_weights) ) # 量化当前列 scale = scales[:, group_idx] zero = zeros[:, group_idx] q_col = self._quantize_vector(w_col, scale, zero) q_weights[:, col] = q_col # 反量化，计算量化误差 dequant_col = self._dequantize_vector(q_col, scale, zero) error = (w_col - dequant_col) / (self._get_hessian_diag(hessian, col) + 1e-8) # 将误差分配到后续列（GPTQ 的核心补偿机制） if hessian is not None: for future_col in range(col + 1, min(col + 128, in_features)): total_error[:, future_col] += ( error * hessian[col, future_col] / hessian[col, col] ) return { "q_weights": q_weights, "scales": scales, "zeros": zeros, "config": { "bits": self.config.bits, "group_size": self.config.group_size, "symmetric": self.config.symmetric, }, } def _compute_scale_and_zero( self, weights: np.ndarray ) -> tuple[np.ndarray, np.ndarray]: """计算缩放因子和零点""" if self.config.symmetric: # 对称量化：零点固定为 0 w_max = np.max(np.abs(weights), axis=-1, keepdims=True) scale = w_max / (self.qmax / 2) scale = np.clip(scale, 1e-8, None) # 防止除零 zero = np.zeros_like(scale) else: # 非对称量化：零点根据值域计算 w_min = np.min(weights, axis=-1, keepdims=True) w_max = np.max(weights, axis=-1, keepdims=True) scale = (w_max - w_min) / self.qmax scale = np.clip(scale, 1e-8, None) zero = -w_min / scale return scale.squeeze(-1), zero.squeeze(-1) def _quantize_vector( self, vector: np.ndarray, scale: np.ndarray, zero: np.ndarray ) -> np.ndarray: """将浮点向量量化为整数""" if self.config.symmetric: q = np.round(vector / scale + self.qmax / 2) else: q = np.round(vector / scale + zero) q = np.clip(q, self.qmin, self.qmax).astype(np.int32) return q def _dequantize_vector( self, q_vector: np.ndarray, scale: np.ndarray, zero: np.ndarray ) -> np.ndarray: """将整数量化值反量化为浮点""" if self.config.symmetric: return (q_vector - self.qmax / 2).astype(np.float32) * scale else: return (q_vector - zero).astype(np.float32) * scale def _get_hessian_diag( self, hessian: Optional[np.ndarray], col: int ) -> np.ndarray: """获取 Hessian 对角线元素""" if hessian is not None: return np.diag(hessian)[col] if hessian.ndim == 2 else hessian[col] return 1.0 # 无 Hessian 时使用均匀权重 def evaluate_quantization( original_weight: np.ndarray, quantized_result: dict, ) -> dict: """评估量化质量""" q = quantized_result["q_weights"] s = quantized_result["scales"] z = quantized_result["zeros"] bits = quantized_result["config"]["bits"] group_size = quantized_result["config"]["group_size"] # 反量化 dequantized = np.zeros_like(original_weight, dtype=np.float32) num_groups = (original_weight.shape[1] + group_size - 1) // group_size for g in range(num_groups): start = g * group_size end = min(start + group_size, original_weight.shape[1]) scale = s[:, g:g+1] zero = z[:, g:g+1] if quantized_result["config"]["symmetric"]: dequantized[:, start:end] = ( (q[:, start:end] - (2**bits - 1) / 2) * scale ) else: dequantized[:, start:end] = (q[:, start:end] - zero) * scale # 计算误差指标 error = original_weight - dequantized mse = np.mean(error ** 2) cos_sim = np.sum(original_weight * dequantized) / ( np.linalg.norm(original_weight) * np.linalg.norm(dequantized) + 1e-8 ) return { "mse": float(mse), "cosine_similarity": float(cos_sim), "max_error": float(np.max(np.abs(error))), "compression_ratio": 16 / bits, # FP16 -> N-bit }

四、量化的精度边界：哪些层不能量化

量化不是全有或全无的决策。模型中不同层对量化的敏感度差异很大，一刀切的量化策略会导致精度断崖式下降。

注意力层 vs FFN 层。实验数据表明，LLM 中 FFN 层的权重对量化相对鲁棒，INT4 量化后精度损失通常在 1-2% 以内。但注意力层的 Query/Key 投影矩阵对量化非常敏感，INT4 量化后精度可能下降 10% 以上。这是因为 QK 矩阵的值域分布更窄，量化分辨率不足。

首尾层的特殊性。模型的第一层和最后一层对量化最为敏感。第一层处理原始输入嵌入，最后一层生成输出 logits，这两层的精度直接影响模型的输入感知和输出质量。混合精度策略通常将首尾层保留为 FP16，中间层使用 INT4。

激活值量化的挑战。权重量化相对容易，因为权重在推理时是固定的。激活值（中间计算结果）的值域随输入变化而变化，难以找到稳定的缩放因子。LLM 中激活值存在离群值（Outlier），少数通道的值远大于其他通道，导致整体量化精度下降。SmoothQuant 通过数学等价变换将激活值的离群值"平滑"到权重上，是解决这个问题的有效方案。

禁用场景：代码生成和数学推理模型——这些任务对数值精度高度敏感，INT4 量化可能导致逻辑错误；模型参数量小于 1B——小模型的冗余度低，量化容错空间小；需要微调的模型——量化后的模型微调效果显著下降，应先微调再量化。

五、总结

模型量化的核心是在精度和效率之间找到平衡点。逐组量化（group_size=128）是 LLM INT4 量化的主流方案，GPTQ 通过 Hessian 信息补偿量化误差，是目前效果最好的训练后量化算法。量化的关键决策是"哪些层能量化，哪些不能"——混合精度策略（敏感层 FP16+ 非敏感层 INT4）是精度保持的最佳实践。量化前必须用业务数据做质量回归测试，不能只看通用 Benchmark。量化是手段，不是目的——如果 INT4 量化的精度损失不可接受，退回 INT8 或 FP16 是正确的选择。

查看全文

http://www.gsyq.cn/news/1547767.html