AI 推理性能调优：Tensor Parallelism 与 Pipeline Parallelism 的通信优化

发布时间：2026/7/29 13:58:58

AI 推理性能调优：Tensor Parallelism 与 Pipeline Parallelism 的通信优化

一、单卡之限：大模型推理的显存墙

当模型参数量超过单张 GPU 的显存容量时，模型并行成为唯一的部署方案。LLaMA-70B 在 FP16 精度下需要约 140GB 显存，远超单张 A100-80GB 的容量。即使使用 INT8 量化，70GB 的模型权重加上 KV Cache 仍然逼近单卡极限。

两种主流的模型并行策略是：Tensor Parallelism（TP，张量并行）和Pipeline Parallelism（PP，流水线并行）。TP 将每一层的权重切分到多张 GPU 上，多卡并行计算同一层；PP 将不同层分配到不同 GPU 上，形成流水线。

两种策略的核心挑战都是通信开销。TP 需要在每层计算后进行 All-Reduce 同步，PP 需要在阶段间传递中间激活值。通信开销如果处理不当，会抵消并行计算带来的加速效果。本文将从通信模式、优化策略和工程实践三个维度，展示如何将大模型推理的并行通信开销降到最低。

二、通信模式：TP 与 PP 的数据流动路径

2.1 两种并行策略的通信模式

flowchart TD subgraph "Tensor Parallelism：层内切分" direction TB TP1[GPU 0: Q/K/V 权重前半部分] --> TP2[All-Reduce<br/>同步注意力输出] TP3[GPU 1: Q/K/V 权重后半部分] --> TP2 TP2 --> TP4[GPU 0: FFN 权重前半部分] TP2 --> TP5[GPU 1: FFN 权重后半部分] TP4 --> TP6[All-Reduce<br/>同步 FFN 输出] TP5 --> TP6 TP7[通信特征：<br/>每层 2 次 All-Reduce<br/>通信量与隐藏维度成正比<br/>延迟敏感型] end subgraph "Pipeline Parallelism：层间切分" direction TB PP1[GPU 0: Layer 0-15] -->|发送激活值| PP2[GPU 1: Layer 16-31] PP2 -->|发送激活值| PP3[GPU 2: Layer 32-47] PP3 -->|发送激活值| PP4[GPU 3: Layer 48-63] PP5[通信特征：<br/>每阶段 1 次 Point-to-Point<br/>通信量与批次大小成正比<br/>吞吐敏感型] end

2.2 通信量分析

并行策略	每层通信量	通信频率	通信类型	瓶颈
TP	2 × b × s × h × 2 bytes	每层 2 次	All-Reduce	延迟（同步等待）
PP	b × s × h × 2 bytes	每阶段 1 次	P2P	带宽（气泡开销）

其中 b = batch size, s = sequence length, h = hidden dimension。

以 LLaMA-70B 为例（h=8192, b=1, s=2048）：

TP 每层通信量：2 × 1 × 2048 × 8192 × 2 = 64MB，80 层共 5.12GB
PP 每阶段通信量：1 × 2048 × 8192 × 2 = 32MB，4 阶段共 128MB

TP 的总通信量远大于 PP，但 TP 的计算-通信重叠更好；PP 的通信量小，但流水线气泡导致 GPU 利用率低。

三、工程实现：通信优化的核心策略

3.1 TP 通信优化：通信与计算重叠

# tensor_parallel.py — 张量并行通信优化 import torch import torch.distributed as dist class TensorParallelLinear(torch.nn.Module): """ 张量并行线性层：列切分 + 通信计算重叠 设计考量：在 All-Reduce 等待期间，利用异步通信重叠计算 """ def __init__(self, in_features: int, out_features: int, tp_rank: int, tp_world_size: int, gather_output: bool = True): super().__init__() self.tp_rank = tp_rank self.tp_world_size = tp_world_size self.gather_output = gather_output # 列切分：每张卡持有 out_features / tp_world_size 列 self.out_features_per_partition = out_features // tp_world_size self.weight = torch.nn.Parameter( torch.empty(self.out_features_per_partition, in_features) ) def forward(self, x: torch.Tensor) -> torch.Tensor: # 本地矩阵乘法（无需通信） output_parallel = torch.nn.functional.linear(x, self.weight) if not self.gather_output: return output_parallel # All-Reduce：收集所有卡的输出 # 优化：使用异步通信，与后续计算重叠 output = self._all_reduce_with_overlap(output_parallel) return output def _all_reduce_with_overlap(self, tensor: torch.Tensor) -> torch.Tensor: """ 优化的 All-Reduce：利用 NCCL 的异步通信 设计考量：在通信等待期间，GPU 可以执行其他计算 """ # 使用 NCCL All-Reduce（in-place，减少显存分配） dist.all_reduce(tensor, op=dist.ReduceOp.SUM) # 异步版本：启动通信后立即返回，后续操作可以与通信重叠 # handle = dist.all_reduce(tensor, op=dist.ReduceOp.SUM, async_op=True) # ... 执行其他计算 ... # handle.wait() return tensor class TensorParallelAttention(torch.nn.Module): """ 张量并行注意力层：QKV 切分 + 输出投影通信优化 设计考量：注意力计算天然可并行（多头分配到不同卡） """ def __init__(self, hidden_size: int, num_heads: int, tp_rank: int, tp_world_size: int): super().__init__() self.tp_rank = tp_rank self.tp_world_size = tp_world_size # 每张卡处理 num_heads / tp_world_size 个注意力头 self.num_heads_per_partition = num_heads // tp_world_size self.head_dim = hidden_size // num_heads # QKV 投影：列切分 self.qkv_proj = TensorParallelLinear( hidden_size, 3 * hidden_size, tp_rank, tp_world_size, gather_output=False ) # 输出投影：行切分 + All-Reduce self.out_proj = TensorParallelLinear( hidden_size, hidden_size, tp_rank, tp_world_size, gather_output=True ) def forward(self, x: torch.Tensor) -> torch.Tensor: batch_size, seq_len, _ = x.shape # QKV 投影（本地计算，无需通信） qkv = self.qkv_proj(x) qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads_per_partition, self.head_dim) q, k, v = qkv.unbind(dim=2) # 注意力计算（本地，无需通信） attn_weights = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5) attn_weights = torch.nn.functional.softmax(attn_weights, dim=-1) attn_output = torch.matmul(attn_weights, v) # 重塑输出 attn_output = attn_output.reshape(batch_size, seq_len, -1) # 输出投影 + All-Reduce（唯一需要通信的步骤） output = self.out_proj(attn_output) return output

3.2 PP 通信优化：微批次流水线

# pipeline_parallel.py — 流水线并行：微批次调度优化 from dataclasses import dataclass from typing import Optional @dataclass class PipelineConfig: """流水线并行配置""" num_stages: int = 4 # 流水线阶段数 num_micro_batches: int = 8 # 微批次数量 pipeline_schedule: str = "1f1b" # 调度策略：1f1b / gpipe / interleaved class PipelineScheduler: """ 流水线调度器：最小化气泡率 设计考量：微批次切分 + 1F1B 调度减少 GPU 空闲时间 """ def __init__(self, config: PipelineConfig, stage_id: int): self.config = config self.stage_id = stage_id def get_schedule(self) -> list[dict]: """ 生成 1F1B 调度时间表 1F1B = One Forward One Backward 每个阶段交替执行前向和反向传播，减少峰值显存 """ schedule = [] num_stages = self.config.num_stages num_micros = self.config.num_micro_batches if self.config.pipeline_schedule == "1f1b": schedule = self._schedule_1f1b(num_stages, num_micros) elif self.config.pipeline_schedule == "gpipe": schedule = self._schedule_gpipe(num_stages, num_micros) elif self.config.pipeline_schedule == "interleaved": schedule = self._schedule_interleaved(num_stages, num_micros) return schedule def _schedule_1f1b(self, num_stages: int, num_micros: int) -> list[dict]: """ 1F1B 调度：前向和反向交替执行 气泡率：约 (p-1) / (p-1+m)，p 为阶段数，m 为微批次数 相比 GPipe 的 (p-1) / (p-1+m*2)，气泡率减半 """ schedule = [] stage = self.stage_id # 预热阶段：连续执行前向传播 warmup_micros = min(num_micros, num_stages - stage) for i in range(warmup_micros): schedule.append({ "type": "forward", "micro_batch": i, "action": "compute_forward" }) # 稳态阶段：1F1B 交替 for i in range(num_micros - warmup_micros): schedule.append({ "type": "forward", "micro_batch": warmup_micros + i, "action": "compute_forward" }) schedule.append({ "type": "backward", "micro_batch": i, "action": "compute_backward" }) # 冷却阶段：执行剩余的反向传播 for i in range(num_micros - warmup_micros, num_micros): schedule.append({ "type": "backward", "micro_batch": i, "action": "compute_backward" }) return schedule def _schedule_gpipe(self, num_stages: int, num_micros: int) -> list[dict]: """GPipe 调度：所有前向完成后执行所有反向""" schedule = [] for i in range(num_micros): schedule.append({"type": "forward", "micro_batch": i}) for i in range(num_micros - 1, -1, -1): schedule.append({"type": "backward", "micro_batch": i}) return schedule def _schedule_interleaved(self, num_stages: int, num_micros: int) -> list[dict]: """ 交错调度：每个 GPU 负责不连续的层组 进一步减少气泡率和通信量 """ # 简化实现：实际需要更复杂的调度逻辑 return self._schedule_1f1b(num_stages, num_micros)

3.3 通信拓扑优化

# topology.py — GPU 通信拓扑优化 class CommunicationTopology: """ 通信拓扑优化：根据硬件拓扑选择最优通信路径 设计考量：NVLink > PCIe > InfiniBand，优先使用高带宽链路 """ @staticmethod def recommend_parallel_strategy( num_gpus: int, gpu_memory_gb: float, model_size_gb: float, nvlink_bandwidth_gbps: float = 600, pcie_bandwidth_gbps: float = 64, ) -> dict: """ 根据硬件配置推荐并行策略 """ recommendations = {} # TP 需要高带宽互联（NVLink），通常在同一节点内 max_tp = min(num_gpus, 8) # 单节点最多 8 GPU # PP 可以跨节点，对带宽要求较低 max_pp = num_gpus // max_tp # 根据模型大小确定最小 TP 度 min_tp = max(1, int(model_size_gb / (gpu_memory_gb * 0.7)) + 1) if min_tp <= max_tp and nvlink_bandwidth_gbps >= 300: # NVLink 可用：优先 TP tp_degree = min_tp pp_degree = max(1, num_gpus // tp_degree) recommendations["strategy"] = "TP-first" recommendations["tp_degree"] = tp_degree recommendations["pp_degree"] = pp_degree recommendations["reason"] = ( f"NVLink 带宽 {nvlink_bandwidth_gbps}Gbps 充足，" f"TP={tp_degree} 可满足显存需求" ) else: # NVLink 不可用或不足：优先 PP pp_degree = max(1, int(model_size_gb / (gpu_memory_gb * 0.7))) tp_degree = max(1, num_gpus // pp_degree) recommendations["strategy"] = "PP-first" recommendations["tp_degree"] = tp_degree recommendations["pp_degree"] = pp_degree recommendations["reason"] = ( f"NVLink 带宽不足，PP={pp_degree} 减少跨卡通信" ) return recommendations

四、并行的代价：通信开销与显存权衡

4.1 TP 的通信瓶颈

TP 在每层 Transformer 中需要 2 次 All-Reduce。在 8 卡 TP 下，每次 All-Reduce 的延迟约 50-100μs（NVLink），80 层累计约 8-16ms。对于单请求推理，这个延迟不可忽视。

4.2 PP 的气泡开销

PP 的流水线气泡导致 GPU 利用率下降。4 阶段 PP 的气泡率约 30-40%（1F1B 调度），意味着近三分之一的 GPU 时间是空闲的。增加微批次数量可以降低气泡率，但会增加延迟。

4.3 混合并行的复杂度

TP+PP 混合并行需要同时管理两种通信模式，调试和性能调优的复杂度显著增加。通信拓扑的微小变化（如 NVLink 拓扑不对称）可能导致性能大幅波动。

4.4 适用边界

TP 最适合：NVLink 互联的同节点多 GPU、延迟敏感的单请求推理。PP 最适合：跨节点部署、吞吐敏感的批量推理。TP+PP 混合最适合：超大模型（>70B）在多节点集群上的部署。

五、总结

Tensor Parallelism 和 Pipeline Parallelism 是大模型推理突破单卡显存限制的两大支柱。TP 通过层内切分实现低延迟推理，但通信量与层数成正比；PP 通过层间切分减少通信量，但气泡开销降低 GPU 利用率。工程实践中的核心优化策略包括：TP 的通信-计算重叠、PP 的 1F1B 微批次调度、以及基于硬件拓扑的并行策略自动推荐。并行策略的选择不是静态的——随着模型规模、硬件配置和业务需求的变化，最优策略也在持续演进。