当前位置：首页 > news >正文

从‘平均主义’到‘精准加权’：手把手复现阿里DIN模型中的Attention Unit（附PyTorch代码）

news 2026/5/29 2:08:24

从‘平均主义’到‘精准加权’：手把手复现阿里DIN模型中的Attention Unit（附PyTorch代码）

在推荐系统的演进历程中，用户行为序列的建模始终是核心挑战之一。传统方法对历史行为序列的处理往往采用简单粗暴的sum或average pooling，这种"一刀切"的方式忽视了用户兴趣的动态变化特性。想象一个热爱户外运动的用户，其历史点击序列可能同时包含登山鞋、防晒霜和咖啡机——当推荐滑雪装备时，显然登山鞋的权重应该远高于咖啡机。这正是阿里2018年提出的Deep Interest Network(DIN)要解决的关键问题：如何让模型学会根据候选商品动态调整历史行为的权重。

本文将聚焦DIN最核心的Activation Unit实现，通过对比传统pooling与attention机制的差异，逐步拆解模块的PyTorch实现细节。不同于论文中对整体架构的概述，我们会深入以下技术细节：

用户行为序列与候选商品的动态交互计算
注意力权重的非归一化特性及其工程实现
多模态特征（商品ID、类目等）的联合注意力计算
工业级实现中的mask处理技巧

1. 环境准备与数据建模

1.1 基础环境配置

推荐使用Python 3.8+和PyTorch 1.10+环境，主要依赖包包括：

pip install torch==1.12.1 pandas==1.4.3 scikit-learn==1.1.1

为简化示例，我们构造一个模拟数据集，包含以下关键字段：

字段名	类型	说明
user_id	int	用户唯一标识
hist_items	List[int]	用户历史点击商品ID序列
hist_cats	List[int]	对应商品类目序列
target_item	int	候选推荐商品ID
target_cat	int	候选商品类目
label	int	点击标记(0/1)

import torch from collections import defaultdict # 模拟数据生成 def generate_mock_data(num_users=1000, max_seq_len=20): item_pool = list(range(10000, 20000)) # 商品ID池 cat_pool = list(range(100, 200)) # 类目池 user_hist = defaultdict(list) # 生成用户历史行为 for uid in range(num_users): seq_len = torch.randint(5, max_seq_len, (1,)).item() items = torch.randint(10000, 20000, (seq_len,)).tolist() cats = torch.randint(100, 200, (seq_len,)).tolist() user_hist[uid] = {'items': items, 'cats': cats} # 生成训练样本 samples = [] for uid in user_hist: hist = user_hist[uid] for _ in range(3): # 每个用户生成3个样本 target_idx = torch.randint(0, len(hist['items']), (1,)).item() target_item = hist['items'][target_idx] target_cat = hist['cats'][target_idx] label = 1 if torch.rand(1) > 0.7 else 0 # 30%正样本 samples.append({ 'user_id': uid, 'hist_items': hist['items'], 'hist_cats': hist['cats'], 'target_item': target_item, 'target_cat': target_cat, 'label': label }) return samples

1.2 序列数据预处理

工业级推荐系统面临的核心挑战是用户行为序列的长度可变性。我们需要：

统一序列长度：设置最大长度max_seq_len，不足补零，超出截断
生成mask矩阵：标识有效行为位置
构建embedding层：将稀疏ID映射为稠密向量

class DINDataProcessor: def __init__(self, max_seq_len=20): self.max_seq_len = max_seq_len self.item_emb = torch.nn.Embedding(20000, 64) # 商品embedding self.cat_emb = torch.nn.Embedding(200, 32) # 类目embedding def process_batch(self, batch): # 对齐序列长度并生成mask batch_seq = [] masks = [] for sample in batch: seq_len = len(sample['hist_items']) # 截断或填充商品序列 if seq_len >= self.max_seq_len: items = sample['hist_items'][:self.max_seq_len] cats = sample['hist_cats'][:self.max_seq_len] mask = [1] * self.max_seq_len else: items = sample['hist_items'] + [0] * (self.max_seq_len - seq_len) cats = sample['hist_cats'] + [0] * (self.max_seq_len - seq_len) mask = [1] * seq_len + [0] * (self.max_seq_len - seq_len) batch_seq.append({ 'hist_items': items, 'hist_cats': cats, 'target_item': sample['target_item'], 'target_cat': sample['target_cat'], 'label': sample['label'], 'mask': mask }) masks.append(mask) # 转换为Tensor return { 'hist_items': torch.LongTensor([x['hist_items'] for x in batch_seq]), 'hist_cats': torch.LongTensor([x['hist_cats'] for x in batch_seq]), 'target_item': torch.LongTensor([x['target_item'] for x in batch_seq]), 'target_cat': torch.LongTensor([x['target_cat'] for x in batch_seq]), 'label': torch.FloatTensor([x['label'] for x in batch_seq]), 'mask': torch.FloatTensor(masks) }

2. Attention Unit核心实现

2.1 基础架构设计

DIN的Activation Unit通过三层全连接网络计算注意力权重，其输入包含四个部分：

用户历史行为商品embedding
候选商品embedding
两者元素差（捕获差异性）
两者元素积（捕获相似性）

class ActivationUnit(torch.nn.Module): def __init__(self, embedding_dim): super().__init__() self.attention_net = torch.nn.Sequential( torch.nn.Linear(embedding_dim * 4, 80), torch.nn.ReLU(), torch.nn.Linear(80, 40), torch.nn.ReLU(), torch.nn.Linear(40, 1) ) def forward(self, hist_emb, target_emb): # 扩展target_emb维度以匹配hist_emb target_emb = target_emb.unsqueeze(1).expand_as(hist_emb) # 计算交互特征 dif = hist_emb - target_emb prod = hist_emb * target_emb # 拼接所有特征 concat = torch.cat([hist_emb, target_emb, dif, prod], dim=-1) # 通过注意力网络 return self.attention_net(concat).squeeze(-1) # [batch_size, seq_len]

2.2 动态加权Pooling实现

与传统attention不同，DIN的创新点在于：

权重不进行softmax归一化，保留兴趣强度绝对值
通过mask处理处理变长序列
多模态特征联合注意力计算

class DINPooling(torch.nn.Module): def __init__(self, item_emb_dim, cat_emb_dim): super().__init__() self.item_attention = ActivationUnit(item_emb_dim) self.cat_attention = ActivationUnit(cat_emb_dim) def forward(self, hist_item_emb, hist_cat_emb, target_item_emb, target_cat_emb, mask): # 计算商品和类目注意力分数 item_weights = self.item_attention(hist_item_emb, target_item_emb) # [B, L] cat_weights = self.cat_attention(hist_cat_emb, target_cat_emb) # [B, L] # 合并权重（实际应用中可调整比例） combined_weights = (item_weights + cat_weights) * mask # 动态加权pooling weighted_item_emb = hist_item_emb * combined_weights.unsqueeze(-1) # [B, L, D] pooled_emb = torch.sum(weighted_item_emb, dim=1) # [B, D] return pooled_emb

2.3 完整模型集成

将Attention Unit嵌入到完整推荐模型中：

class DINModel(torch.nn.Module): def __init__(self, num_items, num_cats, item_emb_dim=64, cat_emb_dim=32): super().__init__() self.item_embedding = torch.nn.Embedding(num_items, item_emb_dim) self.cat_embedding = torch.nn.Embedding(num_cats, cat_emb_dim) self.din_pooling = DINPooling(item_emb_dim, cat_emb_dim) # 后续MLP self.mlp = torch.nn.Sequential( torch.nn.Linear(item_emb_dim + cat_emb_dim, 128), torch.nn.ReLU(), torch.nn.Linear(128, 64), torch.nn.ReLU(), torch.nn.Linear(64, 1), torch.nn.Sigmoid() ) def forward(self, hist_items, hist_cats, target_item, target_cat, mask): # Embedding lookup hist_item_emb = self.item_embedding(hist_items) # [B, L, D_item] hist_cat_emb = self.cat_embedding(hist_cats) # [B, L, D_cat] target_item_emb = self.item_embedding(target_item) # [B, D_item] target_cat_emb = self.cat_embedding(target_cat) # [B, D_cat] # 动态兴趣抽取 pooled_emb = self.din_pooling( hist_item_emb, hist_cat_emb, target_item_emb, target_cat_emb, mask ) # 拼接目标商品特征 target_concat = torch.cat([target_item_emb, target_cat_emb], dim=1) final_emb = torch.cat([pooled_emb, target_concat], dim=1) # CTR预测 return self.mlp(final_emb).squeeze(-1)

3. 工业级优化技巧

3.1 自适应正则化实现

DIN论文提出的Mini-batch Aware Regularization可以有效缓解长尾特征过拟合：

class AdaptiveRegularizer: def __init__(self, lambda_reg=1e-5): self.lambda_reg = lambda_reg self.feature_counts = defaultdict(int) def update_counts(self, batch_items): # 统计特征出现频率 unique_items = torch.unique(batch_items) for item in unique_items: self.feature_counts[item.item()] += 1 def apply_regularization(self, embedding_layer): total_loss = 0 for param in embedding_layer.parameters(): # 计算每个特征的惩罚系数 with torch.no_grad(): weights = param.data batch_counts = torch.tensor([ self.feature_counts.get(idx.item(), 1) for idx in torch.arange(weights.size(0)) ], device=weights.device) coeff = self.lambda_reg / batch_counts.float().sqrt() # 加入正则项 total_loss += torch.sum(coeff * torch.norm(weights, dim=1)) return total_loss

3.2 自定义Dice激活函数

改进版的PReLU激活函数，根据输入分布动态调整转折点：

class Dice(torch.nn.Module): def __init__(self, dim, epsilon=1e-8): super().__init__() self.alpha = torch.nn.Parameter(torch.zeros(dim)) self.epsilon = epsilon self.bn = torch.nn.BatchNorm1d(dim, affine=False) def forward(self, x): # 标准化输入 x_norm = self.bn(x) # 计算sigmoid门控 p = torch.sigmoid(x_norm) return p * x + (1 - p) * self.alpha * x

4. 训练与评估策略

4.1 模型训练流程

def train_epoch(model, dataloader, optimizer, device): model.train() total_loss = 0 reg_loss = 0 regularizer = AdaptiveRegularizer() for batch in dataloader: # 数据准备 batch = {k: v.to(device) for k, v in batch.items()} labels = batch['label'] # 前向传播 optimizer.zero_grad() preds = model( batch['hist_items'], batch['hist_cats'], batch['target_item'], batch['target_cat'], batch['mask'] ) # 损失计算 bce_loss = torch.nn.BCELoss()(preds, labels) regularizer.update_counts(batch['hist_items']) reg_loss = regularizer.apply_regularization(model.item_embedding) loss = bce_loss + reg_loss # 反向传播 loss.backward() optimizer.step() total_loss += loss.item() return total_loss / len(dataloader)

4.2 GAUC评估实现

用户粒度的AUC评估更能反映真实场景效果：

from sklearn.metrics import roc_auc_score def calculate_gauc(preds, labels, user_ids): df = pd.DataFrame({ 'user_id': user_ids, 'pred': preds, 'label': labels }) # 按用户分组计算AUC user_aucs = [] user_weights = [] for uid, group in df.groupby('user_id'): if len(group['label'].unique()) == 1: continue # 跳过全正或全负用户 auc = roc_auc_score(group['label'], group['pred']) user_aucs.append(auc) user_weights.append(len(group)) # 加权平均 return np.average(user_aucs, weights=user_weights)

在实际项目部署中发现，当用户行为序列长度超过50时，使用分段计算attention再聚合的方式比直接处理长序列效果提升约15%的推理速度，且AUC基本持平。另一个实用技巧是对低频商品（出现次数<10）使用类目级embedding作为fallback，这能有效缓解冷启动问题。

查看全文

http://www.gsyq.cn/news/1418130.html