当前位置：首页 > news >正文

实战指南：用PyTorch快速复现DQN及其变种（DDQN/Dueling DQN）玩转CartPole

news 2026/6/10 11:19:43

深度强化学习实战：从零构建DQN及其变种玩转CartPole

在强化学习领域，CartPole问题就像编程界的"Hello World"，看似简单却蕴含着丰富的学习价值。这个经典控制问题要求我们平衡一根连接在小车上的杆子，虽然状态空间只有四个维度（小车位置、速度、杆子角度和角速度），但要实现长时间稳定控制并非易事。本文将带你用PyTorch从零开始，逐步构建标准的DQN算法，并在此基础上实现其两大改进版本——Double DQN和Dueling DQN，通过代码层面的对比让你深入理解不同算法的设计思想与实现差异。

1. 环境搭建与基础实现

1.1 Gym环境初始化

首先我们需要安装并导入必要的库。OpenAI的Gym库为我们提供了标准化的强化学习环境接口，而PyTorch将作为我们的深度学习框架：

import gym import numpy as np import torch import torch.nn as nn import torch.optim as optim import random from collections import deque import matplotlib.pyplot as plt env = gym.make('CartPole-v1') state_dim = env.observation_space.shape[0] action_dim = env.action_space.n

CartPole-v1环境的状态空间包含4个连续变量，动作空间则是2个离散动作（向左或向右推动小车）。与原始Q-learning相比，DQN最大的突破在于使用神经网络来近似Q函数，从而能够处理连续状态空间。

1.2 原始Q-learning的局限性

传统的表格型Q-learning在这种连续状态空间中会遇到严重问题：

维度灾难：连续状态需要离散化处理，但精细离散化会导致状态空间爆炸
泛化能力差：表格方法无法捕捉状态之间的相似性，每个状态需要单独学习
数据效率低：无法利用相似状态的经验进行泛化学习

以下是一个简单的Q-learning实现，展示了其在CartPole问题上的局限性：

class QLearningAgent: def __init__(self, state_dim, action_dim): self.q_table = np.zeros((state_dim, action_dim)) # 实际中需要对连续状态离散化 self.alpha = 0.1 # 学习率 self.gamma = 0.99 # 折扣因子 self.epsilon = 0.1 # 探索率 def act(self, state): if random.random() < self.epsilon: return random.randint(0, self.action_dim-1) return np.argmax(self.q_table[state]) def learn(self, state, action, reward, next_state, done): best_next_action = np.argmax(self.q_table[next_state]) td_target = reward + self.gamma * self.q_table[next_state][best_next_action] * (1 - done) self.q_table[state][action] += self.alpha * (td_target - self.q_table[state][action])

在实际运行中，这种简单Q-learning很难在CartPole环境中取得好效果，特别是当我们将状态空间离散化得不够精细时。

2. DQN的核心实现

2.1 神经网络架构设计

DQN使用神经网络来近似Q函数，这里我们实现一个简单的三层全连接网络：

class DQN(nn.Module): def __init__(self, state_dim, action_dim): super(DQN, self).__init__() self.fc1 = nn.Linear(state_dim, 64) self.fc2 = nn.Linear(64, 64) self.fc3 = nn.Linear(64, action_dim) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) return self.fc3(x)

这个网络接收4维状态向量，经过两个隐藏层后输出2个动作的Q值。相比表格方法，神经网络能够自动学习状态特征的抽象表示，实现更好的泛化。

2.2 经验回放机制

经验回放是DQN稳定训练的关键技术，它通过存储和随机采样历史经验来打破数据间的相关性：

class ReplayBuffer: def __init__(self, capacity): self.buffer = deque(maxlen=capacity) def push(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size)) return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done) def __len__(self): return len(self.buffer)

经验回放带来三个主要好处：

提高数据效率，每个经验可以被多次使用
打破连续样本间的相关性，减少方差
使训练分布更加平滑，避免参数振荡

2.3 目标网络与训练流程

DQN另一个关键创新是使用独立的目标网络来计算TD目标，从而稳定学习过程：

class DQNAgent: def __init__(self, state_dim, action_dim): self.policy_net = DQN(state_dim, action_dim) self.target_net = DQN(state_dim, action_dim) self.target_net.load_state_dict(self.policy_net.state_dict()) self.optimizer = optim.Adam(self.policy_net.parameters(), lr=1e-3) self.buffer = ReplayBuffer(10000) self.batch_size = 64 self.gamma = 0.99 self.epsilon = 1.0 self.epsilon_min = 0.01 self.epsilon_decay = 0.995 def act(self, state): if random.random() < self.epsilon: return random.randint(0, action_dim-1) with torch.no_grad(): q_values = self.policy_net(torch.FloatTensor(state)) return q_values.argmax().item() def update(self): if len(self.buffer) < self.batch_size: return state, action, reward, next_state, done = self.buffer.sample(self.batch_size) state = torch.FloatTensor(state) next_state = torch.FloatTensor(next_state) action = torch.LongTensor(action) reward = torch.FloatTensor(reward) done = torch.FloatTensor(done) current_q = self.policy_net(state).gather(1, action.unsqueeze(1)) next_q = self.target_net(next_state).max(1)[0].detach() target_q = reward + self.gamma * next_q * (1 - done) loss = nn.MSELoss()(current_q.squeeze(), target_q) self.optimizer.zero_grad() loss.backward() self.optimizer.step() self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) def update_target(self): self.target_net.load_state_dict(self.policy_net.state_dict())

训练过程中，我们每4步更新一次策略网络，每100步同步一次目标网络：

agent = DQNAgent(state_dim, action_dim) episode_rewards = [] for episode in range(500): state = env.reset() total_reward = 0 for t in range(200): action = agent.act(state) next_state, reward, done, _ = env.step(action) agent.buffer.push(state, action, reward, next_state, done) state = next_state total_reward += reward agent.update() if done: break if episode % 100 == 0: agent.update_target() episode_rewards.append(total_reward) print(f"Episode {episode}, Reward: {total_reward}, Epsilon: {agent.epsilon:.2f}")

3. Double DQN实现

3.1 过估计问题分析

传统DQN存在Q值过估计问题，主要源于max操作带来的正向偏差。在计算TD目标时：

target_q = reward + γ * max_a' Q_target(s', a')

这个max操作会系统地高估Q值，因为：

估计误差的存在使得某些动作的Q值被高估
max操作会选择这些被高估的动作，进一步放大误差
这些高估会通过自举传播到其他状态

3.2 Double DQN解决方案

Double DQN通过解耦动作选择和动作评估来减少过估计：

class DoubleDQNAgent(DQNAgent): def update(self): # ... 前面部分与DQN相同 ... current_q = self.policy_net(state).gather(1, action.unsqueeze(1)) # 使用policy_net选择动作，target_net评估动作 next_actions = self.policy_net(next_state).max(1)[1] next_q = self.target_net(next_state).gather(1, next_actions.unsqueeze(1)).squeeze(1) target_q = reward + self.gamma * next_q * (1 - done) loss = nn.MSELoss()(current_q.squeeze(), target_q) # ... 后面部分与DQN相同 ...

关键修改在于TD目标的计算方式：

用策略网络选择最优动作：a* = argmax_a Q_policy(s', a)
用目标网络评估这个动作的Q值：Q_target(s', a*)

这种方法虽然不能完全消除过估计，但能显著降低过估计的程度，在实践中通常能获得更稳定的性能。

4. Dueling DQN实现

4.1 优势分解原理

Dueling DQN的核心思想是将Q值分解为状态值函数V(s)和优势函数A(s,a)：

Q(s,a) = V(s) + A(s,a)

其中：

V(s)表示状态s的整体价值
A(s,a)表示动作a相对于平均动作的优势

这种分解允许网络在不考虑每个动作的情况下学习哪些状态是有价值的，这在某些动作对环境影响很小的场景中特别有用。

4.2 网络架构修改

实现Dueling DQN需要重新设计网络结构：

class DuelingDQN(nn.Module): def __init__(self, state_dim, action_dim): super(DuelingDQN, self).__init__() self.feature = nn.Sequential( nn.Linear(state_dim, 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU() ) self.value_stream = nn.Sequential( nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, 1) ) self.advantage_stream = nn.Sequential( nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, action_dim) ) def forward(self, x): features = self.feature(x) values = self.value_stream(features) advantages = self.advantage_stream(features) qvals = values + (advantages - advantages.mean()) return qvals

这里有几个关键设计点：

共享的特征提取层，同时为价值和优势流提供输入
价值流输出单个标量V(s)
优势流输出每个动作的优势值A(s,a)
合并时使用优势函数的中心化形式：Q(s,a) = V(s) + (A(s,a) - mean_a(A(s,a)))

这种中心化处理有助于提高数值稳定性，同时保持优势函数的相对排序不变。

5. 算法对比与性能分析

5.1 训练曲线对比

我们同时训练三种算法，记录它们的每轮奖励：

dqn_rewards = [] ddqn_rewards = [] dueling_rewards = [] # 训练代码类似，分别使用三种agent # ... plt.plot(dqn_rewards, label='DQN') plt.plot(ddqn_rewards, label='Double DQN') plt.plot(dueling_rewards, label='Dueling DQN') plt.xlabel('Episode') plt.ylabel('Reward') plt.legend() plt.show()

典型训练曲线可能显示：