强化学习基础:马尔可夫决策过程
1. 技术分析
1.1 强化学习概述
强化学习是一种通过交互学习的机器学习方法:
强化学习要素 Agent: 智能体 Environment: 环境 State: 状态 Action: 动作 Reward: 奖励 目标: 最大化累积奖励1.2 MDP定义
马尔可夫决策过程是强化学习的数学框架:
MDP组成 S: 状态空间 A: 动作空间 P: 状态转移概率 R: 奖励函数 γ: 折扣因子 特性: 马尔可夫性1.3 MDP要素对比
| 要素 | 描述 | 类型 |
|---|---|---|
| 状态 | 环境的描述 | 离散/连续 |
| 动作 | 智能体的行为 | 离散/连续 |
| 奖励 | 即时反馈 | 标量 |
| 策略 | 行为规则 | 概率分布 |
2. 核心功能实现
2.1 MDP建模
import numpy as np class MarkovDecisionProcess: def __init__(self, states, actions, transition_prob, reward, gamma=0.99): self.states = states self.actions = actions self.transition_prob = transition_prob self.reward = reward self.gamma = gamma def get_next_state(self, state, action): probs = self.transition_prob[state][action] return np.random.choice(self.states, p=probs) def get_reward(self, state, action, next_state): return self.reward[state][action][next_state] def step(self, state, action): next_state = self.get_next_state(state, action) reward = self.get_reward(state, action, next_state) done = self._is_terminal(next_state) return next_state, reward, done def _is_terminal(self, state): return state == 'terminal' class GridWorldMDP(MarkovDecisionProcess): def __init__(self, grid_size=4): self.grid_size = grid_size states = [(i, j) for i in range(grid_size) for j in range(grid_size)] actions = ['up', 'down', 'left', 'right'] super().__init__(states, actions, self._build_transition(), self._build_reward()) def _build_transition(self): transition = {} for state in self.states: transition[state] = {} for action in self.actions: transition[state][action] = self._get_transition_probs(state, action) return transition def _get_transition_probs(self, state, action): probs = {s: 0.0 for s in self.states} next_state = self._get_next_state(state, action) if next_state in self.states: probs[next_state] = 1.0 return probs def _get_next_state(self, state, action): i, j = state if action == 'up': return (max(0, i-1), j) elif action == 'down': return (min(self.grid_size-1, i+1), j) elif action == 'left': return (i, max(0, j-1)) elif action == 'right': return (i, min(self.grid_size-1, j+1)) def _build_reward(self): reward = {} for state in self.states: reward[state] = {} for action in self.actions: reward[state][action] = {} for next_state in self.states: if next_state == (3, 3): reward[state][action][next_state] = 100 else: reward[state][action][next_state] = -1 return reward2.2 值函数计算
class ValueFunction: def __init__(self, mdp): self.mdp = mdp self.values = {s: 0.0 for s in mdp.states} def compute_bellman_equation(self, state): value = 0 for action in self.mdp.actions: action_value = 0 for next_state in self.mdp.states: prob = self.mdp.transition_prob[state][action].get(next_state, 0) reward = self.mdp.get_reward(state, action, next_state) action_value += prob * (reward + self.mdp.gamma * self.values[next_state]) value = max(value, action_value) return value def value_iteration(self, threshold=1e-6): while True: delta = 0 for state in self.mdp.states: old_value = self.values[state] self.values[state] = self.compute_bellman_equation(state) delta = max(delta, abs(old_value - self.values[state])) if delta < threshold: break def get_action_value(self, state, action): value = 0 for next_state in self.mdp.states: prob = self.mdp.transition_prob[state][action].get(next_state, 0) reward = self.mdp.get_reward(state, action, next_state) value += prob * (reward + self.mdp.gamma * self.values[next_state]) return value2.3 策略提取
class Policy: def __init__(self, mdp): self.mdp = mdp self.policy = {s: self.mdp.actions[0] for s in mdp.states} def extract_policy(self, value_function): for state in self.mdp.states: best_action = None best_value = float('-inf') for action in self.mdp.actions: action_value = value_function.get_action_value(state, action) if action_value > best_value: best_value = action_value best_action = action self.policy[state] = best_action def get_action(self, state): return self.policy[state] def evaluate(self, episodes=100): total_reward = 0 for _ in range(episodes): state = self.mdp.states[0] episode_reward = 0 while state != 'terminal': action = self.get_action(state) state, reward, done = self.mdp.step(state, action) episode_reward += reward if done: break total_reward += episode_reward return total_reward / episodes3. 性能对比
3.1 值迭代收敛性
| 迭代次数 | 值函数误差 | 策略稳定性 |
|---|---|---|
| 10 | 0.1 | 低 |
| 100 | 0.01 | 中 |
| 1000 | 0.001 | 高 |
3.2 MDP规模影响
| 状态数 | 值迭代时间 | 内存占用 |
|---|---|---|
| 100 | 0.1s | 1MB |
| 1000 | 1s | 10MB |
| 10000 | 10s | 100MB |
3.3 折扣因子影响
| γ | 远期考虑 | 收敛速度 |
|---|---|---|
| 0.9 | 中 | 快 |
| 0.99 | 高 | 慢 |
| 0.999 | 很高 | 很慢 |
4. 最佳实践
4.1 MDP建模技巧
def create_mdp_from_environment(env): states = env.get_states() actions = env.get_actions() transition = {} reward = {} for state in states: transition[state] = {} for action in actions: transition[state][action] = env.get_transition_probs(state, action) reward[state][action] = env.get_reward_function(state, action) return MarkovDecisionProcess(states, actions, transition, reward) class MDPFactory: @staticmethod def create(config): if config['type'] == 'grid_world': return GridWorldMDP(config.get('size', 4)) else: return MarkovDecisionProcess(**config)4.2 值迭代优化
class ValueIterationOptimizer: def __init__(self, mdp): self.mdp = mdp def optimize(self, threshold=1e-6, max_iterations=1000): values = {s: 0.0 for s in self.mdp.states} for _ in range(max_iterations): delta = 0 for state in self.mdp.states: old_value = values[state] values[state] = self._bellman_update(state, values) delta = max(delta, abs(old_value - values[state])) if delta < threshold: break return values def _bellman_update(self, state, values): return max([ sum([ prob * (self.mdp.get_reward(state, action, next_state) + self.mdp.gamma * values[next_state]) for next_state, prob in self.mdp.transition_prob[state][action].items() ]) for action in self.mdp.actions ])5. 总结
马尔可夫决策过程是强化学习的基础:
- MDP建模:定义状态、动作、转移和奖励
- 值迭代:计算最优值函数
- 策略提取:从值函数提取最优策略
- 关键参数:折扣因子影响远期考虑
对比数据如下:
- 值迭代需要足够迭代次数才能收敛
- 折扣因子γ=0.99是常用选择
- 状态空间大小影响计算复杂度
- 推荐使用值迭代求解小型MDP