news 2026/5/26 0:01:38

强化学习策略参数调节方法及值迭代算法实现 CS188 Proj3 学习笔记

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
强化学习策略参数调节方法及值迭代算法实现 CS188 Proj3 学习笔记

强烈推荐的更好的阅读体验

Q1.Value Iteration

第一个问题是最基础的值迭代实现,这个问题没有什么难度,主要就是一边看着公式一遍敲代码复现。可以先回顾一下Note8中的Value Iteration框架.唯一唯一需要注意的就是需要使用的是batch版本,而不是online版本。这里是前面Note没有提及到的概念,用图片来理解一下会更好一点

这个问题用的数据结构是Counter,它的底层容器是哈希表。Counter十分类似于Dictionary,它只是新增了一条设定,即所有初始值都为0。上图中一张图就可以理解为一个Counter,每个状态都有对应的Value,就和键值对应一样

所谓online版本,就是在某轮循环更新某个状态时,你用了本轮其它状态的值。可以参考图一,假设遍历的第一个状态是效用值为1的状态,1左边的方格在第一轮也会受到效用值1的影响。这样就是偷看了本轮新值。
图一到图二经历的过程就是batch版本,这个过程是由图一推演出来的,而并非参考第二轮迭代的新值。这样就严格遵循了V k V_kVk是由V k − 1 V_{k-1}Vk1推演出来的定式

代码实现

defrunValueIteration(self):""" Run the value iteration algorithm. Note that in standard value iteration, V_k+1(...) depends on V_k(...)'s. """"*** YOUR CODE HERE ***"foriinrange(self.iterations):newValues=util.Counter()forstateinself.mdp.getStates():ifself.mdp.isTerminal(state):newValues[state]=0continueactions=self.mdp.getPossibleActions(state)ifnotactions:newValues[state]=0continueqValues=[]foractioninactions:q=self.computeQValueFromValues(state,action)qValues.append(q)newValues[state]=max(qValues)self.values=newValuesdefgetValue(self,state):""" Return the value of the state (computed in __init__). """returnself.values[state]defcomputeQValueFromValues(self,state,action):""" Compute the Q-value of action in state from the value function stored in self.values. """"*** YOUR CODE HERE ***"qValue=0transitions=self.mdp.getTransitionStatesAndProbs(state,action)fornextState,probintransitions:reward=self.mdp.getReward(state,action,nextState)qValue+=prob*(reward+self.discount*self.getValue(nextState))returnqValue util.raiseNotDefined()defcomputeActionFromValues(self,state):""" The policy is the best action in the given state according to the values currently stored in self.values. You may break ties any way you see fit. Note that if there are no legal actions, which is the case at the terminal state, you should return None. """"*** YOUR CODE HERE ***"ifself.mdp.isTerminal(state):returnNoneactions=self.mdp.getPossibleActions(state)ifnotactions:returnNonebestAction=NonebestValue=float('-inf')foractioninactions:q=self.computeQValueFromValues(state,action)ifq>bestValue:bestValue=q bestAction=actionreturnbestAction util.raiseNotDefined()

整体思路并不难,Coding过程只需要注意一下不要遗漏处理No Leagal Action的情况就可以了


Q2.Policies

Q2问题更简单了,这就是凭直觉调节参数的题目,需要注意的是有三个变量

代码实现

defquestion2a():""" Prefer the close exit (+1), risking the cliff (-10). """answerDiscount=0.3answerNoise=0.0answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2b():""" Prefer the close exit (+1), but avoiding the cliff (-10). """answerDiscount=0.3answerNoise=0.2answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2c():""" Prefer the distant exit (+10), risking the cliff (-10). """answerDiscount=0.8answerNoise=0.0answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2d():""" Prefer the distant exit (+10), avoiding the cliff (-10). """answerDiscount=0.8answerNoise=0.3answerLivingReward=0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2e():""" Avoid both exits and the cliff (so an episode should never terminate). """answerDiscount=0.9answerNoise=0.0answerLivingReward=1returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'

Q3.Q-Learning

Q3稍微有一点复杂但是难度并不大,主要就是围绕着Q-Learning的实现来Coding。其中需要完成的函数共有五个,唯一需要注意的点就是在computeActionFromQValues函数中,面临着相同状态同样最好的QValues,需要用到random.choice()来进行随机选择,要不然会不能通过autograder

代码实现

classQLearningAgent(ReinforcementAgent):""" Q-Learning Agent Functions you should fill in: - computeValueFromQValues - computeActionFromQValues - getQValue - getAction - update Instance variables you have access to - self.epsilon (exploration prob) - self.alpha (learning rate) - self.discount (discount rate) Functions you should use - self.getLegalActions(state) which returns legal actions for a state """def__init__(self,**args):"You can initialize Q-values here..."ReinforcementAgent.__init__(self,**args)"*** YOUR CODE HERE ***"self.qValues=util.Counter()defgetQValue(self,state,action):""" Returns Q(state,action) Should return 0.0 if we have never seen a state or the Q node value otherwise """"*** YOUR CODE HERE ***"returnself.qValues[(state,action)]util.raiseNotDefined()defcomputeValueFromQValues(self,state):""" Returns max_action Q(state,action) where the max is over legal actions. Note that if there are no legal actions, which is the case at the terminal state, you should return a value of 0.0. """"*** YOUR CODE HERE ***"actions=self.getLegalActions(state)bestQvalue=-float('inf')ifnotactions:return0.0foractioninactions:ifself.getQValue(state,action)>bestQvalue:bestQvalue=self.getQValue(state,action)returnbestQvalue# return max([self.getQValue(state, action) for action in actions])util.raiseNotDefined()defcomputeActionFromQValues(self,state):""" Compute the best action to take in a state. Note that if there are no legal actions, which is the case at the terminal state, you should return None. """"*** YOUR CODE HERE ***"actions=self.getLegalActions(state)bestQvalue=self.computeValueFromQValues(state)ifnotactions:returnNonebestActions=[actionforactioninactionsifself.getQValue(state,action)==bestQvalue]returnrandom.choice(bestActions)util.raiseNotDefined()defgetAction(self,state):""" Compute the action to take in the current state. With probability self.epsilon, we should take a random action and take the best policy action otherwise. Note that if there are no legal actions, which is the case at the terminal state, you should choose None as the action. HINT: You might want to use util.flipCoin(prob) HINT: To pick randomly from a list, use random.choice(list) """# Pick ActionlegalActions=self.getLegalActions(state)action=None"*** YOUR CODE HERE ***"ifnotlegalActions:returnNoneifutil.flipCoin(self.epsilon):returnrandom.choice(legalActions)else:returnself.computeActionFromQValues(state)util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float):""" The parent class calls this to observe a state = action => nextState and reward transition. You should do your Q-Value update here NOTE: You should never call this function, it will be called on your behalf """"*** YOUR CODE HERE ***"sample=reward+self.discount*self.computeValueFromQValues(nextState)oldQ=self.getQValue(state,action)self.qValues[(state,action)]=(1-self.alpha)*oldQ+self.alpha*sampledefgetPolicy(self,state):returnself.computeActionFromQValues(state)defgetValue(self,state):returnself.computeValueFromQValues(state)
defflipCoin(p):r=random.random()returnr<p

Q4.Epsilon Greedy

Q4问题在Q3中已经实现了,没看清要求。正是上面刚刚提到的ε-Greedy Policies,原文档中也讲解了一下util.flipCoin(p)的具体逻辑。

You can simulate a binary variable with probability `p` of success by using `util.flipCoin(p)`, which returns `True` with probability `p` and `False` with probability `1-p`.

原文档中还给了两段几乎相同的shell指令

python gridworld.py-aq-k100--noise0.0-e0.1
python gridworld.py-aq-k100--noise0.0-e0.9

Q5.Q-Learning and Pacman

上面的代码可以直接通过Q5的autograder。需要理解并回顾一下的是,mediumGrid在用Q-Learning去学习是行不通的,因为其状态空间巨大,Q-Learning并不具备泛化能力.智能体意识不到遇到ghost是坏事,智能体只能记住在某个具体board下撞鬼是坏事


Q6.Approximate Q-Learning

Q6所呈现的Approximate Q-Learning就具备的泛化能力,智能体能够学习经验而不是学习特定的情况下该做出什么特定的行动。这个问题并不难,文档里也提供了可能需要的函数的定义。
我们可以发现,Approximate Q-Learning的总表达式,启发式的表达式和评估函数的表达式是有点类似的,在Proj1中的Q6.遍历角落问题的启发式有着启发式的具体实现,在Proj2中Q1.Reflex Agent的,也可以回顾一下观察三者的形式,他们都有着共同的思想

代码实现

classApproximateQAgent(PacmanQAgent):""" ApproximateQLearningAgent You should only have to overwrite getQValue and update. All other QLearningAgent functions should work as is. """def__init__(self,extractor='IdentityExtractor',**args):self.featExtractor=util.lookup(extractor,globals())()PacmanQAgent.__init__(self,**args)self.weights=util.Counter()defgetWeights(self):returnself.weightsdefgetQValue(self,state,action):""" Should return Q(state,action) = w * featureVector where * is the dotProduct operator """"*** YOUR CODE HERE ***"features=self.featExtractor.getFeatures(state,action)qValue=0.0forfinfeatures:qValue+=self.weights[f]*features[f]returnqValue util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float):""" Should update your weights based on transition """"*** YOUR CODE HERE ***"features=self.featExtractor.getFeatures(state,action)currentQ=self.getQValue(state,action)nextValue=self.computeValueFromQValues(nextState)difference=(reward+self.discount*nextValue)-currentQforfinfeatures:self.weights[f]+=self.alpha*difference*features[f]deffinal(self,state):"""Called at the end of each game."""# call the super-class final methodPacmanQAgent.final(self,state)# did we finish training?ifself.episodesSoFar==self.numTraining:# you might want to print your weights here for debugging"*** YOUR CODE HERE ***"pass

其中FeatureExtractor类中的getFeatures函数定义如下

classFeatureExtractor:defgetFeatures(self,state,action):""" Returns a dict from features to counts Usually, the count will just be 1.0 for indicator functions. """util.raiseNotDefined()

整体实现并没有什么难点,只是需要对着公式用代码复刻一遍就好

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/25 23:52:05

别再手写测试报告了,这个自动化方案让效率提升5倍

在软件测试的日常工作中&#xff0c;撰写测试报告往往是最令人头疼的环节之一。无论你采用的是敏捷开发模式还是传统的瀑布模型&#xff0c;每逢迭代结束、版本发布或里程碑节点&#xff0c;测试工程师都需要从纷繁复杂的测试数据中提炼信息&#xff0c;手工编排成一份结构清晰…

作者头像 李华
网站建设 2026/5/25 23:52:04

精准测试落地难?我用半年实践总结出这4条铁律

在质量左移、持续交付的大背景下&#xff0c;“精准测试”这个名词几乎被每一位测试从业者挂在嘴边。理想很丰满&#xff1a;通过代码调用链分析、变更影响域评估&#xff0c;告别全量回归的沉重包袱&#xff0c;只测该测的&#xff0c;让每一次提测都快、准、稳。可现实很骨感…

作者头像 李华
网站建设 2026/5/25 23:50:17

WarcraftHelper:魔兽争霸III现代兼容性问题的终极解决方案指南

WarcraftHelper&#xff1a;魔兽争霸III现代兼容性问题的终极解决方案指南 【免费下载链接】WarcraftHelper Warcraft III Helper , support 1.20e, 1.24e, 1.26a, 1.27a, 1.27b 项目地址: https://gitcode.com/gh_mirrors/wa/WarcraftHelper 魔兽争霸III作为经典即时战…

作者头像 李华
网站建设 2026/5/25 23:50:12

你的企业还在用“人海战术”处理发票和报表?2026智能体进化论

站在2026年5月20日的时间节点回望&#xff0c;企业数字化转型已进入深水区。根据最新的行业调研数据显示&#xff0c;今年上半年企业整体业绩增速普遍放缓&#xff0c;导致核心人效指标出现持续下滑。在这种背景下&#xff0c;你的企业是否还在用“人海战术”处理发票和报表&am…

作者头像 李华