强化学习策略参数调节方法及值迭代算法实现 CS188 Proj3 学习笔记-平芜编程栈

强烈推荐的更好的阅读体验

Q1.Value Iteration

第一个问题是最基础的值迭代实现，这个问题没有什么难度，主要就是一边看着公式一遍敲代码复现。可以先回顾一下Note8中的Value Iteration框架.唯一唯一需要注意的就是需要使用的是batch版本，而不是online版本。这里是前面Note没有提及到的概念，用图片来理解一下会更好一点

这个问题用的数据结构是Counter，它的底层容器是哈希表。Counter十分类似于Dictionary,它只是新增了一条设定，即所有初始值都为0。上图中一张图就可以理解为一个Counter，每个状态都有对应的Value,就和键值对应一样

所谓online版本，就是在某轮循环更新某个状态时，你用了本轮其它状态的值。可以参考图一，假设遍历的第一个状态是效用值为1的状态，1左边的方格在第一轮也会受到效用值1的影响。这样就是偷看了本轮新值。
图一到图二经历的过程就是batch版本，这个过程是由图一推演出来的，而并非参考第二轮迭代的新值。这样就严格遵循了V k V_kVk是由V k − 1 V_{k-1}Vk−1推演出来的定式

代码实现

defrunValueIteration(self):""" Run the value iteration algorithm. Note that in standard value iteration, V_k+1(...) depends on V_k(...)'s. """"*** YOUR CODE HERE ***"foriinrange(self.iterations):newValues=util.Counter()forstateinself.mdp.getStates():ifself.mdp.isTerminal(state):newValues[state]=0continueactions=self.mdp.getPossibleActions(state)ifnotactions:newValues[state]=0continueqValues=[]foractioninactions:q=self.computeQValueFromValues(state,action)qValues.append(q)newValues[state]=max(qValues)self.values=newValuesdefgetValue(self,state):""" Return the value of the state (computed in __init__). """returnself.values[state]defcomputeQValueFromValues(self,state,action):""" Compute the Q-value of action in state from the value function stored in self.values. """"*** YOUR CODE HERE ***"qValue=0transitions=self.mdp.getTransitionStatesAndProbs(state,action)fornextState,probintransitions:reward=self.mdp.getReward(state,action,nextState)qValue+=prob*(reward+self.discount*self.getValue(nextState))returnqValue util.raiseNotDefined()defcomputeActionFromValues(self,state):""" The policy is the best action in the given state according to the values currently stored in self.values. You may break ties any way you see fit. Note that if there are no legal actions, which is the case at the terminal state, you should return None. """"*** YOUR CODE HERE ***"ifself.mdp.isTerminal(state):returnNoneactions=self.mdp.getPossibleActions(state)ifnotactions:returnNonebestAction=NonebestValue=float('-inf')foractioninactions:q=self.computeQValueFromValues(state,action)ifq>bestValue:bestValue=q bestAction=actionreturnbestAction util.raiseNotDefined()

整体思路并不难，Coding过程只需要注意一下不要遗漏处理No Leagal Action的情况就可以了

Q2.Policies

Q2问题更简单了，这就是凭直觉调节参数的题目，需要注意的是有三个变量

Discount
- 如果更偏重眼前的利益，就应给在Value Iteration中给更远的奖励值上更小的折扣
- 如果想要长远的奖励值，就只能让其折扣更大，避免长期奖励被过度削弱
Noise
- 如果游走在Cliff附近，就要调小噪音避免不确定行为发生
- 如果绕远路，则鼓励Exploration,多去探索可能性
LivingReward
- 如果目的是一直不退出，则将存活奖励调大
- 反之则调小

代码实现

defquestion2a():""" Prefer the close exit (+1), risking the cliff (-10). """answerDiscount=0.3answerNoise=0.0answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2b():""" Prefer the close exit (+1), but avoiding the cliff (-10). """answerDiscount=0.3answerNoise=0.2answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2c():""" Prefer the distant exit (+10), risking the cliff (-10). """answerDiscount=0.8answerNoise=0.0answerLivingReward=0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2d():""" Prefer the distant exit (+10), avoiding the cliff (-10). """answerDiscount=0.8answerNoise=0.3answerLivingReward=0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'defquestion2e():""" Avoid both exits and the cliff (so an episode should never terminate). """answerDiscount=0.9answerNoise=0.0answerLivingReward=1returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return 'NOT POSSIBLE'

Q3.Q-Learning

Q3稍微有一点复杂但是难度并不大，主要就是围绕着Q-Learning的实现来Coding。其中需要完成的函数共有五个，唯一需要注意的点就是在computeActionFromQValues函数中，面临着相同状态同样最好的QValues，需要用到random.choice()来进行随机选择，要不然会不能通过autograder

代码实现

classQLearningAgent(ReinforcementAgent):""" Q-Learning Agent Functions you should fill in: - computeValueFromQValues - computeActionFromQValues - getQValue - getAction - update Instance variables you have access to - self.epsilon (exploration prob) - self.alpha (learning rate) - self.discount (discount rate) Functions you should use - self.getLegalActions(state) which returns legal actions for a state """def__init__(self,**args):"You can initialize Q-values here..."ReinforcementAgent.__init__(self,**args)"*** YOUR CODE HERE ***"self.qValues=util.Counter()defgetQValue(self,state,action):""" Returns Q(state,action) Should return 0.0 if we have never seen a state or the Q node value otherwise """"*** YOUR CODE HERE ***"returnself.qValues[(state,action)]util.raiseNotDefined()defcomputeValueFromQValues(self,state):""" Returns max_action Q(state,action) where the max is over legal actions. Note that if there are no legal actions, which is the case at the terminal state, you should return a value of 0.0. """"*** YOUR CODE HERE ***"actions=self.getLegalActions(state)bestQvalue=-float('inf')ifnotactions:return0.0foractioninactions:ifself.getQValue(state,action)>bestQvalue:bestQvalue=self.getQValue(state,action)returnbestQvalue# return max([self.getQValue(state, action) for action in actions])util.raiseNotDefined()defcomputeActionFromQValues(self,state):""" Compute the best action to take in a state. Note that if there are no legal actions, which is the case at the terminal state, you should return None. """"*** YOUR CODE HERE ***"actions=self.getLegalActions(state)bestQvalue=self.computeValueFromQValues(state)ifnotactions:returnNonebestActions=[actionforactioninactionsifself.getQValue(state,action)==bestQvalue]returnrandom.choice(bestActions)util.raiseNotDefined()defgetAction(self,state):""" Compute the action to take in the current state. With probability self.epsilon, we should take a random action and take the best policy action otherwise. Note that if there are no legal actions, which is the case at the terminal state, you should choose None as the action. HINT: You might want to use util.flipCoin(prob) HINT: To pick randomly from a list, use random.choice(list) """# Pick ActionlegalActions=self.getLegalActions(state)action=None"*** YOUR CODE HERE ***"ifnotlegalActions:returnNoneifutil.flipCoin(self.epsilon):returnrandom.choice(legalActions)else:returnself.computeActionFromQValues(state)util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float):""" The parent class calls this to observe a state = action => nextState and reward transition. You should do your Q-Value update here NOTE: You should never call this function, it will be called on your behalf """"*** YOUR CODE HERE ***"sample=reward+self.discount*self.computeValueFromQValues(nextState)oldQ=self.getQValue(state,action)self.qValues[(state,action)]=(1-self.alpha)*oldQ+self.alpha*sampledefgetPolicy(self,state):returnself.computeActionFromQValues(state)defgetValue(self,state):returnself.computeValueFromQValues(state)

其中在computeValueFromQValues函数中带#号的一行为for循环以下的简便写法，我写的版本易读性有些差而且简洁性也不够强。
在初始化中，我们依旧选择Counter作为我们的数据存储结构，因为其初始化值为0的特性，我们的Coding过程方便了许多。
在函数update中，就是围绕着Q-Learning的核心公式展开的，这个公式可以回顾Note9中的详细介绍
函数getAction中的util.flipCoin(p)的功能不要遗漏看一下，在util.py文件中给出了如下实现。实现非常简单，含义也很明了：用在getAction函数中的的判断语句含义就是有self.epsilon的概率进入if分支，有1 - self.epsilon的概率进入else分支。这也是在Note10中提及到的ε-Greedy Policies

defflipCoin(p):r=random.random()returnr<p

Q4.Epsilon Greedy

Q4问题在Q3中已经实现了，没看清要求。正是上面刚刚提到的ε-Greedy Policies，原文档中也讲解了一下util.flipCoin(p)的具体逻辑。

You can simulate a binary variable with probability `p` of success by using `util.flipCoin(p)`, which returns `True` with probability `p` and `False` with probability `1-p`.

原文档中还给了两段几乎相同的shell指令

python gridworld.py-aq-k100--noise0.0-e0.1

10% 随机探索
90% 走当前最优 Q

python gridworld.py-aq-k100--noise0.0-e0.9

90% 随机探索
只有 10% 走最优Q
可以很明显发现第一段shell运行的时候，智能体在探索到Reward最高的道路时，就基本一直重复走相同路线。而第二段智能体则基本一直在无规则运动

Q5.Q-Learning and Pacman

上面的代码可以直接通过Q5的autograder。需要理解并回顾一下的是，mediumGrid在用Q-Learning去学习是行不通的，因为其状态空间巨大，Q-Learning并不具备泛化能力.智能体意识不到遇到ghost是坏事，智能体只能记住在某个具体board下撞鬼是坏事

Q6.Approximate Q-Learning

Q6所呈现的Approximate Q-Learning就具备的泛化能力，智能体能够学习经验而不是学习特定的情况下该做出什么特定的行动。这个问题并不难，文档里也提供了可能需要的函数的定义。
我们可以发现，Approximate Q-Learning的总表达式，启发式的表达式和评估函数的表达式是有点类似的，在Proj1中的Q6.遍历角落问题的启发式有着启发式的具体实现，在Proj2中Q1.Reflex Agent的，也可以回顾一下观察三者的形式,他们都有着共同的思想

Approximate Q-Learning表达式
Q ( s , a ) = ∑ i w i f i ( s , a ) \begin{align*} Q(s,a) &= \sum_{i} w_i f_i(s,a) \end{align*}Q(s,a)=i∑wifi(s,a)
启发式表达式
h ( s ) = ∑ i w i f i ( s ) \begin{align*} h(s) &= \sum_{i} w_i f_i(s) \end{align*}h(s)=i∑wifi(s)
评估函数表达式
Eval ⁡ ( s ) = ∑ i w i f i ( s ) \begin{align*} \operatorname{Eval}(s) &= \sum_{i} w_i f_i(s) \end{align*}Eval(s)=i∑wifi(s)

代码实现

classApproximateQAgent(PacmanQAgent):""" ApproximateQLearningAgent You should only have to overwrite getQValue and update. All other QLearningAgent functions should work as is. """def__init__(self,extractor='IdentityExtractor',**args):self.featExtractor=util.lookup(extractor,globals())()PacmanQAgent.__init__(self,**args)self.weights=util.Counter()defgetWeights(self):returnself.weightsdefgetQValue(self,state,action):""" Should return Q(state,action) = w * featureVector where * is the dotProduct operator """"*** YOUR CODE HERE ***"features=self.featExtractor.getFeatures(state,action)qValue=0.0forfinfeatures:qValue+=self.weights[f]*features[f]returnqValue util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float):""" Should update your weights based on transition """"*** YOUR CODE HERE ***"features=self.featExtractor.getFeatures(state,action)currentQ=self.getQValue(state,action)nextValue=self.computeValueFromQValues(nextState)difference=(reward+self.discount*nextValue)-currentQforfinfeatures:self.weights[f]+=self.alpha*difference*features[f]deffinal(self,state):"""Called at the end of each game."""# call the super-class final methodPacmanQAgent.final(self,state)# did we finish training?ifself.episodesSoFar==self.numTraining:# you might want to print your weights here for debugging"*** YOUR CODE HERE ***"pass

其中FeatureExtractor类中的getFeatures函数定义如下

classFeatureExtractor:defgetFeatures(self,state,action):""" Returns a dict from features to counts Usually, the count will just be 1.0 for indicator functions. """util.raiseNotDefined()