Markov Decision Processes(马尔可夫决策过程)

发布于 2024-01-12
更新于 2024-05-17

注:本文大量采用英文表述的原因是有很多概念采用英文表述更方便、更易理解。

什么是马尔可夫过程(Markov Process)

"The future is independent of the past given the present"
(未来仅与当前有关,与过去无关。
即:t+1时刻的状态仅与t时刻的状态有关,与t-1时刻无关
)

这就是马尔可夫过程。

Definition: A state $S_t$ is Markov if and only if

$$ \begin{gather} P[S_{T+1} | S_t] = P [ S_{t+1} | S_1 , \dots,S_t] \end{gather} $$
  • The state captures all relevant information from the history
  • Once the state is known, the history may be thrown away
  • The state is a sufficient statistic of the future

In a Markov decision process, the probabilities given by p completely characterize the environment’s dynamics. That is, the probability of each possible value for $S_t$ and $R_t$ depends only on the immediately preceding state and action, $S_{t-1}$ and $A_{t-1}$ ,and, given them, not at all on earlier states and actions. This is best viewed a restriction not on the decision process, but on the state. The state must include information about all aspects of the past agent–environment interaction that make a di↵erence for the future. If it does, then the state is said to have the Markov property.

在马尔可夫决策过程中,由p给出的概率完全刻画了环境的动态特性。也就是说,$S_t$和$R_t$的每个可能的值出现的概率只取决于前一个状态$S_{t-1}$和前一个动作 $A_{t-1}$ ,并且与更早之前的状态和动作完全无关。这个限制并不是针对决策过程,而是针对状态的。状态必须包括过去智能体和环境交互的方方面面的信息,这些信息会对未来产生一定影响。这样,状态就被认为具有马尔可夫性。

状态转移矩阵:
定义一个从当前马尔可夫状态s到后续状态$s'$的状态状态转移概率(state transition probability)为

$$ \begin{gather} p _{ss'} = P[S_{t+1} = s' | S_t =s] \end{gather} $$

同时有从所有当前状态s到后续状态$s'$的状态转移矩阵,如下图

20240116101638.png20240116101638.png

这个状态转移矩阵的含义是:$p_{ij}$ 表示从状态$i$转变到状态$j$的概率,矩阵每一行的和为1.

马尔科夫过程 又叫马尔科夫链(Markov Chain),它是一个无记忆的随机过程,可以用一个元组tuple 表示,其中S是有限数量的状态集,P是状态转移概率矩阵。
A Markov Process (or Markov Chain) is a tuple

  • S is a (finite) set of states
  • P is a state transition probability matrix

Episodic and Continuing Tasks

A Markov reward process is a Markov chain with values.
强化学习中,智能体的目标用“收益”reward来表征,收益是通过环境来传递给智能体的,收益$R_t \isin \Reals$,智能体的目标是最大化长期累积的收益,有收益假设:

目标是:最大化智能体的收益累计和的概率期望
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

把t时刻后接收的收益序列表示为 $R_{t+1}, R_{t+2},R_{t+3} \dots ,$ 将期望回报expected return记为$G_t$,在简单情况下,回报是收益的总和

$$ \begin{gather} G_t \doteq R_{t+1} + R_{t+2} +R_{t+3} + \cdots + R_T \end{gather} $$

其中T是最终时刻(where T is a final time step),例如在一局游戏、一次走迷宫、一盘棋等情境中,含有“最终时刻”的概念,我们将每下完一盘棋称为“Episode”(有人将它翻译为“幕”)。结束后,重新从某个标准的起始状态或起始状态的分布中的某个状态样本开始新的一幕。例如下完一盘棋后,新的棋局的起始状态与上一幕的结束状态完全无关。

Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as ending in different ways, such as winning and losing a game, the next episode begins independently of how the previous one ended.
Tasks with episodes of this kind are called episodic tasks.

还有另一种情况,不分episode,而是持续不断地发生,例如连续的过程控制任务或者长期运行机器人的应用,我们称之为持续性任务continuing tasks。为了描述这种任务,并且要使收益公式收敛,$T=\infty$ ,我们引入“折扣discounting”的概念,智能体需要使在未来收到的经过折扣系数加权之后的收益总和最大化,公式如下:
The return $G_t$ is the total discounted reward from time-step t

$$ \begin{align} G_t &\doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum _{k=0}^{\infty} \gamma^k R_{t+k+1} \\ &= R_{t+1} + \gamma G_{t+1} \end{align} $$

其中,折扣率$\gamma$ 取值范围为$0 \leq \gamma \leq 1$
The discount $\gamma \isin [0,1]$ is the present value of future rewards
The value of receiving reward R after k+1 time-steps is $\gamma^k$ R
* This values immediate reward above delayed reward.$\gamma$ close to 0 leads to "myopic" evaluation.$\gamma$ close to 1 leads to "far-sighted" evaluation

上式中只要收益reward是非零常数且$\gamma <1$ ,代数式就可以收敛,例如收益是常数1时:

$$ \begin{gather} G_t = \sum _{k=0}^{\infty} \gamma^k =\frac{1}{1-\gamma} \end{gather} $$

下面,我们探讨一下Episodic Tasks与Continuing Tasks的统一表示法。
分幕式任务中,使用 $S_{t,i}$ 来表示幕 $i$ 中时刻 $t$ 的状态(对$A_{t,i},R{t,i}, \pi_{t,i}, T_i$ 同理)。事实证明我们在讨论分幕式任务时不必区分不同的幕,一般我们简化为$S_t$。
为了使分幕式与连续式任务得到统一的描述,我们将幕的终止设为只产生0收益的死循环状态,这样就形成了无限序列,有:

$$ \begin{gather} G_t &\doteq \sum _{k=t+1}^{T} \gamma^{k-t-1} R_{k} \end{gather} $$

Policies and Value Functions

Formally, a policy is a mapping from states to probabilities of selecting each possible action.
策略$\pi(a|s)$ 是当$S_t =s$ 时,$A_t=a$ 的概率。

$$ \begin{gather} \pi (a|s) = \Bbb{P} [A_t =a | S_t = s] \end{gather} $$

我们把策略$\pi$下状态 $s$ 的价值函数记为 $v_{\pi}(s)$ ,即从状态s开始智能体按照策略 $\pi$ 进行决策所获得回报的期望,对于MDP,定义:

20240115215541.png20240115215541.png
其中t可以是任意时刻,终止状态的价值始终为0. $v_{\pi}$ 就是策略$\pi$的状态价值函数。
类似的有,将策略 $\pi$ 下在状态 $s$ 时采取动作 $a$ 的价值记为 $q_{\pi}(s,a)$ ,即根据决策 $\pi$ 从状态$s$时开始,执行动作$a$后所有决策序列的期望回报:
20240115220624.png20240115220624.png

20240115222113.png20240115222113.png

价值函数是贝尔曼方程的唯一解:

20240115222759.png20240115222759.png

马尔可夫链实例

学生决策的马尔可夫链模型如下图:

20240116104524.png20240116104524.png
显然,未来只与现在有关,与过去无关。
sleep是终止态。
状态转移矩阵为:
20240116104723.png20240116104723.png
空格的地方值为0

Markov Reward Process

20240116104855.png20240116104855.png

根据前面的公式,有当$\gamma = 0.5$ 时:

20240116105332.png20240116105332.png

当$\gamma = 0$ 时,只关注眼前收益,不关注长期收益:

20240116110106.png20240116110106.png

当$\gamma = 0.9$ 时,比较关注长期收益:

20240116110212.png20240116110212.png

由于这种场景的问题是有终止状态的,因此允许$\gamma = 1$ :

20240116110348.png20240116110348.png

贝尔曼方程的迭代计算:

20240116110726.png20240116110726.png

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
A Markov Decision Process is a tuple

20240116111331.png20240116111331.png

状态价值函数:

20240116111752.png20240116111752.png

贝尔曼方程:

20240116111909.png20240116111909.png

最优价值函数:

20240116112718.png20240116112718.png

最优动作-价值函数:

20240116112829.png20240116112829.png

最优策略:

20240116112946.png20240116112946.png

Reference