action-value function

Quality of taking a particular value at a function—“expected discounted return when following a policy from \(S\) and taking \(a\)”:

\begin{equation} Q(s,a) = R(s,a) + \gamma \sum_{s’} T(s’|s,a) U(s’) \end{equation}

where, \(T\) is the transition probability from \(s\) to \(s’\) given action \(a\).

value function

Therefore, the utility of being in a state (called the value function) is:

\begin{equation} U(s) = \max_{a} Q(s,a) \end{equation}

“the utility that gains the best action-value”

\begin{equation} \pi(s) = \arg\max_{a} Q(s,a) \end{equation}

“the policy that takes the best action to maximize action-value”

we call this \(\pi\) “greedy policy with respect to \(U\)”