Quality of taking a particular value at a function—“expected discounted return when following a policy from \(S\) and taking \(a\)”:
\begin{equation} Q(s,a) = R(s,a) + \gamma \sum_{s’} T(s’|s,a) U(s’) \end{equation}
where, \(T\) is the transition probability from \(s\) to \(s’\) given action \(a\).
value function
Therefore, the utility of being in a state (called the value function) is:
\begin{equation} U(s) = \max_{a} Q(s,a) \end{equation}
“the utility that gains the best action-value”
value-function policy
A value-function policy is a policy that maximizes the action-value
\begin{equation} \pi(s) = \arg\max_{a} Q(s,a) \end{equation}
“the policy that takes the best action to maximize action-value”
we call this \(\pi\) “greedy policy with respect to \(U\)”