## Equilibrium Policies

In my thesis research I put forward the notion of defining **equilibrium policies** which represent policies for acting at different points in a spatial landscape as the equilibrium of specially constructed Markov chain. The idea is to define **local policies** for how to act at a particular location as **conditional causal distributions**. These local policies interact with each other over space and time so that the joint distribution over landscape-wide actions is in fact the equilibrium distribution of these local policies interacting over space and time.

The most straightforward way to implement this is to model a paramterized form of the local policy as a conditional distribution

p(action at location c | state of current location c, states of nearby locations {d1,d2,d3...}, parameters)

Then construct a Markov chain with each timestep made up of a set of variables, one for each location. Then perform Gibbs sampling on the variables, sampling the action for each location in turn while conditioning on the most recent value of all other locations. When the chain (eventually) converges you will be sampling joint landscape actions from the equilibrium distribution represented by the parameters.

Much research has been done on value function abstractions for Reinforcement Learning to increase the size of problem that can be solved.

Abstract policy functions are simply another step in this direction and are particularly useful for direct policy search methods such as policy gradient planning. Rather than learning an abstraction of the values of different states you represent the policy directly and update its parameters in response to experiences during simulation.