deep Q-learning with experience replay.

Variant of Q-learning for function approximation proposed in the paper
Human-level control through deep reinforcement learning.

In reinforcement leaning, it's known that function approximation with non-linear model (ex. neuralnet) would be unstable and lead poor learning result.
To address this problem, deep Q-learning combined two key ideas with QLearning.

Use experience replay to reduce correlations between sequence of learning data.
Separate taget and behavior network to stable the source of learning data.

Algorithm

    Parameter
        g <- gamma. discounting factor of QLearning
        N <- capacity of replay memory
        C <- interval to sync Q'(target network) with Q
        minibatch_size <- size of minibatch used to train Q
        replay_start_size <- initial size of replay memory. Fill D
            with this number of experiences which created by random policy.
            This procedure is done in setup phase.

    Initialize
        T  <- your RL task
        PI <- policy used to select action during the episode
        Q  <- approximate action value function (ex. neural network)
        Q' <- target network. initialized by deepcopy Q.
        D  <- filled with replay_start_size of  experiences created by random simulation.
              (experience = tuple of state, action, reward and next_state)

    Repeat until computational budget runs out:
        S <- generate initial state of task T
        A <- choose action at S by following PI
        Repeat until S is terminal state:
            S' <- next state of S after taking action A
            R <- reward gained by taking action A at state S
            A' <- next action at S' by following policy PI
            append experience (S,A,R,S') to D
            MB <- sample minibatch_size of experiences from D
            BT <- transform minibatch of experiences into backup targets
            (BT = [r + g * Q(s', GA) for s,a,r,s' in MB], GA=greedy action at s')
            Update Q by using BT (minibatch of backup targets)
            Every C step: Q' <- Q (ex. deepcopy weights of Q to Q')
            S, A <- S', A'

Value function

DeepQLearning method provides only approximation type of value functions.

DeepQLearningApproxActionValueFunction

DeepQLearningApproxActionValueFunction has 6 abstracted methods. You would wrap your prediction model (ex. neuralnet) in these methods.

initialize_network : initialize your prediction model here
deepcopy_network : define how to create deepcopy of your prediction model
predict_value_by_network : predict value of state-action pair by your prediction model
backup_on_minibatch : train your prediction model with passed learning minibatch
save_networks : save your prediction model as you like (ex. save the weights of neuralnet)
load_networks : load your prediction model from resource created by save_networks

The implementation with some neuralnet library would be like this.

class MyApproxActionValueFunction(DeepQLearningApproxActionValueFunction):

    # the model returned here is used as "q_network" (Q of above algorithm)
    def initialize_network(self):
        model = build_neuralnet()
        return model

    # the model returned here is used as "q_hat_network" (Q' of above algorithm)
    def deepcopy_network(self, q_network):
        original_weight = q_network.get_weights()
        deepcopy_network = self.initialize_network()
        deepcopy_network.set_weights(original_weight)
        return deepcopy_network

    # return prediction value of passed state action pair.
    # passed network would be "q_network" or "q_hat_network".
    def predict_value_by_network(self, network, state, action):
        features = build_features(state, action)
        prediction = network.predict(features)
        return prediction

    # train passed q_network with backup_minibatch
    # you would need to transform backup_minibatch into input output pair like
    # supervised learning format.
    def backup_on_minibatch(self, q_network, backup_minibatch):
        # backup_minibatch is array of (state, action, target_value).
        X = [build_features(state, action) for state, action, _target in backup_minibatch]
        y = [target for _state, _action, target in backup_minibatch]
        q_network.train_on_minibatch(X, y)

    # save passed two neuralnet on passed directory
    def save_networks(self, q_network, q_hat_network, save_dir_path):
        q_network.save_weights("%s/q_weight.h5" % save_dir_path)
        q_hat_network.save_weights("%s/q_hat_weight.h5" % save_dir_path)

    # load "q_network" and "q_hat_network" from passed directory and return them in
    # "q_network", "q_hat_network" order.
    def load_networks(self, load_dir_path):
        q_network = self.initialize_network()
        q_network.load_weights("%s/q_weight.h5" % load_dir_path)
        q_hat_network = self.initialize_network()
        q_hat_network.load_weights("%s/q_hat_weight.h5" % load_dir_path)
        return q_network, q_hat_network

Sample code to start learning

TEST_LENGTH = 5000000
task = MyTask()
policy = EpsilonGreedyPolicy(eps=1.0)
policy.set_eps_annealing(initial_eps=1.0, final_eps=0.1, anneal_duration=1000000)
value_func = MyApproxActionValueFunction()
algorithm = DeepQLearning(gamma=0.99, N=100000, C=1000, minibatch_size=32, replay_start_size=50000)
algorithm.setup(task, policy, value_func)
algorithm.run_gpi(test_length)