deep Q-learning with experience replay.
Variant of Q-learning for function approximation proposed in the paper
Human-level control through deep reinforcement learning.
In reinforcement leaning, it's known that function approximation with non-linear model (ex. neuralnet) would be unstable and lead poor learning result.
To address this problem, deep Q-learning combined two key ideas with QLearning.
- Use experience replay to reduce correlations between sequence of learning data.
- Separate taget and behavior network to stable the source of learning data.
Algorithm
Parameter
g <- gamma. discounting factor of QLearning
N <- capacity of replay memory
C <- interval to sync Q'(target network) with Q
minibatch_size <- size of minibatch used to train Q
replay_start_size <- initial size of replay memory. Fill D
with this number of experiences which created by random policy.
This procedure is done in setup phase.
Initialize
T <- your RL task
PI <- policy used to select action during the episode
Q <- approximate action value function (ex. neural network)
Q' <- target network. initialized by deepcopy Q.
D <- filled with replay_start_size of experiences created by random simulation.
(experience = tuple of state, action, reward and next_state)
Repeat until computational budget runs out:
S <- generate initial state of task T
A <- choose action at S by following PI
Repeat until S is terminal state:
S' <- next state of S after taking action A
R <- reward gained by taking action A at state S
A' <- next action at S' by following policy PI
append experience (S,A,R,S') to D
MB <- sample minibatch_size of experiences from D
BT <- transform minibatch of experiences into backup targets
(BT = [r + g * Q(s', GA) for s,a,r,s' in MB], GA=greedy action at s')
Update Q by using BT (minibatch of backup targets)
Every C step: Q' <- Q (ex. deepcopy weights of Q to Q')
S, A <- S', A'
Value function
DeepQLearning
method provides only approximation type of value functions.
DeepQLearningApproxActionValueFunction
DeepQLearningApproxActionValueFunction
has 6 abstracted methods. You would wrap your prediction model (ex. neuralnet) in these methods.
initialize_network
: initialize your prediction model heredeepcopy_network
: define how to create deepcopy of your prediction modelpredict_value_by_network
: predict value of state-action pair by your prediction modelbackup_on_minibatch
: train your prediction model with passed learning minibatchsave_networks
: save your prediction model as you like (ex. save the weights of neuralnet)load_networks
: load your prediction model from resource created bysave_networks
The implementation with some neuralnet library would be like this.
class MyApproxActionValueFunction(DeepQLearningApproxActionValueFunction):
# the model returned here is used as "q_network" (Q of above algorithm)
def initialize_network(self):
model = build_neuralnet()
return model
# the model returned here is used as "q_hat_network" (Q' of above algorithm)
def deepcopy_network(self, q_network):
original_weight = q_network.get_weights()
deepcopy_network = self.initialize_network()
deepcopy_network.set_weights(original_weight)
return deepcopy_network
# return prediction value of passed state action pair.
# passed network would be "q_network" or "q_hat_network".
def predict_value_by_network(self, network, state, action):
features = build_features(state, action)
prediction = network.predict(features)
return prediction
# train passed q_network with backup_minibatch
# you would need to transform backup_minibatch into input output pair like
# supervised learning format.
def backup_on_minibatch(self, q_network, backup_minibatch):
# backup_minibatch is array of (state, action, target_value).
X = [build_features(state, action) for state, action, _target in backup_minibatch]
y = [target for _state, _action, target in backup_minibatch]
q_network.train_on_minibatch(X, y)
# save passed two neuralnet on passed directory
def save_networks(self, q_network, q_hat_network, save_dir_path):
q_network.save_weights("%s/q_weight.h5" % save_dir_path)
q_hat_network.save_weights("%s/q_hat_weight.h5" % save_dir_path)
# load "q_network" and "q_hat_network" from passed directory and return them in
# "q_network", "q_hat_network" order.
def load_networks(self, load_dir_path):
q_network = self.initialize_network()
q_network.load_weights("%s/q_weight.h5" % load_dir_path)
q_hat_network = self.initialize_network()
q_hat_network.load_weights("%s/q_hat_weight.h5" % load_dir_path)
return q_network, q_hat_network
Sample code to start learning
TEST_LENGTH = 5000000
task = MyTask()
policy = EpsilonGreedyPolicy(eps=1.0)
policy.set_eps_annealing(initial_eps=1.0, final_eps=0.1, anneal_duration=1000000)
value_func = MyApproxActionValueFunction()
algorithm = DeepQLearning(gamma=0.99, N=100000, C=1000, minibatch_size=32, replay_start_size=50000)
algorithm.setup(task, policy, value_func)
algorithm.run_gpi(test_length)