Tabular Reinforcement Learning Problem
In this tutorial, we will solve the problem called tabular reinforcement learning problem.
The keyword tabular means state-action space of the problem is small enough to fit in array or table.
Most of reinforcement learning methods have good convergence property on tabular reinforcement learning problem.
So first we will approach this type of problem.
blocking maze problem
We will find the shortest path of blocking maze.
blocking maze transforms its structure during training like below.
Before After
--------G --------G
--------- ---------
--------- => ---------
XXXXXXXX- -XXXXXXXX
--------- ---------
---S----- ---S-----
# 10 step 16 step <= minimum step
The best path in first structure is blocked by transformation.
So agent needs to realize the transformation and re-learn shortest path.
Implementation
Define blocking maze task
First we define our blocking maze problem as reinforcement learning task.
What we need to do is implementing 5 abstracted methods of BaseTask
class.
So our BlockingMazeTask
would be like this.
from kyoka.task import BaseTask
class BlockingMazeTask(BaseTask):
UP, DOWN, RIGHT, LEFT = 0, 1, 2, 3
START, GOAL = (5, 3), (0, 8)
BEFORE_MAZE = [
['-','-','-','-','-','-','-','-','G'],
['-','-','-','-','-','-','-','-','-'],
['-','-','-','-','-','-','-','-','-'],
['X','X','X','X','X','X','X','X','-'],
['-','-','-','-','-','-','-','-','-'],
['-','-','-','S','-','-','-','-','-']
]
AFTER_MAZE = [
['-','-','-','-','-','-','-','-','G'],
['-','-','-','-','-','-','-','-','-'],
['-','-','-','-','-','-','-','-','-'],
['-','X','X','X','X','X','X','X','X'],
['-','-','-','-','-','-','-','-','-'],
['-','-','-','S','-','-','-','-','-']
]
def __init__(self):
self.maze = self.BEFORE_MAZE
def generate_initial_state(self):
return self.START
def is_terminal_state(self, state):
return self.GOAL == state
def transit_state(self, state, action):
row, col = state
height, width = 6, 9
if action == self.UP:
row = max(0, row-1)
elif action == self.DOWN:
row = min(height-1, row+1)
elif action == self.RIGHT:
col= min(width-1, col+1)
elif action == self.LEFT:
col = max(0, col-1)
if 'X' != self.maze[row][col]:
return (row, col)
else:
return state
def generate_possible_actions(self, state):
return [self.UP, self.DOWN, self.RIGHT, self.LEFT]
def calculate_reward(self, state):
row, col = state
return 1 if 'G' == self.maze[row][col] else 0
This code does not imeplment transformation feature yet.
We implement the transformation feature by using kyoka.callback
module later.
Setup algorithm for blocking maze problem
Next we create value function of our task.
value function is the function which receives state-action pair and estimates how good for the agent to take the action at the state.
Before start implementation, let's think about the size of state-action space of our blocking maze problem.
In our BlockingMazeTask
, state and action is defined like this.
- state = position of agent in the maze
- action = the direction to move (UP or DOWN or RIGHT or LEFT)
The number of possible state is 6*9=54
(our maze shape is 6x9) and we can move 4
direction in every state.
So the size of state-action space is 54*4=216
.
This indicates that our learning problem is small enough to fit in array or table.
So we can say that blocking maze problem is tabular reinforcement learning problem.
Ok, let's resume implementation.
Here we use Sarsa
method to find shortest path.
What we need to do is implementing abstract methods of SarsaTabularActionValueFunction
like this.
class MazeTabularValueFunction(SarsaTabularActionValueFunction):
def generate_initial_table(self):
maze_width, maze_height, action_num = 6, 9, 4
return [[[0 for a in range(action_num)] for j in range(width)] for i in range(height)]
def fetch_value_from_table(self, table, state, action):
row, col = state
return table[row][col][action]
def insert_value_into_table(self, table, state, action, new_value):
row, col = state
table[row][col][action] = new_value
Implement maze transformation feature
We implement maze transformation feature by using keras.callback
module.
This module provides the callback methods to interact with our task and value function under training.
Base class of all callback is keras.callback.BaseCallback
. This class has 4 callback methods.
before_gpi_start(task, value_function)
before_update(iteration_count, task, value_function)
after_update(iteration_count, task, value_function)
after_gpi_finish(task, value_function)
Here we create the MazeTransformationCallback
which interacts with BlockingMazeTask
after 50 iteration of training and switch the shape of maze. The code is like this.
from kyoka.callback import BaseCallback
class MazeTransformationCallback(BaseCallback):
def after_update(self, iteration_count, task, value_function):
if 50 == iteration_count:
task.maze = BlockingMazeTask.AFTER_MAZE
# we recommend you to use "self.log(message)" instead of "print" method in callback.
self.log("Maze transformed after %d iteration." % iteration_count)
Watch performance of agent during training
Before start training, we implement one more important callback MazePerformanceWatcher
.
This callback logs performance of agent in each iteration of training.
(In our case, performance means how many step does agent takes to the goal.)
With task
and value_function
, we can test performance of agent like this.
from kyoka.policy import choose_best_action
MAX_STEP = 10000
def solve_maze(task, value_function):
step_count = 0
state = task.generate_initial_state()
while not task.is_terminal_state():
action = choose_best_action(task, value_function, state)
state = task.transit_state(state, action)
step_count += 1
if step_count >= MAX_STEP: # agent may never reaches to the goal
break
return step_count
We create MazePerformanceWatcher
by using kyoka.callback.BasePerformanceWatcher
callback.
The code would be like this.
class MazePerformanceWatcher(BasePerformanceWatcher):
def define_performance_test_interval(self):
return 1 # this means "run_performance_test" is called in every "after_update" callback
# Do performance test and return its result.
# Result is passed to "define_log_message" method as "test_result" argument.
def run_performance_test(self, task, value_function):
step_to_goal = solve_maze(task, value_function)
return step_to_goal
# The message returned here is logged after every "run_performance_test" is called.
def define_log_message(self, iteration_count, task, value_function, test_result):
step_to_goal = test_result
return "Step = %d (iteration=%d)" % (step_to_goal,iteration_count)
Solve blocking maze
Ok, we prepared everything to start training. Let's start !!
from kyoka.algorithm.sarsa import Sarsa
from kyoka.policy import EpsilonGreedyPolicy
task = BlockingMazeTask()
policy = EpsilonGreedyPolicy(eps=0.1)
value_function = MazeTabularValueFunction()
algorithm = Sarsa()
algorithm.setup(task, policy, value_function)
callbacks = [MazeTransformationCallback(), MazePerformanceWatcher()]
algorithm.run_gpi(nb_iteration=100, callbacks=callbacks)
Training logs would be output on console like this
[Progress] Start GPI iteration for 100 times
[Progress] Finished 1 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 223 (nb_iteration=1)
[Progress] Finished 2 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 507 (nb_iteration=2)
[Progress] Finished 3 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 220 (nb_iteration=3)
[Progress] Finished 4 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 142 (nb_iteration=4)
[Progress] Finished 5 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 572 (nb_iteration=5)
[Progress] Finished 6 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 781 (nb_iteration=6)
[Progress] Finished 7 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 18 (nb_iteration=7)
...
[Progress] Finished 50 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 10 (nb_iteration=50)
[MazeTransformationCallback] Maze transformed after 50 iteration
[Progress] Finished 51 / 100 iterations (1.6s)
[MazePerformanceWatcher] Step = 10000 (nb_iteration=51)
[Progress] Finished 52 / 100 iterations (1.4s)
[MazePerformanceWatcher] Step = 10000 (nb_iteration=52)
...
Progress] Finished 99 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 16 (nb_iteration=99)
[Progress] Finished 100 / 100 iterations (0.0s)
[MazePerformanceWatcher] Step = 16 (nb_iteration=100)
[Progress] Completed GPI iteration for 100 times. (total time: 7s)
Great!! Agent found the shortest path of blocking maze before and after transofmation.