A compact, from‑scratch implementation of Q‑learning where an agent learns to reach a goal in a 2D grid with randomly generated obstacles. Built for the hands‑on going with the article "Learning the hard way; Reinforcement Learning Saga — Part I: From zero to Q‑learning".
The visualization shows (clockwise from top‑left): current episode trajectory,
current greedy policy, heatmap of max Q‑values per state, and moving average of returns.
No Gym/Gymnasium or Stable‑Baselines3 is used; the environment, agent, trainer, and visualizer are custom. Dependencies are minimal (NumPy, PyGame, PyYAML).
- Q-learning agent with epsilon-greedy exploration and epsilon decay
- Obstacles are generated given a probability of each cell being an obstacle; a path is always guaranteed
- There are rewards (positive/negative) for each step, reaching the goal, or making an invalid move
- Episode tracking with return history and moving average
- Real-time PyGame UI with keyboard controls:
SPACEto pause/resume,UPto increase simulation speed,DOWNto decrease simulation speed
- Clone the repository
- Install the requirements:
pip install -r requirements.txt - Run the script:
python main.py - A window will open showing the four-panel dashboard. Training starts immediately. Enjoy!
It is possible to test the effect of the hyperparameters by simply changing them in the config.yml file. The configuration for the article illustration is:
environment:
grid_size: 20
obstacle_probability: 0.3
actions: [0, 1, 2, 3]
action_to_delta: {
0: [0, -1], # Up
1: [1, 0], # Right
2: [0, 1], # Down
3: [-1, 0] # Left
}
step_reward: -5.0
goal_reward: 500.0
invalid_move_penalty: -10.0
agent:
alpha: 0.2
gamma: 0.995
epsilon: 1.0
epsilon_min: 0.002
epsilon_decay: 0.995
trainer:
max_steps_per_episode: 200
reward_avg_window: 100It's safe to change most settings; keep actions and their action_to_delta mapping coherent.
- The grid is generated with random obstacles; a BFS check ensures a valid path exists
- The Q-table is initialized to zeros
- The training loop starts
- The agent starts from a fixed start cell and aims for the goal cell
- Actions are chosen epsilon‑greedily from the Q-table
- After each step, Q is updated using the TD target with the max over next‑state actions
- Episodes terminate on reaching the goal or hitting a step limit
- The visualization is updated at each step to reflect learning progress
To find out more details, refer to the article.