Skip to content

paulinamoskwa/q-learning-gridworld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Q-Learning GridWorld

A compact, from‑scratch implementation of Q‑learning where an agent learns to reach a goal in a 2D grid with randomly generated obstacles. Built for the hands‑on going with the article "Learning the hard way; Reinforcement Learning Saga — Part I: From zero to Q‑learning".

The visualization shows (clockwise from top‑left): current episode trajectory,
current greedy policy, heatmap of max Q‑values per state, and moving average of returns.


No Gym/Gymnasium or Stable‑Baselines3 is used; the environment, agent, trainer, and visualizer are custom. Dependencies are minimal (NumPy, PyGame, PyYAML).

Highlights

  • Q-learning agent with epsilon-greedy exploration and epsilon decay
  • Obstacles are generated given a probability of each cell being an obstacle; a path is always guaranteed
  • There are rewards (positive/negative) for each step, reaching the goal, or making an invalid move
  • Episode tracking with return history and moving average
  • Real-time PyGame UI with keyboard controls: SPACE to pause/resume, UP to increase simulation speed, DOWN to decrease simulation speed

Quickstart

  1. Clone the repository
  2. Install the requirements: pip install -r requirements.txt
  3. Run the script: python main.py
  4. A window will open showing the four-panel dashboard. Training starts immediately. Enjoy!

Config

It is possible to test the effect of the hyperparameters by simply changing them in the config.yml file. The configuration for the article illustration is:

environment:
   grid_size: 20
   obstacle_probability: 0.3
   actions: [0, 1, 2, 3]
   action_to_delta: {
      0: [0, -1],   # Up
      1: [1, 0],    # Right
      2: [0, 1],    # Down
      3: [-1, 0]    # Left
   }
   step_reward: -5.0
   goal_reward: 500.0
   invalid_move_penalty: -10.0

agent:
   alpha: 0.2
   gamma: 0.995
   epsilon: 1.0
   epsilon_min: 0.002
   epsilon_decay: 0.995

trainer:
   max_steps_per_episode: 200
   reward_avg_window: 100

It's safe to change most settings; keep actions and their action_to_delta mapping coherent.

How it works (brief)

  • The grid is generated with random obstacles; a BFS check ensures a valid path exists
  • The Q-table is initialized to zeros
  • The training loop starts
  • The agent starts from a fixed start cell and aims for the goal cell
  • Actions are chosen epsilon‑greedily from the Q-table
  • After each step, Q is updated using the TD target with the max over next‑state actions
  • Episodes terminate on reaching the goal or hitting a step limit
  • The visualization is updated at each step to reflect learning progress

To find out more details, refer to the article.

About

Implementation of Q-learning to solve GridWorld

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages