Q-Learning GridWorld

A compact, from‑scratch implementation of Q‑learning where an agent learns to reach a goal in a 2D grid with randomly generated obstacles. Built for the hands‑on going with the article "Learning the hard way; Reinforcement Learning Saga — Part I: From zero to Q‑learning".

The visualization shows (clockwise from top‑left): current episode trajectory,
current greedy policy, heatmap of max Q‑values per state, and moving average of returns.

No Gym/Gymnasium or Stable‑Baselines3 is used; the environment, agent, trainer, and visualizer are custom. Dependencies are minimal (NumPy, PyGame, PyYAML).

Highlights

Q-learning agent with epsilon-greedy exploration and epsilon decay
Obstacles are generated given a probability of each cell being an obstacle; a path is always guaranteed
There are rewards (positive/negative) for each step, reaching the goal, or making an invalid move
Episode tracking with return history and moving average
Real-time PyGame UI with keyboard controls: SPACE to pause/resume, UP to increase simulation speed, DOWN to decrease simulation speed

Quickstart

Clone the repository
Install the requirements: pip install -r requirements.txt
Run the script: python main.py
A window will open showing the four-panel dashboard. Training starts immediately. Enjoy!

Config

It is possible to test the effect of the hyperparameters by simply changing them in the config.yml file. The configuration for the article illustration is:

environment:
   grid_size: 20
   obstacle_probability: 0.3
   actions: [0, 1, 2, 3]
   action_to_delta: {
      0: [0, -1],   # Up
      1: [1, 0],    # Right
      2: [0, 1],    # Down
      3: [-1, 0]    # Left
   }
   step_reward: -5.0
   goal_reward: 500.0
   invalid_move_penalty: -10.0

agent:
   alpha: 0.2
   gamma: 0.995
   epsilon: 1.0
   epsilon_min: 0.002
   epsilon_decay: 0.995

trainer:
   max_steps_per_episode: 200
   reward_avg_window: 100

It's safe to change most settings; keep actions and their action_to_delta mapping coherent.

How it works (brief)

The grid is generated with random obstacles; a BFS check ensures a valid path exists
The Q-table is initialized to zeros
The training loop starts
The agent starts from a fixed start cell and aims for the goal cell
Actions are chosen epsilon‑greedily from the Q-table
After each step, Q is updated using the TD target with the max over next‑state actions
Episodes terminate on reaching the goal or hitting a step limit
The visualization is updated at each step to reflect learning progress

To find out more details, refer to the article.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
media		media
src		src
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q-Learning GridWorld

Highlights

Quickstart

Config

How it works (brief)

About

Uh oh!

Releases

Packages

Languages

paulinamoskwa/q-learning-gridworld

Folders and files

Latest commit

History

Repository files navigation

Q-Learning GridWorld

Highlights

Quickstart

Config

How it works (brief)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages