Reinforcement Laboratory

Interactive educational dashboard explaining core RL functionality. Watch agents explore, exploit, and optimize strategies in real-time within your browser.

Q-Learning Pathfinding

The agent navigates a Grid World to find Gold (+100) while avoiding Fire (-100). It learns the "Quality" (Q-value) of every move and builds an optimal policy over thousands of episodes.

Environment Tools

Training Metrics

Episodes 0
Explore Rate (ε) 100%
Current Reward 0

What are Q-Values?

The colored triangles in the grid represent the agent's learned value for moving in that direction. Green means advantageous, red means dangerous. It updates via the Bellman Equation.

Exploration vs. Exploitation

Notice the Explore Rate (ε) dropping over time. Initially, the agent moves randomly to learn the map. Later, it exploits its learned Q-Table to take the optimal path.

Markov Decision Process

The grid represents states (S), directions represent actions (A), and hitting gold/fire represents rewards (R). The agent's goal is to maximize cumulative future rewards.

Multi-Armed Bandit

Five slot machines have hidden, unequal win probabilities. The AI must balance trying new machines to find the best payout (Exploration) with pulling the best known lever (Exploitation).

Estimated Win Probability (Action Value)

The bar height represents the agent's internal Q-value (expected reward) for each machine. Watch it figure out which is the true optimal arm.