Interactive educational dashboard explaining core RL functionality. Watch agents explore, exploit, and optimize strategies in real-time within your browser.
The agent navigates a Grid World to find Gold (+100) while avoiding Fire (-100). It learns the "Quality" (Q-value) of every move and builds an optimal policy over thousands of episodes.
The colored triangles in the grid represent the agent's learned value for moving in that direction. Green means advantageous, red means dangerous. It updates via the Bellman Equation.
Notice the Explore Rate (ε) dropping over time. Initially, the agent moves randomly to learn the map. Later, it exploits its learned Q-Table to take the optimal path.
The grid represents states (S), directions represent actions (A), and hitting gold/fire represents rewards (R). The agent's goal is to maximize cumulative future rewards.
Five slot machines have hidden, unequal win probabilities. The AI must balance trying new machines to find the best payout (Exploration) with pulling the best known lever (Exploitation).
The bar height represents the agent's internal Q-value (expected reward) for each machine. Watch it figure out which is the true optimal arm.