Intelligent Pirate Maze Agent
An autonomous AI agent that uses Deep Q-Learning to navigate an 8×8 maze environment, achieving 100% success rate across all starting positions through reinforcement learning
Tech Stack
Context
The Problem
Traditional pathfinding requires explicit programming of rules and strategies. This project explores whether an AI agent can learn optimal navigation autonomously through trial and error, without hardcoded instructions.
Constraints
- 8×8 grid maze environment with obstacles and a treasure goal
- Agent must learn from scratch without pre-programmed pathfinding logic
- Must achieve reliable performance across all 44 possible starting positions
- Training must converge within reasonable computational limits (~750 episodes)
Stakes
Academic project for CS-370 (Current Emerging Trends in CS) at Southern New Hampshire University, demonstrating understanding of reinforcement learning, neural networks, and AI fundamentals
My Role
Title
Machine Learning Engineer
Team
Academic Project (Individual)
Ownership
Complete implementation of Deep Q-Learning agent, neural network architecture, training pipeline, reward shaping, and performance analysis
Approach & Key Decisions
Implemented a Deep Q-Learning agent using TensorFlow/Keras with experience replay to train a neural network that learns optimal pathfinding policies through reinforcement learning, balancing exploration and exploitation via epsilon-greedy strategy.
Deep Q-Learning with neural network function approximation
Q-tables are impractical for large state spaces. Neural networks can generalize from seen states to unseen configurations, enabling the agent to handle any starting position.
Experience replay buffer for training stability
Storing and randomly sampling past experiences breaks correlation between consecutive training samples, leading to more stable learning and faster convergence.
64-neuron input layer representing flattened 8×8 maze
Each cell in the maze is represented as input, providing the network with complete environmental awareness for decision-making.
Two hidden layers with PReLU activation functions
PReLU (Parametric ReLU) handles negative values better than standard ReLU, improving learning dynamics in reinforcement learning scenarios where rewards can be negative.
Epsilon-greedy exploration strategy with decay
Balances exploration (random actions to discover new strategies) with exploitation (using learned knowledge). Decay schedule gradually shifts from exploration to exploitation as training progresses.
Shaped reward system with penalties for inefficiency
Large positive reward for reaching treasure, negative rewards for wall collisions and inefficient moves encourage the agent to find short, valid paths rather than wandering aimlessly.
Alternatives Considered
Considered traditional pathfinding algorithms (A*, Dijkstra) but the goal was demonstrating AI learning capabilities rather than implementing deterministic solutions
Challenges & Solutions
⚠Challenge
Agent initially exhibited random wandering without learning progress
✓Solution
Implemented shaped reward system with immediate penalties for wall collisions (-0.75) and small step penalties (-0.04) to discourage inefficient exploration, while maintaining large treasure reward (+1.0) for successful completion.
⚠Challenge
Training was unstable with correlated sequential experiences
✓Solution
Built experience replay buffer that stores (state, action, reward, next_state) tuples and samples randomly during training, breaking temporal correlations and stabilizing neural network convergence.
⚠Challenge
Agent overfitted to specific starting positions during training
✓Solution
Trained across diverse starting positions and validated performance on all 44 possible free cell locations, ensuring generalization rather than memorization of specific paths.
⚠Challenge
Balancing exploration of new strategies vs exploitation of learned knowledge
✓Solution
Implemented epsilon-greedy strategy with exponential decay schedule, starting at high exploration (ε=1.0) and gradually reducing to low exploration (ε=0.1) as the agent gains experience.
Outcomes & Impact
Win Rate
100% success rate across all 44 possible starting positions after training
Training Efficiency
Convergence achieved within ~750 episodes, demonstrating efficient learning
Behavioral Evolution
Agent evolved from random exploration to optimal pathfinding with minimal steps to treasure
Neural Network Architecture
64-input → 164-hidden → 150-hidden → 4-output layers with PReLU activation
Learning Methodology
Successfully implemented Deep Q-Learning with experience replay, epsilon-greedy exploration, and reward shaping