2025

Intelligent Pirate Maze Agent

An autonomous AI agent that uses Deep Q-Learning to navigate an 8×8 maze environment, achieving 100% success rate across all starting positions through reinforcement learning

Tech Stack

Python 3.8+TensorFlowKerasNumPyMatplotlibJupyter Notebook

Context

The Problem

Traditional pathfinding requires explicit programming of rules and strategies. This project explores whether an AI agent can learn optimal navigation autonomously through trial and error, without hardcoded instructions.

Constraints

  • 8×8 grid maze environment with obstacles and a treasure goal
  • Agent must learn from scratch without pre-programmed pathfinding logic
  • Must achieve reliable performance across all 44 possible starting positions
  • Training must converge within reasonable computational limits (~750 episodes)

Stakes

Academic project for CS-370 (Current Emerging Trends in CS) at Southern New Hampshire University, demonstrating understanding of reinforcement learning, neural networks, and AI fundamentals

My Role

Title

Machine Learning Engineer

Team

Academic Project (Individual)

Ownership

Complete implementation of Deep Q-Learning agent, neural network architecture, training pipeline, reward shaping, and performance analysis

Approach & Key Decisions

Implemented a Deep Q-Learning agent using TensorFlow/Keras with experience replay to train a neural network that learns optimal pathfinding policies through reinforcement learning, balancing exploration and exploitation via epsilon-greedy strategy.

Deep Q-Learning with neural network function approximation

Q-tables are impractical for large state spaces. Neural networks can generalize from seen states to unseen configurations, enabling the agent to handle any starting position.

Experience replay buffer for training stability

Storing and randomly sampling past experiences breaks correlation between consecutive training samples, leading to more stable learning and faster convergence.

64-neuron input layer representing flattened 8×8 maze

Each cell in the maze is represented as input, providing the network with complete environmental awareness for decision-making.

Two hidden layers with PReLU activation functions

PReLU (Parametric ReLU) handles negative values better than standard ReLU, improving learning dynamics in reinforcement learning scenarios where rewards can be negative.

Epsilon-greedy exploration strategy with decay

Balances exploration (random actions to discover new strategies) with exploitation (using learned knowledge). Decay schedule gradually shifts from exploration to exploitation as training progresses.

Shaped reward system with penalties for inefficiency

Large positive reward for reaching treasure, negative rewards for wall collisions and inefficient moves encourage the agent to find short, valid paths rather than wandering aimlessly.

Alternatives Considered

Considered traditional pathfinding algorithms (A*, Dijkstra) but the goal was demonstrating AI learning capabilities rather than implementing deterministic solutions

Challenges & Solutions

Challenge

Agent initially exhibited random wandering without learning progress

Solution

Implemented shaped reward system with immediate penalties for wall collisions (-0.75) and small step penalties (-0.04) to discourage inefficient exploration, while maintaining large treasure reward (+1.0) for successful completion.

Challenge

Training was unstable with correlated sequential experiences

Solution

Built experience replay buffer that stores (state, action, reward, next_state) tuples and samples randomly during training, breaking temporal correlations and stabilizing neural network convergence.

Challenge

Agent overfitted to specific starting positions during training

Solution

Trained across diverse starting positions and validated performance on all 44 possible free cell locations, ensuring generalization rather than memorization of specific paths.

Challenge

Balancing exploration of new strategies vs exploitation of learned knowledge

Solution

Implemented epsilon-greedy strategy with exponential decay schedule, starting at high exploration (ε=1.0) and gradually reducing to low exploration (ε=0.1) as the agent gains experience.

Outcomes & Impact

Win Rate

100% success rate across all 44 possible starting positions after training

Training Efficiency

Convergence achieved within ~750 episodes, demonstrating efficient learning

Behavioral Evolution

Agent evolved from random exploration to optimal pathfinding with minimal steps to treasure

Neural Network Architecture

64-input → 164-hidden → 150-hidden → 4-output layers with PReLU activation

Learning Methodology

Successfully implemented Deep Q-Learning with experience replay, epsilon-greedy exploration, and reward shaping

Project Links