2026Featured Project

PulseBoard

An end-to-end data pipeline that ingests Hacker News and NewsAPI headlines, scores them for sentiment, transforms them through dbt staging and marts, orchestrates runs in Airflow, and serves a live FastAPI + Next.js dashboard

Tech Stack

Python 3.13PostgreSQL 14dbt 1.11Apache AirflowConfluent KafkaFastAPIpsycopg2TextBlobNext.js 14Tailwind CSSSWRDocker Compose

Context

The Problem

Most data-engineering portfolios show one piece in isolation — a dbt project, an Airflow DAG, or a dashboard. PulseBoard was built to show the whole pipeline working together: ingestion, storage, transformation, orchestration, streaming, sentiment, API, and frontend, all wired into one system.

Constraints

Pipeline must be idempotent — re-running must not produce duplicate rows
Both batch (Airflow) and streaming (Kafka) ingestion paths must write to the same PostgreSQL tables
Every transformation must be covered by dbt schema tests (unique, not_null)
Sentiment scoring must be applied at ingest, not as a downstream lookup
Dashboard must reflect new data without requiring a manual refresh

Stakes

Designed as a coherent data-engineering portfolio demonstrating fluency across the full modern data stack rather than competence in any single tool

My Role

Title

Data Engineer & Full-Stack Developer

Team

Personal Project

Ownership

End-to-end ownership: Python ingestion, PostgreSQL schema, dbt models, Airflow DAGs, Kafka streaming path, FastAPI REST layer, and Next.js dashboard

Approach & Key Decisions

Python fetchers pull Hacker News stories and NewsAPI headlines into a PostgreSQL raw schema, applying TextBlob sentiment polarity at ingest. dbt 1.11 transforms raw data through staging models (cleaning) into marts (business-ready analytics) with schema tests enforcing uniqueness and not-null. Airflow orchestrates hourly runs; a Confluent Kafka path (via Docker) offers a real-time alternative writing to the same tables. FastAPI exposes three endpoints with auto-generated Swagger docs, and a Next.js 14 + Tailwind dashboard uses SWR to auto-refresh every 60 seconds.

Idempotent upserts on PostgreSQL using source-provided IDs

Lets the pipeline be re-run safely — backfills, retries, and replays never produce duplicates, which is critical for any production-grade ingestion.

Layered dbt models: raw → staging → marts

Keeps raw data untouched for auditability, isolates cleaning logic in staging, and presents only business-ready models to the API — the standard modern-data-stack pattern.

Both Airflow batch and Kafka streaming writing to the same tables

Demonstrates that the storage layer is the integration point, not the ingestion mechanism — either path can feed the same downstream consumers.

Sentiment scoring at ingest with TextBlob

Stores polarity alongside the raw row so downstream models and APIs never need to recompute it; sentiment becomes a first-class column from the start.

SWR auto-refresh every 60 seconds on the dashboard

Trades a tiny amount of API load for a perceptibly live UI — users see fresh data without needing to reload.

Challenges & Solutions

⚠Challenge

Re-running ingestion typically risks duplicate rows

✓Solution

Upserts on source-provided IDs make the pipeline safely re-runnable end to end

⚠Challenge

Maintaining two ingestion paths (batch and streaming) without divergence

✓Solution

Both Airflow and Kafka write to the same raw tables; transformations and APIs are unaware of which path produced a row

⚠Challenge

Guaranteeing transformations don't silently degrade

✓Solution

Every dbt model is covered by schema tests (unique, not_null) that run on every build — 6/6 tests passing across 3 models

⚠Challenge

Re-computing sentiment for every API response would be wasteful

✓Solution

Polarity is computed once at ingest and stored on the row, so the API and dashboard read it for free

Outcomes & Impact

dbt Models

3 models building successfully with 6/6 schema tests passing (unique, not_null)

Orchestration

Airflow DAG running hourly with 100% task success rate

Sentiment Coverage

Polarity score and label (positive/negative/neutral) applied to 100% of ingested headlines

Streaming Path

Kafka producer/consumer streaming 10+ stories per run into the same PostgreSQL tables as the batch path

Dashboard

Next.js + Tailwind dashboard auto-refreshing every 60 seconds across trending topics, HN, and news feeds

Phases Delivered

8 complete phases: ingestion, storage, transformations, orchestration, API, dashboard, streaming, sentiment

Project Links

View on GitHub

← Back to All Projects