PulseBoard
An end-to-end data pipeline that ingests Hacker News and NewsAPI headlines, scores them for sentiment, transforms them through dbt staging and marts, orchestrates runs in Airflow, and serves a live FastAPI + Next.js dashboard
Tech Stack
Context
The Problem
Most data-engineering portfolios show one piece in isolation — a dbt project, an Airflow DAG, or a dashboard. PulseBoard was built to show the whole pipeline working together: ingestion, storage, transformation, orchestration, streaming, sentiment, API, and frontend, all wired into one system.
Constraints
- Pipeline must be idempotent — re-running must not produce duplicate rows
- Both batch (Airflow) and streaming (Kafka) ingestion paths must write to the same PostgreSQL tables
- Every transformation must be covered by dbt schema tests (unique, not_null)
- Sentiment scoring must be applied at ingest, not as a downstream lookup
- Dashboard must reflect new data without requiring a manual refresh
Stakes
Designed as a coherent data-engineering portfolio demonstrating fluency across the full modern data stack rather than competence in any single tool
My Role
Title
Data Engineer & Full-Stack Developer
Team
Personal Project
Ownership
End-to-end ownership: Python ingestion, PostgreSQL schema, dbt models, Airflow DAGs, Kafka streaming path, FastAPI REST layer, and Next.js dashboard
Approach & Key Decisions
Python fetchers pull Hacker News stories and NewsAPI headlines into a PostgreSQL raw schema, applying TextBlob sentiment polarity at ingest. dbt 1.11 transforms raw data through staging models (cleaning) into marts (business-ready analytics) with schema tests enforcing uniqueness and not-null. Airflow orchestrates hourly runs; a Confluent Kafka path (via Docker) offers a real-time alternative writing to the same tables. FastAPI exposes three endpoints with auto-generated Swagger docs, and a Next.js 14 + Tailwind dashboard uses SWR to auto-refresh every 60 seconds.
Idempotent upserts on PostgreSQL using source-provided IDs
Lets the pipeline be re-run safely — backfills, retries, and replays never produce duplicates, which is critical for any production-grade ingestion.
Layered dbt models: raw → staging → marts
Keeps raw data untouched for auditability, isolates cleaning logic in staging, and presents only business-ready models to the API — the standard modern-data-stack pattern.
Both Airflow batch and Kafka streaming writing to the same tables
Demonstrates that the storage layer is the integration point, not the ingestion mechanism — either path can feed the same downstream consumers.
Sentiment scoring at ingest with TextBlob
Stores polarity alongside the raw row so downstream models and APIs never need to recompute it; sentiment becomes a first-class column from the start.
SWR auto-refresh every 60 seconds on the dashboard
Trades a tiny amount of API load for a perceptibly live UI — users see fresh data without needing to reload.
Challenges & Solutions
⚠Challenge
Re-running ingestion typically risks duplicate rows
✓Solution
Upserts on source-provided IDs make the pipeline safely re-runnable end to end
⚠Challenge
Maintaining two ingestion paths (batch and streaming) without divergence
✓Solution
Both Airflow and Kafka write to the same raw tables; transformations and APIs are unaware of which path produced a row
⚠Challenge
Guaranteeing transformations don't silently degrade
✓Solution
Every dbt model is covered by schema tests (unique, not_null) that run on every build — 6/6 tests passing across 3 models
⚠Challenge
Re-computing sentiment for every API response would be wasteful
✓Solution
Polarity is computed once at ingest and stored on the row, so the API and dashboard read it for free
Outcomes & Impact
dbt Models
3 models building successfully with 6/6 schema tests passing (unique, not_null)
Orchestration
Airflow DAG running hourly with 100% task success rate
Sentiment Coverage
Polarity score and label (positive/negative/neutral) applied to 100% of ingested headlines
Streaming Path
Kafka producer/consumer streaming 10+ stories per run into the same PostgreSQL tables as the batch path
Dashboard
Next.js + Tailwind dashboard auto-refreshing every 60 seconds across trending topics, HN, and news feeds
Phases Delivered
8 complete phases: ingestion, storage, transformations, orchestration, API, dashboard, streaming, sentiment