Aadil Ghani Logo

How We Built a Push Notification System That Actually Doesn't Lose Messages

Aadil Ghani
Software Engineer
13 min read

I spent the last few months building Pushary's notification pipeline from scratch. Not because existing tools weren't available. Because they weren't good enough for what we needed: a system where if you hit "send," that notification reaches the browser. Period.

This is the technical breakdown. How it works, why it works, and the decisions that went into it.

The Problem With "Just Send a Push Notification"

Web Push sounds simple on paper. You get a subscription endpoint from the browser, encrypt a payload, POST it to a push service (Google's FCM, Mozilla's autopush, Apple's push gateway), and the browser shows a notification.

In reality, it's a minefield.

Networks fail. Push services rate-limit you. Subscriptions expire silently. iOS has its own universe of constraints. And if you're sending to 50,000 subscribers at once, you need guarantees that "sent" actually means sent. Not "we tried once and gave up."

The Architecture: Transactional Outbox + Kafka

Here's the core insight that drives everything: never lose the intent to send.

When a campaign is triggered, we don't immediately fire off web push calls. Instead, we write notification records and outbox events into Postgres in the same database transaction. This is the transactional outbox pattern, and it's the foundation of the entire reliability story.

Campaign Trigger -> [Postgres Transaction: notifications + outbox_events] -> OutboxPublisher -> Kafka -> Consumer -> Web Push API

Why does this matter? Because if the server crashes one millisecond after the database commit, the notification intent is already persisted. Nothing is lost. The outbox publisher picks it up on the next poll.

The Outbox Publisher

The outbox publisher is the bridge between Postgres and Kafka. It uses two mechanisms to detect new events:

Postgres LISTEN/NOTIFY for near-instant pickup. When an outbox event is inserted, a trigger fires a NOTIFY on the outbox_events_new channel. The publisher hears it and immediately polls.

Adaptive polling as a fallback. If the LISTEN connection drops (networks are fun), polling kicks in with an interval that scales between 500ms and 10 seconds based on load. Busy? Poll faster. Quiet? Back off.

The publisher claims events using SELECT ... FOR UPDATE SKIP LOCKED. This is critical for horizontal scaling. Multiple publisher instances can run without stepping on each other. Each instance grabs unclaimed events, publishes them to Kafka in batches grouped by topic, and marks them as sent.

If Kafka is unreachable, events stay in Postgres. After 5 failed publish attempts, they route to a dead letter queue. And if we see 5 consecutive failures, a circuit breaker opens and we back off for 60 seconds instead of hammering a broken connection.

Kafka: The Backbone

We run Kafka with an idempotent producer (exactly-once semantics on the producer side), Snappy compression, and manual offset commits on the consumer.

Two topics:

  • pushary.notifications — the main event stream (6 partitions, replication factor 3)
  • pushary.dlq — dead letter queue for messages that can't be processed

The consumer uses batch consumption with a concurrency semaphore capped at 10. This means we process up to 10 notifications simultaneously per consumer instance, with backpressure built in. We don't commit offsets until a message is fully processed or routed to the DLQ. That's at-least-once delivery.

"At-least-once" means a message might be delivered twice. So we need idempotency.

Two-Layer Idempotency

This is where most systems cut corners. We don't.

Layer 1: Event-level idempotency. Every event gets a unique ID. Before processing, we check the processed_events table. If it's already marked completed, we skip. If it's not there, we insert a reserved row. If processing fails, we delete the reservation so it can be retried. If it succeeds, we mark it completed.

Layer 2: Delivery-level idempotency. Even within a single event, we deduplicate at the notification level. The notification_deliveries table uses a dedupe key of NOTIFICATION:{notificationId} with SELECT FOR UPDATE SKIP LOCKED in a transaction. A notification can only be sent once, regardless of how many times the event is replayed.

On startup, we clean up stale reservations older than 5 minutes. This handles the case where a worker dies mid-processing.


The Actual Web Push Delivery

Once a message reaches the event handler, we call webpush.sendNotification() with the subscriber's push endpoint, their p256dh and auth keys, and our VAPID credentials.

Each site gets its own VAPID key pair. The payload is encrypted per RFC 8291 (VAPID) and RFC 8188 (content encoding). We set a TTL of 86,400 seconds (24 hours) and urgency "normal."

The payload itself contains everything the service worker needs: title, body, icon, image, badge, notification ID, subscriber ID, campaign ID, site key, and our API URL. This data travels encrypted end-to-end from our server to the browser's push service to the service worker.

On success, we atomically update the notification status to sent, increment the subscriber's total notification count, increment the campaign's sent counter, and update daily stats. All in one transaction.

On failure, we inspect the HTTP status code:

  • 410/404: Subscription expired or gone. We mark the subscriber as expired so we never waste bandwidth on dead endpoints again.
  • 401: VAPID authentication error. Something's wrong with our keys.
  • 429: Push service rate-limiting us. Back off.
  • 413: Payload too large.

Every failure updates the notification status to failed with the error code and increments the campaign's failure counter.


Click Tracking and the Redirect URL

Here's a question most push notification services don't think about carefully: how do you know someone actually clicked?

We track clicks through two parallel paths, because reliability means redundancy.

Path 1: Direct Tracking

When the service worker's notificationclick event fires, we immediately POST a click event to /api/v1/track. This uses fetch with keepalive: true so the request survives even if the page navigates away.

If that POST fails (offline, network blip, whatever), the event gets queued into IndexedDB. It's retried on the next push event, on visibilitychange, or via Background Sync. Max 3 attempts over 24 hours before we give up.

Path 2: The Redirect URL

Simultaneously, we redirect the user through our tracking endpoint. Instead of navigating directly to https://yoursite.com/sale, we navigate to:

https://pushary.com/api/v1/redirect?url=https://yoursite.com/sale&sk=your_site_key&nid=notification_id&cid=campaign_id&sid=subscriber_id

The redirect endpoint does three things:

  1. Races tracking against a 500ms timeout. We record the click (notification status, campaign counters, subscriber stats, daily stats, and a full analytics event) but we never hold the user's redirect hostage to our database. If tracking takes longer than 500ms, we redirect anyway.
  2. Uses an atomic CTE query. When we have both a notification ID and campaign ID, a single SQL statement updates the notification status, increments the campaign click counter, increments the subscriber's click count, upserts daily stats, and inserts the analytics event. One round trip. Zero race conditions.
  3. Prevents double-counting. The clicked_at IS NULL guard in the CTE means if two click events arrive for the same notification (from both tracking paths), only the first one increments counters.

The redirect then returns a 302 to the actual target URL.

Why Both Paths?

Because the direct POST gives us faster, more reliable tracking data (it fires before any navigation), but the redirect URL is the safety net. If the service worker's fetch fails, the redirect still captures the click. If the redirect is slow, the direct POST already recorded it.


Click Rates

Click rate is calculated as:

clickRate = (totalClicked / totalDelivered) * 100

Where totalDelivered counts notifications with status delivered or clicked (because clicked implies delivered). Not sent. There's a meaningful difference between "we sent it to the push service" and "the browser actually showed it."

We know a notification was delivered because the service worker sends an impression event when handlePush fires. That's the browser telling us "I received this and showed it to the user."

We track these metrics at multiple granularities:

  • Per-notification: Individual status lifecycle (pending, sent, delivered, clicked, dismissed, failed)
  • Per-campaign: Denormalized counters (totalTargeted, totalSent, totalDelivered, totalClicked, totalDismissed, totalFailed)
  • Per-day: Aggregated daily stats per site with unique clicker counts
  • Per-subscriber: Running totals of notifications received and clicks, plus last active timestamp

The analytics layer goes deeper. We parse user agents for device type, browser, and OS. We pull geo data from Vercel and Cloudflare headers. We track which URLs get clicked, at what time of day, on which day of the week, and we surface "best send times" based on historical click patterns.


iOS: The Hard Part

iOS doesn't support Web Push the way every other platform does. On Android and desktop browsers, you call Notification.requestPermission(), the user says yes, you get a push subscription, done.

On iOS, push notifications only work inside a Progressive Web App that the user has installed to their home screen. This has been the case since iOS 16.4, and it's not changing anytime soon.

So we built a dedicated subscribe flow that handles the full iOS journey:

Step 1: Detect the Browser

We check if the user is on iOS, and if so, which browser. Safari is required for PWA installation. If they're in Chrome for iOS (CriOS), Firefox for iOS (FxiOS), or an in-app browser (Instagram, Facebook, Twitter, TikTok, LinkedIn), we show a prompt: "Open in Safari to enable notifications."

We detect in-app browsers specifically because they're the most common way users land on a page from social media, and none of them support PWA installation.

Step 2: Guide PWA Installation

Once in Safari, we check navigator.standalone and the (display-mode: standalone) media query. If the user isn't in standalone mode, we show an IOSInstallGuide component that walks them through: Share button, "Add to Home Screen," confirm.

This is a UX challenge, not a technical one. You're asking users to do 3 extra taps before they can subscribe. Every word and every visual in that guide matters for conversion.

Step 3: Request Permission in Standalone Mode

Once the PWA is open in standalone mode, we show the notification permission prompt. Standard pushManager.subscribe() with the VAPID public key. If they accept, we POST the subscription to our server and they're subscribed.

The iOS Background Problem

Here's a subtle bug that took real debugging time: on iOS, when a PWA is backgrounded, the service worker's clients can be "frozen." Calling client.focus() throws. Calling client.navigate() fails silently.

Our solution is a navigation acknowledgment protocol. When a notification is clicked:

  1. The service worker sends a PUSHARY_NAVIGATE message to the client with a unique navigation token.
  2. It waits 800ms for a PUSHARY_NAVIGATE_ACK response.
  3. If the client is frozen (no ack), we fall through to clients.openWindow().
  4. If even that fails, we store the pending navigation in IndexedDB. When the PWA eventually wakes up (via visibilitychange or pageshow), it checks for pending navigations and executes them.

The service worker URL is versioned with ?v=20260305-ios-bg-nav-ack1 specifically because this iOS background navigation handling is a targeted feature we iterate on.

Context Recovery

Another iOS quirk: notification.data can sometimes be stripped by the OS between when the notification is shown and when the user clicks it. The service worker shows the notification and saves the full context (notification ID, subscriber ID, campaign ID, site key, API URL) to IndexedDB, keyed by notification tag.

When the click handler fires, if event.notification.data is empty, we recover context by matching the tag, or the notification ID extracted from the tag, or even by matching the title and body against recently saved contexts (within a 2-minute window for a more permissive fallback, 10 minutes for exact matches).

This means even when iOS strips our data, we still track the click accurately and redirect to the right URL.


The DLQ and Recovery

Messages that can't be processed end up in the dead letter queue Kafka topic. Every DLQ message carries the original payload, the error details, attempt count, and metadata about which topic and partition it came from.

We classify errors:

  • invalid-json and schema-validation: These are permanently broken. Replaying them won't help.
  • processing-error: Transient failures that might succeed on retry.
  • outbox-publish-error: Kafka was unreachable.

There's an admin endpoint at /admin/dlq/replay that supports filtering by error type, setting a limit, and doing dry runs. Permanently broken messages are automatically skipped during replay.

The admin server also exposes /admin/metrics with processing latency percentiles (p50, p95, p99) and counters for sent, failed, outbox processed, outbox failed, and consumer errors.


What We Didn't Build

We didn't build Apple Push Notification Service (APNs) integration. We use standard Web Push (VAPID) exclusively. iOS 16.4+ supports it through PWAs, and the web standard is where the momentum is. No certificate management, no proprietary protocols, no App Store dependency.

We didn't build our own push service relay. Google's FCM, Mozilla's autopush, and Apple's push gateway are the endpoints. They're the ones with global edge infrastructure. We encrypt the payload, send it, and trust the pipe.

We didn't build complex retry scheduling with exponential backoff at the Kafka consumer level beyond 3 attempts. If a notification fails 3 times, something is fundamentally wrong with that subscriber's endpoint. Mark it failed, move on, keep the pipeline fast for the millions that work.


The Result

A notification goes from "campaign triggered" to "showing on the user's screen" through: Postgres transaction, outbox publisher (LISTEN/NOTIFY + adaptive polling), Kafka (idempotent producer, manual offset commits), event handler (two-layer idempotency, web push delivery), service worker (impression tracking, click handling, retry queue).

Every step has a fallback. Every step has idempotency. Every failure is classified and either retried or routed to recovery.

LayerTechnology
Message brokerKafka (idempotent producer, manual offsets)
DatabasePostgreSQL (transactional outbox)
Event detectionPostgres LISTEN/NOTIFY + adaptive polling
ConcurrencySELECT FOR UPDATE SKIP LOCKED
Push protocolWeb Push (VAPID, RFC 8291, RFC 8188)
Click trackingDual-path (service worker POST + redirect URL)
iOS supportPWA with navigation ack protocol
Dead letter queueKafka DLQ with admin replay
Monitoringp50/p95/p99 latency, per-event counters

That's the system. Not because we love complexity. Because push notifications that don't arrive are worse than no push notifications at all.


For another look at how I orchestrate complex multi-service pipelines, read how I built an AI video ad pipeline that coordinates 6 AI services — same philosophy of typed pipelines and failure recovery, applied to generative AI.


Building Pushary in public. If you're sending push notifications at scale and care about delivery reliability, we should talk.

System DesignKafkaWeb PushPushary