Designing a GPS Tracking Backend That Survived 12 Years

The constraints that shaped everything

Finnish hunting season opens in August and peaks in October and November. That means a few thousand hunting dogs are out in the forests simultaneously, each wearing a GPS collar, each transmitting its position every 30 seconds. The coverage in those forests is patchy at best — GSM signals bounce off granite hills and disappear into valleys. The system needed to be tolerant of silence: a dog could go dark for five minutes and come back with a burst of packets, and the hunter's phone needed to catch up gracefully.

The collars themselves were constrained hardware. Battery life was the primary concern — not compute, not bandwidth. The firmware team had already decided on UDP datagrams as the transport. That decision was handed to me as a given. My job was to figure out what to do with those datagrams on the server side.

The usage pattern was also highly seasonal and bursty within each day. Hunting happens at dawn and dusk; midday is quiet. Weekend load was 3–5x weekday load. The system had to handle those spikes without pre-provisioning capacity that would sit idle for weeks.

Why UDP, and what that meant for the protocol

The choice of UDP had real implications for the protocol design. You cannot assume delivery. You cannot assume ordering. You have to design every packet to be independently interpretable — there is no session handshake to establish context, no stream to reconstruct.

We designed a custom binary protocol rather than using NMEA sentences, which would have been the path of least resistance given that GPS modules output NMEA natively. The reasons were practical: NMEA text is verbose. An NMEA RMC sentence — the standard position fix — is around 70–80 ASCII bytes just for position, time, and validity. Our binary packet was 24 bytes and carried more information: device ID, sequence number, latitude, longitude, altitude, speed, heading, fix quality, battery voltage, and a CRC. On a battery-powered device transmitting every 30 seconds, that difference in payload size compounds over a hunting season.

// Binary packet layout (24 bytes total)
// Offset  Size  Field
// 0       4     Device ID (uint32, big-endian)
// 4       2     Sequence number (uint16)
// 6       4     Latitude (int32, microdegrees * 1e6)
// 10      4     Longitude (int32, microdegrees * 1e6)
// 14      2     Altitude (int16, meters)
// 16      2     Speed (uint16, cm/s)
// 18      1     Heading (uint8, degrees / 2)
// 19      1     Flags (fix quality, battery level)
// 20      4     CRC32

One critical decision: every packet carries the full device context. The device ID is in every packet. There is no concept of "you are device 4821 for this session." This made the stateless backend architecture possible — any server instance can handle any packet from any device without consulting shared state first. In 2009 this felt slightly wasteful (4 bytes per packet!), but it was the right call for a system that would need to scale horizontally.

PostgreSQL and PostGIS: the boring choice that aged like fine wine

The database decision was PostgreSQL with PostGIS, and I'll be honest — in 2009 this was not the exciting choice. MySQL had a larger installed base. There were people already building NoSQL things that seemed interesting. I chose PostgreSQL because I wanted proper spatial query support and a query planner I could trust, and PostGIS on PostgreSQL was in a different league from MySQL Spatial at the time.

The core of the data model was two tables. A devices table with collar metadata and owner references. A positions table with every incoming fix, partitioned by month. The positions table had a PostGIS GEOMETRY(Point, 4326) column alongside the raw latitude/longitude floats — the redundancy was intentional, allowing spatial queries through PostGIS while keeping numeric comparisons cheap.

What spatial queries actually mattered in production? Three main ones: "give me all positions for device X in the last N hours" (pure time range, no spatial), "is this dog within the permitted hunting area" (point-in-polygon using PostGIS), and "show me all dogs within 2km of this hunter's location" (radius search). The PostGIS investment paid off on the second and third queries. Without it we'd have been doing bounding-box approximations or pulling positions into application memory for spatial filtering.

The monthly partitioning of the positions table was done manually in PostgreSQL 9.x — this was before declarative partitioning arrived in PG10. It was ugly: a parent table, child tables with check constraints on the timestamp range, and a trigger that routed inserts to the right child. That trigger became a maintenance item eventually. But the partition strategy itself was sound — old months got archived off to cheaper storage, and query performance on current data stayed consistent as the table grew into the hundreds of millions of rows.

What MySQL Spatial would have looked like

I've gone back and thought about this. MySQL's spatial index in 2009 used R-trees only on MyISAM tables. MyISAM had no transactional support. We would have had to run separate storage engines for transactional data and spatial data, or abandon transactions on the position table entirely. The moment hunting season started and we got concurrent writes from thousands of collars, that would have been a problem. The PostgreSQL bet was sound.

The stateless ingest layer

The UDP server was a single-threaded Java application using NIO selectors, listening on port 9000. For each incoming datagram: parse the 24-byte packet, validate the CRC, look up the device in an in-process LRU cache (populated from PostgreSQL), write the position to the database, and publish a notification to any interested parties. That was the entire ingest path.

The LRU cache deserves mention. We did one database read per device per cache miss, but cache hits — which were the common case during active hunting — were in-memory lookups with no database round-trip. Cache entries held device metadata and the last-known owner reference. The cache TTL was 5 minutes, which matched the maximum expected silence window between transmissions.

Because each packet carried the full device ID, we could run multiple ingest processes behind a UDP load balancer (DNS round-robin in the early days, later hardware load balancing). Each ingest instance had its own LRU cache and a direct connection to the primary PostgreSQL. There was no inter-process shared state. This design let us scale ingest by adding instances without any coordination overhead.

The most reliable distributed systems I've worked with share a common trait: they are not actually distributed at the hot path. The coordination happens at the edges, not in the middle of request handling.

Adding WebSockets in 2014 without touching the core

The original notification path was primitive: after writing a position, the ingest process sent an HTTP POST to a notification endpoint, which forwarded to a Comet long-polling server. It worked, and nobody was proud of it. In 2014, with WebSockets now a reasonable browser target, we replaced the notification layer entirely.

The key architectural property that made this painless: the notification system was downstream of the ingest pipeline, not embedded in it. The ingest process still just wrote to PostgreSQL and fired a message. The message bus changed (from the old Comet push to a proper message queue, then to WebSocket subscriptions), but the ingest code was untouched. The data model was untouched. We were essentially recabling the output side of an existing flow.

We used a PostgreSQL LISTEN/NOTIFY mechanism as the bridge between the ingest layer and the WebSocket server. After every position insert, a trigger fired NOTIFY on a channel named after the device owner's ID. The WebSocket server, running as a separate process, had a persistent PostgreSQL connection listening on the relevant channels. When a hunter's dog moved, the notification flow was: ingest writes position → PostgreSQL trigger fires NOTIFY → WebSocket server receives notification → pushes position to connected browser.

This introduced a dependency on PostgreSQL's notification infrastructure, which we later moved off in favor of Redis pub/sub — but the WebSocket protocol and the client code required zero changes for that migration, because the abstraction boundary was in the right place.

What actually wore out first

If I had to pick the two subsystems that caused the most operational grief over 12 years, they would be session management and the SMS alert subsystem. Neither was the core data pipeline.

Session management — tracking which hunters were currently "active" in a hunt, which dogs were their responsibility for the session, what the hunt's geographic bounds were — turned out to be a much harder domain problem than position storage. Hunt sessions have complex lifecycles: they start, pause when hunters take a break, and end. Sometimes they end unexpectedly when a hunter's phone dies. The rules about what to do with orphaned sessions were revised four or five times over the years as we learned how hunters actually used the platform versus how we assumed they would. The code was rewritten twice completely and substantially refactored three more times.

The SMS alert subsystem — "notify me when my dog has been stationary for more than 20 minutes, it might be cornering prey" — had a different problem. SMS gateway APIs are not stable long-term dependencies. We went through three gateway providers in 12 years. The logic was trivial; the integration surface was the problem. If I were starting over, I'd use a gateway abstraction layer from day one and treat SMS providers as plug-in dependencies with automated failover.

Three infrastructure moves, zero code changes

The system started on bare-metal servers at a Finnish hosting provider. It moved to virtual machines — first internal VMware, then cloud VMs — around 2013. It moved to containers (Docker, later Kubernetes) around 2017. In each case, the core ingest pipeline and database required no code changes.

This was not an accident. The ingest process was a self-contained JAR with its database connection string and listen port as configuration. It had no assumptions about the filesystem, no local cron jobs, no hardcoded IP addresses. The database was a separate process from day one, accessed over a network socket. The application did not know or care whether that socket led to a local PostgreSQL instance or one on the other side of a virtual network.

The container migration was the smoothest. We wrapped the JAR in a Docker image, pointed the environment variables at the new database host, and the thing started. The hardest part of that migration was orchestrating the database itself — not the application.

The decision I'd make differently

The monitoring and observability tooling we built in 2009–2012 was entirely custom and entirely bespoke. Custom dashboards, custom alerting scripts, custom log aggregation. By 2016, the open-source ecosystem had caught up to everything we had built from scratch. We then had to maintain our custom tooling AND migrate to standard tooling in parallel. That was expensive time that could have been spent on product features.

If I were starting the same project today, I would use off-the-shelf observability infrastructure from the start — Prometheus, Grafana, structured logging to an ELK stack or Loki — and never build custom monitoring tooling. The monitoring problem is solved. Spend your energy on the parts of the problem that aren't.

On the durability of data models

The thing I've come to believe, after watching this system outlive three technology generations, is that the data model is the most durable artifact in a software system. Code gets refactored, replaced, and rewritten. Infrastructure gets migrated. Libraries get deprecated. But if the data model is stable and well-normalized, it provides a continuity that everything else can attach to.

We over-invested in the data model in 2009. We argued about the position table schema for days. We sketched normalization alternatives on whiteboards. We thought carefully about what "a hunt" meant as a data entity and what its relationships were. That investment was returned many times over in the years that followed, because the data model never needed to be substantially changed. Every refactor of the application code was working with the same underlying facts.

The engineers who maintained B-Bark in 2018 had never met me. They were reading code that was seven years old in places. But the data model was legible to them because the domain concepts it encoded — devices, positions, hunts, alerts — were the actual domain concepts of the problem. That legibility is worth more than any framework choice or language preference. It's the thing that survives.