SQLite as an IoT Edge Cache: Lessons Learned
Every IoT deployment I've worked on has had the same requirement buried in the spec: "devices should continue operating during network outages." For years I reached for custom binary formats or Redis replicas at the edge. Eventually I found my way back to SQLite, and I'm not leaving.
The edge caching problem, stated precisely
The canonical version of this problem, which I've encountered in almost identical form across GPS collars, fleet telemetry units, and building automation controllers, goes like this: a device collects measurements at a fixed rate — say, one GPS fix per five seconds, or one sensor reading per thirty seconds. Under normal conditions, these measurements are transmitted to a backend immediately or with very short latency. Under abnormal conditions — GSM signal loss in a forest, cellular dead zone in a tunnel, temporary server unavailability — the device must buffer measurements locally and deliver them in order when connectivity is restored.
The GPS collar version is particularly well-defined. A hunting dog wearing a B-Bark collar enters a dense spruce forest and loses GSM signal. The collar continues acquiring GPS fixes at five-second intervals. Over a four-hour hunt, this produces 2,880 fixes — each with timestamp, latitude, longitude, altitude, speed, fix quality, and battery voltage, totalling roughly 200 bytes per record. 2,880 records at 200 bytes is about 576 kilobytes of data. The device has a 32MB flash partition available for buffering. Space is not the constraint; correctness is.
When the dog emerges from the forest and the collar regains GSM connectivity, it must replay those 2,880 fixes to the backend in creation order. The backend must receive them, store them, and produce a correct track reconstruction without knowing or caring that they were delivered in a batch four hours after they were recorded. It must also handle the case where the collar delivers the batch twice — either because the first delivery partially succeeded and the collar isn't sure which records were accepted, or because the retry logic has a bug.
Why not a flat file
The obvious naive implementation is a flat binary file: each record appended in sequence, a pointer in flash memory indicating where unsent records begin. I have implemented this. It has problems.
First, there's no random access for replay-from-point. If the backend accepts the first 500 records of a batch and then becomes unavailable, the collar needs to resume from record 501. In a flat file, finding record 501 requires either maintaining an external offset register (adding state that must be managed, persisted across power cycles, and reconciled with the file) or scanning from the beginning to count records. Neither is clean.
Second, marking records as "sent" without rewriting the file requires either a parallel bitmap (more external state) or a tombstone marker in each record (complicating the format and requiring scanning). Rewriting the file to compact sent records is an expensive operation on flash storage that also wears the medium.
Third, concurrent access from a second process causes corruption. If the main firmware process is appending fixes while a background sync process is reading from the file, the sync process may read a partially written record at the tail. Coordinating this with file locks on an embedded Linux system is doable but fragile, and fragile in ways that will manifest in the field at the worst possible time.
Why not Redis at the edge
Redis is a fine system for many things. It is not a fine system for embedded Linux edge caching. The resource overhead is disproportionate for constrained hardware — Redis in production uses tens to hundreds of megabytes of RAM. Persistence modes (RDB and AOF) add complexity. The Redis process is one more thing to keep alive via a supervisor. The client libraries add a network round-trip even for local connections. And if the Redis process dies unexpectedly during a write — power loss is the relevant scenario here — the recovery semantics require careful thought about what's actually in the RDB snapshot versus the AOF log at the moment of failure.
None of these are dealbreakers in isolation. Together, they add up to a solution that's more infrastructure than problem for what is ultimately a "write records, read records in order, mark records as sent" workload.
Why SQLite
SQLite is a single file. It requires zero configuration. It provides ACID transactions with well-understood failure semantics. It is available as a system library on essentially every embedded Linux target, which means no cross-compilation headaches. The file format has been stable for decades. The source code is in the public domain and has been more thoroughly tested than nearly any other software of similar complexity.
For the edge caching workload, specifically, WAL (Write-Ahead Logging) mode matters. In default rollback journal mode, SQLite holds an exclusive lock on the database during writes, which means readers block. In WAL mode, writers don't block readers and readers don't block writers. This allows the firmware's data collection process and the sync process to access the database concurrently without explicit coordination between them. The WAL file handles the concurrency at the database level.
WAL mode also has better crash-safety for the edge IoT scenario specifically. In rollback journal mode, a crash during a write can leave an orphaned journal file that must be cleaned up before the database is accessible again. If your cleanup logic runs after the process that creates the journal file is already gone, you have a problem. WAL mode's recovery is simpler: uncommitted transactions in the WAL are simply not replayed, and the database file itself is always in a consistent state.
The schema I've converged on
After several iterations across different projects, the schema I reach for is deliberately minimal:
CREATE TABLE IF NOT EXISTS outbox (
id INTEGER PRIMARY KEY AUTOINCREMENT,
created_at INTEGER NOT NULL, -- Unix timestamp, milliseconds
payload BLOB NOT NULL, -- serialized record (protobuf, msgpack, JSON)
sent_at INTEGER -- NULL until delivered; Unix ms when confirmed
);
CREATE INDEX IF NOT EXISTS idx_outbox_unsent
ON outbox (created_at)
WHERE sent_at IS NULL;
The id column is the stable, monotonically increasing identifier that the backend uses for deduplication. created_at is the device-local timestamp of when the measurement was taken — this is the ordering field for replay. payload is whatever the record format is; I prefer protobuf for space efficiency on constrained flash, but the schema doesn't care. sent_at is the sync cursor: NULL means unsent, a timestamp means the record was acknowledged by the backend.
The partial index on (created_at) WHERE sent_at IS NULL keeps the unsent record query fast even when the table accumulates thousands of rows — as it will during a long outage. Without the partial index, the query scans all rows; with it, only unsent rows are in the index.
The fetch-and-mark transaction pair
The sync process runs a simple loop: fetch a batch of unsent records, send them, mark them as sent. The SQL is intentionally straightforward:
-- Fetch next batch of unsent records, ordered by creation time
SELECT id, created_at, payload
FROM outbox
WHERE sent_at IS NULL
ORDER BY created_at
LIMIT 50;
-- After server confirms receipt, mark records as sent in one transaction
BEGIN;
UPDATE outbox
SET sent_at = ?1 -- current Unix ms
WHERE id IN (?2, ?3, ...); -- IDs confirmed by server
COMMIT;
The fetch and the mark are separate operations, separated by the HTTP round-trip to the backend. If the device loses power between fetching a batch and marking it sent, the records remain unsent and will be fetched again on the next sync cycle. The server must handle duplicate delivery — more on this in a moment.
WAL mode in practice
Enabling WAL mode requires exactly one pragma after opening the connection:
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL; -- safe with WAL; 'FULL' is slower and unnecessary
The synchronous=NORMAL setting is correct for WAL mode on embedded Linux. FULL synchronous mode calls fsync() after every write, which is the right default for rollback journal mode but is excessive under WAL — the WAL design already protects against corruption on crash with NORMAL sync. The performance difference on flash storage can be significant.
One operational note: WAL mode creates two additional files alongside the main database — outbox.db-wal and outbox.db-shm. These are normal and expected. Firmware update processes that back up the database by copying the .db file must also copy these files, or the backup will be inconsistent. This has bitten at least two projects I know of.
The sync protocol and server-side deduplication
The sync protocol is simple by design. The client sends a JSON (or protobuf) array of records, each with its id and payload. The server processes each record, stores what it needs, and returns the list of IDs it successfully accepted. The client marks those IDs as sent. If the HTTP request times out or returns an error, the client retries on the next sync cycle — the same records will be submitted again, and the server must handle this idempotently.
Server-side deduplication is straightforward: the id field is a combination of the device's hardware identifier and the SQLite autoincrement value, making it globally unique per device per record. The server stores incoming records in a table with a unique constraint on (device_id, record_id) and uses INSERT OR IGNORE (or equivalent) semantics. Duplicate records are silently dropped. The server then returns the IDs of all records it holds — including any that were duplicates of previously accepted records — so the client can mark them as sent and stop retrying them.
This design means the end-to-end system is at-least-once delivery with idempotent processing, which is the correct failure model for unreliable edge connectivity. Exactly-once delivery semantics require distributed coordination that is not worth the complexity for telemetry data.
SQLite in Rust on embedded Linux
For projects where the firmware is written in Rust — which is increasingly common on embedded Linux targets that have enough resources to run it — the rusqlite crate is the cleanest interface I've found. The bundled feature flag statically links SQLite into the binary, which avoids dependency on the system SQLite version and eliminates a class of deployment surprises where the system library is a different version than what was tested against.
# Cargo.toml
[dependencies]
rusqlite = { version = "0.31", features = ["bundled"] }
The bundled build adds several seconds to the compile time (SQLite is a large C codebase) but produces a self-contained binary that runs identically on any ARM Linux target regardless of the distribution's library versions. For fleet-deployed firmware where you cannot control the base OS image, this is worth the compile time.
What SQLite can't do
SQLite is not a message queue. If the requirement is guaranteed ordered delivery with backpressure — where the producer must block when the consumer is behind — SQLite is the wrong tool. It has no native notification mechanism (though you can poll), no concept of consumer groups, and no backpressure primitives. For these requirements, a proper embedded message queue (NNG, ZeroMQ with a local socket, or a POSIX message queue) is more appropriate.
SQLite also doesn't handle high-frequency concurrent writes well. If multiple processes are appending to the outbox at high rates, WAL mode helps with read concurrency but write contention still serializes. For workloads above roughly a few hundred writes per second on constrained hardware, the lock contention becomes visible in latency. The GPS collar use case — one write per five seconds — is far below this threshold. A high-frequency vibration sensor sampling at 1kHz is not.
Within its intended scope — local caching of discrete records for eventual synchronization to a remote backend — SQLite remains the best solution I've found for IoT edge use cases. The zero-administration footprint, the ACID guarantees, the WAL crash safety, and the universality of deployment make it the boring correct answer, which is exactly what production embedded systems should reach for.