Cassandra at Scale: IoT Fleet Management and Big Data Analysis

Why a Relational Database Wasn't Enough

The fleet management platform started on PostgreSQL. At a few hundred devices it worked fine. By the time the fleet reached several thousand vehicles — each reporting GPS position, speed, heading, engine RPM, fuel level, and a dozen digital I/O signals every 30 seconds — the single-node PostgreSQL instance was consuming its write budget on the device_events table alone. The table was growing at roughly 50 million rows per day; the autovacuum daemon was perpetually behind; query latency on historical range scans was climbing into the seconds.

The access patterns told you why PostgreSQL was struggling. Fleet telemetry is almost entirely append-only: you write each event once and rarely update it. Queries fall into two categories: near-real-time tails (show the last N events for device X) and historical range scans (show all events for device X between timestamps T1 and T2). No joins. No transactions. No complex aggregates at query time. The relational model's strength — flexible query composition, ACID transactions, join support — was providing zero benefit for this use case while imposing significant overhead: index maintenance on insert, lock contention under high write concurrency, and B-tree fragmentation on a monotonically-growing time-series table.

Apache Cassandra was designed precisely for this pattern. Its architecture — distributed, leaderless, optimised for sequential writes, with a data model that encodes access patterns into table design — fit the requirements directly. The migration was not without pain, but the fundamental match between Cassandra's design and the fleet telemetry use case made the pain worthwhile.

Cassandra's Distributed Architecture

Cassandra is a distributed wide-column store derived from Amazon Dynamo's partitioning model and Google Bigtable's data model. Each row is identified by a partition key, which determines which nodes in the cluster store that row. Within a partition, rows are ordered by a clustering key. The combined (partition key, clustering key) tuple is the primary key; there is no other index on a given table unless you explicitly create one, and secondary indexes in Cassandra carry significant caveats.

Data is replicated across multiple nodes according to a replication factor. With replication factor 3, every partition is stored on three different nodes. Writes and reads can be tuned for consistency level: QUORUM requires a majority of replicas to acknowledge; ONE requires only one. The CAP theorem trade-off is explicit: with ONE consistency, you can survive node failures and network partitions at the cost of potentially reading stale data; with ALL consistency, you get linearisable reads at the cost of availability.

For fleet telemetry, write consistency of ONE was acceptable — if an event was written to one node and that node immediately failed before replication completed, we would lose that single event. For a GPS position report arriving every 30 seconds, losing one point in a long track was tolerable. Read consistency of QUORUM gave us confidence that the most recent events were visible, without requiring all three replicas to be healthy.

Data Model: The Time-Series Partition Pattern

The most important decision in a Cassandra data model is the partition key. The partition key determines data distribution across the cluster and the maximum query granularity. A well-chosen partition key ensures even distribution (avoiding "hot" partitions that overload a single node) and matches the query access pattern (ensuring the data you need is co-located).

The naive partition key for fleet events is device_id. You'd get one partition per device, clustering columns ordered by timestamp:

CREATE TABLE device_events (
  device_id  uuid,
  ts         timestamp,
  lat        double,
  lon        double,
  speed      float,
  heading    smallint,
  rpm        int,
  fuel_pct   float,
  io_flags   int,
  PRIMARY KEY (device_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

This works for small fleets. For large ones it creates unbounded partitions: a device active for three years accumulates 3 × 365 × 24 × 120 = ~3.1 million rows in a single partition. Cassandra handles wide partitions poorly — compaction becomes expensive, node repair is slow, and the memory overhead of partition metadata grows proportionally. The practical upper bound for a Cassandra partition is roughly a few hundred megabytes; a three-year telemetry stream for an active device far exceeds that.

The standard solution is a time-bucketed partition key, appending a time bucket to the device ID:

CREATE TABLE device_events (
  device_id  uuid,
  bucket     date,           -- date of the event, truncated to day
  ts         timestamp,
  lat        double,
  lon        double,
  speed      float,
  heading    smallint,
  rpm        int,
  fuel_pct   float,
  io_flags   int,
  PRIMARY KEY ((device_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

Now each partition holds one day of events for one device — roughly 2,880 rows at 30-second intervals. Partitions are bounded and predictably sized. The cost: a range query spanning multiple days must fan out across multiple partitions. The application layer needs to enumerate the bucket dates in the query range and issue one Cassandra query per bucket, then merge the results. For most fleet analytics queries this is straightforward; the application already knows the time range it is querying.

The `TimeWindowCompactionStrategy`

Cassandra's write path is append-only: new data goes to a memtable in RAM, which is flushed to immutable SSTable files on disk. Over time, many SSTables accumulate and must be compacted — merged together to reclaim space from deleted/updated rows and to improve read performance. The compaction strategy determines how and when this merging happens.

The default SizeTieredCompactionStrategy (STCS) merges SSTables of similar sizes together. For time-series data, this is suboptimal: it mixes new data with old data in the same compaction operations, and it doesn't respect the fact that old time-series partitions are effectively immutable (they will never be updated once their time window has passed).

TimeWindowCompactionStrategy (TWCS) divides time into windows (we used daily windows) and only merges SSTables containing data from the same window. Once a window closes — the day ends, all its data is flushed — TWCS compacts that window once and never touches it again. This dramatically reduces compaction I/O for old data and is specifically designed for append-only time-series workloads. TWCS was the single most impactful compaction configuration change we made; it reduced background I/O by approximately 60% and eliminated the compaction pressure spikes that had been periodically saturating our storage nodes.

Big Data Analysis: Spark Integration

Raw telemetry storage is only half the value. The fleet analytics layer — trip reconstruction, idle time analysis, fuel efficiency ranking, geofence dwell time, predictive maintenance indicators — required batch processing across the full historical dataset. Cassandra's query model does not support arbitrary analytics: you cannot do GROUP BY, SUM, or multi-partition aggregations in CQL. For analytics you need a separate compute layer.

We integrated Apache Spark via the DataStax Spark Cassandra Connector. Spark read partitions from Cassandra in parallel — each Spark executor was assigned a subset of token ranges, allowing the read to be distributed across the cluster without a single coordinator bottleneck. A full scan of the events table for a 30-day period across the entire fleet completed in minutes rather than hours.

The analytics patterns we ran most frequently:

Trip segmentation — identifying contiguous movement intervals separated by stops (engine off or stationary for >5 minutes). This required a stateful transformation over time-ordered events per device, exactly the kind of operation Spark's groupByKey + custom iterator handled well.
Geofence analysis — determining which events fell inside customer-defined zones using a pre-built spatial index. Spark broadcast the zone geometries to all executors; each executor evaluated the containment test locally without network round trips.
Anomaly detection — flagging unusual patterns: speed in an idle zone, engine on during declared maintenance period, route deviation beyond a threshold. These were rule-based initially, with a later ML layer (gradient boosted trees) for the harder cases.

The results of Spark jobs were materialised back into Cassandra as summary tables — pre-aggregated by device and time window — so that the application tier could serve dashboard queries from fast, bounded Cassandra reads rather than triggering Spark jobs on demand.

Operational Lessons

Repair is not optional

Cassandra's eventual consistency model means that if a write succeeds at two out of three replicas (with QUORUM write consistency), the third replica is temporarily inconsistent. Over time, these inconsistencies compound if not addressed. The repair operation (running nodetool repair) reconciles replica disagreements by streaming missing data between nodes. Repair must be run regularly — we ran it weekly, rotating across nodes — or tombstone accumulation and read-repair storms will degrade cluster health over time. Neglecting repair on a production Cassandra cluster is the single most common cause of the subtle data loss and cascading failure patterns that give Cassandra a bad reputation.

Tombstones are expensive

In Cassandra, a delete is not a remove — it is a write of a special marker called a tombstone. Tombstones accumulate until compaction can safely discard them (after the gc_grace_seconds period, which defaults to 10 days). A query that reads through many tombstones is slow and generates warnings in the server logs. For fleet telemetry, we deliberately avoided deletes: old data was retained indefinitely (or until an explicit time-to-live expired the partition). Using TTL-based expiry rather than explicit deletes meant Cassandra handled the deletion internally via natural compaction rather than generating tombstone accumulation.

Schema changes require care

Adding a column to a Cassandra table is fast — it is a metadata-only change. Removing a column, renaming it, or changing a primary key requires creating a new table, backfilling from the old table, and migrating traffic. We learned this late, after designing an initial schema that needed restructuring as query patterns evolved. The discipline of treating Cassandra schema design as a permanent commitment — because the data model encodes the query access patterns, changing the queries means changing the schema — is a significant cultural shift for teams coming from relational databases where schema migrations are routine.

What I Would Change

If building the same system today, I would evaluate Apache Cassandra against two alternatives that did not exist in their current form when we started: Apache Parquet on object storage (for the analytics layer) and a purpose-built time-series database for the hot path.

For analytics, writing Spark output to Parquet files on S3 or GCS — rather than materialising to Cassandra summary tables — provides better compression, columnar read performance for aggregation queries, and lower operational overhead. Cassandra as an analytics store requires Spark; Parquet files are queryable by Spark, Athena, BigQuery, DuckDB, and practically everything else without a coordination layer.

For the hot path (real-time device event storage), a dedicated time-series database like InfluxDB IOx or Apache IoTDB provides native time-series semantics — automatic time partitioning, downsample-and-retain policies, continuous queries — without the application-layer bucket management that Cassandra requires. The operations model is substantially simpler.

But Cassandra did the job well for years, at a scale that would have been expensive or impossible on alternatives that were less mature at the time. The lessons from operating it — partition design, compaction strategy, repair discipline — transfer directly to other distributed systems and remain relevant even if the specific tool has been superseded for some use cases.