Modelling Online Gamer Behaviour with SOMs and Decision Trees
In 2008, my team at Advant Games needed to understand player behaviour without the budget for continuous focus groups. We built a hierarchical machine learning pipeline — Self-Organizing Maps for unsupervised clustering, decision trees and random forests for classification — to predict session breaks and risk appetite from 20 behavioural features. Here's how the models worked and what they revealed about the hidden structure of online play.
The Business Problem
At Advant Games in 2008 we developed and operated several browser-based real-money games, including the card game Arvaa Kuka? (Guess Who?). Acquisition was cheap — flash-based browser games were thriving — but retention was brutal. Players would try a game, lose interest, and vanish. Traditional game design wisdom said you needed focus groups to understand why. Focus groups were expensive, slow, and subject to social desirability bias: players told you what they thought you wanted to hear, not what they actually did.
We had something focus groups couldn't provide: complete logs. Every round, every wager, every session break, every quit — all timestamped and stored. The question was whether we could extract predictive signal from that data without drowning in noise. The answer became a two-tier player model: a strategic layer capturing long-run personality across sessions, and a tactical layer predicting near-term decisions within a single session.
The Two-Level Player Model
The strategic model characterised players across their entire history with the platform. Its inputs included the player's overall win/loss record, session durations, and most importantly their observed play style — whether they consistently chose conservative (safe), measured (cautious), or high-variance (aggressive) bet sizing. These classifications were relatively stable: a player characterised as aggressive in month one rarely became conservative by month three. Strategic labels fed downstream as prior features for the tactical models.
The tactical model operated at the session level, making two distinct predictions:
- Risk level — a 2-class prediction of whether the player was currently in a risk-taking or risk-avoiding mode
- Break prediction — a 2–5-class prediction of whether the player was approaching a session break, and at what time horizon
Both predictions updated round-by-round, giving the game backend a continuously refreshed view of player intent. The potential applications ranged from adaptive difficulty to personalised bonus timing — though in 2008 the primary goal was simply to demonstrate that prediction was feasible at all.
Feature Engineering: Twenty Numbers Per Round
The tactical model consumed 20 input features, split into three conceptual groups.
Game success features (10 parameters)
These captured the player's recent result trajectory. Rather than raw win/loss counts, we computed stimuli: graded numerical representations of outcome significance. A small win registered differently from a large win; the model needed that distinction because the psychological response — and therefore the behavioural consequences — differed substantially. Beyond individual outcomes, we tracked short-term and long-term trends: specifically, the weighted result gradient over the last five rounds within the current streak, and across the player's longer session history.
Game history features (10 parameters)
The second group described the player's cumulative profile: the relative frequency distribution across outcome classes, total rounds played this session, the player's personal return-to-player percentage (observed, not theoretical), and the number of distinct win/loss streaks within the session. The streak count proved surprisingly informative — players who had experienced many reversals in a session behaved differently from players on a single long trend, even when their cumulative P&L was identical.
Thinking time (1 parameter)
The final feature was deceptively simple: the player's decision latency relative to their own historical average. A player who normally decided in three seconds but was now taking eight was exhibiting meaningful behavioural change — likely deliberation, hesitation, or distraction. We normalised this feature per-player rather than using an absolute threshold because individual baseline thinking times varied by an order of magnitude across the player population.
Self-Organizing Maps: Finding Hidden Structure
Before building classifiers, we needed to understand whether the feature space contained natural clusters at all. If player behaviour was a smooth continuum with no discrete segments, classification would be meaningless — we'd just be drawing arbitrary boundaries through noise.
We used a Self-Organizing Map (SOM) to answer this question. A SOM is an unsupervised neural network that learns a topology-preserving projection of high-dimensional data onto a two-dimensional grid. Similar input vectors land near each other on the grid; distant inputs land far apart. The result is a map where the spatial layout directly reflects the geometry of the data.
Our SOM used a 100 × 150 grid — 15,000 neurons — with a torus topology. The torus wrapping is critical and often overlooked: on a standard rectangular grid, neurons at the edges have fewer neighbours than neurons in the centre, which creates artificial boundary effects where the map folds around its periphery. By connecting opposite edges (top to bottom, left to right), every neuron has an equal neighbourhood structure. Clusters that happen to span what would be the grid edge on a flat map appear as continuous regions on the torus.
We visualised the trained SOM using two complementary tools. The U-matrix (unified distance matrix) shows the Euclidean distance between adjacent prototype vectors: high values indicate a boundary between distinct clusters; low values indicate a smooth region within a cluster. The P-matrix (population density matrix) shows how many input vectors were mapped to each region of the grid. Together they distinguish true cluster structure (high U-matrix boundary, concentrated P-matrix peaks) from diffuse gradients (low boundaries, spread P-matrix).
The advantage of SOM over K-means for this dataset was its ability to handle non-convex cluster shapes. Our player data contained ring-shaped distributions — a characteristic of behavioural trajectories that K-means, which assumes spherical clusters, would split artificially. The SOM's neighbourhood-preserving projection kept these structures intact.
Decision Trees: Interpretable Classification
SOMs tell you that clusters exist; they don't tell you the rules for classifying new observations into those clusters in real time. For that we turned to decision trees, specifically the C4.5 algorithm (a descendant of Quinlan's ID3).
C4.5 builds a tree by recursively selecting the feature split that maximises information gain — the reduction in entropy of the label distribution. At each node:
Gain(S, A) = Entropy(S) - Σ_{v ∈ Values(A)} (|S_v| / |S|) · Entropy(S_v)
Entropy(S) = -Σ_c p_c · log₂(p_c)
The split is chosen to maximise Gain(S, A) across all candidate features and thresholds. C4.5 uses gain ratio rather than raw information gain to penalise high-cardinality splits (features that split the data into many small subsets get artificially high raw gain but have poor generalisation).
For the Arvaa Kuka? dataset, we applied decision tree classification to the risk level prediction task, yielding a 4-class model from 14 input features. The resulting tree was shallow enough to be interpretable: the top-level split was almost always on the recent stimulus trend, with thinking time appearing consistently in the second and third layers. This matched intuition — a player who had just lost several rounds in quick succession and was now pausing longer before each decision was exhibiting a recognisable pattern that any experienced dealer would identify.
Pruning was essential. Unpruned trees on our training data achieved near-perfect accuracy but generalised poorly — they had memorised the noise in the training examples. We used reduced-error pruning, starting from leaves and removing subtrees where doing so did not increase the validation set error.
Random Forests: Ensemble Over Interpretability
Decision trees are interpretable but brittle: small changes in training data can produce substantially different trees, and a single tree over-fits more readily than an ensemble. For the break prediction task — where the marginal benefit of accuracy outweighed the need for human interpretability — we used random forests.
A random forest is an ensemble of decision trees, each trained on a bootstrap sample of the training data. Bootstrap sampling (sampling with replacement) means each tree sees approximately 63.2% of the original training observations, with the rest (the "out-of-bag" samples) available for unbiased error estimation. Additionally, each split in each tree considers only a random subset of features — typically √p features where p is the total number of features — which decorrelates the trees and prevents them from all exploiting the same dominant signal.
Prediction is by majority vote across all trees. Because individual trees are high-variance but their average is low-variance, the ensemble substantially outperforms any single tree. In our experiments the random forest reduced classification error on the break prediction task by 8–12 percentage points compared to a single pruned decision tree, at the cost of interpretability.
Validation: Confusion Matrices and Fleiss Kappa
We evaluated all models using 10-fold cross-validation: the dataset was partitioned into 10 equal folds, and the model was trained and tested 10 times, each time holding out a different fold as the test set. Final metrics were averages across all 10 folds with standard deviation reported as uncertainty.
For the risk level classifier, the confusion matrix on a representative 2-class evaluation showed:
Predicted range 1 Predicted range 2
Actual range 1 273 (TP) 117 (FN)
Actual range 2 160 (FP) 231 (TN)
Precision range 1: 273 / (273+160) = 63.05%
Precision range 2: 231 / (231+117) = 66.38%
Recall range 1: 273 / (273+117) = 70.00%
Recall range 2: 231 / (231+160) = 59.08%
Overall accuracy: (273+231) / (273+117+160+231) = 64.53% ± 4.60%
64.53% accuracy on a binary classification task with substantial natural class imbalance was meaningful in 2008 — random baseline would be around 51% for this particular dataset. The ±4.60% standard deviation across folds indicated stable generalisation rather than lucky splits.
For comparing agreement between classifiers, we used Fleiss' Kappa, which extends Cohen's Kappa to multiple simultaneous classifiers. A Kappa above 0.6 is generally considered "substantial agreement"; our risk classifier achieved Kappa ≈ 0.28, indicating "fair agreement" — honest for a fundamentally noisy behavioural prediction task.
What the Models Didn't Capture
The 20-feature tactical model deliberately excluded external context: time of day, session duration so far, day of week. We excluded these not because they were uninformative — they clearly were — but because including them would have made the model sensitive to calendar patterns rather than player psychology. A model that only fires correctly on Tuesday evenings is not a player model; it's a scheduling model.
The strategic model similarly excluded financial context. The absolute magnitude of wagers was not a feature — only relative bet sizing within a player's personal distribution. This was both a technical choice (normalising across players with wildly different stake levels) and a design principle: we were modelling behaviour, not wealth.
Legacy
The player model described here was never deployed as a live recommendation system — the product direction shifted before that stage. But the research shaped how the team thought about game data from that point forward: not as a record of what happened, but as a signal about what would happen next.
The same hierarchical decomposition — long-run strategic characterisation feeding short-run tactical prediction — recurs in every behaviour modelling system I've seen since. It's a natural structure because it matches the actual layering of human decision-making: personality is more stable than mood, and mood is more stable than the decision you're making right now. Any system that tries to predict decisions without distinguishing those time scales will blend signal and noise in ways that are very hard to untangle later.