OpsWM-1
OpsWM-1 is a passive world model for restaurant kitchen operations. It encodes event streams from restaurant point-of-sale systems into a continuous latent space that organises by operational state (stable / degraded / collapsed), time-of-day, and store identity β learned via self-supervised future-state prediction (JEPA-style) on synthetic data.
This v1 release is the demo-track checkpoint trained on Vita-Mojo/kitchensim-v1, a public synthetic dataset modelling 60 stores Γ 7 days of events. The production OpsWM-1 (trained on real operational data) is a separate, internal artefact; this checkpoint exists to demonstrate the architecture and dynamics publicly.
Architecture
encoder : 8-layer bidirectional transformer
hidden=256, heads=8, MLP-dim=1024
input: 256-event window of (event_type, channel, status_to,
hour, day-of-week, log delta-time-since-prev)
output: 256-d latent (mean-pooled CLS, NO L2-norm)
~10M params
predictor : 4-layer transformer
hidden=256, heads=8, MLP-dim=512
input: [z_t, learned_horizon_token]
output: predicted z_target at horizon β {30s, 2min, 5min, 10min}
~4M params
target encoder : EMA copy of encoder, Ο=0.99, no gradients
losses : L_predict = 1 - cos_sim(z_hat, z_target.detach())
L_var = ReLU(Ξ³ - std(z_t, dim=0)).mean() (VicReg)
L_cov = sum(off-diag cov(z_t)Β²) / D (VicReg)
L_contrast = InfoNCE over (anchor, future) batch pairs
(temperature=0.1)
total = L_predict + 1Β·L_var + 0.01Β·L_cov + 1Β·L_contrast
Training
- 50,000 steps, batch=256, AdamW lr=3e-4 with cosine decay + 2K warmup
- 8.78M trainable params, ~5.2 hours on a single A100 (Modal)
- Pretrained passively: no action conditioning in v1
- Window length 256 events, stride 32, 4 future-horizon targets
Files
opswm1.ptβ best checkpoint (step 45000, eval probe MAE = 51.6s vs naive median 113.9s)opswm1_step50000.ptβ final-step checkpointmodel_config.jsonβModelConfigdict needed to reconstruct the architecture
Loading
import torch
from opsvm.model.encoder import ModelConfig, OpsWMModel
ckpt = torch.load("opswm1.pt", map_location="cpu", weights_only=False)
cfg = ModelConfig(**ckpt["model_cfg"])
model = OpsWMModel(cfg)
model.load_state_dict(ckpt["model_state"])
model.encoder.eval()
The encoder's forward(batch, normalize=False) returns a [B, 256] latent given a [B, 256] window of events tokenized per the conventions in opsvm/model/tokenization.py.
Evaluation
Linear-probe results from the 5-task evaluation harness (see Vita-Mojo/opswm1-demo):
| Task | Metric | OpsWM | Specialist baseline | Winner |
|---|---|---|---|---|
| 1. Fulfilment time MAE | seconds β | 49.18 | 24.48 (XGBoost) | baseline |
| 2. State classification | accuracy β | 0.716 | 0.852 (rule on q/EMA) | baseline |
| 3. State at t+5min | accuracy β | 0.728 | 0.682 (Markov chain) | OpsWM |
| 4. Bundle fulfilment MAE | seconds β | 49.18 | 53.30 (linear regression) | OpsWM |
| 5. Recovery in 10min | ROC-AUC β | 0.927 | 0.943 (logistic) | baseline |
OpsWM wins 2/5 tasks β specifically the predictive ones (Task 3 = future state, Task 4 = bundle interaction). It loses on Tasks 1, 2, 5 where the specialist baselines have direct access to the generator's state variables (queue depth, fulfilment EMA, store_type) as hand-engineered features. Those baselines are sitting close to the oracle by construction.
Limitations
- Synthetic data: this checkpoint is trained on simulated kitchen dynamics (Vita-Mojo/kitchensim-v1). It will not transfer directly to real restaurant operations without additional adaptation.
- Passive only: no action conditioning. The latent reflects what happened, not what might happen under interventions.
- Counterfactuals in the demo are simulator playthroughs, not learned causal estimates. They're labelled "directional, not calibrated" in the UI for a reason.
- Window-mean pooling loses per-order detail. Tasks that depend on individual order outcomes (e.g. predicting cancellation per order) don't benefit much from this representation.
- Cancellation rate prediction: the encoder didn't learn this signal well β XGBoost on hand-features beats it at every training-set size.
Architecture journey (the non-obvious bits)
This isn't a vanilla JEPA recipe. Three spec-scale attempts collapsed catastrophically (active_dims = 0/256) before the right configuration:
- L2-norm + MSE + SIGReg β encoder collapses to one point on the unit sphere
- CLS + cosine + VicReg β encoder satisfies VicReg's variance term using dropout noise while collapsing in eval mode
- + encoder dropout=0 β still collapses
What finally worked (this checkpoint):
- CLS token instead of mean-pool
- No L2-norm on the output β let the encoder pick magnitudes
- Cosine-similarity prediction loss (not MSE)
- VicReg variance + covariance (replaces SIGReg; works on unnormalised latents)
- Contrastive auxiliary loss (InfoNCE on batch pairs) β the load-bearing piece. Without it, VicReg is satisfied by dropout/initial-noise even while the encoder gives a near-constant output for every input.
Citation
If you use this work, please cite:
@misc{opswm1_demo_2026,
author = {Vita-Mojo Research},
title = {OpsWM-1: Self-supervised World Model for Restaurant Operations},
year = {2026},
url = {https://huggingface.co/Vita-Mojo/opswm1}
}
License
Apache 2.0