OpsWM-1

OpsWM-1 is a passive world model for restaurant kitchen operations. It encodes event streams from restaurant point-of-sale systems into a continuous latent space that organises by operational state (stable / degraded / collapsed), time-of-day, and store identity β€” learned via self-supervised future-state prediction (JEPA-style) on synthetic data.

This v1 release is the demo-track checkpoint trained on Vita-Mojo/kitchensim-v1, a public synthetic dataset modelling 60 stores Γ— 7 days of events. The production OpsWM-1 (trained on real operational data) is a separate, internal artefact; this checkpoint exists to demonstrate the architecture and dynamics publicly.

Architecture

encoder        : 8-layer bidirectional transformer
                 hidden=256, heads=8, MLP-dim=1024
                 input: 256-event window of (event_type, channel, status_to,
                 hour, day-of-week, log delta-time-since-prev)
                 output: 256-d latent (mean-pooled CLS, NO L2-norm)
                 ~10M params

predictor      : 4-layer transformer
                 hidden=256, heads=8, MLP-dim=512
                 input: [z_t, learned_horizon_token]
                 output: predicted z_target at horizon ∈ {30s, 2min, 5min, 10min}
                 ~4M params

target encoder : EMA copy of encoder, Ο„=0.99, no gradients

losses         : L_predict  = 1 - cos_sim(z_hat, z_target.detach())
                 L_var      = ReLU(Ξ³ - std(z_t, dim=0)).mean()  (VicReg)
                 L_cov      = sum(off-diag cov(z_t)Β²) / D       (VicReg)
                 L_contrast = InfoNCE over (anchor, future) batch pairs
                              (temperature=0.1)

  total       = L_predict + 1Β·L_var + 0.01Β·L_cov + 1Β·L_contrast

Training

  • 50,000 steps, batch=256, AdamW lr=3e-4 with cosine decay + 2K warmup
  • 8.78M trainable params, ~5.2 hours on a single A100 (Modal)
  • Pretrained passively: no action conditioning in v1
  • Window length 256 events, stride 32, 4 future-horizon targets

Files

  • opswm1.pt β€” best checkpoint (step 45000, eval probe MAE = 51.6s vs naive median 113.9s)
  • opswm1_step50000.pt β€” final-step checkpoint
  • model_config.json β€” ModelConfig dict needed to reconstruct the architecture

Loading

import torch
from opsvm.model.encoder import ModelConfig, OpsWMModel

ckpt = torch.load("opswm1.pt", map_location="cpu", weights_only=False)
cfg = ModelConfig(**ckpt["model_cfg"])
model = OpsWMModel(cfg)
model.load_state_dict(ckpt["model_state"])
model.encoder.eval()

The encoder's forward(batch, normalize=False) returns a [B, 256] latent given a [B, 256] window of events tokenized per the conventions in opsvm/model/tokenization.py.

Evaluation

Linear-probe results from the 5-task evaluation harness (see Vita-Mojo/opswm1-demo):

Task Metric OpsWM Specialist baseline Winner
1. Fulfilment time MAE seconds ↓ 49.18 24.48 (XGBoost) baseline
2. State classification accuracy ↑ 0.716 0.852 (rule on q/EMA) baseline
3. State at t+5min accuracy ↑ 0.728 0.682 (Markov chain) OpsWM
4. Bundle fulfilment MAE seconds ↓ 49.18 53.30 (linear regression) OpsWM
5. Recovery in 10min ROC-AUC ↑ 0.927 0.943 (logistic) baseline

OpsWM wins 2/5 tasks β€” specifically the predictive ones (Task 3 = future state, Task 4 = bundle interaction). It loses on Tasks 1, 2, 5 where the specialist baselines have direct access to the generator's state variables (queue depth, fulfilment EMA, store_type) as hand-engineered features. Those baselines are sitting close to the oracle by construction.

Limitations

  • Synthetic data: this checkpoint is trained on simulated kitchen dynamics (Vita-Mojo/kitchensim-v1). It will not transfer directly to real restaurant operations without additional adaptation.
  • Passive only: no action conditioning. The latent reflects what happened, not what might happen under interventions.
  • Counterfactuals in the demo are simulator playthroughs, not learned causal estimates. They're labelled "directional, not calibrated" in the UI for a reason.
  • Window-mean pooling loses per-order detail. Tasks that depend on individual order outcomes (e.g. predicting cancellation per order) don't benefit much from this representation.
  • Cancellation rate prediction: the encoder didn't learn this signal well β€” XGBoost on hand-features beats it at every training-set size.

Architecture journey (the non-obvious bits)

This isn't a vanilla JEPA recipe. Three spec-scale attempts collapsed catastrophically (active_dims = 0/256) before the right configuration:

  1. L2-norm + MSE + SIGReg β†’ encoder collapses to one point on the unit sphere
  2. CLS + cosine + VicReg β†’ encoder satisfies VicReg's variance term using dropout noise while collapsing in eval mode
  3. + encoder dropout=0 β†’ still collapses

What finally worked (this checkpoint):

  • CLS token instead of mean-pool
  • No L2-norm on the output β€” let the encoder pick magnitudes
  • Cosine-similarity prediction loss (not MSE)
  • VicReg variance + covariance (replaces SIGReg; works on unnormalised latents)
  • Contrastive auxiliary loss (InfoNCE on batch pairs) β€” the load-bearing piece. Without it, VicReg is satisfied by dropout/initial-noise even while the encoder gives a near-constant output for every input.

Citation

If you use this work, please cite:

@misc{opswm1_demo_2026,
  author = {Vita-Mojo Research},
  title  = {OpsWM-1: Self-supervised World Model for Restaurant Operations},
  year   = {2026},
  url    = {https://huggingface.co/Vita-Mojo/opswm1}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Vita-Mojo/opswm1