OpsWM-1

OpsWM-1 is a passive world model for restaurant kitchen operations. It encodes event streams from restaurant point-of-sale systems into a continuous latent space that organises by operational state (stable / degraded / collapsed), time-of-day, and store identity — learned via self-supervised future-state prediction (JEPA-style) on synthetic data.

This v1 release is the demo-track checkpoint trained on Vita-Mojo/kitchensim-v1, a public synthetic dataset modelling 60 stores × 7 days of events. The production OpsWM-1 (trained on real operational data) is a separate, internal artefact; this checkpoint exists to demonstrate the architecture and dynamics publicly.

Architecture

encoder        : 8-layer bidirectional transformer
                 hidden=256, heads=8, MLP-dim=1024
                 input: 256-event window of (event_type, channel, status_to,
                 hour, day-of-week, log delta-time-since-prev)
                 output: 256-d latent (mean-pooled CLS, NO L2-norm)
                 ~10M params

predictor      : 4-layer transformer
                 hidden=256, heads=8, MLP-dim=512
                 input: [z_t, learned_horizon_token]
                 output: predicted z_target at horizon ∈ {30s, 2min, 5min, 10min}
                 ~4M params

target encoder : EMA copy of encoder, τ=0.99, no gradients

losses         : L_predict  = 1 - cos_sim(z_hat, z_target.detach())
                 L_var      = ReLU(γ - std(z_t, dim=0)).mean()  (VicReg)
                 L_cov      = sum(off-diag cov(z_t)²) / D       (VicReg)
                 L_contrast = InfoNCE over (anchor, future) batch pairs
                              (temperature=0.1)

  total       = L_predict + 1·L_var + 0.01·L_cov + 1·L_contrast

Training

50,000 steps, batch=256, AdamW lr=3e-4 with cosine decay + 2K warmup
8.78M trainable params, ~5.2 hours on a single A100 (Modal)
Pretrained passively: no action conditioning in v1
Window length 256 events, stride 32, 4 future-horizon targets

Files

opswm1.pt — best checkpoint (step 45000, eval probe MAE = 51.6s vs naive median 113.9s)
opswm1_step50000.pt — final-step checkpoint
model_config.json — ModelConfig dict needed to reconstruct the architecture

Loading

import torch
from opsvm.model.encoder import ModelConfig, OpsWMModel

ckpt = torch.load("opswm1.pt", map_location="cpu", weights_only=False)
cfg = ModelConfig(**ckpt["model_cfg"])
model = OpsWMModel(cfg)
model.load_state_dict(ckpt["model_state"])
model.encoder.eval()

The encoder's forward(batch, normalize=False) returns a [B, 256] latent given a [B, 256] window of events tokenized per the conventions in opsvm/model/tokenization.py.

Evaluation

Linear-probe results from the 5-task evaluation harness (see Vita-Mojo/opswm1-demo):

Task	Metric	OpsWM	Specialist baseline	Winner
1. Fulfilment time MAE	seconds ↓	49.18	24.48 (XGBoost)	baseline
2. State classification	accuracy ↑	0.716	0.852 (rule on q/EMA)	baseline
3. State at t+5min	accuracy ↑	0.728	0.682 (Markov chain)	OpsWM
4. Bundle fulfilment MAE	seconds ↓	49.18	53.30 (linear regression)	OpsWM
5. Recovery in 10min	ROC-AUC ↑	0.927	0.943 (logistic)	baseline

OpsWM wins 2/5 tasks — specifically the predictive ones (Task 3 = future state, Task 4 = bundle interaction). It loses on Tasks 1, 2, 5 where the specialist baselines have direct access to the generator's state variables (queue depth, fulfilment EMA, store_type) as hand-engineered features. Those baselines are sitting close to the oracle by construction.

Limitations

Synthetic data: this checkpoint is trained on simulated kitchen dynamics (Vita-Mojo/kitchensim-v1). It will not transfer directly to real restaurant operations without additional adaptation.
Passive only: no action conditioning. The latent reflects what happened, not what might happen under interventions.
Counterfactuals in the demo are simulator playthroughs, not learned causal estimates. They're labelled "directional, not calibrated" in the UI for a reason.
Window-mean pooling loses per-order detail. Tasks that depend on individual order outcomes (e.g. predicting cancellation per order) don't benefit much from this representation.
Cancellation rate prediction: the encoder didn't learn this signal well — XGBoost on hand-features beats it at every training-set size.

Architecture journey (the non-obvious bits)

This isn't a vanilla JEPA recipe. Three spec-scale attempts collapsed catastrophically (active_dims = 0/256) before the right configuration:

L2-norm + MSE + SIGReg → encoder collapses to one point on the unit sphere
CLS + cosine + VicReg → encoder satisfies VicReg's variance term using dropout noise while collapsing in eval mode
+ encoder dropout=0 → still collapses

What finally worked (this checkpoint):

CLS token instead of mean-pool
No L2-norm on the output — let the encoder pick magnitudes
Cosine-similarity prediction loss (not MSE)
VicReg variance + covariance (replaces SIGReg; works on unnormalised latents)
Contrastive auxiliary loss (InfoNCE on batch pairs) — the load-bearing piece. Without it, VicReg is satisfied by dropout/initial-noise even while the encoder gives a near-constant output for every input.

Citation

If you use this work, please cite:

@misc{opswm1_demo_2026,
  author = {Vita-Mojo Research},
  title  = {OpsWM-1: Self-supervised World Model for Restaurant Operations},
  year   = {2026},
  url    = {https://huggingface.co/Vita-Mojo/opswm1}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Vita-Mojo
/

opswm1