Hi HF community! I just published carosh/cli-1m — the first large-scale multilingual dataset for natural-language → shell command generation.
**Numbers:**
- 975,933 training pairs
- 6 shells: bash, zsh, fish, PowerShell, nushell, oils-osh
- 13 languages
- 18 industry buckets
- 108× NL2Bash (the previous reference dataset)
**Load it:**
```python
from datasets import load_dataset
# SFT training
ds = load_dataset(“carosh/cli-1m”, split=“train”)
# 50k browse-friendly subset
ds = load_dataset(“carosh/cli-1m”, name=“sample”, split=“train”)
# Domain-specific (security only)
ds = load_dataset(“carosh/cli-1m”, name=“domains”, split=“security”)
```
**Interactive explorer:** CLI-1M Explorer - a Hugging Face Space by carosh
**Help wanted:** Looking for native speakers of Hebrew, Arabic, Hindi, Korean, or Russian to spot-check 50 translations each (~30 min, full credit in dataset card). DM @CaroDaShellShib on X or reply here to join the contributor waitlist.
Apache-2.0. Feedback welcome.