[Dataset] CLI-1M: 975K NL→shell pairs — 13 languages, 6 shells, Apache-2.0

kobi-kadosh · May 14, 2026, 9:08pm

Hi HF community! I just published carosh/cli-1m — the first large-scale multilingual dataset for natural-language → shell command generation.

**Numbers:**

- 975,933 training pairs

- 6 shells: bash, zsh, fish, PowerShell, nushell, oils-osh

- 13 languages

- 18 industry buckets

- 108× NL2Bash (the previous reference dataset)

**Load it:**

```python

from datasets import load_dataset

# SFT training

ds = load_dataset(“carosh/cli-1m”, split=“train”)

# 50k browse-friendly subset

ds = load_dataset(“carosh/cli-1m”, name=“sample”, split=“train”)

# Domain-specific (security only)

ds = load_dataset(“carosh/cli-1m”, name=“domains”, split=“security”)

```

**Interactive explorer:** CLI-1M Explorer - a Hugging Face Space by carosh

**Help wanted:** Looking for native speakers of Hebrew, Arabic, Hindi, Korean, or Russian to spot-check 50 translations each (~30 min, full credit in dataset card). DM @CaroDaShellShib on X or reply here to join the contributor waitlist.

Apache-2.0. Feedback welcome.

Topic		Replies	Views
Translation for Indian languages With CoT Research	4	43	July 8, 2025
Tune mT5 for translation of natural language requests to bash Beginners	1	289	May 23, 2023
Zest, a fine tuned a small Qwen model to work as a command line assistant Show and Tell	0	42	March 8, 2026
How to make a translation dataset Beginners	3	3160	November 18, 2023
Wikilangs - Open NLP for 340+ Wikipedia Languages 🌐 Languages at Hugging Face	0	39	March 8, 2026

[Dataset] CLI-1M: 975K NL→shell pairs — 13 languages, 6 shells, Apache-2.0

Related topics