CommodityBench — LLM export-control classification

The task

Commodity classification assigns an item — hardware, software, or technology — its Export Control Classification Number (ECCN) on the US Commerce Control List. An ECCN is a five-character code: a category digit, a product-group letter, and three identifying digits, sometimes with subparagraphs (e.g. 3A001.a.1.a). An item subject to the export rules but matching no control-list entry is EAR99.

The benchmark

CommodityBench pairs real commodities with the ECCN their manufacturer or another authoritative source assigned, confirmed by human review. Each model reads an item's name and description and returns an ECCN — once unaided, and once with tools that let it read the actual control list. Answers are graded on a partial-credit scale, from the exact code down to the right category.

Why it matters

An item's ECCN determines which export-control requirements apply to it; the Bureau of Industry and Security classifies commodities to set license rules and support enforcement. IAPS has argued BIS should adopt AI tools to augment a limited workforce — so how reliably current models perform this task is a prerequisite for deploying them in such a high-sensitivity setting.

Leaderboard

Every model, unaided and with CCL tools. Marks show mean grade on the verified set; the line is the distance tools move each model. Models without a filled mark were not run with tools.

no tools with CCL tools

Model	Exact ·notool	Grade ·notool	Grade ·tools	Runs	Obs

What the models get wrong

Every prediction sorted by how it misses: pushing an uncontrolled item onto the list (over-classification), calling a controlled item EAR99 (under-classification), or landing in the wrong place on the list. Distribution first, then individual cases with the model's reasoning and tool trace.

Representative cases · predictions with CCL tools

One item per error type. Chips are the predicted ECCN split into its segments — category · group · number · subparagraphs — lit to the depth each model matched the verified code.

Leaderboard

Within-model tool uplift

What the models get wrong

Representative cases · predictions with CCL tools