IAPS · Commerce Control List benchmark

CommodityBench

Can language models assign a commodity its correct Export Control Classification Number — the task BIS performs against the Commerce Control List? Each model classifies the same items unaided and with tools that let it read the actual control list. Scores pool every run to date.

The task

Commodity classification assigns an item — hardware, software, or technology — its Export Control Classification Number (ECCN) on the US Commerce Control List. An ECCN is a five-character code: a category digit, a product-group letter, and three identifying digits, sometimes with subparagraphs (e.g. 3A001.a.1.a). An item subject to the export rules but matching no control-list entry is EAR99.

The benchmark

CommodityBench pairs real commodities with the ECCN their manufacturer or another authoritative source assigned, confirmed by human review. Each model reads an item's name and description and returns an ECCN — once unaided, and once with tools that let it read the actual control list. Answers are graded on a partial-credit scale, from the exact code down to the right category.

Why it matters

An item's ECCN determines which export-control requirements apply to it; the Bureau of Industry and Security classifies commodities to set license rules and support enforcement. IAPS has argued BIS should adopt AI tools to augment a limited workforce — so how reliably current models perform this task is a prerequisite for deploying them in such a high-sensitivity setting.

01

Leaderboard

Every model, unaided and with CCL tools. Marks show mean grade on the verified set; the line is the distance tools move each model. Models without a filled mark were not run with tools.

no tools with CCL tools
ModelExact ·notoolGrade ·notool Grade ·toolsRunsObs

02

Within-model tool uplift

For the models run in both conditions, how much does reading the control list change the score — overall, and by CCL category. Grade scale 0–1.

03

What the models get wrong

Every prediction sorted by how it misses: pushing an uncontrolled item onto the list (over-classification), calling a controlled item EAR99 (under-classification), or landing in the wrong place on the list. Distribution first, then individual cases with the model's reasoning and tool trace.

Representative cases · predictions with CCL tools

One item per error type. Chips are the predicted ECCN split into its segments — category · group · number · subparagraphs — lit to the depth each model matched the verified code.