Paper — Valid JSON, Wrong Answer
Valid JSON, Wrong Answer: Per-Role Regression Detection and Linguistic Presupposition Labeling for Structured Output.
Breck Baldwin, validjson.com.
PDF preprint · GitHub · DOI 10.5281/zenodo.20075999
Abstract
Generating reliable structured output from large language models — JSON of the form "refundable":"True", schema-conformant tool calls, database queries — remains difficult in production. One possible fix pairs grammar-constrained decoding (for syntax) with LoRA fine-tuning (for semantics), evaluated by aggregate loss. We observe counter-current fields: per-grammar-role components whose loss rises under fine-tuning even as aggregate loss falls. We propose two mitigations rooted in a uniqueness-presupposition view of training labels — margin gating (inference-time abstention) and presupposition labeling (training-time relabel of evidentially underdetermined examples). Tested across Qwen 2.5 at 0.5B / 7B / 32B on Schema-Guided Dialogue and CUAD; presupposition labeling reduces constrained-content loss by 21–58% and eliminates the boolean regression at 7B and 32B.
Reproduce every paper number on a laptop
The supplementary on GitHub ships pre-computed result JSONs and the table-builder scripts. Every paper number is derivable without a GPU:
git clone https://github.com/validjson/valid-json-wrong-answer
cd valid-json-wrong-answer
python3 scripts/build_tables.py
A full from-scratch reproduction of the experiments takes ~5 hours on a single A100 80GB.
Citation
@misc{baldwin2026validjson,
title = {Valid JSON, Wrong Answer: Per-Role Regression Detection and
Linguistic Presupposition Labeling for Structured Output},
author = {Baldwin, Breck},
year = {2026},
doi = {10.5281/zenodo.20075999},
url = {https://zenodo.org/records/20075999}
}