GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。
Summary
I found an aggregate pattern in Codex token_count
metadata: gpt-5.5
responses disproportionately land at exactly reasoning_output_tokens = 516
, with additional fixed-boundary spikes around 1034
and 1552
.
This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.
This is related to #29353 , which reported a task-level reproduction where gpt-5.5
runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.
I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.
Environment
- Product: Codex
- Model most implicated: gpt-5.5
- Data source: Codex token_count
metadata
- Time window analyzed: Feb 1-Jun 27, 2026 UTC
- Related issue: gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop #29353
Evidence
Metric Value
Response-level token records analyzed 390,195
Sessions represented 865
Exact reasoning_output_tokens = 516
events 3,363