Be Careful with Sonnet 5 Usage!

推荐理由

这条记录涉及编程工具或代码能力更新，适合开发者评估工作流变化和可复用价值。

To test this new bad boy out, I ran this prompt (expecting it to think for like 40 seconds and pump out some standard information):

There is a correlation between being in America and nations like it and having more auto-immune diseases. What are the theories behind that correlation?

The damn thing ran for about 19 minutes (I timed it). And it used 28% of my 5-hour usage window on the Pro plan. Damn, man, that is costlier than Opus-4.8! It was a great answer, though. I can't accuse it of not doing that.

EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT EDIT

I wrote this thread before reading anything in the answer (I had somewhere to be in 20 minutes. Family in from Egypt, was going to eat with them). I examined what happened, and here it is:

He probably set it as a deep research task (or the wording triggered that automatically). It goes out and pulls a few hundred pdfs and blogs that are tens of pages each plus pictures and such. It eats through your limits so insanely fast with crazy numbers of input tokens lol

You're SEMI right! I definitely didn't use a research task, but for whatever reason, it did a lot of "research searches" examinations on the internet (which I've never seen before. Are they new with the 5 series? As it ran for 20 minutes, I was watching it do a SHIT TON of searches). This is in my output (I ask my models to first state the model in use [bc I might change models mid chat and would like to know which model generated what] as well as an estimation on reasoning depth. It said this:

Claude Sonnet 5 — reasoning effort: high (extended thinking + 11 research searches [emphasis added] + 1 diagram; self-assessed from tool/thinking usage, not an exact internal metric)

The reason it chose to do that links back to where I say I like for studies to be audited, a review of their methodology among other things. Apparently, it pulled in 8 full papers and analyzed them fully for that analysis of methodology. It admits so here:

There are easily 20+ citable papers behind this question. I'm running the full evidence audit (study type, design, sample, effect size, credibility reasoning) on the eight studies that carry the most weight in the argument, and citing the rest more lightly by name, journal, and design. Doing the full field-by-field audit on every paper would roughly triple this response's length for little added value. Where I couldn't verify a study's funding or conflict-of-interest status from what I found, I say so rather than guessing.

For the record, I've done some testing with the same prompt with and without that section. I'm either going to delete it fully or modify it heavily with more interpretable language instead of demanding certain analysis being completed. With the movement toward newer models being "literal" as they like to call it, when you ask for 15 things, the model indeed does those 15 things autistically. I guess I have no one to blame but myself. But, hey, Opus-4.8 never had this issue! The primary difference is in the number of studies it would audit and perhaps its methodology of choosing which studies to audit. would generally get like 1 or 2 random studies audited, which I liked to see some real papers audited when I ask a question about a topic that has studies of this type to audit in the first place. (I have language in that section saying only to audit bla bla bla so that it doesn't apply an audit with p values etc. on a study that simply has none. Before writing that long ago, it'd repeatedly say, "This study doesn't have bla bla bla bla, so I didn't audit it in that way."

Also, my effort level was indeed on max. I wanted to see what this bad boy could do! I'm sure it wouldn't have gone as nuts mode, taking nearly 20 minutes, if it were on low or medium. I wasn't using it as I would use it during coding tasks.

Here is my intense answer if anyone wants a super-researched answer to this question (lol)

主题标签订阅权益

原始关键词#careful#sonnet#usage#be

查看原文reddit.com

单一来源，暂无交叉验证