▲Chain of Recursive Thoughts: Make AI think harder by making it argue with itselfgithub.com

539 points by miles 139 days ago | 239 comments

dudeinhawaii 139 days ago [-]

I see a lot of threads pitting models against each other (or whole swarms of them) in the hope that "wisdom of crowds" will magically appear. After a stack of experiments of my own—and after watching the recent ASU/Microsoft-Research work [1].. I've landed on a simpler takeaway:

An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)

zurfer 139 days ago [-]

Your references show me that it is absolutely task depended. In many domains it's true that "criticizing is easier than creating".

The best example might be books and movies, where it's trivial to say the characters were shallow, but it's surprisingly hard to create deeply interesting characters.

In Software Engineering, there are similar dynamics. An LLM with a security vuln finding prompt will be able to point out places, where the generated code might be insecure.

But if you want another LLM to find a reasoning mistake in a mathematical proof it basically has to do all the reasoning work as well. In which case I doubt there will be any significant performance gains.

aoeusnth1 138 days ago [-]

In principle, Math proofs are another relatively easy to verify problem. In the extreme case, you can express any math proof as a computer-verifiable formalism — no intelligence necessary. Step back one step, and you could have a relatively weak model translate a proof into verifiable formalism and then use a tool call to run the verification. Coming up with the proof is an expensive search process, while verifying it is more mechanical. Even if it is not completely trivial to make the proof computer-verifiable, it might still be a vastly easier task compared to finding the proof in the first place.

simulator5g 138 days ago [-]

An LLM cannot reason through a mathematical proof, it would be something other than an LLM if it could.

mycall 138 days ago [-]

LLM is a overloaded term now as ML models can do tool calls, or MoE segmentation can have specialized solvers embedded... but people will call all variations LLMs.

meander_water 139 days ago [-]

For better or worse this has become the defacto standard in LLM Evaluation research papers since the LLM as a Judge paper [0] came out. Its also heavily embedded into frameworks like LangChain and LlamaIndex to evaluate RAG pipelines.

[0] https://arxiv.org/abs/2306.05685

[1] https://arxiv.org/abs/2411.15594

swyx 139 days ago [-]

its for the better, and i'm actually serious about this. it's just that Subbarao is ALSO right and it is not perfect nor human level. but it -DOES- improve results measurably and consistently.

so what i'm saying is don't throw the baby out with the bathwater. LLM as judge doesnt replace human judgement but its a pretty darn good first pass for how cheap it is. and you can imagine that it will get better over time.

hu3 139 days ago [-]

> ...so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Agree. What do you think about telling the LLM to also generate unit tests for the code it spits and then run all tests (including previous application unit tests).

I think this is a way to ensure some level of grounded verification:

- Does code compile?

- Do unit test pass?

AI can then consume test results to help fix their own mistakes.

nojs 139 days ago [-]

This works well but only if you eyeball the tests and edit them a bit in my experience. Otherwise it gets lazy and makes them trivial to pass. Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

eru 138 days ago [-]

> Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

You can use property based testing for that.

But I've often run into cases where the AI gets into a vicious spiral of worse and worse code when you keep feeding it the test failures.

macrolime 139 days ago [-]

Yet, having the LLM evaluate the tests often works well. Especially if you ask it to look for hardcoded test cases.

dwaltrip 139 days ago [-]

Definitely, test runners are a way to ground the model and give it a feedback loop. Not a silver bullet but can be very helpful.

keepamovin 139 days ago [-]

I believe, what the smart AI company is trying to do, right now, in secret, is to use US, the humans, and our replies to the AIs, as training for the next generation of self-verifying-models. :)

Training on corpus data gets you to 1 order of magnitude. But training on interactive data where you can observe and adapt to the OODA-loop? So much more powerful.

At least, that's what I'd be doing if I were doing AI :)

But I just do BrowserBox

captainbland 138 days ago [-]

I think you'd need to screen for quality of response quite stringently as loads of people will produce "corrections" which are just plain wrong.

keepamovin 138 days ago [-]

Good point! But you could probably identify "super users" who are the ones whose responses you want to mine hahaha :)

mcswell 138 days ago [-]

I assume everyone knows this, but the idea of generating answers and testing them, dates back decades, and has been widely used for problems where generating _the_ correct answer(s) is difficult, but where generating a bunch of potential answers--(at least) one of which is likely correct--is easier. Generate-and-test of course relies on having a test algorithm that is reliable, (relatively) fast, and memory efficient, and is most useful when an exact generate algorithm (one that generated only the correct answer(s)) is either slow or inefficient of memory use (or both).

In the case described, the generator is an LLM, and the tester (called a "verifier") is "the compiler, linter, SAT solver, ground truth dataset, etc."

And of course generate-and-test is related to trial-and-error, which has probably existed since the Paleolithic.

foobiekr 138 days ago [-]

"letting GPT-4 critique its own answers reduces accuracy"

This is because the output, being the input, steers directly into the tree as soon as the tree is in the context window.

ashu1461 139 days ago [-]

Would a LLM under human guidance turn out to be a good verifier ? i.e. if LLM knows the rules to verify or has enough data points (internet access, actual responses)

eru 138 days ago [-]

Of course, that only works for problems where you have a verifier.

autokad 138 days ago [-]

actually, I found that you can definitely yield better results. I ran an experiment with 1 prompt at temperature 0 and 9 with temperature 1.

I found the most anomalous response was as good (15/20) or better (5/20) than the temperature 0 response in 20 samples.

odo1242 139 days ago [-]

Something I do sometimes is:

- Have an AI chat model come up with an answer to a problem.

- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.

- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.

- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.

It's super clunky but has given pretty good results in the cases where I tried it lol

ASalazarMX 139 days ago [-]

Ah, now we know why Spain was out of electricity yesterday.

Cthulhu_ 139 days ago [-]

Here I was thinking cryptocurrency pre-heated the grids (and GPU manufacturing) for us already.

danparsonson 139 days ago [-]

Oh that was a good one XD

StopDisinfo910 139 days ago [-]

For anything semi-adversarial, I have had good results asking the AI to come up with a plan, then take the side of the opponent coming with counter play/way to defeat the plan, finally asking for a revision of the initial plan given the potential reaction from the opponent.

The final plan you obtain is generally a lot more well rounded and thought out.

I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.

meander_water 139 days ago [-]

To be honest, this is what I assumed this repo was doing from the title. It talks about arguing with itself, but it looks like it's just generating multiple alternative responses in parallel and selecting the best one.

Do you find your method handles "sycophancy" well?

StopDisinfo910 139 days ago [-]

I don’t really know.

I stopped using ChatGPT at some point because I disliked how cagey it became about a lot of topics. I used to enjoy making write improbable movies mashup when GPT3 was released and at some point it became very touchy about IP rights and violence which was annoying.

I generally use Deepseek nowadays which is not sycophantic and surprisingly doesn’t seem as censored to me especially if you use a version not hosted by Deepseek themselves.

lblume 138 days ago [-]

Which hosting service would you recommend?

zoogeny 139 days ago [-]

I do the same, and I have one other technique.

I will often have a few chats going for a project, but with different contexts. For example, one might be tech focused, another marketing focused, another with some context on my personal goals, etc.

So I will take the same question and feed it into the chats with differing context. It is almost like having different perspectives on the same problem. And the conclusions can often differ based on the differing contexts.

odie88 138 days ago [-]

This is how I’ve been using Gemini and it’s the first time I’m really seeing consistent value.

I’ll get a context into a solid place with as much information as I can about a project. Usually getting up to 100k tokens.

Then I ask it to give me a summary I can use in a fresh chat, that will maintain the current context. This lets me reclaim space, bring responsiveness back to sane levels, have a baseline chat I use to spin up branches for marketing, design (it’s pretty helpful at trouble shooting Substance Designer graphs), etc.

I’ve found myself going into sub branches from there… like a marketing context that pushes branches into different marketing channels.

jsight 139 days ago [-]

This reminds me a lot of the YT video that went over using Monte Carlo Tree Search with LLMs to maximize result quality. Link: https://www.youtube.com/watch?v=mfAV_bigdRA&ab_channel=Treli...

It seemed like a pretty good idea, though I'd guess that it would greatly increase token usage. I'd also be concerned that the LLM as a judge might struggle to grade things accurately if it wasn't also able to generate good enough answers to begin with.

looofooo0 139 days ago [-]

If you think about marginal cost, such experiments can be run almost at only the cost of extra electricity used for that computation, which in Europe is often zero, at least by the ones who own the compute.

JumpCrisscross 139 days ago [-]

Kagi’s Assistant feature makes this super easy. Just switch assistants and ask them to check the other’s work.

BOOSTERHIDROGEN 139 days ago [-]

How?

nativeit 139 days ago [-]

Ask the AI assistant for instructions.

Pretty soon we'll have new acronyms such as "IDKATFAIA" ["I don't know, ask the f'ing AI already"] as we all succumb to the knowledge soup.

dalmo3 139 days ago [-]

RTFP

factotvm 139 days ago [-]

Read The Fine Prompt, more or less, right?

BOOSTERHIDROGEN 138 days ago [-]

Honestly, the AI assistant isn't as smart as I thought - I'm still having to check its work.

subscribed 139 days ago [-]

I do it all the time in Sillytavern in a group chat - three characters kind of resembling what you just described, and me, participating in the "conversation", them going back and forth until they're satisfied.

With a good model role playing them, works awesome.

hsuduebc2 139 days ago [-]

We're there any situation that first conclusion from AI was completely changed? Can you give generally examples of situations where it changed or significantly improved overall result? It sounds cool.

nomel 139 days ago [-]

I would be interested to know how ofter "oscillations" occur, where they flip flop from being too "agreeable" to challenges (which probably is just a sparse latent space). This happens to me pretty frequently, where you can repeatedly say "no that's wrong" and the LLM will do a 180, explaining why it was "in fact" wrong and you are "right", repeat.

itissid 139 days ago [-]

Isn't this kind of another way of how Inference Time Scaling works? It will basically produce several chain of thoughts and then pursue one that has maximum reward based on an internal function?

pessimizer 139 days ago [-]

I've wondered if it might be helpful to randomly "shard" training data between two LLMs; just feed half the training data to one, and the rest to the other, with no overlap.

So instead of using two models, you'd be making two halves of one model do a similar (deliberative) process to yours. I wonder if that would result in a benefit over a single model with the full training set, and if you could continue to do the same thing by sharding the shards.

ijk 139 days ago [-]

There's some precedent for that: you can do some useful things with the cross entropy of the two models. And k-fold cross validation might also be relevant.

aprilthird2021 139 days ago [-]

This takes such a long time to do though, no? What problems does this save you time on?

dustingetz 139 days ago [-]

i dont understand, is it doing your schoolwork?

Lerc 139 days ago [-]

I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.

I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.

Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.

mikepurvis 139 days ago [-]

In doing some DevOps-y type tasks recently (ansible, packer, docker, baking images with guestfish), I've found it very frustrating how much ChatGPT will confidently tell me to use flags on tools that don't exist, or hallicinate completely non-existent functions or behaviours. And then when I spend time trying what it suggests only to hit a wall and come back like wtf mate it breezily goes "oh yes so you're right, good job figuring that out! You're so close now! Your next step is to do X and Y," and then serves up the same detailed tutorial as before but with the flag or whatever it was that it had wrong subtly changed.

It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.

MoonGhost 139 days ago [-]

You can't get more info from LLMs than it actually holds. Like Anthropic pointed if LLMs knows the name but has no other info it starts hallucinating. The same probably happens here. LLM knows there must be a flag but can't remember all of them. Likely short reminder in prompt will help. (or search web for GPT) Just my $0.02.

mikepurvis 139 days ago [-]

It certainly feels like you can just by challenging it; then it happily finds other paths to what you want. So maybe internally it needs a second voice encouraging it to think harder about alternatives upfront.

buu700 139 days ago [-]

The fact that you can more info from an LLM than it holds is actually a pithy description of this whole challenge.

0x20cowboy 139 days ago [-]

I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.

Even Amazon’s own offering completely made things up about Amazon’s own formats.

I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.

mikepurvis 139 days ago [-]

Maybe I'm excessively anthropomorphizing, but it does feel a bit analogous to my own thought process, like "I need feature XYZ, and based on other tools I'm more familiar with it should be an --xyz flag, so let me google for that and see if I'm right or if I instead find a four-year-old wontfix on Github where someone asked for what I need and got denied."

Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.

meander_water 139 days ago [-]

Cursor has a neat feature where you can upload custom docs, and then reference them with @Docs. I find this prevents hallucinations, and also using a reasoning model

organsnyder 139 days ago [-]

I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.

corvus-cornix 138 days ago [-]

I've also found LLMs to perform poorly at DevOps tasks. Perhaps there's a lack of training data. On the bright side this hints at better job security for platform engineers.

vunderba 139 days ago [-]

100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.

nonelog 139 days ago [-]

Spot on.

vunderba 139 days ago [-]

A year or so ago I experimented with splitting a user prompt down to a set of "different AI personas" that would each try to approach the user's problem in a different way and then bubble back up with a master arbiter for consensus.

I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.

bee_rider 139 days ago [-]

What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.

throwup238 139 days ago [-]

With multiple groups with external and internal rivalries. The Always Sunny gang versus The IT Crowd.

vintermann 138 days ago [-]

I have played Disco Elysium, and can confirm that a bunch of inner voices arguing with each other can be fun.

nonethewiser 139 days ago [-]

In theory couldnt this just be baked into a single adversarial model?

RevEng 139 days ago [-]

Not entirely. Since generation is auto regressive, the next token depends on the previous tokens. Whatever analysis and decisions it has spit out will influence what it will do next. This tends to cause it to be self reinforcing.

But it's also chaotic. Small changes in input or token choices can give wildly different outcomes, particularly if the sampling distributions are fairly flat (no one right answer). So restarting the generation with a slightly different input, such as a different random seed (or in OP's case, a different temperature) can give wildly different outcomes.

If you try this, you'll see some examples of it vehemently arguing it is right and others equally arguing it is wrong. This is why LLM as judge is so poor by itself, bit also why multiple generations like used in self-consistency can be quite useful at evaluating variance and therefore uncertainty.

tonmoy 139 days ago [-]

Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response

Lerc 139 days ago [-]

To an extent, but different models are better at different things.

That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)

Surely there must be something notable at particular points when a model goes off on the wrong path.

crowcroft 139 days ago [-]

Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?

Would be interesting what it comes up with with enough time and tokens.

danielmarkbruce 139 days ago [-]

This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.

kmacdough 139 days ago [-]

These ensembles have been tested throughout AI progress. Well scaffolded larger models have historically come out ahead in both quality and speed/cost.

Perhaps this is a parricularly effective ensemble, but I would need to see real data.

nativeit 139 days ago [-]

Yeah, but we'll finally get definitive proof that the government's been hiding super-intelligent axolotls from us all.

taneq 139 days ago [-]

A society of mind, if you will. :)

This sounds like a fun thing to set up with a quick-enough local model.

139 days ago [-]

cube2222 139 days ago [-]

This is really cool!

One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”

It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).

zoogeny 139 days ago [-]

This is one of the reasons I like the massive context window in Gemini. You can do this as part of the message chain. I don't try to one shot it, just use the same idea across 3 messages.

1. Figure out a plan (it responds with the plan)

2. Point out flaws in the plan (it responds with the flaws)

3. Update the plan to address the flaws (it responds with an up to date plan)

The other things I tend to ask are "what might we be missing?", "what are the [performance|security|legal|cost] considerations?". I can often iterate on the "anything else?" kind of nudging prompts, especially guiding it on topics to consider, for a few messages. After each: update the plan to take those into consideration.

danielbln 139 days ago [-]

I always do "now again but put on your critical hat"

CSSer 139 days ago [-]

Makes me wonder how it would do if you tell it "put on your robe and wizard hat"

tomrod 139 days ago [-]

ChatGPT calls you a superstar and it drops into bruhspeak. Emojis proliferate.

sumtechguy 139 days ago [-]

it proceeds to spit out the entirety of bash.org

bentt 139 days ago [-]

Oh I really like that. It makes me want to have it score its ideas with metrics and then keep iterating until it meets some score.

electroly 139 days ago [-]

This seems to be different than I expected from the title. I thought it would be explicitly adversarial.

1. You are the assistant. Please answer the question directly.

2. You are the cross-examiner. The assistant is wrong. Explain why.

3. You are the assistant. The cross-examiner is wrong. Defend your claim.

4. You are a judge. Did either party make their case, or is another round of argumentation required?

I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.

3np 139 days ago [-]

Also a little clickbaity with "my AI" and then it's all Mistral...

ChadMoran 139 days ago [-]

Check out Fast Agent! (I have no affiliation with it, just use it).

https://github.com/evalstate/fast-agent

mountainriver 139 days ago [-]

Techniques like this have been around since GPT-3.5. There are boatloads of papers on the topic.

I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN

moribunda 139 days ago [-]

Exactly... I thought that implementing STORM was just a basic step in this topic... Looks like we're running in circles.

senordevnyc 139 days ago [-]

Mind sharing a link?

kmacdough 138 days ago [-]

Here's a paper on agent architectures including multi agent. A bit old at this point, but a good overview.

https://arxiv.org/abs/2404.11584

nonethewiser 139 days ago [-]

Chatgpt shares context between chats. I wonder how that impacts it?

It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.

Actually what I find impressive is when I do this and it actually pushes back to defend itself.

the_af 139 days ago [-]

Does it share context even if no "memory updated" message appears indicating it has stored a fact about you?

I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.

visarga 139 days ago [-]

I think they are different systems, one is a collection of saved snippets and the other more like RAG over chat history.

the_af 138 days ago [-]

ChatGPT assures me it doesn't use RAG (fed from my other chat windows), but will use memory-saved preferences (in the store that can be accessed and reviewed in Settings->Personalization->Memory).

Then again, I don't think ChatGPT is reliable when reporting on its own inner workings.

---

Oh, no, here it says it also references chat history: https://help.openai.com/en/articles/8590148-memory-faq

visarga 138 days ago [-]

You can use ChatGPT for these kinds of questions but it needs to use search or research mode, don't ask it in closed book mode.

hnuser123456 139 days ago [-]

I'm having a lot of fun experimenting with stuff like this. I'm trying to put together an unrealengine blueprints style graph editor to allow people to design workflows like this where you start with the user prompt input, which goes to one agent, which makes an initial attempt, and then that conversation history gets passed to another "agent" with a different system prompt telling it to be a harsh critic, but to also give a pass/fail signal, and loop back until the critic judges pass, then send that back to the user as output. Ideally as a little website that can call your own LLM endpoints and save/load/share workflow graphs.

Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.

Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.

andai 139 days ago [-]

I am thinking the same thing! Multiple "personalities", in parallel, or in series. For example, I have approximated, in GPT, some of Gemini's ability to call out nonsense, sloppy thinking, by telling GPT to be mean! (The politeness seems to filter out much that is of great value!)

However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")

So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)

As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities. I don't know if such a thing exists, but I haven't seen it yet, although the message object format seems to have been designed with it in mind (e.g. every message has a name, to allow for multiple users and multiple AIs).

Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)

jbm 139 days ago [-]

I disagree.

If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.

dingnuts 139 days ago [-]

why would that be a joke? there's a ton of Reddit comments in the training data, and the output is of similar quality. LLMs are literally outputting average Reddit comments.

jbm 139 days ago [-]

I have hard similar things but I think that's an exaggeration. When I tell GPT o3 or o4-high to assume a professional air, it stops acting like a meat-based AIs on r/politics; specifically, it stops making inane assumptions about the situation and starts becoming useful again.

For example, I had a question from a colleague that made no sense and I was trying to understand it. After feeding the question to GPT 3o, it aggressively told me that I made a major mistake in a quote and I had to make major changes. (It would be OK if this is what the colleague had said, but this wasn't the case). In reality the colleague had misunderstood something about the scope of the project and GPT had picked up on the other person's opinion as the "voice of reason" and just projected what it thought he was saying in a stronger way.

I changed its instructions to "Be direct; but polite, professional and helpful. Make an effort to understand the assumptions underlying your own points and the assumptions made by the user. Offer outside-of-the-box thinking as well if you are being too generic.". The aggro was immediately lost, and it instead it actually tried to clarify what my colleague was saying and being useful again.

I agree with those who say the vanilla version is sycophantic, but the plain talk version has far too many bad habits from the wrong crowd. It's a bit like Monday; lots of aggro, little introspection of assumption.

MoonGhost 139 days ago [-]

Reddit works hard to make comments accessible to only Google. However MS + OIA might have grabbed something before Reddit-Google contract.

inanutshellus 139 days ago [-]

See, he's not joking, he's "joking" ...

NitpickLawyer 139 days ago [-]

> As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities.

This is the basic idea behind autogen. They also have a web UI now in autogen studio, it's gotten a bit better. You can create "teams" of agents (with different prompts, themes, tools, etc.) and have them discuss / cooperate. I think they even added memory recently. Have a look at it, might be what you need.

theturtletalks 139 days ago [-]

MoE, but an abstraction deeper?

irthomasthomas 139 days ago [-]

I think you can do most of this already with llm-consortium (maybe needs the llm-openrouter plugin with my pr merging)

A consortium sends the same prompt to multiple models in parallel and the responses are all sent to one arbiter model which judges the model responses. The arbiter decides if more iterations are required. It can also be forced to iterate more until confidence-threshold or min-iterations.

Now, using the pr i made to llm-openrouter, you can save an alias to a model that includes lots of model options. For examples, you can do llm openrouter save -m qwen3 -o online -o temperature 0, system "research prompt" --name qwen-researcher

And now, you can build a consortium where one member is an online research specialist. You could make another uses JSON mode for entity extraction, and a third which writes a blind draft. The arbiter would then make use of all that and synthesize a good answer.

kridsdale1 139 days ago [-]

Any links or names of example implementations of this?

irthomasthomas 139 days ago [-]

https://github.com/irthomasthomas/llm-consortium

also, you aren't limited to cli. When you save a consortium it creates a model. You can then interact with a consortium as if it where a normal model (albeit slower and higher quality). You can then serve your custom models on an openai endpoint and use them with any chat client that supports custom openai endpoints.

The default behaviour is to output just the final synthesis, and this should conform to your user prompt. I recently added the ability to continue conversations with a consortium. In this case it only includes your user prompt and final synthesis in the conversation, so it mimics a normal chat, unlike running multiple iterations in the consortium, where full iteration history and arbiter responses are included.

UV tool install llm

llm install llm-consortium

llm install llm-model-gateway

llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash --confidence-threshold 95 --max-iterations 3

llm serve qwen-gem-sonnet

In this example I used -n 2 on the qwen model since it's so cheap we can include multiple instances of it in a consortium

Gemini flash works well as the arbiter for most prompts. However if your prompt has complex formatting requirements, then embedding that within an already complex consortium prompt often confuses it. In that case use gemini-2.5-pro for the arbiter. .

globalise83 139 days ago [-]

Have you tried n8n? It allows you to build flows like that - you can run the community version in a Docker container within a few minutes and share the configurations for the flows you have built very easily.

mecsred 139 days ago [-]

_#_ has to be one of the worst word shortening schemes I've ever seen get widespread. It only works with a very small number of long-lived technologies, in which case they basically just get a nickname, "k8s" "i18n". It does not at all work for larger contexts. You're basically making someone solve a crossword (2 across, 10 letters with two filled in) just to parse your sentence.

jjj123 139 days ago [-]

I just googled it and it looks like “n8n” is the name of the service. The op wasn’t abbreviating anything so I don’t think it’s the same phenomenon as what you’re describing.

lgas 139 days ago [-]

Well, the service is doing the same thing though. The part I don't understand is that I assume n8n is short for "Nation" but literally every single person I've seen talk about it on YouTube (which is quite a lot) say "En Eight En" every time.

nemomarx 139 days ago [-]

nation is too short for 8 - maybe navigation?

pkaye 139 days ago [-]

Looks like n8n is short for nodemation

firesteelrain 139 days ago [-]

Why do we do this to ourselves?

Y_Y 139 days ago [-]

Techno-flagellation is the only way to atone

lgas 135 days ago [-]

So the 8 stands for "odematio"? That sounds about right.

oppodeldoc 139 days ago [-]

https://github.com/n8n-io/n8n?tab=readme-ov-file#what-does-n...

globalise83 139 days ago [-]

The app is actually called n8n - https://n8n.io/

eddieroger 139 days ago [-]

It's just another form of any other jargon - unknown until you know it, and usually specific to the use case. I see k8s and i18n or a11y and I know exactly what they mean because at some point I learned it and it's part of the world I live in. Searching for stuff is how we learn, not solving crosswords.

wongarsu 139 days ago [-]

I kind of get k8s and can live with i18n (at least it's a long word). But a11y just shouldn't exist. "Oh look, it looks like ally, what a cute play on words". Yeah, but for a dumb joke and 9 saved keystrokes you literally made the word accessibility less accessible. That's exactly the opposite of what accessibility is about

mecsred 139 days ago [-]

Right, my complaint is that it only works like jargon, where you are just giving something a context-specific nickname. As a word shortening scheme, it's terrible. A world where many projects have names like s11g is a nightmare.

psychoslave 138 days ago [-]

No it's not just part of the world and it's fatality we have to live with like gravity. Abbreviation can in rare occasion have a net benefit, but only in very narrow highly unusual context do they bring any general benefit. Most often than not it just obfuscate the message for new comers, making artificial entry barrier higher.

hnuser123456 139 days ago [-]

I had not, but that looks awesome. Microsoft put out something called "agent flows" that also fits this category.[1] I'm working on more of an "at home" version - no "talk to sales" button.

https://www.microsoft.com/en-us/microsoft-copilot/blog/copil...

139 days ago [-]

jedberg 139 days ago [-]

We're really going to need to figure out how to power all these GPUs with green power real quick, or we're going to melt the planet having AIs debate with themselves on the optimal solution to tik-tac-toe...

nonethewiser 139 days ago [-]

Ive felt this way when using chatgpt for a simple search. Stuff that google could handle but would just be slower, mostly from me having to manually filter.

Sometimes its the easiest way to complete a very small task but the cost difference on the backend has to be pretty damn large. The user inevitably ends up not caring whatsoever. Its just not real to them.

ivape 139 days ago [-]

I caught infra people saying that's pretty much the only bottleneck in the data center right now, power and cooling. We know the AI needs to run against itself continuously, and that's just a fact.

mcswell 138 days ago [-]

Maybe we should assign them a practical task, like making paperclips.

Xcelerate 139 days ago [-]

I think this is how we get ML models to come up with novel ideas. Diagonalize against all the ideas they’ve already tried and dismissed via self-argument but keep certain consistency constraints. (Obviously much easier said than done.)

jwally 139 days ago [-]

Scaled up and spread out - this probably gets you pretty close to consciousness(?)

Conway's game of life, but instead of colored squares with rules, they're LLM's with some kind of weighting - all chattering back and forth with one another - bubbling up somehow to cause speach/action

lubujackson 139 days ago [-]

Decades ago I read The Society of Mind by Marvin Minsky. He pushed this sort of idea, that consciousness is composed of individual, competing processes. Worth a revisit!

andai 139 days ago [-]

What you just said is what I tried and failed to say ten minutes ago!

https://news.ycombinator.com/item?id=43835798

Nevermark 139 days ago [-]

It’s working! Oh, wait …

These models have limitations obviously, but many critiques apply equally or more to people.

If people were tasked with one shot, 10 second answers, to be written out in near errorless grammar, the LLM’s viewing our responses to prompts would be spending a lot of time discussing our limitations and how to game us into better responses. Humor, not at all humor.

albertgoeswoof 139 days ago [-]

How far is this going to go? Are we going to have a team of AI agents that runs a scrum team and meets for stand ups every couple of hours?

Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?

kgeist 139 days ago [-]

I once attended a talk a year ago where a techlead did just that - they had AI agents that ran a scrum team with different roles, each agent's prompt was to disagree with everyone else (or be highly critical) and present their own point of view, and then an arbiter would make the final decision. They claimed it worked for them.

parrit 139 days ago [-]

Maybe. Humans form teams for a reason. Yes there are different exepriences and points of view in a human (vs. Not so much in LLM), but sometimes a different hat it all it takes. E.g. Code reviewer vs. Coder.

Havoc 139 days ago [-]

Seems likely to me. As long as adding more appears to help people will do it

Presumably there is some point where it levels out. And no doubt there will be a committee of AIs to determine said point.

Cause we wouldn’t want to boil the ocean…

faramarz 138 days ago [-]

That's cool! thanks for making it easy to fork and play with this!

I've just begun my own iteration of adding Nash Equilibrium (NECoRT?) and reframing the "prompt engineering" to be a multi-agent negotiation. Curious what others think? https://github.com/faramarz/NECoRT/

my reasoning is that enterprise LLMs wont have any issue with the extra compute costs and would rather reconcile complex financials with various modeling optimizations.

I'm very new to public repo and contributions, and hope someone can point out if I'm doing it wrong.

my intention was to fork the ops codebase so I can test out my theory, and push as PR eventually

alexmolas 139 days ago [-]

There are two examples in the repo, one with CoRT and another one without. And the one without it it's much better than the one that uses it. Weird choice of examples...

2cheeze4u 139 days ago [-]

I think the names were switched up.

joshstrange 139 days ago [-]

I've thought about trying this cross-model as well. Have Claude generate something, have OpenAI check it, have Gemini check that check. Firing multiple of these in parallel.

There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.

K0balt 139 days ago [-]

I’ll second this. I often use a “research assistant “ and skeptical“department head” personas working together/against each other as a research team. It works well and is occasionally hilarious, replete with the occasional HR complaint when things go off the rails. ( I typically use local uncensored models)

k2xl 139 days ago [-]

I've done something similar for learning about a controversial topic. I ask it to act as if it is called Bob is a well informed supporter of one side (like Ukraine) and then act as if it is something named Alice who is a well informed supporter of another side (Russia) and they have to debate each other over a few prompts with a moderator named 'Sue'

Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.

Really fun, and helps me understand different sides of issues.

rat87 139 days ago [-]

That seems like a terrible idea. At best it seems likely to help you make a false but convincing sounding case. I really hope no one is using that to help them understand controversial topics much less using that to determine their stances.

Id recommend looking into actual human experts who are trustworthy and reading them. Trying to get LLM to argue the case will just get you a lot of false information presented in a more convincing fashion

k2xl 137 days ago [-]

I recommend you try it before judging. I will be honest it has been actually very useful.

You are also assuming that the LLM is providing false information (or will have a higher chance of providing false information than a human)

caseyy 139 days ago [-]

I tried something similar when Llama2 came out, pitting two assistants, who each believed the other is the user, against each other. Ultimately, it was the same model talking with itself. The system prompts for both had various instructions to disagree and criticise the opinion of the user. I provided the first message to get things started. Usually, it’s be along the lines of “nuclear proliferation is harmful to humanity”.

After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.

Happy someone else made it work.

nowittyusername 139 days ago [-]

With my own experiments I've also found this. This behavior is very persistent with llms on default hyperparameters and system prompt. Right now I am exploring how to get these models to output more human like interactions and it seems that a very specific and detailed system prompt is very important to get this to work. These systems are VERY sensitive to system prompt and user input. Meaning that the quality of output varies drastically depending on not just the language you use but how its structured, the order of that structure and also other many nuanced things like system prompt plus user input pre conditioning. So far it seems its possible to get to where we need to for this task but lots of exploration needs to be done in finding the way in how to structure the whole system together. This revelation is kind of nuts when you think about it. It basically means, once you find the right words and the order in which they should be structured for the whole system you can get 2x+ improvement in every variable you care about. That's why I am spending some time creating an automated solution to find these things for x model. Its a tedious effort to do manually, but we have the tools to automate its own optimization and calibration efforts.

generalizations 139 days ago [-]

I always assumed you'd have to use different models. Even if only one of them is large, the others would inject enough difference of opinion to keep it useful.

zamalek 139 days ago [-]

This might be a situation that warrants a higher temperature. Actually, it could be worth starting a very high temperature initially and gradually decreasing it.

caseyy 139 days ago [-]

Even after turning the temperature way up, the outcome was the same, just the text less coherent. Not dismissing the idea, just sharing my exp.

bilekas 139 days ago [-]

This is an interesting approach, it reminds me of YT creator actually. I'll find the YT creator, but basically he would make some script that would play the game like a race-course, with the goal being the finish line and iterate it N number of times, the script would keep iterating until it found the fastest solution.

I believe they called that machine learning.. Or re-enforced training.

I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?

https://www.youtube.com/watch?v=SX08NT55YhA

WhitneyLand 139 days ago [-]

Why try this idea on base models only?

The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.

It would be interesting to see if this is doing anything that’s not already being exploited.

ChadMoran 139 days ago [-]

Fast Agent has this as a first-class citizen called "Evaluator Optimizer" pattern. Where it in a loop with a defined number of max refinements judge itself and give the output a rating, demanding it improve it's output.

Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.

https://github.com/evalstate/fast-agent

Der_Einzige 139 days ago [-]

Debate as a reasoning tactic is massively undervalued. There's tons of papers on this at places like NeurIPS, ICML, ICLR, etc.

Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...

I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!

139 days ago [-]

lepisma 139 days ago [-]

Debates have worked good for me while learning something new:

https://lepisma.xyz/2024/10/19/interventional-debates-for-st...

I believe there are researches on this too.

aaroninsf 139 days ago [-]

Question: has the the adversarial approach been roled into any coding copilots/assistant frameworks?

Costs of various kinds aside I've wanted that from assistance's inception — with precisely the features many call out and home-roll here, difference by both model/provider, and, "role"...

It seems like if you have the money/compute to burn, and can live with the reasoning wall-clock time,

this has got to be the best approach for the foreseeable future, for a lot of specific requirements.

(I also have wondered if this would illuminate the edges of what modern production models are capable of, "aggregating and integrating" over a variety of contributions might make more clear what the limits of their abilities are.)

badmonster 139 days ago [-]

Have you experimented with weighting the self-evaluations based on specific criteria (e.g., correctness, clarity, creativity), or using external validators to guide the AI’s final choice? Curious how much tuning the evaluation step impacts overall performance.

mortarion 139 days ago [-]

I think Gemini 2.5 already does something similar. If you read the "thinking descriptions" that it outputs it often thinks about going back to older thoughts to verify and criticize.

yieldcrv 139 days ago [-]

Reminds me of baby agi from 2 years ago

but I guess that was before chain of thought models

zekenie 139 days ago [-]

I feel like itd be cool to try prompts based on an adversarial justice system… attorney agents arguing both sides, a judge ruling on “the law”—adherence to instructions etc

ivape 139 days ago [-]

That's very easy to do. A prompt I regularly use is a "council" system. For example:

"I believe I have been contacted by the supernatural. Here are the details <details>. Please form a council of seven people: 1) Secular scientist 2) Religious scientist 3) Paranormal historian 4) Secular Psychologist 5) Religious psychologist 6) Carl Jung 7) Richard Dawkins. The council should all be independent and provide their own objective analysis. Please have them create a final report and conclusions at the end".

Your council can be anything, a law firm, a jury, a parent teacher association, whatever you want, and as you can see, you can throw in known people as well. This can all be done with one prompt. It's one my favorite things to do.

svachalek 139 days ago [-]

Wow, that's a very cool prompt, I haven't tried anything like that before.

hu3 139 days ago [-]

Here's some related challenge I'm facing. Maybe someone can help me:

I also managed to make AI critique itself and that improved code generation a ton.

For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.

How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?

Docker works but I like to keep things simple.

Deno supports revoking file access but I'd like to keep using Bun.

small_scombrus 138 days ago [-]

> How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?

The same way you stop any person or program or third party from doing something dumb or nefarious with your files. Don't give them any access to important files.

zactato 139 days ago [-]

Either you trust AI or you don't? If you don't trust it then you need to review what it's writing.

Docker seems like a pretty low complexity way to create an isolated environment to run automation.

derwiki 139 days ago [-]

Manually approve every terminal command it wants to run instead of vibe mode. Tbh I think an rm -rf scenario is exceedingly unlikely.

ivape 139 days ago [-]

a) You should only do this in a sandbox

b) You can have the AI run a "firewall" prompt on the final output. So your final output should go through a "You are a firewall that checks for dangerous terminal commands such as <enumerate list of dangerous commands>. If you spot dangerous commands, reform the command so that it is not dangerous"

nowittyusername 139 days ago [-]

No way around it, got to sandbox the whole thing no matter what.

schnitzelstoat 139 days ago [-]

I probably don't understand the modern, complex models. But doesn't it basically predict the next token given the context and the better models use more training data and can consider a larger context, and have more parameters to better retain information from the training data etc.

But the fundamental way they operate is the same - predicting the next token given previous tokens. Where/how does reasoning happen here?

small_scombrus 138 days ago [-]

Upfront: I think AI is borderline useless for many of the tasks we give it. But:

1. Do our neurons not just react the same way every time to the same input? A brain is larger than the sum of its parts.

2. They don't reason, but you can somewhat emulate (or pretend to be) reasoning if you feed something back into itself enough times and it pinky promises reasoning is happening

139 days ago [-]

stormfather 139 days ago [-]

I made a trading bot that ingested news. The prompt to assess impact was to simulate a debate between Charlie Munger and Warren Buffet on whether to invest.

internetter 139 days ago [-]

How did it do?

thunderbong 139 days ago [-]

A lot of the comments here are reminiscent of the early Google days when everyone was finding ways to search better!

139 days ago [-]

j45 139 days ago [-]

There appear to be no shortage of token saving attempts that can end up using more tokens, whether it's a monthly paid plan or API.

Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.

pkdpic 139 days ago [-]

So glad to see a write up on this finally. I'm no machine learning phd but I always wondered why this wasn't more of a thing. Like an extension of a GAN conceptually, sort of, not really at all Im sure.

Also I think I kind of assumed OpenAI might be doing this behind the curtain?

mritchie712 139 days ago [-]

Did something similar (OverkiLLM) to this waayyyy back in August with open LLMs. I'm sure it'd work much better now:

https://www.definite.app/blog/overkillm

rriley 139 days ago [-]

Makes me wonder what would happen if we combine LLMs with recursive genetic algorithms. Similar to https://github.com/DivergentAI/dreamGPT

noworriesnate 139 days ago [-]

I’ve had success telling the model it really needs to poop and if it gets to the point quickly it’ll be able to leave the meeting and go do that. It actually works amazingly well.

It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.

Programming isn’t what it used to be.

tinix 139 days ago [-]

this works for getting out of traffic tickets too lol

ausbah 139 days ago [-]

at some point this doesn’t make LLMs feel useful. I have to wait 10x as long just so my LLM can have a somewhat higher chance of actually answer my question correctly?

cwillu 139 days ago [-]

Any api that lets you constrain output to a formal syntax should let you do away with the “first output a number, and only then explain yourself” boilerplate.

killerstorm 139 days ago [-]

This is similar to Tree-of-Thought with self-evaluation.

daxfohl 139 days ago [-]

Maybe have a "reconcile" option, for it to see if it can mix and match the best parts of each alternative rather than just choosing one.

grzracz 139 days ago [-]

Your readme demo images are wrong: the terminal one is the non-CoRT one and the GUI one is the one with CoRT. Confused me for a while

Svoka 139 days ago [-]

Oh. I was just asking "Use dialectic method on your solution" in the end of the prompt... It does make it think harder.

ashoeafoot 139 days ago [-]

Give it reward and punishment evaluations, exploring the noise in parallel, extinction for the non rewarding answers ?

keyle 139 days ago [-]

When will we get the `4o` vs `o3` background conversation in "thinking" leading to a more correct result?

kevinrineer 139 days ago [-]

This sounds like the zeitgeist is approaching genetic algorithms, which are super fun. Adversarial stuff is great.

throwawayForMe2 139 days ago [-]

I wonder if the Scholastic method of the Schoolmen would be useful with its argument and counter argument style.

alex1138 139 days ago [-]

Every single one of my prompts would be "Are you suuuuuuure you're not hallucinating that?"

Garlef 139 days ago [-]

Similarly, letting the LLM generate a socratic dialogue can work pretty well to get deeper into a topic.

mangoman 139 days ago [-]

a paper with a similar idea on scaling test time reasoning, this is sorta how all the thinking models work under the hood. https://arxiv.org/abs/2501.19393

gnarlouse 139 days ago [-]

This seems like low hanging fruit; are we seriously supposed to believe this is new and novel?

138 days ago [-]

irthomasthomas 139 days ago [-]

my favourite pattern rn: llm "write a savage, yet grounded roast of: $content" llm -c "Write an equally savage rebuttal" llm -c "first arbitrate and then synthesize a final review."

asdfman123 139 days ago [-]

And when I do this people say I'm overanalyzing

ivape 139 days ago [-]

The thing that makes us weird to regular people is what's going to make us uniquely positioned to utilize AI. If people only knew the level at which I overanalyze and entertain weird ideas. I always inject these personality quirks into my instructions and get very creative results. In a weird way, I'm starting to appreciate just how weird I actually am.

asdfman123 139 days ago [-]

I don't actually think it's that weird though

animitronix 139 days ago [-]

Adversarial networks have been a thing for a while

stevefan1999 139 days ago [-]

That is just reinforcement learning in disguise

akomtu 139 days ago [-]

The modern Alchemy: the belief that you can extract gold (intelligence) from iron (autocomplete by imitation) by mixing iron with itself.

csours 139 days ago [-]

Yes, give the computers anxiety too!

lonetripper 139 days ago [-]

all this hard thinking yet humanity fails to come up with just one girlfriend for me

robofanatic 139 days ago [-]

soon there will be AI debates. Different models debating with each other on a topic

mparnisari 139 days ago [-]

So like rubber ducking for AI?

z2 139 days ago [-]

I would really like to see a fusion guidebook of mental tricks that work for humans and just as well for AI. Or humorously, perhaps prompt-engineering tricks that are also great mental hacks for better or clearer human thinking.

1970-01-01 139 days ago [-]

"While hallucinating a duck, check my script for errors."

jbellis 139 days ago [-]

does it actually make a difference to do M rounds of N vs one round of M*N?

nowittyusername 139 days ago [-]

My gut tells me yes. From my own experiments the order and way in which these things are done are important. I think it all is very strongly tied to the attention mechanism.

firgrove 139 days ago [-]

this is amazing - I love seeing novel approaches to optimizing

celltalk 139 days ago [-]

One of my doctoral propositions is, dialog leads to true artificial intelligence.

139 days ago [-]

getcrunk 139 days ago [-]

Hello cnn’s

parrit 139 days ago [-]

I want to see "Meh" vs. "Holy crap" as a benchmark in a paper published by Google. Or more likely I suspect, Andrej.

codr7 139 days ago [-]

Better yet, let it argue with another AI, preferably using voice; instant entertainment.

139 days ago [-]

antisthenes 139 days ago [-]

Cool. Now I can justify talking to myself.

dqewijodjqweido 139 days ago [-]

[flagged]

casenmgreen 139 days ago [-]

[flagged]

hackinthebochs 139 days ago [-]

Don't people get tired of having this same "debate" on every post about LLMs? And I scare quote debate because the naysayers never support their strong claims beyond the most superficial of responses. It's all just so tiring at this point.

casenmgreen 139 days ago [-]

The most superficial response is adequate, where the claim is so improper.

LLM/AI are extremely useful. I am in no way disputing this.

stevenAthompson 139 days ago [-]

Can you define "thinking" in a way that excludes what the AI is doing, but includes what humans do?

I haven't' really seen anyone else manage it without talking about ghosts or some other kind of metaphysical voodoo.

dttze 139 days ago [-]

Using a conceptual understanding of something to deduce or infer something else.

An LLM doesn't know what anything is. Just what goes around the token representation of that thing.

stevenAthompson 139 days ago [-]

"conceptual understanding of something" is just another way of saying "the relationship between concepts", which is exactly what transformer models use.

*EDIT* To elaborate, how can you define anything in isolation of every other concept/thing? You can't. Things are only defined by their relationships to each other, which is exactly the same thing transformer models do.

dttze 139 days ago [-]

No, it isn't. "Conceptual understanding" is a deep comprehension of a particular concept. It is grasping its meaning, significance, applications, and boundaries. It involves knowing not just what something is definitionally, but understanding how it works, why it matters, and how it connects to other ideas.

"The relationship between concepts," is focusing specifically on how different ideas connect, overlap, contradict, or complement each other. It's more about the network or system of connections rather than deep comprehension of individual concepts.

Understanding relationships between concepts is part of conceptual understanding, sure. But conceptual understanding is broader - it includes both mastery of individual concepts and awareness of their relationships to other concepts.

stevenAthompson 139 days ago [-]

> It is grasping its meaning, significance, applications, and boundaries

To define "thinking" by using words like "meaning", "understanding", or "comprehension" just moves the need for definition further up the abstraction ladder. It doesn't help to define what "thinking" is in any quantifiable way.

To play along, could you define "meaning" or "understanding" in a way that doesn't resort to ghost-talk or just move the definition even further up the abstraction ladder? They are both subjective terms that describe how humans feel, not well defined words that describe objective reality in some way.

To use a more quantifiable metric we could look at something like Humanity's Last Exam. OpenAI's o3 scores something like 20% (a feat which few humans could accomplish). To put that in perspective, consider that fifty four percent of Americans now read below the sixth grade level. Like it or not the machines are "smarter" than the majority of humans and have deeper "understanding" in most of the objective ways we've thought of to measure it. Subjective feelings aside, it's tough to argue that the machines aren't conscious if we're going to accept that our fellow citizens are.

Cantinflas 139 days ago [-]

Why is it so hard to believe that a complex neural network can think? You literally have one over your shoulders that does exactly that.

consumer451 139 days ago [-]

I am not sure that you can make that absolute statement. Reasoning is subdivided into types, and one of those types is inductive reasoning.

> Inductive reasoning refers to a variety of methods of reasoning in which the conclusion of an argument is supported not with deductive certainty, but with some degree of probability. Unlike deductive reasoning (such as mathematical induction), where the conclusion is certain, given the premises are correct, inductive reasoning produces conclusions that are at best probable, given the evidence provided.

Doesn't predicting the next token qualify as doing just that?

https://en.wikipedia.org/wiki/Inductive_reasoning

dttze 139 days ago [-]

Markov chains have done that for ages. They aren't AI. This is just that scaled up.

Just because it can infer a token doesn't mean it can infer a conclusion to an argument.

voidspark 139 days ago [-]

> This is just that scaled up

An LLM is not a Markov process. They are fundamentally different. An LLM conditions the next token prediction on the entire context window (via the attention mechanism), not just the previous token. Besides the token history window it also maintains a cache of neural activations which is updated at every step.

Otherwise you could use the same reasoning to argue that a human is a Markov process, which is absurd, but vacuously true if "state" means the quantum level configuration of every atom in the body.

casenmgreen 139 days ago [-]

To add a bit to this : expert systems have two properties. They give an answer, and they explain their reasoning.

LLM cannot explain their reasoning, and that is because there is no reasoning.

consumer451 139 days ago [-]

To push back on this, a somewhat recent Linus Torvalds ~quote:

"I don't think that 'just predicting the next word' is the insult that people think it is, it's mostly what we all do."

If we break our lives down into the different types of reasoning, and what we mostly do day-to-day, this rings very true to me.

I currently believe that our brains generally operate as very efficient inference machines. Sometimes we slow down to think things through, but for example, when in the ideal "flow state" it's some kind of distilled efficient inference. Isn't it? This is very hard for me to deny at this time.

___

edit:

4o appears to agree with both of you, more than it does with me.

https://chatgpt.com/share/68119b41-1144-8012-b50d-f8f15997eb...

However, Sonnet 3.7 appears to side with me.

https://claude.ai/share/91139bca-3201-4ffc-a940-bdd27329e71f

(Both of these are the default models available for free accounts, on each website, at the time of writing)

IMO, hey, at least we do live in interesting times.

casenmgreen 139 days ago [-]

I may be wrong, but it seems to me this also is a case of improper use of words.

Those LLMs neither agree nor disagree. They do not understand. They produce output, and we read that output and we ourselves consider the output to be something, or something else.

All an LLM does is produce output. There's no conceptual understanding behind it, and so there is no agreement, or disagreement.

consumer451 139 days ago [-]

> All an LLM does is produce output. There's no conceptual understanding behind it, and so there is no agreement, or disagreement.

I think that I agree. However, even on HN, what percentage of human comments are simply some really basic inference, aka output/"reddit"/etc... and those are humans.

I am not trying to elevate LLMs to some form of higher intelligence, my only point is that most of the time, we are not all that much better. Even the 0.000001% best of us fall into these habits sometimes. [0]

I currently believe that modern LLM architecture will likely not lead to AGI/ASI. However, even without that, they could do a lot.

I could also be very wrong.

[0] https://en.wikipedia.org/wiki/Nobel_disease

voidspark 139 days ago [-]

LLMs learn high-dimensional representations that capture conceptual relationships in their training data. They manipulate those representations in ways that approximate human reasoning.

consumer451 138 days ago [-]

> They manipulate those representations in ways that approximate human reasoning.

Fwiw, this is the story of my life. Seriously.

voidspark 138 days ago [-]

LOL everyone is like that most of the time.

System 1 vs System 2 thinking.

System 1 is rapid, uses heuristics to make quick judgements. Not rigorous. System 1 is the default mode.

System 2 is slow deliberate reasoning, energy intensive, and even humans get that wrong.

LLMs often use something like System 1 pattern matching, get the answer wrong initially, then can be prodded into trying again with a System 2 approach (chain of thought).

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

139 days ago [-]

verytrivial 139 days ago [-]

Sorry to join the pile-on, but can I just ask: In what way does a brain think that an Ai does not? And does the distinction apply from human brains down to fruit flies? Is it a property of embodiment? (I have suspected for years that consciousness isn't just emergent but specifically that it is NOTHING besides that. It's all about scale and large models are just starting climb the ladder. The ladder does not necessarily go up the same way as embodied thought though.)

throwaway150 139 days ago [-]

How can you be so sure? How do you know that our brains don't work like transformers too, except for having the advantage of having more types of sensory data? How can you settle this debate without defining what "thinking" and "reasoning" is and how what LLMs do is not similar to what a kindergarten level kid may be capable of? I think we all agree kindergarten kids can think and reason, don't we?

senko 139 days ago [-]

Agree 100%.

We should also not be calling a pointing device "a mouse" because it's not a small rodent, there aren't any actual windows inside a computer, and I haven't seen anyone balancing their laptop trying to surf the web.

Also smartphones are not actually smart and are only barely phones.

casenmgreen 139 days ago [-]

I'm finding laymen are thinking AI is reasoning, because the term makes it look like this is what it is.

The potential confusion of terms such as mouse/windows/surfing is not the same as calling LLM AI, and then going on to say it is "thinking" and "reasoning".

rapfaria 139 days ago [-]

Or "thinking" just got a new meaning and it's to convey information in the field - perhaps the Oxford dictionary will add it soon?

jasonthorsness 139 days ago [-]

What words would you use instead?

Philpax 139 days ago [-]

You say these things with such certainty. How can you be so sure?

pfdietz 139 days ago [-]

You are the critic. Construct three rebuttals to your claim.

m3kw9 139 days ago [-]

Isn’t this best of n?

139 days ago [-]

lenerdenator 139 days ago [-]

I, too, like to give Terminator lite anxiety.

hansmayer 139 days ago [-]

Right, so... but you do realise its still just producing random output based on how you reconfigured it's weights, right? Sometimes it will happen to resonate with what you need. But it still neither thinking nor arguing with itself.

DyslexicAtheist 139 days ago [-]

> "I made my AI think" ...

utterly moronic.

They don't “think” ... not even in the most autistic sense of the word.

They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.

mortarion 139 days ago [-]

That's exactly what us humans do when we think about stuff. We combine memories and knowledge in unique ways, then we usually go ask someone else to give input on it.

Loading comments...

dudeinhawaii 139 days ago [-]

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)

zurfer 139 days ago [-]

Your references show me that it is absolutely task depended. In many domains it's true that "criticizing is easier than creating".

The best example might be books and movies, where it's trivial to say the characters were shallow, but it's surprisingly hard to create deeply interesting characters.

In Software Engineering, there are similar dynamics. An LLM with a security vuln finding prompt will be able to point out places, where the generated code might be insecure.

aoeusnth1 138 days ago [-]

simulator5g 138 days ago [-]

An LLM cannot reason through a mathematical proof, it would be something other than an LLM if it could.

mycall 138 days ago [-]

LLM is a overloaded term now as ML models can do tool calls, or MoE segmentation can have specialized solvers embedded... but people will call all variations LLMs.

meander_water 139 days ago [-]

[0] https://arxiv.org/abs/2306.05685

[1] https://arxiv.org/abs/2411.15594

swyx 139 days ago [-]

its for the better, and i'm actually serious about this. it's just that Subbarao is ALSO right and it is not perfect nor human level. but it -DOES- improve results measurably and consistently.

hu3 139 days ago [-]

> ...so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Agree. What do you think about telling the LLM to also generate unit tests for the code it spits and then run all tests (including previous application unit tests).

I think this is a way to ensure some level of grounded verification:

- Does code compile?

- Do unit test pass?

AI can then consume test results to help fix their own mistakes.

nojs 139 days ago [-]

eru 138 days ago [-]

> Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

You can use property based testing for that.

But I've often run into cases where the AI gets into a vicious spiral of worse and worse code when you keep feeding it the test failures.

macrolime 139 days ago [-]

Yet, having the LLM evaluate the tests often works well. Especially if you ask it to look for hardcoded test cases.

dwaltrip 139 days ago [-]

Definitely, test runners are a way to ground the model and give it a feedback loop. Not a silver bullet but can be very helpful.

keepamovin 139 days ago [-]

I believe, what the smart AI company is trying to do, right now, in secret, is to use US, the humans, and our replies to the AIs, as training for the next generation of self-verifying-models. :)

Training on corpus data gets you to 1 order of magnitude. But training on interactive data where you can observe and adapt to the OODA-loop? So much more powerful.

At least, that's what I'd be doing if I were doing AI :)

But I just do BrowserBox

captainbland 138 days ago [-]

I think you'd need to screen for quality of response quite stringently as loads of people will produce "corrections" which are just plain wrong.

keepamovin 138 days ago [-]

Good point! But you could probably identify "super users" who are the ones whose responses you want to mine hahaha :)

mcswell 138 days ago [-]

In the case described, the generator is an LLM, and the tester (called a "verifier") is "the compiler, linter, SAT solver, ground truth dataset, etc."

And of course generate-and-test is related to trial-and-error, which has probably existed since the Paleolithic.

foobiekr 138 days ago [-]

"letting GPT-4 critique its own answers reduces accuracy"

This is because the output, being the input, steers directly into the tree as soon as the tree is in the context window.

ashu1461 139 days ago [-]

Would a LLM under human guidance turn out to be a good verifier ? i.e. if LLM knows the rules to verify or has enough data points (internet access, actual responses)

eru 138 days ago [-]

Of course, that only works for problems where you have a verifier.

autokad 138 days ago [-]

actually, I found that you can definitely yield better results. I ran an experiment with 1 prompt at temperature 0 and 9 with temperature 1.

I found the most anomalous response was as good (15/20) or better (5/20) than the temperature 0 response in 20 samples.

odo1242 139 days ago [-]

Something I do sometimes is:

- Have an AI chat model come up with an answer to a problem.

- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.

- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.

It's super clunky but has given pretty good results in the cases where I tried it lol

ASalazarMX 139 days ago [-]

Ah, now we know why Spain was out of electricity yesterday.

Cthulhu_ 139 days ago [-]

Here I was thinking cryptocurrency pre-heated the grids (and GPU manufacturing) for us already.

danparsonson 139 days ago [-]

Oh that was a good one XD

StopDisinfo910 139 days ago [-]

The final plan you obtain is generally a lot more well rounded and thought out.

I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.

meander_water 139 days ago [-]

Do you find your method handles "sycophancy" well?

StopDisinfo910 139 days ago [-]

I don’t really know.

I generally use Deepseek nowadays which is not sycophantic and surprisingly doesn’t seem as censored to me especially if you use a version not hosted by Deepseek themselves.

lblume 138 days ago [-]

Which hosting service would you recommend?

zoogeny 139 days ago [-]

I do the same, and I have one other technique.

I will often have a few chats going for a project, but with different contexts. For example, one might be tech focused, another marketing focused, another with some context on my personal goals, etc.

odie88 138 days ago [-]

This is how I’ve been using Gemini and it’s the first time I’m really seeing consistent value.

I’ll get a context into a solid place with as much information as I can about a project. Usually getting up to 100k tokens.

I’ve found myself going into sub branches from there… like a marketing context that pushes branches into different marketing channels.

jsight 139 days ago [-]

This reminds me a lot of the YT video that went over using Monte Carlo Tree Search with LLMs to maximize result quality. Link: https://www.youtube.com/watch?v=mfAV_bigdRA&ab_channel=Treli...

looofooo0 139 days ago [-]

JumpCrisscross 139 days ago [-]

Kagi’s Assistant feature makes this super easy. Just switch assistants and ask them to check the other’s work.

BOOSTERHIDROGEN 139 days ago [-]

How?

nativeit 139 days ago [-]

Ask the AI assistant for instructions.

Pretty soon we'll have new acronyms such as "IDKATFAIA" ["I don't know, ask the f'ing AI already"] as we all succumb to the knowledge soup.

dalmo3 139 days ago [-]

RTFP

factotvm 139 days ago [-]

Read The Fine Prompt, more or less, right?

BOOSTERHIDROGEN 138 days ago [-]

Honestly, the AI assistant isn't as smart as I thought - I'm still having to check its work.

subscribed 139 days ago [-]

With a good model role playing them, works awesome.

hsuduebc2 139 days ago [-]

nomel 139 days ago [-]

itissid 139 days ago [-]

Isn't this kind of another way of how Inference Time Scaling works? It will basically produce several chain of thoughts and then pursue one that has maximum reward based on an internal function?

pessimizer 139 days ago [-]

I've wondered if it might be helpful to randomly "shard" training data between two LLMs; just feed half the training data to one, and the rest to the other, with no overlap.

ijk 139 days ago [-]

There's some precedent for that: you can do some useful things with the cross entropy of the two models. And k-fold cross validation might also be relevant.

aprilthird2021 139 days ago [-]

This takes such a long time to do though, no? What problems does this save you time on?

dustingetz 139 days ago [-]

i dont understand, is it doing your schoolwork?

Lerc 139 days ago [-]

mikepurvis 139 days ago [-]

MoonGhost 139 days ago [-]

mikepurvis 139 days ago [-]

buu700 139 days ago [-]

The fact that you can more info from an LLM than it holds is actually a pithy description of this whole challenge.

0x20cowboy 139 days ago [-]

I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.

Even Amazon’s own offering completely made things up about Amazon’s own formats.

I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.

mikepurvis 139 days ago [-]

meander_water 139 days ago [-]

Cursor has a neat feature where you can upload custom docs, and then reference them with @Docs. I find this prevents hallucinations, and also using a reasoning model

organsnyder 139 days ago [-]

I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.

corvus-cornix 138 days ago [-]

I've also found LLMs to perform poorly at DevOps tasks. Perhaps there's a lack of training data. On the bright side this hints at better job security for platform engineers.

vunderba 139 days ago [-]

100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.

nonelog 139 days ago [-]

Spot on.

vunderba 139 days ago [-]

bee_rider 139 days ago [-]

What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.

throwup238 139 days ago [-]

With multiple groups with external and internal rivalries. The Always Sunny gang versus The IT Crowd.

vintermann 138 days ago [-]

I have played Disco Elysium, and can confirm that a bunch of inner voices arguing with each other can be fun.

nonethewiser 139 days ago [-]

In theory couldnt this just be baked into a single adversarial model?

RevEng 139 days ago [-]

tonmoy 139 days ago [-]

Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response

Lerc 139 days ago [-]

To an extent, but different models are better at different things.

Surely there must be something notable at particular points when a model goes off on the wrong path.

crowcroft 139 days ago [-]

Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?

Would be interesting what it comes up with with enough time and tokens.

danielmarkbruce 139 days ago [-]

This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.

kmacdough 139 days ago [-]

These ensembles have been tested throughout AI progress. Well scaffolded larger models have historically come out ahead in both quality and speed/cost.

Perhaps this is a parricularly effective ensemble, but I would need to see real data.

nativeit 139 days ago [-]

Yeah, but we'll finally get definitive proof that the government's been hiding super-intelligent axolotls from us all.

taneq 139 days ago [-]

A society of mind, if you will. :)

This sounds like a fun thing to set up with a quick-enough local model.

139 days ago [-]

cube2222 139 days ago [-]

This is really cool!

It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).

zoogeny 139 days ago [-]

This is one of the reasons I like the massive context window in Gemini. You can do this as part of the message chain. I don't try to one shot it, just use the same idea across 3 messages.

1. Figure out a plan (it responds with the plan)

2. Point out flaws in the plan (it responds with the flaws)

3. Update the plan to address the flaws (it responds with an up to date plan)

danielbln 139 days ago [-]

I always do "now again but put on your critical hat"

CSSer 139 days ago [-]

Makes me wonder how it would do if you tell it "put on your robe and wizard hat"

tomrod 139 days ago [-]

ChatGPT calls you a superstar and it drops into bruhspeak. Emojis proliferate.

sumtechguy 139 days ago [-]

it proceeds to spit out the entirety of bash.org

bentt 139 days ago [-]

Oh I really like that. It makes me want to have it score its ideas with metrics and then keep iterating until it meets some score.

electroly 139 days ago [-]

This seems to be different than I expected from the title. I thought it would be explicitly adversarial.

1. You are the assistant. Please answer the question directly.

2. You are the cross-examiner. The assistant is wrong. Explain why.

3. You are the assistant. The cross-examiner is wrong. Defend your claim.

4. You are a judge. Did either party make their case, or is another round of argumentation required?

3np 139 days ago [-]

Also a little clickbaity with "my AI" and then it's all Mistral...

ChadMoran 139 days ago [-]

Check out Fast Agent! (I have no affiliation with it, just use it).

https://github.com/evalstate/fast-agent

mountainriver 139 days ago [-]

Techniques like this have been around since GPT-3.5. There are boatloads of papers on the topic.

I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN

moribunda 139 days ago [-]

Exactly... I thought that implementing STORM was just a basic step in this topic... Looks like we're running in circles.

senordevnyc 139 days ago [-]

Mind sharing a link?

kmacdough 138 days ago [-]

Here's a paper on agent architectures including multi agent. A bit old at this point, but a good overview.

https://arxiv.org/abs/2404.11584

nonethewiser 139 days ago [-]

Chatgpt shares context between chats. I wonder how that impacts it?

It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.

Actually what I find impressive is when I do this and it actually pushes back to defend itself.

the_af 139 days ago [-]

Does it share context even if no "memory updated" message appears indicating it has stored a fact about you?

I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.

visarga 139 days ago [-]

I think they are different systems, one is a collection of saved snippets and the other more like RAG over chat history.

the_af 138 days ago [-]

ChatGPT assures me it doesn't use RAG (fed from my other chat windows), but will use memory-saved preferences (in the store that can be accessed and reviewed in Settings->Personalization->Memory).

Then again, I don't think ChatGPT is reliable when reporting on its own inner workings.

---

Oh, no, here it says it also references chat history: https://help.openai.com/en/articles/8590148-memory-faq

visarga 138 days ago [-]

You can use ChatGPT for these kinds of questions but it needs to use search or research mode, don't ask it in closed book mode.

hnuser123456 139 days ago [-]

Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.

andai 139 days ago [-]

However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")

So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)

Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)

jbm 139 days ago [-]

I disagree.

If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.

dingnuts 139 days ago [-]

why would that be a joke? there's a ton of Reddit comments in the training data, and the output is of similar quality. LLMs are literally outputting average Reddit comments.

jbm 139 days ago [-]

MoonGhost 139 days ago [-]

Reddit works hard to make comments accessible to only Google. However MS + OIA might have grabbed something before Reddit-Google contract.

inanutshellus 139 days ago [-]

See, he's not joking, he's "joking" ...

NitpickLawyer 139 days ago [-]

> As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities.

theturtletalks 139 days ago [-]

MoE, but an abstraction deeper?

irthomasthomas 139 days ago [-]

I think you can do most of this already with llm-consortium (maybe needs the llm-openrouter plugin with my pr merging)

kridsdale1 139 days ago [-]

Any links or names of example implementations of this?

irthomasthomas 139 days ago [-]

https://github.com/irthomasthomas/llm-consortium

UV tool install llm

llm install llm-consortium

llm install llm-model-gateway

llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash --confidence-threshold 95 --max-iterations 3

llm serve qwen-gem-sonnet

In this example I used -n 2 on the qwen model since it's so cheap we can include multiple instances of it in a consortium

globalise83 139 days ago [-]

mecsred 139 days ago [-]

jjj123 139 days ago [-]

I just googled it and it looks like “n8n” is the name of the service. The op wasn’t abbreviating anything so I don’t think it’s the same phenomenon as what you’re describing.

lgas 139 days ago [-]

nemomarx 139 days ago [-]

nation is too short for 8 - maybe navigation?

pkaye 139 days ago [-]

Looks like n8n is short for nodemation

firesteelrain 139 days ago [-]

Why do we do this to ourselves?

Y_Y 139 days ago [-]

Techno-flagellation is the only way to atone

lgas 135 days ago [-]

So the 8 stands for "odematio"? That sounds about right.

oppodeldoc 139 days ago [-]

https://github.com/n8n-io/n8n?tab=readme-ov-file#what-does-n...

globalise83 139 days ago [-]

The app is actually called n8n - https://n8n.io/

eddieroger 139 days ago [-]

wongarsu 139 days ago [-]

mecsred 139 days ago [-]

psychoslave 138 days ago [-]

hnuser123456 139 days ago [-]

I had not, but that looks awesome. Microsoft put out something called "agent flows" that also fits this category.[1] I'm working on more of an "at home" version - no "talk to sales" button.

https://www.microsoft.com/en-us/microsoft-copilot/blog/copil...

139 days ago [-]

jedberg 139 days ago [-]

nonethewiser 139 days ago [-]

Ive felt this way when using chatgpt for a simple search. Stuff that google could handle but would just be slower, mostly from me having to manually filter.

ivape 139 days ago [-]

I caught infra people saying that's pretty much the only bottleneck in the data center right now, power and cooling. We know the AI needs to run against itself continuously, and that's just a fact.

mcswell 138 days ago [-]

Maybe we should assign them a practical task, like making paperclips.

Xcelerate 139 days ago [-]

jwally 139 days ago [-]

Scaled up and spread out - this probably gets you pretty close to consciousness(?)

lubujackson 139 days ago [-]

Decades ago I read The Society of Mind by Marvin Minsky. He pushed this sort of idea, that consciousness is composed of individual, competing processes. Worth a revisit!

andai 139 days ago [-]

What you just said is what I tried and failed to say ten minutes ago!

https://news.ycombinator.com/item?id=43835798

Nevermark 139 days ago [-]

It’s working! Oh, wait …

These models have limitations obviously, but many critiques apply equally or more to people.

albertgoeswoof 139 days ago [-]

How far is this going to go? Are we going to have a team of AI agents that runs a scrum team and meets for stand ups every couple of hours?

Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?

kgeist 139 days ago [-]

parrit 139 days ago [-]

Havoc 139 days ago [-]

Seems likely to me. As long as adding more appears to help people will do it

Presumably there is some point where it levels out. And no doubt there will be a committee of AIs to determine said point.

Cause we wouldn’t want to boil the ocean…

faramarz 138 days ago [-]

That's cool! thanks for making it easy to fork and play with this!

my reasoning is that enterprise LLMs wont have any issue with the extra compute costs and would rather reconcile complex financials with various modeling optimizations.

I'm very new to public repo and contributions, and hope someone can point out if I'm doing it wrong.

my intention was to fork the ops codebase so I can test out my theory, and push as PR eventually

alexmolas 139 days ago [-]

There are two examples in the repo, one with CoRT and another one without. And the one without it it's much better than the one that uses it. Weird choice of examples...

2cheeze4u 139 days ago [-]

I think the names were switched up.

joshstrange 139 days ago [-]

I've thought about trying this cross-model as well. Have Claude generate something, have OpenAI check it, have Gemini check that check. Firing multiple of these in parallel.

There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.

K0balt 139 days ago [-]

k2xl 139 days ago [-]

Really fun, and helps me understand different sides of issues.

rat87 139 days ago [-]

k2xl 137 days ago [-]

I recommend you try it before judging. I will be honest it has been actually very useful.

You are also assuming that the LLM is providing false information (or will have a higher chance of providing false information than a human)

caseyy 139 days ago [-]

After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.

Happy someone else made it work.

nowittyusername 139 days ago [-]

generalizations 139 days ago [-]

I always assumed you'd have to use different models. Even if only one of them is large, the others would inject enough difference of opinion to keep it useful.

zamalek 139 days ago [-]

This might be a situation that warrants a higher temperature. Actually, it could be worth starting a very high temperature initially and gradually decreasing it.

caseyy 139 days ago [-]

Even after turning the temperature way up, the outcome was the same, just the text less coherent. Not dismissing the idea, just sharing my exp.

bilekas 139 days ago [-]

I believe they called that machine learning.. Or re-enforced training.

I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?

https://www.youtube.com/watch?v=SX08NT55YhA

WhitneyLand 139 days ago [-]

Why try this idea on base models only?

The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.

It would be interesting to see if this is doing anything that’s not already being exploited.

ChadMoran 139 days ago [-]

Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.

https://github.com/evalstate/fast-agent

Der_Einzige 139 days ago [-]

Debate as a reasoning tactic is massively undervalued. There's tons of papers on this at places like NeurIPS, ICML, ICLR, etc.

Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...

I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!

139 days ago [-]

lepisma 139 days ago [-]

Debates have worked good for me while learning something new:

https://lepisma.xyz/2024/10/19/interventional-debates-for-st...

I believe there are researches on this too.

aaroninsf 139 days ago [-]

Question: has the the adversarial approach been roled into any coding copilots/assistant frameworks?

Costs of various kinds aside I've wanted that from assistance's inception — with precisely the features many call out and home-roll here, difference by both model/provider, and, "role"...

It seems like if you have the money/compute to burn, and can live with the reasoning wall-clock time,

this has got to be the best approach for the foreseeable future, for a lot of specific requirements.

badmonster 139 days ago [-]

mortarion 139 days ago [-]

I think Gemini 2.5 already does something similar. If you read the "thinking descriptions" that it outputs it often thinks about going back to older thoughts to verify and criticize.

yieldcrv 139 days ago [-]

Reminds me of baby agi from 2 years ago

but I guess that was before chain of thought models

zekenie 139 days ago [-]

I feel like itd be cool to try prompts based on an adversarial justice system… attorney agents arguing both sides, a judge ruling on “the law”—adherence to instructions etc

ivape 139 days ago [-]

That's very easy to do. A prompt I regularly use is a "council" system. For example:

svachalek 139 days ago [-]

Wow, that's a very cool prompt, I haven't tried anything like that before.

hu3 139 days ago [-]

Here's some related challenge I'm facing. Maybe someone can help me:

I also managed to make AI critique itself and that improved code generation a ton.

For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.

How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?

Docker works but I like to keep things simple.

Deno supports revoking file access but I'd like to keep using Bun.

small_scombrus 138 days ago [-]

> How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?

The same way you stop any person or program or third party from doing something dumb or nefarious with your files. Don't give them any access to important files.

zactato 139 days ago [-]

Either you trust AI or you don't? If you don't trust it then you need to review what it's writing.

Docker seems like a pretty low complexity way to create an isolated environment to run automation.

derwiki 139 days ago [-]

Manually approve every terminal command it wants to run instead of vibe mode. Tbh I think an rm -rf scenario is exceedingly unlikely.

ivape 139 days ago [-]

a) You should only do this in a sandbox

nowittyusername 139 days ago [-]

No way around it, got to sandbox the whole thing no matter what.

schnitzelstoat 139 days ago [-]

But the fundamental way they operate is the same - predicting the next token given previous tokens. Where/how does reasoning happen here?

small_scombrus 138 days ago [-]

Upfront: I think AI is borderline useless for many of the tasks we give it. But:

1. Do our neurons not just react the same way every time to the same input? A brain is larger than the sum of its parts.

2. They don't reason, but you can somewhat emulate (or pretend to be) reasoning if you feed something back into itself enough times and it pinky promises reasoning is happening

139 days ago [-]

stormfather 139 days ago [-]

I made a trading bot that ingested news. The prompt to assess impact was to simulate a debate between Charlie Munger and Warren Buffet on whether to invest.

internetter 139 days ago [-]

How did it do?

thunderbong 139 days ago [-]

A lot of the comments here are reminiscent of the early Google days when everyone was finding ways to search better!

139 days ago [-]

j45 139 days ago [-]

There appear to be no shortage of token saving attempts that can end up using more tokens, whether it's a monthly paid plan or API.

Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.

pkdpic 139 days ago [-]

Also I think I kind of assumed OpenAI might be doing this behind the curtain?

mritchie712 139 days ago [-]

Did something similar (OverkiLLM) to this waayyyy back in August with open LLMs. I'm sure it'd work much better now:

https://www.definite.app/blog/overkillm

rriley 139 days ago [-]

Makes me wonder what would happen if we combine LLMs with recursive genetic algorithms. Similar to https://github.com/DivergentAI/dreamGPT

noworriesnate 139 days ago [-]

I’ve had success telling the model it really needs to poop and if it gets to the point quickly it’ll be able to leave the meeting and go do that. It actually works amazingly well.

It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.

Programming isn’t what it used to be.

tinix 139 days ago [-]

this works for getting out of traffic tickets too lol

ausbah 139 days ago [-]

at some point this doesn’t make LLMs feel useful. I have to wait 10x as long just so my LLM can have a somewhat higher chance of actually answer my question correctly?

cwillu 139 days ago [-]

Any api that lets you constrain output to a formal syntax should let you do away with the “first output a number, and only then explain yourself” boilerplate.

killerstorm 139 days ago [-]

This is similar to Tree-of-Thought with self-evaluation.

daxfohl 139 days ago [-]

Maybe have a "reconcile" option, for it to see if it can mix and match the best parts of each alternative rather than just choosing one.

grzracz 139 days ago [-]

Your readme demo images are wrong: the terminal one is the non-CoRT one and the GUI one is the one with CoRT. Confused me for a while

Svoka 139 days ago [-]

Oh. I was just asking "Use dialectic method on your solution" in the end of the prompt... It does make it think harder.

ashoeafoot 139 days ago [-]

Give it reward and punishment evaluations, exploring the noise in parallel, extinction for the non rewarding answers ?

keyle 139 days ago [-]

When will we get the `4o` vs `o3` background conversation in "thinking" leading to a more correct result?

kevinrineer 139 days ago [-]

This sounds like the zeitgeist is approaching genetic algorithms, which are super fun. Adversarial stuff is great.

throwawayForMe2 139 days ago [-]

I wonder if the Scholastic method of the Schoolmen would be useful with its argument and counter argument style.

alex1138 139 days ago [-]

Every single one of my prompts would be "Are you suuuuuuure you're not hallucinating that?"

Garlef 139 days ago [-]

Similarly, letting the LLM generate a socratic dialogue can work pretty well to get deeper into a topic.

mangoman 139 days ago [-]

a paper with a similar idea on scaling test time reasoning, this is sorta how all the thinking models work under the hood. https://arxiv.org/abs/2501.19393

gnarlouse 139 days ago [-]

This seems like low hanging fruit; are we seriously supposed to believe this is new and novel?

138 days ago [-]

irthomasthomas 139 days ago [-]

my favourite pattern rn: llm "write a savage, yet grounded roast of: $content" llm -c "Write an equally savage rebuttal" llm -c "first arbitrate and then synthesize a final review."

asdfman123 139 days ago [-]

And when I do this people say I'm overanalyzing

ivape 139 days ago [-]

asdfman123 139 days ago [-]

I don't actually think it's that weird though

animitronix 139 days ago [-]

Adversarial networks have been a thing for a while

stevefan1999 139 days ago [-]

That is just reinforcement learning in disguise

akomtu 139 days ago [-]

The modern Alchemy: the belief that you can extract gold (intelligence) from iron (autocomplete by imitation) by mixing iron with itself.

csours 139 days ago [-]

Yes, give the computers anxiety too!

lonetripper 139 days ago [-]

all this hard thinking yet humanity fails to come up with just one girlfriend for me

robofanatic 139 days ago [-]

soon there will be AI debates. Different models debating with each other on a topic

mparnisari 139 days ago [-]

So like rubber ducking for AI?

z2 139 days ago [-]

1970-01-01 139 days ago [-]

"While hallucinating a duck, check my script for errors."

jbellis 139 days ago [-]

does it actually make a difference to do M rounds of N vs one round of M*N?

nowittyusername 139 days ago [-]

My gut tells me yes. From my own experiments the order and way in which these things are done are important. I think it all is very strongly tied to the attention mechanism.

firgrove 139 days ago [-]

this is amazing - I love seeing novel approaches to optimizing

celltalk 139 days ago [-]

One of my doctoral propositions is, dialog leads to true artificial intelligence.

139 days ago [-]

getcrunk 139 days ago [-]

Hello cnn’s

parrit 139 days ago [-]

I want to see "Meh" vs. "Holy crap" as a benchmark in a paper published by Google. Or more likely I suspect, Andrej.

codr7 139 days ago [-]

Better yet, let it argue with another AI, preferably using voice; instant entertainment.

139 days ago [-]

antisthenes 139 days ago [-]

Cool. Now I can justify talking to myself.

dqewijodjqweido 139 days ago [-]

[flagged]

casenmgreen 139 days ago [-]

[flagged]

hackinthebochs 139 days ago [-]

casenmgreen 139 days ago [-]

The most superficial response is adequate, where the claim is so improper.

LLM/AI are extremely useful. I am in no way disputing this.

stevenAthompson 139 days ago [-]

Can you define "thinking" in a way that excludes what the AI is doing, but includes what humans do?

I haven't' really seen anyone else manage it without talking about ghosts or some other kind of metaphysical voodoo.

dttze 139 days ago [-]

Using a conceptual understanding of something to deduce or infer something else.

An LLM doesn't know what anything is. Just what goes around the token representation of that thing.

stevenAthompson 139 days ago [-]

"conceptual understanding of something" is just another way of saying "the relationship between concepts", which is exactly what transformer models use.

dttze 139 days ago [-]

stevenAthompson 139 days ago [-]

> It is grasping its meaning, significance, applications, and boundaries

Cantinflas 139 days ago [-]

Why is it so hard to believe that a complex neural network can think? You literally have one over your shoulders that does exactly that.

consumer451 139 days ago [-]

I am not sure that you can make that absolute statement. Reasoning is subdivided into types, and one of those types is inductive reasoning.

Doesn't predicting the next token qualify as doing just that?

https://en.wikipedia.org/wiki/Inductive_reasoning

dttze 139 days ago [-]

Markov chains have done that for ages. They aren't AI. This is just that scaled up.

Just because it can infer a token doesn't mean it can infer a conclusion to an argument.

voidspark 139 days ago [-]

> This is just that scaled up

Otherwise you could use the same reasoning to argue that a human is a Markov process, which is absurd, but vacuously true if "state" means the quantum level configuration of every atom in the body.

casenmgreen 139 days ago [-]

To add a bit to this : expert systems have two properties. They give an answer, and they explain their reasoning.

LLM cannot explain their reasoning, and that is because there is no reasoning.

consumer451 139 days ago [-]

To push back on this, a somewhat recent Linus Torvalds ~quote:

"I don't think that 'just predicting the next word' is the insult that people think it is, it's mostly what we all do."

If we break our lives down into the different types of reasoning, and what we mostly do day-to-day, this rings very true to me.

___

edit:

4o appears to agree with both of you, more than it does with me.

https://chatgpt.com/share/68119b41-1144-8012-b50d-f8f15997eb...

However, Sonnet 3.7 appears to side with me.

https://claude.ai/share/91139bca-3201-4ffc-a940-bdd27329e71f

(Both of these are the default models available for free accounts, on each website, at the time of writing)

IMO, hey, at least we do live in interesting times.

casenmgreen 139 days ago [-]

I may be wrong, but it seems to me this also is a case of improper use of words.

Those LLMs neither agree nor disagree. They do not understand. They produce output, and we read that output and we ourselves consider the output to be something, or something else.

All an LLM does is produce output. There's no conceptual understanding behind it, and so there is no agreement, or disagreement.

consumer451 139 days ago [-]

> All an LLM does is produce output. There's no conceptual understanding behind it, and so there is no agreement, or disagreement.

I think that I agree. However, even on HN, what percentage of human comments are simply some really basic inference, aka output/"reddit"/etc... and those are humans.

I currently believe that modern LLM architecture will likely not lead to AGI/ASI. However, even without that, they could do a lot.

I could also be very wrong.

[0] https://en.wikipedia.org/wiki/Nobel_disease

voidspark 139 days ago [-]

LLMs learn high-dimensional representations that capture conceptual relationships in their training data. They manipulate those representations in ways that approximate human reasoning.

consumer451 138 days ago [-]

> They manipulate those representations in ways that approximate human reasoning.

Fwiw, this is the story of my life. Seriously.

voidspark 138 days ago [-]

LOL everyone is like that most of the time.

System 1 vs System 2 thinking.

System 1 is rapid, uses heuristics to make quick judgements. Not rigorous. System 1 is the default mode.

System 2 is slow deliberate reasoning, energy intensive, and even humans get that wrong.

LLMs often use something like System 1 pattern matching, get the answer wrong initially, then can be prodded into trying again with a System 2 approach (chain of thought).

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

139 days ago [-]

verytrivial 139 days ago [-]

throwaway150 139 days ago [-]

senko 139 days ago [-]

Agree 100%.

Also smartphones are not actually smart and are only barely phones.

casenmgreen 139 days ago [-]

I'm finding laymen are thinking AI is reasoning, because the term makes it look like this is what it is.

The potential confusion of terms such as mouse/windows/surfing is not the same as calling LLM AI, and then going on to say it is "thinking" and "reasoning".

rapfaria 139 days ago [-]

Or "thinking" just got a new meaning and it's to convey information in the field - perhaps the Oxford dictionary will add it soon?

jasonthorsness 139 days ago [-]

What words would you use instead?

Philpax 139 days ago [-]

You say these things with such certainty. How can you be so sure?

pfdietz 139 days ago [-]

You are the critic. Construct three rebuttals to your claim.

m3kw9 139 days ago [-]

Isn’t this best of n?

139 days ago [-]

lenerdenator 139 days ago [-]

I, too, like to give Terminator lite anxiety.

hansmayer 139 days ago [-]

DyslexicAtheist 139 days ago [-]

> "I made my AI think" ...

utterly moronic.

They don't “think” ... not even in the most autistic sense of the word.

They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.

mortarion 139 days ago [-]

That's exactly what us humans do when we think about stuff. We combine memories and knowledge in unique ways, then we usually go ask someone else to give input on it.