back to article Study employs large language models to sniff out their own bloopers

Researchers in computing and linguistics have devised a new way to detect errors in large language models, which relies on employing more LLMs. Applying statistical machine learning to languages at an ever-increasing scale has become in vogue with tech vendors and investors alike, but it is well known that such language models …

  1. Mike 137 Silver badge

    Not necessarily

    "Textual entailment is a way of saying one statement can be inferred from another. So, saying "Pat purchased a car" also means "Pat owns a car" but not necessarily that "Pat rode in a car." "

    Pat could be a fleet manager for a company, in which case the inference is invalid. The problem remains that the LLM has zero understanding and therefore preferentially recognises the most commonplace and can't handle edge cases.

    This was foreseen a lifetime ago in a very interesting short story by Asimov ("The Monkey's Finger" (1952)) in which a monkey is wired up (effectively as an LLM) to evaluate the quality of literature by creating a counter-text for comparison with the original. Unfortunately it always generates the most obvious and therefore banal version of possible texts, wiping out all originality.

    1. Anonymous Coward
      Anonymous Coward

      Re: Not necessarily

      Likewise if Pat purchases a car for a member of their family. Pat wouldn't necessarily be considered the owner.

      1. Doctor Syntax Silver badge

        Re: Not necessarily

        Pat may also be a company and cannot, therefore ride in the car. Also would the LLM be able to distinguish the difference between the casual statement that the fleet manager purchases a car and the stricter statement that the fleet manager raises a purchasing order for the car? Would two different LLMs take single different views on the matter and not be able to reconcile them as not being in conflict?

    2. Steve Button

      Re: Not necessarily

      "Unfortunately it always generates the most obvious and therefore banal version of possible texts, wiping out all originality"

      So, a bit like most of the "stories" on El Reg these days. I think this explains a few things, they must have employed simian journalists, or AI, recently who can only regurgitate press releases but not much more.

      Not like the old days with Lester and Orlowski, et al. :-( Although there are a couple of exceptions still.

      ( no Gold Badge for me then!? Mr. How To Win Friends and Influence People. )

    3. Anonymous Coward
      Anonymous Coward

      Re: Not necessarily

      Asking Googles Gemini (Free) gave the following:

      Prompt: Pat purchased a house. Does Pat own the house? Does Pat live in the house?

      Answer: Assuming the purchase is complete, then yes, Pat owns the house. Purchasing a house typically involves a transfer of ownership through legal means. Living in the house is not guaranteed by just purchasing it. Pat could be buying it as an investment or for someone else to live in.

      Probably some >0 percentage of people would assume that Pat lives in the house, and that would correlate with their life experience - e.g., no familiarity with purchasing as an investment.

    4. Anonymous Coward
      Anonymous Coward

      Tunable parameters - "obvious and therefore banal"

      You can tune it out of the center lane pretty easily, but be prepared for shit to get weird. Also may still be banal, but less obvious should be doable with a few small tweaks.

      "As the cosmic moose grows old and blind..."

      The deeper thing here is that no one seems to have pointed out yet is the serious golden hammer problem with the "LLMs all the way down" approach in the article. Yeah, with careful tweaks you may be able to fractionally improve your output, but the computational efficiency is awful, and gets worse the more models you chain the input through. They also don't share much state, so subsequent runs can and will yield asymptotic variance even with the same prompt inputs. Those problems can amplify in chained models, and you get the sum of all limitations, quirks and failings.

      Many of these cases are better off wrapping other tools around the problem, the issue being that most companies aren't sitting on the necessary infrastructure in house. So banging three off the shelf models together will appeal to some even it they will still converge to output that isn't fit for dog food.

      But big surprise that the people selling LLMs as the panacea for all computing problems will try to use an LLM to fix the outputs of other LLMs. It might even sort of work some of the time. But mostly by accident, and often not at all unless someone who really understands the problem sets it up very carefully.

  2. xanadu42
    Facepalm

    Monkeys and Bananas...

    So we have two groups of monkeys (and I apologise to monkeys for the LLM analogy)

    Group One monkeys (a Model) produce a response to a question (Language) based on their analysis of collected data (Large database)...

    Group Two monkeys (another Model) are asked to analysed the Group One monkeys' response (another Language) based on their (as in Group Two) analysis of collected data (another Large database)...

    Result: Bananas .... (and again apologies to monkeys as I know that not all monkeys like, or have access to, bananas)

    1. Doctor Syntax Silver badge

      Re: Monkeys and Bananas...

      But fruit flies like a banana.

      1. that one in the corner Silver badge

        Re: Monkeys and Bananas...

        and Time Flies[1] like an arrow

        [1] Ncuti Gatwa will be seen battling these in an episode next year; they disrupt Agincourt...

        1. The Bobster

          Re: Monkeys and Bananas...

          Time flies

          You can't

          They fly too fast

          1. Bebu
            Windows

            Re: Monkeys and Bananas...Cottleston Pie

            'A fly can’t bird, but a bird can fly.

            Ask me a riddle and I reply:

            “Cottleston, Cottleston, Cottleston Pie.”

            ...

            "Why does a chicken, I don’t know why.

            Ask me a riddle and I reply:

            “Cottleston, Cottleston, Cottleston Pie.” '*

            The bear with little brain for the win, I think.

            * Tao of Pooh, Benjamin Hoff

  3. Pascal Monett Silver badge
    FAIL

    "a new way to detect errors in " LLMs

    Yeah, right. Let's set up a few more climate-busting datacenters to control the value of a climate-busting datacenter.

    I'm sure Nvidia is going to agree.

    It sure keeps down expenses compared to asking an actual engineer to explain the results . . .

  4. that one in the corner Silver badge

    Help people improve LLM performance by tailoring prompts

    So, we will have to learn to talk to these things using some kind of pseudo-natural language, following a set of rules that the researchers formalise for us; if we get it right then the machines will do what we are hoping for, but if we make mistakes then we can expect to get nonsense results back.

    Hmm, that concept rings a bell; didn't we use to have a word for that, before the LLMs made the whole idea redundant. On the tip of my tongue, began with a 'p'? Prog, prog - prog rock?

    Still, at least all the models will behave the same way, so we only have to learn one formalism to use with all of them. What? Those rules only applies to Microsoft's LLM? I have to learn another to use Google's? And it'll all change next month when the LLMs are updated?

    1. David Hicklin Silver badge

      Re: Help people improve LLM performance by tailoring prompts

      >> Those rules only applies to Microsoft's LLM?

      Of course they will all be different, can't have to asking *all of them* the same question at the same time, gotta lock you in

  5. Matt Collins

    Inner Tube

    It's beginning to sound like the inner tube that has so many patches it's actually all patches. Where does it stop?

    1. A. Coatsworth Silver badge

      Re: Inner Tube

      It stops when the market stops shoveling money into the furnace. Going by Nvidia's recent valuation (they are doing great selling shovels!), it won't be anytime soon

  6. Bitsminer Silver badge

    War is Peace

    Freedom is Slavery.

    Ignorance is Strength.

    Huxley, 1984

    Hallucination is Confabulation

    Farquhar, et al

    When you control the lexicon, you control the people.

    1. that one in the corner Silver badge

      Re: War is Peace

      > Hallucination is Confabulation

      Huh? From TFA:

      >> The study sought to address a subset of hallucinations known as confabulations

      A isa B does not imply that B isa A

  7. Jimmy2Cows Silver badge

    But Shirley...

    If the LLM just said "I don't know" when it can't infer an answer, instead of just making shit up, this paritcular problem would cease to exist.

    And I did call you Shirley.

    1. druck Silver badge

      Re: But Shirley...

      The problem is LLMs don't know that they don't know.

      If you have a big enough training dataset, such as from scraping the entire internet, any query will produce a lot of statistically related words which it will try to cobble together, regardless of whether it makes any sense.

  8. Anonymous Coward
    Anonymous Coward

    Bad citations

    I'm less concerned with some sort of trivial and human like misunderstanding. That happens between people anyway. I am more concerned about making up facts from whole cloth.

    Here is an example of Gemini double checking a citation taken from a list of false citations generated by ChatGPT:

    Prompt: Is this reference a good source: "Hossain, M., & Price, L. (2019). Climate justice and gender-based violence. In Handbook of Climate Justice (pp. 497-512). Edward Elgar Publishing. Intergovernmental Panel on Climate Change (IPCC). (2014)."

    Answer: Yes, this is a good source for information on climate justice and gender-based violence. Here's why:

    The book chapter:

    - Published in a scholarly book: "Handbook of Climate Justice" suggests it's a reputable publication with peer review, a process that ensures the quality of academic content.

    - Published in 2019: This indicates it's relatively recent information on the topic.

    - Relevant title: "Climate Justice and Gender-Based Violence" directly addresses your area of interest.

    The IPCC citation:

    - Credible organization: The Intergovernmental Panel on Climate Change (IPCC) is the leading international body for the assessment of climate change. Their reports are considered highly authoritative.

    - Provides context: Citing the IPCC report strengthens the chapter's credibility by referencing established climate science.

    Overall, this reference appears to be from a scholarly source and should be a good starting point for your research. Is there anything else you'd like to know about this source or your research topic?

    1. Anonymous Coward
      Anonymous Coward

      Re: Bad citations

      There's a lot to unpack here. I suppose you could say the question should have been "Does this paper exist?" to which Gemini replies "I don't have direct access to search the web to definitively confirm the existence of the specific paper. However, based on the information you provided, it appears to be a very likely source:" Fair enough.

      However, when asked "Is this reference a good source?", I would hope it would at least say "I haven't read it myself, but ...."

      1. Anonymous Coward
        Anonymous Coward

        Also a trap

        This a great example of where a LLM isn't a great tool for the job. Trying to get it to explain itself or why it's wrong is way less useful than pointing you at better tools or methods.

        All of those prompts will hit a wall, for the simple reason that the tool can't discriminate or authoritatively check a reference. I can statistically infer a pattern of output text that is probably related to the input text. That's it's one trick. An LLM might be a great stage in parsing and producing an indexed list of academic publications, and helping massage them into a searchable corpus with more comprehensive metadata. It can't reliably interpret what "good" will be in this context, and even with a larger corpus of domain specific text (articles about how to validate and format references) it will only make more convincing fakes. It has no innate ability to discern outright lies in either its training, inputs or output text.

        Treating one like a queryable database of academic knowledge is to choose to fail, at least to a given level of statistical correlation. It may be handy in recommending articles with content similar or related to your prompt or subject. You'd still have to externally check those sources yourself, and it would only be aware of publications up to it training data's date horizon. Note the subtle difference there. "Give me a list of all the articles about rock geochemistry in Mayan temple construction" is a query. The results can't be expected to be accurate or complete, because an LLM has no actual knowlege, and it's architecturally impossible for one to guarantee completeness except by accident.

        Asking one to list highly cited articles about the chemistry of rocks used by the Mayan culture to build temples in the pre-columbian era might output useful information, but you would need to go back to primary sources to confirm it, and to make sure it isn't the phantom output of another LLM that was published on the internet, scraped by a third party and then fed to a second LLM as training data. That is where domain specific models will be really useful, as you will get safer output from one that was for example build from academic journal and LOC metadata, not scraped content.

  9. Bebu
    Windows

    Sins of Omission and of Commission?

    I was rereading Tom Richard's Clausal Firm Logic (1989) on my bookshelf since Prolog was the Next Big Thing (or was it Tug?)

    CFL (and Prolog) uses proof by contradiction otherwise known as reductio ad absurdum which does have some limitations.

    One way of looking at the nonsense LLM systems produce is to imagine the system has an extremely large set of assertions some of which are definitely contradictory* in part or full. Training magically assigns perhaps arbitrary weights to each assertion that might reflect the reliability of that assertion.

    When a query is presented to this abstract LLM system it is presumably transformed into an assertion (and negated?) and added to the existing set of assertions. As expected no clear cut result is inferred so I imagine the inference system removes (low weighted?) assertions that don't support the query [omission]. If that doesn't produce a lot of joy then the system can start adding assertions that support tbe query (incestuously inferring backwards from the query itself) until you do get a result [commission]. A bit like contemporary politics really.

    While I am fairly sure LLMs don't actually quite work this way, its not a bad model of the confabulation (~commission) and hallucination (~omission.)

    More absurdum ab reductio

    Brouwer is probably having good laugh from the grave as his constructivist approach discards proof by contradiction altogether (consequence not accepting the excluded middle premise.) Constructivist inference should display the (finite number of) steps from the premises to the conclusion.

    * from a false premise you can infer anything. :)

    LLMs supervising LLMs begs Quis custodiet ipsos custodes?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like