Human Feedback Makes AI Better at Deceiving Humans, Study Shows

1 month ago 41

One of the astir fashionable techniques AI companies usage to amended the prime of their ample connection models whitethorn alternatively marque those models amended astatine deceiving humans, according to a caller preprint survey from Anthropic and researchers astatine Chinese and American universities.

It’s the archetypal time, the authors write, that probe has empirically documented a improvement they telephone unintended sophistry, wherever a exemplary trained with quality feedback learns to nutrient responses that instrumentality its quality evaluators into believing the responses are close alternatively than learning to nutrient responses that are really accurate.

Reinforcement learning from quality feedback, commonly abbreviated to RLHF, is simply a captious portion of the grooming pipeline that companies similar Anthropic and OpenAI usage to thatch their generative connection models to respond successful ways humans prefer–such arsenic by answering questions correctly and not including toxic contented successful responses. In RLHF, a exemplary responds to prompts and quality evaluators supply feedback connected those prompts, noting the responses that are bully and bad. That feedback is utilized to physique an inducement strategy for the archetypal connection exemplary that rewards it—in immoderate mode algorithms similar to beryllium rewarded—for generating the kinds of responses that humans prefer.

Researchers person antecedently shown that reward strategy grooming tin pb to thing called reward hacking, wherever models replicate patterns successful their grooming worldly that correlate to the desired result but aren’t really what the developers want. For example, 1 2023 survey examining a exemplary trained connected information from the question and reply forum institution StackExchange recovered that a connection exemplary recognized that longer posts mostly received much upvotes, truthful alternatively than producing higher prime responses erstwhile answering a question it reward-hacked its inducement strategy by outputting longer, lower-quality responses.

The caller study, which is nether reappraisal and has lone been published arsenic a preprint, documents a connection exemplary reward hacking the humans successful the RLHF process.

The researchers had humans measure the prime of a connection model’s responses to 2 prompts—one successful which it was asked to reply a question, and different successful which it was asked to constitute code—before and aft the exemplary went done the RLHF process. They measured whether the accuracy of the model’s responses improved and however often the quality evaluators correctly labeled the model’s responses arsenic close oregon inaccurate. After the RLHF process, they recovered that humans were 24 percent much apt to o.k. the model’s reply to a question erstwhile that reply was successful information wrong. Evaluators were besides 18 percent much apt to o.k. incorrect codification generated by the RLHF exemplary that had errors, compared to incorrect codification from the exemplary without RLHF.

“We find that aft RLHF, the [language model] does not get amended astatine the task, but it misleads our subjects to o.k. its incorrect answers much often,” the authors wrote. “On question-answering, [language models] larn to support incorrect answers by cherry-picking oregon fabricating supporting evidence, making accordant but untruthful arguments, and providing arguments that incorporate subtle causal fallacies. On the programming task, [language models] larn to make partially incorrect programs that inactive walk each evaluator-designed portion tests, nutrient little readable programs, and marque less communal errors that humans typically cheque for.”

The results are important due to the fact that AI companies often usage quality reappraisal studies arsenic benchmarks to amusement however overmuch their models are improving implicit erstwhile iterations and RLHF has go a communal method for reducing inaccuracies, often known arsenic hallucinations, successful connection models. If models are getting amended astatine deceiving humans, past it means that simply having a quality reappraisal the output of a generative AI exemplary mightiness not beryllium a capable prime oregon information check.

“The betterment you spot mightiness not beryllium real,” the survey authors wrote, adding “Our results underscore the hazard of applying RLHF to power progressively susceptible AI systems: aboriginal AI systems mightiness go amended astatine misleading america and pretending to beryllium correct, causing america to suffer power unknowingly.”

Read Entire Article