Op-ed: Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2 | Headlines | News

Op-ed: Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2 | Headlines | News | CoinMarketCap

07/22/2023 00:29

As previously reported, new research reveals inconsistencies in ChatGPT models over time. A Stanford and UC Berkeley study analyzed March and June …

CryptoSlate7 minutes ago

Meta

Published on July 21, 2023 17:20 GMT+00:00edited on July 21, 2023 17:24 GMT+00:00

As previously reported, new research reveals inconsistencies in ChatGPT models over time. A Stanford and UC Berkeley study analyzed March and June versions of GPT-3.5 and GPT-4 on diverse tasks. The results show significant drifts in performance, even over just a few months.

For example, GPT-4’s prime number accuracy plunged from 97.6% to 2.4% between March and June due to issues following step-by-step reasoning. GPT-4 also grew more reluctant to answer sensitive questions directly, with response rates dropping from 21% to 5%. However, it provided less rationale for refusals.

Both GPT-3.5 and GPT-4 generated buggier code in June compared to March. The percentage of directly executable Python snippets dropped substantially because of extra non-code text.

While visual reasoning …