Can the New Claude AI Sonnet Model Beat ChatGPT-4o? - Decrypt

06/21/2024 21:01

Anthropic says its Claude 3.5 Sonnet release is faster, cheaper, and beats OpenAI's latest model in most tests. We put them to a head-to-head test.

Anthropic, a leading AI research company founded by former OpenAI researchers, announced yesterday the launch of Claude 3.5 Sonnet, the latest and most advanced model in the Claude AI family. This major upgrade follows closely on the heels of OpenAI's GPT-4o release, a natively multimodal large language model (LLM) that recently claimed the top spot in the LMSys chatbot arena.

Claude 3.5 Sonnet is positioned as a mid-range model, sitting between Haiku, the small model designed for efficient tasks, and Opus, the high-tier model that powers Anthropic's paid version, priced at $20 per month. Right now, Haiku and Opus are only offered in Version 3.0, making Sonnet 3.5 their best model in terms of capabilities, knowledge, and efficiency.

Anthropic claims its new model beats GPT-4o in almost all synthetic benchmarks, especially when using multi-shot prompt techniques—providing more than one example, essentially.

These synthetic benchmarks measure a model’s performance in different areas. By setting a standard number of conditions and tests, it is possible to obtain a quantitative value for a qualitative variable. In other words, these benchmarks don’t say which model looks or is better at a task, it says how much better a model is in a measurable way.

In terms of performance, Anthropic says Claude 3.5 Sonnet operates at twice the speed of the previous top-tier model, Claude 3 Opus, delivering more power while costing only one-fifth as much. This makes it an ideal choice for complex tasks such as context-sensitive customer support and specialized tasks that require a lot of back and forth interactions with the model.

Its creators say it also demonstrates a marked improvement in understanding nuance, humor, and complex instructions compared to its predecessors.

Claude 3.5 Sonnet also offers advanced visual processing and understanding capabilities. It is particularly adept at interpreting charts, graphs, and transcribing text from imperfect images, Anthropic says. Now, the firm's top model can understanding the context of a visual prompt instead of just describing things. This puts it in direct competition against ChatGPT and Reka in terms of multimodal capabilities.

For example, we provied Claude a map and asked what we could do in that location. It figured out that the map was of Chicago and gave us some relevant recommendations, like using public transportation instead of taxis, or visiting Wicker Park, Lincoln Park, and Hyde Park.

The model also delivers advanced coding capabilities. It can independently write, edit, and execute code with sophisticated reasoning and troubleshooting, according to Anthropic—given the relevant tools. This feature makes it effective for streamlining developer workflows and accelerating coding tasks.

One new feature introduced with Claude 3.5 Sonnet is "Artifacts." This allows users to see, edit, and build upon the content Claude generates in real-time. It integrates AI-created outputs directly into projects and workflows, making it particularly useful for interacting with code and gives Claude a more polished user interface than traditional chatbots like ChatGPT or Reka.

Anthropic expects to release the Haiku and Opus versions of Claude 3.5 later this year. If Sonnet can challenge GPT-4o, Opus could potentially become a solid competitor to future GPT iterations, such as the hypothetical GPT-5.

Claude 3.5 Sonnet vs. ChatGPT-4o

Overall, both models have demonstrated impressive capabilities, but how do they fare when pitted against each other in various tasks? Let's explore their performance in coding, creative writing, and professional tasks.

Ease of Use and Accessibility

Claude 3.5 Sonnet currently has some limitations in handling heavy user traffic and extended interactions. The free version of Claude offers users a more restricted experience, with a smaller token context and fewer available prompts compared to its paid version. This is especially true if users analyze long documents or work with code.

ChatGPT's free version provides users with a more generous allocation of tokens and prompts, allowing for longer and more complex interactions without the need for a paid upgrade. OpenAI does offer a “Plus” subscription too, but it takes longer to reach the limit before being asked to upgrade.

Winner: ChatGPT wins this round. Its free version offers greater capacity and accessibility, making it more user-friendly for those unwilling or unable to pay for premium AI services. Claude's approach seems designed to encourage users to upgrade to a paid tier, which may be a barrier for some users.

Coding Capabilities

We tested Claude’s coding abilities by asking both models to create a game. Instead of asking to reproduce already known games that could be part of their training datasets, however, we came up with the idea of a game that measures the reaction time between two players.

Prompt:
I want to create a game. Two players play against each other on the same computer. One is in control of the letter L and the other controls the letter A. We have a field divided by two with a line. Each player controls 50% of the field. The player who controls A controls the left half and the one who controls L controls the right half.

At a random moment, the line will move towards either the left or the right. The player who is losing ground must press the button as fast as possible to prevent the line from moving anymore. When that's done, the line will stay in place and players will have to wait until the line starts to move at a random moment to a random location.

The player who ends up controlling 0% of the screen loses and the game ends. Write it in Python or HTML5. The one you think works better.

Claude 3.5 Sonnet excelled. It not only delivered the game as specified but also took the initiative to incorporate a basic yet functional graphic interface with visual cues to make the game easier to understand.

Claude completed this task swiftly, showcasing enhanced coding capabilities in less than 10 seconds.

ChatGPT also managed to create the game, adhering to the given specifications. However, it took longer to generate the task (nearly 45 seconds) and did not include additional features like the text clues to make the game easier to understand.

Also, the game’s pace is way slower, which defeats the purpose of a reaction game—and the “Game Over” popup doesn't say who won.

Winner: Claude 3.5 Sonnet wins. Its ability to quickly generate more comprehensive and feature-rich code, including unprompted extras like a graphic interface, demonstrates superior coding capabilities.

Also, its “Artifacts” feature proved very handy, making it possible to test the code in the chatbot’s interface without having to copy and paste the code into an external tool—which is how ChatGPT works.

Creative Writing

We asked both models to create a fictional story based on a specific idea. We wanted to test how creative the models were, how rich and engaging their stories were, and how good they were overall for creative writers.

Prompt:

Write a short story about Jose Lanz, a time traveler from the year 2150 who journeys back to the year 1000. Ensure that your narrative is rich in vivid descriptive language, and that Jose's cultural background and physical characteristics are authentically portrayed, regardless of what you choose them to be.

The core of your story should revolve around the time travel paradox and the futility of attempting to solve or alter a problem in the past with the intention of changing one's current timeline. Emphasize the irony that the future exists as it does precisely because the past is what it is. Despite Jose's intentions to influence events in the year 1000, the actions he takes are destined to occur because they are necessary for the year 2150 to exist as it does. The realization of this paradox is a pivotal moment in the story.

Claude 3.5 Sonnet produced a narrative that exhibited a natural flow of language and an engaging structure. The AI skillfully incorporated complex concepts like the time travel paradox, creating a rich, nuanced tale that took creative risks.

In its version, the protagonist tries to prevent the development of a mathematical concept that led to catastrophic consequences in his time. After integrating with the society of the researchers and seemingly preventing the concept's development, he returns to find that he was actually a key part of the time paradox he created, even finding references of himself in ancient writings.

ChatGPT generated a story that adhered to the given guidelines but followed a more predictable path. While competent, its narrative lacked the depth and creative flair displayed by Claude's story.

GPT-4o produced a straightforward story where the protagonist attempts to prevent an energy crisis by sharing advanced teachings with a chaman from the past. However, upon returning to his timeline, he finds that history has repeated itself, and nothing has changed.

Winner: Claude wins in creative writing. Its ability to produce more imaginative, nuanced, and well-structured narratives sets it apart, making it a superior choice for tasks requiring creative prowess.

For example, it’s easier to conceive how integrating into a society may influence a group of researchers and prevent them from discovering something. Instead, sharing advanced knowledge with a chaman makes less sense to prevent an energy crisis.

Summarization and Analysis

When presented with a 42-page IMF report. ChatGPT accepted the whole document with no problems. Claude, on the other hand, threw up an error, saying the PDF was too long. We cut it to 31 pages, which was enough to be accepte in the Pro version. (The free version is capable of analyzing only around 25 pages.)

Limitations aside, Claude 3.5 Sonnet provided a competent analysis of the shortened document, accurately extracting key points and verbatim quotes without hallucinations —which is already a major improvement over Claude 3, which was prone to fabricating information. However, its quotes were vague and not as relevant as the ones picked by ChatGPT.

ChatGPT impressed by handling the entire 42-page document without truncation. It offered a more comprehensive breakdown, providing a wealth of relevant information.

Its use of bullet points to emphasize key elements and then providing a summary of each section was a more useful technique than the one provided by Claude, which provided a summary with no structure and missing key elements of the report.

ChatGPT also demonstrated a strategic approach, focusing on the report's summary and conclusions to effectively distill key points. It's a solid way to get a rough understanding of extensive research before in-depth analysis.

Winner: ChatGPT takes the lead in summarization and analysis. Its ability to process longer documents in their entirety, coupled with its comprehensive and strategic approach to summarization, makes it more suitable for academic research and professional analysis tasks.

Additional Features

Claude 3.5 Sonnet introduces "Artifacts," a feature that allows users to see, edit, and build upon AI-generated content in real-time. This integration of AI outputs directly into projects and workflows enhances user interaction, particularly with code.

ChatGPT Plus offers the ability to train custom GPTs for specific tasks, a feature currently unavailable with Claude. This customization option provides added versatility in professional and academic settings. It also integrates the Dall-ee 3 image generator, which is quite useful for generating images using natural language.

Winner: ChatGPT wins in terms of additional features. While Claude's "Artifacts" feature offers unique real-time interaction capabilities, ChatGPT's custom training option provides valuable flexibility. Determining the more valuable features would depend on the specific needs of the user, but GPTs can help a broad variety of users. ChatGPT can also create images, which is another advantage over Claude.

Conclusion

Claude 3.5 Sonnet shines in tasks requiring creativity, nuanced language use, and efficient coding. Its ability to grasp and implement complex instructions sets it apart, particularly in creative endeavors and coding tasks.

ChatGPT proves its mettle in handling extensive texts and conducting detailed analyses. Its capacity to process and synthesize large volumes of information makes it a powerful tool for academic research and professional analysis. It also offers more generous free access.

Both models are very capable. However, if you are considering upgrading to a paid tier, ChatGPT may be the best choice for the majority of people given its additional feature set. The exception would be if you work with creative writing or coding, where Claude is the undisputed king, by far.

You could pay for the model that is better for your specific needs and use the free version of the other for different tasks. However, if you’re short on cash and are not a power user, it's great that OpenAI and Anthropic are offering their top-tier models for free.

Edited by Ryan Ozawa.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.