Curvestone’s Approach to LLM Benchmarking

Posted by WorkflowGPT  |  
For Azure Open AI, Google Vertex AI, and Amazon Bedrock (including GPT-4o, GPT-4o-mini, Gemini-1.5-Pro, Gemini-1.5-Flash, and Claude 3.5 Sonnet)

Since the release of ChatGPT in November 2022, the adoption of generative AI, particularly Large Language Models (LLMs), has surged. Enterprises, developers, and the general public are leveraging these innovations for their utility and transformative potential. The escalating demand for generative AI products has spurred significant advancements in infrastructure and development within the field.

However, with an abundance of options now available, it can be challenging to discern the most reliable and relevant information to make an informed decision about which model and provider to use. Numerous LLM benchmark analyses, such as Vellum, LMSYS Chatbot Arena Leaderboard, or Artificial Analysis’ LLM Leaderboard - covering graduate-level reasoning, mathematical reasoning, code generation, and more—point towards three flagship models. Yet, understanding which of these models is best suited for you is not straightforward.

Adding to the complexity, when shipping a generative AI application into production, the academic benchmark results are only one component of a long list of technical requirements that need to be evaluated. Considerations such as:

  • Data residency control (i.e., region control)
  • Trust/adoption of hosting provider
  • Data caching (ability to turn off)
  • Data privacy & security
  • Content moderation (existence of and control over)
  • Quotas - tokens and requests per minute (default and extendable)
  • Regional availability
  • Cost - what are the units and does it increase with the request size
  • Speed - Time to first token, and tokens per second
  • Reliability
  • Access to tokeniser
  • Ease of integration with LLM libraries (e.g., LangChain)
  • Time before the latest models are available
  • Access to and helpfulness of account managers & technical support

Which leaves you with a long list of security, performance, quota, pricing, and other homework to do.

So What? Why Should Law Firms Care?

For technical leaders of law firms, this should make one thing abundantly clear: don’t take a path that locks you in to using LLMs from one provider. 

Even in the industry’s infancy, it is becoming apparent that different LLMs have pros and cons, and therefore deciding on how to work with LLMs isn't just a technical decision—it's a strategic one. 

The right LLM can significantly enhance your firm's efficiency, client service, and competitive edge in how it is ultimately adopted at your firm. For example:

  1. Enhanced Document Review and Analysis: Faster and more reliable LLMs can streamline document review processes, allowing your legal team to more flexibly use AI in document reviews.
  2. Improved Client Communication: With robust content moderation and data privacy controls, you can ensure secure and compliant client interactions based on LLMs that perform best in that domain.
  3. Cost Efficiency: Understanding the cost implications, including quotas and scalability, helps in budgeting and avoiding unexpected expenses, as well as finetuning smaller models to save on cost.
  4. Regional Compliance: Data residency controls ensure compliance with regional data protection laws, a critical factor for law firms handling sensitive information.
  5. Seamless Integration: Easy integration with existing LLM libraries and systems means quicker deployment and reduced downtime.

To address this challenge, Curvestone has evaluated the speed and reliability component for three major cloud providers: Microsoft Azure, Google Cloud, and Amazon Web Services. For each provider, we assessed their flagship models—GPT-4o, Gemini 1.5, and Claude 3.5—focusing on response times across Europe and the US regions.

We established a uniform configuration for our requests, sending them to each model on each provider once per hour. Our requests include one text generation prompt and one prompt asking the LLM to describe the contents of an image, restricted to 200 and 1,000 output tokens respectively, both at a temperature of 0.

Doing so provides us with valuable data that enables us to quickly identify which models consistently deliver the fastest responses, how they perform relative to their historical averages, and how their stability fluctuates over time. This is one major piece of the puzzle to consider when shipping generative AI applications into production.

You can see the dashboard here:
https://www.workflowgpt.ai/llm-benchmarking-dashboard-workflowgpt


What now?


Choosing the right LLM and provider can transform your law firm's operations, and we have taken the time to thoroughly assess models for their suitability for particular use cases. 

If you want to find out more about the way we assess generative models' readiness, the way we benchmark, or even what the prompts we use are, reach out to us here. Let us help you make an informed decision that aligns with your strategic goals.


If you’d like to speak with us about how we could help your firm use generative AI in this way, then feel free to book a call with us here