What Meta's Llama 4 enterprise push means for UK buyers after the first month of real testing

Model Intelligence & News

25 April 2026 | By Ashley Marshall

What Meta's Llama 4 enterprise push means for UK buyers after the first month of real testing?

After a month of real-world UK enterprise testing, Llama 4 Maverick has emerged as a credible peer to GPT-5.3 on code and structured reasoning, while Llama 4 Scout's 10 million token context is genuinely useful only on a narrow band of long-document tasks. Procurement should treat them as separate decisions and evaluate against your own data, not vendor benchmarks.

One month in, Llama 4 is not a single bet. It is two very different models with different ideal jobs, sold under the same brand, and most UK enterprise teams are choosing the wrong one.

The first month: what UK enterprise testing actually shows

The first time we ran Llama 4 Maverick against a real enterprise contract pile, the result was unspectacular and slightly boring, which is exactly what UK buyers should want from a model they are about to trust with revenue. After roughly four weeks of structured testing across legal, finance and operations workloads, the picture is clearer than the launch noise suggested. Maverick is a credible peer to GPT-5.3 on code-shaped tasks, slightly behind on raw reasoning, and well ahead of older open-weight options on long documents. Scout is a different proposition again. It is built for breadth, with a 10 million token context window that shifts what is possible on document review, but it sheds accuracy as you push it.

The early benchmarks bear that out. Box's enterprise content evaluation reported Maverick scoring 85 to 92 per cent on complex clause extraction where Scout dropped to between 45 and 70 per cent. That gap is not academic. For a UK insurer or a mid-market law firm, the difference between 92 per cent and 70 per cent is the difference between a useful drafting assistant and a tool that quietly seeds errors into client work. The headline lesson from month one is that Llama 4 is not one decision. It is two distinct models with different ideal jobs, sold with the same brand and roughly the same marketing language. Treat them as separate procurement choices, evaluate them on your own data, and ignore most of the LinkedIn enthusiasm. The teams getting value from Llama 4 in April 2026 are the ones that built a small, honest evaluation harness in March and let the results pick the model.

Where Maverick earns its keep

Maverick is the workhorse of the Llama 4 line and the model most UK enterprise teams will end up running. The architecture is mixture of experts: 17 billion active parameters drawn from a 400 billion parameter pool of 128 experts, with the same per-token inference cost as Scout despite the bigger pool. On HumanEval, the standard code generation benchmark, Maverick scored 91.2 per cent, narrowly ahead of GPT-5.3 at 89.7 per cent. That is not a rounding error. For development teams already using Copilot or Cursor, it is the first credible argument that an open-weight model can carry production code workloads without a quality drop. The reasoning gap is real but small, with Maverick trailing GPT-5.3 by one to two percentage points on MMLU-Pro and GPQA Diamond. The LMSYS Chatbot Arena ELO of 1417 puts it ahead of GPT-4o and Gemini 2.0 Flash.

Where Maverick really earns its keep in early UK testing is fine-tuning. Domain teams report 22 per cent accuracy gains on clause extraction in legal work, 25 per cent on entity extraction in finance, and 19 per cent on medical entity recognition in healthcare deployments. Those are not vendor numbers; those are organisations measuring against their own historical baseline. What this means in practice is that Maverick is the right pick for any task where the cost of a wrong answer is meaningful. Contract review, regulatory analysis, fraud signal generation, internal audit support and engineering code review all sit comfortably in its zone. If you are evaluating against Anthropic Claude 3.7 or GPT-5.3 for those workloads, Maverick deserves a place in the test, especially if you have any pressure to keep data on UK soil or to fine-tune on sensitive corpora that you would rather not ship to a US-hosted API.

Scout: long context is not the same as long memory

Scout is the model that gets the most attention and the most disappointment. The 10 million token context window is real and unprecedented in an open-weight model, but the early enterprise reports tell a more careful story. Retrieval accuracy degrades to around 89 per cent at the top of that context window, and drops further on multi-step reasoning that depends on connecting facts spread across that range. Box's testing showed Scout at 45 to 70 per cent accuracy on complex conditional clause extraction, well behind Maverick. The lesson from month one is straightforward. Scout is not a substitute for retrieval-augmented generation; it is a substitute for the chunking and embedding step on a narrow set of long-document tasks where you genuinely need to see the whole text at once.

Use cases where that matters: full-corpus regulatory reviews, multi-year case bundles, long-form research synthesis, and document-heavy diligence work. Use cases where it does not: question answering over a knowledge base of moderate-size documents, customer support, structured data extraction, and most chat workloads. For those, Scout offers no real advantage over Llama 3.3 70B or Mistral Large, and you give up reasoning depth for the privilege. The pricing reflects that positioning. Together AI lists Scout at 0.10 dollars per million input tokens and 0.30 dollars per million output, against Maverick at 0.20 and 0.60. The cost saving is real but you should not pick Scout just because it is cheaper. Pick it when the long context is doing actual work, and Maverick when reasoning quality matters more. What this means in practice is that Scout earns a place in your stack as a specialised tool, not a general one, and most UK teams will end up running both behind a router that sends each query to the right model for the job.

The licence and compliance picture for UK organisations

The Llama 4 Community Licence is permissive in commercial terms but does require care for UK enterprise procurement. The headline point is that royalty-free commercial use is granted to organisations under 700 million monthly active users, which is essentially every UK enterprise. Attribution is required: products and services using Llama must include a "Built with Llama" notice in some form. The geographical restriction in the Acceptable Use Policy applies to multimodal versions of Llama 3.2 onwards and 4, but specifically to companies headquartered in the European Union. UK-headquartered businesses are not caught by that clause, although counsel should confirm the position for any group with EU subsidiaries serving as the contracting entity.

Where UK buyers need to do real work is the wider compliance picture. The UK has not introduced an AI Act and the government's October 2025 AI Growth Lab consultation, which closed on 2 January 2026, points to a sandbox-led, sector-specific approach rather than horizontal rules. The Information Commissioner's Office 2025 to 2026 action plan includes statutory guidance on automated decision-making and consultation on updated profiling rules. For practical purposes, that means a UK enterprise deploying Llama 4 should focus on data protection impact assessments under UK GDPR, transparency records for training data and prompt content, and clear human oversight on any automated decisions affecting individuals. If your organisation also deploys into the EU, the AI Act has extra-territorial reach, and high-risk classifications are likely on most enterprise use cases. The widely missed point in March commentary was that picking an open-weight model does not lower compliance burden; in some respects it raises it, because you carry more direct responsibility for what the deployed system does. That is a feature for organisations that want sovereignty, and a feature to budget for.

Self-hosting versus API: the unit economics most teams get wrong

The most common mistake in early Llama 4 evaluations is a flawed cost comparison. UK teams used to OpenAI bills look at Together AI's 0.20 dollar per million input pricing, multiply by their volumes, and conclude self-hosting is too much hassle. The maths only starts to favour self-hosting at scale. The published breakeven thresholds put Scout self-hosting in the money at around 500 million to 1 billion tokens per month, and Maverick at 1 to 2 billion. That sounds high but is reachable for any organisation running a customer support copilot at moderate scale, an internal knowledge assistant for a mid-sized firm, or production code review across an engineering team of 100 people. At 100 billion tokens per month, the published economics put Scout self-hosting at roughly 12,000 dollars monthly against 150,000 to 200,000 for proprietary APIs.

Hardware tells the rest of the story. Scout in INT4 quantisation runs on a single H100 80GB at around 55GB of VRAM, with throughput in the range of 40 to 75 tokens per second. Maverick needs roughly 200GB in INT4, meaning three H100s minimum, and delivers 30 to 55 tokens per second. Add 0.5 to 1 full-time engineer for production operations, plus monitoring, observability and rollback tooling, and the total cost of ownership is real. What this means in practice is that most UK organisations should start with a managed deployment on AWS Bedrock or Azure AI, prove the use case, then move to self-hosting only when volume, data sensitivity or unit economics make the case. Skipping that order is how teams burn six months on infrastructure that never serves real users. The 60 to 80 per cent saving that gets quoted in vendor decks is real, but only at the volumes where it applies, and only after you have built the operational muscle to keep an open-weight model healthy in production.

What to put on your evaluation roadmap before signing anything

A serious Llama 4 evaluation should run for two to three weeks before any procurement decision. The first task is honest benchmark construction. Pick 100 to 300 examples from your own data, label the right answers, and grade outputs against that set. Generic benchmarks like MMLU and HumanEval matter for vendor screening but they tell you almost nothing about how the model will behave on your particular contracts, support tickets or claims. The second task is to scope the question. Are you replacing a current model, augmenting a workflow, or building something new? Each implies a different decision criterion. Replacement is about parity at lower cost. Augmentation is about marginal lift over the current state. Greenfield is about whether the model is good enough at all.

Third, plan the safety stack. Llama Guard 4, LlamaFirewall and Llama Prompt Guard 2 are part of the picture, alongside Nvidia NeMo Guardrails or your own programmable guardrails. Recent industry survey data shows 49 per cent of organisations using prompt adjustments and 47 per cent using safeguard models in their open-source AI deployments. Treat that as a floor, not a ceiling. Fourth, address the counterargument. The standard objection from CIOs is that managed proprietary models offer fewer surprises and a tidier audit story. That is true. The counter is that you concede control over data, latency, model lifecycle and unit cost. Open-weight models exist precisely so you do not have to make that trade. Finally, document your evaluation. The teams that will get value from Llama 4 over the next year are the ones with a written, measurable record of why they picked the model they picked, what they expected, and what they observed. That record is also the foundation of any future ICO or audit conversation, and it is the artefact that converts a tooling decision into an organisational capability.

Frequently Asked Questions

Should we use Llama 4 Maverick or Scout for our customer support copilot?

Maverick. Scout's long context advantage does not help on short-turn conversations, and Maverick's stronger reasoning produces fewer wrong-tone or wrong-fact replies. The cost difference at typical support volumes is small enough that quality should win.

Is Llama 4 actually free to use commercially in the UK?

Yes for almost every UK enterprise. The Community Licence permits commercial use up to 700 million monthly active users, requires a 'Built with Llama' attribution, and does not impose royalties. UK-headquartered businesses are not caught by the EU geographical restriction on multimodal models.

How does Llama 4 Maverick compare with GPT-5.3 on code generation?

Maverick scored 91.2 per cent on HumanEval against GPT-5.3 at 89.7 per cent. On real engineering work, expect rough parity, with GPT-5.3 still slightly ahead on multi-file reasoning and Maverick stronger on isolated function generation and code explanation.

What is the cheapest way to start testing Llama 4 in the UK?

Together AI, AWS Bedrock or Azure AI for managed access, with starting prices around 0.10 dollars per million input tokens for Scout and 0.20 for Maverick. Run a two to three week evaluation against your own data before any infrastructure investment.

Do we need GDPR-style impact assessments for an open-weight Llama 4 deployment?

Yes for any use case affecting individuals. UK GDPR DPIA requirements apply regardless of whether the model is open-weight or proprietary. Open-weight deployments often raise the documentation burden because you control more of the system end to end.

Is Scout's 10 million token context window actually useful in production?

For a narrow set of tasks: full-corpus regulatory reviews, long case bundles, multi-document diligence. For general retrieval and chat, it offers no real advantage over a properly built RAG pipeline, and retrieval accuracy degrades at the top of the window.

When is self-hosting Llama 4 actually worth it?

Above roughly 500 million tokens per month for Scout and 1 billion for Maverick, assuming you can dedicate 0.5 to 1 engineer to production operations. Below those volumes, managed APIs are simpler and cheaper.

What safety tooling should sit alongside Llama 4 in production?

Meta's Llama Guard 4, LlamaFirewall and Llama Prompt Guard 2 are a credible baseline. Most enterprise deployments add programmable guardrails, such as Nvidia NeMo, plus prompt-level adjustments. Treat the safety stack as a separate engineering exercise from model selection.