Skip to main content
W&B Weave is an observability and evaluation platform for building reliable LLM applications. Weave helps you understand what your AI application is doing, measure how well it performs, and systematically improve it over time.

Why Weave?

Building LLM applications is fundamentally different from traditional software development. LLM outputs are non-deterministic, making debugging harder. Quality is subjective and context-dependent. Small prompt changes can cause unexpected behavior changes. Traditional testing approaches fall short. Weave addresses these challenges by providing:
  • Visibility into every LLM call, input, and output in your application
  • Systematic evaluation to measure performance against curated test cases
  • Version tracking for prompts, models, and data so you can understand what changed
  • Feedback collection to capture human judgments and production signals

What you can do with Weave

Debug with traces

Weave automatically traces your LLM calls and shows them in an interactive UI. You can see exactly what went into each call, what came out, how long it took, and how calls relate to each other. Get started with tracing

Evaluate systematically

Run your application against curated test datasets and measure performance with scoring functions. Track how changes to prompts or models affect quality over time. Build an evaluation pipeline

Version everything

Weave tracks versions of your prompts, datasets, and model configurations. When something breaks, you can see exactly what changed. When something works, you can reproduce it. Learn about versioning

Collect feedback

Capture human feedback, annotations, and corrections from production use. Use this data to build better test cases and improve your application. Collect feedback

Monitor production

Score production traffic with the same scorers you use in evaluation. Set up guardrails to catch issues before they reach users. Set up guardrails and monitors

How Weave fits in your workflow

Weave supports the full LLM application development lifecycle:
PhaseWhat Weave provides
BuildTrace calls to understand behavior, debug issues, and iterate quickly
TestEvaluate against datasets with custom and built-in scorers
DeployVersion prompts and models for reproducible deployments
MonitorScore production traffic, collect feedback, catch regressions

Supported languages

Weave provides SDKs for Python and TypeScript:
pip install weave
Both SDKs support tracing, evaluation, datasets, and the core Weave features. Some advanced features like class-based Models and Scorers are currently Python-only.

Integrations

Weave integrates with popular LLM providers and frameworks:
  • LLM providers: OpenAI, Anthropic, Google, Mistral, Cohere, and more
  • Frameworks: LangChain, LlamaIndex, DSPy, CrewAI, and more
  • Local models: Ollama, vLLM, and other local inference servers
When you use a supported integration, Weave automatically traces LLM calls without additional code changes. View all integrations

Next steps