Jiwon Min Developer

Production-Grade LLM Application Monitoring: A Complete Guide to Building Observability with LangSmith

This post was generated and edited with the Google Gemini API, then published after operator review. Thumbnails may also be AI-generated.

AI-powered applications, especially those leveraging LLMs (Large Language Models), often feel like ‘black boxes’ due to their complex internal workings. User prompts go in, plausible results come out, but understanding what happens in between—how much it costs, where bottlenecks occur—is incredibly difficult. Traditional server monitoring provides only CPU and memory usage, failing to track the core aspects of LLM applications: ‘quality,’ ‘cost,’ and ‘latency.’

To solve these issues, establishing Observability, a key component of LLMOps (LLM Operations), is essential. LLM observability goes beyond simple logging. It’s an engineering practice that allows detailed tracking of all model requests and responses, granular insight into internal processing, visualization of performance metrics, and collection of user feedback. This enables data-driven improvements to AI applications. Without a good observability system, we’d rely on guesswork to diagnose problems, making cost optimization and performance enhancement nearly impossible.

This post will detail how to fully implement observability for LLM applications in a production environment using LangSmith, the LLM application development platform created by the LangChain development team. My aim is for you to gain practical knowledge for tracing complex RAG (Retrieval-Augmented Generation) pipelines or AI agent executions, analyzing token costs and latency, and automating quality evaluations with LangSmith.

Production-Grade LLM Application Monitoring: A Complete Guide to Building Observability with LangSmith

© AI Generated Image


Three Pillars of LLM Application Observability

To fully understand the state of an LLM application, we need three core elements that go beyond traditional monitoring. These are organically connected and provide deep insights into the system’s internal behavior.

1. Tracing

Tracing is a technique for visualizing the entire lifecycle of a single request within an LLM application. It shows the process from user input through all internal components (LLM calls, database queries, API requests, etc.) to the final response, much like a flowchart.

For instance, in a RAG pipeline, tracing allows you to grasp the following at a glance:

  • Input: The question entered by the user.
  • Retriever: Which documents were retrieved from the vector DB using what query?
  • Prompt Generation: How was the final prompt constructed by combining the retrieved context and the question?
  • LLM Call: To which model (e.g., gpt-4-turbo) was the final prompt sent?
  • Output: What was the final answer generated by the model?

This detailed tracing information is crucial for finding the root cause and debugging when asking, “Why did the AI give this answer?”

2. Metrics

Metrics are key indicators that quantitatively measure system performance and cost. For LLM applications, the following metrics are particularly important:

  • Latency: The time taken from request initiation to final response. Specifically, Time to First Token directly impacts user experience.
  • Cost / Token Usage: The number of prompt tokens used and generated tokens for each LLM call. This directly relates to API costs and must be tracked and optimized.
  • Error Rate: The frequency of various error types, such as LLM API call failures, timeouts, or invalid response formats.
  • Throughput: The number of requests processed per unit of time.

Visualizing these metrics in a dashboard and setting threshold-based alerts allows for early detection and rapid response to system anomalies.

3. Feedback

Feedback is the most critical means of measuring the ‘quality’ of results generated by the model. Even if an answer is generated quickly and cheaply, it is useless if it is inaccurate or doesn’t align with the user’s intent.

Feedback can be collected in various forms:

  • User Feedback: ‘Like/Dislike’ buttons in a chatbot interface, star ratings, comments, etc.
  • Programmatic Feedback: Evaluating whether the generated answer adheres to a specific format (e.g., JSON) or passes internal business logic validation.
  • Expert Evaluation: Internal operations teams or experts directly assess and score the quality of the answers.

Collected feedback data helps identify problems with specific prompts and serves as a valuable dataset for future model evaluation and fine-tuning.

Building a Practical Monitoring System with LangSmith

Now, let’s practically build the three pillars of observability described above using LangSmith. LangSmith seamlessly integrates with existing LangChain and OpenAI SDK code simply by setting environment variables, making it very convenient.

1. Setup and API Key Generation

First, visit the official LangSmith website and sign up. Then, generate a new API key from the Settings > API Keys menu.

These generated keys are highly sensitive information. Do not hardcode them directly into your code; always manage them as environment variables. Create a .env file in your project’s root directory and add the keys as follows:

.env

# [🚨 Security Warning] Replace with actual key values and add this file to .gitignore.
LANGCHAIN_TRACING_V2="true"
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY="<YOUR_LANGCHAIN_API_KEY>"
LANGCHAIN_PROJECT="My-AI-Project" # Group traces by project.

OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
  • LANGCHAIN_TRACING_V2="true": This is the key setting to enable LangSmith tracing.
  • LANGCHAIN_ENDPOINT: The address of the LangSmith API server.
  • LANGCHAIN_API_KEY: Your issued LangSmith API key.
  • LANGCHAIN_PROJECT: The name of the project to group generated traces. If not specified, it defaults to ‘default’.

2. Integrating LangSmith with Python Applications

Now, let’s integrate LangSmith into our Python code. First, install the necessary libraries.

pip install openai langchain langchain-openai python-dotenv

Write a simple code snippet that loads environment variables and uses the OpenAI client.

simple_llm_call.py

import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# LangSmith automatically traces OpenAI calls simply by setting environment variables.
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def ask_question(question):
    print("Starting question...")
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": question}],
        temperature=0.7,
    )
    answer = response.choices[0].message.content
    print("Answer:", answer)
    return answer

if __name__ == "__main__":
    ask_question("Explain the importance of observability in LLMOps.")

Simply running the code above will automatically record all information about the OpenAI API call (prompt, response, token usage, latency, etc.) to the ‘My-AI-Project’ in LangSmith, thanks to the LANGCHAIN_TRACING_V2 environment variable.

Now let’s look at a more complex RAG example using LangChain.

rag_chain_example.py

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load environment variables from .env file
load_dotenv()

# Fake Retriever (in reality, would use a VectorDB)
def fake_retriever(query: str) -> str:
    print(f"Retrieving documents for '{query}'...")
    # In a real implementation, you would query a VectorDB here.
    return "LangSmith is a platform for tracing, monitoring, and evaluating LLM applications."

# Define RAG Chain using LangChain Expression Language (LCEL)
template = """
You are an AI assistant that answers questions.
Use the provided context to answer the question.

Context: {context}

Question: {question}

Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4-turbo-preview")

rag_chain = (
    {"context": fake_retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

if __name__ == "__main_":
    print("Running RAG Chain...")
    result = rag_chain.invoke("What is LangSmith?")
    print("Final result:", result)

Executing this code will allow you to visually inspect the entire execution flow of the rag_chain on the LangSmith dashboard. You can break down and see the input, output, and elapsed time for each step, such as the fake_retriever function call, prompt generation, and model call, which greatly facilitates debugging.

3. Exploring the LangSmith Dashboard

After running your code, visit the LangSmith dashboard to explore the collected data in various ways.

  • Projects/Traces View: This lists all executed requests. Clicking on an item shows its detailed trace information. For the RAG example, the retriever and LLM calls are hierarchically displayed, providing an intuitive understanding of the overall pipeline flow.
  • Monitoring Tab: This dashboard displays key project-wide metrics (latency, token costs, error rate, etc.) over time. This allows you to immediately identify issues such as “API costs suddenly surged since last week” or “latency increased during a specific time window.”

Advanced Usage: Performance Optimization and Evaluations (Evals)

LangSmith goes beyond simple logging and monitoring, offering powerful features necessary for improving applications.

1. Programmatically Logging User Feedback

When a user clicks a ‘Like/Dislike’ button in your application, you can record that feedback in LangSmith, linking it to a specific LLM run.

from langsmith import Client

client = Client()

# ... Assume you obtained run_id after the LLM call ...
# rag_chain.invoke() can be configured to return an object including run_id.
# Example run_id (in reality, extracted from invoke result)
example_run_id = "a1b2c3d4-e5f6-..." 

# When the user clicks 'Like'
client.create_feedback(
    run_id=example_run_id,
    key="user_score",  # Key indicating the type of feedback
    score=1,           # 1: Good, 0: Bad
    comment="The answer was very accurate and helpful."
)

# When the user clicks 'Dislike'
client.create_feedback(
    run_id=example_run_id,
    key="user_score",
    score=0,
    comment="The answer was irrelevant to the question."
)

This collected feedback is displayed alongside each trace on the LangSmith dashboard. It can be used to analyze “what types of questions lead to answers users dislike.”

2. Creating Datasets and Running Evaluations

Important success/failure cases discovered during operation can be saved as ‘datasets’ in the LangSmith UI. For example, you can create a dataset titled ‘Collection of Inaccurate Answer Cases.’

New prompts or models can then be automatically evaluated against these created datasets.

from langsmith.evaluation import evaluate

# ... rag_chain defined previously ...

# Evaluate rag_chain's performance against the 'incorrect-answers-dataset'
# LangSmith provides various evaluators to compare ground truth with model answers.
evaluation_results = evaluate(
    rag_chain.invoke, # Function to evaluate
    data="incorrect-answers-dataset", # Name of the dataset stored in LangSmith
    # Custom evaluators can be added if needed
)

The evaluate function runs the rag_chain for each item in the dataset and generates a report by scoring the results based on predefined criteria (e.g., similarity to ground truth). This allows for objective verification of performance improvements after modifying a prompt.

Conclusion: Observability is a ‘Must-Have,’ Not a ‘Nice-to-Have’

Developing LLM-based AI applications is not a one-and-done task; it’s a journey requiring continuous measurement and improvement. Observability tools like LangSmith serve as an essential compass for this journey. Moving past an era of guessing and relying on intuition to modify prompts and solve problems, we must now systematically manage the performance, cost, and quality of AI applications based on data and metrics.

By utilizing the methods introduced in this guide, I hope you can gain transparent insight into your LLM applications, quickly identify the root causes of problems, and ultimately build smarter AI systems that provide greater value to users. If you operate AI in a production environment, an observability system is not an option; it’s a necessity.

References