Lightweight OpenTelemetry for Chatbots - Marcel Pellicero Esteban

Running three different chatbot architectures on resource-constrained infrastructure taught me that observability doesn't have to be expensive or heavy. Here's how I added distributed tracing, custom metrics, and cost tracking with OpenTelemetry while keeping overhead under 30MB.

The Problem: Flying Blind on Production

I had three chatbots in production:

Kafka Chat: HTTP POST → Kafka → Consumer → PostgreSQL → HTTP Polling
WebSocket Chat: WebSocket → LLM streaming → PostgreSQL audit log
Data Agent: Multi-step SQL agent with user approval workflow

Each one worked in development. In production, I had zero visibility into:

How long users waited for first token (TTFT)
How much I was spending on Gemini API calls
Which SQL queries were slow in the Data Agent
Where requests were getting stuck in the Kafka pipeline

The classic developer mistake: I built the features first, observability never.

The Constraint: Limited Resources

My production environment has constrained resources. I was already running:

Django + Daphne (ASGI server for WebSockets): ~150MB
Ollama + llama3.2:1b (local LLM): ~400MB idle, 800MB during inference
Kafka consumer (Python process): ~80MB
Categorizer service (Kafka → Ollama classifier): ~60MB
PostgreSQL client (Aiven hosted, no local RAM)

Peak usage: ~900MB when Ollama was running. Adding heavy observability tooling (local collectors, database exporters, full Prometheus stack) would exceed available resources.

The goal: Add tracing, metrics, and cost tracking for under 30MB of overhead.

Phase 1: Distributed Tracing with Grafana Cloud

OpenTelemetry's beauty is that you don't need to run anything locally. Grafana Cloud's free tier includes 50GB of traces/month and 10K metrics series, and the Python SDK is tiny.

Memory footprint: ~15MB

# requirements.txt
opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-exporter-otlp-proto-http==1.21.0

What I traced

Each chatbot architecture has a different critical path:

Kafka Chat (6 spans)

POST /api/chat/
├─ django.chat_message (view handler)
│  ├─ kafka.send_message (producer)
│  └─ db.write_placeholder (PostgreSQL INSERT)
...
kafka_consumer.process_message
├─ llm.generate (Ollama/Gemini call)
└─ db.write_answer (PostgreSQL UPDATE)

WebSocket Chat (3 spans)

ws.receive
├─ llm.stream (Gemini streaming)
└─ db.save_message (audit log write)

Data Agent (N spans for N steps)

ws.receive
├─ step_1.generate_sql (LLM → SQL query)
├─ step_1.await_approval (user interaction)
├─ step_1.execute_sql (PostgreSQL query)
├─ step_2.generate_sql
└─ ...

Implementation

# frontend/otel_config.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

def init_tracing():
    resource = Resource.create({
        "service.name": "mpwebsite-chatbots",
        "deployment.environment": os.getenv('ENVIRONMENT', 'production'),
    })

    provider = TracerProvider(resource=resource)

    # Grafana Cloud OTLP endpoint
    exporter = OTLPSpanExporter(
        endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/traces",
        headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
    )
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

# django_frontend/settings.py
from frontend.otel_config import init_observability
init_observability()

Instrumenting a view:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@csrf_exempt
def chat_message(request):
    with tracer.start_as_current_span("django.chat_message") as span:
        data = json.loads(request.body)
        span.set_attribute("user_message", data['message'][:50])

        # Send to Kafka
        with tracer.start_as_current_span("kafka.send_message"):
            kafka_producer.send_message(...)

        # Write to DB
        with tracer.start_as_current_span("db.write_placeholder"):
            ChatMessage.objects.create(...)

What it revealed

Kafka latency was 200-400ms — not Kafka itself, but SSL handshake to Aiven broker
95% of WebSocket chat time was in LLM streaming — database write was <10ms
Data Agent SQL queries took 50-300ms — no indexes on analytics tables

This was invisible before tracing. Now I had data in Grafana's Explore view with full trace waterfall diagrams.

Phase 2: Custom Metrics and Cost Tracking

Memory footprint: +10MB (25MB total)

The OTLP exporter handles both traces and metrics with the same HTTP endpoint — no additional dependencies needed.

What I tracked

Metric	Type	Why it matters
`llm.tokens.input`	Counter	Track Gemini API costs (input tokens × $0.0001875/1K)
`llm.tokens.output`	Counter	Track Gemini API costs (output tokens × $0.00075/1K)
`llm.cost_usd`	Counter	Direct cost tracking (calculated from token counts)
`llm.ttft_ms`	Histogram	Time-to-first-token (user-perceived latency)
`data_agent.sql_duration_ms`	Histogram	SQL query performance in multi-step agent
`chat.requests`	Counter	Request volume per chatbot type

Implementation

# frontend/otel_config.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter

def init_metrics():
    exporter = OTLPMetricExporter(
        endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/metrics",
        headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
    )
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=60000)
    provider = MeterProvider(metric_readers=[reader])
    metrics.set_meter_provider(provider)

meter = metrics.get_meter(__name__)
token_counter = meter.create_counter("llm.tokens.input")
cost_counter = meter.create_counter("llm.cost_usd")
ttft_histogram = meter.create_histogram("llm.ttft_ms")

Tracking Gemini costs:

def _stream_gemini(user_message):
    import time
    import google.generativeai as genai

    model = genai.GenerativeModel(model_name=GEMINI_MODEL)

    # Track TTFT
    start = time.time()
    first_token = None

    response = model.generate_content(user_message, stream=True)

    for chunk in response:
        if first_token is None:
            first_token = time.time()
            ttft_ms = (first_token - start) * 1000
            ttft_histogram.record(ttft_ms, {"model": GEMINI_MODEL})

        if chunk.text:
            yield chunk.text

    # After streaming completes, get token counts
    usage = response.usage_metadata
    input_tokens = usage.prompt_token_count
    output_tokens = usage.candidates_token_count

    # Record tokens
    token_counter.add(input_tokens, {"type": "input", "model": GEMINI_MODEL})
    token_counter.add(output_tokens, {"type": "output", "model": GEMINI_MODEL})

    # Calculate cost (Gemini 2.5 Flash pricing)
    cost = (input_tokens / 1000 * 0.0001875) + (output_tokens / 1000 * 0.00075)
    cost_counter.add(cost, {"model": GEMINI_MODEL})

What it revealed

Gemini TTFT averaged 800ms — slower than expected, but consistent
Daily Gemini costs: ~$0.02/day — 100 requests × ~500 tokens × $0.0005 = negligible
Ollama TTFT: 2-4 seconds — local 1B model is slow on shared vCPUs
Data Agent SQL queries over 1 second needed indexes — added composite index, dropped to 80ms

Logging with Trace Context

The missing piece was correlating logs with traces. Python's standard logging doesn't include trace_id by default.

# frontend/otel_config.py
import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor

def init_logging():
    LoggingInstrumentor().instrument(set_logging_format=True)

    # Configure structured logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s [%(levelname)s] trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s'
    )

Now every log entry includes the trace ID:

2026-04-07 10:23:45 [INFO] trace_id=3f2a1b9c8d7e6f5a span_id=9a8b7c6d5e4f Message sent to Kafka
2026-04-07 10:23:46 [INFO] trace_id=3f2a1b9c8d7e6f5a span_id=1a2b3c4d5e6f Answer written to DB

In Grafana's Explore view, clicking a span shows correlated logs in the same interface.

Resource Impact

Phase	Memory	CPU	Tools added
Phase 1	+15MB	<1%	Grafana Cloud traces (OTLP)
Phase 2	+10MB	<2%	Grafana Cloud metrics, structured logs
Total	+25MB	<3%	Full observability stack

Before OTel: 900MB peak RAM usage
After OTel: 925MB peak RAM usage

Minimal impact on system resources.

Why This Works with Limited Resources

Three key decisions kept overhead low:

1. Serverless exporters (no local storage)

Grafana Cloud ingests data via OTLP over HTTPS. The OTel SDK batches spans/metrics in memory and exports to Grafana's managed infrastructure. No local database, no collector process, no disk I/O.

2. Batching, not per-request export

BatchSpanProcessor waits 5 seconds or 512 spans (whichever comes first) before exporting. For a chatbot handling 10 requests/minute, that's one export every 5 seconds — negligible network overhead.

3. Sampling strategically

I'm not creating a span for every token in streaming responses. That would create 500+ spans per request. Instead:

One span per LLM call (not per token)
One span per SQL query execution (not per row)
One span per Kafka message (not per partition poll)

For 100 chatbot requests/day × 5 spans each = 500 spans/day. Grafana Cloud's free tier is 50GB of traces/month and 10K active metrics series. I'm well within limits.

What I Didn't Implement (Phase 3)

AI-as-a-Judge quality evaluation (checking for hallucinations, groundedness, toxicity) would require:

Sending every LLM response to another LLM for scoring
Doubling API costs (evaluation call per response)
Adding 500ms-1s latency to the critical path

For a personal portfolio site, this is overkill. But for a production chatbot with paying users, I'd implement it as:

Async background evaluation (don't block user response)
10% sampling (evaluate 1 in 10 responses)
Use Gemini API (not Ollama on the VM — offload to Google's servers)

This would add ~$0.002/day in Gemini costs (10 evals × ~200 tokens × $0.0001). Still trivial.

Lessons Learned

1. Observability doesn't require self-hosted Prometheus

The classic self-hosted stack (Prometheus scraping metrics, Loki for logs, local Grafana) is 300-500MB of RAM. Grafana Cloud with OTLP exporters eliminates this overhead entirely — you get the Grafana UI without running the storage layer.

2. Traces > Logs for debugging async workflows

When a Kafka message gets stuck, logs show:

10:23:45 [INFO] Message sent to Kafka
10:23:50 [INFO] Consumer received message

Traces show:

kafka.send_message: 380ms
  ├─ ssl_handshake: 320ms (!)
  └─ produce: 60ms
kafka_consumer.process_message: 1.2s
  ├─ llm.generate: 1.1s
  └─ db.write_answer: 90ms

The SSL handshake was the bottleneck — not the message queue, not the database, not the LLM. Logs would never reveal this.

3. Cost tracking prevents surprises

Without metrics, I assumed Gemini was cheap but had no proof. After adding llm.cost_usd, I discovered:

Average request: $0.0002
Peak day (100 requests): $0.02
Monthly projection: $0.60

This is 0.3% of Gemini's $200 free credit. I could 300× my traffic before paying a cent. That context matters when deciding whether to optimize token usage.

4. TTFT is the metric that matters

Token throughput (tokens/second) is interesting. TTFT (time-to-first-token) is what users feel. A chatbot that takes 2 seconds to start responding but then streams 100 tokens/sec feels slower than one that starts in 200ms at 50 tokens/sec.

Tracking TTFT revealed that local Ollama models are slower (2-4s) compared to Gemini API (800ms). That insight drove the decision to default to Gemini and treat Ollama as a fallback.

Code Changes

Total lines added: ~150
Files changed: 4

frontend/otel_config.py — 60 lines (init tracing, metrics, logging)
frontend/views.py — 30 lines (instrument Kafka chat view)
frontend/consumers.py — 40 lines (instrument WebSocket chat + Data Agent)
frontend/kafka_consumer.py — 20 lines (instrument Kafka consumer)

That's 150 lines for full observability across three architectures.

Try It Yourself

If you're running Django anywhere (AWS, GCP, Azure, bare metal), adding OpenTelemetry with Grafana Cloud is straightforward:

1. Sign up for Grafana Cloud free tier (grafana.com)

2. Get your OTLP credentials:

Go to Connections → Add new connection → OpenTelemetry
Copy your OTLP endpoint (e.g., https://otlp-gateway-prod-us-east-0.grafana.net/otlp)
Copy your instance ID and generate an API token

3. Install dependencies:

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-http

4. Set environment variables:

GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-0.grafana.net/otlp
GRAFANA_INSTANCE_ID=123456
GRAFANA_API_TOKEN=glc_xxx...

5. Initialize in settings.py:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(
    endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/traces",
    headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

6. Instrument your code:

tracer = trace.get_tracer(__name__)

def my_view(request):
    with tracer.start_as_current_span("my_view"):
        # Your code here
        pass

7. Deploy and view in Grafana: Open Explore → select your data source → see traces and metrics.

Conclusion

Observability is not a luxury reserved for teams with dedicated DevOps engineers and expensive APM contracts. Resource-constrained infrastructure can run distributed tracing, custom metrics, structured logging, and cost tracking — all for 25MB of overhead.

The trick is using cloud-native exporters instead of self-hosted collectors. OpenTelemetry's real power isn't the protocol — it's that the same SDK works with Jaeger, Prometheus, Grafana Cloud, Google Cloud, and any OTLP-compatible backend. Write once, export anywhere.

If you're running a side project, a portfolio chatbot, or a startup MVP on limited infrastructure, there's no excuse for flying blind. Add tracing first, metrics second, and you'll debug 10× faster than with logs alone.

Next steps: Add Grafana alerting when TTFT > 2s, build custom dashboards for each chatbot, and implement async AI-as-a-Judge evaluation for 10% of responses. But that's Phase 3.