Running three different chatbot architectures on resource-constrained infrastructure taught me that observability doesn't have to be expensive or heavy. Here's how I added distributed tracing, custom metrics, and cost tracking with OpenTelemetry while keeping overhead under 30MB.
The Problem: Flying Blind on Production
I had three chatbots in production:
- Kafka Chat: HTTP POST → Kafka → Consumer → PostgreSQL → HTTP Polling
- WebSocket Chat: WebSocket → LLM streaming → PostgreSQL audit log
- Data Agent: Multi-step SQL agent with user approval workflow
Each one worked in development. In production, I had zero visibility into:
- How long users waited for first token (TTFT)
- How much I was spending on Gemini API calls
- Which SQL queries were slow in the Data Agent
- Where requests were getting stuck in the Kafka pipeline
The classic developer mistake: I built the features first, observability never.
The Constraint: Limited Resources
My production environment has constrained resources. I was already running:
- Django + Daphne (ASGI server for WebSockets): ~150MB
- Ollama + llama3.2:1b (local LLM): ~400MB idle, 800MB during inference
- Kafka consumer (Python process): ~80MB
- Categorizer service (Kafka → Ollama classifier): ~60MB
- PostgreSQL client (Aiven hosted, no local RAM)
Peak usage: ~900MB when Ollama was running. Adding heavy observability tooling (local collectors, database exporters, full Prometheus stack) would exceed available resources.
The goal: Add tracing, metrics, and cost tracking for under 30MB of overhead.
Phase 1: Distributed Tracing with Grafana Cloud
OpenTelemetry's beauty is that you don't need to run anything locally. Grafana Cloud's free tier includes 50GB of traces/month and 10K metrics series, and the Python SDK is tiny.
Memory footprint: ~15MB
# requirements.txt
opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-exporter-otlp-proto-http==1.21.0
What I traced
Each chatbot architecture has a different critical path:
Kafka Chat (6 spans)
POST /api/chat/
├─ django.chat_message (view handler)
│ ├─ kafka.send_message (producer)
│ └─ db.write_placeholder (PostgreSQL INSERT)
...
kafka_consumer.process_message
├─ llm.generate (Ollama/Gemini call)
└─ db.write_answer (PostgreSQL UPDATE)
WebSocket Chat (3 spans)
ws.receive
├─ llm.stream (Gemini streaming)
└─ db.save_message (audit log write)
Data Agent (N spans for N steps)
ws.receive
├─ step_1.generate_sql (LLM → SQL query)
├─ step_1.await_approval (user interaction)
├─ step_1.execute_sql (PostgreSQL query)
├─ step_2.generate_sql
└─ ...
Implementation
# frontend/otel_config.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def init_tracing():
resource = Resource.create({
"service.name": "mpwebsite-chatbots",
"deployment.environment": os.getenv('ENVIRONMENT', 'production'),
})
provider = TracerProvider(resource=resource)
# Grafana Cloud OTLP endpoint
exporter = OTLPSpanExporter(
endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/traces",
headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# django_frontend/settings.py
from frontend.otel_config import init_observability
init_observability()
Instrumenting a view:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@csrf_exempt
def chat_message(request):
with tracer.start_as_current_span("django.chat_message") as span:
data = json.loads(request.body)
span.set_attribute("user_message", data['message'][:50])
# Send to Kafka
with tracer.start_as_current_span("kafka.send_message"):
kafka_producer.send_message(...)
# Write to DB
with tracer.start_as_current_span("db.write_placeholder"):
ChatMessage.objects.create(...)
What it revealed
- Kafka latency was 200-400ms — not Kafka itself, but SSL handshake to Aiven broker
- 95% of WebSocket chat time was in LLM streaming — database write was <10ms
- Data Agent SQL queries took 50-300ms — no indexes on analytics tables
This was invisible before tracing. Now I had data in Grafana's Explore view with full trace waterfall diagrams.
Phase 2: Custom Metrics and Cost Tracking
Memory footprint: +10MB (25MB total)
The OTLP exporter handles both traces and metrics with the same HTTP endpoint — no additional dependencies needed.
What I tracked
| Metric | Type | Why it matters |
|---|---|---|
llm.tokens.input |
Counter | Track Gemini API costs (input tokens × $0.0001875/1K) |
llm.tokens.output |
Counter | Track Gemini API costs (output tokens × $0.00075/1K) |
llm.cost_usd |
Counter | Direct cost tracking (calculated from token counts) |
llm.ttft_ms |
Histogram | Time-to-first-token (user-perceived latency) |
data_agent.sql_duration_ms |
Histogram | SQL query performance in multi-step agent |
chat.requests |
Counter | Request volume per chatbot type |
Implementation
# frontend/otel_config.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
def init_metrics():
exporter = OTLPMetricExporter(
endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/metrics",
headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=60000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
token_counter = meter.create_counter("llm.tokens.input")
cost_counter = meter.create_counter("llm.cost_usd")
ttft_histogram = meter.create_histogram("llm.ttft_ms")
Tracking Gemini costs:
def _stream_gemini(user_message):
import time
import google.generativeai as genai
model = genai.GenerativeModel(model_name=GEMINI_MODEL)
# Track TTFT
start = time.time()
first_token = None
response = model.generate_content(user_message, stream=True)
for chunk in response:
if first_token is None:
first_token = time.time()
ttft_ms = (first_token - start) * 1000
ttft_histogram.record(ttft_ms, {"model": GEMINI_MODEL})
if chunk.text:
yield chunk.text
# After streaming completes, get token counts
usage = response.usage_metadata
input_tokens = usage.prompt_token_count
output_tokens = usage.candidates_token_count
# Record tokens
token_counter.add(input_tokens, {"type": "input", "model": GEMINI_MODEL})
token_counter.add(output_tokens, {"type": "output", "model": GEMINI_MODEL})
# Calculate cost (Gemini 2.5 Flash pricing)
cost = (input_tokens / 1000 * 0.0001875) + (output_tokens / 1000 * 0.00075)
cost_counter.add(cost, {"model": GEMINI_MODEL})
What it revealed
- Gemini TTFT averaged 800ms — slower than expected, but consistent
- Daily Gemini costs: ~$0.02/day — 100 requests × ~500 tokens × $0.0005 = negligible
- Ollama TTFT: 2-4 seconds — local 1B model is slow on shared vCPUs
- Data Agent SQL queries over 1 second needed indexes — added composite index, dropped to 80ms
Logging with Trace Context
The missing piece was correlating logs with traces. Python's standard logging doesn't include trace_id by default.
# frontend/otel_config.py
import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor
def init_logging():
LoggingInstrumentor().instrument(set_logging_format=True)
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s'
)
Now every log entry includes the trace ID:
2026-04-07 10:23:45 [INFO] trace_id=3f2a1b9c8d7e6f5a span_id=9a8b7c6d5e4f Message sent to Kafka
2026-04-07 10:23:46 [INFO] trace_id=3f2a1b9c8d7e6f5a span_id=1a2b3c4d5e6f Answer written to DB
In Grafana's Explore view, clicking a span shows correlated logs in the same interface.
Resource Impact
| Phase | Memory | CPU | Tools added |
|---|---|---|---|
| Phase 1 | +15MB | <1% | Grafana Cloud traces (OTLP) |
| Phase 2 | +10MB | <2% | Grafana Cloud metrics, structured logs |
| Total | +25MB | <3% | Full observability stack |
Before OTel: 900MB peak RAM usage
After OTel: 925MB peak RAM usage
Minimal impact on system resources.
Why This Works with Limited Resources
Three key decisions kept overhead low:
1. Serverless exporters (no local storage)
Grafana Cloud ingests data via OTLP over HTTPS. The OTel SDK batches spans/metrics in memory and exports to Grafana's managed infrastructure. No local database, no collector process, no disk I/O.
2. Batching, not per-request export
BatchSpanProcessor waits 5 seconds or 512 spans (whichever comes first) before exporting. For a chatbot handling 10 requests/minute, that's one export every 5 seconds — negligible network overhead.
3. Sampling strategically
I'm not creating a span for every token in streaming responses. That would create 500+ spans per request. Instead:
- One span per LLM call (not per token)
- One span per SQL query execution (not per row)
- One span per Kafka message (not per partition poll)
For 100 chatbot requests/day × 5 spans each = 500 spans/day. Grafana Cloud's free tier is 50GB of traces/month and 10K active metrics series. I'm well within limits.
What I Didn't Implement (Phase 3)
AI-as-a-Judge quality evaluation (checking for hallucinations, groundedness, toxicity) would require:
- Sending every LLM response to another LLM for scoring
- Doubling API costs (evaluation call per response)
- Adding 500ms-1s latency to the critical path
For a personal portfolio site, this is overkill. But for a production chatbot with paying users, I'd implement it as:
- Async background evaluation (don't block user response)
- 10% sampling (evaluate 1 in 10 responses)
- Use Gemini API (not Ollama on the VM — offload to Google's servers)
This would add ~$0.002/day in Gemini costs (10 evals × ~200 tokens × $0.0001). Still trivial.
Lessons Learned
1. Observability doesn't require self-hosted Prometheus
The classic self-hosted stack (Prometheus scraping metrics, Loki for logs, local Grafana) is 300-500MB of RAM. Grafana Cloud with OTLP exporters eliminates this overhead entirely — you get the Grafana UI without running the storage layer.
2. Traces > Logs for debugging async workflows
When a Kafka message gets stuck, logs show:
10:23:45 [INFO] Message sent to Kafka
10:23:50 [INFO] Consumer received message
Traces show:
kafka.send_message: 380ms
├─ ssl_handshake: 320ms (!)
└─ produce: 60ms
kafka_consumer.process_message: 1.2s
├─ llm.generate: 1.1s
└─ db.write_answer: 90ms
The SSL handshake was the bottleneck — not the message queue, not the database, not the LLM. Logs would never reveal this.
3. Cost tracking prevents surprises
Without metrics, I assumed Gemini was cheap but had no proof. After adding llm.cost_usd, I discovered:
- Average request: $0.0002
- Peak day (100 requests): $0.02
- Monthly projection: $0.60
This is 0.3% of Gemini's $200 free credit. I could 300× my traffic before paying a cent. That context matters when deciding whether to optimize token usage.
4. TTFT is the metric that matters
Token throughput (tokens/second) is interesting. TTFT (time-to-first-token) is what users feel. A chatbot that takes 2 seconds to start responding but then streams 100 tokens/sec feels slower than one that starts in 200ms at 50 tokens/sec.
Tracking TTFT revealed that local Ollama models are slower (2-4s) compared to Gemini API (800ms). That insight drove the decision to default to Gemini and treat Ollama as a fallback.
Code Changes
Total lines added: ~150
Files changed: 4
frontend/otel_config.py— 60 lines (init tracing, metrics, logging)frontend/views.py— 30 lines (instrument Kafka chat view)frontend/consumers.py— 40 lines (instrument WebSocket chat + Data Agent)frontend/kafka_consumer.py— 20 lines (instrument Kafka consumer)
That's 150 lines for full observability across three architectures.
Try It Yourself
If you're running Django anywhere (AWS, GCP, Azure, bare metal), adding OpenTelemetry with Grafana Cloud is straightforward:
1. Sign up for Grafana Cloud free tier (grafana.com)
2. Get your OTLP credentials:
- Go to Connections → Add new connection → OpenTelemetry
- Copy your OTLP endpoint (e.g., https://otlp-gateway-prod-us-east-0.grafana.net/otlp)
- Copy your instance ID and generate an API token
3. Install dependencies:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-http
4. Set environment variables:
GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-0.grafana.net/otlp
GRAFANA_INSTANCE_ID=123456
GRAFANA_API_TOKEN=glc_xxx...
5. Initialize in settings.py:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(
endpoint=f"{os.getenv('GRAFANA_OTLP_ENDPOINT')}/v1/traces",
headers={"Authorization": f"Basic {os.getenv('GRAFANA_INSTANCE_ID')}:{os.getenv('GRAFANA_API_TOKEN')}"}
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
6. Instrument your code:
tracer = trace.get_tracer(__name__)
def my_view(request):
with tracer.start_as_current_span("my_view"):
# Your code here
pass
7. Deploy and view in Grafana: Open Explore → select your data source → see traces and metrics.
Conclusion
Observability is not a luxury reserved for teams with dedicated DevOps engineers and expensive APM contracts. Resource-constrained infrastructure can run distributed tracing, custom metrics, structured logging, and cost tracking — all for 25MB of overhead.
The trick is using cloud-native exporters instead of self-hosted collectors. OpenTelemetry's real power isn't the protocol — it's that the same SDK works with Jaeger, Prometheus, Grafana Cloud, Google Cloud, and any OTLP-compatible backend. Write once, export anywhere.
If you're running a side project, a portfolio chatbot, or a startup MVP on limited infrastructure, there's no excuse for flying blind. Add tracing first, metrics second, and you'll debug 10× faster than with logs alone.
Next steps: Add Grafana alerting when TTFT > 2s, build custom dashboards for each chatbot, and implement async AI-as-a-Judge evaluation for 10% of responses. But that's Phase 3.