Monitoring and Logging in LLM-MCP
This comprehensive guide covers monitoring and logging best practices for the LLM Platform Model Context Protocol (LLM-MCP) to ensure optimal performance, reliability, and observability in production environments.
Table of Contentsโ
- Monitoring Overview
- Logging Configuration
- Metrics and Instrumentation
- Monitoring Dashboards
- Alerting
- Performance Monitoring
- Health Checks
- Distributed Tracing
- Log Analysis
- Integration with External Systems
- Monitoring in Kubernetes
- Best Practices
- Troubleshooting with Monitoring
Monitoring Overviewโ
LLM-MCP provides comprehensive monitoring and logging capabilities to help you observe, understand, and debug your AI model orchestration platform. Proper monitoring enables you to:
- Detect issues before they affect users
- Understand performance characteristics
- Track usage patterns and growth
- Debug problems efficiently
- Plan capacity based on actual usage
- Validate SLAs with metrics
Logging Configurationโ
Basic Logging Setupโ
LLM-MCP uses a structured logging system that outputs JSON-formatted logs by default. Configure logging in your config.json
:
{
"logging": {
"level": "info",
"format": "json",
"destination": "stdout",
"colors": false,
"timestamp": true,
"serviceName": "llm-mcp",
"correlationIdHeader": "x-correlation-id"
}
}
Advanced Logging Configurationโ
For production environments, consider a more detailed logging configuration:
{
"logging": {
"level": "info",
"format": "json",
"destination": "stdout",
"colors": false,
"timestamp": true,
"serviceName": "llm-mcp",
"correlationIdHeader": "x-correlation-id",
"sensitiveFields": ["password", "token", "apiKey"],
"includeRequestBody": false,
"includeResponseBody": false,
"logRequestHeaders": ["user-agent", "content-type", "accept"],
"logResponseHeaders": ["content-type", "content-length"],
"requestIdHeader": "x-request-id",
"fileLogging": {
"enabled": true,
"directory": "/var/log/llm-mcp",
"maxSize": "100m",
"maxFiles": 10,
"compress": true
}
}
}
Log Levelsโ
LLM-MCP supports the following log levels, in order of increasing severity:
- trace: Highly detailed information for debugging
- debug: Detailed information useful during development
- info: General operational information
- warn: Warning conditions that don't affect service
- error: Error conditions that affect specific operations
- fatal: Critical conditions that require immediate attention
Structured Loggingโ
LLM-MCP logs are structured for easy parsing and analysis:
{
"level": "info",
"timestamp": "2025-05-30T12:34:56.789Z",
"service": "llm-mcp",
"correlationId": "c1d2e3f4-g5h6-i7j8-k9l0",
"requestId": "req-1234567890",
"message": "Request processed successfully",
"method": "POST",
"path": "/api/v1/tools/execute",
"statusCode": 200,
"responseTime": 127,
"userId": "user-123",
"toolName": "example-tool",
"component": "tool-executor"
}
Correlation IDsโ
LLM-MCP automatically propagates correlation IDs through the system to track requests across multiple services:
// Example of using correlation IDs in custom code
import [getLogger] from '@llm-mcp/logger';
async function processRequest(req, res) {
const correlationId = req.headers['x-correlation-id'] || generateId();
const logger = getLogger().child({ correlationId });
logger.info('Processing request', {
endpoint: req.path,
method: req.method
});
// Pass correlation ID to downstream services
const result = await callDownstreamService({
headers: {
'x-correlation-id': correlationId
}
});
logger.info('Request completed', { result: 'success' });
}
Metrics and Instrumentationโ
Core Metricsโ
LLM-MCP exposes the following core metrics:
-
Request Metrics:
- Request count (total and by endpoint)
- Request duration (histogram)
- Request size (histogram)
- Response size (histogram)
- Error count (by type and status code)
-
Tool Metrics:
- Tool execution count (by tool)
- Tool execution duration (histogram)
- Tool execution errors (by tool and error type)
-
System Metrics:
- CPU usage
- Memory usage
- Active connections
- Event loop lag
- Garbage collection stats
Metrics Configurationโ
Configure metrics collection in your config.json
:
{
"monitoring": {
"metrics": {
"enabled": true,
"interval": 15,
"prefix": "llm-mcp_",
"defaultLabels": {
"service": "llm-mcp",
"environment": "production"
},
"prometheus": {
"enabled": true,
"port": 9090,
"path": "/metrics"
}
}
}
}
Custom Metricsโ
You can define custom metrics for your specific use cases:
// Example of defining and using custom metrics
import [metrics] from '@llm-mcp/monitoring';
// Define custom metrics
const customMetrics = {
vectorSearchLatency: metrics.histogram({
name: 'vector_search_latency',
help: 'Vector search operation latency in ms',
labelNames: ['collection', 'dimensions'],
buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000]
}),
modelRegistrySize: metrics.gauge({
name: 'model_registry_size',
help: 'Number of models in the registry',
labelNames: ['status']
}),
toolRegistrationsTotal: metrics.counter({
name: 'tool_registrations_total',
help: 'Total number of tool registrations',
labelNames: ['result']
})
};
// Using custom metrics
async function handleVectorSearch(collection, query, dimensions) {
const startTime = Date.now();
try {
const result = await performVectorSearch(collection, query);
// Record latency
const latency = Date.now() - startTime;
customMetrics.vectorSearchLatency.observe(
{ collection, dimensions: String(dimensions) },
latency
);
return result;
} catch (error) [// Handle error...]
}
// Update gauge periodically
function updateModelRegistryMetrics() {
getModelCounts().then(counts => {
customMetrics.modelRegistrySize.set({ status: 'active' }, counts.active);
customMetrics.modelRegistrySize.set({ status: 'pending' }, counts.pending);
customMetrics.modelRegistrySize.set({ status: 'error' }, counts.error);
});
}
// Increment counter
function registerTool(tool) {
try {
// Register tool...
customMetrics.toolRegistrationsTotal.inc({ result: 'success' });
} catch (error) {
customMetrics.toolRegistrationsTotal.inc({ result: 'failure' });
// Handle error...
}
}
Monitoring Dashboardsโ
Grafana Dashboardโ
LLM-MCP provides pre-configured Grafana dashboards for monitoring. Example dashboard configuration:
{
"dashboards": [
{
"name": "LLM-MCP Overview",
"uid": "llm-mcp-overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm-mcp_http_requests_total[5m])) by (status_code)"
}
]
},
{
"title": "Request Latency (95th Percentile)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(llm-mcp_http_request_duration_seconds_bucket[5m])) by (le))"
}
]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [
{
"expr": "process_resident_memory_bytes{job=\"llm-mcp\"} / 1024 / 1024"
}
]
},
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(process_cpu_user_seconds_total{job=\"llm-mcp\"}[1m]) * 100"
}
]
}
]
},
{
"name": "LLM-MCP Tools",
"uid": "llm-mcp-tools",
"panels": [
{
"title": "Tool Execution Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm-mcp_tool_executions_total[5m])) by (tool_name)"
}
]
},
{
"title": "Tool Execution Latency (95th Percentile)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(llm-mcp_tool_execution_duration_seconds_bucket[5m])) by (tool_name, le))"
}
]
},
{
"title": "Tool Errors",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm-mcp_tool_execution_errors_total[5m])) by (tool_name, error_type)"
}
]
}
]
}
]
}
Dashboard for Vector Operationsโ
Example of a specialized dashboard for vector operations:
{
"dashboards": [
{
"name": "LLM-MCP Vector Operations",
"uid": "llm-mcp-vector-ops",
"panels": [
{
"title": "Vector Embeddings Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm-mcp_vector_embeddings_total[5m])) by (model)"
}
]
},
{
"title": "Vector Search Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(llm-mcp_vector_searches_total[5m])) by (collection)"
}
]
},
{
"title": "Vector Search Latency (95th Percentile)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(llm-mcp_vector_search_duration_seconds_bucket[5m])) by (collection, le))"
}
]
},
{
"title": "Vector Database Size",
"type": "graph",
"targets": [
{
"expr": "llm-mcp_vector_collection_size_bytes"
}
]
}
]
}
]
}
Alertingโ
Alert Configurationโ
Configure alerts based on metrics thresholds:
# prometheus-alerts.yml
groups:
- name: llm-mcp-alerts
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(llm-mcp_http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile request latency is above 500ms for 5 minutes"
- alert: HighErrorRate
expr: sum(rate(llm-mcp_http_requests_total[status_code=~"5.."][5m])) / sum(rate(llm-mcp_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate"
description: "Error rate is above 5% for 5 minutes"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes[job="llm-mcp"] > 1.5e9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 1.5GB for 15 minutes"
- alert: ToolExecutionErrors
expr: sum(rate(llm-mcp_tool_execution_errors_total[5m])) by (tool_name) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Tool execution errors"
description: "Tool [$labels.tool_name] has a high error rate"
Alert Notificationsโ
Configure notification channels for alerts:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'team-pager'
receivers:
- name: 'team-email'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'team-pager'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
send_resolved: true
Performance Monitoringโ
Key Performance Indicators (KPIs)โ
Monitor these KPIs to ensure optimal LLM-MCP performance:
- Request Success Rate: Percentage of successful requests (non-5xx responses)
- Request Latency: Response time for API requests (p50, p95, p99)
- Tool Execution Success Rate: Percentage of successful tool executions
- Tool Execution Latency: Time taken to execute tools (p50, p95, p99)
- System Resource Utilization: CPU, memory, disk, and network usage
- Database Performance: Query latency, connection pool usage
- Vector Operations Performance: Vector embedding and search latency
Service Level Objectives (SLOs)โ
Define SLOs for your LLM-MCP deployment:
# service-level-objectives.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: llm-mcp-slos
namespace: monitoring
spec:
groups:
- name: llm-mcp-slos
rules:
- record: llm-mcp:request_availability:ratio_5m
expr: sum(rate(llm-mcp_http_requests_total[status_code!~"5.."][5m])) / sum(rate(llm-mcp_http_requests_total[5m]))
- record: llm-mcp:latency_sli:ratio_5m
expr: sum(rate(llm-mcp_http_request_duration_seconds_count[le="0.3"][5m])) / sum(rate(llm-mcp_http_request_duration_seconds_count[5m]))
- record: llm-mcp:tool_execution_availability:ratio_5m
expr: sum(rate(llm-mcp_tool_executions_total[status="success"][5m])) / sum(rate(llm-mcp_tool_executions_total[5m]))
- alert: SLOAvailabilityBudgetBurning
expr: llm-mcp:request_availability:ratio_5m < 0.995
for: 15m
labels:
severity: warning
annotations:
summary: "SLO availability budget burning"
description: "Service availability is below 99.5% for 15 minutes"
- alert: SLOLatencyBudgetBurning
expr: llm-mcp:latency_sli:ratio_5m < 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "SLO latency budget burning"
description: "Less than 95% of requests are completing within 300ms for 15 minutes"
Health Checksโ
Implementing Health Checksโ
LLM-MCP provides built-in health check endpoints:
- Liveness Check:
/health/live
- Indicates if the service is running - Readiness Check:
/health/ready
- Indicates if the service is ready to accept requests - Startup Check:
/health/startup
- Indicates if the service has completed startup
Configure health checks in your config.json
:
{
"health": {
"enabled": true,
"port": 8081,
"checks": {
"database": true,
"vectorStore": true,
"cache": true,
"externalServices": true,
"diskSpace": {
"enabled": true,
"threshold": 90
},
"memory": {
"enabled": true,
"threshold": 90
}
}
}
}
Custom Health Checksโ
You can implement custom health checks:
// Example of implementing custom health checks
import { HealthCheck, registerHealthCheck } from '@llm-mcp/health';
class CustomServiceHealthCheck implements HealthCheck {
name = 'custom-service';
async check() {
try {
// Check custom service health
const response = await fetch('https://custom-service.example.com/health');
if (response.ok) {
return {
status: 'pass',
message: 'Custom service is healthy'
};
} else {
return {
status: 'warn',
message: `Custom service returned status [response.status]`
};
}
} catch (error) {
return {
status: 'fail',
message: `Custom service check failed: [error.message]`
};
}
}
}
// Register the custom health check
registerHealthCheck(new CustomServiceHealthCheck());
Distributed Tracingโ
OpenTelemetry Integrationโ
LLM-MCP supports OpenTelemetry for distributed tracing:
{
"tracing": {
"enabled": true,
"exporter": "jaeger",
"serviceName": "llm-mcp",
"jaeger": {
"endpoint": "http://jaeger:14268/api/traces",
"username": "",
"password": ""
},
"samplingRatio": 0.1
}
}
Implementing Tracingโ
Example of using tracing in custom code:
// Example of implementing custom tracing
import [tracer] from '@llm-mcp/tracing';
async function processToolExecution(request) {
const span = tracer.startSpan('tool-execution', {
attributes: {
'tool.name': request.toolName,
'request.id': request.id
}
});
try {
// Tool execution logic
const params = request.params;
// Create child span for parameter validation
const validationSpan = tracer.startSpan('parameter-validation', {
attributes: {
'tool.name': request.toolName
}
});
const validationResult = validateParameters(params);
validationSpan.end();
if (!validationResult.valid) {
span.setAttribute('error', true);
span.setAttribute('error.message', validationResult.message);
span.end();
return { error: validationResult.message };
}
// Create child span for tool execution
const executionSpan = tracer.startSpan('execute-tool-handler', {
attributes: {
'tool.name': request.toolName
}
});
const result = await executeTool(request.toolName, params);
executionSpan.end();
span.end();
return result;
} catch (error) {
span.setAttribute('error', true);
span.setAttribute('error.message', error.message);
span.end();
throw error;
}
}
Log Analysisโ
Log Aggregationโ
Configure log aggregation with tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Loki:
{
"logging": {
"aggregation": {
"enabled": true,
"type": "elasticsearch",
"elasticsearch": {
"nodes": ["http://elasticsearch:9200"],
"username": "elastic",
"password": "use_environment_variable",
"index": "llm-mcp-logs"
}
}
}
}
Log Parsingโ
Example of log parsing configuration for Logstash:
# logstash.conf
input {
file {
path => "/var/log/llm-mcp/application.log"
codec => "json"
}
}
filter {
# Parse timestamps
date {
match => [ "timestamp", "ISO8601" ]
target => "@timestamp"
}
# Add environment tag
mutate {
add_field => {
"environment" => "[ENVIRONMENT:production]"
}
}
# Parse user agent
if [userAgent] {
useragent {
source => "userAgent"
target => "user_agent"
}
}
# Parse error stack traces
if [error][stack] {
grok {
match => { "[error][stack]" => "(?m)%{JAVASTACK:stack_trace}" }
}
}
}
output {
elasticsearch {
hosts => ["[ES_HOST:elasticsearch:9200]"]
user => "[ES_USER:elastic]"
password => "[ES_PASSWORD]"
index => "llm-mcp-logs-%{+YYYY.MM.dd}"
}
}
Log Visualizationโ
Example Kibana dashboard configuration for LLM-MCP logs:
{
"attributes": {
"title": "LLM-MCP Logs Dashboard",
"hits": 0,
"description": "Dashboard for LLM-MCP logs analysis",
"panelsJSON": "[
{
\"panelIndex\": \"1\",
\"gridData\": {
\"x\": 0,
\"y\": 0,
\"w\": 24,
\"h\": 8,
\"i\": \"1\"
},
\"version\": \"7.10.0\",
\"type\": \"visualization\",
\"id\": \"llm-mcp-logs-over-time\"
},
{
\"panelIndex\": \"2\",
\"gridData\": {
\"x\": 0,
\"y\": 8,
\"w\": 12,
\"h\": 8,
\"i\": \"2\"
},
\"version\": \"7.10.0\",
\"type\": \"visualization\",
\"id\": \"llm-mcp-error-distribution\"
},
{
\"panelIndex\": \"3\",
\"gridData\": {
\"x\": 12,
\"y\": 8,
\"w\": 12,
\"h\": 8,
\"i\": \"3\"
},
\"version\": \"7.10.0\",
\"type\": \"visualization\",
\"id\": \"llm-mcp-response-times\"
},
{
\"panelIndex\": \"4\",
\"gridData\": {
\"x\": 0,
\"y\": 16,
\"w\": 24,
\"h\": 12,
\"i\": \"4\"
},
\"version\": \"7.10.0\",
\"type\": \"search\",
\"id\": \"llm-mcp-error-logs\"
}
]",
"timeRestore": false,
"kibanaSavedObjectMeta": {
"searchSourceJSON": "{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"
}
}
}
Integration with External Systemsโ
Prometheus Integrationโ
Configure Prometheus to scrape LLM-MCP metrics:
# prometheus.yml
scrape_configs:
- job_name: 'llm-mcp'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['llm-mcp:9090']
relabel_configs:
- source_labels: [__address__]
target_label: instance
Grafana Integrationโ
Configure Grafana with Prometheus data source:
# grafana-datasource.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
version: 1
editable: false
ELK Stack Integrationโ
Configure Filebeat to ship logs to Elasticsearch:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/llm-mcp/*.log
json.keys_under_root: true
json.add_error_key: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
username: "elastic"
password: "[ELASTICSEARCH_PASSWORD]"
index: "llm-mcp-logs-%[+yyyy.MM.dd]"
Monitoring in Kubernetesโ
Kubernetes Monitoring Setupโ
Example Kubernetes manifests for monitoring LLM-MCP:
# prometheus-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-mcp
namespace: monitoring
spec:
selector:
matchLabels:
app: llm-mcp
endpoints:
- port: metrics
interval: 15s
path: /metrics
# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-mcp-grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
llm-mcp-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "LLM-MCP Overview",
"tags": ["llm-mcp", "generated"],
"timezone": "browser",
"panels": [
// Dashboard panels...
],
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d"]
}
}
}
Kubernetes Resource Monitoringโ
Configure resource monitoring for LLM-MCP pods:
# llm-mcp-deployment.yaml (excerpt)
spec:
template:
spec:
containers:
- name: llm-mcp
image: llm-mcp:latest
resources:
limits:
cpu: "2"
memory: "2Gi"
requests:
cpu: "500m"
memory: "1Gi"
livenessProbe:
httpGet:
path: /health/live
port: 8081
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8081
initialDelaySeconds: 10
periodSeconds: 5
Best Practicesโ
Monitoring Best Practicesโ
- Define clear SLOs: Set realistic Service Level Objectives
- Use proper log levels: Use appropriate log levels for different situations
- Include context in logs: Add relevant context to logs (user IDs, request IDs, etc.)
- Implement structured logging: Use structured logging for easier analysis
- Use correlation IDs: Track requests across services with correlation IDs
- Monitor business metrics: Track metrics relevant to business processes
- Set up proactive alerts: Alert on symptoms, not causes
- Use dashboards effectively: Create focused, purpose-driven dashboards
- Implement health checks: Add comprehensive health checks
- Regular review: Regularly review monitoring and adjust as needed
Logging Best Practicesโ
- Log at the right level: Don't over-log or under-log
- Protect sensitive data: Mask sensitive information in logs
- Use structured logging: Structure logs for easier parsing
- Include metadata: Add useful metadata to logs (service name, version, etc.)
- Standardize log formats: Use consistent log formats across services
- Implement log rotation: Rotate logs to manage disk space
- Centralize logs: Aggregate logs in a central location
- Implement log retention: Define clear log retention policies
- Use contextual logging: Include relevant context in logs
- Log actionable information: Log information that helps troubleshooting
Troubleshooting with Monitoringโ
Common Issues and Monitoring Indicatorsโ
Issue | Monitoring Indicators | Potential Causes |
---|---|---|
High Latency | Increased llm-mcp_http_request_duration_seconds | Database slowdown, CPU saturation, memory issues |
Error Spikes | Increased llm-mcp_http_requests_total[status_code="5xx"] | Code bugs, dependency failures, resource exhaustion |
Memory Leaks | Growing process_resident_memory_bytes | Code bugs, improper caching, large response handling |
Connection Issues | Decreased llm-mcp_active_connections | Network problems, misconfiguration, dependency failures |
Database Problems | Increased llm-mcp_database_query_duration_seconds | Slow queries, index issues, connection pool saturation |
Tool Execution Failures | Increased llm-mcp_tool_execution_errors_total | Tool bugs, dependency failures, timeout issues |
Debugging with Logs and Metricsโ
Steps to debug issues using logs and metrics:
- Identify the issue: Use dashboards to identify symptoms
- Correlate with logs: Find relevant logs using correlation IDs
- Check dependencies: Examine metrics for dependent services
- Analyze patterns: Look for patterns in errors and latency
- Isolate components: Determine which component is causing the issue
- Check recent changes: Correlate issues with recent deployments
- Review resource usage: Check for resource constraints
- Examine traces: Use distributed tracing to follow request flow
- Test hypotheses: Make targeted changes to verify root cause
- Implement fix: Deploy fix and monitor results
Example Troubleshooting Workflowโ
graph TD;
A[Alert: High Error Rate] --> B[Check Error Logs];
B --> C[Identify Error Pattern];
C --> D[Error Type?];
D -->|Database Errors| E[Check Database Metrics];
D -->|Tool Execution Errors| F[Check Tool Metrics];
D -->|External Service Errors| G[Check External Service Status];
E --> H[Check Database Connections];
F --> I[Check Tool Response Times];
G --> J[Check Network Connectivity];
H --> K[Implement Fix];
I --> K;
J --> K;
K --> L[Verify Fix with Metrics];
L --> M[Update Monitoring if Needed];
This comprehensive guide provides everything you need to effectively monitor and log your LLM-MCP deployment. By implementing these practices, you'll ensure optimal performance, reliability, and observability for your AI model orchestration platform.