MCP Debugging
Logs, Contracts & Timeouts
Debugging MCP Servers: Logs, Contracts & Timeouts
When an MCP server "just hangs" or returns puzzling errors, you need a reliable path from symptom to fix. This guide gives you a 10-minute reproduction workflow, shows what good logs look like, and walks through the top failure classes: contract mismatches, timeouts, rate limits, and health check pitfalls. Keep it handy as your go-to runbook.
Start Here: A 10-Minute Repro Workflow
Quick Debug Steps
- Capture the failing call. Note client, tool/action, inputs, timestamp, and any request ID.
- Re-run locally with max logging. Enable DEBUG and structured logs; disable batching/retries to see raw behavior.
- Pin versions. Record MCP server version, client/SDK version, and any schema/contract version hash.
- Minimize the input. Reduce to the smallest payload that still fails; save it as a fixture.
- Check server health. Probe liveness/readiness endpoints; note CPU/memory saturation and open file/socket counts.
- Triangulate:
- If you get 4xx → suspect contract/validation.
- If 5xx or timeout → suspect downstream, rate limit, or resource exhaustion.
- Decide next branch: contract mismatch path or timeout/limits path (sections below).
Tip: Attach a correlation ID to every request and propagate it through logs/traces. It turns "mystery" into timeline.
Reading MCP Server Logs (What Good Looks Like)
Well-designed logs let you answer who, what, when, where, how long, and why.
Minimum Fields
- •
timestamp,level,request_id/trace_id - •
client_id,tool/method - •
latency_ms,status,bytes_in/out - • On error:
error_code,error_class - •
contract_version,retries,upstream_service
Useful Patterns
- • Request started/finished pairs with identical
request_id - • Input/output hashing (e.g.,
input_sha256) - • Cause chains:
error.causeshows underlying errors - • Sampling: sample successful spans, never sample errors
Red Flags
- Missing
request_id, ambiguous status messages, or multi-line stack traces without context. - Fix your logging first—debugging gets 10x easier.
Contract Mismatches: How to Find & Fix
Symptoms
- Immediate 4xx validation errors.
- Tools return "unknown field" or "required property missing."
- Client upgrades suddenly fail; rollout only affects specific versions.
Contract Fix Checklist
- Compare schemas. Diff the client's expected contract vs server's published contract (types, required fields, enums, nullability).
- Check version negotiation. Ensure the client sends supported version or feature flags; fallback semantics are explicit.
- Tighten validation messages. Return machine-readable errors (path to offending field, expected type, allowed values).
- Add compatibility shims. For minor changes, support old + new field names; mark deprecations with dates.
- Contract tests in CI. Run example payloads from docs against the live schema on every release.
- Freeze rolling changes. If prod is burning, lock versions and gate deploys behind canaries.
Golden rule: Backward-compatible first—additive changes, never silently repurpose fields.
Timeouts & Retries: From Symptoms to Root Cause
Common Symptoms
- Client sees
deadline exceeded. - Server logs show long tail latencies or canceled requests.
- Partial successes after many retries.
Root Causes
- • Downstream slowness: database or external API latency spikes
- • N+1 or heavy computation: unbounded loops, large file reads
- • Contention or locks: thread pool exhaustion, connection pool wait
- • Network issues: DNS delays, packet loss, TLS renegotiation
Fix Path
- • Measure the budget: Define SLOs and per-call latency budgets
- • Set timeouts explicitly at every hop
- • Right-size retries: Use idempotent retries with exponential backoff
- • Break the chain: Add circuit breakers to fail fast
- • Trim the work: Stream responses, paginate, cache
- • Observe the tail: Track p95/p99 latencies
Rate Limits, Backpressure & Throttling
- Identify where limits apply: at the MCP server, per client, and on each downstream dependency (models, storage, third-party APIs).
- Differentiate 429 vs 503: 429 = client too fast, 503 = server/backends unhealthy.
- Adaptive throttling: slow callers when error/latency rises; communicate
Retry-After. - Queue discipline: use bounded queues; shed load early; prioritize small/fast jobs to keep the system responsive.
- Fairness: allocate per-tenant quotas; prevent noisy neighbors.
Health Checks & Readiness Probes
Liveness
Only process viability; don't cascade restarts on transient downstream issues.
Readiness
Verify dependencies you must have (config loaded, secrets fetched, DB reachable).
Startup
Protect cold starts from premature restarts.
Best Practices
- Return fast, cacheable JSON with probe status and dependency breakdown.
- Expose a lightweight
/infoor/buildendpoint: version, git SHA, contract hash, config checksum.
Copy/Paste Troubleshooting Checklists
A. Contract Mismatch (4xx)
request_id.B. Timeout (Deadline Exceeded)
C. Rate Limited (429/Throttled)
Retry-After.D. Health Check Failing
E. Logging & Observability Gaps
request_id, client_id, tool, latency_ms, status.FAQs
Why do I see different errors between local and prod?
Contracts or dependencies differ. Compare versions, contract hashes, and feature flags. Prod also hits rate limits and real traffic patterns.
How long should my MCP timeout be?
Start from your user SLO (e.g., 2–5s interactive). Allocate budgets per hop and ensure cumulative retries cannot exceed the original deadline.
What's the quickest way to diagnose a contract bug?
Log the request body path that failed validation and the expected type; provide a curlable minimal example. Contract tests in CI stop it from coming back.
Where should I log inputs without leaking data?
Hash large or sensitive fields; keep raw payloads only in secure, short-retention debug buckets guarded by access controls.
Key Takeaways
- Reproduction: Capture failing calls, pin versions, minimize inputs, check health
- Logging: Structured logs with request IDs, correlation IDs, and proper error chains
- Contracts: Backward-compatible changes, machine-readable errors, CI contract tests
- Timeouts: Set explicit budgets per hop, use exponential backoff, add circuit breakers
- Health checks: Separate liveness/readiness, don't cascade failures, expose build info