Skip to main content

Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

  • CPU: 100m (0.1 cores)
  • Memory: 128Mi

Production deployment (recommended):

  • CPU: 500m (0.5 cores)
  • Memory: 512Mi

Scaling factors

Resource needs increase based on:

  • Number of backends: Each backend adds minimal overhead (~10-20Mi memory)
  • Request volume: Higher traffic requires more CPU for request processing
  • Data volume: Large inputs and tool responses increase memory usage and network bandwidth
  • Composite tool complexity: Workflows with many parallel steps consume more memory
  • Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend CountUse CaseNotes
1-5Small teams, focused toolsetsMinimal resource overhead
5-15Medium teams, diverse toolsRecommended range for most use cases
15-30Large teams, comprehensiveIncrease health check interval
30+Enterprise-scale deploymentsConsider multiple vMCP instances

Scaling strategies

Horizontal scaling

Horizontal scaling is possible for stateless use cases where MCP sessions can be resumed on any vMCP instance. However, stateful backends (e.g., Playwright browser sessions, database connections) complicate horizontal scaling because requests must be routed to the same vMCP instance that established the session.

Session considerations:

  • vMCP uses MCP session IDs to cache routing tables and maintain consistency
  • Some backends maintain persistent state that requires session affinity
  • Clients must be able to disconnect and resume sessions for horizontal scaling to work reliably

When horizontal scaling works well:

  • Stateless backends (fetch, search, read-only operations)
  • Short-lived sessions with no persistent state
  • Use cases where session affinity can be reliably maintained

When horizontal scaling is challenging:

  • Stateful backends (Playwright, database connections, file system operations)
  • Long-lived sessions requiring persistent state
  • Complex session interdependencies

Configuration

To scale horizontally, increase replicas in the Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Horizontal scaling
# ... rest of deployment spec
Backend scaling

When scaling vMCP horizontally, the backend MCP servers will also see increased load. Ensure your backend deployments (MCPServer resources) are also scaled appropriately to handle the additional traffic.

Session affinity is required when using multiple replicas. Clients must be routed to the same vMCP instance for the duration of their session. Configure based on your deployment:

  • Kubernetes Service: Use sessionAffinity: ClientIP for basic client-to-pod stickiness
    • Note: This is IP-based and may not work well behind proxies or with changing client IPs
  • Ingress Controller: Configure cookie-based sticky sessions (recommended)
    • nginx: Use nginx.ingress.kubernetes.io/affinity: cookie
    • Other controllers: Consult your Ingress controller documentation
  • Gateway API: Use appropriate session affinity configuration based on your Gateway implementation
Session affinity recommendations
  • For stateless backends: Cookie-based sticky sessions work well and provide reliable routing through proxies
  • For stateful backends (Playwright, databases): Consider vertical scaling or dedicated vMCP instances instead of horizontal scaling with session affinity, as session resumption may not work reliably

Vertical scaling

Vertical scaling (increasing CPU/memory per instance) provides the simplest scaling story and works for all use cases, including stateful backends. However, it has limits and may not provide high availability since a single instance failure affects all sessions.

Recommended approach:

  • Start with vertical scaling for simplicity
  • Add horizontal scaling with session affinity when vertical limits are reached
  • For stateful backends, consider dedicated vMCP instances per team/use case

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

  • High CPU usage (>70% sustained) during normal operations
  • Memory pressure or OOM (out-of-memory) kills
  • Slow response times (>1 second) for simple tool calls
  • Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

  • CPU usage remains high despite increasing resources
  • You need higher availability and fault tolerance
  • Request volume exceeds capacity of a single instance
  • You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

Large backend counts (15+)

Health checks become a significant source of overhead with many backends. With the default 30-second interval, 20 backends generate 40 health check requests per minute. Increasing the healthCheckInterval to 60 seconds cuts this overhead in half while still detecting failures within a reasonable timeframe.

Raising the unhealthyThreshold from 3 to 5 prevents transient network issues or brief slowdowns from unnecessarily removing backends from rotation. This is especially important in larger deployments where temporary hiccups shouldn't trigger immediate failover.

spec:
config:
operational:
failureHandling:
healthCheckInterval: 60s
unhealthyThreshold: 5

The tradeoff is slower failure detection—with these settings, it takes up to 5 minutes to mark a backend unhealthy. This is acceptable for most use cases, but if your backends serve latency-sensitive operations, consider keeping the 30-second interval and raising only the threshold.

High request volumes

Resource requirements scale with request volume and backend complexity. Monitor your current deployment's CPU and memory usage under typical load. If you're seeing CPU throttling (check throttled_time metrics) or memory pressure, increase resources proportionally.

For production deployments handling sustained traffic, allocate headroom for traffic spikes. A good starting point is 1 CPU and 1Gi memory with limits of 2 CPUs and 2Gi memory, allowing the pod to burst during peak load.

spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi

Watch for memory growth over time if you're using token caching with many unique clients or backends exposing large resource payloads. Use vMCP telemetry to track actual usage and adjust accordingly.

Performance optimization

Backend discovery

Backend discovery happens at vMCP startup and affects how quickly your deployment becomes ready. The discovery mode you choose has significant performance implications:

Discovered mode (default) queries the Kubernetes API to find backends matching your group selector. This is flexible and updates automatically as backends change, but adds 1-3 seconds of startup latency for 10-20 backends. For deployments with frequent pod restarts, this can add up.

Inline mode specifies backends directly in the VirtualMCPServer spec, eliminating Kubernetes API calls for near-instantaneous startup. Use this when your backend set is relatively static and you're willing to update the vMCP configuration when backends change.

Beyond the discovery mode, optimize individual backends by reducing the number of tools and resources each exposes. A backend advertising 50 tools takes longer to initialize than one with 5. Group related tools into focused backend servers rather than creating monolithic servers.

Authentication

Authentication adds latency to every backend request, so optimizing this layer pays dividends at scale. vMCP's token cache (enabled by default) stores tokens with their expiration times, eliminating repeated fetches for the same client. This is particularly effective when you have a small number of clients making many requests.

For internal backends running in the same cluster—especially those providing read-only operations or serving trusted internal services—consider using unauthenticated mode. This removes authentication latency entirely (typically 50-200ms per request) and simplifies configuration. Only do this for backends that don't require access control.

When using OAuth/OIDC authentication, configure token expiration times to balance security and performance. Tokens that expire too quickly (under 5 minutes) force frequent re-authentication. Tokens that last too long (over 1 hour) increase security risk. A 15-30 minute expiration typically provides a good balance.

Composite workflows

Composite workflows enable multi-step automation, but design matters for performance. The key is maximizing parallelism—vMCP can execute up to 10 steps concurrently by default.

When designing workflows, identify which steps must run sequentially versus which are independent. For example, fetching data from three backends should use parallel steps, with only the aggregation step depending on them.

Use onError.action: continue for steps that provide optional enrichment. If a workflow fetches a user profile and then enriches it with activity data, marking the enrichment as continuable ensures failures don't prevent workflow completion. Reserve strict error handling for critical steps.

Set explicit timeouts on steps calling potentially slow backends. Without timeouts, a hanging backend can block execution for 30-60 seconds. If a backend typically responds in under 5 seconds, set a 10-second timeout to fail fast.

Monitoring

Effective monitoring helps maintain vMCP performance at scale. Use the Telemetry and metrics integration to export metrics to your observability platform.

Focus on these key metrics:

Backend request latency: Track P95 and P99 latency for requests to each backend. Sudden increases indicate backend degradation or network issues. Set alerts when P95 exceeds your SLO.

Backend error rates: Monitor the percentage of failing requests for each backend. Healthy backends should have error rates under 1%. Sustained rates above 5% suggest problems that need investigation.

Health check success rates: Track the percentage of successful health checks for each backend. This leading indicator often reveals problems before they impact user requests. Declining rates may indicate resource pressure or impending failure.

Workflow execution times: For composite workflows, monitor total execution time and per-step timing. This helps identify bottlenecks and verify that workflows are benefiting from parallelism.

Set up dashboards showing these metrics over time and configure alerts for anomalies. Proactive monitoring catches degradation early, often before users notice impact.