In this technical deep dive, we’ll explore how orra’s Plan Engine implements an intelligent caching system that goes beyond simple key-value lookups. We’ll examine the architecture, design decisions, and technical challenges involved in creating a semantic caching layer for LLM-generated execution plans.
When we first built orra’s orchestration system, we quickly encountered a challenge that’s common in LLM-powered applications: while LLMs provide powerful dynamic orchestration capabilities, they also introduce significant latency and cost overhead.
As request volumes increased in production environments, we found that many actions were semantically similar or functionally equivalent, yet each one triggered a full, expensive LLM call.
This created both performance bottlenecks and unpredictable scaling costs.
Traditional caching approaches didn’t work well for this problem.
Exact string matching was too brittle for natural language inputs, and simple parameter templating couldn’t handle the flexibility needed for a general-purpose orchestration system. We needed something more sophisticated.
The core challenge we’ll address is how to recognise when different user requests are functionally equivalent despite textual differences, and how to efficiently adapt cached plans to work with new parameters.
This involves solving several technical problems:
The approaches described here aren’t just specific to orra – they can inform your own work when building LLM-powered applications.
As LLMs are incorporated into production systems, techniques for semantic caching, dynamic parameter substitution, and vector similarity can be applied to reduce costs, improve response times, and enhance user experiences across many different contexts.
By the end of this article, you’ll understand the inner workings of orra’s caching system, the tradeoffs involved in its design, and how these principles can be applied to significantly improve both performance and cost-efficiency in your own LLM-powered applications.
At the heart of orra’s orchestration system is the Plan Engine - responsible for coordinating multi-agent workflows through execution plans. An execution plan is a structured representation of how different services and agents should interact to accomplish a user’s requested action.
Here’s a simplified example:
{
"tasks": [
{
"id": "task0",
"input": {
"customerId": "CUST789",
"orderId": "ORD456"
}
},
{
"id": "task1",
"service": "customer-service",
"input": {
"customerId": "$task0.customerId"
}
},
{
"id": "task2",
"service": "order-system",
"input": {
"orderId": "$task0.orderId"
}
}
],
"parallel_groups": [
["task1", "task2"]
]
}
These plans are generated by the Plan Engine through an LLM, which analyzes the user’s intent and creates an optimized coordination structure. While effective, this approach presents two significant challenges:
This is where the Plan Engine’s caching system comes in. By recognising when new actions are semantically similar to previously processed ones, it can reuse and adapt existing plans rather than generating new ones each time.
Most caching systems work as simple lookup tables - they either find an exact match or they don’t.
When building orra’s Plan Engine, we asked ourselves:
Could we create a caching system that understands when two differently worded requests are asking for the same thing, and then adapt a cached plan to work with new parameters?
The result is a semantic caching layer that:
This approach dramatically changes the performance and cost profile of LLM-driven orchestration:
Traditional approach:
User1: "Process order #1234" → Full LLM API call (1-3 seconds)
User2: "Process order #5678" → Another full LLM API call (1-3 seconds)
With semantic caching:
User1: "Process order #1234" → Full LLM API call (1-3 seconds) + cache storage
User2: "Process order #5678" → Semantic match + parameter substitution (~50ms, no LLM API call)
For applications handling thousands or millions of similar but distinct requests, this can reduce both latency and API costs by over 90%.
Let’s explore how the Plan Engine accomplishes this.
The first part of the solution involves determining when actions are semantically similar. orra’s Plan Engine uses vector embeddings to represent the meaning of actions rather than just their text.
Implementation Details
func (pc *ProjectCache) findBestMatch(query CacheQuery) (*CacheEntry, float64) {
// Lock for concurrent access safety
pc.mu.RLock()
defer pc.mu.RUnlock()
var bestScore float64 = -1
var bestEntry *CacheEntry
// First pass: quick filtering
for _, entry := range pc.entries {
// Filter out entries with different service signatures
if entry.ServicesHash != query.servicesHash {
continue
}
// Filter based on grounding state
if entry.Grounded != query.grounded {
continue
}
// Calculate semantic similarity
score := CosineSimilarity(query.actionVector, entry.ActionVector)
// Track best match
if score > bestScore {
bestScore = score
bestEntry = entry
// Early exit for near-perfect matches
if score > 0.999 {
break
}
}
}
return bestEntry, bestScore
}
The Plan Engine converts action text into vector embeddings, creating a mathematical representation of semantic meaning. By computing cosine similarity between these vectors, it can identify when two differently worded requests are functionally equivalent.
Several questions needed to be addressed in this approach:
Finding similar plans solves half the challenge. The next question was:
How can we adapt a cached plan to work with different parameter values while maintaining its structure and validity?
Parameter Mapping and Substitution
When a plan is first generated and cached, the Plan Engine analyzes which task inputs correspond to action parameters:
func extractParamMappings(actionParams ActionParams, task0Input map[string]interface{}) ([]TaskZeroCacheMapping, error) {
// Maps for different value types
stringValues := make(map[string]string) // For primitive values
jsonValues := make(map[string]string) // For complex types
// Build lookup maps from action parameters
for _, param := range actionParams {
field := param.Field
// For primitive types (strings, numbers, booleans)
if isPrimitive(param.Value) {
stringValues[fmt.Sprintf("%v", param.Value)] = field
} else {
// For complex types (arrays, objects), use JSON representation
jsonBytes, err := json.Marshal(param.Value)
if err == nil {
jsonValues[string(jsonBytes)] = field
}
}
}
var mappings []TaskZeroCacheMapping
// Find Task0 input values that match action param values
for task0Field, task0Value := range task0Input {
matched := false
actionField := ""
valueToStore := ""
// Try to match primitive values
if isPrimitive(task0Value) {
strVal := fmt.Sprintf("%v", task0Value)
if field, ok := stringValues[strVal]; ok {
matched = true
actionField = field
valueToStore = strVal
}
} else {
// Try to match complex values via JSON comparison
jsonBytes, err := json.Marshal(task0Value)
if err == nil {
jsonStr := string(jsonBytes)
if field, ok := jsonValues[jsonStr]; ok {
matched = true
actionField = field
valueToStore = jsonStr
}
}
}
if matched {
mappings = append(mappings, TaskZeroCacheMapping{
Field: task0Field,
ActionField: actionField,
Value: valueToStore,
})
}
}
return mappings, nil
}
When a cache hit occurs, the Plan Engine uses these mappings to generate a new version of the plan with updated parameters:
func substituteTask0Params(content string, originalInput, newParams json.RawMessage, mappings []TaskZeroCacheMapping) (string, error) {
// Parse the execution plan
var plan ExecutionPlan
if err := json.Unmarshal([]byte(content), &plan); err != nil {
return "", fmt.Errorf("failed to parse calling plan for task0 param substitution: %w", err)
}
// Parse original Task0 input
var origTask0Input map[string]interface{}
if err := json.Unmarshal(originalInput, &origTask0Input); err != nil {
return "", fmt.Errorf("failed to parse original Task0 input: %w", err)
}
// Parse new action params
var actionParams ActionParams
if err := json.Unmarshal(newParams, &actionParams); err != nil {
return "", fmt.Errorf("failed to parse new action params: %w", err)
}
// Generate new Task0 input using mappings
newTask0Input, err := applyParamMappings(origTask0Input, actionParams, mappings)
if err != nil {
return "", err
}
// Find and update Task0 in the plan
task0Found := false
for i, task := range plan.Tasks {
if task.ID == "task0" {
plan.Tasks[i].Input = newTask0Input
task0Found = true
break
}
}
if !task0Found {
return "", fmt.Errorf("task0 not found in calling plan")
}
// Marshal the updated plan
updatedContent, err := json.Marshal(plan)
if err != nil {
return "", fmt.Errorf("failed to marshal updated plan: %w", err)
}
return string(updatedContent), nil
}
Several challenging questions had to be solved in this process:
To be trusted in production environments, a caching system needs robust management capabilities. Here are some of the questions we addressed in the Plan Engine’s cache management:
How do we keep one project’s cache from affecting others?
Each project maintains its own isolated cache:
func (c *VectorCache) getProjectCache(projectID string) *ProjectCache {
c.mu.Lock()
defer c.mu.Unlock()
pc, exists := c.projectCaches[projectID]
if !exists {
pc = newProjectCache(c.logger)
c.projectCaches[projectID] = pc
c.logger.Info().
Str("projectID", projectID).
Msg("Created new project cache")
}
return pc
}
This design ensures:
How do we prevent the cache from growing indefinitely?
The Plan Engine implements both TTL-based expiration and size constraints:
func (c *VectorCache) cache(projectID string, planJson string, actionVector *mat.VecDense, servicesHash string, task0Input json.RawMessage, taskZeroCacheMappings []TaskZeroCacheMapping, actionWithFields string, grounded bool) *CacheEntry {
pc := c.getProjectCache(projectID)
// Create new cache entry
entry := &CacheEntry{
ID: uuid.New().String(),
Response: planJson,
ActionVector: actionVector,
ServicesHash: servicesHash,
Task0Input: task0Input,
CacheMappings: taskZeroCacheMappings,
Timestamp: time.Now(),
CachedActionWithFields: actionWithFields,
Grounded: grounded,
}
// Add to project cache with size management
pc.mu.Lock()
if len(pc.entries) >= c.maxSize {
// Remove oldest entry
pc.entries = pc.entries[1:]
}
pc.entries = append(pc.entries, entry)
pc.mu.Unlock()
return entry
}
The system also includes automatic cleanup of expired entries:
func (c *VectorCache) cleanup() {
c.mu.RLock()
defer c.mu.RUnlock()
now := time.Now()
for projectID, pc := range c.projectCaches {
pc.mu.Lock()
var validIdx int
for i, entry := range pc.entries {
if now.Sub(entry.Timestamp) < c.ttl {
if validIdx != i {
pc.entries[validIdx] = entry
}
validIdx++
}
}
pc.entries = pc.entries[:validIdx]
pc.mu.Unlock()
c.logger.Debug().
Str("projectID", projectID).
Int("remainingEntries", validIdx).
Msg("Cleaned project cache")
}
}
How do we handle multiple similar requests arriving simultaneously?
The Plan Engine uses a singleflight pattern to prevent duplicate LLM calls for the same action:
func (c *VectorCache) Get(ctx context.Context, projectID, action string, actionParams json.RawMessage, serviceDescriptions string, groundingHit *GroundingHit, backPromptContext string) (*CacheResult, json.RawMessage, error) {
result, err, _ := c.group.Do(fmt.Sprintf("%s:%s", projectID, action), func() (interface{}, error) {
return c.getWithRetry(ctx, projectID, action, actionParams, serviceDescriptions, groundingHit, backPromptContext)
})
if err != nil {
return nil, nil, err
}
cacheResult := result.(*CacheResult)
// ...
}
This ensures that concurrent requests for the same action don’t result in duplicate LLM calls, even before the result is cached.
The combination of semantic matching and parameter substitution delivers several concrete benefits for developers building multi-agent applications:
The Plan Engine’s caching system isn’t without trade-offs. Some challenges include:
orra’s Plan Engine demonstrates a practical approach to improving both the performance and cost-efficiency of LLM-driven applications. By recognizing semantic similarity between actions and dynamically adapting cached plans to new parameters, it provides a path to significantly faster response times and lower API costs, without sacrificing the flexibility that makes LLMs valuable.
Through building this system, we’ve learned several important lessons about balancing performance, cost, and capability in LLM-driven orchestration. At the same time, we’ve identified several areas for continued improvement:
Currently, the system depends on action parameter names being identical between requests. If a developer changes parameter names but keeps the same underlying values and structure, it results in a cache miss. For example:
// Will hit cache if previously seen
action: "Process order", params: { orderId: "12345" }
// Will miss cache, despite functional equivalence
action: "Process order", params: { order_id: "12345" }
We’re exploring techniques to recognise functionally equivalent parameter structures despite naming differences, potentially using schema matching or more advanced embedding techniques.
While using embedding models for semantic matching is effective, it raises questions about cost at scale. We’re investigating:
Initial testing suggests local models can reduce costs but may impact match quality or latency, so finding the right balance requires careful evaluation.
The current implementation stores cache entries in memory, meaning the cache is lost if the Plan Engine restarts. This creates a “cold start” problem where performance and cost benefits are temporarily lost after deployments or outages. We plan to implement cache persistence to disk or a database, allowing the Plan Engine to:
Hopefully you’ve found this useful!
Be sure to check out orra if you want to build production-ready multi-agent applications that handle complex real-world interactions.