Agent Memory Systems: Beyond Context Windows

Oleksii_ · ‎12-03-2025

This video discusses the limitations of context windows in agent memory systems and introduces persistent memory architectures. It explores how these systems can enhance AI interactions by retaining knowledge across sessions, ultimately improving efficiency and accuracy in processing tasks.

Key Takeaways

Context Limitations Even with advancements like 400K token context windows, traditional LLMs struggle with memory retention across interactions, leading to repeated token costs.
Persistent Memory Implementing persistent memory architectures allows agents to remember information across sessions, enhancing their ability to provide consistent and informed responses.
Database Integration Utilizing various types of databases (vector, episodic, network) allows agents to efficiently store and retrieve contextual information, improving their operational capabilities.
Agent Evolution Agents can learn and adapt over time by tracking their experiences, leading to improved capabilities and more efficient problem-solving methods.

Video transcript (click here):

Welcome to Agentic Impressions, where we'll be discussing various topics across the agentic AI landscape. Our first season will be covering the basics of agentic AI. Today, we'll be discussing the role of context windows as the second episode in our series. Hi everyone, my name is Shereen Bellamy. I'm a Senior Developer Advocate at Cisco, focusing on AI, security, and quantum. I'm here to share my findings and discuss with you all. And today, as promised, we're going to be discussing agent memory systems. In our last episode, we proved that architecture beats raw power. But here's a problem that everyone's still hitting, context windows. Context windows are still a bottleneck in the industry. Even with clods like 200,000 total tokens, and even GPT-5 is like up to 400K tokens now, you're still constrained by what fits into one single conversation. Let's tackle this specific problem for today. We'll call it the token triangle problem. Let's say we want to work with an LLM that has a context window of 128K tokens. Seems like a large enough number, but we're hitting the limit after analyzing 85 configs, way less than our goal. This is a result of high cost through expensive token usage. You pay per token, and when we access the LLMs, it's processing our query using those tokens. Everything you send to the LLM gets converted into tokens. Whatever it sends back is also composed of tokens. This is also due to the inability to continue interacting with the LLM due to the limit on how much it can actually digest. How many tokens you can send are defined by the LLM's ability, and sometimes by your plan. And LLMs will naturally forget between API calls. So a scenario on what this would look like is you would send 50K tokens of context, like configs, logs, topology, to the LLM. The cost might be arbitrarily like 50 cents. Now you're gonna wait for your LLM to respond with advice. That's 500 tokens. And then five minutes later, you would ask a follow-up question to get more clarification, the answer you need, maybe get more data. But your LLM, your natural standalone LLM, has forgotten everything. So you have to send all 50K tokens again, which would cost another 50 cents. And this would happen about 10 times in a troubleshooting session, right? While you try through trial and error to fix it. Now your total cost may come up to like $5. The total token sent can be up to 500,000 now. But you can have massive duplication in this dataset. And at the same time, you're constantly hitting context limits, which you have to wait to refresh to get your next output. Through the concept of persistent memory, we can see today how databases can help us overcome that problem and what specifically put inside of those databases that we can build on in the future. So again, if we look at this issue from a critical thinking perspective, what if agents had persistent memory that worked across sessions, learned from every interaction and has the ability to get better over time? That's what we're focused on today. Today, we're going to be building memory architectures. I'll show you persistent memory systems, benchmark different approaches and demonstrate the concept of agent self-registration. This is where agents have the ability to remember who they are and what they've learned. Again, when we look at the concept of architecture, especially looking at the design of something like the memory, then we're able to optimize what makes the agent so special, right? Their skills. So let's build it. So let's start by running this persistent agent Python file and see what it does. Then I'll show you how the code works. In these lines, the agent is storing network events, router interface down, switch CPU high, interface backup, and it's remembering the device configurations. Instead of keeping everything in AI's temporary memory, we're using a database, just like how your routers are storing their config in NVRAM instead of regular RAM. We have three types of databases, vector, episodic and network. SQLite is just a file-based database. Think of it like a really smart spreadsheet that never forgets, and we're using three of them. We have our VectorDB. This is for searching by meaning. So if you search something like interface problems, it would find related events, even if they don't have those exact words. We have an episodic database. This stores things in time order, like your syslog from oldest to newest. Then we have our network database. This is our custom database for network-specific stuff, like device IDs and IP addresses. So what happens when it remembers something? When you call the remember function, like when we remembered what routers R1 interface went down, it writes all three databases. That's three different ways to find that information later. It's like documenting an incident in your ticketing system, your runbook, and your network diagram. It's a little redundant, but it's powerful. So here we integrate the AI portion by calling an OpenAI API key. When we analyze the network issue, we're not just searching the database. We're asking a real AI to look at the problem and suggest fixes. So let me run it again and show you the AI-powered analysis. For this analysis, this is coming from GPT-4 looking at the network issue, saying that router interface flapped, switch has high CPU, and giving real troubleshooting steps. It's like having a senior network engineer on call 24-7. So when we mix AI's intelligence with their permanent database memory, AI is able to provide the analysis and the database is able to provide the history. So try closing the terminal and then run it again. Those .db files are still there. The agent's still able to remember everything from last time, making it persistent. Okay, so we can store things in a database. Let's do a memory benchmark to see why. This benchmark is going to create 100 network devices. So imagine a medium-sized enterprise network, and then we'll test two approaches. One, our persistent database approach that we just discussed, and two, the traditional context window approach, which is what most AI tools use. Watch what happens. This for loop is able to create 100 fake network devices, dev one, dev two, up to dev 100. Each has a host name, IP address, type, and status, just like your real network inventory. For the results, we can see that persistent memory gave us 100% accuracy in two seconds. Meanwhile, context windows gave us 40% accuracy in 0.8 seconds. We can say our database approach remembers every device perfectly. You can ask it about dev one, and it can give you dev one's info every single time. We also know that our context window approach only remembers about 40% of devices. Why? That's because it's trying to remember 100 devices in its temporary memory. It's like trying to memorize the entire network topology during a phone call. You'll most likely forget most of it, and you might be able to recall some of it. Also, the database lookup is 40 times faster. That's the difference between querying a well-indexed database and searching through notes. And in terms of five times more devices, context windows max out around 20 devices most of the time, so our approach handles 100, which could handle 10,000 just as easily. But with 10,000 examples, grouping things together can really help, and that's where event correlation comes in. So for this section, we can track that dev one has had three related events. It's had interface flapping detected, packet loss increased, and the interface went down. These events all happen to be these events all happen to different times, but we can correlate them because they're all stored permanently with timestamps and device IDs. That's how you do the root cause analysis, which will be helpful for us to see our patterns over time. Context windows don't have the ability to do that because they forget the first event by the time that the third one happens. Finally, we're going to get into our self-registering agent. So this agent can track its own capabilities and learn new ones from experience. So let's run it and see what it says. The agent just created itself with a unique ID. It starts with three basic capabilities, device monitoring, configuration management, and event logging. Now watch what happens when we log a critical network event. We log that the router R1 interface went down as a critical event. And then when we look at the next line, we can see that the agent automatically learned a new capability called incident troubleshooting. This is possible through our log network event function. Our function is saying that if the severity is critical and we haven't learned troubleshooting yet, then please learn it. Just like humans can build new skills based on experience, this agent is learning incident response by doing it. So let's see how it does with four more capabilities. Now our agent knows BGP troubleshooting, automated failover, security analysis, and performance optimization. That's eight total capabilities up from three. And these capabilities will be stored in those databases that we created earlier. This is a database table that tracks what the agent knows and when it learned it. If you restart the agent, then it will still know all eight capabilities. This matters for network operations because you can track what has this agent handled before? When did it learn to handle BGP issues? And what evidence do we have of what it can do? So similar to episode one, the conclusion here is that structure matters. Just like you wouldn't store your entire network config into one single text file, you shouldn't store all of your agent's knowledge in temporary memory. You should use the right tool for each job. So database for persistence, vector search for semantic similarity, graphs for relationships, and timelines for chronological events. So what next? How do you implement this? First, take a look at the GitHub in the description and try to implement what we did today. When you go to the GitHub repo, you'll see an additional file called agencymemoryimplementation.py. I won't be running it here today. It contains a full agency SDK implementation demo, including its Corto and Longo protocols named after coffee. Corto is agency's memory coordination protocol, which is how agents share and synchronize their memories across a distributed system. And Longo is the self-registration protocol. It's how agents discover each other and register their capabilities in a directory. It also shows you what a production-ready environment looks like with all of the databases working together. When you look at this in the context of the real world, you can have customer service agents that remember every interaction across months. In terms of research, you can have assistants that build knowledge graphs for all your documents. In terms of coding, you can have software development agents that learn your coding patterns, your preferences. And in terms of business, you can have business-specific agents that maintain context across projects and teams. We're moving from stateless interactions to persistent AI relationships. These aren't just tools. They're collaborative partners that have the ability to grow and grow smarter through every interaction. Coming up next week, we're going to focus mainly on multi-agent teams. For deeper information on agency-specific implementations, please feel free to check out that YouTube channel as well. Please subscribe for updates on when we release episode three and drop a comment on what type of agent team you'd most likely want to build. Memory architecture is important to look at, but building a team with what we learned might yield some pretty cool results. As usual, thanks everyone for joining, and I'll see you next time.

Agent Memory Systems: Beyond Context Windows

Key Takeaways

Exploring Cisco's Machine Learning-Enhanced Solutions

Introducing Python and Guest Shell on IOS-XE 16.5

Network Programmability Foundations in DEVNET Zone at Cisco Live