(Original Post courtesy of StreamNative)
Introduction: The Challenge of Building an AI Platform
Late one afternoon, a team of developers set out to build TrustGraph – an open-source AI product creation platform aimed at orchestrating sophisticated AI agents. They faced a familiar challenge: how to connect a constellation of microservices (knowledge extractors, vector indexers, agent runtimes, etc.) into one cohesive system that can scale and adapt dynamically. Traditional point-to-point integrations felt brittle and hard to scale. The team needed a nervous system for their platform – a messaging backbone that could seamlessly link all components in real-time. Enter Apache Pulsar, the technology that would become the high-performance core of TrustGraph’s event-driven architecture. Pulsar (with enterprise support from StreamNative) offered exactly what TrustGraph needed: a reliable publish/subscribe foundation with the flexibility to handle everything from real-time agent queries to large-scale data ingestion. What follows is the story of how Pulsar powers TrustGraph, enabling developers to build modular AI systems that are scalable, resilient, and a joy to work with.
Why TrustGraph Chose Pulsar as its Backbone
From the outset, the TrustGraph engineers recognized that building a scalable AI platform meant embracing event-driven design. They needed a messaging layer that could support diverse workloads – from synchronous API calls to asynchronous data pipelines – without becoming a bottleneck. Apache Pulsar stood out for several reasons:
- It “just works” for ops: Pulsar provides an operations-friendly way to connect complex processing elements. Its simplicity in managing communication patterns and scaling freed the team from writing custom pipeline glue code. Site reliability engineers could focus on deploying and monitoring AI capabilities rather than debugging message passing.
- Native Pub/Sub Model: Pulsar’s publish-subscribe architecture was a perfect fit for TrustGraph’s decoupled microservices. Components like the Knowledge Graph Builder, AI Agent Runtime, and data processors communicate by publishing events and subscribing to the topics they care about – no direct dependencies needed. This decoupling means each service can evolve or scale independently, a critical requirement for a modular AI platform.
- Persistent and Non-Persistent Topics: Pulsar uniquely lets you choose between persistent and non-persistent messaging. TrustGraph leverages this to balance reliability vs. latency. For critical data (e.g. ingesting documents into a knowledge base), TrustGraph uses persistent topics to guarantee delivery – ensuring no data is lost even if a service goes down. Conversely, for high-speed, ephemeral interactions (like an AI agent responding to a user query), TrustGraph uses non-persistent topics to minimize overhead and latency. This flexible messaging guarantees that each use-case gets the right trade-off between speed and safety.
- Multi-Tenancy and Isolation: Pulsar’s built-in multi-tenancy (via tenants and namespaces) proved invaluable for TrustGraph’s vision of dynamic “Flows.” A Flow in TrustGraph is essentially an isolated AI pipeline or workspace. Pulsar’s tenant/namespace model allows TrustGraph to create isolated channels for each Flow, ensuring that projects or tenants don’t interfere with each other’s data streams. This strong isolation was critical for enabling TrustGraph to support multiple concurrent AI agent workflows in one cluster, whether they belong to different teams, customers, or use cases.
In summary, Pulsar provided the scalability, flexibility, and reliability that TrustGraph needed in a messaging backbone. As Mark Adams, Co-founder of TrustGraph, put it, building on Pulsar gave them confidence that the communication layer would not be the limiting factor in scaling intelligent agents. It laid a rock-solid foundation on which to construct an AI platform ready for both rapid iteration and production-grade stability.
Architecting TrustGraph with Pulsar: Key Patterns
With Apache Pulsar at its core, TrustGraph’s architecture evolved a set of powerful patterns. These patterns illustrate how Pulsar’s features are used in practice to create an event-driven, modular AI system:
1. Dynamic and Scalable “Flows”: In TrustGraph, a Flow represents a configurable pipeline of AI tasks (for example, a data ingestion flow or an agent reasoning flow). Some services are global (shared across all Flows), while others are flow-specific. Pulsar enables this dynamic behavior through dynamic queue naming and creation.
- Global Services (like configuration, knowledge base, and librarian APIs) listen on well-known, fixed Pulsar topics since they are always available and shared.
- Flow-Hosted Services (like a GraphRAG processor, Agent runtime, or custom embeddings service) spin up when a new Flow is started. TrustGraph automatically generates unique Pulsar topics for that Flow’s services. For example, if a Flow is named research-flow, the GraphRAG service in that flow might publish/subscribe on topics named:
non-persistent://tg/request/graph-rag:research-flownon-persistent://tg/response/graph-rag:research-flow
Each new Flow gets its own set of topics, isolating its traffic. Multiple Flows can run concurrently without stepping on each other’s messages – a huge win for multi-project and multi-tenant deployments. When the Flow is stopped, its topics can be torn down just as easily. This dynamic provisioning of queues means the platform can scale out new pipelines on the fly with full isolation, all thanks to Pulsar’s flexible naming and multi-tenancy.
2. Diverse Communication Patterns (Pub/Sub Flexibility): TrustGraph doesn’t force a one-size-fits-all messaging style; instead, it uses Pulsar to support different interaction patterns within the platform:
- Request/Response Messaging: For interactive services—such as an AI Agent API or the GraphRAG query service—TrustGraph sets up dedicated request and response topics. For example, when a user’s query hits the Agent service, it is published to a request topic, the agent processes it, and the answer comes back on a response topic tied to that user’s session or flow. This pub/sub request-response pattern feels like a direct call from the client’s perspective, but under the hood it’s decoupled and asynchronous. The client can await a response without knowing which specific service instance will handle it. This pattern gives synchronous behavior on top of asynchronous internals, combining interactivity with scalability.
- Fire-and-Forget Ingestion: For one-way data pipelines like ingesting documents, TrustGraph uses a simpler fire-and-forget approach. A client (say, a data loader component or a user uploading a file) will publish data to an ingestion topic and immediately move on. Downstream processor services (e.g. a Text Load service or a Triples Store loader) are subscribed and will process the data in due course. Crucially, these ingestion topics are persistent in Pulsar. This guarantees that if a processor is slow or temporarily down, the data remains in the queue until processed, ensuring no loss. Developers benefit by not having to babysit the pipeline – they trust Pulsar to eventually deliver data when the consumers are ready, improving the system’s resilience to spikes or faults.
3. Centralized, Push-Based Configuration: Running a complex AI platform means lots of configuration: prompts for the LLM, tool definitions for agents, pipeline parameters, etc. TrustGraph chose to manage configuration changes through Pulsar as well, turning config into an event stream. There is a dedicated Pulsar topic (e.g. persistent://tg/config/config) that acts as a central config channel. Whenever an administrator or developer updates a configuration – for instance, adjusting a prompt template or adding a new tool plugin – that update is published as a message on the config topic. All services that care about config subscribe to this channel. TrustGraph’s services (built on common base classes FlowProcessor or AsyncProcessor) are designed to receive these config events and reconfigure themselves on the fly. The moment a new Flow is launched or a parameter changes, every component gets the memo via Pulsar and updates its behavior without needing a restart. This push-based config distribution makes the platform highly dynamic – developers can deploy new capabilities or tune the system in real-time, and Pulsar ensures a consistent configuration state across the distributed system.
These patterns highlight a theme: Pulsar decouples parts of the system while keeping them coordinated. Dynamic topic creation lets TrustGraph scale out new processing flows easily. Multiple messaging patterns let each service communicate in the style that fits its role. A config event stream keeps everything in sync. All of it is implemented on Pulsar’s robust pub/sub substrate, meaning it inherits Pulsar’s strengths like horizontal scalability, durability, and back-pressure handling.
Benefits to Developers and AI Teams
By weaving Pulsar so deeply into its design, TrustGraph reaps numerous benefits that directly address pain points developers often face in building AI systems:
- Easier Scaling: Need to handle more load? Simply add more consumers to a Pulsar topic to scale out a microservice – no complex rebalancing needed. Because each TrustGraph component processes messages from a queue, scaling is as straightforward as running another instance that subscribes to the same topic. For example, if the AI Agent requests spike, the team can spin up additional agent service containers; Pulsar will automatically distribute requests among them. This elasticity means the system can handle varying workloads on different parts of the AI pipeline without a hitch.
- Resilience and Fault Tolerance: Pulsar’s persistent messaging ensures critical data isn’t lost if something fails. Developers don’t have to write custom retry logic or worry about data gaps – if the Knowledge Graph builder goes down for a bit, all pending documents remain queued. When it comes back up, it picks up where it left off. Also, thanks to the decoupled design, a failure in one component (e.g., the vector embedding service) won’t crash the entire platform. Messages will queue up until that service recovers, while the rest of the system continues unaffected. This isolation containing failures makes the overall platform more robust in production.
- Flexibility for New Features: The dynamic Flow architecture allows teams to deploy new pipelines or custom components without modifying the core system. Because Pulsar handles the routing, a new service can be introduced by simply defining the topics it will use and plugging it in. This pluggable architecture means TrustGraph can evolve quickly. For instance, a developer could add a new “Sentiment Analysis” microservice into a Flow by having it subscribe to an intermediate topic – no need for a full redeploy or breaking existing flows. Pulsar’s multi-tenant setup means this can happen in an isolated way, so experimentation in one Flow won’t disrupt others.
- Better Observability: With Pulsar as the central hub for all messages, it provides a one-stop view into the system’s activity. TrustGraph takes advantage of Pulsar’s metrics – like message rates, consumer backlogs, throughput, and latency per topic – to give developers deep insight into how each part of the platform is performing. These metrics feed into Grafana dashboards where the team can see, for example, if the “ingestion queue” is backing up or if the “response times on the agent request topic” are rising. Such observability helps pinpoint bottlenecks quickly (maybe a vector DB is slow, causing a backlog) and aids in capacity planning. It essentially turns Pulsar into a stethoscope on the health of the AI platform.
- Faster Iteration: Perhaps most importantly, this Pulsar-driven architecture empowers faster development cycles. Because adding new flows or services is low-friction, developers can prototype new AI capabilities without weeks of pipeline engineering. The combination of fewer bottlenecks, auto-scaling behavior, safe fault handling, and real-time config updates means the team spends less time on infrastructure and more on innovating AI features. In practice, that could mean quickly trying a new large language model in the Agent service or connecting an experimental knowledge source – TrustGraph will handle the messaging and integration details, so the developer can focus on the AI logic.
All these benefits fundamentally spring from Pulsar’s role as a unified messaging layer. It abstracts away the hard parts of distributed communication (scaling, reliability, ordering, isolation), letting developers concentrate on building intelligent agents and knowledge pipelines.
Pulsar in Action: A Day in the Life with TrustGraph
To cement how Pulsar powers real-world usage of TrustGraph, let’s walk through a hypothetical scenario:
Meet Alice, an AI engineer at an enterprise, who is using TrustGraph to build a new AI-powered research assistant. She begins her day by defining a new processing Flow for the project, aptly named research-flow. When Alice starts this Flow via TrustGraph’s CLI, under the hood the platform spins up microservices for that Flow – an Agent service, a GraphRAG service, an Embeddings service, etc. – each with their own Pulsar topics. Alice doesn’t have to manually configure any queues; Pulsar automatically provisions topics like tg/request/graph-rag:research-flow and tg/response/graph-rag:research-flow for her new Flow. Immediately, her Flow’s services begin running in isolation. In fact, a colleague can launch a separate analysis-flow in parallel, and thanks to Pulsar, the two sets of services won’t conflict. This allows different teams to use TrustGraph on the same infrastructure, each with their own dedicated message streams.
Later that morning, Alice feeds a batch of documents (PDF reports) into TrustGraph for ingestion. As she uploads them via the Workbench UI, each document’s content is published as a message to the Text Load service’s Pulsar topic. The ingestion is designed as fire-and-forget – the upload request immediately returns, and Alice can go grab a coffee while TrustGraph pipelines the data. Pulsar’s persistent queue means even if the Text Load processor or downstream Knowledge Graph builder is busy, all documents will be queued reliably. After a brief break, Alice checks the dashboard: the documents are being processed one by one, and there are no errors. One of the processing containers did restart (maybe due to a transient error), but because of Pulsar, no data was lost and the pipeline resumed automatically once the service recovered. Alice silently thanks the decision to use Pulsar; in past projects with DIY messaging, a crash often meant writing custom retry logic or manual data cleanup, but not anymore.
In the afternoon, Alice decides to improve the AI agent’s behavior by tweaking its prompt and adding a new tool for it. She opens TrustGraph’s configuration UI and updates the prompt template and registers an external API as a new tool. The moment she hits “Save”, TrustGraph’s Config service publishes an update event to tg/config/config topic. All running services in research-flow receive this update within milliseconds, thanks to their Pulsar subscriptions. The Agent runtime immediately pulls in the new prompt and tool definitions – there’s no need to restart anything. Alice initiates a test query to her agent; it responds using the updated prompt format and can even call the new API tool as needed, all in real-time. This kind of live reconfiguration makes it incredibly easy for Alice to iterate on her AI agent’s capabilities. In traditional setups, such changes might require editing config files on multiple servers or restarting processes, disrupting the workflow. With Pulsar’s event-driven config, TrustGraph achieves seamless, centralized control.
Before wrapping up, Alice reviews the system’s performance. Using TrustGraph’s observability stack, she notices the message backlog on the research-flow ingestion topic grew slightly during peak load, but then drained as additional consumers auto-scaled. The Grafana metrics (sourced from Pulsar) show healthy throughput. One insight stands out: the response queue for the Agent service shows occasional latency spikes. Investigating further, Alice realizes that complex user questions trigger multiple knowledge searches, slowing responses. She decides to allocate another instance of the GraphRAG service to that Flow to handle these heavy queries. Thanks to Pulsar, scaling out is straightforward – the new instance will simply become another consumer on the relevant topics. Sure enough, once deployed, the next test query is handled faster, as the load is now balanced. The bottleneck was resolved by a one-line configuration change to scale the service, without any code changes or downtime. This agility in tuning performance is a direct consequence of the Pulsar-based design.
By the end of the day, Alice has not only built a functioning AI research assistant, but she’s also iterated on it multiple times – all without struggling with messaging middleware. TrustGraph, empowered by Pulsar, took care of the heavy lifting: routing messages, preserving data, triggering reconfigurations, and scaling services on demand. For Alice, the developer experience is night-and-day compared to earlier projects. She can focus on crafting AI logic, confident that the event-driven backbone (powered by Pulsar and StreamNative’s expertise) will handle the rest.
Conclusion: Pulsar as the Foundation for Event-Driven AI
The story of TrustGraph underscores a broader lesson for AI platform developers: a robust messaging backbone is the key to unlocking scalable, modular, event-driven systems. Apache Pulsar proved to be that backbone for TrustGraph – acting as the central nervous system that links independent AI modules into one intelligent whole. Its pub/sub model, dynamic queue management, multi-tenancy, and mix of persistent vs. transient messaging enabled TrustGraph to achieve a level of flexibility and resilience that would be hard to realize otherwise. By using Pulsar, the TrustGraph team and its users gained scalability, fault tolerance, and speed of iteration as first-class features of the architecture. Developers can add new capabilities without fear of breaking the system, ops engineers can sleep easier knowing spikes or failures won’t collapse the pipeline, and organizations can deploy multiple AI agent flows concurrently with confidence in their isolation and security.
In essence, Pulsar (with StreamNative’s enterprise support in the wings) serves as the foundation for TrustGraph’s vision of an AI platform. It demonstrates how an advanced event streaming technology can solve the pain points of building AI products: eliminating brittle point-to-point links, preventing data loss, simplifying scaling, and improving observability. For any team looking to build the next generation of AI systems – be it autonomous agents, real-time analytics, or context-driven LLM applications – the combination of TrustGraph’s modular framework and Pulsar’s event-driven backbone offers a compelling blueprint. Pulsar enabled TrustGraph to transform from an ambitious idea into a production-grade reality, reinforcing its role as a foundational enabler for event-driven AI platforms. The result is a story of technology empowering developers: with Apache Pulsar under the hood, TrustGraph can truly deliver on its promise of creating intelligent, context-aware AI agents at scale.
Contact:
Discord Community
https://trustgraph.ai
https://github.com/trustgraph-ai/trustgraph