System
Exposure Monitor.
Consolidated risk monitoring at a global hedge fund.
A consolidated tally service that absorbs every order and trade from the fund’s portfolio managers and computes risk metrics in real time, wire-to-wire in tens of microseconds, on two servers.
Problem
Independent PM decisions, fund-level imbalance.
Portfolio managers at a global hedge fund manage their portfolios independently, with independent risk profiles, independent decisions on order slicing, long/short balance, and margin utilization. Run concurrently, that independence creates fund-level imbalances no single PM can see.
The fund needed a consolidated exposure monitor to absorb every order and trade from every PM in real time, maintain running risk tallies, and feed those tallies back to the PM desks and market-connectivity engines that drive trade slicing. The fund started building it internally and concluded the infrastructure layer was taking more engineering than the decisioning logic itself.
Architectural Constraint
Tallies must update faster than the desk can use them.
For the tallies to be useful, they must update in double-digit microseconds. Any longer and the upstream slicing decisions are made on stale risk. A traditional two-tier architecture cannot meet that budget. Each compute step must fetch the data it needs across the network from the data tier, on a synchronous critical path.
Scaling around that requires multi-threaded concurrency and horizontal scaling of both tiers. And application-level consensus on failure is non-trivial to engineer; building it from scratch was exactly the kind of work the fund did not want to be doing.
Rumi solution
Storage, serving, streaming, and compute in one node.
The exposure monitor was rebuilt as a hyperconverged Rumi node. The tally computation logic, the durable tally storage, the serving of tallies to the PM desks, and the streaming of tallies to market-connectivity engines all live in the same node, on the same data.
The fund’s tally calculation logic was ported in unchanged. Reliability, including primary/backup consensus and zero-loss replay across failures, comes from the platform rather than the application. Engineering time returned to the decisioning logic, which was the work the fund actually wanted to do.
Operational Outcomes
Tens of microseconds, six million orders a second, two servers.
- 3 monthsto first deployment
- >1 ms → <50 µswire-to-wire tally compute latency (20×)
- ~6Morders/sec, sustained
- 1 + 1primary and backup servers, <4 threads each
- Linearhorizontal scaling with cluster partitions
- Zeroloss recovery across network, process, machine, DC failures