SimBricks users can build and run smaller virtual prototypes comprising only a handful of system components or use the same building blocks to assemble large system prototypes from hundreds or thousands of components. However, simulating larger systems on a single physical machine quickly results in impractically long simulation times because of limited computational resources. To address this, SimBricks supports distributing larger system simulations across multiple physical machines. In this post, we present how SimBricks addresses three key challenges: 1) minimize communication overheads that increase simulation time, 2) lower complexity for users to run distributed simulations, and 3) avoid implementation complexity for each component.
Communication Challenges for Distributed Simulations
SimBricks realizes virtual prototypes by combining multiple simulator instances for different components as separate, parallel, and loosely coupled processes. Component simulators communicate via shared-memory message passing along natural component interfaces for data transfers and synchronization. When distributing components across multiple physical machines, some of this communication needs to be implemented over the network instead.
However, this introduces two costs: additional processing overhead for sending and receiving data over the network and higher message transfer latency. While both have the potential to increase simulation times, we found the former overhead to be vastly dominant. With our efficient shared-memory message passing, simulators spend less than 100 cycles for sending and receiving messages, while sending or receiving a message over the network is typically 100x more expensive. Given that in particular bottleneck simulators incur this overhead on the critical path, this overhead is likely to lead to substantially longer simulation times. Additionally, implementing multiple message passing mechanisms in each component simulator also substantially increases complexity and effort for integrating and developing component simulators.
Scale Out with Separate Proxy Processes
To avoid these drawbacks, we instead implement network communication separately in proxies. SimBricks proxies convert between shared memory message passing and other message transports, such as TCP or RDMA. Component simulators that connect to a simulator on another host instead connect to a local proxy instance using regular SimBricks shared memory message passing. The local proxy then forwards messages over the network to a proxy on the remote host, which in turn converts messages back to shared-memory message passing. Since the proxy on each host runs as a separate process, it requires an additional processor core on each host.
However, relying on proxies has two key advantages by moving the implementation of network communication into a separate process on a separate core. First, this approach is fully transparent to component simulators and requires no changes in individual simulators. Second, and more importantly, this moves network processing overheads out of individual simulators, especially at bottlenecks, and thereby avoids increasing simulation time in most cases.
At the moment, SimBricks provides two proxy implementations supporting two protocols for network communication: TCP and RDMA. Surprisingly, we found that for most synchronized simulations using lower-latency RDMA proxies compared to TCP provided no benefit, as message latency was not a bottleneck. However, additional proxies can of course easily be added to support further communication protocols, e.g. for simulations that benefit from HPC interconnects.
Orchestrating Distributed Simulations
While this provides users with the necessary building blocks for assembling even large distributed simulations, manually instantiating and configuring proxies, assigning simulators to hosts, etc. is extremely tedious and error-prone for users. To make this easier, the SimBricks orchestration framework offers support to automatically configure proxies and distribute full-system simulations across multiple machines. The user first prepares a regular SimBricks simulation configuration, just as with a non-distributed simulation. Then next, the user can either rely on automatic partitioning and distribution to hosts or provide manual assignments of components to physical hosts for more control. From there the orchestration framework takes care of instantiating and configuring proxies, and orchestrating execution of all simulators across the available machines. Finally, the orchestration framework collects outputs from simulators exactly as with non-distributed SimBricks simulations.
If you have questions or would like to learn more: