MemVerge and Open Source Community Partner to Protect Distributed HPC Applications with DMTCP
ST. LOUIS, November 15, 2021 / PRNewswire / – At SC21 MemVerge® and the DMTCP project announced a partnership designed to accelerate the development and adoption of the long-awaited Multithreaded Distributed Control Point (DMTCP) technology.
Checkpoint is commonly used by business applications to minimize downtime, but checkpoint is nearly impossible for complex distributed HPC applications with large data sets. In development for over a decade, DMTCP recently made the impossible possible for several workloads, including VLSI circuit simulators, circuit verification, mathematics formalization, bioinformatics, network simulators, physics energy, cybersecurity, big data, middleware, mobile computing, cloud computing, GPU virtualization and high performance computing (HPC). DMTCP is ready for commercialization and wider deployment.
The collaboration between the DMTCP project and MemVerge will facilitate DMTCP’s entry into the market. The partnership includes MemVerge developers joining the DMTCP project and contributing to open source development; MemVerge providing commercial support for open source DMTCP software; and MemVerge integrating the fully tested and supported version into application-specific big memory solutions. MemVerge also started a collaboration with the National Energy Research Scientific Computing Center (NERSC) to optimize MPI-Agnostic Network-Agnostic (MANA), a plugin on top of DMTCP that was used for transparent control of MPI on Cori supercomputers. and Perlmutter. .
“The robust and efficient control points give us flexibility in scheduling system maintenance tasks and real-time data processing for experimental facilities. This functionality also allows us to perform better jobs, which ultimately leads to increased system utilization and improved job throughput for our nearly 8,000 scientists. users, ”said Rebecca Hartman Baker, User Engagement Group Leader, National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory.
Gene Cooperman, teacher at Northeastern University, and leader of the DMTCP project, has been leading this open source DMTCP project for almost 20 years. He is particularly excited about the recent three-way collaboration to support MANA for MPI.
According to Professor Cooperman, “The collaboration between NERSC / LBNL, MemVerge and the open source DMTCP community will provide a reliable and efficient transparent point of control to MPI (and later CUDA) for the production market. While DMTCP and MANA will always remain free and open source, the use of MemVerge technology for fast writing from memory to stable storage will provide a significant improvement to this technology. “
“The Distributed Checkpoint is a perfect complement to the ZeroIO ™ In-Memory Snapshot technology pioneered by MemVerge,” said Charles Fan, CEO of MemVerge. “We look forward to collaborating with the DMTCP community on the future development of the technology and the market. “
“Being able to transparently and gracefully recover from system failures during complex simulation runs is critical to maximizing the efficiency of running jobs with long run times,” said Marc Nossokoff, Senior Research Analyst at Hyperion Research. “Checkpoint is a well-understood technique for saving memory states of independent nodes during a failure mode and restoring that state when the machine is back and running. large datasets are expected to enable the adoption of in-memory computing techniques within the HPC and AI communities. Kudos to MemVerge for committing to providing industry leadership to make DMTCP a commercial reality. “
About DMTCP and the DMTCP / MANA project
Distributed MultiThreaded Checkpointing (DMTCP) transparently checks a single host or distributed compute in user space – without any changes to user code or operating system. It works on most Linux applications including Python, Matlab, R, GUI desktops, MPI, etc. It is robust and widely used (on Sourceforge since 2007). MANA is a transparent checkpoint implementation for MPI. MANA is under development, but has already demonstrated a robust and transparent checkpoint for calculations with 1000 MPI processes.
MemVerge is the pioneer of Big Memory Computing and Big Memory Cloud technology for the memory-centric, multi-cloud future. MemVerge® Memory Machine ™ is the industry’s first software to virtualize memory hardware for precise provisioning of capacity, performance, availability and mobility. In addition to transparent memory service, Memory Machine provides another industry first, ZeroIO ™ in-memory snapshots that can encapsulate terabytes of application state in seconds and enable data management at speed. Memory. The revolutionary capabilities of Big Memory Computing and Big Memory Cloud Technology open the door to the agility and flexibility of the cloud for thousands of Big Memory applications. To learn more about MemVerge, visit www.memverge.com.