MLSysOps

Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum

Overview

MLSysOps will achieve substantial research contributions in the realm of AI-based system adaptation across the cloud-edge continuum by introducing advanced methods and tools to enable optimal system management and application deployment. MLSysOps will design, implement and evaluate a complete framework for autonomic end-to-end system management across the full cloud-edge continuum. MLSysOps will employ a hierarchical agent-based AI architecture to interface with the underlying resource management and application deployment/orchestration mechanisms of the continuum. Adaptivity will be achieved through continual ML model learning in conjunction with intelligent retraining concurrently to application execution, while openness and extensibility will be supported through explainable ML methods and an API for pluggable ML models. Flexible/efficient application execution on heterogeneous infrastructures and nodes will be enabled through innovative portable container-based technology. Energy efficiency, performance, low latency, efficient, resilient and trusted tier-less storage, cross-layer orchestration including resource-constrained devices, resilience to imperfections of physical networks, trust and security, are key elements of MLSysOps addressed using ML models. The framework architecture disassociates management from control and seamlessly interfaces with popular control frameworks for different layers of the continuum. The framework will be evaluated using research testbeds as well as two real-world application-specific testbeds in the domain of smart cities and smart agriculture, which will also be used to collect the system-level data necessary to train and validate the ML models, while realistic system simulators will be used to conduct scale-out experiments. The MLSysOps consortium is a balanced blend of academic/research and industry/SME partners, bringing together the necessary scientific and technological skills to ensure successful implementation and impact.

Contribution

Project Management; Development of the MLSysOps framework; Development of AI models for the management of computing resources

Acknowledgement

eu-funded