Tutorials
Tutorials
TBA
Title: Multimodal Machine Learning in the Wild: From Foundations to Application in Real-world Scenarios
Abstract: Multimodal machine learning has emerged as a central paradigm in modern AI research, enabling systems to jointly process and reason over heterogeneous data sources such as images, text, audio, and other data. This tutorial provides a structured introduction to the theoretical foundations of multimodal models, covering key architectural principles, fusion strategies, and training objectives that characterise the field.
Building on these foundations, the tutorial examines two application domains that illustrate the distinct challenges multimodal systems face in real-world deployment. The first scenario addresses a medical application, where models must contend with data scarcity and modality missingness. We discuss how these factors shape model design, evaluation protocols, and the requirements for robustness under incomplete observations. The second scenario focuses on Human-Robot Interaction (HRI), where the demands shift toward low-latency, continuous processing of multimodal streams including vision, speech, and physiological signals. Here, real-time elaboration of sensory inputs imposes stringent constraints on computational efficiency, synchronization, and uncertainty handling.
By the end of the tutorial, participants will have acquired a principled grounding in multimodal learning and a concrete understanding of how deployment context and domain constraints drive different research agendas.