In the rapidly evolving landscape of multimodal communication research, this paper explores the transformative role of machine learning (ML), particularly using multimodal large language models, in tracking, augmenting, annotating, and analyzing multimodal data. Building upon the foundations laid in our previous work, we explore the capabilities that have emerged over the past years. The integration of ML allows researchers to gain richer insights from multimodal data, enabling a deeper understa…
Read moreIn the rapidly evolving landscape of multimodal communication research, this paper explores the transformative role of machine learning (ML), particularly using multimodal large language models, in tracking, augmenting, annotating, and analyzing multimodal data. Building upon the foundations laid in our previous work, we explore the capabilities that have emerged over the past years. The integration of ML allows researchers to gain richer insights from multimodal data, enabling a deeper understanding of human (and non-human) communication across modalities. In particular, augmentation methods have become indispensable because they facilitate the synthesis of multimodal data and further increase the diversity and richness of training datasets. In addition, ML-based tools have accelerated annotation processes, reducing human effort while improving accuracy.
Continued advances in ML and the proliferation of more powerful models suggest even more sophisticated analyses of multimodal communication, e.g., through models like ChatGPT, which can now “understand” images. This makes it all the more important to assess what these models can achieve now or in the near future, and what will remain unattainable beyond that.
We also acknowledge the ethical and practical challenges associated with these advancements, emphasizing the importance of responsible AI and data privacy. We must be careful to ensure that benefits are shared equitably and that technology respects individual rights.
In this paper, we highlight advances in ML-based multimodal research and discuss what the near future holds. Our goal is to provide insights into this research stream for both the multimodal research community, especially in linguistics, and the broader ML community. In this way, we hope to foster collaboration in an area that is likely to shape the future of technologically mediated human communication.