This paper introduces Animal Recognition Through Enhanced Multimodal Integration System (ARTEMIS), a transformer-based framework designed for multilabel animal action recognition by fusing video, image, and textual modalities. ARTEMIS utilizes state-of-the-art captioning and language models, such as BLIP2 and Llama 3, to generate textual descriptions from video frames, which are input to the model, significantly enhancing its performance unlikely previous results that do not consider this modality. Through comprehensive ablation studies, we explore the contribution of various model components and propose optimization strategies, including genetic algorithms and reinforcement learning, to dynamically adjust ensemble weights. Our feature alignment techniques-using contrastive and cosine similarity losses-further improve multimodal integration. Evaluations on the Animal Kingdom dataset, which includes 30,100 clips across 140 action classes, demonstrate that ARTEMIS achieves a new state-of-the-art mAP of 79.82, outperforming existing methods. The combination of multimodal fusion and ensemble strategies makes ARTEMIS a robust solution for complex animal action recognition tasks. The code of our fusion method is available at https://github.com/edofazza/ARTEMIS.

ARTEMIS: animal recognition through enhanced multimodal integration system

Fazzari, Edoardo
Primo
;
Romano, Donato
Secondo
;
Stefanini, Cesare
Ultimo
2025-01-01

Abstract

This paper introduces Animal Recognition Through Enhanced Multimodal Integration System (ARTEMIS), a transformer-based framework designed for multilabel animal action recognition by fusing video, image, and textual modalities. ARTEMIS utilizes state-of-the-art captioning and language models, such as BLIP2 and Llama 3, to generate textual descriptions from video frames, which are input to the model, significantly enhancing its performance unlikely previous results that do not consider this modality. Through comprehensive ablation studies, we explore the contribution of various model components and propose optimization strategies, including genetic algorithms and reinforcement learning, to dynamically adjust ensemble weights. Our feature alignment techniques-using contrastive and cosine similarity losses-further improve multimodal integration. Evaluations on the Animal Kingdom dataset, which includes 30,100 clips across 140 action classes, demonstrate that ARTEMIS achieves a new state-of-the-art mAP of 79.82, outperforming existing methods. The combination of multimodal fusion and ensemble strategies makes ARTEMIS a robust solution for complex animal action recognition tasks. The code of our fusion method is available at https://github.com/edofazza/ARTEMIS.
2025
File in questo prodotto:
File Dimensione Formato  
s13042-025-02602-3.pdf

accesso aperto

Tipologia: Documento in Pre-print/Submitted manuscript
Licenza: Creative commons (selezionare)
Dimensione 2.62 MB
Formato Adobe PDF
2.62 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/577014
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
social impact