This paper introduces Animal Recognition Through Enhanced Multimodal Integration System (ARTEMIS), a transformer-based framework designed for multilabel animal action recognition by fusing video, image, and textual modalities. ARTEMIS utilizes state-of-the-art captioning and language models, such as BLIP2 and Llama 3, to generate textual descriptions from video frames, which are input to the model, significantly enhancing its performance unlikely previous results that do not consider this modality. Through comprehensive ablation studies, we explore the contribution of various model components and propose optimization strategies, including genetic algorithms and reinforcement learning, to dynamically adjust ensemble weights. Our feature alignment techniques-using contrastive and cosine similarity losses-further improve multimodal integration. Evaluations on the Animal Kingdom dataset, which includes 30,100 clips across 140 action classes, demonstrate that ARTEMIS achieves a new state-of-the-art mAP of 79.82, outperforming existing methods. The combination of multimodal fusion and ensemble strategies makes ARTEMIS a robust solution for complex animal action recognition tasks. The code of our fusion method is available at https://github.com/edofazza/ARTEMIS.
ARTEMIS: animal recognition through enhanced multimodal integration system
Fazzari, Edoardo
Primo
;Romano, DonatoSecondo
;Stefanini, CesareUltimo
2025-01-01
Abstract
This paper introduces Animal Recognition Through Enhanced Multimodal Integration System (ARTEMIS), a transformer-based framework designed for multilabel animal action recognition by fusing video, image, and textual modalities. ARTEMIS utilizes state-of-the-art captioning and language models, such as BLIP2 and Llama 3, to generate textual descriptions from video frames, which are input to the model, significantly enhancing its performance unlikely previous results that do not consider this modality. Through comprehensive ablation studies, we explore the contribution of various model components and propose optimization strategies, including genetic algorithms and reinforcement learning, to dynamically adjust ensemble weights. Our feature alignment techniques-using contrastive and cosine similarity losses-further improve multimodal integration. Evaluations on the Animal Kingdom dataset, which includes 30,100 clips across 140 action classes, demonstrate that ARTEMIS achieves a new state-of-the-art mAP of 79.82, outperforming existing methods. The combination of multimodal fusion and ensemble strategies makes ARTEMIS a robust solution for complex animal action recognition tasks. The code of our fusion method is available at https://github.com/edofazza/ARTEMIS.File | Dimensione | Formato | |
---|---|---|---|
s13042-025-02602-3.pdf
accesso aperto
Tipologia:
Documento in Pre-print/Submitted manuscript
Licenza:
Creative commons (selezionare)
Dimensione
2.62 MB
Formato
Adobe PDF
|
2.62 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.