Evaluating Vision-Enabled LLMs a Comparative Study on Cloud Detection using Horizon Camera Imagery
Research Article
Keywords:
Remote Sensing; Large Language Models; Sky Clouds; Multimodal LLMsAbstract
The rapid advancement of vision-enabled large language models (LLMs) presents transformative opportunities for specialized domains such as atmospheric science. This study evaluates the efficacy of multimodal LLMs in cloud identification tasks by leveraging a curated subset of the Clouds-1500 dataset, annotated with World Meteorological Organization (WMO) cloud classes. We introduce a novel pipeline that converts segmentation masks into text-based spatial, coverage, and class representations, enabling structured LLM analysis through custom prompts and the BAML library for response standardization. Benchmarking 18 state-of-the-art models revealed significant performance variations, with Anthropic’s Claude 3.5 Sonnet (71.67% class accuracy), OpenAI’s GPT-4o (68.89%), and xAI’s Grok Vision Beta (70.00%) emerging as top performers. However, challenges persist in low-coverage scenarios, where even leading models exhibited accuracy drops of 30–50%. The study demonstrates that while LLMs show promise in interpreting complex meteorological data, their effectiveness depends on task complexity, model architecture, and domain-specific adaptations. These findings provide a framework for integrating LLMs into remote sensing workflows, balancing automation with the precision required for operational meteorology.
Keywords Remote Sensing; Large Language Models; Sky Clouds; Multimodal LLMs
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Allan Cerentini, Juliana Marian Arrais Arrais, Bruno Juncklaus Martins, Sylvio Luiz Mantelli Neto Mantelli Neto, Tiago Oliveira da Luz, Aldo von Wangenheim

This work is licensed under a Creative Commons Attribution 4.0 International License.