Writing assistants like Grammarly and Microsoft Copilot typically generate diverse image captions through syntactic and semantic variation. In contrast, human-written captions rely on pragmatic cues to convey a central message. To better capture this pragmatic diversity, we introduce RONA, a prompting strategy for Multi-modal Large Language Models (MLLMs) that uses Coherence Relations as a variation axis. RONA produces more diverse and accurate captions than standard MLLM baselines across domains. Code:
https://github.com/aashish2000/RONA