This research paper from Amazon proposes a Multimodal-CoT technique to enhance the complex reasoning ability of language models by incorporating both text and image modalities through a two-stage framework that separates rationale generation and answer inference. Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, previous CoT studies have only focused on the language modality.