Object detection is a fundamental task in computer vision, with critical applications in autonomous driving, surveillance, and robotics. Traditional object detection models primarily rely on RGB images, which perform well under favorable lighting but degrade in low-visibility environments such as nighttime or adverse weather. Infrared (IR) imagery, which captures thermal information, offers improved performance in such conditions but lacks structural and color details. Combining RGB and IR modalities has the potential to enhance detection accuracy by leveraging their complementary strengths. However, RGB-IR fusion for aerial imagery remains underexplored, and the scarcity of publicly available paired datasets further limits research in this area. Additionally, implementing onboard fusion models for aerial applications, such as on drones, poses significant challenges, including feature-level fusion complexity and high computational overhead. In this work, we propose an efficient RGB-IR fusion framework specifically designed for aerial image datasets. Our framework integrates pixel-level fusion and transformer-based feature-level fusion to capture both low-level and high-level cross-modal interactions. To address computational constraints, we introduce a token selection mechanism that dynamically selects the most informative tokens, reducing inference time while maintaining high detection performance. Extensive experiments conducted on an RGB-IR aerial image dataset demonstrate that our proposed framework significantly improves detection accuracy and computational efficiency.