Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
Chinese Image Captioning Based on Deep Fusion Feature and Multi-Layer Feature Filtering Block
Blog Article
Cross-modal research has long been a critical pillar for the future development of human-computer interaction.With deep learning achieving remarkable results in computer vision and natural language processing, image captioning has emerged as a key focus area in artificial intelligence research.Traditionally, click here most image captioning studies have focused on the English context; however, interdisciplinary efforts should not be confined to monolingual environments.Instead, it is essential to expand into multiple languages, given that Chinese is one of the world’s most widely used logographic languages.
The study of Chinese image captioning holds immense value but presents significant challenges due to the complexity of Chinese semantic features.To address these difficulties, we propose a Deep Fusion Feature Encoder, which enables the model to extract more detailed visual features from images.Additionally, we introduce Swi-Gumbel Attention and develop a Feature Filtering Block based on it, aiding the model in accurately capturing core semantic elements during caption generation.Experimental results demonstrate that our method achieves superior performance across multiple Chinese datasets.
Specifically, in the experimental section of this paper, we compared our proposed model with those based on recurrent neural networks and transformer, demonstrating both its advantages and limitations.Additionally, we provided insights into future research directions for Chinese image captioning.Through ablation experiments, we validated the effectiveness of the Deep Fusion Feature Encoder, Swi-Gumbel Attention, and click here Triple-Layer Feature Filtering Block.We also explored the impact of different architectural configurations within the Multi-Layer Feature Filtering Block on caption accuracy.