【论文合集】多模态大语言模型相关论文
本文综述了多模态大语言模型(MLLMs)领域的最新研究进展,重点介绍了12篇代表性论文。研究内容包括:ClearSight模型通过视觉信号增强缓解物体幻觉问题;VCR框架基于"经验锥"生成数学推理合成数据;EarthGPT实现遥感领域多传感器图像理解;以及Chain of Images和LION等模型在直观推理与双重视觉知识方面的创新。此外,还收录了两篇重要综述:《从数据为中心
本文讨论了多模态大语言模型领域的相关研究,列举了该领域的多篇论文及其对应的链接。关键要点包括:
-
ClearSight 相关论文:论文名为 ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models,链接为 https://arxiv.org/abs/2503.13107 。
-
VCR 相关论文:论文名为 VCR: A “Cone of Experience” Driven Synthetic Data Generation Framework for Mathematical Reasoning,链接为 VCR: A “Cone of Experience” Driven Synthetic Data Generation Framework for Mathematical Reasoning| Proceedings of the AAAI Conference on Artificial Intelligence 。
-
多视角综述论文:A Survey of Multimodal Large Language Model from A Data-centric Perspective,链接为 https://arxiv.org/abs/2405.16640v1 。
-
多模态大语言模型综述论文:The Revolution of Multimodal Large Language Models: A Survey,链接为 https://aclanthology.org/2024.findings-acl.807.pdf 。
-
EarthGPT 相关论文:EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain,链接为 EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain | IEEE Journals & Magazine | IEEE Xplore 。
-
Chain of Images 相关论文:Chain of Images for Intuitively Reasoning,链接为 [2311.09241] Chain of Images for Intuitively Reasoning 。
-
GPT-4 多模态分析论文:GPT-4 Multimodal Analysis on Ophthalmology Clinical Cases Including Text and Images,链接为 GPT-4 Multimodal Analysis on Ophthalmology Clinical Cases Including Text and Images | medRxiv 。
1、ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
论文链接:https://arxiv.org/abs/2503.13107
2、VCR: A “Cone of Experience” Driven Synthetic Data Generation Framework for Mathematical Reasoning
论文链接:https://ojs.aaai.org/index.php/AAAI/article/view/34645
3、A Survey of Multimodal Large Language Model from A Data-centric Perspective
论文链接:https://arxiv.org/abs/2405.16640v1
4、The Revolution of Multimodal Large Language Models: A Survey
论文链接:https://aclanthology.org/2024.findings-acl.807.pdf
5、EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain
论文链接:https://ieeexplore.ieee.org/abstract/document/10547418?signout=success
6、Chain of Images for Intuitively Reasoning
论文链接:https://arxiv.org/abs/2311.09241
暂时无法在飞书文档外展示此内容
7、GPT-4 Multimodal Analysis on Ophthalmology Clinical Cases Including Text and Images
论文链接:https://www.medrxiv.org/content/10.1101/2023.11.24.23298953v1
暂时无法在飞书文档外展示此内容
8、LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
论文链接:https://arxiv.org/abs/2311.11860
暂时无法在飞书文档外展示此内容
9、MM-LLMs: Recent Advances in MultiModal Large Language Models
论文链接:https://arxiv.org/abs/2401.13601
暂时无法在飞书文档外展示此内容
10、SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation
论文链接:https://arxiv.org/abs/2409.06633
暂时无法在飞书文档外展示此内容
11、VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
论文链接:https://openreview.net/forum?id=BH7ZAmkWVc
暂时无法在飞书文档外展示此内容
12、RocketEval: Efficient automated LLM evaluation via grading checklist
论文链接:https://openreview.net/forum?id=zJjzNj6QUe
火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。
更多推荐
所有评论(0)