论文网址:[2502.15786] MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. Method

2.4.1. Method Overview

2.4.2. fMRI Encoder

2.4.3. Brain Instruction Tuning (BIT)

2.5. Experiments

2.5.1. Settings

2.5.2. Brain Captioning

2.5.3. Versatile Decoding

2.5.4. Unseen Subject Generalization

2.5.5. Adapting to New Tasks

2.5.6. Ablation Study

2.5.7. Visualizations and Interpretations

2.6. Conclusion

1. 心得

(1)做了很多工作

2. 论文逐段精读

2.1. Abstract

        ①Challenges: suboptimal performance, limited task variety, and poor generalization across subjects

2.2. Introduction

        ①Design and implement of MindLLM:

        ②Responsive voxel selected will cause different voxel number when brings higher performance. Pooling or sampling them to the same number may cause loss of information

        ③Their method aims to complete tasks of perception & scene understanding, memory & knowledge retrieval, language & symbolic processing, and complex reasoning

prosthesis  n.假体(如假肢、假眼或假牙)

2.3. Related Works

        ①⭐VQA responds answers which is not relevant to β value

        ②⭐Cross-subject methods did not deal well with voxel differentiation, flattening or samling may cause spatial/individual information loss:

        ③Designing different encoder for different person actually limits. And caption annotation only is also a limitation

2.4. Method

2.4.1. Method Overview

        ①Overall framework of MindLLM:

where LLM is Vicuna-7b(适合开放对话??长文本理解??)

        ②Input brain signal of each subject: \boldsymbol{v}=[v_{1},v_{2},\cdots,v_{N}]\in\mathbb{R}^{N}N \in \left [ 12682,17907 \right ] denotes voxels

        ③fMRI encoder f_\theta encodes \boldsymbol{v} to fMRI tokens X_{v}=[\boldsymbol{x}_{v,1},\boldsymbol{x}_{v,2},\cdots,\boldsymbol{x}_{v,L}]\in\mathbb{R}^{d\times L} with d dimension and L tokens

2.4.2. fMRI Encoder

        ①在注意力里面,V是某个体素激活,K是那个体素的傅里叶坐标和很多个属于不同脑图谱ROI的区域嵌入:

k_i=k_i^\mathrm{pos}\|k_i^\mathrm{reg,}\mathcal{P}^1\|k_i^\mathrm{reg,}\mathcal{P}^2\|\cdots

        ②z_q\in\mathbb{R}^{N_q} is the output of attention layer and then employed a MLP:

X_{v}=\mathrm{reshape}\left(\mathrm{MLP}(\{\boldsymbol{z}_{q}\})\right)\in\mathbb{R}^{L\times d}

2.4.3. Brain Instruction Tuning (BIT)

        ①Tasks of MindLLM:

signifier  n.能指(语言符号的形式)

        ②Multi-run conversation X_{t}=(X_u^1,X_a^1,\cdots,X_u^T,X_a^T) with T\geq1 number of runs, a message from the assistant and u message is from the user for each sample \boldsymbol{v}

        ③Training object:

\arg\max_\theta p(X_a|X_v,X_{\mathrm{inst}})=\prod_{t=1}^Tp(X_a^t|X_u^{\leq t},X_a^{\leq t},X_{\mathrm{inst}},X_v)

        ④Examples of Q&A:

2.5. Experiments

2.5.1. Settings

        ①Datasets: NSD and other downstream datasets

2.5.2. Brain Captioning

        ①Captioning performance:

where CIDEr is scaled by a factor of 100

2.5.3. Versatile Decoding

        ①Performance of versatile decoding:

2.5.4. Unseen Subject Generalization

        ①Train on 1~7 subjects but evaluate on the 8:

2.5.5. Adapting to New Tasks

        ①Performance on sentiment understanding and utility/affordance tasks:

2.5.6. Ablation Study

        ①Ablation of position encoding:
 

2.5.7. Visualizations and Interpretations

        ①Attention of brain voxels:
 

2.6. Conclusion

        ~

Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐