Multimodal RL Paper (#3)

⭐ Goal

<aside>

본 스터디에서는 Multimodal Representation Learning (MRL)의 핵심 개념과 다양한 방법론을 이해하고, 특히 Text-Image 조합을 중심으로 한 최신 모델들을 학습함. 또한, 이러한 기법들이 언어모델과 추천시스템에 어떻게 활용되는지 파악하고, 관련된 Multimodal Representation Learning 연구들을 리뷰함
Multimodal Representation Learning의 핵심 개념 이해
- Self-Supervised Learning 기반 학습: 외부 레이블 없이 데이터 자체의 supervisory signal을 활용하는 방식 이해 (예, 데이터 증강, 순서 예측, 인스턴스 구분)
- **대표적 학습 패러다임
- Contrastive**: 서로 다른 모달리티 간 대응 관계를 맞추는 방식
- Generative/Predictive: 하나의 모달리티로 다른 모달리티를 생성/예측 (예: captioning, masked modeling)
- Self-Distilled: 교사-학생 구조를 활용한 representation 정제
- **네트워크 구조적 관점
- Single-Stream**: 이미지와 텍스트를 동일한 joint encoder에 통합 → 주로 similarity search, contrastive learning에 활용
- Dual-Stream: 각각 별도 인코더 사용 후 cross-attention/interaction → retrieve-ranker, retrieve-reader 방식에 활용 (Siamese network 구조는 dual-stream의 대표적 예시)
연구 흐름 및 실제 활용 사례
- 언어모델 응용:
- Vision-Language Pretraining: 이미지+텍스트를 같이 학습해서 공유 임베딩 확보 (주로 classification, retrieval, grounding)
- Bridge Models (VLP → LLM 연결): BLIP, BLIP-2, Flamingo, LLaVA 등 → LLM에 멀티모달 입력을 연결
- Multimodal Large Language Models: 그 임베딩을 LLM과 결합해서 instruction-following, 멀티모달 reasoning, 대화 생성까지 확장
- 추천시스템 응용: ****- Self-Supervised Recommender Systems: contrastive, generative, self-distillation 기반으로 대규모 데이터에서 레이블 없이 표현 학습 - Multimodal Recommender Systems: 텍스트 리뷰·이미지·메타데이터 활용 </aside>

⭐ Ground Rule

<aside>

유튜브, 블로그, 동영상, ChatGPT 등등 다양한 정보들을 활용해서 논문의 내용을 이해하는 것이 중요함!!! 다만, 논문에서 introduction (기존 방법의 limitation, 제안하는 방법의 contribution), method (각 모듈의 input과 output 그리고 그 모듈의 효과), experiment (task), … 부분을 직접 보고 이해하기
참가자들은 각자 할당된 논문을 읽고 각 주차에 해당하는 Content 내 본인 페이지에 내용을 정리함
- 발표자: 매주 정해진 수의 발표자가 논문 한 편씩을 할당받아서 발표함
- 발표자 외: 발표를 하지 않는 사람들도 각 주차 주제에서 가장 핵심 논문을 할당받아서 정리함
유튜브, 블로그, 동영상, ChatGPT 등등 다양한 정보들을 활용했다면 해당 링크를 본인 페이지와 논문 오리지널 페이지에 북마크로 넣어서 공유하기
강의를 듣고 공부하는 과정에서 공유하고자 하는 자료가 있다면 자유롭게 Materials/Resource 또는 Comment에 추가하면 됨
활발한 토론 및 피드백
- 각 주차의 논문 발표 후 반드시 토론 시간을 가짐. 이때 참가자 모두 적극적으로 질문 및 의견을 제시함
  - 각 주차에 논문을 읽고, 위의 Survey 논문들과 연결 지어 생각해 봄. 가능하다면 관련되거나 흥미로운 논문을 추가적으로 찾아보고 제시함 (랩미팅에서 주로 이루어짐)
- 매 주차 종료 후 참가자는 주차 페이지 상단의 Comment에 스터디 후기를 남김. 이때, 개인의견/시사점/내용/향후학습 등을 자유롭게 요약 및 정리 </aside>

🔥 Weekly Plan

<aside>

🔥 Paper 관계형에 연결된 논문을 확인하고, ReadingLog 관계형에 각자 할당된 논문을 정리하면 됩니다. **** (다양한 리뷰자료들을 활용해 이해해도 되지만, 최종적으로는 논문을 꼭 확인해보세요.)

</aside>

Untitled

📄 Subject List

[Week 0] Preliminaries - Word Embedding [Week 0] Preliminaries - Attention and Transformer [Week 1] Introduction - Overview and Survey Papers [Week 2] Generative/Predictive Self-Supervised Learning [Week 3] Contrastive (Self-)Supervised Learning [Week 4] Self-Distilled Self-Supervised Learning [Week 5] Word/Sentence/Document-level SSL for Text Encoders [Week 6] Node/Graph-level SSL for Graph Encoders [Week 7] Image-Text Alignment for Generation based on Single-Stream Fusion [Week 8] Image-Text Alignment for Retrieval/Generation based on Dual-Stream Fusion [Week 9] Large-scale Multimodal Alignment based on Multi-Stream Fusion [Week 10] Adapter-based Multimodal Models (toward MLLMs) [Week 11] Multimodal Large Language Models [Week 12] Multimodal Recommendation [Week 12+] Multimodal Recommendation based on Graph Learning

📄 Paper List

🤔 Vision-Language Pre-training (VLP) vs. Multimodal Large Language Models (MLLM)

[Week 1] Introduction - Overview and Survey Papers [2023][MIR][-] VLP - A Survey on Vision-Language Pre-training (295) [2022][CVPR][METER] An Empirical Study of Training End-to-End Vision-and-Language Transformers (470) # dual-stream —————————— [2024][ACL-F][-] The Revolution of Multimodal Large Language Models - A Survey (113) [2024][TPAMI][-] A Survey on Multimodal Large Language Models (1984) [2024][ACL-F][-] MM-LLMs - Recent Advances in MultiModal Large Language Models (424) —————————— [2023][arXiv][-] A comprehensive survey on multimodal recommender systems - Taxonomy, evaluation, and future directions (109) [2023][TRS][-] A survey of graph neural networks for recommender systems - Challenges, methods, and directions (635) [2023][TIS][-] Contrastive Self-supervised Learning in Recommender Systems - A Survey (85) [2024][CSUR][-] Multimodal Recommender Systems - A Survey (123) [2024][TKDE][-] Self-Supervised Learning for Recommender Systems - A Survey (449)

[Week 2] Generative/Predictive Self-Supervised Learning [2018][OpenAI][GPT-1] Improving Language Understanding by Generative Pre-Training (15473) # decoder [2019][NAACL][BERT] Pre-training of Deep Bidirectional Transformers for Language Understanding (142678) # encoder, masked language modeling [2019][OpenAI][GPT-2] Language Models are Unsupervised Multitask Learners (18051) # decoder, autoregressive modeling [2020][JMLR][T5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (26079) # encoder-decoder [2020][NIPS][GPT-3] Language Models are Few-Shot Learners (52889) # decoder [2023][OpenAI][GPT-4] Technical Report (5168) # decoder —————————— [2020][ICML][iGPT] Generative Pretraining from Pixels (2103) [2021][CVPR][MAE] Masked Autoencoders Are Scalable Vision Learners (11219) [2022][ICLR][BEiT] BERT Pre-Training of Image Transformers (3779) # Microsoft

🤔 Self-Supervised Learning 구분
🤔Problem Definition은 항상 중요!!!

[Week 3] Contrastive (Self-)Supervised Learning [2015][CVPR][FaceNet] A Unified Embedding for Face Recognition and Clustering (19437) # supervised, triplet loss [2015][ICLR][triplet network] Deep Metric Learning with Triplet Network (2888) # supervised, triplet loss [2016][CVPR][-] Deep Metric Learning via Lifted Structured Feature Embedding (2195) # hard negative, supervised [2020][NIPS][SupCon] Supervised Contrastive Learning (6708) # Google, hard negative, supervised —————————— ****[2018][arXiv][CPC] Representation Learning with Contrastive Predictive Coding (12471) # self-supervised [2018][CVPR][NPID] ****Unsupervised Feature Learning via Non-Parametric Instance Discrimination (4924) # self-supervised, instance discrimination [2020][ICML][SimCLR] A Simple Framework for Contrastive Learning of Visual Representations (26029) # Google, self-supervised [2020][CVPR][MoCo] Momentum Contrast for Unsupervised Visual Representation Learning (16840) # self-supervised, memory bank [2020][NIPS][SwAV] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (5141) # Facebook, clustering, self-supervised [2020][NIPS][SimCLRv2] Big Self-Supervised Models are Strong Semi-Supervised Learners (2916) [2020][arXiv][MoCo-2] Improved Baselines with Momentum Contrastive Learning (4272) [2021][ICCV][MoCo-3] An Empirical Study of Training Self-Supervised Vision Transformers (2430) # Transformer

[Week 4] Self-Distilled Self-Supervised Learning [2020][NIPS][BYOL] Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning (8971) # DeepMind [2021][CVPR][SimSiam] Exploring Simple Siamese Representation Learning (5623) # Meta, BYOL 구조 단순화 [2021][ICCV][DINO] Emerging Properties in Self-Supervised Vision Transformers (8179) # Meta [2022][ICLR][iBOT] Image BERT Pre-Training with Online Tokenizer (1154) # Microsoft [2024][TMLR][DINOv2] Learning Robust Visual Features without Supervision (4465) # Meta [2021][ACL][BSL] Bootstrapped Unsupervised Sentence Representation Learning (43)

🤔 현재시점에서 가장 좋은 모델은?

[Week 5] Word/Sentence/Document-level SSL for Text Encoders **[2019][EMNLP][SBERT] Sentence Embeddings using Siamese BERT-Networks (18399) # supervised [2018][ACL][USE] Universal Sentence Encoder (1788) # Google, supervised —————————— [2021][EMNLP-F][TSDAE] Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (274) # Generative/Predictive —————————— [2021][ACL][DeCLUTR] Deep Contrastive Learning for Unsupervised Textual Representations (603) # contrastive, unsupervised [2021][ACL][ConSERT] A Simple Contrastive Framework for Self-Supervised Sentence Representation Learning (679) # contrastive, unsupervised [2021][EMNLP][SimCSE] Simple Contrastive Learning of Sentence Embeddings (4193) # contrastive, (un)supervised [2022][EMNLP][PromptBERT] Improving BERT Sentence Embeddings with Prompts (224) # contrastive, supervised / weakly-supervised [2022][ACL-F][ST5] Scalable Sentence Encoders from Pre-trained Text-to-Text Models (660) # contrastive, Google, supervised [2022][TMLR][Contriever] Unsupervised Dense Information Retrieval with Contrastive Learning (1017) # ⭐, contrastive, Meta, unsupervised [2022][arXiv][E5] Text Embeddings by Weakly-Supervised Contrastive Pre-training (724) # ⭐, Microsoft, weakly-supervised [2023][ACL-F][ReContriever] Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training (39) —————————— [2020][NIPS][MiniLM] Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (1700) # 범용 언어모델, Microsoft, knowledge distillation [2024][ACL-F][BGE-M3] Multi-lingual, Multi-functionality, Multi-granularity Text Embeddings through Self-Knowledge Distillation (659) # ⭐, contrastive, knowledge distillation

🤔 MIM?

[Week 6] Node/Graph-level SSL for Graph Encoders [2019][ICLR][DGI] Deep Graph Infomax (3530) # node-level, MIM, contrastive [2020][KDD][GCC] Graph Contrastive Coding for Graph Neural Network Pre-Training (1064) # node-level, graph-level, contrastive [2020][ICLR][InfoGraph] Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization (1334) # graph-level, MIM, contrastive [2020][KDD][GPT-GNN] Generative Pre-Training of Graph Neural Networks (661) # node-level, generative [2020][ICML][-] Contrastive Multi-View Representation Learning on Graphs (1870) # node-level, graph-level, contrastive [2022][KDD][GraphMAE] Self-Supervised Masked Graph Autoencoders (640) # node-level, graph-level, generative [2021][NIPS][InfoGCL] Information-Aware Graph Contrastive Learning (230) # node-level, graph-level, contrastive [2021][SIGIR][SGL] Self-supervised Graph Learning for Recommendation (1110) # contrastive, node-level

🤔 왜 Image-Text Alignment for Retrieval based on Single-Stream Fusion 이 없나?