A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
Recaptioning images with high-quality samples improve the text-to-image generation
#image-generation
#multi-modal
Video Language Planning
Vision language models can make long horizon task plans
#language-model
#multi-modal
#robotics
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Contrastive ViT Makes VLM Stronger
#language-model
#multi-modal
Collaborative Score Distillation for Consistent Visual Synthesis
Expand dimension by leveraging consistency without changing the architecture
#multi-modal
#diffusion-model
LiT: Zero-Shot Transfer with Locked-image Text Tuning
Image-text pre-training with pre-trained image model enhances zero-shot performance
#multi-modal
#contrastive-learning
End-to-end Multi-modal Video Temporal Grounding
Adding depth and flow to RGB improves video understanding
#video-understanding
#multi-modal
CLIP-It! Language-Guided Video Summarization
Get highlight clip of your favorite player by typing it
#multi-modal
#video-understanding
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Make your cat cute by typing it into this StyleGAN-CLIP hybrid
#multi-modal
#generative-adversarial-network
#image-manipulation