[2310.10625] Video Language Planning

Significance

Vision language models can make long horizon task plans

Review

The authors propose Visual Language Planning (VLP), which aims to enable visual planning for long horizon tasks. VLP exploits two large models and the tree search algorithm to realize this goal. Specifically, a vision-language model (PaLM-E) first takes an image and corresponding task instruction to generate action texts. The action texts are input into a text-to-video model to generate video rollouts of the future. These action text - video pairs are input to a vision-language model to evaluate and replace the least likely plans with tree-search. After these loops of planning, the optimal plan is executed with goal-conditioned policies.

231017-1 The Video Language Planning

It can be seen from the demo videos of the project page that the proposed VLP shows an impressive performance in long horizon task planning.

Related

Share

Comment

#image-generation #multi-modal #language-model #retrieval-augmentation #robotics #forecasting #psychiatry #instruction-tuning #diffusion-model #notice #graph-neural-network #responsible-ai #privacy-preserving #scaling #mixture-of-experts #generative-adversarial-network #speech-model #contrastive-learning #self-supervised #image-representation #image-processing #object-detection #pseudo-labeling #scene-text-detection #neural-architecture-search #data-sampling #long-tail #graph-representation #zero-shot #metric-learning #federated-learning #weight-matrix #low-rank #vision-transformer #computer-vision #normalizing-flow #invertible-neural-network #super-resolution #image-manipulation #thread-summarization #natural-language-processing #domain-adaptation #knowledge-distillation #scene-text #model-compression #semantic-segmentation #instance-segmentation #video-understanding #code-generation #graph-generation #image-translation #data-augmentation #model-pruning #signal-processing #text-generation #text-classification #music-representation #transfer-learning #link-prediction #counterfactual-learning #medical-imaging #acceleration #transformer #style-transfer #novel-view-synthesis #point-cloud #spiking-neural-network #optimization #multi-layer-perceptron #adversarial-training #visual-search #image-retrieval #negative-sampling #action-localization #weakly-supervised #data-compression #hypergraph #adversarial-attack #submodularity #active-learning #deblurring #object-tracking #pyramid-structure #loss-function #gradient-descent #generalization #bug-fix #orthogonality #explainability #saliency-mapping #information-theory #question-answering #knowledge-graph #robustness #limited-data #recommender-system #anomaly-detection #gaussian-discriminant-analysis #molecular-graph #video-processing