MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations.

To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.

	Inputs	Camera: Static	Camera: Dolly out	Camera: Orbit right + Pedestal up

Object Motions

Camera motion	Object global motion	Object local motion
[ Pedestal up + Dolly in ]
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Static ]
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Roll clockwise ]
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Tilting up ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

Abstract

Comparison & Showcases

Applications

Additional Applications

Comparisons with Baseline Methods

Ablation Study

Additional Analysis

Large Camera-motion Results

Legend of Camera Motions

Camera motion	Object global motion	Object local motion
[ Dolly in ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Trucking right + Pedestal up ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Orbit right ]	⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Panning left + Tilting up ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Pedestal down + Tiliting up ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Panning left + Dolly in ]	⌀	⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Panning left ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Trucking right ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Trucking left + Dolly out ]		⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Camera motion	Object global motion	Object local motion
[ Panning left + Pedestal up + Orbit right ]	⌀	⌀
DragAnything	MOFA-Video	MotionCanvas (Ours)

Input image	Motion control signal	Result sample #1	Result sample #2

Input image	Result sample #1	Result sample #2

Input source video
Transfer results

Input video
Editing results

Inputs	Camera: Trucking right	Camera: Zoom in	Camera: Roll clockwise



Inputs	Camera: Static	Camera: Dolly in	Camera: Diagonal bottom-right

Inputs	DragAnything	MOFA-Video

Camera	TrackDiffusion	Ours
[ Static ]

Inputs	DragAnything	MOFA-Video

Camera	TrackDiffusion	Ours
[ Trucking right ]

Input	Gauss. Map	Plucker	Point Traj Coeff. (Ours)
[ Dolly out + Panning right ]
Input	Gauss. Map	Plucker	Point Traj Coeff. (Ours)
[ Roll clockwise + Zoom out ]

"A man."	"A man crossing a stream."	"A man with a red backpack steps over a stream in a mountain valley."

"A man wearing a blue flannel shirt, hiking boots, and a red backpack carefully steps across a rocky stream in a picturesque valley surrounded by rugged mountains."	"A man crossing a stream. It is raining."	"A man crossing a stream and turning around."

-	Inputs	Preview	w/o transform	w/ transform (Ours)
Camera-aware transformation
Camera-aware transformation
Camera-object-aware transformation