Inputs | Camera: Static | Camera: Dolly out | Camera: Orbit right + Pedestal up | |
---|---|---|---|---|
![]() |
||||
Object Motions | ![]() |
|||
![]() |
||||
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations.
To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.
Effectiveness in Cinematic Shot Design (Joint camera and object motion control in a 3D-scene-aware manner).
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Pedestal up + Dolly in ]
|
![]() |
![]() |
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Static ]
|
![]() |
![]() |
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Roll clockwise ]
|
![]() |
![]() |
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Tilting up ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Dolly in ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Trucking right + Pedestal up ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Orbit right ]
|
⌀
|
![]() |
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Panning left + Tilting up ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Dolly in ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Pedestal down + Tiliting up ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Panning left + Dolly in ]
|
⌀
|
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Panning left ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Trucking right ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Static ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Trucking left + Dolly out ]
|
![]() |
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Camera motion | Object global motion | Object local motion |
---|---|---|
[ Panning left + Pedestal up + Orbit right ]
|
⌀
|
⌀
|
DragAnything | MOFA-Video | MotionCanvas (Ours) |
Shot Design with Joint Camera and Object Control.
Inputs | Camera: Trucking right | Camera: Zoom in | Camera: Roll clockwise |
---|---|---|---|
![]() |
|||
![]() |
|||
![]() |
|||
Inputs | Camera: Static | Camera: Dolly in | Camera: Diagonal bottom-right |
![]() |
|||
![]() |
|||
![]() |
Camera: Dolly out | Camera: Orbit left | Camera: Pedestal up |
---|---|---|
Camera: Orbit left | Camera: [Trcuking left + Pedestal up] | Camera: Dolly in |
---|---|---|
Camera: Dolly out | Camera: Dolly in | Camera: Trcuking left |
---|---|---|
Long Videos with Complex Trajectories.
Input image | Motion control signal | Result sample #1 | Result sample #2 |
---|---|---|---|
![]() |
![]() |
||
![]() |
![]() |
||
Input image | Result sample #1 | Result sample #2 |
---|---|---|
![]() |
||
![]() |
||
![]() |
||
![]() |
||
![]() |
||
Object Local Motion Control.
Inputs | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Results | ||||
Inputs | ![]() |
![]() |
![]() |
![]() |
Results | ||||
Inputs | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Results | ||||
Inputs | ![]() |
![]() |
![]() |
![]() |
Results | ||||
Motion Transfer.
Input source video | |||
Transfer results | |||
Video Editing.
Input video | |||
Editing results | |||
Camera Motion Control.
Input camera control | MotionCtrl | CameraCtrl | Ours |
---|---|---|---|
[ Dolly in + Zoom Out ]
(Dolly zoom) |
|||
[ Trucking right ]
|
|||
Object Motion Control.
Inputs | DragAnything | MOFA-Video |
---|---|---|
![]() |
||
Camera | TrackDiffusion | Ours |
[ Static ]
|
||
Inputs | DragAnything | MOFA-Video |
![]() |
||
Camera | TrackDiffusion | Ours |
[ Trucking right ]
|
||
Camera Motion Representation.
Input | Gauss. Map | Plucker | Point Traj Coeff. (Ours) |
---|---|---|---|
[ Dolly out + Panning right ]
|
|||
Input | Gauss. Map | Plucker | Point Traj Coeff. (Ours) |
[ Roll clockwise + Zoom out ]
|
|||
Bounding Box Conditioning.
Input | Ourscoord | Ours |
---|---|---|
![]() |
||
Input | Ourscoord | Ours |
![]() |
||
Effect of Point Track Density on Camera Motion Control
Density=0.1 | Density=0.4 | Density=0.7 | Density=1.0 | |
---|---|---|---|---|
Input point track | ||||
Results | ||||
Effect of Text Prompt.
- We show camera motion control of "dolly in" with different levels of text detail.
"A man." | "A man crossing a stream." | "A man with a red backpack steps over a stream in a mountain valley." |
---|---|---|
"A man wearing a blue flannel shirt, hiking boots, and a red backpack carefully steps across a rocky stream in a picturesque valley surrounded by rugged mountains." | "A man crossing a stream. It is raining." | "A man crossing a stream and turning around." |
Essentiality of Camera-aware and Camera-object-aware Transformations
- | Inputs | Preview | w/o transform | w/ transform (Ours) |
---|---|---|---|---|
Camera-aware transformation | ![]() |
|||
![]() |
||||
Camera-object-aware transformation | ![]() |
|||