Moving Off-the-Grid: Scene-Grounded Video Representations

Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged “on-the-grid,” which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move “off-the-grid” to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective—next frame prediction—trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG’s learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to “on-the-grid” baselines.

Visualizing Token Attention Maps

For each pixel location, at each frame, we colour code the token that has the most attention weight at that location. If the representation is stable - i.e. if the same token tracks the same content as it moves - we should see the motion of the token argmax move with the scene motion, which is the case.

Top Row from left to right: colour coded arg-max tokens, blended arg-max with video, ground truth video, frame predictions. Bottom row: 4 randomly selected tokens latch to specific scene elements and track them as they move

PCA Analysis of Tokens

We unroll the model over a batch of 24 short clips, each 12 frames in length. We take the predicted states of all clips across all time steps to obtain a 294912 x 512 matrix where 512 is the token size. We calculate the PCA components of this matrix and take 3 (out of 512) of the leading components and visualize them as RGB. Note the consistent cross sequence structure, relating to meaningful elements in the scene. From left to right, PCA in RGB, blended with original video, original video.

Changing the Number of Tokens

Since the model has a latent set of tokens there are no parameters in the model that depends on the number of tokens. As a result we can instantiate the model with a varying number of tokens without retraining. As can be seen the model adapts elegantly, making the tokens bind to larger areas of the image but still being able to predict future frames adequatly and track scene structure. This model has been trained with 1024 tokens. Shown, from top to bottom, are 256, 512 and 1024 tokens, from right to left predictions, ground truth frames, blended argmax attention, argmax attention.

SAVi: Conditional Object-Centric Learning from Video encodes a video into a set of temporally-consistent latent variables (object slots), and is trained to predict optical flow or to reconstruct input frames.

SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos improves SAVi by utilizing depth prediction and by adopting best practices for model scaling in terms of architecture design and data augmentation.

Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations takes few posed or unposed images of novel real-world scenes as input and produces a set-latent scene representation that is decoded to 3D videos & semantics.

RUST: Latent Neural Scene Representations from Unposed Imagery learns a latent pose space through self supervision by taking a peek at the target view during training.

DyST: Towards Dynamic Neural Scene Representations on Real-World Videos learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.