DBR-TAD: Diffusion-Based Boundary Refinement for Temporal Action Detection

Wenjie Zhang, Zhiheng Li, Wenhao Tan, Ran Song, Jiyu Cheng, and Wei Zhang
School of Control Science and Engineering, Shandong University, Jinan, China.
Description of the image

Overview of DBR-TAD. First, the IAL module takes as input the video feature produced by a pretrained video encoder and generates initial action proposals at each discrete time step based on the multiscale features extracted by ConvFormer. Then, the DBR module vectorizes the differences between the initial action proposal boundaries and the ground truth boundaries, and performs a diffusion-based denoising process on the difference vector. Finally, the refined action boundaries are obtained by superimposing the initial action proposal boundaries and the denoised difference vector.

Abstract

Existing temporal action detection (TAD) methods take videos of different lengths as input and produce a fixed-length feature sequence by feature extraction and temporal downsampling, followed by action boundary localization and action classification. However, the temporal downsampling often leads to the loss of action information and results in the difficulty of locating accurate action boundaries. To address this issue, we introduce DBR-TAD, a diffusion-based boundary refinement method for TAD. DBR-TAD locates accurate action boundaries from noisy action boundaries through a progressive denoising process. Its core component is the diffusion-based boundary refinement (DBR) module, which progressively converts the distributions corresponding to uncertain and noisy action boundaries predicted by any TAD model to the specific distributions corresponding to good action boundaries. Extensive experiments demonstrate that DBR-TAD achieves the state-of-the-art performance on three single-label datasets and two multi-label datasets.

Visualization on THUMOS14

Visualization on ActivityNet-1.3

Visualization on EPIC-Kitchens 100 Noun

Visualization on EPIC-Kitchens 100 Verb