Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.
Human pose estimation is a complicated structured data sequence modeling task. Most existing methods only consider the pair-wise interaction of human body joints in model learning. Unfortunately, this causes 3D pose estimation to fail in difficult cases such as $\textit{joints overlapping}$, and pose $\textit{fast-changing}$, as pair-wise relations cannot exploit fine-grained human body priors in pose estimation. To this end, we revamped the 3D pose estimation framework with a $\textit{High-order}$ $\textit{Directed}$ $\textit{Transformer}$ (HDFormer), which coherently exploits the high-order bones and joints relevances to boost the performance of pose estimation. Specifically, HDFormer adopts both self-attention and high-order attention schemes to build up a multi-order attention module to perform the information flow interaction including the first-order $"\textit{joint$\leftrightarrow$joint}"$, second-order $"\textit{bone$\leftrightarrow$joint}"$ as well as high-order $"\textit{hyperbone$\leftrightarrow$joint}"$ relationships (hyperbone is defined as a joint set), compensating the hard cases prediction in fast-changing and heavy occlusion scenarios. Moreover, modernized CNN techniques are applied to upgrade the transformer-based architecture to speed up the HDFormer, achieving a favorable trade-off between effectiveness and efficiency. We compare our model with other SOTA models on the datasets Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HDFormer achieves superior performance with only $\textbf{1/10}$ parameters and much lower computational cost compared to the current SOTAs. Moreover, HDFormer can be applied to various types of real-world applications, enabling real-time and accurate 3D pose estimation. The source code is in https://github.com/hyer/HDFormer.