Abstract:Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.




Abstract:Video detection and human action recognition may be computationally expensive, and need a long time to train models. In this paper, we were intended to reduce the training time and the GPU memory usage of video detection, and achieved a competitive detection accuracy. Other research works such as Two-stream, C3D, TSN have shown excellent performance on UCF101. Here, we used a LSTM structure simply for video detection. We used a simple structure to perform a competitive top-1 accuracy on the entire validation dataset of UCF101. The LSTM structure is named Context-LSTM, since it may process the deep temporal features. The Context-LSTM may simulate the human recognition system. We cascaded the LSTM blocks in PyTorch and connected the cell state flow and hidden output flow. At the connection of the blocks, we used ReLU, Batch Normalization, and MaxPooling functions. The Context-LSTM could reduce the training time and the GPU memory usage, while keeping a state-of-the-art top-1 accuracy on UCF101 entire validation dataset, show a robust performance on video action detection.




Abstract:Artificial neural networks that simulate human achieves great successes. From the perspective of simulating human memory method, we propose a stepped sampler based on the "repeated input". We repeatedly inputted data to the LSTM model stepwise in a batch. The stepped sampler is used to strengthen the ability of fusing the temporal information in LSTM. We tested the stepped sampler on the LSTM built-in in PyTorch. Compared with the traditional sampler of PyTorch, such as sequential sampler, batch sampler, the training loss of the proposed stepped sampler converges faster in the training of the model, and the training loss after convergence is more stable. Meanwhile, it can maintain a higher test accuracy. We quantified the algorithm of the stepped sampler.