Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wen Yang

Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning

Dec 16, 2024

Chang Xu, Ruixiang Zhang, Wen Yang, Haoran Zhu, Fang Xu, Jian Ding, Gui-Song Xia

Abstract:Detecting oriented tiny objects, which are limited in appearance information yet prevalent in real-world applications, remains an intricate and under-explored problem. To address this, we systemically introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study. Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets. Based on AI-TOD-R, we present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches. Through investigation, we identify a learning bias presents across various learning pipelines: confident objects become increasingly confident, while vulnerable oriented tiny objects are further marginalized, hindering their detection performance. To mitigate this issue, we propose a Dynamic Coarse-to-Fine Learning (DCFL) scheme to achieve unbiased learning. DCFL dynamically updates prior positions to better align with the limited areas of oriented tiny objects, and it assigns samples in a way that balances both quantity and quality across different object shapes, thus mitigating biases in prior settings and sample selection. Extensive experiments across eight challenging object detection datasets demonstrate that DCFL achieves state-of-the-art accuracy, high efficiency, and remarkable versatility. The dataset, benchmark, and code are available at https://chasel-tsui.github.io/AI-TOD-R/.

Via

Access Paper or Ask Questions

Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Dec 08, 2024

Damien de Mijolla, Wen Yang, Philippa Duckett, Christopher Frye, Mark Worrall

Figure 1 for Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Figure 2 for Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Figure 3 for Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Figure 4 for Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Abstract:Prompting and fine-tuning have emerged as two competing paradigms for augmenting language models with new capabilities, such as the use of tools. Prompting approaches are quick to set up but rely on providing explicit demonstrations of each tool's usage in the model's prompt, thus coupling tool use to the task at hand and limiting generalisation. Fine-tuning removes the need for task-specific demonstrations of tool usage at runtime; however, this ties new capabilities to a single model, thus making already-heavier setup costs a recurring expense. In this paper, we introduce language hooks, a novel framework for augmenting language models with new capabilities that is decoupled both from the model's task-specific prompt and from the model itself. The language hook algorithm interleaves text generation by the base model with the execution of modular programs that trigger conditionally based on the existing text and the available capabilities. Upon triggering, programs may call external tools, auxiliary language models (e.g. using tool specific prompts), and modify the existing context. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches, and demonstrate its ability to generalise to novel tasks.

* This work was conducted during Summer 2023. Experimental results and references reflect the state of the field at that time and may not account for subsequent developments

Via

Access Paper or Ask Questions

Tiny Object Detection with Single Point Supervision

Dec 08, 2024

Haoran Zhu, Chang Xu, Ruixiang Zhang, Fang Xu, Wen Yang, Haijian Zhang, Gui-Song Xia

Figure 1 for Tiny Object Detection with Single Point Supervision

Figure 2 for Tiny Object Detection with Single Point Supervision

Figure 3 for Tiny Object Detection with Single Point Supervision

Figure 4 for Tiny Object Detection with Single Point Supervision

Abstract:Tiny objects, with their limited spatial resolution, often resemble point-like distributions. As a result, bounding box prediction using point-level supervision emerges as a natural and cost-effective alternative to traditional box-level supervision. However, the small scale and lack of distinctive features of tiny objects make point annotations prone to noise, posing significant hurdles for model robustness. To tackle these challenges, we propose Point Teacher--the first end-to-end point-supervised method for robust tiny object detection in aerial images. To handle label noise from scale ambiguity and location shifts in point annotations, Point Teacher employs the teacher-student architecture and decouples the learning into a two-phase denoising process. In this framework, the teacher network progressively denoises the pseudo boxes derived from noisy point annotations, guiding the student network's learning. Specifically, in the first phase, random masking of image regions facilitates regression learning, enabling the teacher to transform noisy point annotations into coarse pseudo boxes. In the second phase, these coarse pseudo boxes are refined using dynamic multiple instance learning, which adaptively selects the most reliable instance from dynamically constructed proposal bags around the coarse pseudo boxes. Extensive experiments on three tiny object datasets (i.e., AI-TOD-v2, SODA-A, and TinyPerson) validate the proposed method's effectiveness and robustness against point location shifts. Notably, relying solely on point supervision, our Point Teacher already shows comparable performance with box-supervised learning methods. Codes and models will be made publicly available.

Via

Access Paper or Ask Questions

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Dec 05, 2024

Haitian Zhang, Xiangyuan Wang, Chang Xu, Xinya Wang, Fang Xu, Huai Yu, Lei Yu, Wen Yang

Figure 1 for Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Figure 2 for Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Figure 3 for Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Figure 4 for Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Abstract:Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textit{vs.}~high-latency RGB frames; temporally sparse labels in training \textit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbf{F}requency-\textbf{A}daptive Low-Latency \textbf{O}bject \textbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$\times$ Event-RGB frequency mismatch.

Via

Access Paper or Ask Questions

QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

Dec 04, 2024

Ji Wu, Huai Yu, Shu Han, Xi-Meng Cai, Ming-Feng Wang, Wen Yang, Gui-Song Xia

Figure 1 for QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

Figure 2 for QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

Figure 3 for QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

Figure 4 for QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

Abstract:In the realm of large-scale point cloud registration, designing a compact symbolic representation is crucial for efficiently processing vast amounts of data, ensuring registration robustness against significant viewpoint variations and occlusions. This paper introduces a novel point cloud registration method, i.e., QuadricsReg, which leverages concise quadrics primitives to represent scenes and utilizes their geometric characteristics to establish correspondences for 6-DoF transformation estimation. As a symbolic feature, the quadric representation fully captures the primary geometric characteristics of scenes, which can efficiently handle the complexity of large-scale point clouds. The intrinsic characteristics of quadrics, such as types and scales, are employed to initialize correspondences. Then we build a multi-level compatibility graph set to find the correspondences using the maximum clique on the geometric consistency between quadrics. Finally, we estimate the 6-DoF transformation using the quadric correspondences, which is further optimized based on the quadric degeneracy-aware distance in a factor graph, ensuring high registration accuracy and robustness against degenerate structures. We test on 5 public datasets and the self-collected heterogeneous dataset across different LiDAR sensors and robot platforms. The exceptional registration success rates and minimal registration errors demonstrate the effectiveness of QuadricsReg in large-scale point cloud registration scenarios. Furthermore, the real-world registration testing on our self-collected heterogeneous dataset shows the robustness and generalization ability of QuadricsReg on different LiDAR sensors and robot platforms. The codes and demos will be released at \url{https://levenberg.github.io/QuadricsReg}.

* 25 pages, 17 figures

Via

Access Paper or Ask Questions

MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

Dec 04, 2024

Huai Yu, Junhao Wang, Yao He, Wen Yang, Gui-Song Xia

Figure 1 for MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

Figure 2 for MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

Figure 3 for MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

Figure 4 for MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

Abstract:Making multi-camera visual SLAM systems easier to set up and more robust to the environment is always one of the focuses of vision robots. Existing monocular and binocular vision SLAM systems have narrow FoV and are fragile in textureless environments with degenerated accuracy and limited robustness. Thus multi-camera SLAM systems are gaining attention because they can provide redundancy for texture degeneration with wide FoV. However, current multi-camera SLAM systems face massive data processing pressure and elaborately designed camera configurations, leading to estimation failures for arbitrarily arranged multi-camera systems. To address these problems, we propose a generic visual odometry for arbitrarily arranged multi-cameras, which can achieve metric-scale state estimation with high flexibility in the cameras' arrangement. Specifically, we first design a learning-based feature extraction and tracking framework to shift the pressure of CPU processing of multiple video streams. Then we use the rigid constraints between cameras to estimate the metric scale poses for robust SLAM system initialization. Finally, we fuse the features of the multi-cameras in the SLAM back-end to achieve robust pose estimation and online scale optimization. Additionally, multi-camera features help improve the loop detection for pose graph optimization. Experiments on KITTI-360 and MultiCamData datasets validate the robustness of our method over arbitrarily placed cameras. Compared with other stereo and multi-camera visual SLAM systems, our method obtains higher pose estimation accuracy with better generalization ability. Our codes and online demos are available at \url{https://github.com/JunhaoWang615/MCVO}

* 8 pages, 8 figures

Via

Access Paper or Ask Questions

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Nov 29, 2024

Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang

Figure 1 for ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Figure 2 for ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Figure 3 for ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Figure 4 for ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Abstract:During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0

* ChineseWebTex2.0 dataset is available at https://github.com/CASIA-LM/ChineseWebText-2.0

Via

Access Paper or Ask Questions

Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Nov 22, 2024

Haoyuan Li, Chang Xu, Wen Yang, Li Mi, Huai Yu, Haijian Zhang

Figure 1 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 2 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 3 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Figure 4 for Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Abstract:Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.

* 13 pages

Via

Access Paper or Ask Questions

Markov Chain of Thought for Efficient Mathematical Reasoning

Oct 23, 2024

Wen Yang, Kai Fan, Minpeng Liao

Figure 1 for Markov Chain of Thought for Efficient Mathematical Reasoning

Figure 2 for Markov Chain of Thought for Efficient Mathematical Reasoning

Figure 3 for Markov Chain of Thought for Efficient Mathematical Reasoning

Figure 4 for Markov Chain of Thought for Efficient Mathematical Reasoning

Abstract:Chain of Thought (CoT) of multi-step benefits from the logical structure of the reasoning steps and task-specific actions, significantly enhancing the mathematical reasoning capabilities of large language models. As the prevalence of long CoT, the number of reasoning steps exceeds manageable token limits and leads to higher computational demands. Inspired by the fundamental logic of human cognition, ``derive, then reduce'', we conceptualize the standard multi-step CoT as a novel Markov Chain of Thought (MCoT). In this study, we consider the mathematical reasoning task, defining each reasoning step as text accompanied by a Python code snippet. To facilitate a longer reasoning path, self-correction is enabled through interactions with the code interpreter. Our MCoT aims to compress previous reasoning steps into a simplified question, enabling efficient next-step inference without relying on a lengthy KV cache. In our experiments, we curate the \texttt{MCoTInstruct} dataset, and the empirical results indicate that MCoT not only significantly enhances efficiency but also maintains comparable accuracy. While much remains to be explored, this work paves the way for exploring the long CoT reasoning abilities of LLMs.

* Work in progress

Via

Access Paper or Ask Questions

Language Imbalance Driven Rewarding for Multilingual Self-improving

Oct 11, 2024

Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang

Abstract:Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $\textit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.

* Work in progress

Via

Access Paper or Ask Questions