Abstract:Enhancing model interpretability can address spurious correlations by revealing how models draw their predictions. Concept Bottleneck Models (CBMs) can provide a principled way of disclosing and guiding model behaviors through human-understandable concepts, albeit at a high cost of human efforts in data annotation. In this paper, we leverage a synergy of multiple foundation models to construct CBMs with nearly no human effort. We discover undesirable biases in CBMs built on pre-trained models and propose a novel framework designed to exploit pre-trained models while being immune to these biases, thereby reducing vulnerability to spurious correlations. Specifically, our method offers a seamless pipeline that adopts foundation models for assessing potential spurious correlations in datasets, annotating concepts for images, and refining the annotations for improved robustness. We evaluate the proposed method on multiple datasets, and the results demonstrate its effectiveness in reducing model reliance on spurious correlations while preserving its interpretability.
Abstract:This paper presents a novel framework for benchmarking hierarchical gender hiring bias in Large Language Models (LLMs) for resume scoring, revealing significant issues of reverse bias and overdebiasing. Our contributions are fourfold: First, we introduce a framework using a real, anonymized resume dataset from the Healthcare, Finance, and Construction industries, meticulously used to avoid confounding factors. It evaluates gender hiring biases across hierarchical levels, including Level bias, Spread bias, Taste-based bias, and Statistical bias. This framework can be generalized to other social traits and tasks easily. Second, we propose novel statistical and computational hiring bias metrics based on a counterfactual approach, including Rank After Scoring (RAS), Rank-based Impact Ratio, Permutation Test-Based Metrics, and Fixed Effects Model-based Metrics. These metrics, rooted in labor economics, NLP, and law, enable holistic evaluation of hiring biases. Third, we analyze hiring biases in ten state-of-the-art LLMs. Six out of ten LLMs show significant biases against males in healthcare and finance. An industry-effect regression reveals that the healthcare industry is the most biased against males. GPT-4o and GPT-3.5 are the most biased models, showing significant bias in all three industries. Conversely, Gemini-1.5-Pro, Llama3-8b-Instruct, and Llama3-70b-Instruct are the least biased. The hiring bias of all LLMs, except for Llama3-8b-Instruct and Claude-3-Sonnet, remains consistent regardless of random expansion or reduction of resume content. Finally, we offer a user-friendly demo to facilitate adoption and practical application of the framework.
Abstract:Neuromorphic vision sensors or event cameras have made the visual perception of extremely low reaction time possible, opening new avenues for high-dynamic robotics applications. These event cameras' output is dependent on both motion and texture. However, the event camera fails to capture object edges that are parallel to the camera motion. This is a problem intrinsic to the sensor and therefore challenging to solve algorithmically. Human vision deals with perceptual fading using the active mechanism of small involuntary eye movements, the most prominent ones called microsaccades. By moving the eyes constantly and slightly during fixation, microsaccades can substantially maintain texture stability and persistence. Inspired by microsaccades, we designed an event-based perception system capable of simultaneously maintaining low reaction time and stable texture. In this design, a rotating wedge prism was mounted in front of the aperture of an event camera to redirect light and trigger events. The geometrical optics of the rotating wedge prism allows for algorithmic compensation of the additional rotational motion, resulting in a stable texture appearance and high informational output independent of external motion. The hardware device and software solution are integrated into a system, which we call Artificial MIcrosaccade-enhanced EVent camera (AMI-EV). Benchmark comparisons validate the superior data quality of AMI-EV recordings in scenarios where both standard cameras and event cameras fail to deliver. Various real-world experiments demonstrate the potential of the system to facilitate robotics perception both for low-level and high-level vision tasks.
Abstract:Accurate traffic flow prediction is crucial for optimizing traffic management, enhancing road safety, and reducing environmental impacts. Existing models face challenges with long sequence data, requiring substantial memory and computational resources, and often suffer from slow inference times due to the lack of a unified summary state. This paper introduces ST-MambaSync, an innovative traffic flow prediction model that combines transformer technology with the ST-Mamba block, representing a significant advancement in the field. We are the pioneers in employing the Mamba mechanism which is an attention mechanism integrated with ResNet within a transformer framework, which significantly enhances the model's explainability and performance. ST-MambaSync effectively addresses key challenges such as data length and computational efficiency, setting new benchmarks for accuracy and processing speed through comprehensive comparative analysis. This development has significant implications for urban planning and real-time traffic management, establishing a new standard in traffic flow prediction technology.
Abstract:Balancing accuracy with computational efficiency is paramount in machine learning, particularly when dealing with high-dimensional data, such as spatial-temporal datasets. This study introduces ST-MambaSync, an innovative framework that integrates a streamlined attention layer with a simplified state-space layer. The model achieves competitive accuracy in spatial-temporal prediction tasks. We delve into the relationship between attention mechanisms and the Mamba component, revealing that Mamba functions akin to attention within a residual network structure. This comparative analysis underpins the efficiency of state-space models, elucidating their capability to deliver superior performance at reduced computational costs.
Abstract:Accurate and efficient traffic prediction is crucial for planning, management, and control of intelligent transportation systems. Most state-of-the-art methods for traffic prediction effectively predict both long-term and short-term by employing spatio-temporal neural networks as prediction models, together with transformers to learn global information on prediction objects (e.g., traffic states of road segments). However, these methods often have a high computational cost to obtain good performance. This paper introduces an innovative approach to traffic flow prediction, the Spatial-Temporal Selective State Space Model (ST-SSMs), featuring the novel ST-Mamba block, which can achieve good prediction accuracy with less computational cost. A comparative analysis highlights the ST-Mamba layer's efficiency, revealing its equivalence to three attention layers, yet with markedly reduced processing time. Through rigorous testing on diverse real-world datasets, the ST-SSMs model demonstrates exceptional improvements in prediction accuracy and computational simplicity, setting new benchmarks in the domain of traffic flow forecasting
Abstract:Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.
Abstract:Accurate, and effective traffic forecasting is vital for smart traffic systems, crucial in urban traffic planning and management. Current Spatio-Temporal Transformer models, despite their prediction capabilities, struggle with balancing computational efficiency and accuracy, favoring global over local information, and handling spatial and temporal data separately, limiting insight into complex interactions. We introduce the Criss-Crossed Dual-Stream Enhanced Rectified Transformer model (CCDSReFormer), which includes three innovative modules: Enhanced Rectified Spatial Self-attention (ReSSA), Enhanced Rectified Delay Aware Self-attention (ReDASA), and Enhanced Rectified Temporal Self-attention (ReTSA). These modules aim to lower computational needs via sparse attention, focus on local information for better traffic dynamics understanding, and merge spatial and temporal insights through a unique learning method. Extensive tests on six real-world datasets highlight CCDSReFormer's superior performance. An ablation study also confirms the significant impact of each component on the model's predictive accuracy, showcasing our model's ability to forecast traffic flow effectively.
Abstract:E-commerce platforms usually present an ordered list, mixed with several organic items and an advertisement, in response to each user's page view request. This list, the outcome of ad auction and allocation processes, directly impacts the platform's ad revenue and gross merchandise volume (GMV). Specifically, the ad auction determines which ad is displayed and the corresponding payment, while the ad allocation decides the display positions of the advertisement and organic items. The prevalent methods of segregating the ad auction and allocation into two distinct stages face two problems: 1) Ad auction does not consider externalities, such as the influence of actual display position and context on ad Click-Through Rate (CTR); 2) The ad allocation, which utilizes the auction-winning ad's payment to determine the display position dynamically, fails to maintain incentive compatibility (IC) for the advertisement. For instance, in the auction stage employing the traditional Generalized Second Price (GSP) , even if the winning ad increases its bid, its payment remains unchanged. This implies that the advertisement cannot secure a better position and thus loses the opportunity to achieve higher utility in the subsequent ad allocation stage. Previous research often focused on one of the two stages, neglecting the two-stage problem, which may result in suboptimal outcomes...
Abstract:Human pose estimation is a critical component in autonomous driving and parking, enhancing safety by predicting human actions. Traditional frame-based cameras and videos are commonly applied, yet, they become less reliable in scenarios under high dynamic range or heavy motion blur. In contrast, event cameras offer a robust solution for navigating these challenging contexts. Predominant methodologies incorporate event cameras into learning frameworks by accumulating events into event frames. However, such methods tend to marginalize the intrinsic asynchronous and high temporal resolution characteristics of events. This disregard leads to a loss in essential temporal dimension data, crucial for safety-critical tasks associated with dynamic human activities. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC collates events within concise temporal slices at identical positions, preserving 3D attributes with statistical cues and markedly mitigating memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. On the public real-world DHP19 dataset, our event point cloud technique excels in real-time mobile predictions, while the decoupled event voxel method achieves the highest accuracy. Experiments reveal our proposed 3D representation methods' superior generalization capacities against traditional RGB images and event frame techniques. Our code and dataset are available at https://github.com/MasterHow/EventPointPose.