On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML models, we designed and developed Talaria: a model visualization and optimization system. Talaria enables practitioners to compile models to hardware, interactively visualize model statistics, and simulate optimizations to test the impact on inference metrics. Since its internal deployment two years ago, we have evaluated Talaria using three methodologies: (1) a log analysis highlighting its growth of 800+ practitioners submitting 3,600+ models; (2) a usability survey with 26 users assessing the utility of 20 Talaria features; and (3) a qualitative interview with the 7 most active users about their experience using Talaria.
Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces, as well as the accuracy of object localization within the 3D space. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization. Different from previous methods which directly predict depth distributions by using a supervised estimation model, we propose a cascade framework consisting of two depth-aware learning paradigms. First, a depth estimation (DE) scheme leverages relative depth information to realize the effective feature lifting from 2D to 3D spaces. Furthermore, a depth calibration (DC) scheme introduces depth reconstruction to further adjust the 3D object localization perturbation along the depth axis. In practice, the DE is explicitly realized by using both the absolute and relative depth optimization loss to promote the precision of depth prediction, while the capability of DC is implicitly embedded into the detection Transformer through a depth denoising mechanism in the training phase. The entire model training is accomplished through an end-to-end manner. We propose a baseline detector and evaluate the effectiveness of our proposal with +2.2%/+2.7% NDS/mAP improvements on NuScenes benchmark, and gain a comparable performance with 55.9%/45.7% NDS/mAP. Furthermore, we conduct extensive experiments to demonstrate its generality based on various detectors with about +2% NDS improvements.
Autonomous robot navigation within the dynamic unknown environment is of crucial significance for mobile robotic applications including robot navigation in last-mile delivery and robot-enabled automated supplies in industrial and hospital delivery applications. Current solutions still suffer from limitations, such as the robot cannot recognize unknown objects in real time and cannot navigate freely in a dynamic, narrow, and complex environment. We propose a complete software framework for autonomous robot perception and navigation within very dense obstacles and dense human crowds. First, we propose a framework that accurately detects and segments open-world object categories in a zero-shot manner, which overcomes the over-segmentation limitation of the current SAM model. Second, we proposed the distillation strategy to distill the knowledge to segment the free space of the walkway for robot navigation without the label. In the meantime, we design the trimming strategy that works collaboratively with distillation to enable lightweight inference to deploy the neural network on edge devices such as NVIDIA-TX2 or Xavier NX during autonomous navigation. Integrated into the robot navigation system, extensive experiments demonstrate that our proposed framework has achieved superior performance in terms of both accuracy and efficiency in robot scene perception and autonomous robot navigation.
Consumer credit services offered by e-commerce platforms provide customers with convenient loan access during shopping and have the potential to stimulate sales. To understand the causal impact of credit lines on spending, previous studies have employed causal estimators, based on direct regression (DR), inverse propensity weighting (IPW), and double machine learning (DML) to estimate the treatment effect. However, these estimators do not consider the notion that an individual's spending can be understood and represented as a distribution, which captures the range and pattern of amounts spent across different orders. By disregarding the outcome as a distribution, valuable insights embedded within the outcome distribution might be overlooked. This paper develops a distribution-valued estimator framework that extends existing real-valued DR-, IPW-, and DML-based estimators to distribution-valued estimators within Rubin's causal framework. We establish their consistency and apply them to a real dataset from a large e-commerce platform. Our findings reveal that credit lines positively influence spending across all quantiles; however, as credit lines increase, consumers allocate more to luxuries (higher quantiles) than necessities (lower quantiles).
For capturing colored document images, e.g. posters and magazines, it is common that multiple degradations such as shadows, wrinkles, etc., are simultaneously introduced due to external factors. Restoring multi-degraded colored document images is a great challenge, yet overlooked, as most existing algorithms focus on enhancing color-ignored document images via binarization. Thus, we propose DocStormer, a novel algorithm designed to restore multi-degraded colored documents to their potential pristine PDF. The contributions are: firstly, we propose a "Perceive-then-Restore" paradigm with a reinforced transformer block, which more effectively encodes and utilizes the distribution of degradations. Secondly, we are the first to utilize GAN and pristine PDF magazine images to narrow the distribution gap between the enhanced results and PDF images, in pursuit of less degradation and better visual quality. Thirdly, we propose a non-parametric strategy, PFILI, which enables a smaller training scale and larger testing resolutions with acceptable detail trade-off, while saving memory and inference time. Fourthly, we are the first to propose a novel Multi-Degraded Colored Document image Enhancing dataset, named MD-CDE, for both training and evaluation. Experimental results show that the DocStormer exhibits superior performance, capable of revitalizing multi-degraded colored documents into their potential pristine digital versions, which fills the current academic gap from the perspective of method, data, and task.
In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors.
Visual localization plays a critical role in the functionality of low-cost autonomous mobile robots. Current state-of-the-art approaches for achieving accurate visual localization are 3D scene-specific, requiring additional computational and storage resources to construct a 3D scene model when facing a new environment. An alternative approach of directly using a database of 2D images for visual localization offers more flexibility. However, such methods currently suffer from limited localization accuracy. In this paper, we propose an accurate and robust multiple checking-based 3D model-free visual localization system to address the aforementioned issues. To ensure high accuracy, our focus is on estimating the pose of a query image relative to the retrieved database images using 2D-2D feature matches. Theoretically, by incorporating the local planar motion constraint into both the estimation of the essential matrix and the triangulation stages, we reduce the minimum required feature matches for absolute pose estimation, thereby enhancing the robustness of outlier rejection. Additionally, we introduce a multiple-checking mechanism to ensure the correctness of the solution throughout the solving process. For validation, qualitative and quantitative experiments are performed on both simulation and two real-world datasets and the experimental results demonstrate a significant enhancement in both accuracy and robustness afforded by the proposed 3D model-free visual localization system.
Time series forecasting is prevalent in various real-world applications. Despite the promising results of deep learning models in time series forecasting, especially the Recurrent Neural Networks (RNNs), the explanations of time series models, which are critical in high-stakes applications, have received little attention. In this paper, we propose a Decomposition-based Linear Explainable LSTM (DeLELSTM) to improve the interpretability of LSTM. Conventionally, the interpretability of RNNs only concentrates on the variable importance and time importance. We additionally distinguish between the instantaneous influence of new coming data and the long-term effects of historical data. Specifically, DeLELSTM consists of two components, i.e., standard LSTM and tensorized LSTM. The tensorized LSTM assigns each variable with a unique hidden state making up a matrix $\mathbf{h}_t$, and the standard LSTM models all the variables with a shared hidden state $\mathbf{H}_t$. By decomposing the $\mathbf{H}_t$ into the linear combination of past information $\mathbf{h}_{t-1}$ and the fresh information $\mathbf{h}_{t}-\mathbf{h}_{t-1}$, we can get the instantaneous influence and the long-term effect of each variable. In addition, the advantage of linear regression also makes the explanation transparent and clear. We demonstrate the effectiveness and interpretability of DeLELSTM on three empirical datasets. Extensive experiments show that the proposed method achieves competitive performance against the baseline methods and provides a reliable explanation relative to domain knowledge.
Semantic segmentation in complex scenes not only relies on local object appearance but also on object locations and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to its huge computational cost for large-scale point clouds. In this paper, we propose to use regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning network (REFL-Net) that leverages region correlations to enhance the features of ambiguous points. We design a Region-based Feature Enhancement module (RFE) which consists of a Semantic-Spatial Region Extraction (SSRE) stage and a Region Dependency Modeling (RDM) stage. In the SSRE stage, we group the input points into a set of regions according to the point distances in both semantic and spatial space. In the RDM part, we explore region-wise semantic and spatial relationships via a self-attention block on region features and fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is a plug-and-play module that can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetv2 and S3DIS datasets, and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetv2 and 1.0% mIoU gain on S3DIS respectively with negligible computational cost compared to the backbone networks. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net.
This paper presents Scalable Semantic Transfer (SST), a novel training paradigm, to explore how to leverage the mutual benefits of the data from different label domains (i.e. various levels of label granularity) to train a powerful human parsing network. In practice, two common application scenarios are addressed, termed universal parsing and dedicated parsing, where the former aims to learn homogeneous human representations from multiple label domains and switch predictions by only using different segmentation heads, and the latter aims to learn a specific domain prediction while distilling the semantic knowledge from other domains. The proposed SST has the following appealing benefits: (1) it can capably serve as an effective training scheme to embed semantic associations of human body parts from multiple label domains into the human representation learning process; (2) it is an extensible semantic transfer framework without predetermining the overall relations of multiple label domains, which allows continuously adding human parsing datasets to promote the training. (3) the relevant modules are only used for auxiliary training and can be removed during inference, eliminating the extra reasoning cost. Experimental results demonstrate SST can effectively achieve promising universal human parsing performance as well as impressive improvements compared to its counterparts on three human parsing benchmarks (i.e., PASCAL-Person-Part, ATR, and CIHP). Code is available at https://github.com/yangjie-cv/SST.