This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and efficient personal mobility generation. LLMs overcome the limitations of previous models by efficiently processing semantic data and offering versatility in modeling various tasks. Our approach addresses the critical need to align LLMs with real-world urban mobility data, focusing on three research questions: aligning LLMs with rich activity data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. In experimental studies, comprehensive validation is performed using real-world data. This research marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.
Most existing traffic sign-related works are dedicated to detecting and recognizing part of traffic signs individually, which fails to analyze the global semantic logic among signs and may convey inaccurate traffic instruction. Following the above issues, we propose a traffic sign interpretation (TSI) task, which aims to interpret global semantic interrelated traffic signs (e.g.,~driving instruction-related texts, symbols, and guide panels) into a natural language for providing accurate instruction support to autonomous or assistant driving. Meanwhile, we design a multi-task learning architecture for TSI, which is responsible for detecting and recognizing various traffic signs and interpreting them into a natural language like a human. Furthermore, the absence of a public TSI available dataset prompts us to build a traffic sign interpretation dataset, namely TSI-CN. The dataset consists of real road scene images, which are captured from the highway and the urban way in China from a driver's perspective. It contains rich location labels of texts, symbols, and guide panels, and the corresponding natural language description labels. Experiments on TSI-CN demonstrate that the TSI task is achievable and the TSI architecture can interpret traffic signs from scenes successfully even if there is a complex semantic logic among signs. The TSI-CN dataset and the source code of the TSI architecture will be publicly available after the revision process.
Next Point-of-Interest (POI) recommendation plays a crucial role in urban mobility applications. Recently, POI recommendation models based on Graph Neural Networks (GNN) have been extensively studied and achieved, however, the effective incorporation of both spatial and temporal information into such GNN-based models remains challenging. Extracting distinct fine-grained features unique to each piece of information is difficult since temporal information often includes spatial information, as users tend to visit nearby POIs. To address the challenge, we propose \textbf{\underline{Mob}}ility \textbf{\underline{G}}raph \textbf{\underline{T}}ransformer (MobGT) that enables us to fully leverage graphs to capture both the spatial and temporal features in users' mobility patterns. MobGT combines individual spatial and temporal graph encoders to capture unique features and global user-location relations. Additionally, it incorporates a mobility encoder based on Graph Transformer to extract higher-order information between POIs. To address the long-tailed problem in spatial-temporal data, MobGT introduces a novel loss function, Tail Loss. Experimental results demonstrate that MobGT outperforms state-of-the-art models on various datasets and metrics, achieving 24\% improvement on average. Our codes are available at \url{https://github.com/Yukayo/MobGT}.
Contour-based instance segmentation methods include one-stage and multi-stage schemes. These approaches achieve remarkable performance. However, they have to define plenty of points to segment precise masks, which leads to high complexity. We follow this issue and present a single-shot method, called \textbf{VeinMask}, for achieving competitive performance in low design complexity. Concretely, we observe that the leaf locates coarse margins via major veins and grows minor veins to refine twisty parts, which makes it possible to cover any objects accurately. Meanwhile, major and minor veins share the same growth mode, which avoids modeling them separately and ensures model simplicity. Considering the superiorities above, we propose VeinMask to formulate the instance segmentation problem as the simulation of the vein growth process and to predict the major and minor veins in polar coordinates. Besides, centroidness is introduced for instance segmentation tasks to help suppress low-quality instances. Furthermore, a surroundings cross-correlation sensitive (SCCS) module is designed to enhance the feature expression by utilizing the surroundings of each pixel. Additionally, a Residual IoU (R-IoU) loss is formulated to supervise the regression tasks of major and minor veins effectively. Experiments demonstrate that VeinMask performs much better than other contour-based methods in low design complexity. Particularly, our method outperforms existing one-stage contour-based methods on the COCO dataset with almost half the design complexity.
Existing real-time text detectors reconstruct text contours by shrink-masks directly, which simplifies the framework and can make the model run fast. However, the strong dependence on predicted shrink-masks leads to unstable detection results. Moreover, the discrimination of shrink-masks is a pixelwise prediction task. Supervising the network by shrink-masks only will lose much semantic context, which leads to the false detection of shrink-masks. To address these problems, we construct an efficient text detection network, Adaptive Shrink-Mask for Text Detection (ASMTD), which improves the accuracy during training and reduces the complexity of the inference process. At first, the Adaptive Shrink-Mask (ASM) is proposed to represent texts by shrink-masks and independent adaptive offsets. It weakens the coupling of texts to shrink-masks, which improves the robustness of detection results. Then, the Super-pixel Window (SPW) is designed to supervise the network. It utilizes the surroundings of each pixel to improve the reliability of predicted shrink-masks and does not appear during testing. In the end, a lightweight feature merging branch is constructed to reduce the computational cost. As demonstrated in the experiments, our method is superior to existing state-of-the-art (SOTA) methods in both detection accuracy and speed on multiple benchmarks.
Text detection, the key technology for understanding scene text, has become an attractive research topic. For detecting various scene texts, researchers propose plenty of detectors with different advantages: detection-based models enjoy fast detection speed, and segmentation-based algorithms are not limited by text shapes. However, for most intelligent systems, the detector needs to detect arbitrary-shaped texts with high speed and accuracy simultaneously. Thus, in this study, we design an efficient pipeline named as MT, which can detect adhesive arbitrary-shaped texts with only a single binary mask in the inference stage. This paper presents the contributions on three aspects: (1) a light-weight detection framework is designed to speed up the inference process while keeping high detection accuracy; (2) a multi-perspective feature module is proposed to learn more discriminative representations to segment the mask accurately; (3) a multi-factor constraints IoU minimization loss is introduced for training the proposed model. The effectiveness of MT is evaluated on four real-world scene text datasets, and it surpasses all the state-of-the-art competitors to a large extent.
Existing object detection-based text detectors mainly concentrate on detecting horizontal and multioriented text. However, they do not pay enough attention to complex-shape text (curved or other irregularly shaped text). Recently, segmentation-based text detection methods have been introduced to deal with the complex-shape text; however, the pixel level processing increases the computational cost significantly. To further improve the accuracy and efficiency, we propose a novel detection framework for arbitrary-shape text detection, termed as RayNet. RayNet uses Center Point Set (CPS) and Ray Distance (RD) to fit text, where CPS is used to determine the text general position and the RD is combined with CPS to compute Ray Points (RP) to localize the text accurate shape. Since RP are disordered, we develop the Ray Points Connection (RPC) algorithm to reorder RP, which significantly improves the detection performance of complex-shape text. RayNet achieves impressive performance on existing curved text dataset (CTW1500) and quadrangle text dataset (ICDAR2015), which demonstrate its superiority against several state-of-the-art methods.
Recently, text detection for arbitrary shape has attracted more and more search attention. Although segmentation-based methods, which are not limited by the text shape, have been studied to improve the performance, the slow detection speed, complicated post-processing, and text adhesion problem are still limitations for the practical application. In this paper, we propose a simple yet effective arbitrary-shape text detector, named Bold Outline Text Detector (BOTD). It is a novel one-stage detection framework with few post-processing processes. At the same time, the text adhesion problem can also be well alleviated. Specifically, BOTD first generates a center mask (CM) for each text instance, which makes the adhesive text easy to distinguish. Base on the CM, we further compute the polar minimum distance (PMD) for each text instance. PMD denotes the shortest distance between the center point of CM and the outline of the text instance. By dividing the text mask into CM and PMD, the outline of arbitrary-shape text instance can be obtained by simply predicting its CM and PMD. Without any bells and whistles, BOTD achieves an F-measure of 80.1% on CTW1500 with 52 FPS. Note that the post-processing time only accounts for 9% of the whole inference time. Code and trained models will be publicly available soon.