Honor Device Co., Ltd
Abstract:Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.
Abstract:Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.
Abstract:We study exact-regenerating codes for entanglement-assisted distributed storage systems. Consider an $(n,k,d,α,β_{\mathsf{q}},B)$ distributed system that stores a file of $B$ classical symbols across $n$ nodes with each node storing $α$ symbols. A data collector can recover the file by accessing any $k$ nodes. When a node fails, any $d$ surviving nodes share an entangled state, and each of them transmits a quantum system of $β_{\mathsf{q}}$ qudits to a newcomer. The newcomer then performs a measurement on the received quantum systems to generate its storage. Recent work [1] showed that, under functional repair where the regenerated content may differ from that of the failed node, there exists a unique optimal regenerating point that \emph{simultaneously minimizes both storage $α$ and repair bandwidth $d β_{\mathsf{q}}$} when $d \geq 2k-2$. In this paper, we show that, under \emph{exact repair}, where the newcomer reproduces exactly the same content as the failed node, this optimal point remains achievable. Our construction builds on the classical product-matrix framework and the Calderbank-Shor-Steane (CSS)-based stabilizer formalism.
Abstract:The integration of spatial multi-omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi-task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi-omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi-omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological "Dark Matter".
Abstract:An important part of the information theory folklore had been about the output statistics of codes that achieve the capacity and how the empirical distributions compare to the output distributions induced by the optimal input in the channel capacity problem. Results for a variety of such empirical output distributions of good codes have been known in the literature, such as the comparison of the output distribution of the code to the optimal output distribution in vanishing and non-vanishing error probability cases. Motivated by these, we aim to achieve similar results for the quantum codes that are used for classical communication, that is the setting in which the classical messages are communicated through quantum codewords that pass through a noisy quantum channel. We first show the uniqueness of the optimal output distribution, to be able to talk more concretely about the optimal output distribution. Then, we extend the vanishing error probability results to the quantum case, by using techniques that are close in spirit to the classical case. We also extend non-vanishing error probability results to the quantum case on block codes, by using the second-order converses for such codes based on hypercontractivity results for the quantum generalized depolarizing semi-groups.
Abstract:This work investigates the use of quantum resources in distributed storage systems. Consider an $(n,k,d)$ distributed storage system in which a file is stored across $n$ nodes such that any $k$ nodes suffice to reconstruct the file. When a node fails, any $d$ helper nodes transmit information to a newcomer to rebuild the system. In contrast to the classical repair, where helper nodes transmit classical bits, we allow them to send classical information over quantum channels to the newcomer. The newcomer then generates its storage by performing appropriate measurements on the received quantum states. In this setting, we fully characterize the fundamental tradeoff between storage and repair bandwidth (total communication cost). Compared to classical systems, the optimal storage--bandwidth tradeoff can be significantly improved with the enhancement of quantum entanglement shared only among the surviving nodes, particularly at the minimum-storage regenerating point. Remarkably, we show that when $d \geq 2k-2$, there exists an operating point at which \textit{both storage and repair bandwidth are simultaneously minimized}. This phenomenon breaks the tradeoff in the classical setting and reveals a fundamentally new regime enabled by quantum communication.
Abstract:We provide new insights into an open problem recently posed by Yuan-Sun [ISIT 2025], concerning the minimum individual key rate required in the vector linear secure aggregation problem. Consider a distributed system with $K$ users, where each user $k\in [K]$ holds a data stream $W_k$ and an individual key $Z_k$. A server aims to compute a linear function $\mathbf{F}[W_1;\ldots;W_K]$ without learning any information about another linear function $\mathbf{G}[W_1;\ldots;W_K]$, where $[W_1;\ldots;W_K]$ denotes the row stack of $W_1,\ldots,W_K$. The open problem is to determine the minimum required length of $Z_k$, denoted as $R_k$, $k\in [K]$. In this paper, we characterize a new achievable region for the rate tuple $(R_1,\ldots,R_K)$. The region is polyhedral, with vertices characterized by a binary rate assignment $(R_1,\ldots,R_K) = (\mathbf{1}(1 \in \mathcal{I}),\ldots,\mathbf{1}(K\in \mathcal{I}))$, where $\mathcal{I}\subseteq [K]$ satisfies the \textit{rank-increment condition}: $\mathrm{rank}\left(\bigl[\mathbf{F}_{\mathcal{I}};\mathbf{G}_{\mathcal{I}}\bigr]\right) =\mathrm{rank}\bigl(\mathbf{F}_{\mathcal{I}}\bigr)+N$. Here, $\mathbf{F}_\mathcal{I}$ and $\mathbf{G}_\mathcal{I}$ are the submatrices formed by the columns indexed by $\mathcal{I}$. Our results uncover the novel fact that it is not necessary for every user to hold a key, thereby strictly enlarging the best-known achievable region in the literature. Furthermore, we provide a converse analysis to demonstrate its optimality when minimizing the number of users that hold keys.




Abstract:Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.




Abstract:Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.
Abstract:Vision-based regression tasks, such as hand pose estimation, have achieved higher accuracy and faster convergence through representation learning. However, existing representation learning methods often encounter the following issues: the high semantic level of features extracted from images is inadequate for regressing low-level information, and the extracted features include task-irrelevant information, reducing their compactness and interfering with regression tasks. To address these challenges, we propose TI-Net, a highly versatile visual Network backbone designed to construct a Transformation Isomorphic latent space. Specifically, we employ linear transformations to model geometric transformations in the latent space and ensure that {\rm TI-Net} aligns them with those in the image space. This ensures that the latent features capture compact, low-level information beneficial for pose estimation tasks. We evaluated TI-Net on the hand pose estimation task to demonstrate the network's superiority. On the DexYCB dataset, TI-Net achieved a 10% improvement in the PA-MPJPE metric compared to specialized state-of-the-art (SOTA) hand pose estimation methods. Our code will be released in the future.