IDSIA
Abstract:Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making the models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given to the user within a certain output length. It is unclear whether and how the reasoning abilities of LLMs remain effective under such constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test more than 25 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between the token budgets and the actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning that differ from the unconstrained situation, e.g. the optimal choices of model sizes and prompts change under different budgets. These findings offer practical guidance for users to deploy LLMs under real-world latency constraints.
Abstract:Many applications demand context sensing to offer personalized and timely services. Yet, developing sensing programs can be challenging for developers and using them is privacy-concerning for end-users. In this paper, we propose to use natural language as the unified interface to process personal data and sense user context, which can effectively ease app development and make the data pipeline more transparent. Our work is inspired by large language models (LLMs) and other generative models, while directly applying them does not solve the problem - letting the model directly process the data cannot handle complex sensing requests and letting the model write the data processing program suffers error-prone code generation. We address the problem with 1) a unified data processing framework that makes context-sensing programs simpler and 2) a feedback-guided query optimizer that makes data query more informative. To evaluate the performance of natural language-based context sensing, we create a benchmark that contains 133 context sensing tasks. Extensive evaluation has shown that our approach is able to automatically solve the context-sensing tasks efficiently and precisely. The code is opensourced at https://github.com/MobileLLM/ChainStream.
Abstract:It is widely acknowledged that the epileptic foci can be pinpointed by source localizing interictal epileptic discharges (IEDs) via Magnetoencephalography (MEG). However, manual detection of IEDs, which appear as spikes in MEG data, is extremely labor intensive and requires considerable professional expertise, limiting the broader adoption of MEG technology. Numerous studies have focused on automatic detection of MEG spikes to overcome this challenge, but these efforts often validate their models on synthetic datasets with balanced positive and negative samples. In contrast, clinical MEG data is highly imbalanced, raising doubts on the real-world efficacy of these models. To address this issue, we introduce LV-CadeNet, a Long View feature Convolution-Attention fusion Encoder-Decoder Network, designed for automatic MEG spike detection in real-world clinical scenarios. Beyond addressing the disparity between training data distribution and clinical test data through semi-supervised learning, our approach also mimics human specialists by constructing long view morphological input data. Moreover, we propose an advanced convolution-attention module to extract temporal and spatial features from the input data. LV-CadeNet significantly improves the accuracy of MEG spike detection, boosting it from 42.31\% to 54.88\% on a novel clinical dataset sourced from Sanbo Brain Hospital Capital Medical University. This dataset, characterized by a highly imbalanced distribution of positive and negative samples, accurately represents real-world clinical scenarios.
Abstract:There are unique challenges to developing item recommender systems for e-commerce platforms like eBay due to sparse data and diverse user interests. While rich user-item interactions are important, eBay's data sparsity exceeds other e-commerce sites by an order of magnitude. To address this challenge, we propose CoActionGraphRec (CAGR), a text based two-tower deep learning model (Item Tower and User Tower) utilizing co-action graph layers. In order to enhance user and item representations, a graph-based solution tailored to eBay's environment is utilized. For the Item Tower, we represent each item using its co-action items to capture collaborative signals in a co-action graph that is fully leveraged by the graph neural network component. For the User Tower, we build a fully connected graph of each user's behavior sequence, with edges encoding pairwise relationships. Furthermore, an explicit interaction module learns representations capturing behavior interactions. Extensive offline and online A/B test experiments demonstrate the effectiveness of our proposed approach and results show improved performance over state-of-the-art methods on key metrics.
Abstract:Reinforcement learning method is extremely competitive in gait generation techniques for quadrupedal robot, which is mainly due to the fact that stochastic exploration in reinforcement training is beneficial to achieve an autonomous gait. Nevertheless, although incremental reinforcement learning is employed to improve training success and movement smoothness by relying on the continuity inherent during limb movements, challenges remain in adapting gait policy to diverse terrain and external disturbance. Inspired by the association between reinforcement learning and the evolution of animal motion behavior, a self-improvement mechanism for reference gait is introduced in this paper to enable incremental learning of action and self-improvement of reference action together to imitate the evolution of animal motion behavior. Further, a new framework for reinforcement training of quadruped gait is proposed. In this framework, genetic algorithm is specifically adopted to perform global probabilistic search for the initial value of the arbitrary foot trajectory to update the reference trajectory with better fitness. Subsequently, the improved reference gait is used for incremental reinforcement learning of gait. The above process is repeatedly and alternatively executed to finally train the gait policy. The analysis considering terrain, model dimensions, and locomotion condition is presented in detail based on simulation, and the results show that the framework is significantly more adaptive to terrain compared to regular incremental reinforcement learning.
Abstract:The family of local maximum likelihood (LML) detectors and the family of likelihood ascent search (LAS) detectors are successfully applied to symbol detection in massive multi-input multi-output (MIMO) communications and demonstrate performance approaching the NP-hard global maximum likelihood (GML) detector. The LML and LAS detectors in the two families are akin to each other and possess common properties significant in both theory and practical MIMO communications. This book chapter reviews the LML and LAS detectors in a unified framework. The focus is on their formulation, relationships, properties, and GML performance in bit error rate (BER) and spectral efficiency in large MIMO channels.
Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.
Abstract:Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian PINN estimator of the solution of a PDE from $n$ independent noisy measurement of the solution. We focus on a class of equations that are linear in their parameters (with unknown coefficients $\theta_\star$). We show that when the partial differential equation admits a classical solution (say $u_\star$), differentiable to order $\beta$, the mean square error of the Bayesian posterior mean is at least of order $n^{-2\beta/(2\beta + d)}$. Furthermore, we establish a convergence rate of the linear coefficients of $\theta_\star$ depending on the order of the underlying differential operator. Last but not least, our theoretical results are validated through extensive simulations.
Abstract:The perception of transparent objects for grasp and manipulation remains a major challenge, because existing robotic grasp methods which heavily rely on depth maps are not suitable for transparent objects due to their unique visual properties. These properties lead to gaps and inaccuracies in the depth maps of the transparent objects captured by depth sensors. To address this issue, we propose an end-to-end network for transparent object depth completion that combines the strengths of single-view RGB-D based depth completion and multi-view depth estimation. Moreover, we introduce a depth refinement module based on confidence estimation to fuse predicted depth maps from single-view and multi-view modules, which further refines the restored depth map. The extensive experiments on the ClearPose and TransCG datasets demonstrate that our method achieves superior accuracy and robustness in complex scenarios with significant occlusion compared to the state-of-the-art methods.
Abstract:The grasp generation of dexterous hand often requires a large number of grasping annotations. Especially for functional grasp-requiring the grasp pose to be convenient for the subsequent use of the object. However, annotating high DoF dexterous hand pose is rather challenging. This prompt us to explore how people achieve manipulations on new objects based on past grasp experiences. We find that people are adept at discovering and leveraging various similarities between objects when grasping new items, including shape, layout, and grasp type. In light of this, we analyze and collect grasp-related similarity relationships among 51 common tool-like object categories and annotate semantic grasp representation for 1768 objects. These data are organized into the form of a knowledge graph, which helps infer our proposed cross-category functional grasp synthesis. Through extensive experiments, we demonstrate that the grasp-related knowledge indeed contributed to achieving functional grasp transfer across unknown or entirely new categories of objects. We will publicly release the dataset and code to facilitate future research.