Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiao Ma

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Mar 06, 2024

Xiao Ma, Sumit Patidar, Iain Haughton, Stephen James

Figure 1 for Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Figure 2 for Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Figure 3 for Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Figure 4 for Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Abstract:This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically, we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.

* Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2024). Videos and code: https://yusufma03.github.io/projects/hdp/

Via

Access Paper or Ask Questions

InsActor: Instruction-driven Physics-based Characters

Dec 28, 2023

Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Xiao Ma, Liang Pan, Ziwei Liu

Figure 1 for InsActor: Instruction-driven Physics-based Characters

Figure 2 for InsActor: Instruction-driven Physics-based Characters

Figure 3 for InsActor: Instruction-driven Physics-based Characters

Figure 4 for InsActor: Instruction-driven Physics-based Characters

Abstract:Generating animation of physics-based characters with intuitive control has long been a desirable task with numerous applications. However, generating physically simulated animations that reflect high-level human instructions remains a difficult problem due to the complexity of physical environments and the richness of human language. In this paper, we present InsActor, a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters. Our framework empowers InsActor to capture complex relationships between high-level human instructions and character motions by employing diffusion policies for flexibly conditioned motion planning. To overcome invalid states and infeasible state transitions in planned motions, InsActor discovers low-level skills and maps plans to latent skill sequences in a compact latent space. Extensive experiments demonstrate that InsActor achieves state-of-the-art results on various tasks, including instruction-driven motion generation and instruction-driven waypoint heading. Notably, the ability of InsActor to generate physically simulated animations using high-level human instructions makes it a valuable tool, particularly in executing long-horizon tasks with a rich set of instructions.

* NeurIPS 2023. Project page is at https://jiawei-ren.github.io/projects/insactor

Via

Access Paper or Ask Questions

Gemini: A Family of Highly Capable Multimodal Models

Dec 19, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth(+930 more)

Abstract:This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.

Via

Access Paper or Ask Questions

Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis

Dec 18, 2023

Tianyao He, Huabin Liu, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Weiyao Lin

Figure 1 for Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis

Figure 2 for Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis

Figure 3 for Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis

Figure 4 for Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis

Abstract:Video Correlation Learning (VCL), which aims to analyze the relationships between videos, has been widely studied and applied in various general video tasks. However, applying VCL to instructional videos is still quite challenging due to their intrinsic procedural temporal structure. Specifically, procedural knowledge is critical for accurate correlation analyses on instructional videos. Nevertheless, current procedure-learning methods heavily rely on step-level annotations, which are costly and not scalable. To address this problem, we introduce a weakly supervised framework called Collaborative Procedure Alignment (CPA) for procedure-aware correlation learning on instructional videos. Our framework comprises two core modules: collaborative step mining and frame-to-step alignment. The collaborative step mining module enables simultaneous and consistent step segmentation for paired videos, leveraging the semantic and temporal similarity between frames. Based on the identified steps, the frame-to-step alignment module performs alignment between the frames and steps across videos. The alignment result serves as a measurement of the correlation distance between two videos. We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment. Extensive experiments validate the effectiveness of our approach in providing accurate and interpretable correlation analyses for instructional videos.

* has been accepted by AAAI 24

Via

Access Paper or Ask Questions

Adjustable Robust Transformer for High Myopia Screening in Optical Coherence Tomography

Dec 12, 2023

Xiao Ma, Zetian Zhang, Zexuan Ji, Kun Huang, Na Su, Songtao Yuan, Qiang Chen

Abstract:Myopia is a manifestation of visual impairment caused by an excessively elongated eyeball. Image data is critical material for studying high myopia and pathological myopia. Measurements of spherical equivalent and axial length are the gold standards for identifying high myopia, but the available image data for matching them is scarce. In addition, the criteria for defining high myopia vary from study to study, and therefore the inclusion of samples in automated screening efforts requires an appropriate assessment of interpretability. In this work, we propose a model called adjustable robust transformer (ARTran) for high myopia screening of optical coherence tomography (OCT) data. Based on vision transformer, we propose anisotropic patch embedding (APE) to capture more discriminative features of high myopia. To make the model effective under variable screening conditions, we propose an adjustable class embedding (ACE) to replace the fixed class token, which changes the output to adapt to different conditions. Considering the confusion of the data at high myopia and low myopia threshold, we introduce the label noise learning strategy and propose a shifted subspace transition matrix (SST) to enhance the robustness of the model. Besides, combining the two structures proposed above, the model can provide evidence for uncertainty evaluation. The experimental results demonstrate the effectiveness and reliability of the proposed method. Code is available at: https://github.com/maxiao0234/ARTran.

* 11 pages, 3 figures, MICCAI 2023 - Accepted Papers; International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023: 504-514

Via

Access Paper or Ask Questions

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Dec 01, 2023

Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Su, Jilin Chen, Chinmay Kulkarni, Heng-Tze Cheng, Quoc Le, Ed Chi

Figure 1 for Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Figure 2 for Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Figure 3 for Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Figure 4 for Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses

Abstract:Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the interaction is textual, users have little scaffolding in the way of structure, informational "scent", or ability to specify high-level preferences or goals. We introduce ExploreLLM that allows users to structure thoughts, help explore different options, navigate through the choices and recommendations, and to more easily steer models to generate more personalized responses. We conduct a user study and show that users find it helpful to use ExploreLLM for exploratory or planning tasks, because it provides a useful schema-like structure to the task, and guides users in planning. The study also suggests that users can more easily personalize responses with high-level preferences with ExploreLLM. Together, ExploreLLM points to a future where users interact with LLMs beyond the form of chatbots, and instead designed to support complex user tasks with a tighter integration between natural language and graphical user interfaces.

* 19 pages, 11 figures

Via

Access Paper or Ask Questions

Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning

Oct 25, 2023

Ananth Balashankar, Xiao Ma, Aradhana Sinha, Ahmad Beirami, Yao Qin, Jilin Chen, Alex Beutel

Figure 1 for Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning

Figure 2 for Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning

Abstract:As large language models (LLMs) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. If we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? In this paper, we study the novel setting of domain-generalized few-shot learning for LLM-based text safety classifiers. Unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. We demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting training data based on similar examples in prior existing rules. We empirically show that our approach of similarity-based data-augmentation + prompt-tuning (DAPT) consistently outperforms baselines that either do not rely on data augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule is loosely correlated with existing ones.

Via

Access Paper or Ask Questions

Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

Oct 25, 2023

Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel(+1 more)

Figure 1 for Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

Figure 2 for Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

Figure 3 for Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

Figure 4 for Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

Abstract:A crucial challenge for generative large language models (LLMs) is diversity: when a user's prompt is under-specified, models may follow implicit assumptions while generating a response, which may result in homogenization of the responses, as well as certain demographic groups being under-represented or even erased from the generated responses. In this paper, we formalize diversity of representation in generative LLMs. We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes. We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal. This finding motivated a new prompting technique called collective-critique and self-voting (CCSV) to self-improve people diversity of LLMs by tapping into its diversity reasoning capabilities, without relying on handcrafted examples or prompt tuning. Extensive empirical experiments with both human and automated evaluations show that our proposed approach is effective at improving people and culture diversity, and outperforms all baseline methods by a large margin.

* To appear at EMNLP 2023 main conference

Via

Access Paper or Ask Questions

A global product of fine-scale urban building height based on spaceborne lidar

Oct 22, 2023

Xiao Ma, Guang Zheng, Chi Xu, L. Monika Moskal, Peng Gong, Qinghua Guo, Huabing Huang, Xuecao Li, Yong Pang, Cheng Wang(+4 more)

Figure 1 for A global product of fine-scale urban building height based on spaceborne lidar

Figure 2 for A global product of fine-scale urban building height based on spaceborne lidar

Figure 3 for A global product of fine-scale urban building height based on spaceborne lidar

Figure 4 for A global product of fine-scale urban building height based on spaceborne lidar

Abstract:Characterizing urban environments with broad coverages and high precision is more important than ever for achieving the UN's Sustainable Development Goals (SDGs) as half of the world's populations are living in cities. Urban building height as a fundamental 3D urban structural feature has far-reaching applications. However, so far, producing readily available datasets of recent urban building heights with fine spatial resolutions and global coverages remains a challenging task. Here, we provide an up-to-date global product of urban building heights based on a fine grid size of 150 m around 2020 by combining the spaceborne lidar instrument of GEDI and multi-sourced data including remotely sensed images (i.e., Landsat-8, Sentinel-2, and Sentinel-1) and topographic data. Our results revealed that the estimated method of building height samples based on the GEDI data was effective with 0.78 of Pearson's r and 3.67 m of RMSE in comparison to the reference data. The mapping product also demonstrated good performance as indicated by its strong correlation with the reference data (i.e., Pearson's r = 0.71, RMSE = 4.60 m). Compared with the currently existing products, our global urban building height map holds the ability to provide a higher spatial resolution (i.e., 150 m) with a great level of inherent details about the spatial heterogeneity and flexibility of updating using the GEDI samples as inputs. This work will boost future urban studies across many fields including climate, environmental, ecological, and social sciences.

Via

Access Paper or Ask Questions

Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning

Jun 25, 2023

Xiao Ma, Swaroop Mishra, Ahmad Beirami, Alex Beutel, Jilin Chen

Abstract:Language models still struggle on moral reasoning, despite their impressive performance in many other tasks. In particular, the Moral Scenarios task in MMLU (Multi-task Language Understanding) is among the worst performing tasks for many language models, including GPT-3. In this work, we propose a new prompting framework, Thought Experiments, to teach language models to do better moral reasoning using counterfactuals. Experiment results show that our framework elicits counterfactual questions and answers from the model, which in turn helps improve the accuracy on Moral Scenarios task by 9-16% compared to other zero-shot baselines. Interestingly, unlike math reasoning tasks, zero-shot Chain-of-Thought (CoT) reasoning doesn't work out of the box, and even reduces accuracy by around 4% compared to direct zero-shot. We further observed that with minimal human supervision in the form of 5 few-shot examples, the accuracy of the task can be improved to as much as 80%.

* 8 pages, ICML Neural Conversational AI workshop, thought experiments, moral reasoning

Via

Access Paper or Ask Questions