Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Wang

IBM T. J. Watson Research Center

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Nov 30, 2024

Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

Figure 1 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 2 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 3 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 4 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Abstract:It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

Via

Access Paper or Ask Questions

Polarization under the Channel Noise with Memory

Nov 25, 2024

Tianfu Qi, Jun Wang

Abstract:The channel polarization under the channel noise with memory is comprehensively studied. With the help of the genie-aided channel, we prove that the polarized channels also converge to extremal channels under the standard polar codes structure. More importantly, the ratio of the perfect channel can be larger than $I(W)$ which is the capacity of the original channel. However, the polarization rate is shown to be slower than the binary-input discrete memoryless channel (DMC) case. Specifically, the upper bound of the block error is $\mathcal{O}(L^{-c_0})$ where $L$ is the block length and $c_0$ is some positive constant. Furthermore, the upper and lower bound of the gap between the capacity and cutoff rate is investigated when the block length is finite, which is more useful for practical applications.

Via

Access Paper or Ask Questions

Multi-LLM-Agent Systems: Techniques and Business Perspectives

Nov 21, 2024

Yingxuan Yang, Qiuying Peng, Jun Wang, Weinan Zhang

Figure 1 for Multi-LLM-Agent Systems: Techniques and Business Perspectives

Figure 2 for Multi-LLM-Agent Systems: Techniques and Business Perspectives

Figure 3 for Multi-LLM-Agent Systems: Techniques and Business Perspectives

Figure 4 for Multi-LLM-Agent Systems: Techniques and Business Perspectives

Abstract:In the era of (multi-modal) large language models, most operational processes can be reformulated and reproduced using LLM agents. The LLM agents can perceive, control, and get feedback from the environment so as to accomplish the given tasks in an autonomous manner. Besides the environment-interaction property, the LLM agents can call various external tools to ease the task completion process. The tools can be regarded as a predefined operational process with private or real-time knowledge that does not exist in the parameters of LLMs. As a natural trend of development, the tools for calling are becoming autonomous agents, thus the full intelligent system turns out to be a multi-LLM-agent system (MLAS). This paper discusses the technical and business landscapes of MLAS. Compared to the previous single-LLM-agent system, a MLAS has the advantages of i) higher potential of task-solving performance, ii) higher flexibility for system changing, iii) proprietary data preserving for each participating entity, and iv) feasibility of monetization for each entity. To support the ecosystem of MLAS, we provide a preliminary version of such MLAS protocol considering technical requirements, data privacy, and business incentives. As such, MLAS would be a practical solution to achieve artificial collective intelligence in the near future.

Via

Access Paper or Ask Questions

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Nov 20, 2024

Jiawei Zhang, Tian-Hao Zhang, Jun Wang, Jiaran Gao, Xinyuan Qian, Xu-Cheng Yin

Abstract:Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the synthesized mel-spectrogram to enhance the immersive experience, ensuring that the involved reverberation condition matches the scene accurately. Experimental results demonstrate that our model achieves high-quality scene and spatial matching without compromising speech naturalness, marking a significant advancement in the field of context-aware speech synthesis. Project demo page: https://spatialTTS.github.io/ Index Terms-Speech synthesis, scene prompt, spatial perception

* 5pages,4figures

Via

Access Paper or Ask Questions

Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

Nov 20, 2024

Yuexing Ding, Jun Wang, Hongbing Lyu

Figure 1 for Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

Figure 2 for Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

Figure 3 for Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

Figure 4 for Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

Abstract:In the field of medical image segmentation, challenges such as indistinct lesion features, ambiguous boundaries,and multi-scale characteristics have long revailed. This paper proposes an improved method named Intensity-Spatial Dual Masked AutoEncoder (ISD-MAE). Based on the tissue-contrast semi-masked autoencoder, a Masked AutoEncoder (MAE) branch is introduced to perform intensity masking and spatial masking operations on chest CT images for multi-scale feature learning and segmentation tasks. The model utilizes a dual-branch structure and contrastive learning to enhance the ability to learn tissue features and boundary details. Experiments are conducted on multiple 2D and 3D datasets. The results show that ISD-MAE significantly outperforms other methods in 2D pneumonia and mediastinal tumor segmentation tasks. For example, the Dice score reaches 90.10% on the COVID19 LESION dataset, and the performance is relatively stable. However, there is still room for improvement on 3D datasets. In response to this, improvement directions are proposed, including optimizing the loss function, using enhanced 3D convolution blocks, and processing datasets from multiple perspectives.Our code is available at:https://github.com/prowontheus/ISD-MAE.

* 10 pages,6 figures,3 tables

Via

Access Paper or Ask Questions

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Nov 12, 2024

Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park(+6 more)

Figure 1 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Figure 2 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Figure 3 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Figure 4 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Abstract:We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

Via

Access Paper or Ask Questions

Direct Preference Optimization Using Sparse Feature-Level Constraints

Nov 12, 2024

Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang(+1 more)

Figure 1 for Direct Preference Optimization Using Sparse Feature-Level Constraints

Figure 2 for Direct Preference Optimization Using Sparse Feature-Level Constraints

Figure 3 for Direct Preference Optimization Using Sparse Feature-Level Constraints

Figure 4 for Direct Preference Optimization Using Sparse Feature-Level Constraints

Abstract:The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

Via

Access Paper or Ask Questions

Adaptive Conditional Expert Selection Network for Multi-domain Recommendation

Nov 11, 2024

Kuiyao Dong, Xingyu Lou, Feng Liu, Ruian Wang, Wenyi Yu, Ping Wang, Jun Wang

Abstract:Mixture-of-Experts (MOE) has recently become the de facto standard in Multi-domain recommendation (MDR) due to its powerful expressive ability. However, such MOE-based method typically employs all experts for each instance, leading to scalability issue and low-discriminability between domains and experts. Furthermore, the design of commonly used domain-specific networks exacerbates the scalability issues. To tackle the problems, We propose a novel method named CESAA consists of Conditional Expert Selection (CES) Module and Adaptive Expert Aggregation (AEA) Module to tackle these challenges. Specifically, CES first combines a sparse gating strategy with domain-shared experts. Then AEA utilizes mutual information loss to strengthen the correlations between experts and specific domains, and significantly improve the distinction between experts. As a result, only domain-shared experts and selected domain-specific experts are activated for each instance, striking a balance between computational efficiency and model performance. Experimental results on both public ranking and industrial retrieval datasets verify the effectiveness of our method in MDR tasks.

Via

Access Paper or Ask Questions

Proprioceptive and Exteroceptive Information Perception in a Fabric Soft Robotic Arm via Physical Reservoir Computing with minimal training data

Nov 11, 2024

Jun Wang, Zhi Qiao, Wenlong Zhang, Suyi Li

Figure 1 for Proprioceptive and Exteroceptive Information Perception in a Fabric Soft Robotic Arm via Physical Reservoir Computing with minimal training data

Figure 2 for Proprioceptive and Exteroceptive Information Perception in a Fabric Soft Robotic Arm via Physical Reservoir Computing with minimal training data

Figure 3 for Proprioceptive and Exteroceptive Information Perception in a Fabric Soft Robotic Arm via Physical Reservoir Computing with minimal training data

Figure 4 for Proprioceptive and Exteroceptive Information Perception in a Fabric Soft Robotic Arm via Physical Reservoir Computing with minimal training data

Abstract:Over the past decades, we have witnessed a rapid emergence of soft and reconfigurable robots thanks to their capability to interact safely with humans and adapt to complex environments. However, their softness makes accurate control very challenging. High-fidelity sensing is critical in improving control performance, especially posture and contact estimation. To this end, traditional camera-based sensors and load cells have limited portability and accuracy, and they will inevitably increase the robot's cost and weight. In this study, instead of using specialized sensors, we only collect distributed pressure data inside a pneumatics-driven soft arm and apply the physical reservoir computing principle to simultaneously predict its kinematic posture (i.e., bending angle) and payload status (i.e., payload mass). Our results show that, with careful readout training, one can obtain accurate bending angle and payload mass predictions via simple, weighted linear summations of pressure readings. In addition, our comparative analysis shows that, to guarantee low prediction errors within 10\%, bending angle prediction requires less training data than payload prediction. This result reveals that balanced linear and nonlinear body dynamics are critical for the physical reservoir to accomplish complex proprioceptive and exteroceptive information perception tasks. Finally, the method of exploring the most efficient readout training methods presented in this paper could be extended to other soft robotic systems to maximize their perception capabilities.

Via

Access Paper or Ask Questions

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Nov 05, 2024

Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab(+8 more)

Figure 1 for Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Figure 2 for Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Figure 3 for Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Figure 4 for Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Abstract:We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.

Via

Access Paper or Ask Questions