Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhecheng An

Seed1.5-VL Technical Report

May 11, 2025

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang(+187 more)

Figure 1 for Seed1.5-VL Technical Report

Figure 2 for Seed1.5-VL Technical Report

Figure 3 for Seed1.5-VL Technical Report

Figure 4 for Seed1.5-VL Technical Report

Abstract:We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Via

Access Paper or Ask Questions

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

May 27, 2023

Linhao Dong, Zhecheng An, Peihao Wu, Jun Zhang, Lu Lu, Zejun Ma

Figure 1 for CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Figure 2 for CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Figure 3 for CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Figure 4 for CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Abstract:Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.

* Accepted by ACL 2023 Findings

Via

Access Paper or Ask Questions

Improving Contextual Representation with Gloss Regularized Pre-training

May 13, 2022

Yu Lin, Zhecheng An, Peihao Wu, Zejun Ma

Figure 1 for Improving Contextual Representation with Gloss Regularized Pre-training

Figure 2 for Improving Contextual Representation with Gloss Regularized Pre-training

Figure 3 for Improving Contextual Representation with Gloss Regularized Pre-training

Figure 4 for Improving Contextual Representation with Gloss Regularized Pre-training

Abstract:Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.

* Accepted to Findings of NAACL 2022

Via

Access Paper or Ask Questions

Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Sep 30, 2019

Zhecheng An, Sicong Liu

Figure 1 for Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Figure 2 for Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Figure 3 for Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Figure 4 for Towards Diverse Paraphrase Generation Using Multi-Class Wasserstein GAN

Abstract:Paraphrase generation is an important and challenging natural language processing (NLP) task. In this work, we propose a deep generative model to generate paraphrase with diversity. Our model is based on an encoder-decoder architecture. An additional transcoder is used to convert a sentence into its paraphrasing latent code. The transcoder takes an explicit pattern embedding variable as condition, so diverse paraphrase can be generated by sampling on the pattern embedding variable. We use a Wasserstein GAN to align the distributions of the real and generated paraphrase samples. We propose a multi-class extension to the Wasserstein GAN, which allows our generative model to learn from both positive and negative samples. The generated paraphrase distribution is forced to get closer to the positive real distribution, and be pushed away from the negative distribution in Wasserstein distance. We test our model in two datasets with both automatic metrics and human evaluation. Results show that our model can generate fluent and reliable paraphrase samples that outperform the state-of-art results, while also provides reasonable variability and diversity.

* 8 pages, 2 figures

Via

Access Paper or Ask Questions