Abstract:Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.
Abstract:Fake news detection has been a critical task for maintaining the health of the online news ecosystem. However, very few existing works consider the temporal shift issue caused by the rapidly-evolving nature of news data in practice, resulting in significant performance degradation when training on past data and testing on future data. In this paper, we observe that the appearances of news events on the same topic may display discernible patterns over time, and posit that such patterns can assist in selecting training instances that could make the model adapt better to future data. Specifically, we design an effective framework FTT (Forecasting Temporal Trends), which could forecast the temporal distribution patterns of news data and then guide the detector to fast adapt to future distribution. Experiments on the real-world temporally split dataset demonstrate the superiority of our proposed framework. The code is available at https://github.com/ICTMCG/FTT-ACL23.




Abstract:Numerous fake images spread on social media today and can severely jeopardize the credibility of online content to public. In this paper, we employ deep networks to learn distinct fake image related features. In contrast to authentic images, fake images tend to be eye-catching and visually striking. Compared with traditional visual recognition tasks, it is extremely challenging to understand these psychologically triggered visual patterns in fake images. Traditional general image classification datasets, such as ImageNet set, are designed for feature learning at the object level but are not suitable for learning the hyper-features that would be required by image credibility analysis. In order to overcome the scarcity of training samples of fake images, we first construct a large-scale auxiliary dataset indirectly related to this task. This auxiliary dataset contains 0.6 million weakly-labeled fake and real images collected automatically from social media. Through an AdaBoost-like transfer learning algorithm, we train a CNN model with a few instances in the target training set and 0.6 million images in the collected auxiliary set. This learning algorithm is able to leverage knowledge from the auxiliary set and gradually transfer it to the target task. Experiments on a real-world testing set show that our proposed domain transferred CNN model outperforms several competing baselines. It obtains superiror results over transfer learning methods based on the general ImageNet set. Moreover, case studies show that our proposed method reveals some interesting patterns for distinguishing fake and authentic images.