6G promises a paradigm shift in which positioning and sensing are inherently integrated, enhancing not only the communication performance but also enabling location- and context-aware services. Historically, positioning and sensing have been viewed through the lens of cost and performance trade-offs, implying an escalated demand for resources, such as radio, physical, and computational resources, for improved performance. However, 6G goes beyond this traditional perspective to encompass a set of broader values, namely sustainability, inclusiveness, and trustworthiness. This paper aims to: (i) shed light on these important value indicators and their relationship with the conventional key performance indicators, and (ii) unveil the dual nature of 6G in relation to these key value indicators (i.e., ensuring operation according to the values and enabling services that affect the values).
The far-field channel model has historically been used in wireless communications due to the simplicity of mathematical modeling and convenience for algorithm design, and its validity for relatively small array apertures. With the need for high data rates, low latency, and ubiquitous connectivity in the sixth generation (6G) of communication systems, new technology enablers such as extremely large antenna arrays (ELAA), reconfigurable intelligent surfaces (RISs), and distributed multiple-input-multiple-output (D-MIMO) systems will be adopted. These enablers not only aim to improve communication services but also have an impact on localization and sensing (L\&S), which are expected to be integrated into future wireless systems. Despite appearing in different scenarios and supporting different frequency bands, these enablers share the so-called near-field (NF) features, which will provide extra geometric information. In this work, starting from a brief description of NF channel features, we highlight the opportunities and challenges for 6G NF L\&S.
Video question--answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image--text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.
Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based Sentiment Analysis (ABSA) that considers each opinion term, their expressed sentiment, and the corresponding aspect targets. However, existing methods are limited to the in-domain setting with two domains. Hence, we propose a domain-expanded benchmark to address the in-domain, out-of-domain and cross-domain settings. We support the new benchmark by annotating more than 4000 data samples for two new domains based on hotel and cosmetics reviews. Our analysis of five existing methods shows that while there is a significant gap between in-domain and out-of-domain performance, generative methods have a strong potential for domain generalization. Our datasets, code implementation and models are available at https://github.com/DAMO-NLP-SG/domain-expanded-aste .
Reconfigurable intelligent surfaces (RISs) are expected to be a main component of future 6G networks, due to their capability to create a controllable wireless environment, and achieve extended coverage and improved localization accuracy. In this paper, we present a novel cooperative positioning use case of the RIS in mmWave frequencies, and show that in the presence of RIS, together with sidelink communications, localization with zero access points (APs) is possible. We show that multiple (at least three) half-duplex single-antenna user equipments (UEs) can cooperatively estimate their positions through device-to-device communications with a single RIS as an anchor without the need for any APs. We start by formulating a three-dimensional positioning problem with Cram\'er-Rao lower bound (CRLB) derived for performance analysis. After that, we discuss the RIS profile design and the power allocation strategy between the UEs. Then, we propose low-complexity estimators for estimating the channel parameters and UEs' positions. Finally, we evaluate the performance of the proposed estimators and RIS profiles in the considered scenario via extensive simulations and show that sub-meter level positioning accuracy can be achieved under multi-path propagation.