It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$\rightarrow$EN, EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the success of these three pieces together in improving document-level performance.
We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $ \mathtt{SNR} := \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE\text{-}UCB++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE\text{-}UCB++}$ performs uniform exploration in its initial phases and then utilizes the {\em upper confidence bound }(UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE\text{-}UCB++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.
Point cloud registration is an important task in robotics and autonomous driving to estimate the ego-motion of the vehicle. Recent advances following the coarse-to-fine manner show promising potential in point cloud registration. However, existing methods rely on good superpoint correspondences, which are hard to be obtained reliably and efficiently, thus resulting in less robust and accurate point cloud registration. In this paper, we propose a novel network, named RDMNet, to find dense point correspondences coarse-to-fine and improve final pose estimation based on such reliable correspondences. Our RDMNet uses a devised 3D-RoFormer mechanism to first extract distinctive superpoints and generates reliable superpoints matches between two point clouds. The proposed 3D-RoFormer fuses 3D position information into the transformer network, efficiently exploiting point clouds' contextual and geometric information to generate robust superpoint correspondences. RDMNet then propagates the sparse superpoints matches to dense point matches using the neighborhood information for accurate point cloud registration. We extensively evaluate our method on multiple datasets from different environments. The experimental results demonstrate that our method outperforms existing state-of-the-art approaches in all tested datasets with a strong generalization ability.
Beamforming techniques have been widely used in the millimeter wave (mmWave) bands to mitigate the path loss of mmWave radio links as the narrow straight beams by directionally concentrating the signal energy. However, traditional mmWave beam management algorithms usually require excessive channel state information overhead, leading to extremely high computational and communication costs. This hinders the widespread deployment of mmWave communications. By contrast, the revolutionary vision-assisted beam management system concept employed at base stations (BSs) can select the optimal beam for the target user equipment (UE) based on its location information determined by machine learning (ML) algorithms applied to visual data, without requiring channel information. In this paper, we present a comprehensive framework for a vision-assisted mmWave beam management system, its typical deployment scenarios as well as the specifics of the framework. Then, some of the challenges faced by this system and their efficient solutions are discussed from the perspective of ML. Next, a new simulation platform is conceived to provide both visual and wireless data for model validation and performance evaluation. Our simulation results indicate that the vision-assisted beam management is indeed attractive for next-generation wireless systems.
Incomplete multi-view clustering (IMVC) is an unsupervised approach, among which IMVC via contrastive learning has received attention due to its excellent performance. The previous methods have the following problems: 1) Over-reliance on additional projection heads when solving the dimensional collapse problem in which latent features are only valid in lower-dimensional subspaces during clustering. However, many parameters in the projection heads are unnecessary. 2) The recovered view contain inconsistent private information and useless private information will mislead the learning of common semantics due to consistent learning and reconstruction learning on the same feature. To address the above issues, we propose a novel incomplete multi-view contrastive clustering framework. This framework directly optimizes the latent feature subspace, utilizes the learned feature vectors and their sub-vectors for reconstruction learning and consistency learning, thereby effectively avoiding dimensional collapse without relying on projection heads. Since reconstruction loss and contrastive loss are performed on different features, the adverse effect of useless private information is reduced. For the incomplete data, the missing information is recovered by the cross-view prediction mechanism and the inconsistent information from different views is discarded by the minimum conditional entropy to further avoid the influence of private information. Extensive experimental results of the method on 5 public datasets show that the method achieves state-of-the-art clustering results.
The World Wide Lab (WWL) connects the Digital Shadows (DSs) of processes, products, companies, and other entities allowing the exchange of information across company boundaries. Since DSs are context- and purpose-specific representations of a process, as opposed to Digital Twins (DTs) which offer a full simulation, the integration of a process into the WWL requires the creation of DSs representing different aspects of the process. Human-Robot Collaboration (HRC) for assembly processes was recently studied in the context of the WWL where Behaviour Trees (BTs) were proposed as a standard task-level representation of these processes. We extend previous work by proposing to standardise safety functions that can be directly integrated into these BTs. This addition uses the WWL as a communication and information exchange platform allowing industrial and academic practitioners to exchange, reuse, and experiment with different safety requirements and solutions in the WWL.
In this report, we summarize the first NTIRE challenge on light field (LF) image super-resolution (SR), which aims at super-resolving LF images under the standard bicubic degradation with a magnification factor of 4. This challenge develops a new LF dataset called NTIRE-2023 for validation and test, and provides a toolbox called BasicLFSR to facilitate model development. Compared with single image SR, the major challenge of LF image SR lies in how to exploit complementary angular information from plenty of views with varying disparities. In total, 148 participants have registered the challenge, and 11 teams have successfully submitted results with PSNR scores higher than the baseline method LF-InterNet \cite{LF-InterNet}. These newly developed methods have set new state-of-the-art in LF image SR, e.g., the winning method achieves around 1 dB PSNR improvement over the existing state-of-the-art method DistgSSR \cite{DistgLF}. We report the solutions proposed by the participants, and summarize their common trends and useful tricks. We hope this challenge can stimulate future research and inspire new ideas in LF image SR.
Medical image recognition often faces the problem of insufficient data in practical applications. Image recognition and processing under few-shot conditions will produce overfitting, low recognition accuracy, low reliability and insufficient robustness. It is often the case that the difference of characteristics is subtle, and the recognition is affected by perspectives, background, occlusion and other factors, which increases the difficulty of recognition. Furthermore, in fine-grained images, the few-shot problem leads to insufficient useful feature information in the images. Considering the characteristics of few-shot and fine-grained image recognition, this study has established a recognition model based on attention and Siamese neural network. Aiming at the problem of few-shot samples, a Siamese neural network suitable for classification model is proposed. The Attention-Based neural network is used as the main network to improve the classification effect. Covid- 19 lung samples have been selected for testing the model. The results show that the less the number of image samples are, the more obvious the advantage shows than the ordinary neural network.
Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations - for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs into highly efficient mesh-based neural representations that are fully compatible with the massively parallel graphics rendering pipeline. We represent scenes as neural radiance features encoded on a two-layer duplex mesh, which effectively overcomes the inherent inaccuracies in 3D surface reconstruction by learning the aggregated radiance information from a reliable interval of ray-surface intersections. To exploit local geometric relationships of nearby pixels, we leverage screen-space convolutions instead of the MLPs used in NeRFs to achieve high-quality appearance. Finally, the performance of the whole framework is further boosted by a novel multi-view distillation optimization strategy. We demonstrate the effectiveness and superiority of our approach via extensive experiments on a range of standard datasets.
The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author's emotions, social trends, artistic movement, etc., and their deep comprehension undoubtfully requires to handle both. A promising step towards a general paradigm for art analysis is to disentangle content and style, whereas relying on human annotations to cull a single aspect of artworks has limitations in learning semantic concepts and the visual appearance of paintings. We thus present GOYA, a method that distills the artistic knowledge captured in a recent generative model to disentangle content and style. Experiments show that synthetically generated images sufficiently serve as a proxy of the real distribution of artworks, allowing GOYA to separately represent the two elements of art while keeping more information than existing methods.