The medical conversational system can relieve the burden of doctors and improve the efficiency of healthcare, especially during the pandemic. This paper presents a medical conversational question answering (CQA) system based on the multi-modal knowledge graph, namely "LingYi", which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures including medical triage, consultation, image-text drug recommendation and record. To conduct knowledge-grounded dialogues with patients, we first construct a Chinese Medical Multi-Modal Knowledge Graph (CM3KG) and collect a large-scale Chinese Medical CQA (CMCQA) dataset. Compared with the other existing medical question-answering systems, our system adopts several state-of-the-art technologies including medical entity disambiguation and medical dialogue generation, which is more friendly to provide medical services to patients. In addition, we have open-sourced our codes which contain back-end models and front-end web pages at https://github.com/WENGSYX/LingYi. The datasets including CM3KG at https://github.com/WENGSYX/CM3KG and CMCQA at https://github.com/WENGSYX/CMCQA are also released to further promote future research.
Generative Adversarial Networks (GANs) have achieved remarkable achievements in image synthesis. These successes of GANs rely on large scale datasets, requiring too much cost. With limited training data, how to stable the training process of GANs and generate realistic images have attracted more attention. The challenges of Data-Efficient GANs (DE-GANs) mainly arise from three aspects: (i) Mismatch Between Training and Target Distributions, (ii) Overfitting of the Discriminator, and (iii) Imbalance Between Latent and Data Spaces. Although many augmentation and pre-training strategies have been proposed to alleviate these issues, there lacks a systematic survey to summarize the properties, challenges, and solutions of DE-GANs. In this paper, we revisit and define DE-GANs from the perspective of distribution optimization. We conclude and analyze the challenges of DE-GANs. Meanwhile, we propose a taxonomy, which classifies the existing methods into three categories: Data Selection, GANs Optimization, and Knowledge Sharing. Last but not the least, we attempt to highlight the current problems and the future directions.
The last decade has witnessed enormous improvements in science and technology, stimulating the growing demand for economic and cultural exchanges in various countries. Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting. However, recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese. To achieve this, the low-resource multilingual translation challenge of the 2021 iFLYTEK AI Developer Competition provides the Chinese-centric multilingual low-resource NMT tasks, where participants are required to build NMT systems based on the provided low-resource samples. In this paper, we present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking. In addition, a new Incomplete-Trust (In-trust) loss function is proposed to replace the traditional cross-entropy loss when training. The experimental results demonstrate that the implementation of these ideas leads better performance than other state-of-the-art methods. All the experimental codes are released at: https://github.com/WENGSYX/Low-resource-text-translation.
Automatic laparoscope motion control is fundamentally important for surgeons to efficiently perform operations. However, its traditional control methods based on tool tracking without considering information hidden in surgical scenes are not intelligent enough, while the latest supervised imitation learning (IL)-based methods require expensive sensor data and suffer from distribution mismatch issues caused by limited demonstrations. In this paper, we propose a novel Imitation Learning framework for Laparoscope Control (ILLC) with reinforcement learning (RL), which can efficiently learn the control policy from limited surgical video clips. Specially, we first extract surgical laparoscope trajectories from unlabeled videos as the demonstrations and reconstruct the corresponding surgical scenes. To fully learn from limited motion trajectory demonstrations, we propose Shape Preserving Trajectory Augmentation (SPTA) to augment these data, and build a simulation environment that supports parallel RGB-D rendering to reinforce the RL policy for interacting with the environment efficiently. With adversarial training for IL, we obtain the laparoscope control policy based on the generated rollouts and surgical demonstrations. Extensive experiments are conducted in unseen reconstructed surgical scenes, and our method outperforms the previous IL methods, which proves the feasibility of our unified learning-based framework for laparoscope control.
The temporal answering grounding in the video (TAGV) is a new task naturally derived from temporal sentence grounding in the video (TSGV). Given an untrimmed video and a text question, this task aims at locating the matching span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question. However, due to the weak correlations and huge gaps of the semantic features between the textual question and visual answer, existing methods adopting visual span predictor perform poorly in the TAGV task. To bridge these gaps, we propose a visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles as a passage to perform the text span localization for the input text question, and prompts the visual highlight features into the pre-trained language model (PLM) for enhancing the joint semantic representations. Specifically, the context query attention is utilized to perform cross-modal interaction between the extracted textual and visual features. Then, the highlight features are obtained through the video-text highlighting for the visual prompt. To alleviate semantic differences between textual and visual features, we design the text span predictor by encoding the question, the subtitles, and the prompted visual highlight features with the PLM. As a result, the TAGV task is formulated to predict the span of subtitles matching the visual answer. Extensive experiments on the medical instructional dataset, namely MedVidQA, show that the proposed VPTSL outperforms the state-of-the-art (SOTA) method by 28.36% in terms of mIOU with a large margin, which demonstrates the effectiveness of the proposed visual prompt and the text span predictor.
This paper describes the LingJing team's method to the Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA) 2022 shared task on Personality Prediction (PER) and Reactivity Index Prediction (IRI). In this paper, we adopt the prompt-based method with the pre-trained language model to accomplish these tasks. Specifically, the prompt is designed to provide the extra knowledge for enhancing the pre-trained model. Data augmentation and model ensemble are adopted for obtaining better results. Extensive experiments are performed, which shows the effectiveness of the proposed method. On the final submission, our system achieves a Pearson Correlation Coefficient of 0.2301 and 0.2546 on Track 3 and Track 4 respectively. We ranked Top-1 on both sub-tasks.
How to efficiently utilize the temporal features is crucial, yet challenging, for video restoration. The temporal features usually contain various noisy and uncorrelated information, and they may interfere with the restoration of the current frame. This paper proposes learning noise-robust feature representations to help video restoration. We are inspired by that the neural codec is a natural denoiser. In neural codec, the noisy and uncorrelated contents which are hard to predict but cost lots of bits are more inclined to be discarded for bitrate saving. Therefore, we design a neural compression module to filter the noise and keep the most useful information in features for video restoration. To achieve robustness to noise, our compression module adopts a spatial channel-wise quantization mechanism to adaptively determine the quantization step size for each position in the latent. Experiments show that our method can significantly boost the performance on video denoising, where we obtain 0.13 dB improvement over BasicVSR++ with only 0.23x FLOPs. Meanwhile, our method also obtains SOTA results on video deraining and dehazing.
This paper studies multi-unit auctions powered by intermediaries, where each intermediary owns a private set of unit-demand buyers and all intermediaries are networked with each other. Our goal is to incentivize the intermediaries to diffuse the auction information to individuals they can reach, including their private buyers and neighboring intermediaries, so that more potential buyers are able to participate in the auction. To this end, we build a diffusion-based auction framework which incorporates the strategic interaction of intermediaries. It is showed that the classic Vickrey-Clarke-Groves (VCG) mechanism within the framework can achieve the maximum social welfare, but it may decrease the seller's revenue or even lead to a deficit. To overcome the revenue issue, we propose a novel auction, called critical neighborhood auction, which not only maximizes the social welfare, but also improves the seller's revenue comparing to the VCG mechanism with/without intermediaries.
Blood pressure indicates cardiac function and peripheral vascular resistance and is critical for disease diagnosis. Traditionally, blood pressure data are mainly acquired through contact sensors, which require high maintenance and may be inconvenient and unfriendly to some people (e.g., burn patients). In this paper, an efficient non-contact blood pressure measurement network based on face videos is proposed for the first time. An innovative oversampling training strategy is proposed to handle the unbalanced data distribution. The input video sequences are first normalized and converted to our proposed YUVT color space. Then, the Spatio-temporal slicer encodes it into a multi-domain Spatio-temporal mapping. Finally, the neural network computation module, used for high-dimensional feature extraction of the multi-domain spatial feature mapping, after which the extracted high-dimensional features are used to enhance the time-domain feature association using LSTM, is computed by the blood pressure classifier to obtain the blood pressure measurement intervals. Combining the output of feature extraction and the result after classification, the blood pressure calculator, calculates the blood pressure measurement values. The solution uses a blood pressure classifier to calculate blood pressure intervals, which can help the neural network distinguish between the high-dimensional features of different blood pressure intervals and alleviate the overfitting phenomenon. It can also locate the blood pressure intervals, correct the final blood pressure values and improve the network performance. Experimental results on two datasets show that the network outperforms existing state-of-the-art methods.