In recent years, the rapid growth of online multimedia services, such as e-commerce platforms, has necessitated the development of personalised recommendation approaches that can encode diverse content about each item. Indeed, modern multi-modal recommender systems exploit diverse features obtained from raw images and item descriptions to enhance the recommendation performance. However, the existing multi-modal recommenders primarily depend on the features extracted individually from different media through pre-trained modality-specific encoders, and exhibit only shallow alignments between different modalities - limiting these systems' ability to capture the underlying relationships between the modalities. In this paper, we investigate the usage of large multi-modal encoders within the specific context of recommender systems, as these have previously demonstrated state-of-the-art effectiveness when ranking items across various domains. Specifically, we tailor two state-of-the-art multi-modal encoders (CLIP and VLMo) for recommendation tasks using a range of strategies, including the exploration of pre-trained and fine-tuned encoders, as well as the assessment of the end-to-end training of these encoders. We demonstrate that pre-trained large multi-modal encoders can generate more aligned and effective user/item representations compared to existing modality-specific encoders across three multi-modal recommendation datasets. Furthermore, we show that fine-tuning these large multi-modal encoders with recommendation datasets leads to an enhanced recommendation performance. In terms of different training paradigms, our experiments highlight the essential role of the end-to-end training of large multi-modal encoders in multi-modal recommendation systems.
Recommender systems are frequently challenged by the data sparsity problem. One approach to mitigate this issue is through cross-domain recommendation techniques. In a cross-domain context, sharing knowledge between domains can enhance the effectiveness in the target domain. Recent cross-domain methods have employed a pre-training approach, but we argue that these methods often result in suboptimal fine-tuning, especially with large neural models. Modern language models utilize prompts for efficient model tuning. Such prompts act as a tunable latent vector, allowing for the freezing of the main model parameters. In our research, we introduce the Personalised Graph Prompt-based Recommendation (PGPRec) framework. This leverages the advantages of prompt-tuning. Within this framework, we formulate personalized graph prompts item-wise, rooted in items that a user has previously engaged with. Specifically, we employ Contrastive Learning (CL) to produce pre-trained embeddings that offer greater generalizability in the pre-training phase, ensuring robust training during the tuning phase. Our evaluation of PGPRec in cross-domain scenarios involves comprehensive testing with the top-k recommendation tasks and a cold-start analysis. Our empirical findings, based on four Amazon Review datasets, reveal that the PGPRec framework can decrease the tuned parameters by as much as 74%, maintaining competitive performance. Remarkably, there's an 11.41% enhancement in performance against the leading baseline in cold-start situations.
One advantage of neural ranking models is that they are meant to generalise well in situations of synonymity i.e. where two words have similar or identical meanings. In this paper, we investigate and quantify how well various ranking models perform in a clear-cut case of synonymity: when words are simply expressed in different surface forms due to regional differences in spelling conventions (e.g., color vs colour). We first explore the prevalence of American and British English spelling conventions in datasets used for the pre-training, training and evaluation of neural retrieval methods, and find that American spelling conventions are far more prevalent. Despite these biases in the training data, we find that retrieval models often generalise well in this case of synonymity. We explore the effect of document spelling normalisation in retrieval and observe that all models are affected by normalising the document's spelling. While they all experience a drop in performance when normalised to a different spelling convention than that of the query, we observe varied behaviour when the document is normalised to share the query spelling convention: lexical models show improvements, dense retrievers remain unaffected, and re-rankers exhibit contradictory behaviour.
Performing automatic reformulations of a user's query is a popular paradigm used in information retrieval (IR) for improving effectiveness -- as exemplified by the pseudo-relevance feedback approaches, which expand the query in order to alleviate the vocabulary mismatch problem. Recent advancements in generative language models have demonstrated their ability in generating responses that are relevant to a given prompt. In light of this success, we seek to study the capacity of such models to perform query reformulation and how they compare with long-standing query reformulation methods that use pseudo-relevance feedback. In particular, we investigate two representative query reformulation frameworks, GenQR and GenPRF. GenQR directly reformulates the user's input query, while GenPRF provides additional context for the query by making use of pseudo-relevance feedback information. For each reformulation method, we leverage different techniques, including fine-tuning and direct prompting, to harness the knowledge of language models. The reformulated queries produced by the generative models are demonstrated to markedly benefit the effectiveness of a state-of-the-art retrieval pipeline on four TREC test collections (varying from TREC 2004 Robust to the TREC 2019 Deep Learning). Furthermore, our results indicate that our studied generative models can outperform various statistical query expansion approaches while remaining comparable to other existing complex neural query reformulation models, with the added benefit of being simpler to implement.
We propose a new uniform framework for text classification and ranking that can automate the process of identifying check-worthy sentences in political debates and speech transcripts. Our framework combines the semantic analysis of the sentences, with additional entity embeddings obtained through the identified entities within the sentences. In particular, we analyse the semantic meaning of each sentence using state-of-the-art neural language models such as BERT, ALBERT, and RoBERTa, while embeddings for entities are obtained from knowledge graph (KG) embedding models. Specifically, we instantiate our framework using five different language models, entity embeddings obtained from six different KG embedding models, as well as two combination methods leading to several Entity-Assisted neural language models. We extensively evaluate the effectiveness of our framework using two publicly available datasets from the CLEF' 2019 & 2020 CheckThat! Labs. Our results show that the neural language models significantly outperform traditional TF.IDF and LSTM methods. In addition, we show that the ALBERT model is consistently the most effective model among all the tested neural language models. Our entity embeddings significantly outperform other existing approaches from the literature that are based on similarity and relatedness scores between the entities in a sentence, when used alongside a KG embedding.
Social networks (SNs) are increasingly important sources of news for many people. The online connections made by users allows information to spread more easily than traditional news media (e.g., newspaper, television). However, they also make the spread of fake news easier than in traditional media, especially through the users' social network connections. In this paper, we focus on investigating if the SNs' users connection structure can aid fake news detection on Twitter. In particular, we propose to embed users based on their follower or friendship networks on the Twitter platform, so as to identify the groups that users form. Indeed, by applying unsupervised graph embedding methods on the graphs from the Twitter users' social network connections, we observe that users engaged with fake news are more tightly clustered together than users only engaged in factual news. Thus, we hypothesise that the embedded user's network can help detect fake news effectively. Through extensive experiments using a publicly available Twitter dataset, our results show that applying graph embedding methods on SNs, using the user connections as network information, can indeed classify fake news more effectively than most language-based approaches. Specifically, we observe a significant improvement over using only the textual information (i.e., TF.IDF or a BERT language model), as well as over models that deploy both advanced textual features (i.e., stance detection) and complex network features (e.g., users network, publishers cross citations). We conclude that the Twitter users' friendship and followers network information can significantly outperform language-based approaches, as well as the existing state-of-the-art fake news detection models that use a more sophisticated network structure, in classifying fake news on Twitter.
Despite its troubled past, the AOL Query Log continues to be an important resource to the research community -- particularly for tasks like search personalisation. When using the query log these ranking experiments, little attention is usually paid to the document corpus. Recent work typically uses a corpus containing versions of the documents collected long after the log was produced. Given that web documents are prone to change over time, we study the differences present between a version of the corpus containing documents as they appeared in 2017 (which has been used by several recent works) and a new version we construct that includes documents close to as they appeared at the time the query log was produced (2006). We demonstrate that this new version of the corpus has a far higher coverage of documents present in the original log (93%) than the 2017 version (55%). Among the overlapping documents, the content often differs substantially. Given these differences, we re-conduct session search experiments that originally used the 2017 corpus and find that when using our corpus for training or evaluation, system performance improves. We place the results in context by introducing recent adhoc ranking baselines. We also confirm the navigational nature of the queries in the AOL corpus by showing that including the URL substantially improves performance across a variety of models. Our version of the corpus can be easily reconstructed by other researchers and is included in the ir-datasets package.
We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the evaluation process for the user. The tool also makes it easier for researchers to use recently-proposed measures (such as those from the C/W/L framework) alongside traditional measures, potentially encouraging their adoption.
The advent of contextualised language models has brought gains in search effectiveness, not just when applied for re-ranking the output of classical weighting models such as BM25, but also when used directly for passage indexing and retrieval, a technique which is called dense retrieval. In the existing literature in neural ranking, two dense retrieval families have become apparent: single representation, where entire passages are represented by a single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE approach), or multiple representations, where each token in a passage is represented by its own embedding (as exemplified by the recent ColBERT approach). These two families have not been directly compared. However, because of the likely importance of dense retrieval moving forward, a clear understanding of their advantages and disadvantages is paramount. To this end, this paper contributes a direct study on their comparative effectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We observe that, while ANCE is more efficient than ColBERT in terms of response time and memory usage, multiple representations are statistically more effective than the single representations for MAP and MRR@10. We also show that multiple representations obtain better improvements than single representations for queries that are the hardest for BM25, as well as for definitional queries, and those with complex information needs.