Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Abstract:This paper explores mobile crowdsensing, which leverages mobile devices and their users for collective sensing tasks under the coordination of a central requester. The primary challenge here is the variability in the sensing capabilities of individual workers, which are initially unknown and must be progressively learned. In each round of task assignment, the requester selects a group of workers to handle specific tasks. This process inherently leads to task overlaps in the same round and repetitions across rounds. We propose a novel model that enhances task diversity over the rounds by dynamically adjusting the weight of tasks in each round based on their frequency of assignment. Additionally, it accommodates the variability in task completion quality caused by overlaps in the same round, which can range from the maximum individual worker's quality to the summation of qualities of all assigned workers in the overlap. A significant constraint in this process is the requester's budget, which demands an efficient strategy for worker recruitment. Our solution is to maximize the overall weighted quality of tasks completed in each round. We employ a combinatorial multi-armed bandit framework with an upper confidence bound approach for this purpose. The paper further presents a regret analysis and simulations using realistic data to demonstrate the efficacy of our model.