Abstract:Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.
Abstract:Cognitive diagnosis is an essential research topic in intelligent education, aimed at assessing the level of mastery of different skills by students. So far, many research works have used deep learning models to explore the complex interactions between students, questions, and skills. However, the performance of existing method is frequently limited by the long-tailed distribution and dynamic changes in the data. To address these challenges, we propose a meta-learning framework for cognitive diagnosis based on continual learning (MetaCD). This framework can alleviate the long-tailed problem by utilizing meta-learning to learn the optimal initialization state, enabling the model to achieve good accuracy on new tasks with only a small amount of data. In addition, we utilize a continual learning method named parameter protection mechanism to give MetaCD the ability to adapt to new skills or new tasks, in order to adapt to dynamic changes in data. MetaCD can not only improve the plasticity of our model on a single task, but also ensure the stability and generalization of the model on sequential tasks. Comprehensive experiments on five real-world datasets show that MetaCD outperforms other baselines in both accuracy and generalization.
Abstract:Multimodal Large Language Models (MLLMs) show remarkable progress across many visual-language tasks; however, their capacity to evaluate artistic expression remains limited. Aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach, where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric, with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. These results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.