Abstract:Generative Artificial Intelligence (GenAI) tools such as ChatGPT are emerging as a revolutionary tool in education that brings both positive aspects and challenges for educators and students, reshaping how learning and teaching are approached. This study aims to identify and evaluate the key competencies students need to effectively engage with GenAI in education and to provide strategies for lecturers to integrate GenAI into teaching practices. The study applied a mixed method approach with a combination of a literature review and a quantitative survey involving 130 students from South Asia and Europe to obtain its findings. The literature review identified 14 essential student skills for GenAI engagement, with AI literacy, critical thinking, and ethical AI practices emerging as the most critical. The student survey revealed gaps in prompt engineering, bias awareness, and AI output management. In our study of lecturer strategies, we identified six key areas, with GenAI Integration and Curriculum Design being the most emphasised. Our findings highlight the importance of incorporating GenAI into education. While literature prioritized ethics and policy development, students favour hands-on, project-based learning and practical AI applications. To foster inclusive and responsible GenAI adoption, institutions should ensure equitable access to GenAI tools, establish clear academic integrity policies, and advocate for global GenAI research initiatives.
Abstract:We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench