Filighera, Anna (2023)
Automatic Short Answer Grading Using Neural Models. Examining Adversarial Robustness and Elaborated Feedback Generation.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00024394
Ph.D. Thesis, Primary publication, Publisher's Version
Text
2023_08_10_Filighera_Anna.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (4MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Automatic Short Answer Grading Using Neural Models. Examining Adversarial Robustness and Elaborated Feedback Generation | ||||
Language: | English | ||||
Referees: | Steinmetz, Prof. Dr. Ralf ; Schroeder, Prof. Dr. Ulrik | ||||
Date: | 30 August 2023 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | viii, 155 Seiten | ||||
Date of oral examination: | 13 July 2023 | ||||
DOI: | 10.26083/tuprints-00024394 | ||||
Abstract: | High-quality feedback is essential for learners. It reveals misconceptions, knowledge gaps and improvement opportunities. Asking short-answer questions and giving elaborated feedback on the learners' responses is highly effective in increasing not only their understanding of the material but also their ability to transfer the knowledge to new contexts. However, providing even basic feedback, such as verifying correctness, is time-consuming. For this reason, neural feedback systems have risen in popularity in recent years. While such systems have matured to achieve high grading accuracy on some datasets, their decision process is opaque and their behavior when confronted with out-of-training-distribution data remains underexplored. Thus, the first research question posed in this thesis concerns current state-of-the-art grading models' robustness to adversarial examples - answers crafted to fool the grading model. The second research question explores how grading systems can be expanded to provide elaborated feedback explaining learners' mistakes instead of merely verifying correctness. In total, we make four contributions to these research questions. First, we investigate grading models' robustness to adversarial examples crafted by students as well as an existing automatic attack. We show that current models are generally vulnerable to adversarial attacks and provide evidence that their predictions are at least partially based on spurious correlations. However, we also find that existing adversarial attacks are difficult to employ in typical summative assessment scenarios. Therefore, we propose an adversarial attack tailored to summative assessments as our second contribution. We demonstrate the attack's effectiveness on multiple models and domains and empirically evaluate manipulated responses with human experts. Our third contribution consists of the bilingual Short Answer Feedback dataset. In contrast to existing datasets, it contains elaborated feedback in addition to verification feedback. We annotated learner responses from three domains spanning college-level and life-long learning. We demonstrate that this novel task challenges current state-of-the-art models. We provide an evaluation framework and benchmark models to lay the groundwork for research in this field. Though the feedback generated by the benchmark models is imperfect, we observed positive effects on learning outcomes compared to no feedback and even human feedback conditions in a college course field study. Finally, we propose an unsupervised elaborated feedback generation method for domains where costly data annotation is infeasible as our fourth contribution. It aims to find small counterfactual changes to students' responses that would have led the grading model to classify them as correct instead. These changes can be considered concrete improvement suggestions in the student's own words. We compare four counterfactual generation approaches and find further evidence for the grading models' unreliability but also genuine improvements, indicating that such feedback may be feasible in the future. Overall, this thesis provides insight into the robustness of neural Automatic Short Answer Grading systems to various forms of input manipulation. We also present evidence for the usefulness of even imperfect elaborated feedback models while providing the tools for further research on improved approaches. The garnered understanding can be helpful to practitioners seeking to employ grading systems more securely, understandably and safely. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-243945 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 18 Department of Electrical Engineering and Information Technology > Institute of Computer Engineering > Multimedia Communications | ||||
Date Deposited: | 30 Aug 2023 14:11 | ||||
Last Modified: | 18 Oct 2023 13:45 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/24394 | ||||
PPN: | 511417683 | ||||
Export: |
View Item |