There is growing need and interest in using automated feedback as part of writing training. This raises several questions that have not yet been adequately answered:

  • Can/do automated feedback systems identify similar issues as live instructors, with similar frequency?
  • When instructors have access to automated feedback, do they trust it? Do they provide feedback on different issues, focus on higher-order problems, or continue grading as if they had no insights?
  • Do students respond to automated feedback in the same way as feedback from live instructors?

 

Origins of Automated Feedback

Interest in automated feedback on writing emerged in the mid-1960s. Systems have since progressed from basic spelling checkers to syntax-aware spelling and grammar evaluation. More recently, automated essay scoring (AES) systems emerged that can provide feedback on open-response writing. AES systems use natural language processing and machine learning algorithms to produce a score based on syntactic and lexical patterns and features such as sentence length, word frequency and part of speech (Shermis et al. 2010; Balfour 2013).

Demand for AES systems has spurred development of several commercial products including Criterion®, WriteToLearn™, Intelligent Essay Assessor™, IntelliMetric®, WriteLab, Inc. and Turnitin Feedback Studio.

AES systems have the advantage of being scalable and are very attractive for large face-to-face courses, and for asynchronous courses like massive open online courses (Balfour 2013). Still they are not without drawbacks. First, un-/minimally-supervised AES systems may not score student work consistently. Perelman (Perelman 2013) highlighted their poor accuracy, using obviously faked texts to demonstrate how easily current AES applications are fooled. Second, it is unclear to what extent students pay attention to automated feedback. One small study (Heffernan and Otoshi 2015) compared relative impacts of teacher feedback versus technology driven feedback from ETS’ Criterion® tool on 12 Japanese English-foreign- language students’ writing. Results showed automated feedback alone was less effective than feedback from the teacher and Criterion® combined. However, the authors only compared scores on three independent, consecutive writing assignments; they did not compare effects of feedback on draft versus final versions of the same assignment. Also, the authors failed to compare automated feedback alone to teachers’ comments alone.

The degree of authority students attach to both automated and human feedback also is a confounding variable. Anecdotally, most instructors can cite many examples of students ignoring basic spelling and grammar feedback that is standard for word processors like MS. Word. Ruegg (2015) found that feedback from peer reviewers was not valued as much as teachers’ feedback, even though teachers provided less explicit comments than peer reviewers. Ruegg argued that this is likely because teachers are the ones that ultimately grade the student’s writing.

Despite their limitations, demand remains high for AES systems, and there are ongoing efforts to integrate them into more traditional writing review models (Balfour 2013). Constructed response assessments reveal more about student thinking and the persistence of misconceptions than do multiple-choice questions. However they require more analysis on the part of the educator, and so are less likely to be used in large-enrollment courses. Urban-Lurain and colleagues developed automated assessment of constructed response (AACR) as a way to give students constructed response test questions and other work that can be graded automatically (Kaplan et al. 2014; Weston et al. 2015; Ha, Baldwin, and Nehm 2015; Urban-Lurain et al. 2015; Ha M. and Nehm 2016). AACR uses a robust and scalable platform for scoring open responses to pre-written prompts that probe important disciplinary concepts from biology and biological chemistry. The prompts are administered via online course management systems where students enter responses. Lexical and statistical analyses then predict what expert ratings would be for student responses. Work to date shows that automated analyses can predict expert ratings of students’ work on these topics accurately and with higher inter-rater reliability than a group of trained human graders. The AACR model is growing rapidly to include additonal STEM disicplines and has a national presence (http://create4stem.msu.edu/project/aacr).

The success of AACR suggests that the general lexical analysis principles this project is based on can be applied to the analysis of student writing. However, more research is needed before we can say with any confidence that automated feedback benefits students or their instructors.

Our NSF-sponsored propoosal is exploring whether automated feedback affect how graduate teaching assistants (TAs) grade writing-related activities. Using the text classifier being built for this FLC project, we will determine whether TAs provide students with more feedback focused on higher order thinking, and less on low-level copy editing.

 


Work Cited

Balfour, S. P. 2013. “Assessing Writing in Moocs: Automated Essay Scoring and Calibrated Peer Review (Tm).” In Research & Practice in Assessment. Vol. 8.

Ha, M., B. C. Baldwin, and R. H. Nehm. 2015. “The Long-Term Impacts of Short-Term Professional Development: Science Teachers and Evolution.” Evolution: Education and Outreach 8 (1). doi:10.1186/s12052-015-0040-9.

Ha, M., and R. H. Nehm. 2016. “Predicting the Accuracy of Computer Scoring of Text: Probabilistic, Multi-Model, and Semantic Similarity Approaches.” Baltimore, MD: NARST.

Heffernan, N., and J. Otoshi. 2015. “Comparing the Pedagogical Benefits of Both Criterion and Teacher Feedback on Japanese Efl Students’ Writing.” JALT CALL Journal 11 (1).

Kaplan, J. J., K. C. Haudek, M. Ha, N. Rogness, and D. G. Fisher. 2014. “Using Lexical Analysis Software to Assess Student Writing in Statistics.” Technology Innovations in Statistics Education 8.

Perelman, L. 2013. “Critique (Ver. 3.4) of Mark d. Shermis and Ben Hammer, Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Critique (Ver. 3.4) of Mark D. Shermis and Ben Hammer, Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.

Ruegg, R. 2015. “Differences in the Uptake of Peer and Teacher Feedback.” RELC Journal 46 (2): 131–45.

Shermis, M. D., J. Burstein, D. Higgins, and K. Zechner. 2010. “Automated Essay Scoring: Writing Assessment and Instruction.” Automated Essay Scoring: Writing Assessment and Instruction 4: 20–26.

Urban-Lurain, M., M. M. Cooper, K. C. Haudek, J. J. Kaplan, J. K. Knight, and P. P. Lemons. 2015. “Expanding a National Network for Automated Analysis of Constructed Response Assessments to Reveal Student Thinking in Stem.” Computers in Education Journal 6: 65–81.

Weston, M., K. C. Haudek, L. Prevost, M. Urban-Lurain, and J. Merrill. 2015. “Examining the Impact of Question Surface Features on Students’ Answers to Constructed-Response Questions on Photosynthesis.” CBE - Life Sciences Education 14. doi:10.1187/cbe.14-07-0110.


Copyright © 2019 A. Daniel Johnson. All rights reserved.