Total views : 1551
Evaluating Computer Automated Scoring: Issues, Methods, and an Empirical Illustration
With the continual progress of computer technologies, computer automated scoring (CAS) has become a popular tool for evaluating writing assessments. Research of applications of these methodologies to new types of performance assessments is still emerging. While research has generally shown a high agreement of CAS system generated scores with those produced by human raters, concerns and questions have been raised about appropriate analyses and validity of decisions/interpretations based on those scores. In this paper we expand the emerging discussions on validation strategies on CAS by illustrating several analyses can be accomplished with available data. These analyses compare the degree to which two CAS systems accurately score data from a structured interview using the original scores provided by human raters as the criterion. Results suggest key differences across the two systems as well as differences in the statistical procedures used to evaluate them. The use of several statistical and qualitative analyses is recommended for evaluating contemporary CAS systems.
Automated Scoring, Computerized Testing, Structured Interviews, Validity
- Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, winter 1998, 9-17.
- Bejar, I. I. (1991). A methodology for scoring open-ended architectural design problems. Journal of Applied Psychology, 76, 522-532.
- Bejar, I. I., & Braun, H. I. (1994). On the synergy between assessment and instruction: early lessons from computer-based simulations. Machine-Mediated Learning, 4, 5-25.
- Burstein, J. C. (2001a, February). Automated essay evaluation in Criterion. Paper presented at the Association of Test Publishers Computer-Based Testing: Emerging Technologies and Opportunities for Diverse Applications conference, Tucson, AZ.
- Burstein, J. C. (2001b, April). Automated essay evaluation with natural language processing. Paper presented at the annual meeting of the National Council of Measurement in Education, Seattle, WA.
- Burstein, J. C., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998, April). Computer analysis of essays. In Automated Scoring. Symposium conducted at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Available on-line: http://ftp.ets.org/pub/res/erater_ncmefinal.pdf.
- Burstein, J. C., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M. D. (1998, August). Automated scoring using a hybrid feature identification technique. In the Proceedings of the annual meeting of the Association of Computational Linguistics. Montreal, Canada. Available on-line: http://ftp.ets.org/pub/res/erater_acl98.pdf.
- Burstein, J. C., & Marcu, D. (2000, August). Benefits of modularity in an automated essay scoring system. In the Proceedings of the Workshop on Using Toolsets and Architectures to Build NLP Systems, 18th International Conference on Computational Linguistics. Luxembourg. Available on-line: http://ftp.ets.org/pub/res/erater_colinga4.pdf.
- Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551-558.
- Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a performance assessment scored with a computer-automated scoring system. Journal of Educational Measurement, 37, 245-261.
- Clauser, B. E., Kane, M. T., & Swanson, D. B. (2002). Validity issues for performance-based tests scored with computer-automated scoring systems. Applied Measurement in Education, 15, 413-432.
- Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, L. P. (1997). Development of automated scoring algorithms for complex performance assessments. Journal of Educational Measurement, 34, 141-161.
- Clauser, B. E., Ross, L. P., Clyman, S. G., Rose, K. M., Margolis, M. J., Nungester, R. J., Piemme, T. E., Chang, L., El-Bayoumi, G., Malakoff, G. L., & Pincetl, P. S. (1997). Development of a scoring algorithm to replace expert rating for scoring a complex performance-based assessment. Applied Measurement in Education, 10, 345-358.
- Clauser, B. E., Subhiyah, R. G., Nungester, R. J., Ripkey, D. R., Clyman, S. G., & McKinley, D. (1995). Scoring a performance-based assessment by modeling the judgments of experts. Journal of Educational Measurement, 32, 397-415.
- Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1999). A comparison of generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281-299.
- Cohen J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.
- Cook, R. J. (1998). Kappa. In T. P. Armitage and T. Colton (Eds.), The Encyclopedia of Biostatistics (pp. 2160-2168). New York: Wiley.
- Elliot, S. M. (2001, April). IntelliMetric: from here to validity. Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA.
- Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543-549.
- Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31, 651-659.
- Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions (2nd ed.). New York: John Wiley.
- Johnson, L. A., Wohlgemuth, B., Cameron, C. A., Caughtman, F., Koertge, T., Barna, J., & Schultz, J. (1998). Dental Interactive Simulations Corporation (DISC): Simulations for education, continuing education, and assessment. Journal of Dental Education, 62, 919-928.
- Khaliq, S. N. (2003). A Review and Critique of Automated Scoring for Large-Scale Performance Assessments. Center for Educational Assessment Research Report No. 479, Amherst, MA: School of Education, University of Massachusetts Amherst.
- Kukich, K. (2000). Beyond automated essay scoring. IEEE Intelligent Systems [On-line], 15(5), 22-27. Available: http://www.knowledge-technologies.com/papers/IEEEdebate.pdf.
- Laham, D. (2001, April). Automated scoring and annotation of essays with the Intelligent Essay Assesor. Paper presented at the annual meeting of National Council of Measurement in Education, Seattle, WA.
- Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 2, 211-240.
- Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259-284.
- Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The Intelligent Essay Assesor. IEEE Intelligent Systems [On-line], 15(5), 27-31. Available: http://www.knowledge-technologies.com/papers/IEEEdebate.pdf.
- Landauer, T. K., Laham, D., & Foltz, P. W. (2001, February). The Intelligent Essay Assesor: putting knowledge to the test. Paper presented at the Association of Test Publishers Computer-Based Testing: Emerging Technologies and Opportunities for Diverse Applications conference, Tucson, AZ.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.
- Maclure, M., & Willett, W. C. (1987). Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126, 161-169.
- Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651-655.
- Maxwell, A. E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79 –83.
- Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (2002). Making sense of data from complex assessments. Applied Measurement in Education, 15, 363-389.
- Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238-243.
- Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of Experimental Education, 62, 127-142.
- Scott, W. A. (1955). Reliability of content analysis: The case of nominal coding. Public Opinion Quarterly, 22, 321-325.
- Shermis, M. D., Koch, C. M., Page, E. B., Keith, T. Z., Harrington, S. (1999, April). Trait ratings for automated essay grading. Paper presented at the annual meeting of National Council on Measurement in Education, Montreal, Canada.
- Siegel, S., & Castellan, N. J., Jr. (1988). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). New York: McGraw-Hill.
- Stuart, A. (1955). A test of homogeneity of marginal distributions in a two-way classification. Biometrika, 42, 412-416.
- Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). 'Mental model' comparison of automated and human scoring. Journal of Educational Measurement, 36, 158-184.
- Yang, Y. W., Buckendahl, C.W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A Review of Strategies for Validating Computer Automated Scoring. Applied Measurement in Education, 15, 391-412.
- Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374-378.
- There are currently no refbacks.