评分员与评分量表间的交互作用对EFL作文评分结果与过程的影响

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

评分员与评分量表间的交互作用对EFL作文评分结果与过程的影响

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Effects of Rater-scale Interaction on EFL Essay Rating Outcomes and Processes
作者：李航
论文级别：博士
学科专业名称：英语语言文学
中文关键词：整体量表 ; 分项量表 ; 评分员 ; 评分过程 ; 评分结果
英文关键词：holistic scale ; analytic scale ; rater ; rating process ; rating outcome
学位年度：2012
导师：何莲珍
学科代码：050201
学位授予单位：浙江大学
论文提交日期：2012-11-30
答辩委员会主席：洪岗教

摘要

要求考生写出一个(或几个)样本的写作任务,即直接写作测试,是目前写作测试中最为常用的方法(Weigle2002)。由于此类测试的评分涉及包含评分员、评分量表、考生、作文、写作任务以及评分员培训等在内的多个因素及其交互作用(Milanovic&Saville1996:7； Weigle2002:60； Barkaoui2008:8),评分的过程和结果都常呈现出差异性。而在上述诸因素中,又以评分员与评分量表之间的交互作用对评分的影响最为直接。作为评分过程的核心(Lumley2002:267),评分员通过与作为测试构念的操作化定义的评分量表的交互,直接决定了某项写作测试的实际构念效度,并对测试的信度产生重要的影响。由此可见,量表与评分员之间的交互作用正是写作测试信、效度问题的核心。然而,已有的国内外研究尚未能够就这一交互作用对作文评分过程和结果的影响取得共识。而现有的少数研究在研究方法以及研究设计上也都还存在改进的空间。因此,本研究希望通过综合使用定性、定量的研究方法,对评分员与整体以及分项评分量表之间的交互作用对评分过程和结果的影响作出进一步的说明。
     结合中国测试实践,本研究针对大学英语六级考试(CET6)的写作部分展开,所采用的实验材料也来自某次CET6考试的实考作文。9名具有一定CET6作文评分经验的评分员,对共60份CET6实考作文使用CET6整体评分量表以及一个专为本研究所设计的分项评分量表,进行了先后两次评分。同时为了获得有关评分过程的实证证据,所有评分员都在对其中10份作文进行评分时作了有声思维报告。此外,为了进一步了解评分员对评分量表的理解、使用以及评价,所有评分员还在完成有声思维报告之余,参与了针对两种量表的问卷调查和半结构式访谈。
     由于本研究发现有声思维报告的使用会对作文评分的结果产生一定的影响,因此对作文得分的定量分析是以两次独立评分(50份)的结果为依据的。为能分别从群体和个体两个层面说明评分员与量表的交互作用对评分结果的影响,本研究同时使用了概化理论和多层面Rasch模型来对作文得分进行分析。另一方面,为了能对有声思维报告进行全面、细致的描写,本研究根据所使用的评分量表以及具体的研究问题,建构了专门的有声思维报告编码系统。该编码系统对评分策略以及评分员的文本关注点进行了分类描写。在此基础之上,本研究对两次有声思维报告的主要编码类别作了定量比较。同时,为了能在使用不同量表时评分员的评分策略、文本关注点,以及评分难点等方面获得更加深入的理解,本研究还对包括有声思维报告以及评分员对问卷调查和半结构式访谈中相关陈述和问题等的回复进行了定性的和解释性的分析。
     对作文得分的定量分析显示,评分员与两类量表的交互的确对评分结果产生了不同的影响：
     第一,概化理论的决策研究分析显示：在只用一名评分员的情况下,使用两种评分量表所得分数的概化系数都未能达到0.7。但各分项分合成分数的概化系数(0.695)却高于使用整体评分量表所得分数的概化系数(0.606)。
     第二,对比考生层面的分隔指数和信度可知,与使用整体量表相比,分项量表的使用能使评分员对考生的英语写作能力作出更加细致的区分；同时,在使用分项量表时,非拟合的考生数量也较少。这些发现说明,分项量表可能更适合对二语写作能力的测量。
     第三,不同量表的使用使得评分员在严厉度方面发生了一定的变化。同时,尽管两次评分时,评分员之间在严厉度上都存在显著差异,但由于使用分项量表时,评分员对考生的写作能力能作出更多层次的区分,因此与使用整体量表相比,使用分项量表时,评分员严厉度上的差异对考生得分的影响更小。此外,概化理论对分项量表各分项分数的分析还表明,在对比较局部的语言特点,如语法和词汇进行评分时,评分员在严厉度上的差异较小；而在对句子或语篇层面的文章特点,如句子结构、连贯性和内容进行评分时,这种差异则较大。这说明,对能被较为客观描述的分项量表维度,如语法和词汇,评分员能对其在理解和使用上都达到较高的一致性。
     第四,尽管在整体上,评分员在两次评分中都实现了较好的自身一致性,但在使用整体量表时,有4名评分员显示出了过度拟合的倾向,这说明他们的评分存在着一定的趋中性。另一方面,偏差分析发现,在使用分项量表时,评分员与考生之间以及评分员与量表维度之间都存在着更多的显著交互作用。可能的原因是：第一,分项量表的使用导致评分员需对作文给出多个分数,从而增加了出现这两类交互的机会；第二,尽管参与本研究的评分员都有较为丰富的CET6作文评分经验,但他们都没有使用过分项量表。而这种对量表使用的不熟练可能给他们的自身一致性带来了负面的影响。此外,评分员与考生之间的偏差分析显示,两类量表的使用导致了不同的偏差交互模式。首先,使用整体量表时,评分员与能力度量值高的考生之间出现偏差交互的概率更高,但在使用分项量表时,他们则更容易与能力度量值低的考生出现偏差交互。其次,尽管在使用整体量表时,评分员呈现出对较高能力的考生偏严而对较低能力的考生偏松的趋势,但这一趋势在他们使用分项量表时则相对不明显。最后,评分员在使用分项量表时,与处于能力度量值两极的考生之间出现的偏差交互频率较高,但这一趋势在他们使用整体量表时却不明显。
     第五,多层面Rasch模型对评分量表使用情况的分析表明：评分员对整体量表的个别分数(11,12,13)的使用未能达到模型的预期;此外,几乎所有相邻分数起始值间的间隔都未能达到1.4个对数单位,即这些分数间的区别都不明显。相比之下,分项量表全部5个维度的所有分数的使用情况都未现异常；此外,所有相邻分数起始值间的间隔也都处在1.4个对数单位到5个对数单位的合理区间内,即所有相邻分数间都存在明显的差别。
     以上定量分析的结果表明：在使用分项量表时,评分员更容易对量表各分数作出明确的区分,而他们在严厉度上的差异对考生能力估计的影响也更小。更为重要的是,分项量表的使用能够使评分员对考生的二语写作能力作出更加细致和准确的区分。尽管在使用一名评分员的情况下,各分项维度上的信度表现不尽如人意,但其合成分数的信度表现却接近0.7。而由于五个分项维度的全域分数间的相关系数很高,这就为将各分项得分进行合成提供了依据。虽然本研究也发现使用分项量表时会产生较多的评分员与考生以及分项维度间的交互作用,但已有的研究(Engelhard1992； Weigle1998； Cho1999)表明：培训能有效地提高评分员的自身一致性,减少评分员与考生以及分项维度间的显著交互。因此,总的来说,本研究定量分析的结果表明,分项量表的使用对二语写作测试的评分结果能产生更为有益的影响。
     另一方面,对有声思维报告以及对评分员所进行的问卷调查以及半结构式访谈结果的定量、定性分析则显示：评分员与两类量表的交互也对评分过程产生了不同的影响。
     第一,不同量表的使用导致了评分员在评分策略使用上的差别。使用整体量表时,评分员更频繁地使用了理解性策略,尤其是用于自我监控的理解性策略,如对文本的阅读,以及考虑局部语言特点的判断性策略。同时,整体量表的使用还导致了评分员更多地使用那些能帮助他们建构文章整体印象的评分策略,如对文章的总体印象进行明确的表达,以及对考生的语言能力以及考试策略等方面进行推测等。此外,由于评分员在区分相邻分数上存在困难,因此他们也更频繁地使用了考虑相邻分数的判断性策略。而在使用分项量表时,评分员则更多地使用了判断性策略,尤其是自我监控的以及评判文章质量的判断性策略。同时,由于在使用分项量表时,评分员需要对语言使用的不同方面进行评分,这也导致了他们更频繁地使用对错误分类的理解性策略。以上发现说明：评分量表所包含的评分方法以及关注点对于评分策略的选用有着不容忽视的影响。
     第二,不同量表的使用也导致了评分员在文本关注点方面的差别。相比使用分项量表,在使用整体量表时,评分员更多地关注了语言使用的整体质量以及非量表相关的语言特点,尤其是中式英语。同时,他们对文章的可理解性,拼写错误以及词汇量方面的关注也更频繁。而相比使用整体量表,在使用分项量表时,评分员则更多地关注了连贯性和语法,尤其是这两方面的整体质量。同时他们对内容的完整性、句子结构及词汇的整体质量以及错误频率的关注也更频繁。此外,在使用分项量表时,评分员在文本关注点上的个体差异也较小。以上发现说明：评分员的文本关注点受评分量表所包含的描述项以及关注点的影响。同时,由于分项量表所含的描述项更为具体细致,而评分员又无需对评分标准所含各部分进行权衡以作出一项整体的评分决定(Goulden1994),因此他们在文本关注点上的个体差异也更小。此外,本研究的发现也说明分项量表的使用有助于将评分员的关注点更多地集中到量表所含的标准上,这主要体现在他们较少使用文章间比较的判断策略以及较少关注非量表相关的语言特点这两个方面。
     第三,尽管使用两种量表时,评分员都在对连贯性的评分上遇到了困难,同时他们认为两个量表的描述项在精细度上也都存在缺陷,但不同评分量表的使用也导致了其他一些不同类型的评分难点。在使用整体量表时,评分员的主要难点集中在对相邻分数以及5分和8分两个相邻分数段的区分,同时,在内容、连贯性和语言等三方面有不一致表现的文章也给整体评分带来了困难。而在使用分项量表时,评分员所面临的主要评分难点则是：一,应对五个不同量表维度的评分所带来的认知负荷；二,对各分项维度进行区分,尤其是对句子结构和语法,语法和词汇以及连贯性和语言质量等维度所作的区分。
     第四,根据上述评分过程的种种异同可推知评分员在与两类量表交互时的主要特征。就评分员与整体量表的交互而言,评分员对量表的理解和使用与量表本身所规定的并不一致,同时评分员还在对量表各分数的界定上存在困难。而就评分员与分项量表的交互而言,尽管评分员对量表的使用与量表本身所规定的较为一致,但他们对量表的理解仍与量表编制者的理解存在差异。
     以上对评分过程所作的定量、定性分析表明：评分员对量表的理解和使用与量表本身所规定的并不一致,且不一致的程度也因量表的不同而相异。同时,评分员与量表的交互作用不仅导致了评分策略使用上的差异,也导致了文本关注点上的区别。总的来说,尽管分项量表在使用上更为耗时,但这种量表的使用却能减少评分员对非量表相关的语言特征的关注。同时,评分员对这类量表的理解与使用也更符合量表编制者的意图。由此可见,分项量表的使用能对二语写作测试的评分过程产生更为有益的影响。
     综上所述,在理论层面,本研究的主要发现有以下启示：首先,评分员和评分量表之间存在着复杂的交互作用。一方面,评分量表所含的关注点和描述项会影响评分员对所测构念的理解以及他们在实际评分中所依据的标准,同时,量表所含的分数数量也会对评分任务的难度以及评分结果的精确性产生重大的影响。另一方面,评分员在与量表的交互中也发挥着重要的作用。这是因为：第一,量表无法穷尽对文本特征的描写,而这种描述项与文本之间的“缺口”只能由评分员来加以填补。第二,整体量表各部分的权重以及分项量表各维度间的重叠意味着两种量表自身都会给评分带来一定程度的不确定性。而这种不确定性也只能由评分员来加以解决。第三,评分员对所测构念的理解则又会极大地影响他们对量表的接受程度,他们对整体量表各部分所给予的权重以及他们对分项量表各维度间所存重叠的区分。
     其次,评分员与整体量表的交互作用会对写作评分的过程与结果产生以下影响：第一,为了对作文作出整体评价,评分员常常使用有助于建构对文章整体印象的策略,并会对文章的总体语言质量进行更多的关注。第二,由于整体量表的描述项往往较为模糊,而整体量表又常常缺乏对各组成部分权重的清晰规定,因此这类量表对评分员的约束力较小,这就导致了评分员会更多地使用非量表相关的评分标准,同时在对量表的使用和理解上,评分员之间也容易出现个体差异。这种差异性不仅反映在他们的文本关注点上,同时也反映在他们的严厉度上。第三,由于评分员倾向于通过关注诸如书法、拼写错误以及词汇量等明显但较为表面的文本特征来得出对文章的整体印象,因此他们仍可能实现较好的自身一致性。但这种做法不仅会对测试的效度产生负面影响,同时也可能限制评分员对量表所含各分数的使用,导致评分出现趋中现象。
     最后,评分员与分项量表的交互作用会对写作评分的过程与结果产生以下影响：第一,由于分项量表要求评分员对文章质量的某些方面进行评分,因此评分员容易加强对这些方面整体质量的关注,并增加对判断文章质量以及错误分类策略的使用。第二,由于分项量表的描述项往往较为细致,且这类量表无须评分员对其组成部分进行权重分配,因此评分员对此类量表的理解和使用受量表本身的约束更大。一方面,这会增加评分员对较为困难的分项维度(如连贯性)的关注,同时它也会减少评分员对非量表相关标准的使用。此外,这种约束还有助于保证评分员的自身一致性。第三,由于评分员自身对所测构念的理解不同,同时他们对各分项维度间所存重叠的看法也不相同,评分员对量表的理解仍然有着明显的个体差异。这种差异不仅影响了他们的文本关注点,同时也会加大他们在评分严厉度上的差别。最后,尽管分项量表的使用使得评分员出现了更多的与考生及分项维度间的交互作用,但这类量表的使用也有助于评分员对考生的写作能力作出吏为细致的描述和更加准确的区分。
     除上述理论层面的启示以外,本研究的发现也对CET6写作测试的评分实践,以及对做事测试的效度研究有一定的研究方法上的启示。简言之,CET6作文评分需从量表制定,以及评分员培训两方面进行改进。而在做事测试的研究方法上,本研究的发现表明：首先,概化理论和多层面Rasch模型具有很好的互补性,宜同时应用于对做事测试评分结果的分析。其次,尽管有声思维报告的确会对评分的过程和结果产生影响,并且这一研究方法也无法揭示评分过程的全貌,但它仍然是能直接提供有关评分过程实证证据的最佳研究方法。第三,本研究的结果还证明了采用根据具体研究背景以及研究问题所建构的有声思维编码系统的必要性。最后,本研究也说明了使用多种数据搜集和分析方法对做事测试的研究的必要性和重要性。
This research is an attempt at investigating the effects of rater-scale interaction on essay rating outcomes and processes in the context of CET6.
     A group of9experienced CET6raters rated the same batch of60CET6essays produced in an operational CET6administration twice, using both the CET6holistic essay rating scale and an analytic rating scale designed specifically for the study. In order to collect data on raters' rating processes, all the nine raters provided think-aloud protocols while rating10of these essays. In addition, the raters also completed two questionnaires as well as two semi-structured interviews about their rating process as well as their perception of the rating scales. The think-aloud protocols were coded in terms of the rating strategies adopted as well as the aspects of writing attended to. The results were then compared quantitatively across rating scales. Meanwhile, interpretative analysis was also carried out on both think-aloud protocols and raters'responses to questionnaires and semi-structured interviews. Essay scores were analyzed using G-theory and MFRM to estimate both facet-and item-level reliability indices across rating scales.
     With regard to essay rating outcomes, it is found that the use of the analytic scale led to finer distinctions among examinees in terms of their English writing ability and higher proportion of examinees with acceptable fit. Meanwhile, though there was considerable variability in terms of rater severity with the use of both scales, the impact of this variability on examinee ability estimates was smaller with the use of the analytic scale. What's more, while there was a lack of clear distinction between most adjacent holistic scores, this problem was not detected for the analytic scale categories. All in all, though the use of the analytic scale led to higher proportions of rater-examinee and rater-scale interactions, the use of this scale still led to more favorable impact on essay rating outcome. As to essay rating processes, it is found that the degree of conformity between raters' understanding and application of the rating criteria and that stipulated in the scales differed across the scales. Meanwhile, such interaction affected both the types and frequencies of rating strategies adopted and the specific aspects of essays attended to. On the whole, when applying the analytic scale, raters tended to focus more on scale-based criteria and there also seemed to be more similarity in their understanding and application of the scale.

引文

Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (eds.). Language Testing in the 1990s. London:Macmillan:71-86.
    Andrich, D. (1978). Rating Formulation for Ordered Response Categories. Psychometrika 43 (4):561-573.
    Andrich D. (1998). Thresholds, Steps and Rating Scale Conceptualization. Rasch Measurement Transactions 12 (3):648-649.http://www.rasch.org/rmt/rmtl 239.htm (accessed 23/06/2011)
    Attali, Y.& J. Burstein (2004). Automated Essay Scoring with E-rater v.2.0. Paper presented at the Conference of the International Association for Educational Assessment (IAEA). Philadelphia.
    Bacha, N. (2001). Writing Evaluation:What can Analytic versus Holistic Essay Scoring Tell Us? System 29 (3):371-383.
    Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
    Bachman, L. F. (2004). Statistical Analyses for Language Assessment. Cambridge: Cambridge University Press.
    Bachman, L. F., B. K. Lynch,& M. Mason (1995). Investigating Variability in Tasks and Rater Judgments in a Performance Test of Foreign Language Speaking. Language Testing 12 (2):238-257.
    Barkaoui, K. (2007). Rating Scale Impact on EFL Essay Marking:A Mixed-method Study. Assessing Writing 12 (2):86-107.
    Barkaoui, K. (2008). Effects of Scoring Method and Rater Experience on ESL Rating Outcomes and Processes. Ph. D Dissertation. Toronto:University of Toronto.
    Barkaoui, K. (2010). Variability in ESL Essay Rating Processes:The Role of the Rating Scale and Rater Experience. Language Assessment Quarterly 7 (1):54-74.
    Barkaoui, K. (2011) Think-aloud Protocols in Research on Essay Rating:An Empirical Study on Their Veridicality and Reactivity. Language Testing 28 (1):51-75.
    Barritt, L., P. L. Stock & F. Clark (1986). Researching Practice:Evaluating Assessment Essays. College Composition and Communication 37 (3):315-327.
    Bauer, B. A. (1981). A Study of the Reliabilities and Cost-effectiveness of Three Methods of Assessment for Writing Ability. (ERIC Document Reproduction Service No. ED 216 357).
    Blaikie, N. (2000). Designing Social Research. Molden, MA:Polity Press & Blackwell.
    Bond, T. G.& C. M. Fox (2007). Applying the Rasch Model:Fundamental Measurement in the Human Sciences (2nd edition). Mahwah, NJ:Lawrence Erlbaum.
    Bonk, W. J.& G. J. Ockey (2003). A Many-facet Rasch Analysis of the Second Language Group Oral Discussion Task. Language Testing 20 (1):89-110.
    Breland, H. M.& R. J. Jones (1984). Perceptions of Writing Skills. Written Communication 1 (1):101-119.
    Breland, H., Y. Lee & E. Muraki (2004). Comparability of TOEFL CBT Writing Prompts:Response Mode Analyses. TOEFL Research Report No. RR-04-23. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-04-23 (accessed 03/06/2010)
    Brown, A. (1995). The Effect of Rater Variables in the Development of an Occupation-specific Language Performance Test. Language Testing 12(1):1-15.
    Brown, A., N. Iwashita & T. McNamara (2005). An Examination of Rater Orientations and Test-taker Performance on English-for-Academic-Purposes Speaking Tasks. TOEFL Monograph Series No.29. Princeton, NJ:Educational Testing Service. http://www.ets. org/research/policy_research_reports/rr-05-05_toefl-ms-29 (accessed 06/12/2010).
    Brown, J. D. (1991). Do English and ESL Faculties Rate Writing Samples Differently? TESOL Quarterly 25 (4):587-603.
    Carr, N. (2000). A Comparison of the Effects of Analytic and Holistic Composition in the Context of Composition Tests. Issues in Applied Linguistics 11(2):207-241.
    Charney, D. (1984). The Validity of Using Holistic Scoring to Evaluate Writing. Research in the Teaching of English 18 (1):65-81.
    Cheng, L. (2008). The Key to Success:English Language Testing in China. Language Testing 25 (1):15-37.
    Cheng, X.& M. S. Steffensen (1996). Metadiscourse:A Technique for Improving Student Writing. Research in the Teaching of English 30 (2):149-181.
    Cho, D. (1999). A Study on ESL Writing Assessment:Intra-rater Reliability of ESL Compositions. Melbourne Papers in Language Testing 8 (1):1-24.
    Cohen, A. (1987). Using Verbal Reports in Research on Language Learning. In C. Faerch & G. Kasper (eds.). Introspection in Second Language Research. Clevedon, UK:Multilingual Matters:82-95.
    Condon, P. J.& J. McQueen (2000). The Stability of Rater Severity in Large-scale Assessment Programs. Journal of Educational Measurement 37 (2):163-178.
    Connor, U. (1996). Contrastive Rhetoric:Cross-Cultural Aspects of Second Language Writing. Cambridge:Cambridge University Press.
    Connor, U.& M. Farmer (1990). The Teaching of Topical Structure Analysis as a Revision Strategy for ESL Writers. In B. Kroll (ed.). Second Language Writing: Research Insights for the Classroom. Cambridge:Cambridge University Press: 126-139.
    Connor, U.& P. Carrell (1993). The Interpretation of Tasks by Writers and Readers in Holistically Rated Direct Assessment of Writing. In J. Carson and I. Leki (eds.). Reading in the Composition Classroom:Second Language Perspectives. Boston, MA:Heinle & Heinle:141-160.
    Connor, U.& M. Kramer (1995). Writing from Sources:Case Studies of Graduate Students in Business Management. In D. Belcher & G. Braine (eds.). Academic Writing in a Second Language:Essays on Research and Pedagogy. Norwood, NJ: Ablex:155-182.
    Connor-Linton, J. (1995a). Crosscultural Comparison of Writing Standards:American ESL and Japanese EFL. World Englishes 14 (1):99-115.
    Connor-Linton, J. (1995b). Looking behind the Curtain:What do L2 Composition Ratings Really Mean? TESOL Quarterly 29 (4):762-765.
    Council of Europe. (2001). The Common European Framework of Reference for Languages:Learning, Teaching, Assessment. Cambridge:Cambridge University Press.
    Crismore, A., R. Markkanen & M. S. Steffensen (1993). Metadiscourse in Persuasive Writing:A Study of Texts Written by American and Finnish University Students. Written Communication 10 (1),39-71.
    Cumming, A. (1989). Writing Expertise and Second Language Proficiency. Language Learning 39 (1):81-141.
    Cumming, A. (1990). Expertise in Evaluating Second Language Compositions. Language Testing 7 (1):31-51.
    Cumming, A. (1997). The Testing of Writing in a Second Language. In C. Clapham & D. Corson (eds.). Encyclopedia of Language and Education. Volume 7:Language Testing and Assessment. Dordrecht:Kluwer Academic Publishers:51-63.
    Cumming, A., R. Kantor & D. Powers (2001). Scoring of TOEFL Essays and TOEFL 2000 Prototype Writing Tasks:An Investigation into Raters'Decision Making and Development of a Preliminary Analytic Framework. TOEFL Monograph Series No. 22. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rm-01-04_toefl-ms-22 (accessed 03/04/2010)
    Cumming, A., R. Kantor & D. Powers (2002). Decision Making While Rating ESL/EFL Writing Tasks:A Descriptive Framework. Modern Language Journal 86 (1):67-96.
    Cumming, A., L. Grant, P. Mulcahy-Ernt & D. Powers (2005). A Teacher-verification Study of Speaking and Writing Prototype Tasks for a New TOEFL. TOEFL Monograph Series No.26. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rm-04-05_toefl-ms-26 (accessed 12/06/2010)
    Davidson, F. (1991). Statistical Support for Training in ESL Composition Rating. In L. Hamp-Lyons (ed.). Assessing Second Language Writing in Academic Contexts. Norwood, NJ:Ablex:155-164.
    Davies, A., A. Brown, C. Elder, K. Hill, T. Lumley & T. McNamara (1999) Dictionary of Language Testing. Cambridge:Cambridge University Press.
    Delaruelle, S. (1997). Text Type and Rater Decision-making in the Writing Module. In G. Brindley & G. Wigglesworth (eds.). Access:Issues in English Language Test Design and Delivery. Sydney, Australia:National Center for English Language Teaching and Research, Macquarie University:215-242.
    DeRemer, M. (1998). Writing Assessment:Raters'Elaboration of the Rating Task. Assessing Writing 5 (1):7-29.
    Deville, G.& M. Chalhoub-Deville (2006). Old and New Thoughts on Test Score Variability:Implications for Reliability and Validity. In M. Chalhoub-Deville, C. A. Chapelle & P. Duff (eds.). Inference and Generalizability in Applied Linguistics: Multiple Perspectives. Amsterdam:John Benjamins:9-25.
    Diederich, P., J. French & S. Carlton. (1961). Factors in Judgments of Writing Ability. Research Bulletin 61-15, Princeton:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rb-61-15. (accessed 19/03/10).
    di Gennaro, K. (2009). Investigating Differences in the Writing Performance of International and Generation 1.5 Students. Language Testing 26 (4):533-559.
    Douglas, D. (1994). Quantity and Quality in Speaking Test Performance. Language Testing 11 (2):125-144.
    Douglas, D.& L. Selinker (1992). Analyzing Oral Proficiency Test Performance in General and Specific Purpose Contexts. System 20 (3):317-328.
    Douglas, D.& L. Selinker (1993). Performance on a General versus a Field-specific Test of Speaking Proficiency by International Teaching Assistants. In D. Douglas & C. Chapelle (eds.). A New Decade of Language Testing Research:Selected Papers from the 1990 Language Testing Research Colloquium. Alexandria, VA: TESOL:235-256.
    Dunbar, S. B., D. M. Koretz & H. D. Hoover (1991). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education 4 (4):289-303.
    Eckes, T. (2008). Rater Types in Writing Performance Assessment:A Classification Approach to Rater Variability. Language Testing 25 (2):155-185.
    Elder, C., U. Knoch, G. Barkhuizen & J. von Randow (2005). Feedback to Enhance Rater Training:Does It Work? Language Assessment Quarterly 2 (3):175-196.
    Engelhard, G. (1992). The Measurement of Writing Ability with a Many-faceted Rasch Model. Applied Measurement in Education 5 (3):171-191.
    Engelhard, G. (1994). Examining Rater Errors in the Assessment of Written Composition with a Many-faceted Rasch Model. Journal of Educational Measurement 31(2):93-112.
    Enright, M. K.& T. Quinlan (2010). Complementing Human Judgment of Essays Written by English Language Learners with E-rater(?) Scoring. Language Testing 27 (3):317-334.
    Erdosy, M. (2004). Exploring Variability in Judging Writing Ability in a Second Language:A Study of Four Experienced Raters of ESL Compositions. TOEFL Research Report No. RR-03-17. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-03-17 (accessed 19/03/2010)
    Ericsson, K. A.& H. A. Simon (1987). Verbal Protocols on Thinking. In C. Faerch & G. Kasper (eds.). Introspection in Second Language Research. Clevedon, UK: Multilingual Matters:24-53.
    Ericsson, K. A.& H. A. Simon (1984/1993) Protocol Analysis:Verbal reports as Data. Cambridge, MA:MIT Press.
    Frase, L. T., J. Faletti, A. Ginther & L. Grant (1999). Computer Analysis of the TOEFL Test of Written English. TOEFL Research Report No. RR-98-42. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-98-42_toefl-rr-64 (accessed 21/08/2010)
    Freedman, S. W. (1977). Influences on the Evaluators of Student Writing. Dissertation Abstract International 37:5306A.
    Freedman, S. W. (1979a). How Characteristics of Students Essays Influence Teachers'Evaluation. Journal of Educational Psychology 71 (3):328-338.
    Freedman, S. W. (1979b). Why do Teachers Give the Grades They Do? College Composition and Communication 30 (2):161-164.
    Freedman, S. W. (1981). Influences on Evaluators of Expository Essays:Beyond the Text. Research in the Teaching of English 15 (3):245-255.
    Freedman, S. W. (1984). The Registers of Student and Professional Expository Writing:Influences on Teachers' Responses. In R. Beach & S. Bridwell (eds.). New Directions in Composition Research. New York:Guilford Press:334-347.
    Freedman, S. W.& R. C. Calfee (1983). Holistic Assessment of Writing: Experimental Design and Cognitive Theory. In P. Mosenthal, L.Tamor & S. A. Walmsley (eds.). Research on Writing:Principles and Methods. New York, NY: Longman:75-98.
    Fulcher, G. (1996). Does Thick Description Lead to Smart Tests? A Data-based Approach to Rating Scale Construction. Language Testing 13 (2):208-238.
    Gebril, A. (2009). Score Generalizability of Academic Writing Tasks:Does One Test Method Fit It All? Language Testing 26(4):507-531.
    Gebril, A. (2010). Bringing Reading-to-write and Writing-only Assessment Tasks Together:A Generalizability Analysis. Assessing Writing 15 (2):100-117.
    Glaser, B.& A. L. Strauss (1967). The Discovery of Grounded Theory:Strategies for Qualitative Research. Chicago, IL:Aldine.
    Goulden, N. R. (1992). Theory and Vocabulary for Communication Assessments. Communication Education 41 (3):258-269.
    Goulden, N. R. (1994). Relationship of Analytic and Holistic Methods to Raters' Scores for Speeches. The Journal of Research and Development in Education 27: 73-82.
    Green, A. (1998). Verbal Protocol Analysis in Language Testing Research:A Handbook. Cambridge:Cambridge University Press.
    Green, J. C., V. J. Caracelli & W. F. Graham (1989). Towards a Conceptual Framework for Mixed-method Evaluation Designs. Educational Evaluation and Policy Analysis 11 (3):255-274.
    Grobe, C. (1981). Syntactic Maturity, Mechanics, and Vocabulary as Predicators of Writing Quality. Research in the Teaching of English 15 (1):75-85.
    Hake, R. L.& J. M. Williams (1981). Style and Its Consequences:Do as I Do, Not as I Say. College English 43 (5):433-451.
    Hale, G. A., C. Taylor, B. Bridgeman, J. Carson, B. Kroll,& R. Kantor (1996). A Study of Writing Tasks Assigned in Academic Degree Programs. TOEFL Research Report RR-54. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-95-44 (accessed 02/12/2009)
    Hales, L. W.& E. Tokar (1975). The Effects of the Quality of Preceding Responses on the Grades Assigned to Subsequent Responses to an Essay Question. Journal of Educational Measurement 12 (2):115-117.
    Hamp-Lyons, L. (1986). Testing Writing across the Curriculum. Papers in Applied Linguistics, Michigan (PALM) 2 (1):17-29.
    Hamp-Lyons, L. (1990). Second Language Writing:Assessment Issues. In B. Kroll (ed.). Second Language Writing. Cambridge:Cambridge University Press:69-87.
    Hamp-Lyons, L. (1991a). Pre-text:Task Related Influences on the Writer. In L Hamp-Lyons (ed.). Assessing Second Language Writing in Academic Contexts. Norwood, NJ:Ablex:87-107.
    Hamp-Lyons, L. (1991b). Scoring Procedures. In L. Hamp-Lyons (ed.). Assessing Second Language Writing in Academic Contexts. Norwood, NJ:Ablex:241-276.
    Hamp-Lyons, L. (1995). Rating Non-native Writing:The Trouble with Holistic Scoring. TESOL Quarterly 29 (4):759-762.
    Hamp-Lyons, L. (2003). Writing Teachers as Assessors of Writing. In B. Kroll (ed.). Exploring the Dynamics of Second Language Writing. Cambridge:Cambridge University Press:162-189.
    Hamp-Lyons, L. (2011). Writing Assessment:Shifting Issues, New Tools, Enduring Questions. Assessing Writing 16 (1):3-5.
    Hamp-Lyons, L.& B. Kroll (1997). TOEFL 2000-writing:Composition, Community and Assessment. TOEFL Monograph Series No.5. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rm-96-05_toefl-ms-05 (accessed 23/07/2010)
    Henning, G. (1992). Dimensionality and Construct Validity of Language Tests. Language Testing 9(1):1-11.
    Hill, K. (1997). Who Should be the Judge? The Use of Non-native Speakers as Raters on a Test of English as an International Language. In A. Huhta, V. Kohonen, L. Kurkisuonio & S. Luoma (eds.). Current Developments and Alternatives in Language Assessment:Proceedings of LTRC 96. Jyvaskyla:University of Jyvaskyla:275-290.
    Hinkel, E. (2002). Second Language Writers'Text:Linguistic and Rhetorical Features. Mahwah, NJ:Erlbaum.
    Homburg, T. J. (1984). Holistic Evaluation of ESL Composition:Can It be Validated Objectively? TESOL Quarterly 18 (1):87-107.
    Huang, J. (2008). How Accurate are ESL Students'Holistic Writing Scores on Large-scale Assessments? A Generalizability Theory Approach. Assessing Writing 13 (3): 201-218.
    Hughes, D. E., B. Keeling & B. F. Tuck. (1980). The Influence of Context Position and Scoring Method on Essay Scoring. Journal of Educational Measurement 17 (2): 131-135.
    Huot, B. (1988). The Validity of Holistic Scoring:A Comparison of the Talk-aloud Protocols of Expert and Novice Holistic Raters. Ph. D Dissertation. Pennsylvania: Indiana University of Pennsylvania.
    Huot, B. (1990a). Reliability, Validity, and Holistic Scoring:What We Know and What We Need to Know. College Composition and Communication 41 (2):201-213.
    Huot, B. (1990b). The Literature of Direct Writing Assessment:Major Concerns and Prevailing Trends. Review of Educational Research 60 (2):237-263.
    Huot, B. (1993). The Influence of Holistic Scoring Procedures on Reading and Rating Student Essays. In M. M. Williamson & B. A. Huot (eds.). Validating Holistic Scoring for Writing Assessment:Theoretical and Empirical Foundations. Creskill, NJ:Hampton Press:206-236.
    Huot, B. (1996). Toward a New Theory of Writing Assessment. College Composition and Communication 47 (4):549-66.
    Huot, B. (2002). (Re)Articulating Writing Assessment:Writing Assessment for Teaching and Learning. Logan, Utah:Utah State University Press.
    Intaraprawat, P.& S. Steffensen (1995). The Use of Metadiscourse in Good and Poor ESL Essays. Journal of Second Language Writing 4 (3):253-272.
    Jacobs, H. L., S. A. Zingkgraf, D. R. Wormuth, V. F. Hartfiel & J. B. Hughey (1981). Testing ESL Composition:A Practical Approach. Rowley, MA:Newbury House Publishers, Inc.
    Jin, T., Y. Wang, C. Song & S. Guo (2008). An Empirical Study of Fuzzy Scoring Methods for Speaking Tests. Modern Foreign Languages 31 (2):157-164.
    Jin, T, B. Mak & P. Zhou (2012). Confidence Scoring of Speaking Performance:How does Fuzziness Become Exact? Language Testing 29 (1):43-65.
    Jin, Y. (2005). The National College English Test of China. Paper presented at the International Association of Applied Linguistics (AILA). Madison, WI.
    Jin, Y. (2009). The National College English Testing Committee. In L. Cheng,& A. Curtis (eds.). English Language Assessment and the Chinese Learner. New York and London:Routledge:44-59.
    Johnson, J. S.& G. S. Lim (2009). The Influence of Rater Language Background on Writing Performance Assessment. Language Testing 26 (4):485-505.
    Kenyon, D. (1992). Introductory Remarks at Symposium on Development and Use of Rating Scales in Language Testing.14th Language Testing Research Colloquium. Vancouver.
    Knoch, U. (2007).'Little Coherence, Considerable Strain for Reader':A Comparison between Two Rating Scales for the Assessment of Coherence. Assessing Writing 12 (2):108-128.
    Knoch, U. (2011). Investigating the Effectiveness of Individualized Feedback to Rating Behavior-A Longitudinal Study. Language Testing 28 (2):179-200.
    Knoch, U., J. Read & J. von Randow (2007). Re-training Raters Online:How does It Compare with Face-to-face Training? Assessing Writing 12 (1):26-43.
    Kobayashi, H.& C. Rinnert. (1996). Factors Affecting Composition Evaluation in an EFL Context:Cultural Rhetorical Pattern and Readers'Background. Language Learning 46 (3):397-437.
    Kobayashi, T. (1992). Native and Nonnative Reactions to ESL Compositions. TESOL Quarterly 26(1):81-112.
    Kondo-Brown, K. (2002). A FACETS Analysis of Rater Bias in Measuring Japanese L2 Writing Performance. Language Testing 19 (1):3-31.
    Kunnan, A. J. (1990) Differential Item Functioning:The Case of an ESL Placement Examination. TESOL Quarterly 24 (4):741-746.
    Lane, S.& D. Sabers (1989). Use of Generalizability Theory for Estimating the Dependability of a Scoring System for Sample Essays. Applied Measurement in Education 2 (3):195-205.
    Lee, H. K. (2004). A Comparative Study of ESL Writers'Performance in a Paper-based and a Computer-delivered Writing Test. Assessing Writing 9(1):4-26.
    Lee, Y.& R. Kantor (2005). Dependability of New ESL Writing Test Scores:Tasks and Alternative Rating Schemes. TOEFL Report MS-31. Princeton, NJ: Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-05-14_toefl-ms-31 (accessed 15/07/2010)
    Lee, Y., C. Gentile & R. Kantor, (2008). Analytic Scoring of TOEFL CBT Essays: Scores from Humans and E-raters. TOEFL Research Report RR-08-01. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-08-01_toefl-rr-81 (accessed 17/05/2010)
    Lim, G. S. (2011) The Development and Maintenance of Rating Quality in Performance Writing Assessment:A Longitudinal Study of New and Experienced Raters. Language Testing 28 (4):543-560.
    Linacre, J. M. (1994). Constructing Measurement with a Many-Facet Rasch Model. In M. Wilson (ed.). Objective Measurement:Theory into Practice. Volume 2. Norwood, NJ:Ablex:129-144.
    Linacre, J. M. (1996). Generalizability Theory and Many-facet Rasch Measurement. In G. Engelhard & M. Wilson (eds.).Objective Measurement:Theory into Practice. Volume 3. Norwood, NJ:Ablex:85-98.
    Linacre, J. M. (1999). Investigating Rating Scale Category Utility. Journal of Outcome Measurement 3 (2):103-122.
    Linacre, J. M. (2010). A User's Guide to FACETS Rasch Model Computer Program. http://www. winsteps.com. (accessed 16/12/2010).
    Lumley, T. (2002). Assessment Criteria in a Large-scale Writing Test:What do They Really Mean to the Raters? Language Testing 19 (3):246-276.
    Lumley, T. (2005). Assessing Second Language Writing:The Rater's Perspective. New York:Peter Lang.
    Lunz, E. M., J. A. Stahl & B. D. Wright (1996). The Invariance of Judge Severity Calibration. In G. Engelhard & M. Wilson (eds.). Objective Measurement:Theory into Practice. Volume 3. Norwood, NJ:Ablex:173-179.
    Lynch, B. K. (1996). Language Program Evaluation:Theory and Practice. Cambridge:Cambridge University Press.
    Lynch, B. K.& T. F. McNamara (1998). Using G-theory and Many-facet Rasch Measurement in the Development of Performance Assessment of the ESL Speaking Skills of Immigrants. Language Testing 15 (2):158-180.
    Markham, L.R. (1976). Influence of Handwriting Quality on Teacher Evaluation of Written Work. American Educational Research Journal 13 (4):277-283.
    Masters, G. N. (1982). A Rasch Model for Partial Credit Scoring. Psychometrika 47 (2):149-174.
    Mathison, S. (1988). Why Triangulate? Educational Researcher 17 (2):13-17.
    McNamara, T. (1996). Measuring Second Language Performance. New York: Addison Wesley Longman Limited.
    Mendelsohn, D.& A. Cumming. (1987). Professors'Ratings of Language Use and Rhetorical Organization in ESL Compositions. TESL Canada Journal 5 (1):9-26.
    Michael, W. B., T. Cooper, P. Shaffer & E. Wallis (1980). A Comparison of the Reliability and Validity of Ratings of Student Performance on Essay Examinations by Professors of English and by Professors in Other Disciplines. Educational and Psychological Measurement 40 (1):183-195.
    Milanovic, M.& N. Saville (1994). An Investigation of Marker Strategies Using Verbal Protocols. Paper presented at the 16th Language Testing Research Colloquium. Washington, D.C.
    Milanovic, M., N. Saville (1996). Introduction. In M. Milanovic & N. Saville (eds.), Performance Testing, Cognition and Assessment:Selected Papers from the 15th Language Testing Colloquium (LTRC), Cambridge and Arnhem. Cambridge: Cambridge University Press,3-33.
    Milanovic, M., N. Saville & S. Shuhong (1996). A study of the Decision-making Behaviour of Composition Markers. In M. Milanovic,& N. Saville (eds.), Performance Testing, Cognition and Assessment:Selected Papers from the 15th Language Testing Colloquium (LTRC), Cambridge and Arnhem. Cambridge: Cambridge University Press,92-107.
    Myford, C. M.& E. W. Wolfe. (2000a). Monitoring Sources of Variability within the Test of Spoken English Assessment System. TOEFL Research Report No. RR-65. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-00-06 (accessed 12/12/2009)
    Myford, C. M.& E. W. Wolfe. (2000b). Strengthening the Ties that Bind:Improving the Linking Network in Sparsely Connected Rating Designs. TOEFL Technical Report No. TR-15. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-00-09 (accessed 21/01/2010)
    Myford, C. M.& E. W. Wolfe. (2003). Understanding Rasch Measurement:Detecting and Measuring Rater Effects Using Many-facet Rasch Measurement:Part Ⅰ. Journal of Applied Measurement 4 (4):386-422.
    Myford, C. M.& E. W. Wolfe. (2004). Understanding Rasch Measurement:Detecting and Measuring Rater Effects Using Many-facet Rasch Measurement:Part Ⅱ. Journal of Applied Measurement 5 (2):189-227.
    Neilsen, L.& G. Piche (1981). The Influence of Headed Nominal Complexity and Lexical Choice of Teachers'Evaluation of Writing. Research in the Teaching of English 15 (1):65-74.
    Nold, E. W.& S. W. Freedman (1977). An Analysis of Readers'Responses to Essays. Research in the Teaching of English 11 (2):164-174.
    North, B. (2000). Linking Language Assessments:An Example in the Low Stakes Context. System 28 (4):555-577.
    O'Loughlin, K, (1992). Do English and ESL Teachers Rate Essays Differently? Melbourne Papers in Language Testing 1 (2):19-44.
    O'Loughlin, K. (1994). The Assessment of Writing by English and ESL Teachers. Australian Review of Applied Linguistics 17 (1):23-44.
    O'Sullivan, B.& M. Rignall (2007). Assessing the Value of Bias Analysis Feedback to Raters for the IELTS Writing Module. In L. Taylor & P. Falvey (eds.). IELTS Collected Papers. Cambridge:Cambridge University Press,446-476.
    Park, Y. M. (1988). Academic and Ethnic Background as Factors Affecting Writing Performance. In A. C. Purves (ed.). Writing across Languages and Cultures. Beverly Hills, CA:Sage,261-272.
    Patton, M. Q. (2002). Qualitative Evaluation and Research Methods (3rd edition.). Newbury Park, CA:Sage.
    Penny J., R. L. Johnson & B. Gordon (2000). The Effect of Rating Augmentation on Inter-rater Reliability:An Empirical Study of a Holistic Rubric. Assessing Writing 7(2):143-164.
    Perkins, K. (1983). On the Use of Composition Scoring Techniques, Objective Measures, and Objective Tests to Evaluate ESL Writing Ability. TESOL Quarterly 17 (4):651-671.
    Polio, C.& M. Glew (1996). ESL Writing Assessment Prompts:How Students Choose. Journal of Second Language Writing 5(1):35-49.
    Powers, D. E., M. E. Fowles, M. Farnum & P. Ramsey. (1994). Will They Think Less of My Handwritten Essay if Others Word Process Theirs? Effects on Essay Scores of Intermingling Handwritten and Word-processed Essays. Journal of Educational Measurement 31 (3):220-233.
    Rafoth, B. A.& D. L. Rubin (1984). The Impact of Content and Mechanics on Judgments of Writing Quality. Written Communication 1 (4):446-458.
    Raimes, A. (1987). Language Proficiency, Writing Ability, and Composing Strategies: A Study of ESL College Student Writers. Language Learning 37 (3):439-468.
    Reid, J. (1990). Responding to Different Topic Types:A Quantitative Analysis from a Contrastive Rhetoric Perspective. In B. Kroll (ed.). Second Language Writing: Research Insights for the Classroom. Cambridge:Cambridge University Press, 191-210.
    Rinnert, C.& H. Kobayashi (2001). Differing Perceptions of EFL Writing among Readers in Japan. The Modern Language Journal 85 (2):189-209.
    Russo, J. E., E. J. Johnson & D. L. Stephens (1989). The Validity of Verbal Protocols. Memory and Cognition 17 (6):759-769.
    Ruth, L.& S. Murphy (1988). Designing Writing Tasks for the Assessment of Writing. Norwood. NJ:Ablex.
    Sakyi, A. A. (2000) Validation of Holistic Scoring for ESL Writing Assessment:A Study of How Raters Evaluate ESL Compositions on a Holistic Scale. In A. J. Kunnan (ed.). Fairness and Validation in Language Assessment. Cambridge: Cambridge University Press,130-153.
    Sakyi, A. A. (2003). A Study of the Holistic Scoring Behaviors of Experienced and Novice ESL Instructors. Ph. D Dissertation. Toronto:University of Toronto.
    Santos, T. (1988). Professors'Reactions to the Academic Writing of Nonnative-speaking Students. TESOL Quarterly 22 (1):69-90.
    Schaefer, E. (2008). Rater Bias Patterns in an EFL Writing Assessment. Language Testing 25 (4):465-493.
    Schoonen, R. (2005). Generalizability of Writing Scores:An Application of Structural Equation Modeling. Language Testing 22 (1):1-30.
    Schoonen, R., M. Vergeer & M. Eiting (1997). The Assessment of Writing Ability: Expert Readers versus Lay Readers. Language Testing 14(2):157-184.
    Shi, L. (2001). Native- and Nonnative-speaking EFL Teachers'Evaluation of Chinese Students'English Writing. Language Testing 18 (3):303-325.
    Shohamy, E., C. M. Gordon & R. Kraemer (1992). The Effects of Raters'Background and Training on the Reliability of Direct Writing Tests. The Modern Language Journal 76 (1):27-33.
    Sloan, C.& I. McGinnis (1982). The Effect of Handwriting on Teachers'Grading of High School Essays. Journal of the Association for the Study of Perception 17 (2): 15-21.
    Smagorinsky, P. (1994). Think-aloud Protocol Analysis:Beyond the Black Box. In P. Smagorinsky (ed.). Speaking about Writing:Reflections on Research Methodology. Thousand Oaks, CA:Sage,3-19.
    Smith, D. (2000). Rater Judgment in the Direct Assessment of Competency-based Second Language Writing Ability. In G. Brindley (ed.). Studies in Immigrant English Language Assessment. Volume I. Sydney:Macquarie University,159-189.
    Song, C. B.& I. Caruso. (1996). Do English and ESL Faculty Differ in Evaluating the Essays of Native English-speaking and ESL Students? Journal of Second Language Writing 5(2):163-182.
    Stansfield, C. W.& J. Ross (1988). A Long-term Research Agenda for the Test of Written English. Language Testing 5 (2):160-186.
    Stratman, J. F.& L. Hamp-Lyons (1994). Reactivity in Concurrent Think-aloud Protocols:Issues for Research. In P. Smagorinsky, (ed.). Speaking about Writing: Reflections on Research Methodology. Thousand Oaks, CA:Sage,89-111.
    Strauss, A.& J. Corbin (1994). Grounded Theory Methodology:An Overview. In N.K. Denzin & Y.S. Lincoln (eds.). Handbook of Qualitative Research. Thousand Oaks, CA:Sage,273-285.
    Stuhlmann, J., K. Daniel, A. Dellinger, R. Denny & T. Powers (1999). A Generalizability Study of the Effects of Training on Teachers'Ability to Rate Children's Writing Using a Rubric. Journal of Reading Psychology 20 (2):107-127.
    Sudweeks, R. R., S. Reeve & W. S. Bradshaw (2005). A Comparison of Generalizability Theory and Many-facet Rasch Measurement in an Analysis of College Sophomore Writing. Assessing Writing 9 (3):239-261.
    Sullivan, F. J. (1987). Negotiating Expectations:Writing and Reading Placement Tests. Paper presented at the meeting of the Conference of College Composition and Communication. Atlanta, GA.
    Swartz, C. W., S. R. Hooper, J. W. Montgomery, M. B. Wakely, R. E. L. DeKruif, M. Reed, et al. (1999). Using Generalizability Theory to Estimate the Reliability of Writing Scores Derived from Holistic and Analytical Scoring Methods. Educational and Psychological Measurement 59 (3):492-506.
    Sweedler-Brown, C. O. (1985). The Influence of Training and Experience on Holistic Essay Evaluation. English Journal 74 (5):49-55.
    Tedick, D.& M. Mathison (1995). Holistic Scoring in ESL Writing Assessment: What does an Analysis of Rhetorical Features Reveal? In D. Belcher & B. Braine (eds.). Academic Writing in a Second Language:Essays on Research and Pedagogy. Norwood, NJ:Ablex,205-230.
    van Dijk, T. A. (1980). Text and Context:Explorations in the Semantics and Pragmatics of Discourse. London:Longman.
    Vann, R. J., F. O. Lorenz & D. M. Meyer (1991). Error Gravity:Faculty Response to Errors in the Written Discourse of Non-native Speakers of English. In L. Hamp-Lyons (ed.). Assessing Second Language Writing in Academic Contexts. Norwood, NJ:Ablex,181-195.
    Vaughan, C. (1987). What Affects Raters' Judgment? Paper presented at the meeting of the Conference on College Composition and Communication. Atlanta.
    Vaughan, C. (1991). Holistic assessment:What Goes on in the Rater's Mind? In L. Hamp-Lyons (ed.). Assessing Second Language Writing in Academic Contexts. Norwood, NJ:Ablex,111-125.
    Watson Todd, R. T., P. Thienpermpool,& S. Keyuravong (2004). Measuring the Coherence of Writing Using Topic-based Analysis. Assessing Writing 9 (2):85-104.
    Weigle, S. C. (1994a). Effects of Training on Raters of English as a Second Language Compositions:Quantitative and Qualitative Approaches. Ph. D Dissertation. Los Angeles:University of California.
    Weigle, S. C. (1994b). Effects of Training on Raters of ESL Compositions. Language Testing 11(2):197-223.
    Weigle, S. C. (1998). Using FACETS to Model Rater Training Effects. Language Testing 15 (2):263-287.
    Weigle, S. C. (1999). Investigating Rater/Prompt Interactions in Writing Assessment: Quantitative and Qualitative Approaches. Assessing Writing 6 (2):145-178.
    Weigle, S.C. (2002). Assessing Writing. Cambridge:Cambridge University Press.
    Weir, C. J. (1990). Communicative Language Testing. NJ:Prentice Hall Regents.
    White, E. M. (1984). Holisticism. College Composition and Communication 35 (4): 400-409.
    White, E. M. (1993). Holistic Scoring:Past Triumphs, Future Challenges. In M.M. Williamson & B. A. Huot (eds.). Validating Holistic Scoring for Writing Assessment:Theoretical and Empirical Foundations. Creskill, NJ:Hampton Press, 79-108.
    White, E. M. (1995). An Apologia for the Timed Impromptu Essay Test. College Composition and Communication 46 (1):30-45.
    Widdowson, H. G. (1983). Learning Purpose and Language Use. Oxford:Oxford University Press.
    Wigglesworth, G. (1993). Exploring Bias Analysis as a Tool for Improving Rater Consistency in Assessing Oral Interaction. Language Testing 10 (3):305-335.
    Wigglesworth, G. (1994). Patterns of Rater Behaviour in the Assessment of an Oral Interaction Test. Australian Review of Applied Linguistics 17 (2):77-103.
    Wiseman, C. (2008). Investigating Selected Facets in Measuring Second Language Writing Ability Using Holistic and Analytic Scoring Methods. Ph. D Dissertation. New York:Columbia University.
    Wolfe, E. W.& B. Feltovich (1994). Learning to Rate Essays:A Story of Scorer Cognition. Paper presented at the Annual Meeting of the American Educational Research Association. New Orleans, LA.
    Wolfe, E. W. (1997). The Relationship between Essay Reading Style and Scoring Proficiency in a Psychometric Scoring System. Assessing Writing 4 (1):83-106.
    Wolfe, E. W., C. Kao & M. Ranney (1998). Cognitive Differences in Proficient and Nonproficient Essay Scorers. Written Communication 15 (4):465-492.
    Wright, B. D.& G. N. Masters (1982). Rating Scale Analysis. Chicago:MESA Press.
    Xi, X. (2007). Evaluating Analytic Scoring for TOEFL Academic Speaking Test (TAST) for Operational Use. Language Testing 24 (2):251-286.
    Xi, X.& P. Mollaun (2006). Investigating the Utility of Analytic Scoring for the TOEFL Academic Speaking Test (TAST). TOEFL iBT Research Report TOEFLiBT-01. Princeton, NJ:Educational Testing Service. http://www.ets.org/research/policy_research_reports/rr-06-07_toeflibt-01 (accessed 12/11/2009)
    Yoko, K. (2004). Using GENOVA and FACETS to Set Multiple Standards on Performance Assessment for Certification in Medical Translation from Japanese into English. Language Testing 21 (1):1-27.
    Zheng, Ying & Cheng Liying (2008). Test Review:College English Test (CET) in China. Language Testing,25 (3):408-417.
    何莲珍、张洁(2008),多层面Rasch模型下大学英语四、六级口语考试(CET-SET)信度研究,《现代外语》31(4)：388-398。
    教育部高等教育司(2007),《大学英语课程教学要求College English Curriculum Requirements》。北京：外语教学与研究出版社。
    李春花(2010),高考英语写作评分中评分员和评分量表的关系探讨,硕士学位论文。太原：山西大学。
    李航(2011),基于概化理论和多层面Rasch模型的CET-6作文评分信度研究,《外语与外语教学》260：51-56。
    李清华、孔文(2010),TEM-4写作新分项式评分标准的多层面Rasch模型分析,《外语电化教学》131：19-25。
    李文中(1993),中国英语与中国式英语,《外语教学与研究》4：18-24。
    梁茂成(2011),《中国学生英语作文自动评分模型的构建》。北京：外语教学与研究出版社。
    刘建达(2005),话语填充测试方法的多层面Rasch模型分析,《现代外语》28(2)：157-169。
    刘建达(2010),评卷人效应的多层面Rasch模型研究,《现代外语》33(2)：185-193。
    罗娟(2007),作文整体评分与分项评分方法的比较研究,硕士学位论文。长沙：湖南大学。
    罗娟、肖云南(2008),基于多元概化理论的英语写作评分误差分析研究,《中国外语》5(5)：61-66。
    全国大学英语四六级考试委员会(2006),《大学英语六级考试大纲(2006修订版)》。北京：高等教育出版社。
    杨惠中、CWeir(1998),《大学英语四、六级考试效度研究》。上海：上海外语教育出版社。
    张洁(2009),评分过程与评分员信念-评分员差异的内在因素研究,博士学位论文。广州：广东外语外贸大学。

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700