Workshop on Evaluation for Language and Dialogue Systems

Toulouse, France
July 6-7, 2001


The aim of this two day workshop is to identify and to synthesize current needs for language-technology evaluation.

The space of possible dialogues is enormous, even for limited domains like travel information servers. The generalization of evaluation methodologies across different application domains and languages is an open problem. Review of published evaluations of dialogue models and systems suggests that usability techniques are the standard method. Dialogue-based system are often evaluated in terms of standard, objective usability metrics, such as task-completion time and number of user actions. In the past, researchers have proposed and debated theory-based methods for modifying and testing the underlying dialogue model, but the most widely used method of evaluation is usability testing, although more precise and empirical methods for evaluating the effectiveness of dialogue models have been proposed. For task-based interaction, typical measures of effectiveness are time-to-completion and task outcome, but the evaluation should focus on user satisfaction rather than on arbitrary effectiveness measurements.Indeed, the problems faced in current approaches to measurement of effectiveness dialogue models and systems include:

  1. Direct measures are unhelpful because efficient performance on the nominal task may not represent the most effective interaction
  2. Indirect measures usually rely on judgment and are vulnerable to weak relationships between the inputs and outputs
  3. Subjective measures are unreliable and domain-specific

For its first day, the workshop organizers solicit papers on these issues, with particular emphasis on methods that go beyond usability testing to address the underlying dialogue model. Representative questions to be addressed include:

  1. How do we deal with the combinatorial explosion of dialogue states?
  2. How can satisfaction be measured with respect to underlying dialogue models?
  3. Are there useful direct measures of dialogue properties that do not depend on task efficiency?
  4. What is the role of agent-based simulation in evaluation of dialogue models?

Of course, the problems faced in evaluating dialogue and system models are found in other domains of language engineering, even for non-interactive processes such as part-of-speech tagging, parsing, semantic disambiguation, information extration, speech transcription, and audio document indexing. So the issue of evaluation can be viewed at a more generic level, raising fundamental, theoretical questions such as:

  1. What are the interest and benefits of evaluation for language engineering?
  2. Do we really need these specific methodologies, since a form of evaluation sould always be present in any scientific investigation?
  3. If evaluation is needed in language engineering, is it the case for all domains?
  4. What form should it take? Technology evaluation (task-oriented in laboratory environment) or field/user Evaluation (complete systems in real-life conditions)?
  5. We have seen before that the the evaluation of dialogue models is still unsolved, but for domains where metrics already exists, are they satisfactory and sufficient? How can we take into account or abstract from the subjective factor introduced by human operators in the process?
  6. Do similarity measures and standards offer appropriate answers to this problem? Most of the efforts focus on evaluating process, but what about the issue of language resources evaluation?

For its second day of work, the workshop organizers solicit papers on these issues, with the intent to address the problem of evaluation both from a broader perspective (including novel applications domains for evaluation, new metrics for known tasks and resource evaluation) and a more theoretical point of view (including formal theory of evaluation and infrastructural needs of language engineering).

NOTE: People who would like to submit a paper on lexical semantic disambiguation evaluation should consider the parallel workshop, on July 5-6, for the closure of the SENSEVAL-2 evaluation campaign.


The organization of each of the two days of the workshop will reflect the workshop's two main themes. Each day will begin with a session of presentations of selected papers and follow with panel discussions to synthesize and develop possible methodologies from additional selected workshop papers.


The workshop seeks participation from people involved or interested in the problem of evaluation in language processing and the research and industrial communities that study and implement dialogue models for natural-language interaction systems.

The first part of the workshop will specifically draw on the natural-language interaction community, for instance like the one developing at the confluence of SIGdial and SIGCHI, which will find in this workshop an atmosphere more flavored by computational-linguistics related issues (see, for example, the First SIGdialWorkshop on Discourse and Dialogue).

The second part of the workshop is intended to provide a forum for a broader audience more in the spirit of the one that attended the LREC'2000 Satellite Workshop on Evaluation (see, in particular offering an opportunity to people involved in language engineering evaluation (e.g ., the CLASS audience) in the context of national or transnational projects or programs, both in Europe and abroad.


Paper submissions should follow the two-column format of ACL proceedings and should not exceed eight (8) pages, including references. We strongly recommend the use of ACL LaTeX style files or Microsoft Word Style files tailored for this year's conference. They are available from the ACL-2001 program committee Web site at

Papers should be submitted electronically, as either a LaTeX, Word or PDF file to either:


Deadline for workshop paper submissions: April 6, 2001
Deadline for notification of workshop paper acceptance: April 27, 2001
Deadline for camera-ready workshop papers: May 16, 2001
Workshop date: July 6-7, 2001


David G. Novick
Department of Computer Science
University of Texas at El Paso
El Paso, TX 79968, USA
Phone: +1 915-747-6952

Joseph Mariani
Limsi - CNRS
Bâtiment 508 Université Paris XI
BP 133 - 91403 ORSAY Cedex - France
Fax: +33 (0)1 69 85 80 88

Candy Kamm
AT&T Labs
180 Park, Bldg 103
Florham Park, NJ 07932, USA
+1 973-360-8540

Patrick Paroubek
Spoken Language Processing Group / Human-Machine Communication Department
Limsi - CNRS
Bâtiment 508 Université Paris XI
BP 133 - 91403 ORSAY Cedex - France
Fax: +33 (0)1 69 85 80 88
Phone: +33 (0)1 69 85 81 91

Nils Dahlbäck
Computer & Information Science Department
Linköping University
S-581 83 Linköping Sweden
Phone: +46 13 28 16 64

Frankie James
RIACS Mail Stop 19-39
NASA Ames Research Center
Moffett Field, CA 94035, USA
Phone: +1 650-604-0197

Karen Ward
Department of Computer Science
University of Texas at El Paso
El Paso, TX 79968 USA
Phone: +1 915-747-6957



ACL 2001

We also anticipate co-sponsorship from SIGdial.


Additional information on the workshop, including accepted papers and the workshop schedule, will be made available as needed at

DGN and PAP, February 19, 2001