Participating

Join REAIS Slack workspace: http://bit.ly/reais-2020-slack

You should receive an invitation to join Underline via email with the subject “Welcome to HCOMP 2020” after registering for the conference and/or workshops. Click that link to sign up for Underline and view the conference.

Having trouble? Ask in #ask-for-help-here channel in REAIS 2020 Slack.

Program Agenda

Note from organizers: Daylight savings time changes for Amsterdam (but not yet in the US), so take care with timezone calculations. We apologize for any confusion this has caused.

2:00pm CET / 6:00am PDT Tech Platform Onboarding
2:10pm CET / 6:20am PDT Introduction
2:25pm CET / 6:25am PDT Invited Talk: José Hernández-Orallo
3:05pm CET / 7:05am PDT Break
3:15pm CET / 7:15am PDT NLP Invited Talks: Ellen Voorhees, João Sedoc, Courtney Napoles
4:45pm CET / 8:45am PDT Panel discussion
5:05pm CET / 9:05am PDT Break
5:15pm CET / 9:15am PDT Reporting from discussions
5:20pm CET / 9:20am PDT Invited Talk: Besmira Nushi
6:00pm CET / 10:00am PDT Lightning Talks: Cynthia Freeman, Ian Beaver
6:20pm CET / 10:20am PDT Closing remarks

Speakers

José Hernández-Orallo, Professor, Universitat Politècnica de València (website)

Title: Adversarial Testing: from Adversarial Datasets to Cognitively-Adaptive Testing

Abstract: AI evaluation has been fighting a decades-long battle against overfitting, caused by AI researchers designing their systems to the test or training them to the benchmark dataset of collected task instances. Very recently, an adversarial approach has been emerging as a way of collecting examples that take AI systems to their limits, and find their vulnerabilities and limitations. However, this is a very time-consuming effort that relies on human computation, usually through crowdsourcing. We would like this kind of ‘Turing testing’ and ‘adversarial testing’ to become more dynamic and automated. We must recognise that this effort is cognitive and hence hard –but not impossible– to automate. In this talk I discuss some possibilities to automate ‘cognitively-adaptive testing’ based on adversarial (generative) models, and other more general techniques such as computerised adaptive testing, adaptive sampling, active learning, cognitively-precise evaluation, machine teaching, etc.

Ellen Voorhees, Senior Research Scientist, NIST (website)

Title: Evaluating Messy Tasks

Abstract: A messy task is one in which humans have valid differences of opinion as to the quality or correctness of a candidate response. Language processing is rife with messy tasks—humans disagree about the relevance of documents retrieved in response to query, the acceptability of answers to a question (even for factoid questions), the quality of a document summary, the appropriateness of a translation, and the helpfulness of an explanation to name a few. Evaluating automated systems for messy tasks is complicated by this lack of a true gold standard. Human annotators can be trained to converge on a single interpretation of quality, but that often defeats the purpose since eventual end users of the technology will neither know nor care about annotation standards. This talk will review how humans’ inherent disagreement has shaped the evaluation methodologies used to drive technology development for messy language tasks.

João Sedoc, Assistant Professor, New York University Stern School of Business (website)

Title: Evaluating Conversational Agents

Abstract: There has been a renewed focus on dialog systems, including non-task driven conversational agents. Dialog is a challenging problem since it spans multiple conversational turns. To further complicate the problem, there are many possible valid utterances that may be semantically different. This makes automatic evaluation difficult, which is why the current best practice for analyzing and comparing dialog systems is the use of human judgments. This talk focuses on evaluation, presenting a theoretical framework for the systematic evaluation of open-domain conversational agents, including the usage of Item Response Theory (Lord, 1968) for efficient chatbot evaluation and evaluation set creation. We introduce ChatEval https://chateval.org a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems.

Courtney Napoles, Language Data Manager, Grammarly (website)

Title: Evaluating text generation with “gold-standard” labeled data

Abstract: Like other areas of AI, natural language processing relies on automatic evaluation to support the advancement of new models. Since much automatic evaluation is supervised, text generation poses a particular challenge because there is no single gold answer. Even the highest quality labels provided by highly skilled human annotators are subject to bias and oversights. This talk will consider challenges encountered when evaluating text generation models with gold-standard datasets, including the risks of relying on a human-labeled gold standard, efficiently supporting automatic evaluation with human judgments, and the importance of establishing a feedback loop to reassess the metrics themselves as domains and models change.

Besmira Nushi, Senior Researcher, Microsoft Research AI (website)

Title: The Unpaved Path of Deploying Reliable and Human-Centered Machine Learning Systems

Abstract: As machine learning systems are increasingly becoming part of user-facing applications, their reliability and robustness are key to building and maintaining trust with users, especially for high-stake domains such as healthcare. While advances in learning are continuously improving model performance in expectation and in isolation, there is an emergent need for identifying, understanding, and mitigating cases where models may fail in unexpected ways and therefore break human trust or dependencies with other larger software ecosystems. Current development infrastructures and methodologies often designed with traditional software in mind, still provide very little support to enable practitioners debug and troubleshoot systems over time. This discussion will look at such problems from two different stakeholder lenses: machine learning practitioners and end user decision makers. From a practitioner perspective, it will summarize some of the current gaps in tooling for responsible ML development and evaluation, and present ongoing work that can enable in-depth error analysis and careful model versioning. Next, from an end user perspective it will propose rethinking the optimization of machine learning models such that it takes into consideration human-centered properties of human-machine collaboration and partnership. While both these lenses pose both research and engineering practices, they also require close collaboration with domain experts who are using machine learning in the open field to ensure that deployed systems meet real-world expectations.