Rigorous Evaluation for AI Systems Workshop

This year, the REAIS workshop will run a joint program with DEW 2020 across two days:

Program Agenda

Note from organizers: Daylight savings time changes for Amsterdam (but not yet in the US), so take care with timezone calculations. We apologize for any confusion this has caused.

2:00pm CET / 6:00am PDT Tech Platform Onboarding
2:10pm CET / 6:20am PDT Introduction
2:25pm CET / 6:25am PDT Invited Talk: José Hernández-Orallo
3:05pm CET / 7:05am PDT Break
3:15pm CET / 7:15am PDT NLP Invited Talks: Ellen Voorhees, João Sedoc, Courtney Napoles
4:45pm CET / 8:45am PDT Panel discussion
5:05pm CET / 9:05am PDT Break
5:15pm CET / 9:15am PDT Reporting from discussions
5:20pm CET / 9:20am PDT Invited Talk: Besmira Nushi
6:00pm CET / 10:00am PDT Lightning Talks: Cynthia Freeman, Ian Beaver
6:20pm CET / 10:20am PDT Closing remarks

The last decade has seen massive progress in AI research powered by crowdsourced datasets and benchmarks such as ImageNet, Freebase, SQuAD; as well as widespread adoption and increasing use of AI in deployed systems. A crucial ingredient is the role of crowdsourcing in operationalizing empirical ways for evaluating, comparing, and assessing the progress.

Using crowdsourced techniques for evaluating AI systems’ success at tasks such as image labeling and question answering have proven powerful enablers for research. However, adoption of such approaches is typically driven by the mere existence and size of crowdsourced contributions without proper scrutiny of their scope, quality, and limitations. While crowdsourcing has enabled a burst of published work on specific problems, determining if that work has resulted in real progress cannot continue without a deeper understanding of how dataset and benchmarks support the scientific or performance claims of the AI systems it is evaluating. This workshop will provide a forum for growing our collective understanding of what makes an evaluation good, the value of improved datasets and collection methods, and how to inform the decisions of when to invest in more robust data acquisition.

This year’s focus: Third Party And Independent Meta-evaluation

Often AI systems and datasets are evaluated by measures that only have mathematical and theoretical components, while the real, physical world is messy, irregular, and subject to constant change. Some systems and approaches that perform well under the scrutiny of clean, elegant theory may still fail quite spectacularly in real-world applications simply because the theory did not match reality. Traditionally, approaches such as crowdsourcing, “human in the loop” decision systems, HCI mechanisms, and other human-centered sources of ground truth are powerful complements used to fill in that “real-world” gap. However, because they are often part of the same construction by the same creators, this human component of measure and evaluation has the correlated biases and omissions as the rest of the system. To overcome these biases and omissions, an independent layer of 3rd party human-centered evaluation, or “meta-evaluation” may be needed. If this separate, external scrutiny is crafted from the perspective of actual users and consumers of such datasets and systems rather than the expectations of the system/dataset creators, it might be used to more accurately measure real-world performance and value.

This year, we will will have a focus on how we can use human computation to craft external, independent evaluation of AI datasets and systems, especially focussed on application to:

Key Dates

October 14, 2020: Extended abstracts and short papers due
October 18, 2020: Notification of acceptance
October 25, 2020: Rigorous Evaluation for AI Systems workshop
October 26, 2020: sister workshop, Data Excellence Workshop (link)

Workshop Authors are invited to submit extended abstracts (2-4 pages) or short papers (4-6 pages), plus any number of additional pages containing references only.

Organizing committee

Bernease Herman
Sarah Luger
Kurt Bollacker
Maria Stone


The workshop will be held at the eighth AAAI Conference on Human Computation and Crowdsourcing.


All questions about submissions should be emailed to rigorous-evaluation@googlegroups.com.