Evaluating Evaluation of AI Systems (Meta-Eval 2020)
Workshop at The thirty fourth AAAI Conference on Artificial Intelligence
Submission Deadline: November 27, 2019
Notification of acceptance: December 13, 2019
Final camera-ready papers due: December 18, 2019
Call for papers
The last decade has seen massive progress in AI research powered by crowdsourced datasets and benchmarks such as Imagenet, Freebase, SQuAD; as well as widespread adoption and increasing use of AI in deployed systems. A crucial ingredient is the role of crowdsourcing in operationalizing empirical ways for evaluating, comparing, and assessing the progress.
The focus of this workshop is not on evaluating AI systems, but on evaluating the quality of evaluations of AI systems. When these evaluations rely on crowdsourced datasets or methodologies, we are interested in the meta-questions around characterization of those methodologies. Some of the expected activities in the workshop include:
- Asking the question of "what makes evaluations good'?
- Defining "what good looks like" in evaluations of different types of AI systems (image recognition, recommender systems, search, voice assistants, etc).
- Collecting, examining and sharing current evaluation efforts, comprehensive of one system or competitive of multiple systems with the goal of critically evaluating the evaluations themselves
- Developing an open repository of existing evaluations with methodology fully documented and raw data and outcomes available for public scrutiny
Using crowdsourced datasets for evaluating AI systems’ success at tasks such as image labeling and question answering have proven powerful enablers for research. However, adoption of such datasets is typically driven by the mere existence and size of a dataset without proper scrutiny of its scope, quality, and limitations. While crowdsourcing has enabled a burst of published work on specific problems, determining if that work has resulted in real progress cannot continue without a deeper understanding of how the dataset supports the scientific or performance claims of the AI systems it is evaluating. This workshop will provide a forum for growing our collective understanding of what makes a dataset good, the value of improved datasets and collection methods, and how to inform the decisions of when to invest in more robust data acquisition.
Topics of Interest
We invite scientific contributions and positions papers on the following topics:
- META-EVALUATION: Quality of evaluation approaches, datasets / benchmarks
- Characteristics of ‘good’ dataset / benchmark?
- Shortcomings of existing evaluation approaches, datasets / benchmarks?
- Building new / improving existing metrics
- Measuring trustworthiness, interpretability and fairness of crowdsourced benchmarks datasets
- Measuring added value of improvements to previous versions of benchmark datasets
- Comparative evaluations between mainstream AI systems, e.g. recommenders, voice assistants, etc.
- Measuring quality of guidelines for content moderation, search evaluation, etc.
- Comparison of results between offline (e.g. crowdsourced) and online (e.g. A/B testing) evaluations?
- Open questions and challenges in meta-evaluation?
- TRANSPARENCY: Making quality and characteristics of (crowdsourced) benchmark datasets transparent and explainable
- Reproducibility of crowdsourced datasets
- Replicability of crowdsourced evaluations of AI systems
- Explainability of crowdsourced evaluations to different stakeholders, e.g. users, scientists, developers
- RESOURCE BUILDING: Making existing evaluation methodologies, raw data and outcomes, discoverable, fully documented and available for public scrutiny
- How do we make evaluations and related datasets archival and discoverable?
- What can we learn from other systematic evaluation efforts and communities such as TREC, SIGIR, etc.?
- Matt Lease, UT Austin
- Paul Tepper, Nuance
- Sid Suri, Microsoft
- Danna Gurari, UT Austin
- Anbang Xu, IBM
- Chris Welty, Google
- Lora Aroyo, Google
- Omar Alonso, Microsoft
- Walter Lasecki, Michigan
- Sarah Luger, Orange
- Alex Quinn, Purdue
- Brad Klingenberg, StitchFix
- Ka Wong, Google
- Panos Ipeirotis, NYU
Email: rigorous-evaluation AT googlegroups.com
- Praveen Paritosh, Google
- Kurt Bollacke, Long Now Foundationr
- Maria Stone, Spotify
- Lora Aroyo, Google