1st Data Excellence Workshop (DEW 2020)

At the 8th AAAI Conference on Human Computation and Crowdsourcing

Date: 26 October 2020
Time PT: 6:00am-10:20am
Time EST: 9:00am-1:20pm
Time CET: 2:00pm-6:20pm
Location: Virtual

Important Dates Register Talks

Human annotated data is crucial for operationalizing empirical ways for evaluating, comparing, and assessing the progress of ML/AI research. As human annotated data represents the compass that the entire ML/AI community relies on, the human computation (HCOMP) research community has a multiplicative effect on the progress of the field. Optimizing the cost, size, and speed of collecting data has attracted significant attention by HCOMP and related research communities. In the first to market rush with data, aspects of maintainability, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection.

Panos Ipeirotis [Ipeirotis, 2010], one of the founders of the HCOMP research community, warned us that crowdsourced data collection platforms had the structural properties of a market for lemons [Akerlof, 1967]. Due to uncertainty in the notion of quality, the market focuses on price, resulting in an equilibrium state where the good sellers are priced out of the market and only lemons remain. The focus on scale, speed, and cost of building datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019].

Currently, data excellence happens organically due to appropriate support, expertise, diligence, commitment, pride, community, etc. However, data excellence is more than maintaining a minimum standard for the ways in which we collect, publish or assess our data. It’s a metascientific enterprise of recognizing what’s important in the long term for science.

This workshop continues a series of preceding workshops on related topics:

This year, the DEW workshop will run a joint program with REAIS 2020 across two days:

Important Dates

Submission Deadline

October 15, 2020

Notification of acceptance

October 20, 2020

Final camera-ready papers due

October 24, 2020


October 26, 2020
PT: 6:00am-10:20am
EST: 9:00am-1:20pm
CET: 2:00pm-6:20pm

Call for Papers

Decades of catastrophic failures within high-stakes software projects (e.g., explosions of billion dollar spacecrafts in the 60s and 90s due to floating point overflows and missing hyphens in source code) have burnt the vital importance of upfront investments in software engineering excellence into our collective memory. It was through careful post-hoc analysis of these kinds of disasters that software engineering has matured as a field and achieved a more robust understanding of the costs and benefits that come with: processes like systematic code reviews, standards like coding conventions and design patterns, infrastructure for debugging and experimentation, as well as incentive structures that prioritize careful quality control over hasty roll-outs.

With the rise of artificial intelligence human-labeled data has increasingly become the fuel and compass of AI-based software systems. However, an analogous framework for excellence in data engineering does not yet exist, bearing the risk of similarly disastrous catastrophes to arise from the use of insufficient datasets in AI-based software systems.

This workshop aims to leverage lessons learned from decades of excellence in software engineering to inspire and enable an analogous framework for data excellence in AI systems.

The outcomes of this workshop will be:

  • defining properties and metrics of data excellence
  • gathering and reviewing various case studies of data excellence and data catastrophes
  • building a catalog of best practices and incentive structures for data excellence
  • discussing the cost-benefit trade-off for investments in data excellence
  • gathering and reviewing empirical and theoretical methodologies for reliability, validity, maintainability, fidelity of data

We welcome short papers (1-2 pages) and long papers (4 pages) addressing one or more of the topics of interest below. Papers will be peer-reviewed by the program committee and accepted papers will be presented as lightning talks during the workshop.

Submission Link

Topics of Interest

We invite scientific contributions and positions papers on the following topics:

Maintainability: Maintaining data at scale, e.g., Knowledge Graph [Bollacker et al, 2008] has similar challenges to maintaining software at scale, or potentially knowledge engineering. Data engineering often refers to data munging tasks, and is far more important and challenging that has been appreciated thus far. The fastest thing to do isn’t the most maintainable or reusable:

  • What lessons for maintaining data at scale could we learn or adapt from software and knowledge engineering at scale?
  • Methods and tools for data documentation (e.g., [Gebru, 2018]), management of owners/maintainers, forking/replication
  • Data should be not abandoned, should not be disposable
  • What if it is a one-off dataset?
    • There is no good software that is abandoned.
    • Even a one-off dataset requires maintenance over time, extension to other languages, addition of new word pairs, updates, scrutiny.
    • How do we help other people to contribute to growing and improving this data
  • Open infrastructure for supporting maintenance and growth

Reliability: Reliability captures internal aspects of validity of the data, such as: consistency, replicability, reproducibility of data. Irreproducible data allows us to draw whatever conclusions we want to draw, while giving us the facade of being data-driven, when it is dangerously hunch driven:

  • Irreproducible data is not maintainable
  • Reliable mechanisms to account for the human aspects of data collection
  • Gathering data from humans requires acknowledging our humanity: frailty, subjectiveness, imperfections, knowledge gaps, etc.

Validity: Validity tells us about how well the data helps us explain things related to the phenomena captured by the data, e.g., via correlation between the data and external measures:

  • Does the specific operationalization of a data collection procedure (e.g., answering a certain question on a 5-point Likert scale) accurately represent and allow generalizations to the abstract construct that is being modeled (e.g., human sentiment reactions to sentences)?
  • Was the construct properly defined before it was operationalized (e.g., getting convicted of a crime is not necessarily equivalent to having committed the crime)?
  • Does the data operationalization account for potential complexity, subjectivity, multi-valence or ambiguity of the intended construct?
  • Does the data operationalization predict features of the represented construct (e.g., education research tries to explore whether grades are valid by studying its correlations to external indicators of success)?

Fidelity: Users of data often assume that the dataset accurately and comprehensively represents the phenomenon, which is almost never the case. For example, if the sentences in a sentiment corpus were sampled from Wikipedia, it might not work as well as news headlines required by the product:

  • Importance of a balanced / unbiased / representative dataset
  • Sampling, e.g:
    • Temporal splitting can introduce bias if not done right in cases such as signal processing or sequential learning.
    • User based splitting not keeping data of users separated is a potential bias source. E.g. if data from the same user is in test and train sets.
  • Does the dataset live up to scientific standards of data collection?
  • Correlation between the data and external measures or objective outcomes
  • Problem of circular data validation (i.e., validating data by itself)

Organizing Committee

Praveen Paritosh, Google
Matt Lease, Amazon and UT Austin
Mike Schaekermann, Amazon and University of Waterloo
Lora Aroyo, Google

Program Committee

Chris Welty, Google Research
Ana Rogers, University of Copenhagen
Omar Alonso, Curai
Markus Krause, Brainworks
Juho Kim, KAIST
Alex Quinn, Purdue University
Owen Rambow, Elemental Cognition



6:00 AM 9:00 AM 2:00 PM Intro
6:10 AM 9:10 AM 2:10 PM Invited talk: Ben Hutchinson
6:30 AM 9:30 AM 2:30 PM Invited talk: Ian Soboroff
6:50 AM 9:50 AM 2:50 PM Invited talk: Andrea Olgiati
7:10 AM 10:10 AM 3:10 PM 3x Lightning Talks
7:25 AM 10:25 AM 3:25 PM Short Break
7:35 AM 10:35 AM 3:35 PM Talks Discussion
8:05 AM 11:05 AM 4:05 PM Invited talk: Emily Dinan
8:25 AM 11:25 AM 4:25 PM Invited talk: Aleksander Madry
8:45 AM 11:45 PM 4:45 PM Short Break
8:55 AM 11:55 AM 4:55 PM Invited talk: Quang Duong
9:15 AM 12:15 PM 5:15 PM Invited talk: Peter Hallinan
9:35 AM 12:35 PM 5:35 PM Talks Discussion
10:05 AM 1:05 PM 6:05 PM Wrap Up
10:20 AM 1:20 PM 6:20 PM End of Workshop

Accepted Papers

  • Reducing Annotation Artifacts in Crowdsourcing Datasets for Natural Language Processing
    Donghoon Han, Juho Kim and Alice Oh

  • Machine Learning Training to Support Diversity of Opinion
    Johanne Christensen and Benjamin Watson

  • Data Desiderata: Reliability and Fidelity in High-stakes AI
    Shivani Kapania, Nithya Sambasivan, Kristen Olson, Hannah Highfill, Diana Akrong, Praveen Paritosh and Lora Aroyo

Invited Talks

Aleksander Madry (MIT)

What Do Our Models Learn?

Large-scale vision benchmarks have driven—and often even defined—progress in machine learning. However, these benchmarks are merely proxies for the real-world tasks we actually care about. How well do our benchmarks capture such tasks?

In this talk, I will discuss the alignment between our benchmark-driven ML paradigm and the real-world uses cases that motivate it. First, we will explore examples of biases in the ImageNet dataset, and how state-of-the-art models exploit them. We will then demonstrate how these biases arise as a result of design choices in the data collection and curation processes.

Throughout, we illustrate how one can leverage relatively standard tools (e.g., crowdsourcing, image processing) to quantify the biases that we observe.

Based on joint works with Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Jacob Steinhardt, Dimitris Tsipras and Kai Xiao.


Aleksander Madry is a Professor of Computer Science in the MIT EECS Department and a Principal Investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his PhD from MIT in 2011 and, prior to joining the MIT faculty, he spent some time at Microsoft Research New England and on the faculty of EPFL.

Aleksander's research interests span algorithms, continuous optimization, science of deep learning, and understanding machine learning from a robustness and deployability perspectives. His work has been recognized with a number of awards, including an NSF CAREER Award, an Alfred P. Sloan Research Fellowship, an ACM Doctoral Dissertation Award Honorable Mention, and Presburger Award.

Emily Dinan (Facebook AI)

Adversarial Data Collection for Chatbot Safety

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which may include offensive or otherwise toxic behavior and unwanted biases. I will discuss several methods for mitigating these issues in neural dialogue models. In particular, I will highlight the use of iterative, adversarial human-and-model-in-the-loop data collection and model training procedures for the purpose of making chatbots robust to adversarial attack from humans. In opposition to static benchmarks, these approaches yield a “moving goalpost” for natural language understanding systems, and recent work shows that models trained on these increasingly complex tasks are more robust to adversarial attack.


Emily is a Research Engineer at Facebook AI Research in New York. She graduated with a Master's degree in Mathematics from the University of Washington. Emily's research interests include conversational AI, natural language processing, and fairness in these fields.

Ian Soboroff (NIST)

TREC: Building datasets within a community of use

NIST's Text Retrieval Conference (TREC) and TRECVID evaluations of multimedia search are long-running evaluation workshops that develop datasets for search ranking, classification, and other AI tasks, along with appropriate metrics that the research community can use to gauge progress in those tasks. Since the start of TREC in 1991, NIST has developed a large store of practical experience in collecting, assembling, labeling, and measuring the quality of datasets. In this brief talk, I will describe the history of TREC relevance assessments, the processes we use to do determine those labels, and research on how to make those processes more robust, reliable, and inexpensive.


Dr. Ian Soboroff is a computer scientist and leader of the Retrieval Group at the National Institute of Standards and Technology (NIST). The Retrieval Group organizes the Text REtrieval Conference (TREC), the Text Analysis Conference (TAC), and the TREC Video Retrieval Evaluation (TRECVID). These are all large, community-based research workshops that drive the state-of-the-art in information retrieval, video search, web search, information extraction, text summarization and other areas of information access. He has co-authored many publications in information retrieval evaluation, test collection building, text filtering, collaborative filtering, and intelligent software agents. His current research interests include building test collections for complex information seeking tasks.

Quang Duong (Google Health)

Asynchronous adjudication in labeling medical data: lessons, improvements and expansions

Crowdsourcing has enabled the collection, aggregation and refinement of human knowledge and judgment, i.e. ground truth, for problem domains with data of increasing complexity and scale. A lack of comprehensive diagnostic reference standards in medicine significantly increases the complexity of labeling medical data. Asynchronous adjudication, a newly introduced method of facilitating asynchronous discussions among remote healthcare professionals, has enabled the generation of such ground truth, i.e. medical diagnoses, with high quality at scale, and consequently, the development of machine learning based medical applications.

In this talk, I will present the challenges of managing labellers, who often are medical experts, involved in asynchronously adjudicating medical labels. The discussion will focus on what information labellers find important to doing their jobs efficiently, and how such information impacts their work’s speed and quality. I next propose improvements to the current asynchronous adjudication mechanism. Last, I will describe a novel patented async adjudication expansion, designed and built to address the challenges of resolving disagreements among experts on visual annotations of medical images.


I received my Ph.D.in Computer Science and Engineering from the University of Michigan, Artificial Intelligence Lab. From 2014 to 2018, I led and managed a data quality team of engineers and data scientists at Google Maps, responsible for building machine learning time-series models that detect and predict visit patterns useful in assisting users to plan their trips efficiently. We built innovative location-based systems for crowdsourcing users’ assessment about their physical surroundings in real time. Since 2018, I have been leading a team tasked with developing a crowd-sourcing platform for medical labeling and diagnoses in Health, a new healthcare division at Google. The platform's output of high quality labeled data, generated securely and reliably at scale, is imperative to Google Health's efforts in building and deploying state-of-the-art machine-learning-powered diagnostic solutions, including ML models for diabetic retinopathy and breast cancer detection.

Andrea Olgiati (Amazon AI)

Data =? Software

Managing large amounts of data in a rigorous way is key to the success of ML initiatives. Yet, we still find that many organizations often do not have a complete view of their data, leading to sub-optimal and biased solutions. What can we learn from the world of software development to build a better-controlled system? How can the creation of controlled datasets become less of an art and more of a science?


Andrea is a Senior Principal Engineer at AWS, where he is the technical leader for the AWS SageMaker managed ML platform. In the past he has worked on computer vision, databases and chip design. He holds a Laurea degree from Politecnico di Milano in Italy.

Peter Hallinan (AI & Data @ Amazon)

Getting good data faster

The AI industry is collectively fielding tens of thousands of ML applications, many of which require human-labeled data (HLD) for training, evaluation or production “patching”. We identify business and operational challenges to producing high-quality HLD at scale, and review the pros and cons of common organizational solutions: outsourcing HLD to third-party vendors, insourcing HLD to a product or engineering sub-team, insourcing HLD to a specialized HLD function, letting science contributors fend for themselves, and hybrids of these.


Peter W. Hallinan leads ML initiatives in human-labeled data at Amazon. Prior to Amazon, he was a partner at Blindsight (computer vision for the blind, sold to AMZN), and at Techcelerator, a venture development firm. Earlier, he was a vice president at Analysis Group / Integral, where he specialized in growth strategy and the management of innovation for Fortune 500 companies in high-tech, life science, and finance. He holds A.B., S.M., and Ph.D. degrees from Harvard University, where he conducted research in pattern recognition and artificial intelligence, co-authored an early book on face recognition, and moonlighted as a teaching fellow in policy analysis and experimental design at the John F. Kennedy School of Government. His volunteer activities have included serving as a consulting professor at the Stanford School of Medicine, and leading the American Chamber of Commerce in Madagascar.

Ben Hutchingson (Google Research)

Towards Accountability for Machine Learning Datasets: Lessons from Software Engineering

Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. In this talk, I introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention to the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.


Ben Hutchinson is a Senior Engineer in Google Research, working on artificial intelligence, dataset development practices, data ethics, fairness, and accountability. His research focuses on incorporating knowledge and practices from a variety of engineering and social science disciplines in order to inform the ethical development of AI. Prior to joining Google Research, he spent ten years writing code for a variety of products such as Google Maps, Knowledge Graph, and Google Search. He uses this engineering experience to help bridge the gaps between research and engineering, and between theory and practice.