NeurIPS Data-Centric AI Workshop

Date: 14 December 2021
Location: Virtual

Important Dates Paper Submission

Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.

The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers (

Learn more about Data Centric AI (DCAI) here. This workshop builds on a tradition of series of workshops focusing on the role of data in AI:

Important Dates

Submission Deadline

September 30, 2021

Notification of acceptance

October 22, 2021


December 14, 2021


For questions please check FAQ

Call for Papers

The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.

If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].

We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Submission Instructions

We welcome short papers (1-2 pages) and long papers (4 pages) addressing one or more of the topics of interest below. All papers need to be formatted according to the NeurIPS2021 Formatting Instructions. Papers will be peer-reviewed by the program committee and accepted papers will be presented as lightning talks during the workshop. If you have any questions about submission, please first check the FAQ link below. Contact us per email only if your question is not answered in the FAQ below, or if you experience any problems with the submission site, please email us at (

Topics of Interest

Data Centric AI workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:

New Datasets in areas:

  • Speech, vision, manufacturing, medical, recommendation/personalization
  • Science:

Tools & methodologies for accelerating open-source dataset iteration:

  • Tools that quantify and accelerate time to source and prepare high quality data
  • Tools that ensure that the data is labeled consistently, such as label consensus
  • Tools that make improving data quality more systematic
  • Tools that automate the creation of high quality supervised learning training data from low quality resources, such as forced alignment in speech recognition
  • Tools that produce consistent and low noise data samples, or remove labeling noise or inconsistencies from existing data
  • Tools for controlling what goes into the dataset and for making high level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
  • Search methods for finding suitably licensed datasets based on public resources
  • Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
  • Tools for timely incorporation of feedback from production systems into datasets
  • Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
  • Dataset importers that allow easy combination and composition of existing datasets
  • Dataset exporters that make the data consumable for models and interface with model training and inference systems such as webdataset.
  • System architectures and interfaces that enable composition of dataset tools such as, MLCube, Docker, Airflow

Algorithms for working with limited labeled data and improving label efficiency:

  • Data selection techniques such as active learning and core-set selection for identifying the most valuable examples to label.
  • Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data.
  • Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data.
  • Novelty and drift detection to identify when more data needs to be labeled.

Responsible AI development :

  • Fairness, bias, diversity evaluation and analysis for data sets and modeling/algorithms
  • Tools for green AI hardware-software system design and evaluation
  • Scalable, reliable training methods and systems
  • Tools, methodologies, and techniques for private, secure machine learning training
  • Efforts toward reproducible AI, such as data cards, model cards

Organizing Committee

Andrew Ng, Landing AI, DeepLearning.AI
Lora Aroyo, Google Research
Cody Coleman, Stanford University
Greg Diamos, Landing AI
Vijay Janapa Reddi, Harvard University
Joaquin Vanschoren, Eindhoven University of Technology
Sharon Zhou, Stanford University