A Community Library for NLP by Hugging Face


There has been enormous research into the applications of NLP since its implementation. Nowadays we have powerful tools like BERT that facilitate a robust NLP model on the fly. When preparing such models we often spend a lot of time collecting suitable data and for this we have to go through various repositories like Kaggle, UCI ML etc. So there is a way to access a variety of this data in one place? The answer is yes. A few months ago, Hugging Face introduced their community library called Datasets which facilitates over 600 publicly available datasets in a standard format in 467 different languages. So, in this article, we are going to discuss this framework and see concretely how we can take advantage of it. The main points to be discussed are listed below.


  1. Need this community library
  2. Library design
  3. Implementation in Python

Let’s start the discussion by understanding the need for this framework.

Free Course on Responsible AI. Register here>>

The size, variety, and number of publicly available natural language processing (NLP) datasets have grown rapidly as researchers come up with new goals, larger models, and unique references. For evaluation and benchmarking, organized data sets are used; supervised datasets are used for model training and fine-tuning, and unsupervised massive datasets are needed for language pre-training and modeling. Each type of dataset has a different scale, granularity, and structure, in addition to the annotation approach.

In the past, new dataset paradigms have been essential in advancing NLP. Today’s NLP systems consist of a pipeline that includes a wide variety of datasets with varying dimensions and annotation levels. Several datasets are used for pre-training, fine-tuning, and benchmarking. As a result, the number of datasets available to the NLP community has exploded. As the number of datasets grows, significant issues such as interface standardization, versioning, and documentation arise.

Without having to use multiple interfaces, one should be able to work with a variety of datasets. Also, a group of people working on the same dataset should know that they are all using the same version. Due to this magnitude, interfaces should not have to change.

This is where a Database join the game. Datasets is a modern NLP community library that was created to help the NLP community. Datasets aim to standardize end user interfaces, version management and documentation while also providing a lightweight front end that can handle small data sets as well as large Internet corpora.

The library was built with a distributed, community-driven approach to adding datasets and usage documentation in mind. The library now has over 650 unique datasets, over 250 contributors, and has supported many original research initiatives on shared datasets and tasks after a year of hard work.

Datasets is a community library dedicated to managing data and access issues while promoting community culture and standards. The project has hundreds of contributors from all over the world, and every dataset is tagged and documented. Each data set should be in a standard tabular format that can be versioned and cited; datasets are compute efficient and memory efficient by default, and they work well with tokenization and functionality.

Library design

Users can access the dataset by simply referring to a global variable. Each dataset has its own feature schema and its own metadata. For each dataset, users don’t need to load the entire dataset, Datasets provided 3 panes for almost all datasets, and users can load them separately and can access them by indexing. In addition, we can apply various pre-processing steps directly to the corpus.

Datasets have divided all of its procedures into four simple steps as follows,

Retrieving and creating datasets

The underlying raw datasets are not hosted by Datasets; instead, it uses a distributed approach to access the hosted data of the original authors. Each dataset has a community contributed building module. The build module is responsible for converting unstructured data, such as text or CSV files, into a standardized dataset interface.

Data point representation

Internally, each constructed dataset is represented as an array with typed columns. A variety of common and NLP-targeted dataset types are available in the Dataset Type System. Besides atomic values ​​(ints, floats, strings and binary blobs) and JSON-like dicts and lists, the library also includes named categorical class labels, sequences, matched translations, and higher dimensional arrays for images, videos or waveforms.

Memory access

The datasets are built on Apache Arrow, a multilingual columnar data framework. Arrow includes a local caching system that allows datasets to be backed up by a memory-mapped disk cache for fast searching. This architecture allows large datasets to be used on machines with limited device memory. Arrow also allows copyless transfers to popular machine learning tools like NumPy, Pandas, Torch, and TensorFlow.

User treatment

The library provides access to typed data with minimal preprocessing during download. It includes sorting, shuffling, splitting and filtering functions for manipulation of datasets. It has a powerful mapping function for complex manipulations which supports arbitrary Python functions for creating new tables in memory. The card can be run in multi-process batch mode to apply parallel processing to large data sets. Data processed by the same function is also cached automatically between sessions.

The complete flow of the request

When you request a dataset, it is downloaded from its original host. This triggers the execution of the dataset-specific generator code, which converts the text to a typed tabular format that conforms to the entity schema and caches the table. The user receives a memory-mapped table. The user can run any vectorized code and cache the results to perform additional data processing, such as tokenization.

Python implementation

Here in this section, we practically see how we can take advantage of datasets to build NLP related applications. In this implementation, we will first see how to preview and load the dataset, preprocess it and make it compatible to model it. Let’s start by installing and importing dependencies.

! pip install datasets
! pip install transformers
from datasets import list_datasets, load_dataset, list_metrics, load_metric, load_dataset_builder

It is often useful to quickly get all the relevant information about a dataset before taking the time to download it. The datasets.load dataset builder () method lets you inspect the attributes of a dataset without having to download it.

dataset_builder = load_dataset_builder('imdb')
# get feature information

Go out

See also
# get fold information

Go out

Once you find the dataset you want, load it with single row datasets. With load_dataset (), you can see the entire schema just by printing the variable. Or even you can convert it to a CSV version as shown below.

data = load_dataset('imdb',split="train")

We’ve seen how to load a dataset from Hugging Face Hub and access the data it contains so far. We will now tokenize our data and use a framework like TensorFlow to analyze it. By default, all columns in the dataset are returned as Python objects. Columns are formatted to be compatible with TensorFlow types.

To get started, let’s take a look at tokenization. Tokenization is the process of separating text into individual words called tokens. The tokens are converted to numbers, which the pattern uses as input. Bring a tokenizer. To ensure that the text is systematically divided, we need to use the tokenizer associated with the template. Since you are using the BERT template in this example, load the BERT tokenizer.

import tensorflow as tf
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encoded_data = data.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)

Tensorflow and Pytorch are two widely used frameworks for building models. We will continue with the Tensorflow example. To wrap the dataset with tf.data we can use to_tf_dataset (). This indicates a tf.data. The dataset object can be iterated to produce batches of data, which can then be passed directly to methods such as model.fit (). to_tf_dataset () takes a number of arguments such as,

  • Columns: which columns should be formatted specify which columns should be formatted (including entries and labels).
  • to mix together: If the dataset is to be mixed, the mix is ​​used.
  • lot size: parameter that specifies the size of the batch.
  • assemble fn: specifies a data assembler that will group and populate each example processed. If you are using a DataColllator to return tf, be sure to set return_tensors = “tf” when you initialize it.
# making compatible dataset for Tensorflow
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
train_dataset = encoded_data.to_tf_dataset(
   columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'],

We have now created a dataset ready to be used in the training loop for the Tensorflow models. Let’s take a look.


Final words

The main Datasets library is designed to be simple to use, fast, and use the same interface for data sets of varying sizes. Having over 600 datasets in one place is a gift for any developer or novice. We have tried to understand how the library is organized in this article and shown how we can use it for various NLP applications.

The references

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Join our Telegram Group. Be part of an engaging community


Sam D. Gomez