Blogs

TensorFlow Datasets

Peter McLaughlinOctober 30, 2022

TensorFlow Datasets are commonly used for sharing datasets in the public domain. Well known examples include the MNIST dataset for classification and the OxfordIIITPET dataset for segmentation. This article explains how TensorFlow Datasets work and how to create your own.

There are three main scenarios in which TensorFlow Datasets are used:

  • Creating a dataset for sharing in the public domain
  • Modifying a publicly available dataset
  • Creating a dataset for sharing in a private context, for example within a company or a research instution

The objective of TensorFlow Datasets is to automate the work of fetching data and preparing it in a TensorFlow-ready standard format on disk. Concretely, TensorFlow Datasets are Python classes which inherit from tfds.core.DatasetBuilder. More specifically, the tfds.core.GeneratorBasedBuilder class is often used as the base class for a custom dataset, which sub-classes tfds.core.DatasetBuilder.

This article is available in PDF format for easy printing

Accessing a publicly available dataset is very straightforward, for example the following code shows how to download and access the oxford_iiit_pet dataset. The remainder of this article focuses on creating a new dataset.

import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
dataset, info = tfds.load('oxford_iiit_pet:3.*.*', with_info=True)

Datasets can be very large therefore if you wish to control where the data is stored locally on your machine, the data_dir argument can be used.

dataset, info = tfds.load(name="oxford_iiit_pet:3.*.*", data_dir="C:\\Users\\[complete path]", with_info=True)

Building your own TensorFlow Dataset

The TensorFlow Datasets package is not automatically installed with TensorFlow and must be installed separately using the command pip install tensorflow-datasets. This package also includes the TFDS CLI (command line interface) which provides convenience commands for creating and working with TensorFlow Datasets.

Once the TFDS package and CLI is installed, we can use the CLI to create boiler plate code for our dataset. The command tfds new creates the required Python files e.g. tfds new my_dataset. A class is created which is named as per the argument provided to the tfds new command. It can be seen from the constructor that the newly created class sub-classes tfds.core.GeneratorBasedBuilder. The GeneratorBasedBuilder parent class expects sub-classes to override the _split_generators method to return a dictionary of splits and generators. If the dataset is being access from a public website or a local server, TensorFlow’s download manager can be used to transfer and uncompress the files. The best practice is to then call the _generate_examples method to read and parse the data files.

class MyDataset(tfds.core.GeneratorBasedBuilder):  
  """DatasetBuilder for my_dataset dataset."""
  VERSION = tfds.core.Version('1.0.0')
  RELEASE_NOTES = {
      '1.0.0': 'Initial release.',
  }
    
  def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # TODO(my_dataset): Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            # These are the features of your dataset like images, labels ...
            'image': tfds.features.Image(shape=(None, None, 3)),
            'label': tfds.features.ClassLabel(names=['no', 'yes']),
        }),
        # If there's a common (input, target) tuple from the
        # features, specify them here. They'll be used if
        # `as_supervised=True` in `builder.as_dataset`.
        supervised_keys=('image', 'label'),  # Set to `None` to disable
        homepage='https://dataset-homepage/',
        citation=_CITATION,
    )  
  
  def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Returns SplitGenerators."""
    # TODO(my_dataset): Downloads the data and defines the splits
    path = dl_manager.download_and_extract('https://todo-data-url')
    # TODO(my_dataset): Returns the Dict[split names, Iterator[Key, Example]]
    return {
        'train': self._generate_examples(path / 'train_imgs'),
    }
    
  def _generate_examples(self, path):
    """Yields examples."""
    # TODO(my_dataset): Yields (key, example) tuples from the dataset
    for f in path.glob('*.jpeg'):      
      yield 'key', {
          'image': f,
          'label': 'yes',
      }

Public datasets can be referred to for examples on how to best re-define the three methods generated by the tfds new command. The source code for datasets included in the TensorFlow catalog is typically linked on the TensorFlow dataset page. For example, the source code for the oxford_iiit_pet dataset illustrates how to use the download manager to obtain the archives locally, uncompress them and generate the data splits.

Conclusion

TensorFlow datasets allow us to include in the dataset itself the code required to download and prepare the data. This mechanism simplifies sharing of data both within the public domain and in private contexts such as companies and research institutions.


To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: