Normal Distribution Generator. This tool will produce a normally distributed dataset based on a given mean and standard deviation. By default, the tool will produce a dataset of 100 values based on the standard normal distribution (mean = 0, SD = 1). However, you can choose other values for mean, standard deviation and dataset size. EMS Data Generator is a software application for creating test data to MySQL database tables. It allows you to populate MySQL database table with test data simultaneously. Data are relentless – so digital products must be designed for robustness; It's fun 🎉 – seeing your design evolve with meaningful data is motivating and rewarding; Read more about designing with data Designing with meaningful data (opens new window) Modern Design Tools: Using Real Data (opens new window) Designing with Data (opens new.
pytorch
data loader
large dataset
parallel
By Afshine Amidi and Shervine AmidiMotivation
Have you ever had to load a dataset that was so memory consuming that you wished a magic trick could seamlessly take care of that? Large datasets are increasingly becoming part of our lives, as we are able to harness an ever-growing quantity of data.
We have to keep in mind that in some cases, even the most state-of-the-art configuration won't have enough memory space to process the data the way we used to do it. That is the reason why we need to find other ways to do that task efficiently. In this blog post, we are going to show you how to generate your data on multiple cores in real time and feed it right away to your deep learning model.
This tutorial will show you how to do so on the GPU-friendly framework PyTorch, where an efficient data generation scheme is crucial to leverage the full potential of your GPU during the training process.
Tutorial
Previous situation
Before reading this article, your PyTorch script probably looked like this:
or even this:
This article is about optimizing the entire data generation process, so that it does not become a bottleneck in the training procedure.
In order to do so, let's dive into a step by step recipe that builds a parallelizable data generator suited for this situation. By the way, the following code is a good skeleton to use for your own project; you can copy/paste the following pieces of code and fill the blanks accordingly.
Notations
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID
be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
Create a dictionary called
partition
where you gather:- in
partition['train']
a list of training IDs - in
partition['validation']
a list of validation IDs
- in
Create a dictionary called
labels
where for eachID
of the dataset, the associated label is given bylabels[ID]
For example, let's say that our training set contains id-1
, id-2
and id-3
with respective labels 0
, 1
and 2
, with a validation set containing id-4
with label 1
. In that case, the Python variables partition
and labels
look like
and
Also, for the sake of modularity, we will write PyTorch code and customized classes in separate files, so that your folder looks like
where data/
is assumed to be the folder containing your dataset.
Finally, it is good to note that the code in this tutorial is aimed at being general and minimal, so that you can easily adapt it for your own dataset.
Dataset
Now, let's go through the details of how to set the Python class Dataset
, which will characterize the key features of the dataset you want to generate.
First, let's write the initialization function of the class. We make the latter inherit the properties of torch.utils.data.Dataset
so that we can later leverage nice functionalities such as multiprocessing.
There, we store important information such as labels and the list of IDs that we wish to generate at each pass.
Each call requests a sample index for which the upperbound is specified in the __len__
method.
Now, when the sample corresponding to a given index is called, the generator executes the __getitem__
method to generate it.
During data generation, this method reads the Torch tensor of a given example from its corresponding file ID.pt
.Since our code is designed to be multicore-friendly, note that you can do more complex operations instead (e.g. computations from source files) without worrying that data generation becomes a bottleneck in the training process.
The complete code corresponding to the steps that we described in this section is shown below.
PyTorch script
Now, we have to modify our PyTorch script accordingly so that it accepts the generator that we just created.In order to do so, we use PyTorch's DataLoader
class, which in addition to our Dataset
class, also takes in the following important arguments:
batch_size
, which denotes the number of samples contained in each generated batch.shuffle
. If set toTrue
, we will get a new order of exploration at each pass (or just keep a linear exploration scheme otherwise). Shuffling the order in which examples are fed to the classifier is helpful so that batches between epochs do not look alike. Doing so will eventually make our model more robust.num_workers
, which denotes the number of processes that generate batches in parallel. A high enough number of workers assures that CPU computations are efficiently managed, i.e. that the bottleneck is indeed the neural network's forward and backward operations on the GPU (and not data generation).
A proposition of code template that you can write in your script is shown below.
Conclusion
This is it! You can now run your PyTorch script with the command
and you will see that during the training phase, data is generated in parallel by the CPU, which can then be fed to the GPU for neural network computations.
You may also like...
- • Reflex-based models
- • States-based models
- • Variables-based models
- • Logic-based models
Data Generator Csv
- • Supervised learning
- • Unsupervised learning
- • Deep learning
- • Machine learning tips and tricks
Data Generator Statistics
- • Convolutional neural networks
- • Recurrent neural networks
- • Deep learning tips and tricks