Machine Learning: Data Preprocessing

I have already started getting into deep learning, and been alloted some great tasks to carry out for the intended projects in AGV. Now when it comes to deep learning, its all about collecting dataset, which is first of all, erroneous, and again different machine learning libraries demand the dataset in various formats. But, there also exists some libraries that can directly allow importing images straight from the directory, EX. Caffe. On other hand, libraries like Tensorflow or Torch, restricts this and generally uses Tensor or numpy “n” dimensional array, for faster data calling and embedding.

Currently, I am involved with two libraries Caffe and Tensorflow in two different projects where I had faced some issue regarding above dataset preprocessing requirements. So, I sort of developed a python code which when applied on class-wise classified dataset, returns pickle files individually for each class and also provides various options like Tensor or ndarray reformating (as according to Convnet input), merged pickle file generation, etc. The only explicit requirement for successfully running this code is that dataset should be segregated in classes wise order in different directories, which I think is followed by most of the datasets found online. Take the example of MNIST, then notMNIST, another famous one is cifar-10, cifar-100, and for driverless cars, we have GTSRB (Traffic sign recognition), GTSDB (Traffic sign detection) from University of Informatics. Imagenet provides more than 1000 classes in a similar way.

The code is self explanatory. Still here is a brief about the pipeline.

# Collect the images, classes wise and Normalize. By normalization, I mean getting its mean close to zero and standard deviation close to 0.5. Normalization is very useful for optimizers to converge quickly.

# Create a ndarray using numpy of set of Images. I have taken classes as set, and have loaded them into individual pickle file.

# Merge the pickle files as according to requirement. Most of the times, only a single chunk of dataset is available and it has to be classified into validation set, training set, and test set. This has been taken care of in the code. Just mention the proportion of volume required in each set before code execution

# Give parameters for reformatting, incase you need one.

Here is the link to the github repository.


After successfully running the code, a pickle file can be seen generated into the same dataset directory. Now for loading a pickle into code, I modified a online available code to make it versatile. It justs loads pickle file into ndarray

Please comment incase you feel something wrong within this code. Also suggestion are most welcomed.

Leave a Reply, I generally respond quickly

%d bloggers like this: