In a recent session of Machine Learning for Scientists & Engineers, we were talking about the use of folds in cross-validation, and a student did one of my favorite things — he asked a perceptive question. “How is folding related to the concept of batching I’ve heard about for deep learning?” We had a good discussion about batching and folding in machine learning and what the differences and similarities are.
What is Machine Learning?
Terms like “AI” and “machine learning” have become nearly meaningless in casual conversation and advertising media—especially since the arrival of large language models like ChatGPT. At Diller Digital, we define AI (that is, “artificial intelligence”) as computerized decision-making, covering areas from robotics and computer vision to language processing and machine learning.

Machine learning refers to the development of predictive models that are configured, or trained, by exposure to sample data rather than by explicitly encoded interactions. For example, you can develop a classification model that sorts pictures into dogs and cats by showing it a lot of examples of photos of dogs and cats. (Sign up for the class to learn the details of how to do this.).
Or you can develop a regression model to predict the temperature at my house tomorrow by training the model on the last 10 years’ worth of measurements of temperature, pressure, humidity, etc. from my personal weather station.
Classical vs Deep Learning
Broadly speaking, there are two kinds of machine learning: what we at Diller Digital call classical machine learning and deep learning. Classical machine learning is characterized by relatively small data sets, and it requires a skilled modeler to do feature engineering to make the best use of the available (and limited) training data. This is the subject of our Machine Learning for Scientists & Engineers class. Deep Learning is a subset of machine learning that makes use of many-layered models that function in a rough analog to how the neurons in a human brain function. Training such models requires much more data but less manual feature engineering by the modeler. The skill in deep learning is that of configuring the architecture of the model, and that is the subject of our Deep Learning for Scientists & Engineers.
Parameters and Hyperparameters
There is one more pair of definitions we need to cover before we can talk about folding versus batching: parameters and hyperparameters.
At the heart of both kinds of machine learning is the adjustment of a model’s parameters, sometimes also called coefficients or weights. Simply stated, these are the coefficients of what boils down to a linear regression problem.

Each model also has what are called hyperparameters, or parameters that govern how the model behaves algorithmically. These might include things like how you score your model’s performance or what method you use to update the model weights.
The process of training a model is the process of adjusting the parameters until you get the best possible predictions from your model. For this reason, we typically divide our training data into two parts: one (the training data set) for adjusting the weights, the other (the testing data set) for assessing the performance of the model. It’s important to score your model on data that was not used in the training step because you’re testing its predictive power on things it hasn’t seen before.
What is Folding?
So this brings us finally to the subject of folding and batching. Folding typically arises in the context of cross-validation, when you’re trying to decide on the best hyperparameters to use for your model. That process involves fitting your model with different sets of hyperparameters and seeing which combination gives the best results. How can you do that without using your test data set? (If we used the test data set during training, that would be cheating because it would sacrifice the ability of your model to generalize for the short-term gain of a better result.) We divide our training data into folds and hold each fold back as a “mini-test” data set and train on the others. We successively hold each fold back and then average the scores across the folds. That becomes our cross-validation score and gives us a way to score that set of hyperparameters without dipping into the test data set.

What is Batching?
Batching looks a lot like folding but is a distinct concept used in a different context. Batching arises in the context of training deep models, and it serves two purposes. First, training a deep learning model typically requires a lot of training data (orders of magnitude more data than classical methods), and except for trivial cases you can’t fit all the training data into working memory at the same time. You solve that problem by dividing the training data into batches in much the same way that you would divide it into folds for cross-validation, and then iteratively update the model parameters using each batch of data until you have used the entire training data set. One full pass through all of the batches is called an epoch. Training a deep learning model typically takes multiple epochs.

Beyond considerations of working memory, there’s a second important reason to train a deep model on batches: because there are so many model parameters with so many possible configurations, and because of the way the layers of the model insulate some of the parameters from information in the test data set, it’s helpful that smaller batches are “noisier” and provide more variation for the training algorithm to use to adjust the model parameters. As a physical analogy, you might think of the way that shaking a pan while you poured sand into it would help it settle into a flat surface more quickly than just waiting for gravity to do the work for you, and without shaking you might end up with lumps and bumps.
So hopefully, by this point you can see how folding is similar to batching and how they are distinct concepts. They both similarly divide training data into segments. Folding is used in cross-validation for optimizing hyperparameters, and batching is used in training deep learning models to limit memory requirements and improve convergence for fitting model parameters.
Diller Digital offers Machine Learning for Scientists & Engineers and Deep Learning for Scientists & Engineers at least once per quarter. Sign up to join us, and bring your curiosity, questions, and toughest problems and see what you can learn! Maybe you’ll join the chorus of those who leave glowing feedback.