Are you overwhelmed with news about deep learning but unsure of its practical implications? Let’s demystify this fascinating topic! In this article, we will explore how to create programs that recognize objects in images using deep learning technologies. Specifically, we will uncover the magic that allows services like Google Photos to identify and search your images based on their content.

Just like the previous parts of this series, this guide is aimed at anyone interested in machine learning, regardless of their prior knowledge. Our goal is to make this topic accessible, with some generalizations included. If we can spark interest in ML, then our mission is accomplished!
(Please refer to Part 1 and Part 2 if you haven’t done so already!)
Understanding Object Recognition with Deep Learning
You may have encountered the popular xkcd comic, illustrating the absurdity that a toddler can recognize a bird, while teaching a computer to do so has stumped experts for decades. Thankfully, advancements in deep convolutional neural networks (CNNs) have provided a viable solution for object recognition, making it possible to identify complex patterns in images.
Let’s embark on a journey to build a program capable of recognizing birds!
Starting Simple: Recognizing Handwritten Digits
Before tackling the challenge of bird recognition, we will start with something simpler — recognizing the handwritten numeral “8”.
In the previous part, we developed a basic neural network to estimate house prices based on features like the number of bedrooms and area. Now, we will adapt that network to recognize images of the numeral “8”.
To train our network effectively, we need a substantial dataset. Fortunately, researchers have curated the MNIST dataset, which contains 60,000 images of handwritten digits, including numerous examples of the number “8”.

Feeding Images as Input
Although our previous neural network handled a few numerical inputs, how can we modify it to process images? The answer lies in understanding that a digital image is essentially a matrix of numbers representing pixel brightness.
We will use an 18×18 pixel image, translating it into an array of 324 numerical values. To accommodate this input, we will expand our neural network to include 324 input nodes. Our network will now have two outputs: one representing the likelihood that the image depicts an “8” and the other predicting that it does not.
As we train our network using images of “8”s and images of other numbers, we will guide it to learn the distinction.
Here’s an example of our training data:

Training the neural network can be completed quickly on a modern laptop, and once finished, we will have a functioning model capable of recognizing the numeral “8” with decent accuracy.
Challenges of Tunnel Vision
While it is impressive that we can achieve image recognition by merely feeding pixel data into a neural network, it’s essential to note the limitations.
Our network successfully identifies perfectly centered “8”s in images. However, it fails completely when the numeral is slightly off-center. This oversight occurs because the network has been trained solely on centered examples and lacks versatile learning.
Brute Force Solutions to Improve Recognition
Brute Force Idea #1: Sliding Window Technique
A straightforward approach would be to implement a “sliding window” technique, scanning across the entire image to identify potential “8”s in smaller sections. While this method has its merits, it is computationally inefficient, requiring excessive checks across the same image.
Brute Force Idea #2: Enhanced Training Data and Deep Networks
Instead of relying on a single perspective of the “8”, we can enhance our dataset by including variations of the numeral in all positions and sizes. We can achieve this by generating synthetic training images, utilizing scripts to create numerous iterations of the “8” in various locations.
Increasing the data diversity makes the challenge for our neural network more complex, yet we can address this by constructing a deeper neural network that accommodates learning intricate patterns.
Introducing Deep Neural Networks
Deep neural networks, characterized by multiple layers, had existed since the late 1960s, but it was only with the advent of advanced hardware—like graphics processing units (GPUs)—that we could train such extensive networks efficiently.
With a modern GPU, building and training a deep neural network becomes practically feasible, allowing us to tackle more complex recognition tasks.
Understanding Convolution for Translation Invariance
To enhance our network’s capabilities, we need to instill a sense of translation invariance—recognizing that an “8” is still an “8” regardless of where it appears in an image. This is achieved through a process known as Convolution.
Steps of Convolution
- Breaking the Image into Tiled Sections: We’ll pass a sliding window over the original image to create smaller overlapping tiles.
- Processing Each Tile: We feed each tile through the same neural network while keeping the weights uniform across all tiles.
- Recording Results: We save the outcomes from processing each tile in a grid that mirrors the original image’s arrangement.
- Downsampling: By applying a technique called max pooling, we reduce the array size by retaining only the most significant values from each section.
- Final Prediction: The resulting smaller array, composed of numerical values, serves as input to another fully connected neural network, determining if an image contains the numeral “8”.
This iterative, multi-layered process is what allows convolutional networks to effectively learn complex features.
Constructing the Bird Classifier
Now that we’ve grasped foundational concepts, it’s time to build a classifier for recognizing birds. We will utilize the CIFAR10 dataset, which consists of 6,000 images of birds and includes additional data from the Caltech-UCSD Birds-200-2011 dataset.
While 72,000 images is a good starting point, real-world applications often demand millions of diverse images to achieve high effectiveness. This realization underscores why major companies like Google collect extensive user data.
Using TFlearn, a wrapper around TensorFlow, we can simplify the task of constructing our model. A few lines of code are all it takes to define the layers of our convolutional neural network.
As the model trains, we can observe accuracy fluctuating. Initial accuracy may be around 75.4%, increasing to 91.7% after ten training iterations. After around 50 iterations, it stabilizes around 95.5% accuracy.
Evaluating the Network’s Performance
With our newly trained neural network, we can test its performance. By validating with a set of 15,000 images, the network correctly identifies birds about 95% of the time. However, to gain a true understanding of its efficacy, we need to examine the details behind this accuracy.
Precise Analysis: True Positives, True Negatives, False Positives, and False Negatives
To gauge the quality of our classification system, we analyze outputs beyond simple accuracy. We categorize our predictions into four classes:
- True Positives: Correctly identified birds.
- True Negatives: Correctly identified non-bird images.
- False Positives: Non-bird images mistakenly classified as birds.
- False Negatives: Actual birds our network failed to recognize.
This nuanced analysis reveals that not all mistakes are equal. For instance, in medical diagnosis, failing to identify a condition (false negative) is far more critical than incorrectly identifying it (false positive).
By calculating Precision and Recall, we can further clarify our network’s success.
- Precision: The percentage of times the network guessed “bird” correctly.
- Recall: The percentage of actual birds identified by the network.
These metrics provide insights into the model’s effectiveness, highlighting the strengths and weaknesses in our classification attempts.
Next Steps
Now that you are well-versed in deep convolutional networks, consider experimenting with existing examples in TFlearn to explore various neural network architectures! You might also venture into applying algorithms to train computers to navigate Atari games.
If you enjoyed this article, please consider subscribing to my Machine Learning is Fun! email list. You’ll receive updates whenever I publish something new and exciting.
Feel free to connect with me on Twitter, email, or LinkedIn; I’d be happy to assist you or your team with machine learning inquiries.
Please let me know if you need any further modifications or additional images to assist with understanding!