Get Started With Image Recognition in Core ML

With technological advances, we’re at the point where our devices can use their built-in cameras to accurately identify and label images using a pre-trained data set. You can also train your own models, but in this tutorial, we’ll be using an open-source model to create an image classification app.

I’ll show you how to create an app that can identify images. We’ll start with an empty Xcode project, and implement machine-learning-powered image recognition one step at a time.

Getting Started

Xcode Version

Before we begin, make sure you have the latest version of Xcode installed on your Mac. This is very important because Core ML will only be available on Xcode 9 or newer. You can check your version by opening Xcode and going to Xcode > About Xcode in the upper toolbar. 

If your version of Xcode is older than Xcode 9, you can go to the Mac App Store and update it, or if you don’t have it, download it for free.

Sample Project

New Project

After you have made sure you have the right version of Xcode, you’ll need to make a new Xcode project. 

Go ahead and open Xcode and click Create a new Xcode project.

Figure 1 Create an Xcode Project

Next, you’ll need to choose a template for your new Xcode project. It’s pretty common to use a Single View App, so go ahead and select that and click Next.

Figure 2 Select a Single View Application

You can name your project anything you like, but I will be naming mine CoreML Image Classification. For this project, we’ll be using Swift, so make sure that it’s selected in the Language dropdown.

Figure 3 Selecting Language and Naming Application

Preparing to Debug

Connecting an iPhone

Since the Xcode Simulator doesn’t have a camera, you’ll need to plug in your iPhone. Unfortunately, if you don’t have an iPhone, you’ll need to borrow one to be able to follow along with this tutorial (and for any other camera-related apps). If you already have an iPhone connected to Xcode, you can skip ahead to the next step.

A nifty new feature in Xcode 9 is that you can wirelessly debug your app on a device, so let’s take the time to set that up now:

In the top menu bar, choose Window > Devices and Simulators. In the window that appears, make sure that Devices is selected at the top.

Now, plug in your device using a lightning cable. This should make your device appear in the left pane of the Devices and Simulators window. Simply click your device, and check the Connect via Network box.

Figure 4 Devices and Simulators

You will now be able to wirelessly debug on this iPhone for all future apps. To add other devices, you can follow a similar process.

Simulator Selection

Figure 5 Select a Simulator

When you want to finally use your iPhone to debug, simply select it from the dropdown beside the Run button. You should see a network icon next to it, showing that it’s connected for wireless debugging. I’ve selected Vardhan’s iPhone, but you need to select your specific device.

Diving Deeper

Now that you’ve created your project and set up your iPhone as a simulator, we’ll dive a bit deeper and begin programming the real-time image classification app.

Preparing Your Project

Getting a Model

To be able to start making your Core ML image classification app, you’ll first need to get the Core ML model from Apple’s website. As I mentioned before, you can also train your own models, but that requires a separate process. If you scroll to the bottom of Apple’s machine learning website, you’ll be able to choose and download a model.

In this tutorial, I will be using the MobileNet.mlmodel model, but you can use any model as long as you know its name and can ensure that it ends in .mlmodel.

Figure 6 Working with Models

Importing Libraries

There are a couple of frameworks you’ll need to import along with the usual UIKit. At the top of the file, make sure the following import statements are present:

import UIKit
import AVKit
import Vision

We’ll need AVKit because we’ll be creating an AVCaptureSession to display a live feed while classifying images in real time. Also, since this is using computer vision, we’ll need to import the Vision framework.

Designing Your User Interface

An important part of this app is displaying the image classification data labels as well as the live video feed from the device’s camera. To begin designing your user interface, head to your Main.storyboard file.

Adding an Image View

Head to the Object Library and search for an Image View. Simply drag this onto your View Controller to add it in. If you’d like, you can also add a placeholder image so that you can get a general idea of what the app will look like when it’s being used.

If you do choose to have a placeholder image, make sure that the Content Mode is set to Aspect Fit, and that you check the box which says Clip to Bounds. This way, the image will not appear stretched, and it won’t appear outside of the UIImageView box.

Figure 7 Content Mode

Here’s what your storyboard should now look like:

Figure 8 Storyboard

Adding a View

Back in the Object Library, search for a View and drag it onto your View Controller. This will serve as a nice background for our labels so that they don’t get hidden in the image being displayed. We’ll be making this view translucent so that some of the preview layer is still visible (this is just a nice touch for the user interface of the app).

Drag this to the bottom of the screen so that it touches the container on three sides. It doesn’t matter what height you choose because we’ll be setting constraints for this in just a moment here.

Figure 9 Storyboard

Adding Labels

This, perhaps, is the most important part of our user interface. We need to display what our app thinks the object is, and how sure it is (confidence level). As you’ve probably guessed, you’ll need to drag two Label(s) from the Object Library to the view we just created. Drag these labels somewhere near the center, stacked on top of each other.

For the top label, head to the Attributes Inspector and click the button next to the font style and size and, in the popup, select System as the font. To differentiate this from the confidence label, select Black as the style. Lastly, change the size to 24.

Figure 10 Object Label Attributes

For the bottom label, follow the same steps, but instead of selecting Black as the style, select Regular, and for the size, select 17.

Figure 11 Confidence Label Attributes

The image below shows how your Storyboard should look when you’ve added all these views and labels. Don’t worry if they aren’t exactly the same as yours; we’ll be adding constraints to them in the next step.

Figure 12 Storyboard Final

Adding Constraints

In order for this app to work on different screen sizes, it’s important to add constraints. This step isn’t crucial to the rest of the app, but it’s highly recommended that you do this in all your iOS apps.

Image View Constraints

The first thing to constrain is our UIImageView. To do this, select your image view, and open the Pin Menu in the bottom toolbar (this looks like a square with the constraints and it’s the second from the right). Then, you’ll need to add the following values:

Figure 13 Image Constraints

Before you proceed, make sure that the Constrain to Margins box isn’t checked as this will create a gap between the screen and the actual image view. Then, hit Enter. Now your UIImageView is centered on the screen, and it should look right on all device sizes.

View Constraints

Now, the next step is to constrain the view on which the labels appear. Select the view, and then go to the Pin Menu again. Add the following values:

Figure 14 View Constraints

Now, simply hit Enter to save the values. Your view is now constrained to the bottom of the screen.

Label Constraints

Since the view is now constrained, you can add constraints to the labels relative to the view instead of the screen. This is helpful if you later decide to change the position of the labels or the view.

Select both of the labels, and put them in a stack view. If you don’t know how to do this, you simply need to press the button (second one from the left) which looks like a stack of books with a downwards arrow. You will then see the buttons become one selectable object.

Click on your stack view, and then click on the Align Menu (third from the left) and make sure the following boxes are checked:

Figure 15 Label Constraints

Now, hit Enter. Your labels should be centered in the view from the previous step, and they will now appear the same on all screen sizes.

Interface Builder Outlets

The last step in the user interface would be to connect the elements to your ViewController() class. Simply open the Assistant Editor and then Control-Click and Drag each element to the top of your class inside ViewController.swift. Here’s what I’ll be naming them in this tutorial:

  • UILabel: objectLabel
  • UILabel: confidenceLabel
  • UIImageView: imageView

Of course, you can name them whatever you want, but these are the names you’ll find in my code.

Preparing a Capture Session

The live video feed will require an AVCaptureSession, so let’s create one now. We’ll also be displaying our camera input to the user in real time. Making a capture session is a pretty long process, and it’s important that you understand how to do it because it will be useful in any other development you do using the on-board camera on any of Apple’s devices.

Class Extension and Function

To begin, we can create a class extension and then make it conform to the AVCaptureVideoDataOutputSampleBufferDelegate protocol. You can easily do this within the actual ViewController class, but we’re using best practices here so that the code is neat and organized (this is the way you would be doing it for production apps).

So that we can call this inside of viewDidLoad(), we’ll need to create a function called setupSession() which doesn’t take in any parameters. You can name this anything you want, but be mindful of the naming when we call this method later.

Once you’re finished, your code should look like the following:

// MARK: - AVCaptureSession
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    func setupSession() {    
        // Your code goes here

Device Input and Capture Session

The first step in creating the capture session is to check whether or not the device has a camera. In other words, don’t attempt to use the camera if there is no camera. We’ll then need to create the actual capture session.

Add the following code to your setupSession() method:

guard let device = AVCaptureDevice.default(for: .video) else { return }
guard let input = try? AVCaptureDeviceInput(device: device) else { return }

let session = AVCaptureSession()
session.sessionPreset = .hd4K3840x2160

Here, we’re using a guard let statement to check if the device (AVCaptureDevice) has a camera. When you try to get the camera of the device, you must also specify the mediaType, which, in this case, is .video.

Then, we create an AVCaptureDeviceInput, which is an input which brings the media from the device to the capture session.

Finally, we simply create an instance of the AVCaptureSession class, and then assign it to a variable called session. We’ve customized the session bitrate and quality to Ultra-High-Definition (UHD) which is 3840 by 2160 pixels. You can experiment with this setting to see what works for you.

Preview Layer and Output

The next step in doing our AVCaptureSession setup is to create a preview layer, where the user can see the input from the camera. We’ll be adding this onto the UIImageView we created earlier in our Storyboard. The most important part, though, is actually creating our output for the Core ML model to process later in this tutorial, which we’ll also do in this step.

Add the following code directly underneath the code from the previous step:

et previewLayer = AVCaptureVideoPreviewLayer(session: session)
previewLayer.frame = view.frame

let output = AVCaptureVideoDataOutput()
output.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue"))

We first create an instance of the AVCaptureVideoPreviewLayer class, and then initialize it with the session we created in the previous step. After that’s done, we’re assigning it to a variable called previewLayer. This layer is used to actually display the input from the camera.

Next, we’ll make the preview layer fill the whole screen by setting the frame dimensions to those of the view. This way, the desired appearance will persist for all screen sizes. To actually show the preview layer, we’ll add it in as a sub-layer of the UIImageView that we created when we were making the user interface.

Now, for the important part: We create an instance of the AVCaptureDataOutput class and assign it to a variable called output

Input and Start Session

Finally, we’re done with our capture session. All that’s left to do before the actual Core ML code is to add the input and start the capture session. 

Add the following two lines of code directly under the previous step:

// Sets the input of the AVCaptureSession to the device's camera input
// Starts the capture session

This adds the input that we created earlier to the AVCaptureSession, because before this, we’d only created the input and hadn’t added it. Lastly, this line of code starts the session which we’ve spent so long creating.

Integrating the Core ML Model

We’ve already downloaded the model, so the next step is to actually use it in our app. So let’s get started with using it to classify images. 

Delegate Method

To begin, you’ll need to add the following delegate method into your app:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    // Your code goes here

This delegate method is triggered when a new video frame is written. In our app, this happens every time a frame gets recorded through our live video feed (the speed of this is solely dependent on the hardware which the app is running on).

Pixel Buffer and Model

Now, we’ll be turning the image (one frame from the live feed) into a pixel buffer, which is recognizable by the model. With this, we’ll be able to later create a VNCoreMLRequest

Add the following two lines of code inside the delegate method you created earlier:

guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
guard let model = try? VNCoreMLModel(for: MobileNet().model) else { return }

First we create a pixel buffer (a format which Core ML accepts) from the argument passed in through the delegate method, and then assign it to a variable called pixelBuffer. Then we assign our MobileNet model to a constant called model.

Notice that both of these are created using guard let statements, and that the function will return if either of these are nil values.

Creating a Request

After the previous two lines of code have been executed, we know for sure that we have a pixel buffer and a model. The next step would be to create a VNCoreMLRequest using both of them. 

Right below the previous step, paste the following lines of code inside of the delegate method:

let request = VNCoreMLRequest(model: model) { (data, error) in {
    // Your code goes here

Here, we’re creating a constant called request and assigning it the return value of the method VNCoreMLRequest when our model is passed into it.

Getting and Sorting Results

We’re almost finished! All we need to do now is get our results (what the model thinks our image is) and then display them to the user. 

Add the next two lines of code into the completion handler of your request:

// Checks if the data is in the correct format and assigns it to results
guard let results = data.results as? [VNClassificationObservation] else { return }
// Assigns the first result (if it exists) to firstObject
guard let firstObject = results.first else { return }

If the results from the data (from the completion handler of the request) are available as an array of VNClassificationObservations, this line of code gets the first object from the array we created earlier. It will then be assigned to a constant called firstObject. The first object in this array is the one for which the image recognition engine has the most confidence.

Displaying Data and Image Processing

Remember when we created the two labels (confidence and object)? We’ll now be using them to display what the model thinks the image is.

Append the following lines of code after the previous step:

if firstObject.confidence * 100 >= 50 {
  self.objectLabel.text = firstObject.identifier.capitalized
  self.confidenceLabel.text = String(firstObject.confidence * 100) + "%"

The if statement makes sure that the algorithm is at least 50% certain about its identification of the object. Then we just set the firstObject as the text of the objectLabel because we know that the confidence level is high enough. We’ll just display the certainty percentage using the text property of confidenceLabel. Since firstObject.confidence is represented as a decimal, we’ll need to multiply by 100 to get the percentage.

The last thing to do is to process the image through the algorithm we just created. To do this, you’ll need to type the following line of code directly before exiting the captureOutput(_:didOutput:from:) delegate method:

try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:]).perform([request])


The concepts you learned in this tutorial can be applied to many kinds of apps. I hope you’ve enjoyed learning to classify images using your phone. While it may not yet be perfect, you can train your own models in the future to be more accurate.

Here’s what the app should look like when it’s done:

Figure 16 Final Application

While you’re here, check out some of our other posts on machine learning and iOS app development!

Leave a Reply

Your email address will not be published. Required fields are marked *