In this article, Shachar Ilan, Director of Computer Vision at Nike at the time of the talk, explains how you can more accurately measure shoe fit by leveraging the power of computer vision and deep learning.
I've been working in the visual algorithm development industry for over 12 years. I began my career in 3D, geometry, and physical simulation, and for many years I worked in the medical industry.
I've previously been a researcher at Disney Research Pittsburgh, and I joined Nike through the acquisition of Invertex, an Israeli startup that worked on body scanning.
A short disclaimer; I'm speaking here on behalf of myself, not on behalf of Nike. I will be discussing products that are deployed already, and to the extent that they can probably be understood from observations by an expert.
I'll be focusing on the problem of fit in shoe wear rather than in apparel, and I'll be giving both product and technical insights so both functions can enjoy and suffer equally.
The problem with shoe fit
What is the fit problem? Well, it's pretty simple actually. It's what happens when you shop for apparel or footwear online and you don't always know what size you need to pick.
It's pretty common for a customer to be frustrated or at least unsure of what he needs to choose, which leads to both customer and retailer pain. The customer’s side is the feeling of being unconfident, which will sometimes prevent the purchase. In other cases, the customer may get the wrong size and then be frustrated with the process of returns and disappointed by not having the product.
The retailer will suffer pain from having to deal with returns, which are costly and a logistical headache.
So why are sizes so hard to get? Sizes aren't consistent across brands. If you look at the table below, you'll see that different brands have very different size definitions.
Sizes aren't even consistent within a single brand when you move between different products. And actually, can they even be consistent? The short answer is probably no.
The old solution to this problem is unfortunately still the current solution in most cases. The consumer is expected to get manual measurements. This is hard in apparel, but almost impossible with feet measurements because a feet measurement device is something that’s very rarely found in homes.
Once the consumer has those measurements, he can simply find his size through a table like the one below. This process is both annoying and inaccurate, and since it's usually global for an entire brand, it's quite inconsistent across products and doesn’t treat a specific anatomy, but rather an average anatomy.
Anatomy is actually reduced to a single parameter. If you think about footwear, this would be the primary principal component of the foot, which is its length. And obviously, that's not the only important parameter when fitting footwear.
Another problem comes from geo conversion issues. The size stops in one size scale. For example, the US size scale won't match the EU size scale. So if you're trying to convert your US size when buying shoes in the EU, you might not be converting very accurately, because an eight might convert to a 39, but instead, it converts to maybe a 39.3.
Choosing the right solution
This is where computer vision comes in. And since this is a big problem with potential gains of billions of dollars in eCommerce for solving it, there are many types of solutions that have been developed.
I'm not personally familiar with all of them, but some are displayed below.
So, how do you choose between all of these solutions? Let's start by presenting some of the considerations one can look at when choosing a solution. We'll assume that we're choosing a solution for a company like Nike that has a very large physical presence, with more than a billion customers entering stores worldwide a year, along with a very big eCommerce presence.
With such a big presence, the first consideration would probably be the scalability of the solution. This means that it shouldn't cost too much, and it should be easy to maintain and replace. It should be very easy to use and integrate into flows, it should be accurate, and if there's a physical scanner, the scanner size and footprint are important if you intend to put it in stores.
Also from a data science perspective, we should remember that small scale means little data and that will hinder the possibilities of improving the product gradually.
So, what are the trade-offs when looking to choose such a solution and find the right sweet spot?
I think about it as this spectrum, where on the left side you have very big solutions aimed at maybe laboratories with the necessity to prepare for the scan. They try to capture the perfect geometry, they sometimes need a technical operator, they're large, they need maintenance, they're likely costly, and if put in stores in a selling flow, they might add friction.
As you move down the scale, you’ll find more low-cost devices that scale better. The scan will be quicker, but there may be some limitations so you might not get the perfect geometry for the entire foot. You might work with a low-tech mat that has no batteries or electricity, and you'll probably need less maintenance.
The more you move down the scale, the closer you’ll get to your end user and deliver a solution that they can use unguided and very simply.
So to find the sweet spot here, we went with the KISS principle; keep it simple, and keep the mat hardware stupid.
Introducing Nike’s in-store scan mat
We started with an in-store solution because they’re more controlled, the technology is much simpler, and they can still help to drive customer engagement in stores and provide them with a fit profile they can later use to shop online at home. There’s a large scale of customers walking into Nike stores, so it's not optimistic, but it still provides some scale.
Another very important factor about starting in stores is that you’ll get try-on data. Try-on data is very important for developing recommendation systems, and at home, you won't get that data so easily.
So we look for something that's easy to scale, low-cost, easily replaceable, has no batteries, or at least not in the mat, and is easy to operate by the store athlete (an athlete is the Nike term for the sales representative in the store). It should have minimum friction, and we wanted all the tech to be in a mobile device.
Many Nike stores are operated through a mobile device anyway, so what we did was integrate our system into the device and the app that was already in the stores. We added low-tech mats to make the scan more controlled and easier technologically.
The scan is operated by the in-store athlete and takes about 10 seconds to perform. It's usually enjoyable, helps the selling flow in the store, and customers like it.
With one scan you get the millimeter size of the feet, and as soon as the fit profile has been built, the athlete can scan a barcode and immediately get a size recommendation.
If the person that was scanned is a member, we attach the fit profile to his membership. If he isn't a Nike member yet, he can onboard as one immediately, and save that profile for later online scanning.
His feedback is collected after he does a try-on. This is very important to improve the system and predict new footwear styles, and to give that specific individual a better recommendation based on his preferences, for example, if he likes a looser fit than average.
Going from scan to recommendation
First of all, it's important to remember that we want to recommend the size for a specific person and a specific product, so that's what our system is intended to do. The inputs will be the product and feet features. So we'll supply both anatomical and geometrical features of the feet from the scan to our recommendation system.
This is a pretty data-hungry mission because not only do we need a lot of people to be scanned, we need those same people that were scanned to try on shoes and give us feedback on their ideal size for different styles. So how do you get such data?
At first, it's pretty hard. You start by perhaps conducting an artificial experiment. You ask people such as Nike employees, users, or user study participants to try on different styles and get scanned.
After you have this initial base of data, you can basically deploy an initial product at a small scale, which will help with the recommendation of popular styles that you have feedback for.
As soon as it's deployed in a real store, you’ll start collecting real data. So you're done bootstrapping, and now you can leverage scale to improve your system, learn about new footwear styles, and scale up as you improve and become confident in the system.
A US half-size is about four and a half millimeters, so any error that’s higher than half that size will cause you to recommend incorrectly. So your error wants to be about half of this size at maximum for almost all cases.
You want to extract features from feet. Length is pretty simple, but other features are not very easy to define. Feet come in a variety of different shapes, and finding the right way to define feet features could be challenging, so you need to experiment.
You also need to consider if you want a 2D scan or a 3D scan. A 2D scan compromises on some of the foot features like the arch height, but a 3D scan will require you to use either more complex hardware or specific devices or to move the phone around the foot to cover the entire area, which can be problematic.
You need to consider that you’ll have perspective warps, even if you're trying to take a flat image with the device that’s parallel to the ground.
Heel visibility is also an issue. The heel is not naturally visible in a picture that’s taken from waist height, and it might be covered by pants in many cases.
Overcoming computer vision challenges with deep learning
The segmentation has to be very accurate compared to many other applications. There will also be a lot of socks variability; socks can pretty much look like anything, and at scale, you’ll see a huge variety. There are so many different and confusing patterns.
There will also be a lot of hard shadows and highlights. If you think a store is a controlled environment, it really isn't. There are lots of spotlights and different lighting in stores, and you can get very weird highlights and strong shadows. Sometimes these shadows look like they're extending a foot, especially if it's a dark sock color.
Your markers can be hidden, such as in the image below on the bottom right, where both markers, sock and foot shape, are hidden by the coat unintentionally.
You will have to detect a very large variety of real-world problems that will find their way into your system.
This is where deep learning will come to the rescue. At first glance, this looks like a pretty classic deep learning segmentation task. It isn't very far from it, but there are a few twists. You want to engineer for accurate measurements, not a classical accurate segmentation.
We went with a classic CNN architecture downsized UNET. A UNET is built out of a first encoder part and a second decoder part. Those layers of similar resolution are interconnected using long skip connections. The input is an RGB image and the output is a heat map of the segmentation.
We used higher than typical resolution because we wanted better accuracy. And what we found to be really important is to consider a special loss and metrics for training this network. And I’ll show why IOU isn't the best idea here.
If you look at the image of a foot on a scan mat below, you’ll see that the shadow increases the foot’s length a little bit, and the shadow edge is much stronger than the sock edge. So it’ll be very typical for a deep learning network to segment into that shadow and increase the length of the foot. That's what would happen with typical binary cross entropy or IOU-based losses.
Now what we did was extra weight on the border of the mask and give that weighted mask into the loss to create a weighted BCE + IOU loss. And that worked much better in covering such small but important cases, which would move us off by a half size or even a full size.
We trained a very greedy deep learning network, and we generalized it only as much as needed. If you give it feet that are upside down, it won't know what to do. And it's so greedy that if you give it something that looks like socks, it won't really tell the difference much. So this is what happens when you give it shoes here in the below image.
You can see it's not as completely confident as it is in socks, but it mostly does a decent job of segmenting those two. So we use other means to tell if it's really a foot or not. At the store, it's less of a problem because we have the in-store athlete with us.
We work hard to get amazing and challenging data sets for training. So how do we do that? Again, you don't have any real data pre-launch, so you have to come up with it to train something initially and bootstrap your system.
We spent a lot of time, money, and sweat on getting and rounding up good data. We worked with a few external vendors, we improvised a lot, and we were creative. For instance, we asked vendors to purchase different light sources and lamps to light the photos very differently so we’d have a huge variety of lighting. We also sent them huge bags of colorful patterned socks.
We used a lot of augmentation. We believe in augmenting very heavily, both in geometrical space and in color space.
Deep learning is awesome, and so were our results. All the images you see on the left below are actual heat maps, not post-process segmentations. The confidence is very high, and there are very few errors if any.
In the top left, even though the shadow is very strong, it still gets the foot extremely right, except for maybe the bottom part next to the heel. The heel is less important to us. And if you look at the image, you really can't tell where the shadow ends and where the foot begins. But this is still a very good result and is more than sufficient for a recommendation.
Even with noisy socks like the white one, or patterned socks that were pretty similar in color to the mat, we had no problem.
We trained the system on other backgrounds as well and it does very well on carpets and floors with different socks.
Dealing with ‘in the wild’ images
So what could possibly go wrong in such a system? Apparently, it's a jungle out there, and they’re not called ‘in the wild’ images for nothing. People will do a bunch of crazy things, mostly unintentionally, but sometimes intentionally. And here are just a few examples.
You can see different occlusions on the top left to the middle top two scan mats in one image, and an image that has a partial scan mat and partial feet, which happens a lot.
The bottom left image has a bad standing position, very hard socks, and very hard shadows all together in one image. The next image with the white socks has socks that are completely loose and won't give a decent measurement.
The next one has shoes on, and there's one trying to trick the system by putting just one foot on the scanner. The last one is simply standing on the scan markers.
There are literally hundreds of different errors we've seen people do. So how do you prepare for all of these errors?
Try to be forgiving. You need to detect them, but then they could be severe or they could be light, and you need to balance it out. Solve as many cases as you can, otherwise, your users will get tired of getting flagged. That's what we call it when we get back to the user with a message saying something went wrong and what went wrong.
You need to let your user know what went wrong, use computer vision, and write more computer vision code to explain what went wrong.
You also need to be as robust and forgiving as you can because you can't have too many false positives which will annoy your users and stop them using your system. And that's the case even if those users are in-store athletes. They won't like having these errors in front of customers.
It's not enough to say a marker is hidden. It makes more sense to say that clothing has occluded markers, rather than maybe shadows that have occluded markers. These are two different situations the user has to deal with.
Once you've done that, you also need to make sure you have the minimum possible amount of false negatives. A false negative is when you’ve missed a real error and you're giving a bad measurement, and then consequently a bad recommendation. So these have to be absolutely minimized, even if that means a few more false positives.
You can see two different shadow cases on the top row of the image below. One is too severe, and you’ll have this delicate balance to know to say that because you won't be able to tell the edge of the foot. The other one is not that severe, and the robust system should be able to give a recommendation.
In the second row, there's a marker occlusion because of clothing. Again, one case is very severe, where a lot of the foot and many markers are occluded. In the other case, the system should be able to cope.
In the last row, we have two socks with strings coming out. In one, the sock really changes the shape of the foot, and you wouldn't want to use it to get measurements. In the second one, it's not that harsh, and you would want it to pass. So this is a very delicate balance you have to play with and get right.
So that's a very nice system for stores. We've deployed it in over 100 stores already and it's scaling.
But what about back home?
I can't really say too much about this because it's still in the works, but what I can say is that harder segmentation is the least of our problems. Getting scale is a difficult product and technological challenge because if you don't have the mat, you don't have an immediate reference for scale. You either need to use a reference for scale or get it from something else like a depth sensor on a phone.
Many more things can and will go wrong. People won't stand correctly. With no in-store athlete, they'll do more things you hadn't expected. They'll stand on very weird carpets, have their pets walk in, wear shoes, or do other weird things that your system may or may not be built to. So there's a lot of work until you can get such a system to a level of production.
Additionally, you won’t have try-on data, which will make training recommendations very hard. If you've deployed a store system as we have, you can use the try-on data from the stores to make recommendations for the home system.
I hope you’ve enjoyed this talk, and I hope the technology we're developing will help you and your families find great fits when you shop online at Nike.