AI is practically built for e-commerce image editing (thank you MIT, Google, Microsoft, and Adobe)
If you’re not already using AI to work smarter, you’re falling behind. Automation is one of the greatest efficiencies to be found in any workflow, and that’s especially true in expensive yet repetitive work like product photography, outsource photo editing, and post-production.
In fact, today’s AI are nearly perfectly aligned with e-commerce image editing needs. Why?
- The best minds in tech are intensely focused on AI image analysis
- Standard input + standard output = actionable exploits
- You can use the cloud to autoscale AI
Before I awe (or bore) you, here’s some quick background on what Pixelz is already doing with AI. It’s not just theory to us, or something to look at in the future. The development of AI is driving our company—it was key to our recent round of Series A funding—and has been for a long time.
Pixelz is 20% automated (and will be 50% by 2019).
Update 2021: Pixelz photo retouching service is now 64% automated.
We’re deep believers in standardization, automation, and quality prioritization. Why? Because lean production has proven time and time again that it produces better quality, faster, and more efficiently in virtually every industry.
That’s why we built S.A.W.™, a Photoshop assembly line for product image editing. S.A.W.™ breaks image editing down into small steps that are completed by a blend of specialist human editors, AI, and scripts.
In fact, 20% of our image editing is automated—and that’s steadily increasing.
As our CEO, Thomas Kragelund, recently told Forbes, “We estimate 50% of our post-production will be automated by 2019, which is possible because we’ve been analyzing a massive amount of data. We’ve been tracking our Photoshop activity for years — all retouching happens in our proprietary Photoshop extension — so we’ve got data on millions of images, and that knowledge guides us. Image editing is complex, but breaking it down into hundreds of microsteps allows us to train AI precisely. The data practically makes decisions for us, and that’s extremely exciting.”
So how does it work? How can AI be integrated into post-production?
First, you need something like S.A.W.™. Because we break image editing down into component steps, we’re able to isolate individual processes and train intelligences to contained tasks. Controlling the input is critical: it limits exceptions and other “special” cases that are difficult for an AI to handle—just like humans.
For example, in a product photo the product is usually in the center of the image, and we typically have an idea what kind of product a customer will be uploading. That makes it much easier to train an AI to draw a mask around the product. We also can use humans to identify edges before sending to the AI, and to validate results immediately after a step is completed—further training the AI.
Let’s start with a real life example.
Using AI to remove moles (It works! But nobody wants it...)
One of our earliest functional AI was trained to remove moles during skin retouching. We used traditional algorithms to first detect skin, then another algorithm to detect “candidates” (potential moles to remove), and then used an AI to classify those candidates.
Mole candidates were determined based on color difference with the skin. Seems straightforward—and that’s why there are existing algorithms for it—but the problem we encountered was that loose hair triggered false positives. The job of the AI was to classify moles and not-moles, and to train it we fed it 65,000 images of moles and non-moles, sorted manually (fun job!).
After the moles and not-moles were properly classified, we used a standard Photoshop script to remove them.
The project worked! We successfully automated mole removal—but in the meantime, trends changed, and nowadays most of our clients prefer a natural look complete with moles. But it was mostly a research project anyway, and its success propelled us further down the AI path. And it highlights something important to remember -- AI doesn't always mean a human is replaced; in this case, we developed an AI tool which a human photo editor uses to be more efficient, and in some cases the work of the AI still needs to be tweaked for perfection, but the net result is a gain of efficiency.
Using AI as traffic control (Toothbrushes to the left, chairs to the right!)
In the Pixelz smart factory, the first AI process is focused on classifying images after upload: basically, looking at a photo and figuring out what’s in it. Much like in our “mole or not-mole” example, but with far broader parameters. Is there a model? A mannequin? Shoe? Bottle? Table? Etc.
Maybe that sounds simple to you (“My two year old can tell whether there’s a person in a photo or not!”), but it’s actually one of the single biggest challenges in artificial intelligence. The human brain excels at visual and auditory input interpretation, and what children intuit without seeming effort can stump supercomputers. Almost all the latest, coolest, and most popular AI revolves around image and audio recognition.
Self-driving cars? That’s image interpretation, and lives literally depend on it (and radar and lidar, hopefully).
Alexa? Siri? Google Home? Speech recognition.
Translating street signs with your phone and a camera app? Image classification. Spotting warning signs of cancer in x-rays? Image classification. You get the idea.
So yes, Pixelz’ AI primarily classifies images (using an architecture known as “Inception” in a “GoogLeNet” designed by—you guessed it—Google). The first time it does so is during a stage we call “image preparation.”
Image preparation is the first step all images go through, and it determines the future steps for each image. The bulk of it is categorization: as our COO Jakob Osterby says, “If you had the same input every time, it wouldn't be a problem because you know the object. But it could be a chair, it could be a toothbrush, or it could be a jacket. If it’s a jacket, there’s a big difference between leather and fur.”
Fortunately for us, we do have some expectations regarding the input because customers setup specifications for products ahead of time, but even so there’s still a lot of variance. “If there’s a prop, if there’s not a prop, that’s a huge difference in our workflow,” says Jakob. “Maybe a template doesn't have a retouching step. But if there's a prop in front of the object—could be a hanger, a fishing line, or a bag stand—we're going to remove it.” This has added complexity to our Image Preparation process, in which we have developed in-house over a dozen image classification outputs, for Prop Detection, Model Detection, Skin Detection, Image Complexity, and more. For a bulk image editing service, understanding our images is important to us.”
The AI and human hybrid (Not a cyborg)
AI not only sorts images by type during image prep, it assigns complexity scores based on things like facial recognition, background contrast, points in a layer mask, and the presence of skin.
That score helps to determine costs, timelines, and which AI or specialist editor an image is routed to. “Having AI in classification, classifying something that helps another AI model algorithm perform something later on, I think that's a beautiful thing,” says Jakob.
Let’s look at a quick but realistic (and important!) example.
One of the things the AI detects is contrast against background. To take our earlier example, a black leather jacket will have a higher contrast against a white background than a white fur coat. The black leather jacket may be routed directly to an AI for automatic background removal, while the low contrast white-on-white image is routed to an editor to draw a trimap, then masked by an AI, then to another editor to polish off the mask.
The data we record during the white-on-white background removal is stored and used to help train the AI later, with the goal of improving its ability to remove the background on low contrast images. Update 2021: our AI now works very well on low contrast images.
Trimaps to find the way (Masking for the win)
That’s a pretty simple example, and there’s lots of software that can do a good job when images have sharp edges and high contrast (including Photoshop itself). Where it gets more difficult to draw a mask is when colors are subtle, products have lots of edges (like a mesh chair back, or jewelry chain), and when models are involved.
Are you ready for some shocking news? Most models have hair. Lots of it, artfully styled and dramatically tossed.
Talk to any product image retoucher, and they’ll tell you that masking around hair is one of the most time consuming parts of their job. Fine strands flying everywhere, crossing lines and adding tons of soft new points to draw around.
Solving masking is quite possibly Pixelz’ most important challenge. “As it is today, 40% of our time is spent on masking alone,” says Jakob. Update 2021: now, thanks to AI, it is down to 35% including advanced masks and paths (where our customers require several masks/paths for the image details).
“I think the big push in machine learning is going to be a hybrid model between AI and human,” Jakob continues. “For masking, the idea is that when the image comes in we have somebody—right now an editor, but we’re training an AI—draw a rough path around it. It takes two seconds in the interface, and it helps the algorithm detect the edge and distinguish between props and the actual product. Then we push it server side, AI removes the background, and it’s sent back to an editor that validates and maybe refines it. Over time, the AI should be able to do more and more.”
Drawing a rough path around an image is part of generating a “trimap.” A trimap in our system breaks an image into three segments: the foreground (keep it!), background (delete it!), and the border area outlining where the mask will be drawn.
Gains, gains, and more gains (or how to mask a million images in 30 seconds)
Okay, that sounds cool. But why go to so much trouble? What kind of tangible gains can we REALLY see from AI automation?
The answers are pretty astonishing. It’s a work in progress and quantifying is a challenge for broad metrics, but for AI masking alone we’ve seen:
- 15x faster masking
- There’s a lot of variance, since masking time is product specific. A human might take anywhere from 20 seconds to 30 minutes to mask a product, while AI ranges from near instantaneous to 1-2 minutes (or longer for both humans and AI on something like a bicycle). At present, most AI masks need additional adjusting afterward—but the human finishing off the mask has a huge head start! Edit 2021: now 50% of our AI-made masks do not need additional work! Which leads to...
- Savings of 15% on production costs (projected for 2018)
- We spent a lot on research and development, but now that AI masking is up and running we’re seeing significant ROI (remember, masking accounts for 40% of all our retouching time). As an added bonus, our editors have more time to become proficient at more advanced retouching tasks, adding more value to our clients.
- Infinite scale
- Our neural networks run in the cloud. Additionally, more image volume means more accurate AI because each image edited helps train it.
On that last point—and this is where you really begin to see the power of AI—we’ve made our masking AI autoscaling on Amazon servers. As Pixelz CTO Janus Matthesen puts it, “We automatically spin up new servers as we get more images and scale down again when we are done. That means we can decide how fast we want our images to move through the step by simply spinning up more servers. The AI spends about 30 seconds per image, so in theory (with enough servers), we can complete any AI Mask workload in 30 seconds.”
That's right. Whether 10 images, 10,000, or 1,000,000, they can all be masked in 30 seconds. Update 2021: we can actually do it in 12 seconds now, and with better precision and higher quality, however, there is a tradeoff with speed and cost, and we don't always need our AI to work so fast, so we control our speed to meet our production demands.
Goodbye, bottlenecks.
Not all automation is intelligent (Scripts aren’t AI!)
Let’s do a quick rewind before we dive into some of the more technical aspects.
First, start with a basic fact: not all automated processes are intelligent. To truly be intelligent, a process must be capable of learning.
For example, we have many automated processes that are not intelligent. They’re bots, scripts that work on tasks like “Apply Path,” “Apply Mask,” “Auto Preparation,” “Auto Finalize,” and “Auto Stencil.” They’re sophisticated, but limited in scope by their author’s imagination. They’re not going to get better at their tasks without human intervention: for example, when a scripter sees we need new handling for a different product type and goes in to manually modify the script.
That type of unintelligent bot is where Pixelz began automation years ago.
“The origins of it?” Jakob says. “We had a step where we only needed to import a layer to a step. We had people sitting and doing that, just pushing a button and then waiting for a script to run, ten or fifteen seconds. It was not only time consuming, but a hell of a boring job.”
AI is different. An artificial intelligence is capable of learning, primarily by guessing and learning from the results of trial and error.
How to scale with AI (Use the cloud, duh)
AI don’t become intelligent without education, and for that you need heaps and heaps of data and a lot of computing power. Just ask our CTO, Janus Matthesen. “When we train our AI models, we need to train on millions of images—a quite time consuming process,” says Janus. “To find the optimal configuration, we also need to tweak weights and hyperparameters to find the right combination. We have moved this work to the cloud, and we have been able to scale it and test multiple configurations at the same time.”
“We used to use local servers with many GPUs, but now that Amazon has released the P2 and P3 EC2 instances, we have moved this work to the cloud,” says Janus. Update 2021: we now have extended our infrastructure to Microsoft Azure in addition to AWS and recently acquired a new on-premises GPU server with 8 very powerful Nvidia RTX 3090. “Bringing down the time we use to find the right configurations and scaling our AI model training is a competitive advantage for us. We recently joined Nvidia's Inception program for AI startups, where we became aware of the Nvidia GPU Cloud which runs well on top of P3 instances.”
AI just keeps getting better and better (Capsule Networks recognize pose)
As we’ve stated before, Pixelz plans to automate 50% of post-production by the end of 2019 (Update 2021: We succeeded, and we're not slowing down. We can now offer retouching in less than 3 hours, and for Flow customers as little as 10 minutes!
We could easily get there just by applying our existing AI approach to additional areas (more specific retouching jobs like the mole removal example, or auto-cropping, or selecting primary images for marketplaces, etc.), but in truth we’re anticipating more revolutionary advancements.
For example, a recent major development resulted from the publication of two October 2017 research papers. The AI we (and everyone else) are using are “Convolutional Neural Networks” (CNN). CNN are what people usually mean when they refer to “Deep Learning” or “Machine Learning.” (Update 2021: We are way beyond these ancient concepts from 2017 :D Right now our work focuses on improvements on EfficientNet and Unet3+ networks. We are also keeping an eye on the developments of transformer networks applied to computer vision.)
I don’t want to get too deep in the weeds, but simplistically, CNN do a lot of classifying and counting of objects without identifying their relationship in 3 dimensions. As a result, a major challenge for CNN is recognizing similar objects with different poses (position, size, orientation).
What that means for product photography is that CNN aren’t always great at recognizing products that have been rotated, have atypical zoom, or are photographed from atypical angles. Or, more commonly, products that don’t have fixed shape—like necklaces, with their wide variety of styling, chains, and charms. We’re able to mitigate those challenges by controlling the input and training with lots and lots of images.
So it rocked the AI world when one of the godfathers of Deep Learning, Geoffrey Hinton, introduced a new type of neural network that’s being called a “Capsule Network” (CapsNet). Capsule networks work in large part by identifying pose!
The images they used for proof of concept were a set of toy figurines, photographed individually against a plain background from a variety of angles. Overhead, straight-on, side view, etc. Sounds a lot like product imagery, doesn’t it?
Identifying pose in itself is huge, and valuable, but a CapsNet also needs much less data for training. That’s encouraging for us, as it would allow us to adapt CapsNet AI to more areas more quickly, and hopefully an increased understanding of image content leads to even greater precision.
We’re experimenting with CapsNet right now. It may or may not pan out for us, but the constant advancement in AI technology is extraordinarily exciting to be a part of.
How to use AI in your retouching workflow
I hope you enjoyed learning about how Pixelz uses AI to retouch product images, and I hope you’re convinced of its value (and a little less scared of our future SkyNet overlords).
If you’re looking to incorporate AI into your own local retouching workflow, feel free to ask us questions! Comment, email, or hit us up on social media. We like to hear the challenges other people are encountering, and problem solving is fun. We’re not information hoarders.
Of course, not everyone has the time and resources to dive into AI headfirst. It probably doesn’t make sense if you’re a brand or retailer—you likely don’t have the image volume to justify a Deep Learning dev team for post-production. We do, so you can always test drive our system first.
Thanks for reading!