The Robot Industry Podcast – Creating Better 3D Cameras

17 min read

2021-04-28

Our CTO, Henrik Schumann-Olsen, was interviewed on the Robot Industry Podcast hosted by Jim Beretta. In this 30-min podcast recording, they discussed why some of the automation challenges are not solved and how to tackle them using 3D vision technologies.

The podcast covered the following questions:

What is the difference between 2D and 3D vision?
What is a point cloud?
Where is the 3D algorithm created, and how do we make it better?
How did Zivid start as a 3D vision company and why?
What are main pick and place challenges and how to solve them with good 3D data?
3D hardware and software, which one is harder?
Is optical glass important for 3D vision?
What does true-to-reality mean in 3D vision?
Why does an on-arm robot mounting matter?

You can find the answers to the questions by listening to the full episode here.

Transcript:

At Zivid, we are creating a human-like vision for robots!

Jim:
Hello everyone, and welcome to the A3 robot industry podcast. I'd like to welcome the listeners from all over the world, and I'm thrilled to have Henrik Schumann-Olsen from Zivid. If you don't know Henrik or if you've not heard about Zivid, they manufacture one of the leading 3D cameras that enables robotics to do what they do, from pick and place to bin-picking and machine tending, for some examples. Zivid and Henrik are located in Oslo, Norway.

So let me tell you a little bit about Henrik. He's an experienced founder with a demonstrated history of working in the industrial automation industry. He led the 3D vision company Zivid from its incubation at SINTEF, which is one of Scandinavia's largest independent research organizations, to a fast-growing company in the 3D robot and automation field internationally. He's a seasoned and awarded senior researcher in machine vision, pattern recognition, 3D cameras, optics, robotics programming, and parallel processing. He graduated from the Norwegian University of Science and Technology called NTNU in engineering cybernetics and robotics.

Thanks, Henrik, for coming on the A3 robot industry podcast. This will be a bit different because much of what you supply is kind of on the visual side - best shown on a webinar or Youtube or at a trade event, so I'm looking forward to the conversation.

Henrik:
Thank you for that nice introduction, and thank you for having me. This is going to be great to talk about the thing I'm so passionate about, namely 3D vision.

Jim:
Before we dive into the industry and the tech, can you answer a few questions for those in the audience that may not be exclusively vision experts and might be headed on a walk or at the gym or whatever so we're going to have to visualize some of the things?
My first question is: can you explain what a 3D camera is and why is it different from a 2D camera?

Henrik:
Sure, I will try to do it very visually. Think of a 2D camera we all know. That's a flat representation of the world. So you have the X and Y data, like a plane, so to say. That is actually a projection. You stand and take an image, and then that is a projection of what you see. In a 3D camera, we introduce the depth dimension. If you think of it, for every pixel, you also have the depth, the distance from the camera origin to the objects you are imaging, and in that sense, you have X, Y, and Z, where the Z represents depth. It's more of a true representation of what you actually see because we humans see in three dimensions, and the 3D camera gives a picture represented in three dimensions. (Learn more about 2D vs. 3D)

Jim:
Thank you for that. Also, I'm going to ask you to explain to the audience because I think it's important to the podcast who may not know. What is a point cloud?

Henrik:
That is what we call a 3D picture, so a point cloud then are points in space. So it's a digital representation. If you think of the image we talked about earlier for every pixel in the sensor, instead of just having the color information that you would have in a regular 2D color image, you also have X, Y, and Z coordinates. Those coordinates are the distance and positions of surface points of the objects that you are imaging. Together, all of these points represent a point cloud. That point cloud is then all these points on the surfaces of all the objects that you are taking a picture of.

Jim:
That's a great explanation. Thank you, and I was also going to ask you another kind of basic 101 3D vision question. Where is the vision algorithm created, and how do we make it better?

Henrik:
That is a good question. Of course, there are several types of algorithms and one thing that we do at Zivid is the vision algorithm to create a point cloud. Then you go into the measurement technology and how to do 3D imaging and so on. That is an algorithm itself that creates a point cloud as an output. When you have the point cloud, then depending on what you're going to use that point cloud for, you need to make an algorithm to do that.

Let's say you want to pick objects. You have a robot, and you want to pick objects. And then you have a 3D camera, for instance, from Zivid or some others. You take an image, you get a point cloud. Then you want to try to detect objects in the point cloud, and you can, for instance, have a CAD model of the object, and then you search in the point cloud, try to match that CAD model with the points in space. If you get a match, you can tell that match to the robot, and it can go in and pick it.

The algorithm to do all of that is another vision algorithm, and it's representing the brain of humans. You know we see the world; our eyes see the world. And then we have the brain that tell the hand where to go and pick something, tells it where it is in space and so on. So it's several algorithms, but we at Zivid, we work on creating the best point clouds. Then, our customers again utilize that point cloud to do their so-called vertical integration, whether picking assembly dimension control inspection or guiding robots.

Jim:
How did you get started in the 3D vision space? Like you went to school, this was part of your education?

Henrik:
I did a robotics education. We were seven students who created a robot that was competing in a competition in France. We had to make a robot for the competition, there were like two robots competing, bursting balloons. Me and a friend did the vision system and the brain. So I got inspired in the field in the early days. That was year 2000.

After school, I started working in SINTEF, where I joined the vision group there, and we worked very closely with the robotics group in the same research institute. I was part of building up a group that was particularly oriented towards robot vision. We then worked with all types of 3D camera systems like time-of-flight, laser stereo vision, and structured light, which is the core technology behind the Zivid camera.

So it was around 15 years of research before we started Zivid, me, and a colleague in 2015.

Jim:
And your motivation was to make the best 3D camera available?

Henrik:
We did lots of the analysis part in regards to the vision algorithm we talked about. And then we bought 3D cameras or systems online, and it was always a little problematic because the quality was not as good as we wanted. Of course, there existed some high-end systems like the metrology scanners and so on but they were also extremely expensive and very slow.

So mostly, we chose cheaper camera systems which is pretty normal to do in research or in universities. We were also inspired by the Kinect, the first Kinect came out at that time, and we saw a need for a better 3D camera.

During different projects, I worked for a time for ESA and NASA, creating a time-of-flight camera that was going to be used in space things like that. So we had explored different technologies and followed the development for a long time. We saw the customer need through our industrial clients that came to us and asked if we could do these kind of inspection or picking tasks.

We tried different cameras that weren’t good enough. Eventually, we started a strategic project where we wanted to create a new camera that unifies the world of high-end metrology scanners with more of the low-end Kinect-ish type of fast and affordable systems. So we could get high-quality fast and to an affordable price such that it could open this field of robotics within material handling that we saw were emerging.

Now we see that several camera companies are offering these solutions now, Zivid included. We see a wave of automation happening on top of these cameras in regards to picking, placing, assembly, and all of these tasks. So that was the start.

Jim:
That's great to hear that history. So now, in your work in the industry, do you work with integrators, or with OEM customers, or is it a mix of both?

Henrik:
It's a mix. It depends but since we only provide the point cloud, the image, our customers mainly have competence in doing the other vision algorithms - detecting an object and the vertical integration.

So whether it's a system integrator or an OEM that creates an off-the-shelf system for some sort of task, we are a supplier of 3D cameras to those customers. It's very similar to the 2D camera industry. The camera manufacturers sell their cameras, and people use them for different tasks, but the 2D manufacturer doesn't sell to the end customer. It sells via some integrator or OEM or something. So that's similar for us.

Jim:
What are some of the challenges that your customers and your manufacturers and integrators are facing? Is it ease of use, reliability, speed, or is it just the pick and place?

Henrik:
That's a good question, and also, it's a huge topic. But also, of course, why I think this automation is so exciting. There are so many challenges. But it's frustrating also because this is easy for humans, right? To stand in front of a bin and pick an object from a bin, everyone can do it. It's really simple for us. But for a robot, this is ridiculously hard.

And there's lots of reason for that, and it's about the uncertainty in all aspects. So the uncertainty in the vision system, in the gripper, in the motion planning, in the robot, and of course, the complexity of creating a software brain to control all of that.

All of that together needs to work for such a system to be robust and used in the industry. A lot of this is done in research, and there are systems out there already, but it's still in the early beginnings, I would say, in regards to human-level capabilities. So it's just the shared complexity of all parts involved. And we are focusing on one of those complexities: the 3D camera itself and the ability to get good data on all types of materials and surfaces so that you can image all of those millions of SKUs you will need in the warehouse or in the manufacturing plant.

The cameras out there today, Zivid included, we can't see everything, but we can see a lot. That's what we're working on to improve all the time and get better and better to make those robots more capable, giving them this human-like vision that we set out to do.

Jim:
By good data, you mean good vision, good pictures. You've said that pick and place are not solved, and that's what you mean?

Henrik:
Right. Just take a look at the logistics industry. Last year, of course, due to Covid as well. There was a huge rise in online shopping and, Amazon alone hired 400,000 people to do picking in the warehouses. And if this were solved, it would be loads of robots doing that work, right?

Also, in the logistics industry, intra-logistics has come pretty far with the moving of stuff inside. Small robots are moving the shelves, or they're moving small boxes and so on. But they move that to a human pickers that fulfill the order. If you have ordered something online, the human picker picks. Maybe it's a cell phone and some cables, the human worker picks it, and put it in the packaging container, packs it and sends it. So that part of the industry is now going hard on automation (This is called robotic piece picking and you can read more about it here: https://www.zivid.com/applications/piece-picking).

And then you would need better vision because when all those robots are going to do the human tasks, they need to see more like us, because we have ridiculously good vision. So, an employer wouldn't go around and accept a worker ("the robot") that is almost not seeing, can't pick 60% of what's in the bin; it wouldn't work. It's just not a sustainable business. There need to be improvements there, and we're not there yet. But we are at a good start, and lots of things are happening, and of course, a lot can already be automated.

Jim:
You've got both hardware and software in your life when it comes to vision. Is there one that's harder than the other? Have you solved this on the hardware's part, and you're working on the software?

Henrik:
I think these things go hand in hand. When you have been in the industry for a while, you understand that it takes some time to call something solved. There are always things that can be improved.

In regards to hardware, some of the things that we have worked a lot on and which is very complex and important for high-end industrial 3D-vision are thermal and mechanical stability. Changes that can happen to the camera due to temperature variations, shock and vibrations.

Take a warehouse again. At night time, the temperature goes down. And in the daytime, maybe in different parts of the world, it can be up to 40 degrees in the warehouse. And if it’s next to an open door where, in the wintertime, the cold air is streaming in every time a truck goes by, there can be rapid changes, things like that. When the temperature change, everything changes. So the lens change, the mechanics change, everything changes. And then, being a high-end calibrated and accurate system, you need to be able to cope with these things.

So that's something we worked a lot on, and that's pure hardware, and then you can add software on top of it. We have so-called floating thermal calibration where we measure everything that happens inside, and we try to adapt based on readings from sensors inside.

Then, of course, we need to build smartness on top of the camera to understand what is noise, what is real, and that's yet another thing we've worked a lot on, shiny and reflective surfaces.

Jim:
I watched your Youtube. I thought it was really good. I will ask the listeners to check that out maybe later on after the podcast. I was gonna ask you a question about the glass, the optical glass, and the camera. Of course, you've got two lenses because you're recording in 3D. How important is the glass in your camera?

Henrik:
We apply a system, a measurement principle called structured light. In contrary to stereo vision that you might know about, which is also what we have as humans, you know, our two eyes, its the same principle (read more about all types of 3D measurement principles here: https://www.zivid.com/3d-vision-technology-principles).

In structured light however, one of the lenses is from a projector. So there's a projector behind the lens. And if you think about a projector, it's some sort of inverse camera where instead of receiving light, it's sending out light. Since it's structured light, it's coded so we know what we are sending out, and then we look at the patterns in the camera, which has a baseline and a projective difference. We can then look at how that projection changes, and then we can infer the 3D information. So understandably, since we are projecting and we need very high accuracy. It sets a lot of demands for the optics and the glass.

Regarding what I talked about earlier about thermal and mechanical stability, we need to have ruggedized systems. So we put a lot of effort into the design. And the devil is in the details there, of course, how to make a system that can be stable. Be industrial grade and usable, and live for years in an industrial environment. So the glass is very important, absolutely.

Jim:
Can you tell the audience what you mean by true-to-reality?

Henrik:
Sure. It's a hard one. If you think of the foundation, we have a term called accuracy, and I think it's something that is confused a lot when you hear people talk about it. But accuracy has two components, one of the components is precision, and the other is trueness.

Precision is about the noise. If you image a surface, there will be some random noise in the point cloud, like not every point is perfectly placed where it should be. It's a little over the surface, little under. There's noise there. It's like this blurriness or graininess of an image. In 3D, that's the precision. in other words when you take a lot of images how close to each other are these points.

And then you have trueness. Trueness is about: are you representing the reality correct? So that means sizes are exactly what they are in real life, are the rotations correct, and the absolute distances, because in 3D you always measure from an origin inside the 3D camera. And then what is the correct absolute distance to the object that you are looking at? So all of that is the trueness.

If you have good trueness, you have a very true-to-reality representation. The combination of trueness and precision together gives you accuracy. So that is low noise and high trueness. This is very important to understand when to use a camera in all sort of applications.

A good example is the first Kinect that came out. It wasn't so good in regards to data quality, but what it worked quite okay for was gesture recognition in front of your TV. In that particular case, what you want to do is to recognize a gesture. And it doesn't matter if your hand is at 1.1 meters from the camera or 1.2 meters, it's not so important. It's more important that you recognize the actual gesture. That is the precision, that you can distinguish gestures from eachother, but not necessarily placing the hands/fingers right in space.

But immediately when you introduce a robot and want to use the data to interact with objects, trueness is very important. Because if you see an object and it's placed wrongly, maybe it's a little to the side and a little bigger or rotated than in reality, then you can crash when you try to grip it. And that is a huge problem. So true-to-reality means that it's correct in regards to all of these factors. (Learn more about trueness)

Jim:
That's great. Thank you for that clarity on that. I see from some of the videos that I researched in preparation for the podcast. I see the Zivid mounted on a robot arm. Is that kind of the natural state preferred, or are you just seeing a lot more of that application?

Zivid-industrial-3D-camera-product-selector-mounting

Stationary mounting (left) vs. on-arm mounting (right) with a Zivid Two 3D camera

Henrik:
It's used sometimes, but mostly the industry has been mounting cameras stationery. It's a little more straightforward to do that. The thing with on-arm is that it gives a lot of appealing benefits if you can do it. One thing is that you can get closer. A lot of the problems with 3D comes from distance. You want high resolution, you want high-quality images, and you want it over the full working range. A long working range will pretty quickly lead to degradation of quality.

However, If you have the camera on-arm, you can always maintain the same distance. If you for instance do a depalletization task, you take the image at the same height. When you have taken the first layer of boxes, you go a little closer, take the next layer, and so on. So you get closer, and are not so disturbed by the ambient light, and you can minimize occlusion and shadowing because you can see from different sides.

It's also very economical, you can instead of having one camera at the pick and the place position, use the same on-arm camera to do everything. So why isn't the industry using it more? Because it's also complex. That's something we have worked a lot on, and now we released our Zivid Two camera, were we have reduced the size significantly. So it's more easily mountable. It doesn't restrict the motions of the robot so much, and it's also mechanically stable. This is very important when you have vibrations or when you accelerate the robot.

I think if on-arm solutions was more available, the industry would have used it more, and that's what we are trying to do. Enable the industry to reap these benefits. Just think of the mobile manipulators coming more and more. The flexibility when you have on-arm cameras, like picking from shelves. You couldn't mount cameras on all the shelves, so you need to have something on the robot moving around. The flexibility of looking at different shelves and so on. I think it's something that we will see more and more of. We believe so much in it that we have positioned ourselves to be a leading actor delivering those kind of systems to the industry. (Learn more about stationary vs. on-arm mounting)

Jim:
So that's one of your focuses on the autonomous robot industry as well?

Henrik:
Yes, it's about the flexibility, what you can do with it. All these factories of the future where everything changes, it's dynamic. The stationary setups have their beauties, but they are not as flexible.

Jim:
I think the future is mobile. That's for sure. What are some of the challenges that still remain in the industry?

Henrik:
There are still challenges regarding for instance 3D quality on all types of materials and surfaces. Some objects are not possible to image as of today, especially transparent materials. That's kind of like the Holy Grail thing to do.

Another challenge and something we have worked a lot on is shiny and reflective surfaces. We have come pretty far on that, but there're still some challenges. (Learn more about solving challenges with shiny objects)

In general, its a challenge in 3D if you want a large working area, at long distances, and you want high resolution and great image quality, and you want it all together. Typically a bigger field of view makes it harder to achieve high resolution and good 3D quality. The market wants super great 3D quality, on all surfaces and materials, and they want it in a smaller and better package. If you put on-arm capabilities on top of that, then you understand the challenges.

Jim:
So those are kind of those things that we've all been dealing with. Everybody wants it smaller or bigger?

Henrik:
It's like better, smaller, bigger, faster. You are kind of hitting your head into some of the physical limitations. So there's a lot of innovation going on and ways to get around it, and it's gradually improving indeed. What we see going forward is more on-arm, and better quality 3D vision, as important factors.

We see an influx of more demand for high-end industrial-grade vision systems, combined with AI. There has been a lot of AI going on lately, and I think that will be the future. But it's still early for that in regards to picking and placing.

If you take ourselves (humans), we have great vision in combination with our brain, so I think we need to offer the robots the same! You know, we need to give them great vision as well. I think that's something that will be very important in order to achieve universal picking for robots.

Jim:
Henrik, thanks for coming on the podcast. If someone's interested in getting in touch with you or learning more about Zivid, how should they do that?

Henrik:
The easiest would be just to contact us on zivid.com. If you want to get hold of me, I'm pretty active on Linkedin, so you can contact me there.

The Robot Industry Podcast – Creating Better 3D Cameras

You May Also Like

Solving Picking Challenges with Zivid 2+ 3D Vision

A beginner's guide to 3D machine vision cameras

The Advantages of 3D Sensors in Robotic Machine Tending

Get Email Notifications

Zivid

Copyright © 2025 Zivid. Terms of Use Privacy Policy