Tonight’s Release, Xbox Kinect: How Does It Work?

The prototype for Microsoft’s Kinect camera and microphone famously cost $30,000. At midnight tonight, the company is releasing it as a motion-capture Xbox 360 peripheral for $150.

Microsoft is projecting that it will sell five million units between now and Christmas. It’s worth taking some time to think about what’s happening here.

I’ve used Kinect to play video games without a controller, watch digital movies without a remote, and do audio-video chat from across the room. I’ve spent even more time researching the technology behind it and explaining how it works.

Kinect’s camera is powered by both hardware and software. And it does two things: generate a three-dimensional (moving) image of the objects in its field-of-view and recognize (moving) human beings among those objects.

Older software programs used differences in color and texture to distinguish objects from their backgrounds. PrimeSense, the company whose tech powers Kinect, and recent Microsoft acquisition Canesta use a different model. The camera transmits invisible near-infrared light and measures its time of flight after it reflects off the objects.

Time-of-flight works like sonar: if you know how long the light takes to return, you know how far away an object is. Cast a big field, with lots of pings going back and forth at the speed of light, and you can know how far away a lot of objects are.

Using an infrared generator also partially solves the problem of ambient light, which can throw off recognition like a random finger on a touchscreen: the sensor really isn’t designed to register visible light, so it doesn’t get quite as many false positives.

PrimeSense and Kinect go one step further and encode information in the near-IR light. As that information is returned, some of it is deformed — which in turn can help generate a finer image of those objects’ three-dimensional texture, not just their depth.

With this tech, Kinect can distinguish objects’ depth within 1cm and their height and width within 3mm.

Figure from PrimeSense Explaining the PrimeSensor Reference Design.

At this point, both the Kinect’s hardware — its camera and IR light projector — and its firmware (sometimes called “middleware”) of the receiver are operating. It has an onboard processor which is using algorithms to process the data to render the three-dimensional image.

The middleware also can recognize people: both distinguishing human body parts, joints, and movements and distinguishing individual human faces from one another. When you step in front of it, the camera knows who you are.

Please note: I’m keenly aware here of the standard caution against anthropomorphizing inanimate objects. But at a certain point, we have to accept that if the meaning of “to know” is its use, in the sense of familiarity, connaissance, whatever you want to call it, functionally, this camera knows who you are. It’s got your image — a kind of biometric — and can map it to a persona with very limited encounters, as naturally and nearly as accurately as a street cop looking at your mug shot and fingerprints.

Does it “know” you in the sense of embodied neurons firing, or the way your mother knows your personality or your priest your soul? Of course not. It’s a video game.

But it’s a pretty remarkable video game. You can’t quite get the fine detail of a table tennis slice, but the first iteration of the WiiMote couldn’t get that either. And all the jury-rigged foot pads and Nunchuks strapped to thighs can’t capture whole-body running or dancing like Kinect can.

That’s where the Xbox’s processor comes in: translating the movements captured by the Kinect camera into meaningful on-screen events. These are context-specific. If a river rafting game requires jumping and leaning, it’s going to look for jumping and leaning. If navigating a Netflix Watch Instantly menu requires horizontal and vertical hand-waving, that’s what will register on the screen.

It has an easier time recognizing some gestures and postures than others. As Kotaku noted this summer, recognizing human movement — at least, any movement more subtle than a hand-wave — is easier to do when someone is standing up (with all of their joints articulated) than sitting down.

So you can move your arms to navigate menus, watch TV and movies, or browse the internet. You can’t sit on the couch wiggling your thumbs and pretending you’re playing Street Fighter II. It’s not a magic trick cooked up by MI-6. It’s a camera that costs $150.


I should mention too that it has a stereo microphone to enable chat and voice commands. The tech on the audio capture is fairly well-known, but it’s worth observing that unlike the noise-cancelling microphone you might have on your smartphone or laptop’s webcam, Kinect has a wide-field, conic audio capture.

This is because unlike a smartphone, you wouldn’t want the Kinect’s microphone to capture sounds close to it and only close to it: you’d only pick up the sound of the television set. You want it to capture ambient speech, whole groups of people watching sports or playing games in their living rooms, talking to people in other living rooms.

Screenshot from Kinect Sports Hurdles

A video game controller is individual and serial: it’s me and whatever I’m controlling on the screen versus you and what you’re controlling. We might play co-operatively, but we’re basically discrete entities isolated from one another, manipulating objects in our hands.

A video game controller is also usually a highly specialized device. It might do light work as a remote control. But the buttons, d-pads, joysticks, accelerometers, gyroscopes, haptic feedback mechanisms and interface (wired or wireless) with the console are all designed to communicate very specific kinds of information.

Kinect is communal, continuous and general — a Natural User Interface (or NUI) for multimedia, rather than a GUI for gaming. The specificity, where it exists, is overwhelmingly on the software side of the device — and the hardware side of us, its users.


No Responses to “Tonight’s Release, Xbox Kinect: How Does It Work?”

Post a Comment