OAK-Deep

In 2021, OpenCV created what would become a wildly successful kickstarter campaign for an experimental line of cameras with highly versatile functionality. Enter: OpenCV AI Kit (OAK). OAK was the first device of this hardware-software renaissance and was priced at a cool $199. While the family of OAK devices has grown over the years, the core offerings have primarily only seen performance improvements. Here are some of the features of this award-winning camera:

4K RGB Capture, at up to 30 FPS
1280x800 Stereo Depth Capture, at up to 120 FPS
Intel Movidius MyriadX Visual Processing Unit (VPU), up to 4 Trillion Operations per second
- Support for on-board Neural Network Inference
- Pre/Post processing for on-board AI Pipelines
IMU (Accelerometer, Gyroscope, Magnetometer), orientation estimates at up to 400 Hz

In this post, I'll be discussing my personal work with the OAK-D Camera. I have fairly extensive experience developing software with this camera, however Project OAK-Deep focuses on exploring the vast computer vision and robotics applications which can be powered by minimal hardware outside of the OAK-D camera. This project has infrastructure for the following:

Network Training (Classification, Object Detection, Instance Segmentation, Image Generation)
Data Collection (Hardware-Specific Training Dataset Curation)
RGB, Depth, IMU Pipeline-interchangeability (Highly customizable, json-based pipeline construction)
Custom Neural Inference Pipelines (Object Detection, OpenPose, Facial Keypoints)

Network Training

Stepping away from the scope of deploying algorithms and systems onto the OAK-D camera, OAK-Deep also supports training, validation, and testing of neural networks in several different core computer vision tasks. Currently the following is supported:

Image Classification
- Fully Convolutional Network (FCN) Architecture
- MobileNet V2 Architecture
Object Detection
- YOLOv8 Architecture
Instance Segmentation
- YOLOv8 Architecture
Image Generation
- Basic GAN Architecture
- Deep Convolutional GAN (DCGAN) Architecture
- Deep Variational AutoEncoder (VAE) Architecture

With a wide variety of tasks and networks to train such tasks, the user/developer can also choose from a vast number of different datasets. OAK-Deep currently supports datasets sourced from Roboflow, Pytorch, Ultralytics, and even custom datasets with proper formatting. This puts hundreds, possibly even thousands, of datasets at the fingertips of developers.

With customization of task, network, and dataset, users/developers can stick to the defaults for training their own model, or they can choose hyperparameters which are best suited for their task. Developers can control a set of well-studied training hyperparameters, such as batch size and number of epochs, learning rate, momentum, and weight decay, as well as optimizer and loss function. For more seasoned deep learning developers, these parameters can also be tuned/explored relatively automatically, with Weights and Biases Sweeps. Here are some examples of projects I have created with OAK-Deep, visualized with Weights and Biases.

I have designed this subset of the project to allows users to generally train and test their own neural networks, but my next steps are to fit it as a component in the OAK-Deep ecosystem by offering automated deployment of such models onto the OAK-D camera.

Data Collection

In data collection mode, the camera will record and save frames from specified sources. This feature allows the user/developer the option to upload collected data to an AWS S3 Bucket, if specified. This feature can save rgb frames and depth frames, but at the moment cannot save IMU data or neural network preditions (might happen in a future release). OAK-D offers onboard video encooding, but I have yet to explore this feature extensively. I'd like to explore efficient compression methods/filetypes, such as the h5 file extension, to handle faster data saving. I'd also like to enable saving of IMU data and neural network predictions, which would also require timestamps to be provided for each type of frame (rgb and depth), IMU samples, and network predictions. For now though, this dataset collection pipeline is stable and users/developers can find their dataset in the ./datasets directory.

On-device Pipeline Customization

Other modes exist, such as facial landmark mode and display mode, and the parameters relating to the OAK-D camera are in the oak_config.MODE.json file. Shown below is a demo of the facial landmark mode, with depth map visualization.

Figure 1: Demonstration of facial landmark mode, with depth map visualization. Notice that the network predictions plotted on the rgb image are distorted This is purely the result of how the frames are visualized. I use fullscreen OpenCV visualization, which stretches and squashes the frames to fit the display.

Off-device Pipeline Customization

Similarly, there are pipelines which leverage off-device functionality. The parameters which control these pipelines are also in the oak_config.MODE.json file. Currently, most of th off-device pipeline functionality is geared toward visualizing network outputs. My next steps will be to implement a basic SLAM pipeline with visualization, as well as AprilTag/ARuCo Marker AR visualization.