- David Battel
- Connor Goggins
- Kjartan Brownell (TA)
FingerSpark is a program that tracks a user’s hand and individual fingers in a video feed, then interprets a gesture from the movement of the user’s fingertips. The user will wear a glove with differently colored fingertips, making it easier for the camera to pinpoint the two-dimensional location of each finger. The user positions his hand 2-3 feet in front of the camera, where a video feed is recorded and interpreted by the Raspberry Pi B+’s CPU. The final product is mounted on a tripod for the user’s convenience. To detect the movement of brightly colored points at approximately 2-3 feet away from the camera, we use the Raspberry Pi Camera Module’s slow-motion video mode, which takes 90 frames/second at 640x480p of resolution.
To achieve the desired functionality, we wrote an image-processing program to analyze individual frames from the camera’s video feed. We then mask each frame and perform a bitwise OR on the array of frames to create a composite mask. We then perform image comparisons on this composite image and a series of templates, utilizing cropping techniques and an adapted form of Hooke-Jeeves’ algorithm. This optimization algorithm searches for patterns to find the template with the highest degree of similarity to the image, thereby determining which gesture the user performed.
Our goal in creating FingerSpark is to work towards eliminating the barriers to perfectly natural user control of electronic devices. We believe that our product will be an essential next step in developing three-dimensional operating systems, creating robots that can flawlessly mimic the fine motor skills of humans, and producing interactive augmented reality technologies.
Our demonstration at the end of the semester will consist of a user moving his hand in the glove with colored fingertips in front of the camera, making a gesture of his choice. After the video is recorded, our program will process the video input and correctly select the user’s gesture from our set of templates. Once the user’s gesture is correctly identified, our program will output what type of gesture the user made.
Masking and Color Identification
The core of our image-processing technique is the masking algorithm. This takes in an image (Image 1) and a set of color bounds, and returns an array composed of 1s and 0s representing which pixels are within the color bounds specified. Originally, we implemented this method using RGB encoding, but we quickly learned that better results could be obtained with HSV encoding (because it can more easily identify colors in the way humans do, and better accounts for changes in lighting). To generate Image 2, we calculated the mask between the HSV bounds of [0, 0, x-10] and [255, 255, x] for various values of x, then took the bitwise OR of all masks generated this way. The masks can also be applied onto the original images, as shown in Image 3.
Image Processing Steps
Upon recording and/or loading the video file of the gesture, the mask is applied to each frame (Image 4). The code then iterates through each frame, and generates a composite image by taking the bitwise OR of all of the frames (Image 5). However, there is a small amount of noise generated by this process (seen in the middle of the circle in Image 5), requiring the use of the Hooke-Jeeves optimization algorithm to find the best fit of the three templates onto the composite mask. We treated this as a four-parameter problem: the x and y bias (dx and dy), and the x and y gain (fx and fy). Using Hooke-Jeeves’ algorithm requires an initial value for these four parameters, as well as a fitness function. The initial values define an initial template orientation as shown in Image 6, determined by excluding 5% of the white pixels in each direction. The fitness function is not the raw number of pixels that overlap between the resized template and the image (calculated with a bitwise AND). Instead, the fitness function calculates o/sqrt(i*t), where o represents the number of overlapping white pixels between the two images, i represents the number of white pixels in the composite mask, and t represents the number of white pixels in the resized template. Because the fitness function is very discontinuous, the step size is varied and the optimization algorithm is run several times. The best fit found for each template is shown in Images 7-9 (with the template highlighted in red).
On the road to developing our final product, our group faced significant challenges.
First, we did not anticipate how challenging implementing OpenCV would be. Both the lack of documentation for this library and the issues associated with referencing the library in Python on the Raspberry Pi were problematic in the early stages of our project. However, by researching the library extensively we were able to leverage OpenCV’s methods to effectively process the video feed.
Second, we needed to find a way to compare a video feed to the static template images. We worked though this challenge by pulling individual frames from the video feed and applying a mask to each frame. This created an array of 0s and 1s, representing the location of the user’s finger as white on a black background. We then performed a bitwise OR on the set of masked frames, showing the path of the user’s finger throughout the video. We also formatted our templates as black and white images to allow the comparison of the user’s gesture with the templates.
Our third major challenge was the time required to execute our program. Using simple brute-force methods to compare the image with each template took nearly half an hour to return a result. We rose to this challenge by writing a custom version of Hooke-Jeeves’ algorithm to handle our image comparisons. When this still took too much time, we reduced the search space by cropping the image and each template to similar dimensions and then applied Hooke-Jeeves’ algorithm and the comparison. Our program currently takes under one minute to execute successfully with a high degree of accuracy.
In its current state, FingerSpark is able to track multiple colors (several distinct colors have been tested successfully). With more time, we would have liked to check for gestures that involve multiple fingers by simultaneously tracking multiple colors. Another improvement would have been to interpret gestures in real-time video (rather than relying on recording videos and then analyzing them). Furthermore, we would have loved to incorporate more gestures and eventually code an interface that is controlled exclusively by a user’s gestures through FingerSpark.
- Raspberry Pi B+ - $29.95 (Will likely use classroom kit)
- Raspberry Pi Camera Module - $24.99 (Need to purchase)
- Set of comfortable black gloves - 2 pairs: $7.89 x 2 = $15.78 (Need to purchase)
- Spray Paint Set: $15.99 (Need to purchase)
- Tripod: $19.99 (Need to purchase)
- White Backdrop: $9.50 (Need to purchase)