- David Battel
- Connor Goggins
- Kjartan Brownell (TA)
FingerSpark is a program that tracks a user’s hand and individual fingers in a video feed, then interprets a gesture from the movement of the user’s fingertips. The user will wear a glove with differently colored fingertips, making it easier for the camera to pinpoint the two-dimensional location of each finger. The user positions his hand 2-3 feet in front of the camera, where a video feed is recorded and interpreted by the Raspberry Pi B+’s CPU. The final product is mounted on a tripod for the user’s convenience. To detect the movement of brightly colored points at approximately 2-3 feet away from the camera, we use the Raspberry Pi Camera Module’s slow-motion video mode, which takes 90 frames/second at 640x480p of resolution.
To achieve the desired functionality, we wrote an image-processing program to analyze individual frames from the camera’s video feed. We then mask each frame and perform a bitwise OR on the array of frames to create a composite mask. We then perform image comparisons on this composite image and a series of templates, utilizing cropping techniques and an adapted form of Hooke-Jeeves’ algorithm. This optimization algorithm searches for patterns to find the template with the highest degree of similarity to the image, thereby determining which gesture the user performed.
Our goal in creating FingerSpark is to work towards eliminating the barriers to perfectly natural user control of electronic devices. We believe that our product will be an important next step in developing three-dimensional operating systems, creating robots that can flawlessly mimic the fine motor skills of humans, and producing interactive augmented reality technologies.
Our demonstration at the end of the semester will consist of a user moving his hand in the glove with colored fingertips in front of the camera, making a gesture of his choice. After the video is recorded, our program will process the video input and correctly select the user’s gesture from our set of templates. Once the user’s gesture is correctly identified, our program will output what type of gesture the user made.
Masking and Color Identification
The core of our image-processing technique is the masking algorithm. This takes in an image (Image 1) and a set of color bounds, and returns an array composed of 1s and 0s representing which pixels are within the color bounds specified. Originally, we implemented this method using RGB encoding, but we quickly learned that better results could be obtained with HSV encoding (because it can more easily identify colors in the way humans do, and better accounts for changes in lighting). To generate Image 2, we calculated the mask between the HSV bounds of [0,0,x-10] and [255,255,x] for various values of x, then took the bitwise OR of all masks generated this way. The masks can also be applied onto the original images, as shown in Image 3.
Image Processing Steps
Upon recording and/or loading the video file of the gesture, the mask is applied to each frame (Image 4). The code then iterates through each frame, and generates a composite image by taking the bitwise OR of all of the frames (Image 5). However, there is a small amount of noise generated by this process (seen in the middle of the circle in Image 5), requiring the use of the Hooke-Jeeves optimization algorithm to find the best fit of the three templates onto the composite mask. We treated this as a four-parameter problem: the x and y bias (dx and dy), and the x and y gain (fx and fy). Using Hooke-Jeeves’ algorithm requires an initial value for these four parameters, as well as a fitness function. The initial values define an initial template orientation as shown in Image 6, determined by excluding 5% of the white pixels in each direction. The fitness function is not the raw number of pixels that overlap between the resized template and the image (calculated with a bitwise AND). Instead, the fitness function calculates o/sqrt(i*t), where o represents the number of overlapping white pixels between the two images, i represents the number of white pixels in the composite mask, and t represents the number of white pixels in the resized template. Because the fitness function is very discontinuous, the step size is varied and the optimization algorithm is run several times. The best fit found for each template is shown in Images 7-9 (with the template highlighted in red). The program then informs the user of which gesture was most closely matched to their own hand motions.
On the road to developing our final product, our group faced significant challenges.
First, we did not anticipate how challenging implementing OpenCV would be. Both the lack of documentation for this library and the issues associated with referencing the library in Python on the Raspberry Pi were problematic in the early stages of our project. However, by researching the library extensively we were able to leverage OpenCV’s methods to effectively process the video feed.
Second, we needed to find a way to compare a video feed to the static template images. We worked though this challenge by pulling individual frames from the video feed and applying a mask to each frame. This created an array of 0s and 1s, representing the location of the user’s finger as white on a black background. We then performed a bitwise OR on the set of masked frames, showing the path of the user’s finger throughout the video. We also formatted our templates as black and white images to allow the comparison of the user’s gesture with the templates.
Our third major challenge was the time required to execute our program. Using simple brute-force methods to compare the image with each template took nearly half an hour to return a result. We rose to this challenge by writing a custom version of Hooke-Jeeves’ algorithm to handle our image comparisons. When this still took too much time, we reduced the search space by cropping the image and each template to similar dimensions and then applied Hooke-Jeeves’ algorithm and the comparison. Our program currently takes under one minute to execute successfully with a high degree of accuracy.
Overall, our demonstration in Lopata Gallery was a success. Our team of two accomplished the objective we set out to achieve (accurately interpreting a user's gestures from a video feed) and went one step further: we provided users with a glove that had red and blue fingertips, and we allowed users to choose whether they wanted the program to track the blue or red fingertip during their gesture. FingerSpark successfully identified gestures in each color, and we experienced few complications during the presentation of our project to the WUSTL engineering community. During the demonstration, we also took Professor Gonzalez's advice to heart and quickly modified our program to show the image-template comparison process for each gesture in real time on the monitor, making our project significantly more visually appealing.
The key factor that prevented us from getting perfect results was the lighting in Lopata Gallery. Although we had tested our product in Lopata Gallery previously at nighttime, the midday sun became an increasing annoyance as the demo hour progressed. While initially FingerSpark was able to match a gesture to one specific template with ease (often returning a percentage match value for the correct template over three times the value of the highest percentage match of an incorrect template), by the end of the hour the differences in percentage match between the correct gesture and the two other gestures decreased significantly to just a few percentage points. Despite this interference, our results still remained accurate. However, given another few days to work on the project, we would have liked to configure FingerSpark to take an initial read of the video feed from the camera module and use that information to determine the amount of light in the frame. In particular, the optimal solution would have been to identify the white glove, see what color range it included (to see what color the lighting was), and used that information to modify the HSV range of the gesture. This would have allowed us to eliminate the problem of variable lighting altogether.
Another important factor that prevented us from getting better results was the extreme discontinuity of our fitness function. The Hooke-Jeeves' algorithm is only capable of finding a local maximum from a starting point, not an absolute maximum. However, because our fitness function was very "noisy", approximately an eighth of all points were local maxima, and the optimization algorithm was unable to find good maxima. It was impractical to check too many starting points - at that point, we're just implementing brute force methods - and for this reason, the Hooke-Jeeves algorithm was ill-suited to serving as the fitness function of FingerSpark. Because our goal was to find the absolute best match of an image to a template, a different optimization algorithm that checked for more than a single local maxima may have been more effective.
In its current state, FingerSpark is able to track multiple colors (several distinct colors have been tested successfully). With more time, we would have liked to check for gestures that involve multiple fingers by simultaneously tracking multiple colors. Another improvement would have been to interpret gestures in real-time video (rather than relying on recording videos and then analyzing them). Furthermore, we would have loved to incorporate more gestures and eventually code an operating system that is controlled exclusively by a user’s gestures through FingerSpark. We also had implemented a capability for users to record their own unique gestures, but we never tested it enough to be comfortable enabling it at the demo (and thus it was excluded from the final code). With more time and resources, this would have been another valuable feature to include in our product.
- Raspberry Pi B+ - $29.95 (Will use classroom kit)
- Raspberry Pi Camera Module - $24.99 (Need to purchase)
- Set of comfortable black gloves - 2 pairs: $7.89 x 2 = $15.78 (Need to purchase)
- Spray Paint Set: $15.99 (Need to purchase)
- Tripod: $19.99 (Need to purchase)
- White Backdrop: $9.50 (Need to purchase)