Tech Reflect Voice

From ESE205 Wiki
Jump to: navigation, search

Project Overview

The best creations come not from reinventing the wheel, but from integrating existing technologies in new and interesting ways. This is why when we saw the original Tech Reflect project, we realized that there was so much opportunity to improve upon it. With the increasing prevalence of IOT devices, home assistants like Google Home and Amazon Alexa, open-source and free to use APIs, and the decreasing cost of display technology, it has become possible to cheaply and easily create a piece of physical hardware for the home which can utilize the strengths of home assistants and cheap display technology, while minimizing their obtrusiveness on your life.

Group Members

  • Ethan Shry
  • Tony Sancho-Spore
  • Baihao Xu (Kevin)
  • Ellen Dai (TA)

Project Proposal

https://docs.google.com/presentation/d/1UtUwZfxM7SI90nvJo0Fx1iLGG8HB7N2jdl1ApcL0bRY/edit?usp=sharing

Objectives

We hope to construct a proof-of-concept bathroom mirror which responds to user feedback. The mirror and GUI should be somewhat aesthetically pleasing. It should listen to the user and be able to convert what they ask for into a visual response displayed on-screen. We hope to show that there is some novelty or value in integrating technology into everyday items like mirrors.

Challenges

Due to the tools available to us, ensuring that the hardware is talking to the Python listener is talking to the GUI server will be somewhat logistically challenging, especially as we will be running code in several different programming languages.

The selection of which command to take when analyzing user speech will also be a nightmare should we decide to allow multiple different trigger commands. A simple solution would only look for exact string matches, but a more robust solution will require looking into.

Our current plan for the mirror is to 3D print the frame, which due to the lack of large printers available to us needs to be done in many pieces (>10), which will be potentially infeasible.

Additionally Kevin will need to become comfortable in Pug templating language and NodeJS.

Gantt Chart

GanttChartTRVoiceSpring2018.PNG Media:GanttChartTRVoiceSpring2018.PNG

Budget

21.5" Display: $65 @ Microcenter

Google AIY Voice Kit (Includes Speaker and Microphone): $10 @ Microcenter (DISCONTINUED)

Mirror: $50 @ Amazon (https://www.amazon.com/gp/product/B01G4MQ966/ref=oh_aui_detailpage_o00_s00?ie=UTF8&psc=1)

Raspberry Pi 3: $35

Raspberry Pi NoIR Camera: $25

Arduino Mini: $10

LED Strip: About $25 for more than you need

3 spools Standard Breadboard-y wire (22 gauge): $15

Frame: $20ish for wood and screws

Paint: $5

Power Strip: $10

3D Printed Frame: $35 (approximately 1kg of PLA)

2 Power Supplies/MicroUSB Cables: $25

Epoxy: $10

Total Cost: $340 (ish)

Code and CAD Files

All the code for this project can be found on Github

All the CAD Files for this project can be found on Grabcad

Design And Solutions: Modules

Product Design Process Overview

Step 1: Find and target users

We think our target users are families or home use, and as such we can break down our information into what is most pertinent to them:

1. Our users need access to the information on the internet: Twitter, Stock, Wiki Search

2. Our users need to know what they need to do during the day: Reminders

3. Our users need to know the environment around them: Weather

4. Our users needs to have some personalized features: Timer, personalized hotwords, etc.


Step 2: Designing the main page

When we designed the main page, we wanted to give the consumers a brief overview of some of the features the mirror can do for them, to give them inspiration for possible uses of the mirror

Step 3: Prioritize features based on client importance

1. Solo Users: Our users will care about the utility first, which are weather, reminder, and timer. The timer will serve to control the time they make up every morning. The weather will help them decide what clothes to wear for the day. The reminder feature will remind them not forget anything during the day. For the Twitter and Stock, these features are optional, because they might not have time to see this, but it will be very attractive to users who like them, and make us competitive compare to potential competitors.

2. Families: Family users may need similar feature to our self-buyer users. However, they might need a more powerful back-end support, because there are more information need to be saved at database, like reminders- and they need the ability to have different information and different profiles for different users in the household

Step 4: Prioritize our features

At this point we pared down our feature list to only the most important of features- the ones that our users will rely most heavily on, and the ones most vital to client use.


Step 5: Design Testing and Guidelines

Due to the nature of a smart mirror, it is very important that UI designs have high contrast, since the very nature of a two way mirror is that it blocks part of the light from coming through, and as such low-contrast designs can be hard to see. additionally, the less cluttered the design is the more functional the mirror is as a mirror.

Software Overview

Below is the overview of the Software flow for this project.

SmVoiceCodeFlow.png Media:smVoiceCodeFlow.png

HTML & CSS to Pug

Nodejs gives us access to tools for writing html pages to be rendered on the server- we chose to use the Pug language for this purpose. Pug is lightweight, fast, and highly readable, making it easy to learn and understand. This allows us to easily and quickly write what would otherwise be very tedious html pages and render them securely on the sever with minimal impact on the client. The tutorial and syntax of Pug can be found at https://github.com/pugjs/pug.

Speech to Text

SmVoiceSpeechTesting.png Media:SmVoiceSpeechTesting.png

After sampling different speech to text libraries, it became clear that Google Cloudspeech was vastly superior than any other solution in terms of what we wanted to achieve. While it would have been nice to run something locally like CMU Sphinx, or a totally free solution, in the end Google was easiest to integrate with and had the most accurate speech to text conversion technology.

Additionally, the use of Google Cloudspeech allowed us to use the Google AIY Voice Kit, which was cheap and made integration with google's services incredibly easy with a pre-configured distro of raspbian available for use.

What we essentially have is a two-loop system for hotword detection and then speech recognition- see the following psuedocode:

while shouldBeListening:
    shouldBeListening = checkIfShouldListen()
    text = listen()
    if hotword in text:
        commandText = listen()
        sendToNodeServer(commandText)

Command Parsing

A command string will come down from the server as follows:

   "Show me the weather for Seattle"

Our custom command parser will try to match that to an array of commands. Any command takes the form of an object in a javascript list, as follows:

   {
           name: "descriptive name, only for coder use",
           cmdStrings: [
               "list of strings
               Can make use of %?% for parameter
               Can make use of %$% for continuous whitelist character sequence
               Max 1 of each per command
               Must have seperator for %?% and %$%. i.e. %$% timer for %?% is valid, timer %$% %?% is not valid"
               ],
           keywords: ["keyword list. should be unique to only this command- idea is if no cmdString matches for any command, will loop back to look for keyword in command. no %?% or %#% allowed"],
           trigger: (param, activeUser) => {
               // function to be called when command is triggered
               };
           },
           viewName: "name of .pug file which should be displayed as a result of this command i.e. stockView"
       }

Our Command parser then works as follows:

   for command in possibleCommandList:
       for commandString in command.commandStrings:
           if commandString strictly matches spokenCommand (excluding %?% or %$% templates:
               //we have matched to a command, call the command trigger function with a passed in parameter (if one exists)
   // no command matched, try to match a keyword
   for command in possibleCommandList:
       for keyword in command.keywords
           if spokenCommand contains keyword
               //we have matched to a command, call the command trigger function

For instance, our sample command would match to the following:

   sample spokenCommand: "Show me the weather for Seattle"
   would match to comamnds as follows:
   commandString: "%$% weather for %?%" - would match with parameter: Seattle
   commandString "Show me the weather %$%" - would match with no parameter
   keyword - "weather" - would match with no parameter

Facial Recognition

To recognize a persons face, we used Amazon's Rekognition AWS service. Rekognition is designed to work as a complete Computer Vision (CV) API, which includes facial recognition, object finding, image comparisons, etc. We also used Amazon's S3 AWS service, which is equivalent to Google Drive, except for all of the Amazon AWS API's. The complete version of our facial recognition program works as follows:

  • Wait for the "Switch User" command
  • Take a photo using the Raspi camera
  • Upload the picture to the S3 bucket as "UnknownFace.jpg"
  • Compare the picture to every other picture in the S3 bucket ("TonyFace.jpg", "EthanFace.jpg"...)
  • If any of the pictures match the unknown face, then we know who is currently using the mirror
  • Else, switch back to the default user (Future Update: be able to add users on the fly)

LED Ring

The LED Ring is a ring of 128 RGB LEDs controlled by an Arduino Uno. The Uno is connected to the Raspi 3 via UART serial connection, and to the LED strip via 1 signal pin. When the Arduino receives a mode number from the raspi corresponding to some action/command from the user, it updates the patterns for the LEDs.

Some of the modes are:

  • Steady on 1 color
  • Fade in and out 1 color
  • "Chasing" start up sequence
  • Wrap up from bottom to top once when hotword is detected
  • All off

The particular LED strip we're using is finicky, requiring about 3 hours and 1 blown up Arduino to get working properly with the Raspi. In order to not suffer the same fate we had, here are some of the issues we had/issues that we foresaw before they could happen:

  • Not sharing common ground. This was a tricky one to debug (Shoutout to Professor Feher for figuring this one out!!), as it wasn't obvious to us that the ground voltage of the Arduino and the LEDs could be different granted that they were both plugged into the same power strip.
  • Not placing a suitably sized capacitor on the LED strip's power leads.
  • Not placing an isolator between the Arduino and the LED strip (blew up an Arduino Leonardo due to this one)
  • Not tying the signal pin of the LEDs to ground
  • Not placing a small resistor on the signal pin of the Arduino connected to the LEDs

For future iterations of this subsystem, the only change that we would make in this design is to change the way that the Arduino checks the serial port. In the current design, it is only able to check for data from the Raspberry Pi once every lighting cycle (i.e. one fade in-fade out cycle, one loop during start up sequence, etc), because the LED's are sequentially updated from the data line, with very specific timing required to avoid problems. The WS2812B chipset that we are using for our LED strip reads data encoded as a PWM signal, with a duty cycle of approximately 33% representing a 0, and a duty cycle of approximately 67% representing a 1. Each bit cycle takes approximately 1.25 microseconds, with a maximum of a 0.6 microsecond leeway. The LED strip expects to receive 8 bit color, which totals to 24 bits per LED in total. This totals to 30 microseconds per LED, with a maximum of a 14.4 microsecond leeway. If the data line is held low for more than 50 microseconds, however, the strip will reset itself, meaning that in order for the Arduino to check the serial port in between LED write cycles, it only has a maximum of 34.4 microseconds to do a Serial.available() call. During testing, the fastest Serial.available() call took far more than the absolute maximum allotted time, thus it became apparent that doing a check for data was not possible except for between a full write cycle to the LED strip. On average, this takes approximately 4 milliseconds, given that there are 128 LEDs on the strip.

During the development process, having the Arduino control the LEDs while seamlessly taking serial input from the Raspberry Pi proved difficult. The Arduino's serial buffer does not have enough space to store a complete command from the Pi. To solve this, we utilized a one byte handshake from the Pi, which when read by the Arduino on it's next read cycle, would instruct the Arduino to reply with it's own handshake and prepare to receive the new command. Although the time from when the Arduino checks the serial port to when the new light sequence should take less than 100ms, the actual times were between 1-3 seconds. This could be due to a number of things, including how often the process that communicates with the Arduino gets processor time, a potentially bad connection that results in a high data loss rate, or a coding error on either microcontroller. Because we were unable to pinpoint the cause of the extra delay, it was decided that the Arduino would check for a new command after every lighting cycle, so that the animations appeared smooth and not excessively choppy and sluggish.

In the next iteration of the LED subsystem, there are two possible solutions to fix the response times. The first is to implement a circuit that can store an entire command from the Raspberry Pi, and outputs the new command to the Arduino either by its' serial port, or through its' data pins. The second option is more likely to be viable, since each digitalRead() cycle takes approximately 5 microseconds, and with a maximum of 128 bytes per command, this process takes at a minimum 5 milliseconds, and should take no more than 30 milliseconds worst case. Since most commands take significantly less than 128 bytes, the average time for this process to complete should be no more time than a serial read of 128 bytes, which is what our data input algorithm assumes. The second possible solution is to completely rewrite the command structure to use less additional arguments, which means sacrificing how much of each animation can be customized for response timing. This means that an entire command can fit within the Arduino's 64 byte serial buffer, which removes the need for a handshake before data transmission can begin. The Arduino code must also be completely rewritten, in such a way that it is able to check the serial port after a certain number of complete write cycles to the LED strip.

Physical Design

The mirror was designed such that the monitor would be central on the mirror, with the Raspberry Pi and Arduino mounted near the camera mount on top of the mirror, and all of other electronics, except the speaker, being mounted nearby on the top. The speaker is mounted on its mounting block on the bottom of the mirror, with extended wires going to the Google AIY Voice HAT's output screw terminals.

In order to accomplish this, a simple open wooden box large enough to fit the interior edge of the 3D printed frame inside is constructed. Exterior dimensions should be no larger than 26"x32", and no less than 2" shorter for either dimension. The box we constructed was 6" deep due to the wood we bought, although a box half as thick would have sufficed.

There are 2 wooden crossbeams inside of the wooden box to hold the monitor in its appropriate position. These are held to the box with 2 screws on either side, with the lower beam approximately 8.5" above the bottom of the box, and the second beam approximately 10" above the first beam. These measurements should be adjusted to accommodate different size monitors.

Finally, as a part of cable management, the power supplies and power strip were held to the left edge of the box using screws. Any other cable management is accomplished using electrical tape as a temporary measure, as we intend to improve the physical box in the future, and any permanent cable management or other mounting mechanism would make upgrading the wooden box difficult.

3D Printed Frame

The front facing exterior frame was constructed of 18 individual 3D printed parts, using 6 different CAD files. Parts list:

  • (4) Corner blocks
  • (1) Camera mount
  • (1) Speaker mount
  • (9) 6"x4" block
  • (2) 5.6"x4" block
  • (1) Cord hole block

The top is constructed by connecting a corner block, 6x4 block, camera mount, another 6x4 block, and finally another corner block. The bottom is constructed the same way, except the speaker mount replaces the camera mount. The left side is constructed with 2 6x4 blocks, a 5.6x4 block, and the cord hole block. The right side is constructed with 3 6x4 blocks, and a 5.6x4 block.

Before assembling the 4 sides of the frame, make sure that the mirror fits snugly within the ruts, and that the mirror has the protective layers removed before final insertion.

Results

Safety/Privacy Concerns

Since this mirror has access to both a camera and an always-on microphone, there is a fair amount of concern over privacy violations to the user. This is especially pertinent since these files are uploaded through the internet and are analyzed by foreign servers- simply something to be aware of. In the future we would like the ability to take the mirror offline (use CMU Sphinx and OpenCV) which would limit functionality but increase security for concerned users.

Additionally, this mirror is an electronic device, so people should be careful not to pour water over it. The box and plastic frame are what we would call "splash resistant", but if you were to actively try to water damage components it would be very possible.

Frame and Physical Design

The Frame was a bit of a nightmare- our connections weren't long enough and so the frame was highly unstable. With lots of epoxy and tape and rigid frame supports, we were able to get the frame into a sturdy-ish state, and it looked cool in the end. The physical box was a challenge to assemble and work with as our measurements for the wood weren't perfect, and so the frame was a bit taller than the box, the box was a bit deeper than we expected it to be, and we didn't have all the tools we needed for the job. In the end, however, the box came together well and we had plenty of room for all our components with minimal backlight issues.

LEDs

The LEDs integrated into the project very well, and for the most part worked as intended.

Speech to Text and Audio Output

Speech to text turned out well- it is challenging in a space where there are a large number of people talking close to the microphone at once, and no STT engine handles accents very well, but it is surprisingly sensitive and fantastic at tuning out background noises and music when listening. Audio out worked well as well, though pronunciation for Google Audio isn't perfect quite yet.

Facial Recognition

Facial recognition, when it works, works great. Ours broke due to fraudulent charges to our account and our account being suspended, but not much we could do about that

General Software

The software in general works really well. Its a little finicky to ensure all the servers boot up in the correct order, but in general the websocket communication keeps everything running quickly, smoothly, and overall leads to a very satisfying result.

Commands

In the end we were able to implement even more commands than we planned- including commands that make use of Web Scraping techniques, APIs with secure keys, simple javascript functions, and local filesystem storage. Additionally, we designed the command engine to require about 40 lines of code in total between a command object and a UI to add additional commands to the mirror.

SubPart: Future Improvements/ How would we improve on this Project

Cost Saving:

Would have been great to cut costs on this project- one major cost saving measure would have been to use a piece of glass and a reflective coating instead of a actual two way mirror. This is more technically challenging but is about half as expensive per square foot based on brief research. Also looking harder for a cheaper display would have been a great way to cut costs.

Physical Design:

Spending more time on the physical design of the mirror would have been nice. The 3D printed frame is a great idea for our context (testing out many differently technologies and seeing what we can make work), but it would have been much easier to simply spend more time designing a better, lighter frame/box for our mirror and components, as opposed to spending hours CADing and printing parts.

Also would love to get a better mic/speaker onto this mirror- we got ours as part of the AIY Voice HAT Kit, which was cheap and easy, but a better speaker would definitely improve the project's functionality.

Software:

The main software improvements I would make are around the STT and Facial Recognition engines- I would love to move them offline to something like CMU Sphinx and OpenCV, simply for reliability without internet access and for the privacy of the user. I would also like to make commands each have their own folder to make it even easier to add new commands to the system without having to modify the core mirror files, and improve reliability of bootup.


Tech Reflect Voice Log: https://classes.engineering.wustl.edu/ese205/core/index.php?title=Tech_Reflect_Voice_Log