Difference between revisions of "Tech Reflect Voice"

From ESE205 Wiki
Jump to navigation Jump to search
Line 128: Line 128:
 
[[File:SmVoiceSpeechTesting.png|1200px]]
 
[[File:SmVoiceSpeechTesting.png|1200px]]
 
[[Media:SmVoiceSpeechTesting.png]]
 
[[Media:SmVoiceSpeechTesting.png]]
 +
 
After sampling different speech to text libraries, it became clear that Google Cloudspeech was vastly superior than any other solution in terms of what we wanted to achieve. While it would have been nice to run something locally like CMU Sphinx, or a totally free solution, in the end Google was easiest to integrate with and had the most accurate speech to text conversion technology.
 
After sampling different speech to text libraries, it became clear that Google Cloudspeech was vastly superior than any other solution in terms of what we wanted to achieve. While it would have been nice to run something locally like CMU Sphinx, or a totally free solution, in the end Google was easiest to integrate with and had the most accurate speech to text conversion technology.
  

Revision as of 15:59, 1 May 2018

Project Overview

The best creations come not from reinventing the wheel, but from integrating existing technologies in new and interesting ways. This is why when we saw the original Tech Reflect project, we realized that there was so much opportunity to improve upon it. With the increasing prevalence of IOT devices, home assistants like Google Home and Amazon Alexa, open-source and free to use APIs, and the decreasing cost of display technology, it has become possible to cheaply and easily create a piece of physical hardware for the home which can utilize the strengths of home assistants and cheap display technology, while minimizing their obtrusiveness on your life.

Group Members

  • Ethan Shry
  • Tony Sancho-Spore
  • Baihao Xu (Kevin)
  • Ellen Dai (TA)

Project Proposal

https://docs.google.com/presentation/d/1UtUwZfxM7SI90nvJo0Fx1iLGG8HB7N2jdl1ApcL0bRY/edit?usp=sharing

Objectives

We hope to construct a proof-of-concept bathroom mirror which responds to user feedback. The mirror and GUI should be somewhat aesthetically pleasing. It should listen to the user and be able to convert what they ask for into a visual response displayed on-screen. We hope to show that there is some novelty or value in integrating technology into everyday items like mirrors.

Challenges

Due to the tools available to us, ensuring that the hardware is talking to the Python listener is talking to the GUI server will be somewhat logistically challenging, especially as we will be running code in several different programming languages.

The selection of which command to take when analyzing user speech will also be a nightmare should we decide to allow multiple different trigger commands. A simple solution would only look for exact string matches, but a more robust solution will require looking into.

Our current plan for the mirror is to 3D print the frame, which due to the lack of large printers available to us needs to be done in many pieces (>10), which will be potentially infeasible.

Additionally Kevin will need to become comfortable in Pug templating language and NodeJS.

Gantt Chart

GanttChartTRVoiceSpring2018.PNG Media:GanttChartTRVoiceSpring2018.PNG

Budget

21.5" Display: $65 @ Microcenter

Google AIY Voice Kit (Includes Speaker and Microphone): $10 @ Microcenter (DISCONTINUED)

Mirror: $50 @ Amazon (https://www.amazon.com/gp/product/B01G4MQ966/ref=oh_aui_detailpage_o00_s00?ie=UTF8&psc=1)

Raspberry Pi 3: $35

Raspberry Pi NoIR Camera: $25

Arduino Mini: $10

LED Strip: About $25 for more than you need

3 spools Standard Breadboard-y wire (22 gauge): $15

Frame: $20ish for wood and screws

Paint: $5

Power Strip: $10

3D Printed Frame: $35 (approximately 1kg of PLA)

2 Power Supplies/MicroUSB Cables: $25

Epoxy: $10

Total Cost: $340 (ish)

Code and CAD Files

All the code for this project can be found on Github

All the CAD Files for this project can be found on Grabcad

Design And Solutions: Modules

Product Design Process Overview

Step 1: Find and target users

We think our target users are families or home use, and as such we can break down our information into what is most pertinent to them:

1. Our users need access to the information on the internet: Twitter, Stock, Wiki Search

2. Our users need to know what they need to do during the day: Reminders

3. Our users need to know the environment around them: Weather

4. Our users needs to have some personalized features: Timer, personalized hotwords, etc.


Step 2: Designing the main page

When we designed the main page, we wanted to give the consumers a brief overview of some of the features the mirror can do for them, to give them inspiration for possible uses of the mirror

Step 3: Prioritize features based on client importance

1. Solo Users: Our users will care about the utility first, which are weather, reminder, and timer. The timer will serve to control the time they make up every morning. The weather will help them decide what clothes to wear for the day. The reminder feature will remind them not forget anything during the day. For the Twitter and Stock, these features are optional, because they might not have time to see this, but it will be very attractive to users who like them, and make us competitive compare to potential competitors.

2. Families: Family users may need similar feature to our self-buyer users. However, they might need a more powerful back-end support, because there are more information need to be saved at database, like reminders- and they need the ability to have different information and different profiles for different users in the household

Step 4: Prioritize our features

At this point we pared down our feature list to only the most important of features- the ones that our users will rely most heavily on, and the ones most vital to client use.


Step 5: Design Testing and Guidelines

Due to the nature of a smart mirror, it is very important that UI designs have high contrast, since the very nature of a two way mirror is that it blocks part of the light from coming through, and as such low-contrast designs can be hard to see. additionally, the less cluttered the design is the more functional the mirror is as a mirror.

Our Designing Principle

Main Page: We make our main page have the background color of black, because our screen is behind our mirror, and we make every text and pictures white, and we think this is the best way to guarantee users see our screen clearly. Also, as the main page, we need to let users know what features can we offer. In the page, we list all of our features, like Twitter, reminder….. For our weather feature, we even list the logo for each icon of whether we think they are appropriate, like cloudy, raining.


Twitter: We try to make the user experience as same as Twitter, but because our screen is generally black and white only, we still make background black, and make text white. We redesigned the layout of the page and make it better fit for our screen. Actually, it is better than real twitter, because this is an ad-free version of Twitter.


Weather: In our weather section, we imported lots of icon, because we want to give users a real experience of what will the weather like today beside the text reminder. In order make every day’s UX consistence, we import the same set of UX, which makes our product seems better quality. Also if some users like other sets of weather icon, we can change based on their preference.


Reminder:The reminder page can remind people what to do every day, so the most important thing in this part is to let users really know where do they do what at which time. Our layout designed had included these parameter and make it a more user-friendly experience.


Timer: Our timer serve to help our user control the time they need every day when they use our mirror, for example, when the user just get up in the morning, their time might be valuable, we want them not miss their important thing(by using our reminder feature) and enjoy the feature offered by our mirror.


Stock: This is an optional feature for us because this only work for people who like stocks. They can view most up-to-date information about the stock by using our mirror, which may give most exciting news for them every day.

Software Overview

Below is the overview of the Software flow for this project.

SmVoiceCodeFlow.png Media:smVoiceCodeFlow.png

HTML&CSS to PUG

When we designed the layouts for our project, we just follow a 2-step procedure. First, we write our ideas in HTML. After all members of the group agreed with the design, we translated the HTML to Pug. Pug is a high-performance template engine heavily influenced by HTML and implemented with JavaScript for Node.js and browsers. It provides the ability to write dynamic and reusable HTML documents, its an open source HTML templating language for Node.js (server-side JavaScript). This allows us to easily and quickly write what would otherwise be very tedious html pages and render them securely on the sever with minimal impact on the client. The tutorial and syntax of Pug can be found at https://github.com/pugjs/pug.

Speech to Text

SmVoiceSpeechTesting.png Media:SmVoiceSpeechTesting.png

After sampling different speech to text libraries, it became clear that Google Cloudspeech was vastly superior than any other solution in terms of what we wanted to achieve. While it would have been nice to run something locally like CMU Sphinx, or a totally free solution, in the end Google was easiest to integrate with and had the most accurate speech to text conversion technology.

Additionally, the use of Google Cloudspeech allowed us to use the Google AIY Voice Kit, which was cheap and made integration with google's services incredibly easy with a pre-configured distro of raspbian available for use.

What we essentially have is a two-loop system for hotword detection and then speech recognition- see the following psuedocode indicates:

while shouldBeListening:
    shouldBeListening = checkIfShouldListen()
    text = listen()
    if hotword in text:
        commandText = listen()
        sendToNodeServer(commandText)

Command Parsing

A command string will come down from the server as follows:

   "Show me the weather for Seattle"

Our custom command parser will try to match that to an array of commands. Any command takes the form of an object in a javascript list, as follows:

   {
           name: "descriptive name, only for coder use",
           cmdStrings: [
               "list of strings
               Can make use of %?% for parameter
               Can make use of %$% for continuous whitelist character sequence
               Max 1 of each per command
               Must have seperator for %?% and %$%. i.e. %$% timer for %?% is valid, timer %$% %?% is not valid"
               ],
           keywords: ["keyword list. should be unique to only this command- idea is if no cmdString matches for any command, will loop back to look for keyword in command. no %?% or %#% allowed"],
           trigger: (param, activeUser) => {
               // function to be called when command is triggered
               };
           },
           viewName: "name of .pug file which should be displayed as a result of this command i.e. stockView"
       }

Our Command parser then works as follows:

   for command in possibleCommandList:
       for commandString in command.commandStrings:
           if commandString strictly matches spokenCommand (excluding %?% or %$% templates:
               //we have matched to a command, call the command trigger function with a passed in parameter (if one exists)
   // no command matched, try to match a keyword
   for command in possibleCommandList:
       for keyword in command.keywords
           if spokenCommand contains keyword
               //we have matched to a command, call the command trigger function

For instance, our sample command would match to the following:

   sample spokenCommand: "Show me the weather for Seattle"
   would match to comamnds as follows:
   commandString: "%$% weather for %?%" - would match with parameter: Seattle
   commandString "Show me the weather %$%" - would match with no parameter
   keyword - "weather" - would match with no parameter

Facial Recognition

To recognize a persons face, we used Amazon's Rekognition AWS service. Rekognition is designed to work as a complete Computer Vision (CV) API, which includes facial recognition, object finding, image comparisons, etc. We also used Amazon's S3 AWS service, which is equivalent to Google Drive, except for all of the Amazon AWS API's. The complete version of our facial recognition program works as follows:

  • Wait for the "Switch User" command
  • Take a photo using the Raspi camera
  • Upload the picture to the S3 bucket as "UnknownFace.jpg"
  • Compare the picture to every other picture in the S3 bucket ("TonyFace.jpg", "EthanFace.jpg"...)
  • If any of the pictures match the unknown face, then we know who is currently using the mirror
  • Else, switch back to the default user (Future Update: be able to add users on the fly)

LED Ring

The LED Ring is a ring of 128 RGB LEDs controlled by an Arduino Uno. The Uno is connected to the Raspi 3 via UART serial connection, and to the LED strip via 1 signal pin. When the Arduino receives a mode number from the raspi corresponding to some action/command from the user, it updates the patterns for the LEDs.

Some of the modes are:

  • Steady on 1 color
  • Fade in and out 1 color
  • "Chasing" start up sequence
  • Wrap up from bottom to top once when hotword is detected
  • All off

The particular LED strip we're using is finicky, requiring about 3 hours and 1 blown up Arduino to get working properly with the Raspi. In order to not suffer the same fate we had, here are some of the issues we had/issues that we foresaw before they could happen:

  • Not sharing common ground. This was a tricky one to debug (Shoutout to Professor Feher for figuring this one out!!), as it wasn't obvious to us that the ground voltage of the Arduino and the LEDs could be different granted that they were both plugged into the same power strip.
  • Not placing a suitably sized capacitor on the LED strip's power leads.
  • Not placing an isolator between the Arduino and the LED strip (blew up an Arduino Leonardo due to this one)
  • Not tying the signal pin of the LEDs to ground
  • Not placing a small resistor on the signal pin of the Arduino connected to the LEDs

Physical Design

The mirror was designed such that the monitor would be central on the mirror, with the Raspberry Pi and Arduino mounted near the camera mount on top of the mirror, and all of other electronics, except the speaker, being mounted nearby on the top. The speaker is mounted on its mounting block on the bottom of the mirror, with extended wires going to the Google AIY Voice HAT's output screw terminals.

In order to accomplish this, a simple open wooden box large enough to fit the interior edge of the 3D printed frame inside is constructed. Exterior dimensions should be no larger than 26"x32", and no less than 2" shorter for either dimension. The box we constructed was 6" deep due to the wood we bought, although a box half as thick would have sufficed.

There are 2 wooden crossbeams inside of the wooden box to hold the monitor in its appropriate position. These are held to the box with 2 screws on either side, with the lower beam approximately 8.5" above the bottom of the box, and the second beam approximately 10" above the first beam. These measurements should be adjusted to accommodate different size monitors.

Finally, as a part of cable management, the power supplies and power strip were held to the left edge of the box using screws. Any other cable management is accomplished using electrical tape as a temporary measure, as we intend to improve the physical box in the future, and any permanent cable management or other mounting mechanism would make upgrading the wooden box difficult.

3D Printed Frame

The front facing exterior frame was constructed of 18 individual 3D printed parts, using 6 different CAD files. Parts list:

  • (4) Corner blocks
  • (1) Camera mount
  • (1) Speaker mount
  • (9) 6"x4" block
  • (2) 5.6"x4" block
  • (1) Cord hole block

The top is constructed by connecting a corner block, 6x4 block, camera mount, another 6x4 block, and finally another corner block. The bottom is constructed the same way, except the speaker mount replaces the camera mount. The left side is constructed with 2 6x4 blocks, a 5.6x4 block, and the cord hole block. The right side is constructed with 3 6x4 blocks, and a 5.6x4 block.

Before assembling the 4 sides of the frame, make sure that the mirror fits snugly within the ruts, and that the mirror has the protective layers removed before final insertion.

Results

Frame and Physical Design

The Frame was a bit of a nightmare- our connections weren't long enough and so the frame was highly unstable. With lots of epoxy and tape and rigid frame supports, we were able to get the frame into a sturdy-ish state, and it looked cool in the end. The physical box was a challenge to assemble and work with as our measurements for the wood weren't perfect, and so the frame was a bit taller than the box, the box was a bit deeper than we expected it to be, and we didn't have all the tools we needed for the job. In the end, however, the box came together well and we had plenty of room for all our components with minimal backlight issues.

LEDs

The LEDs integrated into the project very well, worked as intended. They are occasionally slow to update, due to the nature of sequentially updating LEDs.

Speech to Text and Audio Output

Speech to text turned out well- it is challenging in a space where there are a large number of people talking close to the microphone at once, and no STT engine handles accents very well, but it is surprisingly sensitive and fantastic at tuning out background noises and music when listening. Audio out worked well as well, though pronunciation for Google Audio isn't perfect quite yet.

Facial Recognition

Facial recognition, when it works, works great. Ours broke due to fraudulent charges to our account and our account being suspended, but not much we could do about that

General Software

The software in general works really well. Its a little finicky to ensure all the servers boot up in the correct order, but in general the websocket communication keeps everything running quickly, smoothly, and overall leads to a very satisfying result.

Commands

In the end we were able to implement even more commands than we planned- including commands that make use of Web Scraping techniques, APIs with secure keys, simple javascript functions, and local filesystem storage. Additionally, we designed the command engine to require about 40 lines of code in total between a command object and a UI to add additional commands to the mirror.

SubPart: Future Improvements/ How would we improve on this Project

Cost Saving:

Would have been great to cut costs on this project- one major cost saving measure would have been to use a piece of glass and a reflective coating instead of a actual two way mirror. This is more technically challenging but is about half as expensive per square foot based on brief research. Also looking harder for a cheaper display would have been a great way to cut costs.

Physical Design:

Spending more time on the physical design of the mirror would have been nice. The 3D printed frame is a great idea for our context (testing out many differently technologies and seeing what we can make work), but it would have been much easier to simply spend more time designing a better, lighter frame/box for our mirror and components, as opposed to spending hours CADing and printing parts.

Also would love to get a better mic/speaker onto this mirror- we got ours as part of the AIY Voice HAT Kit, which was cheap and easy, but a better speaker would definitely improve the project's functionality.

Software:

The main software improvements I would make are around the STT and Facial Recognition engines- I would love to move them offline to something like CMU Sphinx and OpenCV, simply for reliability without internet access and for the privacy of the user. I would also like to make commands each have their own folder to make it even easier to add new commands to the system without having to modify the core mirror files, and improve reliability of bootup.


Tech Reflect Voice Log: https://classes.engineering.wustl.edu/ese205/core/index.php?title=Tech_Reflect_Voice_Log