Difference between revisions of "BusyBear"
(→Managing a Database)
|Line 122:||Line 122:|
==Managing a Database==
==Managing a Database==
database through AWS named BusyBear. Which we can access through MySQL workbench. From the database, we created multiple tables to story current MAC addresses and historical information. It was important for us that the table storing MAC addresses auto populate timestamps with the current time. As far as externally connecting the database to both the Pi and the website, we found that using a combination of python and PHP was the most effective. It is important to note, however, to keep the database secure we had to escape all incoming data to prevent SQL injection attacks. As of now, our database isn't in 3rd normal form however this is a goal we have to store and access data most efficiently. One hurdle was attempting to store pictures in the database (as LONGBLOB) but we discovered that because a) the complexity caused instability, and b) picture's large size made queries take an exceptional amount of time. It was best to just upload images directly to the website and bypass the database.<br>
=== Databases and Structures Used ===
=== Databases and Structures Used ===
Revision as of 17:08, 29 April 2019
- 1 Project Proposal
- 2 Design and Solutions
- 3 Results
- 4 Next Steps
- 5 References
It always seems like an impossible task to find an open table to work or a quick line for food across the WashU campus. BusyBear's goal is to create a database that is accessible to WashU students that will show the population and busyness trends of popular locations on campus, beginning with Bear's Den. By using a network adapter connected to the Raspberry Pi, we will receive an approximate measurement of busyness based on the number of found MAC addresses for a specific region. By looking at pictures taken simultaneously with the MAC address collection, a historic trend between the number of found MAC addresses and relative busyness can be determined. We hope to be able to store this information in a database hosted by AWS and display this data on a website. Our end goal is to gather information to allow the WashU community to create more educated decisions regarding where to go and when to go there.
David Tekien, TA
Jim Feher, Instructor
- Learn and be able to code in Python as related to the Pi
- Use sniffing/MAC tracking method in the analysis of busyness
- Investigate the use of the camera in the analysis of busyness
- Be able to monitor busyness fairly accurately by filtering detected devices
- Compare busyness at different times of day and between buildings
- Design a GUI for an aesthetically pleasing and useful website
- Host a website displaying useful and relevant data through Amazon Web Services (AWS)
- Limited experience with working with WiFi receivers or anything to do with MAC Addresses
- Limited knowledge of Python and Raspberry Pi
- Connecting our data with a database, AWS, and a website
- Privacy Concerns
|AWS||Website Hosting||$5 / month||https://aws.amazon.com/pricing/?nc2=h_ql_pr|
|2 x TL-WN722N||Network Adapter||returned: $7.21||https://www.amazon.com/TP-Link-TL-WN722N-Wireless-network-Adapter/dp/B002SZEOLG|
|1 x 5dBi Long Range WiFi for Raspberry Pi||Network Adapter||returned: $5.00||https://www.amazon.com/5dBi-Long-Range-WiFi-Raspberry/dp/B00YI0AIRS/ref=lp_9026119011_1_1?srs=9026119011&ie=UTF8&qid=1550447401&sr=8-1|
|1 x Alfa AWUSO36NH High Gain USB Wireless G/N Long-Range WiFi Network Adapter||Network Adapter||$31.99||https://www.amazon.com/Alfa-AWUSO36NH-Wireless-Long-Rang-Network/dp/B0035APGP6/ref=sr_1_1_sspa?keywords=alfa+network+adapter&qid=1553045771&s=gateway&sr=8-1-spons&psc=1|
Design and Solutions
Build the Device
We began by constructing a device to collect MAC addresses. Initially, we hoped that with the RaspberryPi's WiFi capabilities, we could simply use the base hardware for detection. We quickly determined that the RaspberryPi was not capable of entering a monitoring mode; we would need external hardware to serve this purpose. We went through a variety of external network adapters, and ultimately found one with both monitoring mode capabilities and compatibility with the RaspberryPi . Using the Network Adapter's functionality will be explored further in the Collect Information section and in the Network Adapter in Monitoring Mode tutorial.
We decided that a RaspberryPi camera should be added to the device to strengthen the validity of the data gathered from the network adapter. The Pi Camera is fairly simple to connect and the functionality is implemented through Pi commands. By analyzing a combination of the number of addresses collected and the visual busyness found in the picture, more accurate trends over time can be determined.
This is our final project:
In the before mentioned tutorial [Network Adapter Monitoring Mode Tutorial] we established how setup the network adapter in monitoring mode and install kismet, the software we used to utilize monitoring mode. Once properly configured, simply calling kismet spews out text into the console as so:
INFO: 802.11 Wi-Fi device 00:A7:42:FC:6E:03 advertising SSID 'wustl-guest-2.0' INFO: Detected new 802.11 Wi-Fi access point 00:A7:42:FC:6E:00 INFO: 802.11 Wi-Fi device 00:A7:42:FC:6E:00 advertising SSID 'eduroam' INFO: Detected new 802.11 Wi-Fi device 74:B5:87:C6:90:1E INFO: Detected new 802.11 Wi-Fi device 00:08:E3:FF:FD:EC
We could collect data, but we needed a way to be able to consolidate the MAC Addresses, find out what device it belonged to, and upload that information to the database all periodically throughout a gap of time. Over a couple of months, we finalized our data collection design to utilize crontab  , a program used the schedule execution of programs at certain times. Crontab was also utilized to setup emailing the Pi's IP address on boot, as detailed in a class tutorial.  Crontab was used to schedule two tasks: Running kismet and dumping the output into a text file, and running a script to parse through the text file and upload the necessary information.
The crontab usage can be seen below. The first program runs kismet through the timeout modifier  such that it only runs for four minutes. All the contents of the output is written to the the kismetlog.txt
# m h dom mon dow command */5 * * * * /usr/bin/timeout 240 /usr/local/bin/kismet > kismetlog.txt */5 * * * * ./busybear2
The second task to run is an executable file named "busybear2"  whose contents is shown below.
sleep 242s python3 uploader.py
This bash script's only purpose is to wait 4 minutes and 2 seconds (essentially waiting for the kismet task to terminate) before executing the uploader script. The contents of "uploader.py" can be seen below.
# Regular Expression, only gets MAC Addresses after it sees "device" MAC_regex = re.compile(r"(?<=\bdevice\b\s)\w\w[:]\w\w[:]\w\w[:]\w\w[:]\w\w[:]\w\w") # Loop through the lines of the file to find MAC Addresse for line in testFile: MAC_addresses = MAC_regex.findall(line) # Compile all found mac addresses in var MAC_addresses for address in MAC_addresses: # Loop through the individual MAC Addresses req = requests.get(MAC_URL%address) obj = req.json() for key, value in obj.items(): if('company' in value): values = (address,value['company']) else: values = (address,'Null') cursor.execute(qString,values)
The uploader script uses regular expressions    in order to isolate the desired MAC addresses from the text file constructed in the first crontab task. From there, it utilizes a MAC Address Vendor Lookup API  in order to attach a vendor name to a MAC Address. From there, the MAC Address and its associated vendor is uploaded to our mySQL database  whereupon it is automatically assigned a timestamp and unique ID. More on how our database was created and structure in the next section. NOTE: Not all code is shown, full code can be accessed on our GitHub.
Managing a Database
RDS database through AWS named BusyBear. Which we can access through MySQL workbench. From the database, we created multiple tables to story current MAC addresses and historical information. It was important for us that the table storing MAC addresses auto populate timestamps with the current time. As far as externally connecting the database to both the Pi and the website, we found that using a combination of python and PHP was the most effective. It is important to note, however, to keep the database secure we had to escape all incoming data to prevent SQL injection attacks. As of now, our database isn't in 3rd normal form however this is a goal we have to store and access data most efficiently. One hurdle was attempting to store pictures in the database (as LONGBLOB) but we discovered that because a) the complexity caused instability, and b) picture's large size made queries take an exceptional amount of time. It was best to just upload images directly to the website and bypass the database.
Databases and Structures Used
Contained within our uploader.py program (see GitHub for full code), we are uploading ALL MAC Addresses to a single table. From there, we sort this information into more tables. The basic idea can be seen by the strings used to access/alter the mySQL table below:
qString = 'INSERT INTO wifiMAC_BD (macAdd, vendor) VALUES (%s, %s)'
The first qString demonstrates what happens first; we upload every MAC Address, regardless of what it was, into a database named wifiMAC_BD. We can see an example of the entries below. Notice all the MAC Addresses, the timestamp they were inserted, and the vendors that refer to each address. Null simply means our API did not have a registered vendor under that MAC Address.
qString2 = "SELECT count(*) FROM wifiMAC_BD where (timestampe > now() - interval '5' minute) AND ( vendor = 'Apple, Inc.' OR vendor = 'Google, Inc.' Or vendor ='Microsoft' Or vendor ='Samsung Electronics Co.,Ltd' OR vendor ='HUAWEI TECHNOLOGIES CO.,LTD');" qString3 = "INSERT INTO historicalData_BD_Limited (numAddresses,helper) VALUES (%s,%s)"
Next, using qString2, we query the database with all the entries and only record the number of entries from the past 5 minutes (our code runs every 5 minutes) and that have a valid vendor. Valid vendors are those that probably corresponds to devices used by students, like Apple, Samsung, Microsoft, etc. This filters out any noise like routers, printers, and many more random devices at any given time which gives consistency to our design. The next qString inserts this number into a separate table called historicalData_BD_Limited as seen in qString3. This table is used to create our charts located on our website. An example of what this table's entries looks like can be seen below. The important piece is that there is an entry representing the total number of filtered devices at a given 5 minute interval.
Create a Website
We initially imagined that our website would flow like this:
At the moment our plan is to create a website very similar to this, but with direct access to the available location from the home page. Our current website should function in the following:
Our measure of busyness is completely relative. That is, it relies on us having a lot of reliable data. We do so by having many entries of the number of devices in a given location over a long period of time. From there, we can find the maximum and minimum number of devices in the history of the location. This gives us a relative "empty" and "full" which we can then compare the current busyness. Similar thinking can be applied to the historical information (hour periods over a given day) to construct relative charts. Doing so actually gives us a good estimate of how busy Bear's Den is at a given point of time. Full code can be seen in our BusyBear_BD.php file located on our GitHub.
To display live busyness, we went with a pie chart because it gives a really easy to see visualization of how free or busy the location is. A slice of a pie gives a very instantaneous and intuitive sense of how free the location is which gives useful insight on whether to go to the location or not. For historical busyness, we went with a bar graph over hour long times. This gives a very nice visual trend over time. You can see spikes or periods of high busyness that make logical sense.
Below we can see a brief snippet of the start of the code to get the maxes and mins so that you can locate it on our Git. Examples of the charts can also be seen in the Results tab.
$sql2 = "SELECT Max(numAddresses) AS max FROM historicalData_BD_Limited"; $sql3 = "SELECT Min(numAddresses) AS min FROM historicalData_BD_Limited"; $sql4 = "SELECT numAddresses AS current FROM historicalData_BD_Limited where timestampe > now() - interval '5' minute";
If you're viewing this after the Spring semester 2019, we've taken down the website as to not incur charges. You can always recreate the website using our code on GitHub, but below are a few screenshots taken from a phone of the website to give you a better idea how it looks.
Data collection went very well and was generally reflective of busyness. We hit our goals almost exactly, except for trying to extend this project to multiple locations so that people could reliably tell where to go to eat or study. We wanted to capture information for a week in one location at a constant time to construct good historical data and charts for at least one location. Since we were limited to one device, we could not record multiple locations at once. Additionally, since we did not have permission by any university staff and no feasible way to hide the device while taking pictures, we needed to sit with the device. The risk of having the Pi stolen or something going wrong was too high.
For privacy reasons we will not show any of the pictures recorded, but we did self-analyze the approximate busyness every 30 minutes or so while recording data to ensure accuracy. The pictures along with passively intercepting packets fell under an ethical dilemma, but in the end we felt like we were not doing that any other device/person could accomplish. However, most apps are required to tell you what information they are collecting before you download them and there are probably some actual ethical considerations to be made. For the scope of the project, no user data was being used maliciously, and there was basically no way to be able to correlate a MAC Address to a specific person.
Shown below are examples of live data from 5:10pm and 6:45pm on Tuesday, 4/16/19 and an example of our historical data from 5-11pm. As you can see, at 5:10pm it was not as busy, which reflected the real life scenario with only a few booths being occupied with people working. Later at 6:45pm you can see busyness ramp up a lot, as it's peak hour for people coming in and out to eat dinner.
Regarding the historical data, we can see a clear trend surrounding dinner time. BD starts quiet as BD just starts to open and peaks from around 6-7pm before slowing dying down throughout the night. This might vary day to day, but this is the overall trend we observed. Live data is more reflective of the real time busyness, whereas the historical data reflects big trends.
There are several areas where we can improve upon or explore.
- Increasing capacity in which the camera is used: Currently, the pictures captured by the Pi are strictly for reference only. It would be a more robust system if some sort of human detection was implemented and automatically analyzed to refine our busyness level.
- Privacy and security concerns: Obviously nothing strictly illegal was done, however, that does not mean there are no worries. While sniffing packets is done regularly by numerous devices, apps, etc. there is still a concern of proper ethical considerations into tracking and monitoring people. Additionally, pictures being taken and stored could potentially capture damaging information. All of this paired with the fact we did not encrypt any information is certainly a front in which we would strengthen.
- Make a more compact body and design for the actual finished product. Right now the case and network adapter are very clunky to work with. We settled on the first network that worked, but in reality we could probably downgrade the network adapter and Raspberry Pi to be more cost effective and smaller in overall size.
- Expand our system such that it works at multiple locations. This means more devices, more mySQL tables, and more website tabs. In a world where cost isn't a huge issue, the beauty of our design is that we replicate the same process in every location we would want, the only limitation being where/how we can stow away the device itself.
- A minor change to our database structure would probably be ideal. Instead of having to have a separate table for ALL MAC addresses for each location, we could have one big with an added column/entry that was for locationID. This would allow us to query only the main database for each location's data analysis.
Pi Blinking LED (tutorial sake)
nmap (unused in the end)
fping (unused in the end)
openCV (unused in the end)
kismet & monitoring mode (referenced in our tutorial)
- [Python Dictionary] (unused in the end)
- [More Python Dictionary Documentation] (unused in the end)
- [Wireshark OUI lookup] (API used instead)
- Pi Network - 
- Network Adapter - 
- Network Adapter in Monitoring Mode Tutorial - 
- Pi Camera - 
- How to use Crontab - 
- SSHing into your Pi Tutorial - 
- Using timeout with crontab - 
- How to make a file executable - 
- CSE 330 Wiki: Regular Expressions - 
- Regular expressions look-ahead/behind - 
- Online Regular Expression tester - 
- MAC Address Vendor Lookup API - 
- Uploading data to a mySQL database -