Up in the Air{bnb}

About

Founded in 2008, Airbnb has been a blockbuster hit with the number of guest bookings skyrocketing throughout the 2010s as an alternative to traditional hotels. As it prepares to go public this year, we wanted to specifically focus in on the Austin market for Airbnb rentals as local SMEs (Subject Matter Experts. Since we live here too!) to determine what the unique "characteristics" of Austin Airbnbs are. Here, we analyze and present an open dataset from Inside Airbnb that includes a listing's cost per night, zip code, ratings, availability, and more.

Using Python, Tableau, AWS, Databricks, the John Snow Labs Spark NLP library, and Javascript libraries, we aim to present what makes Austin Airbnbs not only unique, but weird.

Infographic: New Year's Peak Illustrates Airbnb's Growing Stature | Statista
Fig.1 - Airbnb's Massive Growth from 2009-2019. Source

Data Insights

by Hazel & Hayley

These are basic charts and plots that we did to investigate certain details of the Austin Airbnb listings, such as property and room types by region, average fees, average number of guests, and minimum nights.






Below is a map made in Tableau to show the distribution of Airbnb listings' price level distributions across the Austin area. A helpful tooltip is provided to show metadata information about each listing and a reference link to the listing on the Airbnb site.


Sentiment Map

by Chris

The review data for the listings was loaded onto an AWS Postgres server and John Snow Lab's Spark NLP library was used to aggregate all of the reviews and assess each their general sentiment ("positive" or "negative") on a sentence by sentence level. After each review was analyzed, the sentiments were aggregated in various ways to assess the review's "overall" sentiment. Each aggregation could produce wildly different results so they were compared to the overall star rating of each listing as a measure of accuracy.

Note that Airbnb reviews are overwhelmingly positive (the average rating was about 4.8 stars in the dataset!) and each aggregation's results are as follows:

As you can see, just taking the count of positive vs. negative sentences produced the worst result in terms of accuracy. Some tweaking can be done (it was found that there are weird interpretations of certain characters so text cleaning could be improved) but the John Snow Spark NLP library tends to unconfidently rate sentences as negative much of the time from what I could see. Taking the most confident sentence as rated by the library and then using that one to represent the overall review was the option with the best accuracy.


Fig.2 - Sentiment Map of Austin Airbnbs ("positive" vs "negative")

Word Clouds

by Chris

The text of the reviews were isolated and cleaned (with a lot of common words and stopwords like "a", "the", "and", etc. taken out). We were interested in seeing what words were most uniquely popular in reviews for each zipcode. During our first analysis, the word frequencies seemed very homogenous, with most reviews having the same few key words (like "clean" or "host" for example), which was not very helpful in determing what made each area unique. The text was cleaned, taking out these more common words, and we arrived at the final dashboard below. The word clouds and bar charts can be a useful indicator of what makes each zipcode a unique place to stay in.

The pipeline for the John Snow Spark NLP library was much like the one for the sentiment map, except that this time the pipeline was set up to tokenize sentences, remove stopwords, and determine both the most common unigram (single words) and bigrams (word pairings) that occurred in the text. Bigrams were considered as people tend to describe things in an adjective-noun pattern (such as "clean house" or "great host").


Note: Use the map to zoom in/pan for zipcodes in the Austin area but the zipcode must be selected in the dropdown menu to change plots.
Fig.3 - Unigrams Word Cloud

Fig.4 - Bigrams Word Cloud
Fig.5 - Top 10 Unigrams

Fig.6 - Top 10 Bigrams

Review Writer AI

by Nathan

A dataframe of all the reviews was also created, linked to the listing they were attached to as well as the average rating of that property. They were all separated into three buckets as determined by their rating:

0 stars - 2 stars ("bad")
2.1 stars - 3.9 stars ("ok")
4 stars - 5 stars ("good")

Taking all the good ratings, we trained AI using textgenrnn on top of TensorFlow and Keras to write reviews based on our data. Unfortunately, we did not have enough time to train any AI besides the "good ratings" one since there were 69 million characters to train from and by the time we got to the "bad ratings" training, we realized there was 1000x less data in that set, and we would have to run it many more times.


What follows are snippets written by the "good ratings" AI with increasing settings of temperature (aka writing creativity). Smaller temperature values indicate that the AI follows its training more closely.

#################### Temperature: 0.2 ####################
The house is spacious and clean. The bed was super comfortable and the place is a great place to stay in Austin. The house is spacious and clean. The house was clean and cozy. The house is in a great location and very clean. We were able to walk


#################### Temperature: 0.5 ####################
ievine and the space was awesome! The house is in a great location for exploring Austin and want to stay in Austin and we would definitely stay here again. My boyfriend and I had a wonderful stay at the house. It's also a bit of a place to stay in a great location and close to everything in Austin


#################### Temperature: 1.0 ####################
My husband and I attended the friend and he went out of his way to visit on shopping to all the Austin farex - clean, company and had good cassy giant in the eventing (and all the genek in the covering area I was really high & again when I get to the same channel trafficility was residential and qui

Future Analysis Directions


There are many future directions that we can take our analysis (especially if we had more data!):