Creating a Processing Pipeline: AirBnB Data

So as I get deeper into the world of data science, a huge amount of the work comes from data cleaning and investigation. Although often fun, it is rather irritating when you have to repeat, especially when you get new data that you need to either add to your model or to validate with.

This is an example of setting up a pipeline using AirBnB data for properties in Sydney. In the initial section of this, I will run through some of the data exploring and cleaning which will shortly be followed by the development of the pipeline. The pipeline will do all the necessary data cleaning so when new data is added, one function will make everything data ready. There is no modelling done here however if you are interested in what I would do don’t hesistate to reach out.

The data comes from the lecture set which can be found here at https://github.com/Finance-781/FinML/tree/master/Lecture%202%20-%20End-to-End%20ML%20Project%20/Practice/datasets

There are a huge number of features in the data set initially (left) of which we well will only deal with a few (right)

 

'id', 'listing_url', 'name', 'summary', 'space', 'description',

'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',

       'house_rules', 'picture_url', 'host_id', 'host_url', 'host_name',

       'host_since', 'host_location', 'host_about', 'host_response_time',

       'host_response_rate', 'host_is_superhost', 'host_thumbnail_url',

       'host_picture_url', 'host_neighbourhood', 'host_listings_count',

       'host_total_listings_count', 'host_verifications',

       'host_has_profile_pic', 'host_identity_verified', 'street',

       'neighbourhood', 'neighbourhood_cleansed',

       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',

       'smart_location', 'country_code', 'country', 'latitude', 'longitude',

       'is_location_exact', 'property_type', 'room_type', 'accommodates',

       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',

       'price', 'weekly_price', 'monthly_price', 'security_deposit',

       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',

       'maximum_nights', 'calendar_updated', 'has_availability',

       'availability_30', 'availability_60', 'availability_90',

       'availability_365', 'number_of_reviews', 'first_review', 'last_review',

       'review_scores_rating', 'review_scores_accuracy',

       'review_scores_cleanliness', 'review_scores_checkin',

       'review_scores_communication', 'review_scores_location',

       'review_scores_value', 'instant_bookable', 'cancellation_policy',

       'require_guest_profile_picture', 'require_guest_phone_verification',

       'calculated_host_listings_count', 'reviews_per_month'

'price', 'city', 'longitude', 'latitude', 'review_scores_rating',

'number_of_reviews', 'minimum_nights', 'security_deposit',

'cleaning_fee', 'accommodates', 'bathrooms', 'bedrooms', 'beds',

'property_type', 'room_type', 'availability_365',

'host_identity_verified', 'host_is_superhost', 'host_since',

'Cancellation_policy'

Whilst investigating the data I looked into measures of skewness and Kurtosis (peakedness). Skewness is usually described as a measure of a dataset’s symmetry – or lack of symmetry.   A perfectly symmetrical data set will have a skewness of 0.   For example, the normal distribution has a skewness of 0.

Kurtosis was originally thought to be a measure the “peakedness” of a distribution. However, since the central portion of the distribution is virtually ignored by this parameter, kurtosis cannot be said to measure peakedness directly.  While there is a correlation between peakedness and kurtosis, the relationship is an indirect and imperfect one at best. So, kurtosis is all about the tails of the distribution – not the peakedness or flatness.  It measures the tail-heaviness of the distribution.

So initially, for price, there was a skewness of 13.8 and kurtosis of 413.43. Keeping only data in the 99.5% percentile, the skewness drops to 2.657 and the kurtosis goes to 11.187 which is a radical difference. It is clear from this that getting rid of those extreme outliers can be useful in making our data more symmetric and thus easier to manage in our statistical models later.

Histogram plots for the different features demonstrate nicely the right skewness of the data.

Right skewness of the numeric features

Right skewness of the numeric features

Another nice way of cutting outliers for data that are geolocated is to look into the spread through longitude and latitude. The following longitude and latitude visualisations help to identify areas of dense property. On the left we have the all the properties. For this, I removed locations under 151.16 longitude and over -33.75 latitude and prices over £600 which is shown on the right. The size of the circles are based on number of reviews so as the get a true understanding of the density.

All properties included here

All properties included here

Properties within the specific bounds and under £600 a night

Properties within the specific bounds and under £600 a night

We are now down from 27,000 rows of data to 6230 rows. Still a sizeable amount. 

Some new features are created 

  • Bedrooms per person

  • Bathrooms per person

  • Host since

  • Days on AirBnB

Pre-Processing

Now to the  PreProcessing. The aim here is to make it as generic as possible so which can clean data we ease in future.  

  1. First thing to note is that the review_scores_rating contains NaNs. So let’s replace any NaN in the data with the median. 

  2. Select only when the host is a Superhost and the host is verified (==t).

  3. Encode the categorical data! Surely use one-hot encoder

Preprocessing set up:

First the CombinedAttributionAdder class is set up. It simply adds new features to the raw data.

from sklearn.base import BaseEstimator, TransformerMixin
from datetime import datetime
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# Receive numpy array, convert to pandas for features, 
# convert back to array for 
# output.

# Inital class is to generate new attributes such as bedrooms 
# per person and 
# bathrooms per person

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, popularity = True, num_cols=[]):
      self.popularity = popularity

    # For these pipline classes we seem to always have a fit 
    # and transform 
    # function

    def fit(self, X, y = None):
      return self # we are not doing any fitting in this class

    def transform(self, X, y=None):

      ### Some feature engineering
      X = pd.DataFrame(X, columns=num_cols)
      X["bedrooms_per_person"] = X["bedrooms"]/X["accommodates"]
      X["bathrooms_per_person"] = X["bathrooms"]/X["accommodates"]

      # I hate this global variable
      # Have not spent time thinking of a clever way around it 
      global feats
      feats = ["bedrooms_per_person","bathrooms_per_person"]

      if self.popularity:
          X["past_and_future_popularity"]=X["number_of_reviews"]/(X["availability_365"]+1)
          feats.append("past_and_future_popularity")
          return X.values
      else:
        return X.values 

Here we begin the pipeline. We are splitting into test and training data. Then the first part of the pipeline uses the CombinedAttributeAdder as well as an Imputer to change NaNs to median and Standard Scaler to scale numeric values. 

# Create the inital part of the pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

strat_train_set2 = strat_train_set.dropna()
X = strat_train_set2.copy().drop("price",axis=1)
Y = strat_train_set2["price"]

num_cols = list(X.select_dtypes(include=numerics).columns)
cat_cols = list(X.select_dtypes(include=[object]).columns)

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder(num_cols=num_cols,
                                                  popularity=True)),
        ('std_scaler', StandardScaler()),
    ])

This is the second part of our pipeline. Where we initiate the num_pipeline first that one hot encode the categorical columns. 

# The middle section of the pipeline
# This one hot encodec the cat cols

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import itertools


mid_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_cols),
        ("cat", OneHotEncoder(),cat_cols ),
    ])

The class ToPandasDF moves the numpy array to a pandas dataframe. I am not happy with the class as it requires num_cols and one_coles to come from the other class. Perhaps this is norm. In the case where you only call the pipeline function you are fine I suppose as you call everything chronological anyway. 

The pipe function simply runs the pipeline.

class ToPandasDF(BaseEstimator, TransformerMixin):
    def __init__(self, fit_index = [] ): 
        self.fit_index = fit_index
    def fit(self, X_df, y=None):
        return self  # nothing else to do
    def transform(self, X_df, y=None):
        global cols
        cols = num_cols.copy()
        cols.extend(feats)
        cols.extend(one_cols) # one in place of cat
        X_df = pd.DataFrame(X_df, columns=cols,
                            index=self.fit_index)

        return X_df

def pipe(inds):
    return Pipeline([
            ("mid", mid_pipeline),
            ("PD", ToPandasDF(inds)),
        ])
    
params = {"inds" : list(X.index)}

X_pr = pipe(**params).fit_transform(X) 
# Now we have done all the preprocessing instead of
# The pipeline becomes extremely handy in 
# the cross-validation step.

If new data is added where you want to validate you can simply run it through the pipeline. This was an exercise more than anything but something cool to learn.

Kevin SynnottComment