25 Dec Strategy of Predicting Repeat Restaurant Bookings
The post is a response to the request of sharing the strategy for approaching the Kaggle in-class competition “Predict Repeat Restaurant Bookings” (https://inclass.kaggle.com/c/predict-repeat-restaurant-bookings). I am still new to the Kaggle competitions and yes, it has been my first competition. I am seeing Kaggle as a terrific playground for trying to apply the theory that I am currently learning during my Master of Science studies in Predictive Analytics.
I used R and RStudio throughout the competition and submitted probabilities in my submissions. I believe the following points reflect some of the major aspects in my approach:
- The winning model was created with a Neural Network. I also used the GLM and Random Forest whereas the GLM performed worst. The Random Forest models usually returned the best AUC values whereas on the public leadership board my best Random Forest models almost always took a hit by a few percentage points. I am not quite sure why ….
- The final set of predictors consisted of:
- waitingPeriod (constructed predictor: trainData$dateTime – trainData$cDate)
- purpose (because the ‘Purpose’ variable included many different strings with the same or similar meaning both in English and Chinese, I translated all the different values into English and performed some string normalization of the ‘Purpose’ attribute in order to reduce the number of unique purposes)
- priceLow (because priceLow and priceHigh were highly correlated, I removed priceHigh from the set of predictors)
- lat (similarily, because lat and lng were highly correlated, I removed lng from the set of predictors)
- Since the data set was characterized by a typical class imbalance, I attempted to balance the data set by both over- and under-sampling the imbalanced data set using the ROSE package in R.
trainData.balanced.ou <- ovun.sample(return90 ~ ., data=trainData, N=nrow(trainData), p=0.5, seed=100, method="both")$data
- Another obstacle that I had to face came in form of missing restaurant instances in the Restaurant data set. Because of the considerable number of missing restaurants in the Restaurant data set, I replaced the missing values with the median or mode of all the existing restaurant observations.
- Regarding imputation or the handling of missing values, I think I need to do additional research about the topic on how to impute missing values on data sets that are completely missing particular observations. I would be interested to learn how other contestants handled the missing restaurants?
- Regarding the imbalanced target class, my understanding is that the issue of an imbalanced target class is a common challenge during data mining activities. While I found a suitable package called ROSE and made it work for my project, I would like to get a better overview and understanding of available techniques that solve or at least partly remediate the issue of target class imbalance. Again, I am happy to learn how other Kaggle members handled or would have handled the class imbalance.
- Regarding the attribute “openingHours”, the feature contained unstructured information that I think might contain valuable information to further improve the performance of the predictive models. Did anyone attempt to structure the information and measure the attribute’s importance?