The Anatomy Of Large Scale CTR* Prediction System
* With little or no modifications the proposed system design and algorithms can be used for optimizing other metrics like Cost Per Viewable Completion, Cost Per Completion, Cost Per Engagement, Cost etc
Aim: To build a scalable distributed machine learning framework based on stacked classifier comprising Gradient Boosting Machines and Logistic Regression trained with Online Stochastic Gradient Descent. This system is specifically designed to achieve high performance for display advertising campaigns with a varied range of targeting criteria. The final aim of the prediction system is not only restricted to achieve high performance but also extends to ease of deployment, frequent model updates in online fashion, highly scalability and distributed learning.
A Glimpse Into The Modeling Aspects:
Click and conversion prediction for display advertising presents a different set of challenges.
- Sparse Features : Categorical features like domain, publisher, advertiser ID, creative ID, cookie ID etc when converted to binary representation with One Hot Encoding leads to very sparse features.
- Biased Training Data: Normally the number of instances with click are much less (less than 0.1%) compared to negative training instances. This leads to a biased class problem which further complicates the training of a conventional classifier.
- Feature Generation: Usually we will generate different permutations of features e.g. quadratic features and combination of features to help the classifier generate a non linear decision boundary. But though it increases the performance of the system it leads to explosion in dimensionality of the features and makes it harder to train the model.
- Tuning Log Linear Model: L1 vs L2 Regularization ( conventional wisdom suggests that go with L1 since it preserves sparsity)
- Tuning Stochastic Gradient Descent: Finding optimal rates for weight update
- Subsampling: Finding optimal sampling rate for the the negative class to get a balanced data set
- Periodic updates to model: The Logistic Regression model will be trained using online stochastic gradient descent. This will make training faster and scalable in a distributed manner. Though we will need to periodically train the model after few hours to incorporate new data points.
- Gradient Boosting Machine: I will be training GBMs separately to generate features that will be consumed by the log linear model. Since training GBMs take comparably much more time (I will have to test the training time vs (#trees and training data size) for GBMS using spark cluster) we will have to schedule a daily job to update the GBM.
- Dealing with new ads/creatives: We will have to generate a separate generalized model to handle new advertisers/creatives. Though the performance of this model will not be up to the benchmark but it will still help us to exploit historical patterns and collect better training data for the new ad campaign.
- Generating Historical Features: We will be adding historical performance of advertiser and user as additional features to our feature vector.
- Mobile Advertisement: Whether a separate model for mobile/app based ads since the feature set will involve many new elements e.g. device id, device type, app
- Feature Hashing: Reducing dimensionality of generated feature vector
- Explore and Exploit: How to sometimes show wrong ads to wrong users with the sole intention of exploring new users, finding new patterns and discovering covariate shift. If we only shows ads according to our system then next day we will get a highly biased training sample.
Advance Non Linear Approaches
I will avoid using fancy ensemble techniques (the winners of Avazu and Criteo CTR prediction challenge used 20 model ensemble and other many more hacks) and algorithms like Field- aware Factorization Machines that have shown to perform a bit better (increase the performance metric score by fourth decimal place) in Kaggle competitions but they lead to grave overfitting and are not at all scalable in production. We need a mechanism that can be easily trained on a cluster and can be efficiently updated within few hours. After going through more than a dozen research papers and other literature I narrowed down the final model to use Online gradient descent based log linear model with feature generation via gradient boosted machines. With appropriate computing power (Spark cluster ) we should be able to train the model every few hours hence exploring new data. Many people have used Vowpal Wabbit to train log linear models in Kaggle competitions. Though it takes comparatively much less time when using Stochastic Gradient Descent but it took them days in generating basic features and other data munging stuff. Hence I plan to go with Spark which claims to provide lightening fast computation due to its in memory computation framework.
One interesting approach mentioned in ad click prediction literature to solve the problem of exploration is Bayesian Logistic Regression. The problem of exploration can be explained using the following example, imagine that we trained our CTR prediction model using 2 days of training data. On the third we used the trained model to only severe ads to request that have high predicted CTR. Now the third day training data will only contain training instances that had high predicted accuracy, i.e. we will be getting a narrow/biased training sample. If new segments requests were encountered or the segment behavior changed our model will not be containing necessary training data to adapt to the mentioned change. So we need a procedure to gradually explore along with exploiting high performing segments.
When it comes to “Explore And Exploit” paradigm I have a strong proclivity towards Bayesian methods. Bayesian Logistic Regression takes prior distribution on the weights of Logistic Regression and tries to gradually update the weights using the incoming data stream. This method is high suited to online stream learning. Finally we draw a sample (using Thompson Sampling) from the weight’s posterior distributions and use these weights to predict the performance of a segment. I have used Bayesian Bandit based Thompson Sampling techniques to design a RTB optimal bid price detection algorithm and hence I am handsomely introduced to the underlined statistics.
Graepel et. al[3] gives a detailed account of how this method is nearly equivalent to Logistic Regression trained with Stochastic Gradient Descent.
- With Gaussian Prior on weights Bayesian Logistic Regression is equivalent to Logistic Regression trained with Stochastic Gradient Descent with L2 regularization.
- With Laplacian Prior on weights Bayesian Logistic Regression is equivalent to Logistic Regression trained with Stochastic Gradient Descent with L1 regularization.
Note:
In the initial version I will not be using Bayesian Logistic Regression(BLR). Instead I will be going with Logistic Regression(LR) trained with Stochastic/Online Gradient Descent(SGD). Research has shown that the LR trained with SGD inherently leads to exploration since the gradient deviates from finding the optimal wights instantaneously (convergence time is more), also due to anomalies in data the gradient deviates much often which in turn makes this algorithm inherently suited to exploration. So sometimes we will predict sub optimally i.e. after hourly updates our algorithm will predict accurate ctr for some segments but this can act as blessing in disguise since it will let us explore more unseen segments which will be part of training data set in next training iteration.
Research[1,3,4] has shown that these algorithms performs nearly on same scale. But in later versions I will definitely love to try BLR as it has more robust update rule.
Model Training And Tuning Procedure
This will be the most critical and time consuming step. Once done we will be just hard coding the various tuned model parameters and generated features in the model generation job. I have few ideas to generate more useful features like forming clusters of existing users w.r.t their behavior and forming clusters of different advertiser ids, and including respective cluster ids in the feature vector. This will act as a hierarchical feature based on historical behavioral data and may lead to increased performance. But all the experimentation will be constrained depending on the project timeline.
For modeling I will be following the main findings and steps defined in the four landmark papers on CTR prediction by Facebook, Google, Microsoft and Criteo together with the results and findings from the two Kaggle competitions on CTR prediction by Criteo and Avazu
REFERENCES:
1. Practical Lessons from Predicting Clicks on Ads at Facebook , Xinran et. al., ADKDD’14.
2. Ad Click Prediction: a View from the Trenches , McMahan et. al., KDD’13.
3. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in
Microsoft’s Bing Search Engine, Graepel et. al. Appearing in Proceedings of the 27th
International Conference on Machine Learning
4. Simple and scalable response prediction for display advertising, Chapel et. al., YYYY ACM
2157-6904.
5. http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/
6. https://github.com/songgc/display-advertising-challenge
7. https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10322/beat-the-benchmark- with-less-then-200mb-of-memory/53674
Main Modeling Steps:
- Feature Generation: Try quadratic, cubic and different combination of features. Check the performance and use the best combination. The continuous features can be binned and we can treat the bin index as a feature.
- Historical Features: Generate the following historical features to see if they improve performance
a. Counting Features:
i. Device_IP count
ii. device id count
iii. hourlyusercount
iv. user count
v. hourly impression count
vi. #impressions to the user in a given day
vii. #impressionstouserperpuborappid
viii. time interval since last visit
ix. #impressions to user per pub or app id in last hour
x. features that occurred less than 10 times will be converted to “rare” category xi.
b. Bag Features
i. user, pubid or app id, bag of app ids
c. Click History:
i. #clicks (over all) by a user
ii. #clicks (over all) per pub/app id - Feature Hashing:
The issue with the dummy coding presented above is that the dimensionality d can get very large when there are variables of high cardinality. The idea is to use a hash function to reduce the number of values a feature can take. We still make use of the dummy coding described in the previous section, but instead of a c-dimensional code, we end up with a d-dimensional one, where d is the number of bins used with hashing. If d < c, this results in a compressed representation. In that case, collisions are bound to occur, but as explained later, this is not a major concern.
When dealing with several features, there are two possible strategies:
(1) Hash each feature f into a df -dimensional space and concatenate the codes, resulting in df dimensions.
(2) Hash all features into the same space; a different hash function is used for each feature.
We use the latter approach as it is easier to implement. The tuning procedure will consist of finding the optimal value of the size of reduced dimensions df. - Tuning Gradient Boosting Machines : This step involves train the data set on GBM and generating features from them. We treat each individual tree as a categorical feature that takes as value the index of the leaf as instance end up falling in. The final representation of the input feature vector will be in binary coded format. Tuning GBMs: • shrinkage • number of trees • interaction depth Our goal is to find optimal value of all these parameters via using k fold cross validation.
5. Learning Rate Parameters of Stochastic Gradient Descent:
Per-coordinate learning rate has been shown to achieve the best accuracy. The learning rate for feature i at iteration t is set to
6. Learning weights and regularization parameter of Logistic Regression using cross entropy cost function
* To Do: I have to study and test Follow The Leader Regularization — It has been recently published by Google[2] and is acclaimed to be performing a bit better than L1 regularization on sparse data.
7. Finding optimal negative sampling rate: Since we have a class imbalance problem we need to find a subsampling rate parameter to sample from the negative (no click) instances. We need to experiment with different negative down sampling rate to test the prediction accuracy of the learned model. We can vary the rate in {0.1, 0.01, 0.001, 0.0001} can check the effect on final cross entropy achieved.
Infrastructure And Data Pipeline
I will have to discuss more about this section with you in person. So putting this section in a very broad way(rough sketch), we will be needing following batch jobs
1. Hourly SPARK job to process the RTB logs (read logs from HDFS and create training segments.
2. Hourly SPARK job Preprocessing, features and training data generation job (one hot encoded vectors).
3. Daily SPARK job to generate Gradient Boosted Machines Model (if training on spark cluster is not computationally expensive may be we can train this model much more often)
4. Hourly SPARK job to generate Logistic Regression Model.
5. Job to deploy the updated model into production
6. Daily job to generate a generalized model to handle new advertisers/creatives.
This is a very insightful post. Did you use hot encoding for all your categorical values? Why did you choose hot encoding and not label encoding? If you use athena in AWS would it be easier, would you still need spark?
I was using Vowpal Wabbit 7 for log linear modeling. One doesn’t need to explicitly do one hot encoding for categorical variables in that. All you need to do is create the training file in a particular format e.g. each line contains label (0 or 1 for click/not click class) with space separated attributes. Vowpal Wabbit uses training data in sparse feature representation i.e. you only provide features that are present. So if you got a click, and publisher id was “pub_23”, ad id was “adid_456, page title was “daily news” then your training data instance would look like
1 |p pub_23 |a adid_456 ad_daily ad_news
here p(publisher_features) is a namespace where you put all the publisher related features
and a(ad_features) is a namespace for ad specific features (you can name them according to your need).
Vowpal Wabbit used feature hashing, so we don’t need to generate 1 hot encoding. Whuile training, let’s say a logistic regression model you can even generate quadratic features as follows
vw -d train_data_file.txt -c –loss_function=logistic –passes 5 -f model_name –l1 0.0000001 –l2 0.00000001 –readable_model readable.txt -q pa -b 26
now -q pa corresponds to generation of quadratic features present for publisher and ad name spaces.
There is no need to use space, since vowpal wabbit is an online learner and it is really fast even on a single machine. Though you can use Spark for training data generation.
Did you use previous ctr as one of your features for training?