Fraud detection applications are ubiquitous. In particular, for companies working in online advertising, click fraud can happen at such an overwhelming volume that it results in huge money loss. Indeed, ad channels can drive up costs by spamming clicks at very large scale, caused by what are called “click farms”. For this reason, a score representing the likelihood of an app’s download after a click on a mobile advertisement could be very beneficial for any advertising agency trying to detect fraudulent clicks, or should we say taps.
Certainly, damages of these “click farms” could be drastically reduced by means of an accurate system estimating the conversion/download probability after the tap. For example, ads could be solely displayed if the conversion probability is above a certain threshold, sparing thereby the ads agency to pay for a worthless click. To this end, we have access to different features for each click, and the goal is to predict if this click will lead to a downloading, or not.
Click Fraud in Numbers
Numbers based on digital security company WhiteOps’, suggest that marketers lost $7.2 billion to digital ad fraud in 2016. Another, by ad verification company Adloox, says that marketers lost around $6.4 billion to ad fraud in 2017.
Clearly, even if those numbers are off, ad fraud is a massive problem and is forcasted to a steady growth in the upcoming years. On top of that, up to 25% of marketing budget is consumed by click fraud and this percentage is increasing in spite of huge and widely publicized efforts to eliminate it. Finally, this curse impacts the whole industry as 23% of ad networks have more than 20% of fraudulent clicks
To this end, the CEO of White Ops, Michael Tiffany, concludes:
“But, as these declines are relatively modest, it’s critical that those affected by this threat remain vigilant,” – White Ops CEO
For this problem, we have access to a dataset of 50’000 events (25’000 which led to a download, and 25’000 which did not) captured from ads agency.
The available features are listed in the table hereafter:
|ip||IP address of user|
|app||App id for marketing|
|device||Device type id of user mobile phone (e.g. iPhone 7, Huawei Mate 7, etc.)|
|os||OS version id of user mobile phone|
|channel||Channel id of mobile ad publisher|
|click_time||Timestamp of click (e.g. 2017-11-06 23:10:28)|
|downloaded||Target variable (0 or 1), indicate if the click led to a downloading or not|
We can see that we do not have many features (6) at our service, but we will squeeze the most out of it with the help of the following stage of our development process!
How to get the most out of the data at our disposal? That is the challenge feature engineering solves, the success of applied Machine Learning algorithms depends on how you present them the data. Hence, making it an essential step while solving any problem with Machine Learning.
Here is how I would define Feature Engineering:
Feature engineering is the transformation process from raw data & features to new ones that better describe the underlying trends and problems to the predictive algorithm, resulting in improved generalization and hence accuracy on unseen data.
Therefore, feature engineering is basically designing what the input X should be for the algorithm, from the features at our disposal. If you want further information, here is a good article about the art of Feature Engineering.
For this problem, we came up with the following new features:
|ip_day_hour combination quantity||This feature adds information about the number of click for a given ip in an hour at a given day.|
|ip_app combination quantity||This feature adds information about thenumber of click for a given ip with respect to particular app.|
|ip_app_os combination quantity||This feature adds information about thenumber of click for a given ip with respect to particular app on an os.|
|hour||This feature extracts the hour out of the feature click_time.|
|day||This feature extracts the day out of the feature click_time.|
|month||This feature extracts the month out of the feature click_time.|
|week_day||This feature extracts the week day (0 for Monday, 1 for Tuesday, …) out of the feature click_time.|
Finally, we remove the click_time feature now that we extracted the valuable information out of it. With these 6 (old features) + 7 (new features) – 1 (click_time removed) = 12 features, we will now build our Deep Learning model!
Our model will be a Neural Network taking these 12 features as an input, and outputting a single number between 0 and 1, representing the estimate of the target variable downloaded.
Firstly, we need to embed all the features since they are either ordinal or categorical ones. We have thus 12 embedding layers where each will embed the feature into 32 dimensions and then they are concatenated into a single vector of size 12 * 32 = 384 dimensions. This vector is then processed by a fully-connected layer of 256 neurons, followed by another fully-connected layer made of 128 neurons. Finally, these neurons are connected to a single one, which will give the desired output. On top of that, we applied a dropout at each layer with a probability of 25%, to prevent the network from over-fitting (memorizing instead of generalizing) during the training phase.
To train and evaluate our model, we randomly split the data into 75% – 25% (training and testing set), and we train the Deep Learning model using a back-propagation optimizer for a few hours (15 epochs).
Now that the training phase is finished and we have our model ready, we will test its generalization capabilities on the testing set. This set has never be seen by our model and thus represents the fraud detection accuracy we can expect in a practical application.
Accuracy on the testing set: 89.26%
Accuracy on the training set: 90.24%
We see thus that our model did not over-fit (thanks to the dropouts) and that we reach a very good classification accuracy! We will now further investigate the mistakes made by the model through different metrics than the accuracy, for example if we look at the precision & recall on the testing set:
We can see that the Precision of the “Not downloaded” inputs as well as the Recall of the “Downloaded” ones are the factors driving the accuracy down.
And for the bravest of you, here is the receiver operation characteristic curve, with a respectable area under the curve of 0.954:
We successfully built a Deep Learning based model achieving very good accuracy to estimate the conversion/download probability after the tap. This way, an ads agency could very easily deploy such a system in production to display ads only if the estimated conversion probability is above a certain threshold. Hence, saving a millions of dollars in their fight against fraudulent clicks.