Understanding the Prior and Posterior in Bayesian Inference for Anomalous User Behavior Detection
Today, I want to dive deeper into the technical details of how we calculate the values of \(\alpha_{prior}\) and \(\beta_{prior}\) in our Bayesian inference model at Fortscale. In my previous posts, I explained how we use these values to incorporate prior knowledge and prevent false alerts for users who have never acted anomalously.
The prior is a crucial component of Bayesian inference as it allows us to incorporate our prior knowledge when calculating probabilities. In our case, it helps us address the challenge of users with a history of zero SMART values triggering alerts for any positive value. By setting the right values for \(\alpha_{prior}\) and \(\beta_{prior}\), we can strike a balance between incorporating organizational knowledge and giving weight to the user’s actual data.
To determine the values of \(\alpha_{prior}\) and \(\beta_{prior}\), we need to consider the organization’s overall level of anomalous activities. If there are many anomalous activities, the analyst’s interest threshold is higher. We can simulate this effect using the prior by setting \(\alpha_{prior}\) to the number of SMART values in the organization and \(\beta_{prior}\) to their sum. This way, the prior represents the knowledge of the amount of anomalous activities in the organization.
However, setting \(\alpha_{prior}\) too high can make the prior too influential, leading to the user’s data having minimal impact on the calculated probability. To address this, we experimented with real-life data and found that setting \(\alpha_{prior}\) to a reasonable small number, such as 20, while updating \(\beta_{prior}\) to be \(\alpha_{prior}\) times the average of the organization’s SMART values, strikes the right balance.
Choosing a smaller \(\alpha_{prior}\) reduces the prior’s influence, allowing the user’s data to affect their threshold while still taking into account the organization’s level of anomalous activities. The variance of the prior also increases, allowing for some uncertainty in the expected value. This balance between the organization’s knowledge and the user’s data is crucial in personalizing the threshold and reducing false alerts.
In conclusion, calculating the values of \(\alpha_{prior}\) and \(\beta_{prior}\) in our Bayesian inference model requires careful consideration of the organization’s level of anomalous activities and the desired influence of the user’s data. By striking the right balance, we can effectively detect and prevent insider threats while minimizing false alerts and maintaining personalized thresholds.