Principal Components Analysis (PCA) and its sibling Singular Value Decomposition (SVD) are commonly-used dimensionality reduction tools that can dramatically improve the performance of supervised and unsupervised machine learning algorithms. They are also useful in exploratory data analysis to expose the underlying distribution of the data by visualising a complex high-dimensional dataset in 2- or 3-dimensions.
Pros and Cons of the PCA/SVD Approach
PCA/SVD can also be useful in reducing the amount of noise in data (de-noising) as random deviations will be pushed into components with a lesser explained variance. By focusing on the strongest components and ignoring the least significant components, we ignore the noise in the data, as the least significant components have the smallest impact on the data.
This approach is fine when the only goal of the exercise is to better understand the overall distribution of the data. If this is the goal, insight and understanding are typically achieved once a reasonable-looking low-dimensional space is created, with a couple of clusters or some complex structures that can be explored and mined for insights.
There is a danger to this approach, however. What if the features that help you understand the data are not the ones that provide most of the variance to the data? For example, you may have been paying attention to what someone is saying, because this is an information-rich signal, when instead you should have been paying attention to their subtle body language?
SPC Provides the Answer
In such a case, what we really want is a mathematical embedding that eliminates irrelevant features and maximizes the contrast for the most important features. Supervised Principal Components (SPC) achieves this by combining the unsupervised, non-parametric techniques of PCA/SVD with supervised learning.
There are many ways of implementing Supervised Principal Components (SPC). It is not a specific algorithm but rather an approach to applying algorithms. In short, SPC requires the investigators to use their knowledge of the system to identify a fundamental key variable in the system. For example, the output of the system (regression) or a label (classification). First, SPC uses this key variable to select the subset of features in the data that have the strongest association (ie. mutual information) with the output of the system. Then, PCA/SVD is only allowed to “see” those features that are the most valuable in predicting the output of the system. The SPC approach can greatly reduce the number of features that contribute to the variance of the data, and it’s no surprise that SPC was originally developed as a solution for p>>n problems (ie. more features than observations) such as microarray gene expression analysis.
Leveraging SPC at Arkose Labs
At Arkose Labs, when we’re defending a client against online abuse, we routinely use this technique with great success on datasets with over a million features. When a hostile agent makes their presence known to us by starting an attack, we can rapidly identify something that is an obvious unusual feature in some of their traffic. Attackers will try to hide their presence by using a variety of fake appearances for their traffic, so not all of the attacker’s traffic may have the unusual feature we detected. However, that one feature is enough to allow SPC to identify the 500-2000 features that are most powerful in discriminating between attacker traffic and normal traffic.
When we embed all of the traffic into the low-dimensional Euclidean space created by SPC, this tends to create two clusters: the normal traffic and the attacker traffic. The attacker cluster is typically a mixture of two types of traffic: traffic we know for sure to be hostile, and suspicious traffic. This suspicious traffic shares many features with the hostile traffic and also looks different from the bulk of the normal traffic being validated by the Arkose Labs platform. This newly identified suspicious traffic is then treated appropriately.
Supervised Principal Components is not a particularly well-known technique, particularly when you compare it to something like Gradient Boosted Trees or Convolutional Neural Networks. That said, at Arkose Labs we have found it to be an extremely powerful technique both for extracting insights from data, as well as for improving the contrast of a manifold to maximize the performance of our machine learning algorithms.