A Comparison of Techniques for Visualizing High-Dimensional Data

Quinn Hubbarth, March 06, 2020

High-dimensional data is hard to understand. A single record may contain hundreds of variables that describe it. Though such high-dimensional data may be suitable for machine learning algorithms, it is very hard for humans to grasp. On the other hand, low-dimensional data is easy for humans to read through and visualize. 2D and 3D data can be graphed, and other variables like color or size can be changed to represent other dimensions in data. These visualization methods help researchers understand the data. High-dimensional data simply has too many variables to get a comprehensive understanding of the data using these approaches.

Dimensionality reduction (DR) is the process of decreasing the number of variables in a dataset to make the data easier to understand while preserving the most important features. There are many different techniques and uses for DR; notably, it’s frequently used for data visualization. In some cases, reduced dimensions are also more efficient features for machine learning algorithms. When using lower-dimensional data, less storage and computing power is required to read the data. Reduced dimensions can remove unimportant variation in the data and reduce the number of parameters needed in a model. This allows for much higher speeds in algorithms with very little difference in accuracy.

As mentioned previously, DR is a great way (and, in many cases, the only way) to visualize high-dimensional data. For example, if a dataset with 100 variables can be re-expressed with just 2, it can be graphed as a scatterplot. Of course, the process of squeezing 100 variables to just 2 leads to a loss of information. Different DR techniques use different strategies for reducing the dimensionality, leading to different visualizations representing different patterns from the original high-dimensional dataset. This post explores a few common techniques - PCA, SVD, T-SNE, and UMAP - as applied to tabular, text, and image data. We will see how these methods produce vastly different results and discuss their strengths and weaknesses for visualizing high-dimensional data.

Methods

There are many different methods machine learning researchers can use to perform dimension reduction. Each method has its pros and cons. Some methods are extremely accurate, while others are extremely fast; others are great for niche tasks. We chose to test four different dimension reduction methods:

  • PCA (Principal component analysis): this method has been used for over 100 years in mathematics and statics. PCA is a very simple and efficient algorithm that can run quickly on modern computers, and is thus quite useful for fast machine learning programs. It is available in the sci-kit learn library.
  • SVD (Singular value decomposition): SVD is similar to PCA in its usage and efficiency, and is also an older mathematical method. It has a very similar runtime to PCA, and is also available from the sci-kit learn library.
  • T-SNE (t-distributed stochastic neighbor embedding): this method is much newer than PCA and SVD, and was invented in 2008. It uses more complicated mathematics such as manifold learning to produce better results, and is thus more computationally expensive than PCA or SVD.
  • UMAP (Uniform manifold approximation and projection): UMAP was invented in 2018, and was intended to expand on T-SNE’s improvements. It is similar to T-SNE in effectiveness, but attempts to be more efficient. It still has a much longer runtime than PCA or SVD.

Datasets/Results

To show how vastly applicable these techniques are, we chose three vastly different datasets. These datasets are frequently used as benchmarks for different machine learning algorithms. Additionally, they each are not purely numerical datasets. Each dataset had to be translated to purely numerical datasets through some form of data manipulation so that dimension reduction could be performed.

AMES Housing

The AMES Housing dataset is a widely-used tabular dataset recording the sale price of homes along with 80 other variables about the homes such as land area, square footage, number of bathrooms, etc. We chose to focus on a subset of the variables from the full AMES dataset. Here’s an example of the data showing the selected variables:

Link to data here.

Results

Below are the results from 4 different techniques of dimension reduction. The graphs are color-coded by house price, which is a variable that was not included in the dimension reduction.

For this dataset, the more complicated algorithms (T-SNE and UMAP) produced graphs that are visually appealing but hard to interpret. These graphs showed many local structures, but didn’t form cohesive color arrangements across the entire graph. PCA and SVD produced relatively simple graphs, which helped their ease of understanding. Nonetheless, every graph seems to show strong relationships to house price in the data, which is evidence of a good visualization.

AG News

The AG News dataset is a collection of thousands of news articles that are classified in one of four categories: world, sports, business, or sci/tech news. To represent text data in a way that can be understood by machine learning algorithms, text is translated into a set of numerical representations known as encodings. FastText, the library we use to generate encodings, translates the text into a vector of 100 numbers. We reduce this high-dimensional data to 2 dimensions to visualize the data. Here’s an example of what the text data looks like:

Link to data here.

Results

Below are the results of the dimension reduction, color-coded by news category:

For this dataset, no method stood out to be totally different in terms of structure. Although PCA and SVD were very similar, the graphs produced by T-SNE and UMAP were not nearly as complex as those in the AMES Housing dataset. Different parameters for T-SNE and UMAP might have affected how these graphs were produced. For this blog post, we used the default parameters, but tuning parameters can drastically change how these algorithms work. For more information on T-SNE and UMAP parameters, visit here and here.

MNIST

MNIST is a benchmark image dataset for machine learning algorithms. Each image in the dataset is a hand-drawn digit (0-9). These images are translated to numerical data with 784 data points, where each data point is a pixel in the 28x28 image. Dimension reduction reduces the 784 dimensions to 2 variables, and is then classified based on the true value of the hand-drawn digit. Here’s an example of what some of these images look like:

MNIST image

Link to data here.

Results

Here are the DR results, color-coded based on digits:

This dataset most significantly showed the advantages of T-SNE and UMAP over PCA and SVD. Each of the more complicated algorithms produced graphs with clear clusters for each digit. Specifically, UMAP did an excellent job sorting each digit separately. PCA and SVD were barely able to separate any

Conclusions

The results of all three datasets showed that PCA performed similarly to SVD, and T-SNE performed similarly to UMAP. In some datasets, like MNIST, T-SNE and UMAP showed significantly better visualizations than PCA and SVD. This shows what some of the strengths and weaknesses are of these DR methods. For example, the reason T-SNE and UMAP performed so well with MNIST could be due to the size of the MNIST data. Each data point has 784 dimensions, and it seems like the more complex algorithms did a better job than PCA and SVD at determining multiple patterns throughout the sparse data. Similarly, in the AMES housing dataset, T-SNE and UMAP formed many relationships between different variables and house price, but the result was difficult to understand visually. PCA and SVD showed their strengths in the AMES dataset by forming one strong relationship between the entire data and house price.

These outcomes show that each method of dimension reduction has pros and cons regarding effective visualization. However, in order to determine the effectiveness of a visualization, one must first consider what makes a visualization “good”. Some visualizations can effectively show one relationship between variables, whereas others can effectively show many relationships between many variables (for example, in the AMES housing dataset). The “best” visualization for a certain dataset depends greatly on what type of relationships a researcher wishes to explore. For example, if a researcher wanted to use a linear regression model on a visualization created by DR, it would be very easy to do so using the PCA or SVD results from the AMES Housing dataset. However, if a researcher wanted to explore how each individual variable in the AMES Housing dataset affected the house price, a technique like T-SNE or UMAP would be effective. Because different DR algorithms analyze data in unique ways, it’s important to consider the goals of the visualization before selecting a method to use.

Dimension reduction is extremely useful for visualization in a wide variety of machine learning applications. However, different DR techniques can produce very different results. Every technique has its pros and cons, so researchers should choose to use an algorithm that aligns with their desired results. Overall, dimension reduction is an easy-to-use and practical solution for visualization of machine learning data. Using it to visualize results is an important part of many efforts in the realm of data science research.