Mining Your Path Through Dark Data

NOV 26, 2018

By Sam Fangman, Trace3 Research Analyst

Over half of our data today is not being used. Can emerging tech help bridge the gap?

According to a recent report, 85% of today’s data is not used. While a third is due to redundant, obsolete, or trivial data, 52% of all data is considered ‘dark data’, or data whose value has not yet been identified. In an ever-growing data landscape, it is apparent the tools and technologies used to process and analyze data are not keeping up with the rate at which we consume and create it.

The data lifecycle has numerous points that can bottleneck the process and perpetuate the gap between collection and analysis. Compiling data from various sources into a single repository can be a painful and slow process, while ensuring that data is correctly classified and cleansed is labor intensive. If your company consumes streaming data, processing this information in real-time can be time critical and resource intensive.

For decades, the existing technologies used to tackle these processes have been based on the CPU (Central Processing Unit), largely due to its ability to handle a wide variety of complex instructions. While the generalized approach of the CPU is versatile, it is not as efficient as more specialized tools that can accelerate data processing and shrink the hurdles.

Popularized in 1999 by NVIDIA, the GPU (Graphics Processing Unit) has long been used for image processing. With hundreds of simple cores designed to handle vectorized data, GPUs thrive when performing repetitive ‘simple’ tasks such as vector manipulation or matrix operations. While these tasks typically lend themselves towards purposes such as graphics processing, the GPU is staking a claim in the enterprise market as a major data processing accelerator.

The process of training deep learning models exemplifies the accelerating ability of the GPU. Deep learning models are based off deep neural networks, sets of algorithms used to recognize and classify patterns. They are made up of inputs and weights, such that when an input is passed to the network, an output is generated based on the weights. These models require extensive training, consisting of forward and backward passes. A forward pass is completed when an input is passed to the network and an output is generated. The backward pass is then completed by adjusting the weights based on the error from the forward pass.

A forward pass is a simple task for either a CPU or a GPU if the matrix of weights is small. However, a typical deep learning model has on the scale of millions of weights and multiple hidden levels, meaning there are millions of computations needed to calculate the output for a single input.

 

This process would be tremendously time consuming for a CPU, as it handles these vector calculations in multiple cycles. However, a GPU could perform these vector multiplications 100s of times quicker using its numerous cores in parallel.

So, who are the players capitalizing on the benefits of GPUs:

GPU hardware:

  • NVIDIA provided momentum for this use-case. Their parallel computing platform, CUDA, integrates with existing code (C++, Python, etc.) to dedicate general processing to GPUs.
  • AMD and Intel also are players in the GPU hardware space.

GPU-enabled technologies:

  • Anaconda provides a GPU-accelerated Data Science platform for creating advanced analytics.
  • H20.ai offers a GPU-accelerated Machine Learning platform where you can build and train deep learning models.
  • Fastdata.io developed a GPU-enabled streaming engine to accelerate stream processing up to 1000x.
  • A number of players, including Kinetica, SQream, Omnisci (formerly MAPD), and BlazingDB, released GPU-enabled databases that can speed queries up to 100x.
  • Datalogue developed a GPU-accelerated data preparation platform powered by a Deep Learning model that automates the ETL process of preparing/cleaning your data for analytics.

GPU in the cloud:

  • One of the fastest growing arenas for GPU-enabled processing offered by providers (AWS, Google, Alibaba, etc.), the cloud allows for a subscription-based entrance to GPU processing without the need for a large hardware purchase.

It is important to note GPUs are a significant up-front cost due to their very high energy density that requires more power per unit of rack space. For smaller companies that deal with more manageable chunks of data, GPU processing might not necessarily be the best solution. However, for companies with large volumes of data to process, the GPU provides a great option for accelerated data processing.

GPUs are not the only alternate processing chips that can be used to accelerate data processing. Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs) can be specialized to accelerate specific use-cases. For example, the Tensor Processing Unit (TPU) is an ASIC programmed by Google to accelerate machine learning in their native TensorFlow framework. Another example is the Deep Neural Network Processing Unit (DNPU), developed by the Korea Advanced Institute of Science and Technology (KAIST) to perform deep neural network training. As more processing units like the TPU and DNPU continue to develop, their cost and performance may challenge the GPU in the market.

With the ever-growing data landscape, we need to start considering how to cut into the ‘dark’ 52% of our data. Faster queries mean more time creating new analytics and insights. Better trained deep learning models means discovering more patterns and asking better questions about the world around us. While we don’t yet know the ultimate solution for uncovering dark data, GPUs enable us to spend less time waiting and more time exploring the insights that lie beneath the surface.


Trace3 understands these dark data trends can be challenging – please contact our Data Intelligence team to discuss solutions that will help you start exploring.

Leave a Reply

Your email address will not be published. Required fields are marked *