Data Trends: IT Coralling Data Science; Big Data May Not be Better


We’ve recently seen data science shift markedly from a peripheral capability to a core function, with larger teams tackling increasingly complex analytics problems. We’ve watched rapid advances in data science platforms and their big implications for data and analytics teams. But what surprises are in store in the realm of data, analytics and machine learning going forward?

What new developments in data science will we be talking about in a year? Here are our three predictions:

1. Big data’s diminishing returns: De-emphasizing the size of the data. We are increasingly seeing that bigger data is often not better. Companies are realizing that extracting more data may not help them address certain problems more effectively.

While more data can be useful if it is clean, the vast majority of business use-cases experience diminishing marginal returns. More data can actually slow innovation, making it harder for data scientists to iterate quickly as testing takes longer and requires more infrastructure.

Experimenting and iterating faster will lead to better models and outcomes compared to fewer experiments with larger data sets.  “If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they can ask of that data in a short amount of time,” writes MIT researcher Kalyan Veeramachaneni.

Indeed, Fortune 500 companies will take a more agile and iterative approach by focusing on learning more from higher-quality samples of data. They will use techniques to extract more representative data examples, enabling the derivations of better conclusions from these sub-samples. For example, rather than process petabytes of call center recordings, they will sample the last 2-3 months, run dozens of experiments, and more quickly deliver a churn prediction to their team for feedback.

2. CIOs dealing with data science Wild West: IT teams bringing order to data and analytics. IT organizations have traditionally managed analytical data infrastructure, such as data warehouses and production processes. Driven by a desire to experiment, data scientists, who reside in the middle of the stack between IT and business consumers, are increasingly creating their own shadow IT infrastructure. They download and install locally on their desktops or on shared servers scattered through departments. They use RStudio, Jupyter, Anaconda and a myriad of open source packages that improve almost daily.

This Wild West of tooling creates a plethora of governance challenges. Many CIO teams are realizing the degree to which data scientists need consistent and secure tooling without constraining their ability to experiment and innovate.

Read the source article in