Dennis D. McDonald: Here’s One for The Reluctant Data Janitor

schools of fish data management data janitor janitors data prepping AI edison
Written by Dennis D McDonald

When will artificial intelligence finally take the data prep out of data analysis? Data janitors need to know, says DENNIS D. McDONALD …

aNewDomain — As many as 40 percent of data pros spend half their time prepping data, according to a recent release I caught.

That’s as opposed to actually doing their job and analyzing it.

Not all of us are into that kind of extreme drudgery. Some us would rather spend our day jobs doing something fun and sexy — like the analytical, predictive, and visualization stuff we’re actually getting paid to do.

But what if you’re not? What then?

The end of data prep

The study’s data is for sure disappointing. That people still have to spend so much time preparing data for analysis in these days of constantly improving data analysis tools is downright depressing. Aren’t we supposedly speeding toward the day when AI handles the data prep, at least as envisioned by IBM?

Not yet. And looking at the numbers, you would think that things haven’t changed since the days when I crunched numbers for a living and would feel really lucky if I could actually spend 20% of a statistical or survey research project on analysis.

Or when I built digital database products for a living composed of text, numeric, and image data extracted from dozens of different systems and platforms.

In those days, data cleanup and standardization were always a major cost and time component for any client project, be the client an appliance retailer, a truck manufacturer, or an international insurance company.

And heaven help you if you had to move a decade of customer payments data from one mainframe system to another and the systems were based on radically different — and inconsistently applied — financial or customer route models.

(I know, modern network based digital businesses that have grown up with the Web may not have such concerns, but I’m talking about the messy real world here.)

Distributed systems

Obviously today’s tools are much better and, perhaps even more important, there does seem to be a recognition that better data governance is one way to improve tha ratio of data analysis to data cleanup time. Also, the distributed ledgers in blockchain systems require synchronized and compatible data.

(Author’s note: See Linking Up Blockchain and Data Integration by David Linthicum.)

Better data governance

From the same article quoted at the top of this post:

Nearly a third (32 percent) of respondents’ organizations are planning or researching a formalized data governance program, and nearly 20 percent (19.4 percent) are in the early stages of rolling out their governance programs, primarily with the goal of ensuring that everyone is working with consistent data.

Down and dirty

As important as I do think better data governance is, I also think there’s no real substitute for getting “down and dirty” with the data regardless of how sophisticated the analysis is going to be.

I still believe that, regardless of the sophistication of the planned analysis, there’s no substitute for “running your fingers” through the data and getting a feel for it, hence the “data janitor” reference above.

For aNewDomain, I’m Dennis D. McDonald.

Cover image: iStart.co.nz

An earlier version of this story ran on Dennis D. McDonald’s DDMCD site. Read it here.