10% of the data holds 90% of the value: four reasons why
Often 10% of the data contains 90% of its value in solving the problems we need solved. This means that we can waste enormous resources cleansing, migrating, and managing that 90%, with little to show for it. What’s worse, we can miss the 10%: a big disconnect to achieving our objectives.
There are at least four reasons for this:
- Operational data has different quality requirements than analytical data, yet many organizations treat them the same. For instance, a telecommunications company I worked with used billing data (call detail records (CDRs)) both operationally and also to support decision-making (such as to decide where to build new parts of its network). For operational purposes, every CDR mattered, because each represented billable revenue. However, to support network buildout decision-making, the company only needed enough data to understand the trend.This is a familiar concept to statisticians, who understand, for instance, that polling data can be used effectively to predict an election. But you’d be surprised how many organizations say “the decision is only as good as the data”, by which they mean that they must do the equivalent of polling every voter. This is not necessary for many use cases.
- When you “connect the dots” between data and benefit (e.g. revenue), some data fields have a big impact, and some not so much. So, for instance, the color of your competitor’s product may not turn out to matter much to your competitive position. But their price does. This is an obvious example, but there are may not-so-obvious ones as well.The principal here is that without some knowledge of how the data combines with decisions to impact business objectives, it’s impossible to tell the highly important data from the less-important data. Many organizations stovepipe their data people, so that they don’t have access to this bigger picture, and ask their teams to “just make it all as clean as you can”. This can be a tremendous waste of resources. For instance, I worked with one organization that spent 18 months trying to find keys to join about a hundred tables reflecting customer data. When we did a preliminary analysis for this organization, we learned that a join between just two tables would give them much of the insight that they required.
- Decisions are not made of just data, but they are also based on human expertise, things you may not have thought yet to measure, and intangible factors like reputation and brand. For this reason, looking to data as the sole basis for decision support is a bit like looking for your keys under the lamppost. Again, if we work backwards from business outcomes using a decision intelligence approach, we often find that the most valuable information needed to support that decision is not in existing data stores.
- In many environments, there is no data for your situation, because it’s brand-new. Here, we need to generate “data from the future” (and also see Mark Zangari speaking on this topic.).In addition, we can apply techniques like inductive transfer learning, which I invented and wrote about in Learning to Learn.
One particular project that followed this pattern was with a large bank. When we joined this project, it had built an extensive database containing dozens of tables. Yet this data wasn’t enough to solve an important use case. We used decision intelligence to analyze this bank’s needs, and discovered that its problem could be solved with a few dozen fields and data small enough to fit inside a spreadsheet.
There’s been a lot of excitement about big data. Data stores are so big that we and our projects can be lost in them. So it’s important to remember that all data is not created equal, and that there are inexpensive—but also unnecessarily time-consuming—ways to go about using it.