The hardest part of data science

Toys Magic Cube Patience Games Tricky Puzzle

The hardest part of data science

So what’s the hardest part of data science?

Mmm – well, hyper-parameter tuning can be a bit tricky. Algorithm selection for best results can be a bit of an art as well as a science. Feature extraction and selection is a real skill and probably makes a bigger difference for accuracy. NLP needs a whole new bunch of skills.  And don’t get me started on the importance of data cleansing and preparation…or having good model evaluation mechanisms etc. Unfortunately for the technically excited it’s none of these.  The biggest barrier we see to the exploitation of artificial intelligence approaches and specifically machine learning is…

…not getting started in the first place

Ready…steady…stop

Why would that happen then? Here are some reasons:

  • Expectation management – it is important to “sell” the benefits in a genuine way. It’s a research and investigative process, with many iterations. The outcome is not known or guaranteed at the beginning, and that’s fine – but everyone needs to understand what the project is and isn’t.
  • Distraction and focus – there may be a potential prize, but if your business stakeholders trousers are on fire for ten other reasons, then your nice R&D project isn’t going to get very high up their agenda. This is the case regardless of its validity, business case or strategic importance.
  • Buy-in – I understand that it takes less muscles to smile than it does to frown (link here), but with innovation initiatives like the application of predictive analytics to business data it’s the opposite. It’s a lot easier to say no (no-one got sacked for risk-aversion, well not quickly anyway :)) than yes (which requires some vision)
  • Information security – there is no way around it, if you want to learn more about the relationships and patterns in your business data, and even make near-real-time predictions from it, then you…er…need access to your business data. Typically we start by using heavily anonymised data to eliminate any PII risks. This is not because we don’t have the information security answers, but just so that we don’t have to have the conversation and process and overheads that it implies. Travel light – move fast. But the time will rapidly come where the case has been made that more business value and insight will flow from using all the data. This is all possible of course, but it increases the chances of a stakeholder throwing in a blocker or delay. At the risk of generalising (ML pun intended!), what we tend to find is that business leaders “get it”, i.e. their starting position is “I’m sold on the potential value, explain to me what I need to do to unlock it”, whilst still appreciating that security must be a first-class non-negotiable consideration.
  • Resources – Someone needs to extract the data for you, open up that API, give you a DB connection string etc. And they are probably three weeks behind on their to-do list already. That someone is generally close to the data layer, such as a DBA or a SysAdmin controlling file shares etc – and organisations often have single points of failure in these positions. Related to an earlier point, one strategy we employ is anonymisation of data sets to minimise information security concerns, and so this also generates a small additional task for the already overworked DBA.

Pulling it all together, the potential barriers all mean one thing – time. All these issues are surmountable, but energy has to be consumed to overcome them, and this takes time, bleeding your pet project of momentum and energy.

Bring me a solution, not a problem!

OK, enough moaning 🙂 – what do we do about it?

Like many things in life, much comes back to expectation management. We need to be honest about what challenges might need to be overcome in a predictive analytics initiative and tackle them head-on, rather than hoping they won’t come up.
Don’t oversell what AI is going to do for your customer – i.e. it’s not magic, it’s still “garbage in, garbage out”, and it’s definitely “no data in, no data out”. That’s not to say that I don’t think it’s amazing – I do! Just spend some time looking at these Google AI experiments if you are not convinced, and imagine if even a fraction of these capabilities is brought to bear on the 99{d093e0ed3b34f7b2723df508a98ef00fff93fa564feecd6d0dc6e7ce42b939fd} of enterprise data sets.  This is just the beginning – take a look at the ambitions of the like of Elemental Cognition for where the research agenda is heading.

We need to minimise initial information security and data extraction effort.  Our approach is that it’s better to have less data earlier, than more data later. Once the enterprise appetite is whetted and it’s clear that more investment will be repaid with magnified return, then it is massively easier to go back and get more and “deeper” data sets. So our data anonymisation strategies and pre-secured RAMP platform as a data destination are critical here.

And last but not least – find the right stakeholder.  At some level of abstraction everything is a change management project and this is no different.  Find out who really cares about the outcome. Clue – generally (but not always, as it is use case dependant) look in the business community, not IT.

Robin Meehan
robin@inawisdom.com
No Comments

Post A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.