15 Nov Data science capability anti-patterns and remedies
At Inawisdom we are perfectly placed to see trends in the data science community – we are growing exponentially so we need to hire and grow/train, and we also see our customers growing out their own data science teams. We’ve seen some of our customers run into challenges, and even come across some customers who have completely abandoned their own attempts to grow and sustain an internal data science capability. So I thought it would be interesting to share a few perspectives on this. After all – surely everyone needs their own internal data science team, right? Data is the new oil…etc etc (insert your favourite marketing hype here!).
I don’t see this as a backlash against the promises of data science hype …it’s more of an execution problem. Like most things that are worth doing, it can be difficult!
So what are the failure patterns that we see? This could be a long list, but these are some of the key ones…
Lack of ability to implement and realise the benefits
This is #1 for sure. It’s not a palatable message for a data scientist. Their internal value system is often coupled with the accuracy of the models they produce, but it’s way more important to have a so-so accuracy machine learning model in production, adding real business value, than a great machine learning model sitting on the shelf. I’d say that the main organisational design flaw that I see is organisations who have invested in a data science-biased capability but virtually no data engineering/DevOps/cloud skills. These are essential to deliver those models into production, keep them trained as they drift over time, and ensure they are sufficiently monitored with operational alerting etc. It’s obvious, but true.
…and dare I say, PhD snobbery? It comes with the territory really – but as many data scientists have come from academia then understandably they see any business data exploitation challenge through that lens. This is a plus and a minus. At Inawisdom we deliberately have a mix of PhD and non-PhD data scientists. The scientific method and experimental design is more strongly baked into the academic-leaning individuals, which is a massive asset – it really is. However this can also be a hindrance – “great” can become the enemy of “good”.
The bottom line is…er…we are working to impact the bottom line (for our customers). So it’s not about the cleverest model for bragging rights. Lose sight of this in a commercial environment and you fail. Deep learning is probably the best example of this – in some circles there is an unwritten implication that if you’re not working on a complex LSTM with an attention mechanism, then you’re a nobody. This is not the Inawisdom philosophy! Bragging rights come from results like this, this and this. That’s not to say we don’t embrace cutting edge neural network architectures where they are the best way to solve a customer need. In summary – horses for courses!
Not having a process
Data science is a process. If you don’t know where to start, look at CRISP-DM. Processes encapsulate best practice and protect us from being lured into working in the wrong direction. There are many seductive and well-intended side-investigations that a data scientist can burn time on…and your business stakeholders will not have infinite patience. This is related to two other failure patterns.
The first is a failure to adequately scope out the “exam questions” that are being worked on, including critical success factors and how they will be measured. You wouldn’t start a software development project without some idea of scope however agile it was, so why would you do it for a data science initiative. When I hear that a data science team is “being allowed to play with the data to see what they can do” it’s generally a concern. The second process area to get right is use case prioritisation.
Months/years (!) of discovery
Related to the point made above, it is not always about a hyper-accurate model. Some things in life are just not highly predictable, but the crucial point is that they are predictable enough for the application of a machine learning model in a business process to significantly move the dial in terms of revenue, profit or some other important KPI. For example, many organisations have a workflow process where their staff just plough through a worklist on a FIFO (“first in, first out”) or gut-feel basis.
It’s often a relatively simple optimisation to prioritise this list using machine learning and therefore maximise the impact of the staff you have at your disposal. This is especially obviously true for worklists that have a time-expiry nature to them (as most of them do) and where the human resources you can bring to bear are limited such that you have to accept you cannot process the whole worklist. A great example of this is sales leads, where the lead “goes cold” if you don’t get to it quickly, and there are more leads, however tenuous, than your sales team have time to address.
Data science is a broad church, including shallow learning (e.g. classic scikit-learn classification and regression), deep learning for image processing, NLP, Bayesian/probabilistic modelling, time series prediction, reinforcement learning etc. The point is that no data scientist can know it all, and just by career experience they naturally end up specialising. Dangerous logic here is to say “I can only hire very few people in my fledgling data science team, so I’ll hire someone with all these skills”. This is a flawed strategy for a number of reasons. Hiring one person who does all that is like trying to find a God particle.
Firstly, those people generally don’t exist. Secondly, if you can find and hire them, they are ruinously expensive. And finally, they are very hard to retain, as they are the data science equivalent of hen’s teeth. The reality is that you cannot cover all these areas until you get to a certain scale (and therefore a number of people). This was a challenge for us in the early days of Inawisdom, just like it is for our customers. However we are well past the point now where we have sufficient scale to be able to have all the skills we need in our team.
Data scientists have a hunger for solutions and thrive on problem-solving. A good way to hack off your data science team is to ask them to spend a year working on the same problem. Of course this can also be very engaging, i.e. drilling into a problem very, very deeply. But simply put, interesting work = interested people!
This is where, as a professional services organisation, I feel that we have an edge. Due to the nature of our business model, we see a very broad range of data science problems across many industry sectors and domains, in fact some would say frighteningly so! It’s not dull, and even applying similar data science approaches whilst translating them from one business domain to another is very rewarding. Every customer’s data opportunities are unique to them and so is their data – which is less the case when performing data science as part of an internal team in an end-user organisation.
Bring me some solutions!
That’s enough about the challenges – let’s talk solutions! What strategies do we recommend?
- Design your sourcing model – If you think it will be differentiating for your organisation, build your own data science team. Whilst you are on this multi-year journey, add in “bridge” capability (skills, experience and capacity) in the intervening period from organisations like ourselves. There’s nothing wrong with that, as long as it’s delivering business value. In fact, we’d go further than that, we’d say as long as it is delivering business value and easily washes its face in any ROI analysis.
- Focus on impact – It’s better to have a 70% accurate model that is adding business value than a 73.2% accurate model on the shelf – keep this in mind and bias your processes towards impact and implementation. The critical success factor is impact, not accuracy or perfection. And it’s the integral of impact really, i.e. £ impact per day multiplied by time – so the earlier you get it in production, the more £/$ pile up! Perhaps the key thing though is gaining learning from being in production, so you find where the gold is buried via tuning and refinement from evidence – not from untested hypotheses. It doesn’t have to be risky or cavalier. For example, A/B testing and using interleaving allow testing of model outputs in a controlled way.
- Prioritise – This is a whole subject in itself. Suffice to say that it’s all about being focused, with benefit-assessed business-executable outputs for a stakeholder who cares enough to deliver the associated business changes into the organisation.
- Be full-stack – Ensure you have a delivery team (however it is sourced) that provides “full-stack” AI/ML capability. You need this full pyramid of skills:
- talented data scientists who can cover the range of modelling opportunities that you need
- data engineering to make it happen
- the DevOps capability to keep it happening with a high degree of automation and repeatability (including run-time model evaluation and retraining), and also keep the operational burden minimised
- the cloud skills to architect and deploy it in a secure and maintainable way
- Keep it interesting – If you are building a new data science team make sure you have enough interesting use cases to work on and sufficiently mature (quality, catalogued etc) and accessible data to keep the team interested