I regularly get asked how to apply machine learning, and other advanced data methods to get the most out of a client’s data. The first thing I ask is “what are you actually trying to do?” Followed shortly by “have you looked at the data?” People then look at me strangely, like I’ve said something confusing. Surely machine learning doesn’t need to understand data it just learns and gives me insight! Well in reality you always need to start at the beginning, what is the goal, and what sort of data do you have? When it comes to understanding what sort of insight you can achieve; lots of data is not the same as big data. You can’t just use or buy an algorithm or software off the shelf and apply it to any data source and expect magic, even though this is often the sales tactic of many companies in this field.
What sort of data do you have?
So you may ask, what the hell is this guy talking about, data is the future and I’ve got big data so show me the insight!
Simply put, the first step is to understand what are your feasible actions? Having insight is only relevant if it can create change. Why spend thousands, or millions of pounds on systems to provide insight if that insight can’t change anything! For example I have created systems that can detect an integrity failure, a failure that costs around a million Euros to fix. However there is actually no financial gain in preventing the failure, as the cost of repair is the same with or without the equipment totally failing; so why spend the money detecting the problem? Well in this case, I could provide a solution relatively cheaply, and, whilst the early warning only saves around fifty thousand euros, by reducing time lost through planning and resourcing; it was worth doing anyway.
It’s just data
What sort of data do you have? Where does it come from? Have you looked at it? Some data scientists don’t think about these as relevant starting questions, but the answers have a massive influence on how you approach the problem. Rather than just taking your favourite variant of data cleansing, categorisation andneural network; why not just look at the data and try to understand what a user needs to know, to be able to make a actionable decision?
Sometimes there’s no better solution than a simple threshold, and if you talk to the right people the position of the threshold is already known, noPCA required! And yes with a bit of clever maths you can determine an actionable limit that’s 5% more accurate, but does anybody actually care? Sometimes the answer is no, accept it! Get on with helping the client to be able to use the answer in a functional, stable, and clear manner that fits into their processes, in a way that they can actually use it!
Simple thresholds can be functional if you understand your inputs
OK, so you can’t solve the problem with simple pragmatism, what do you do next? The next question to ask is where is the statistical significance? This takes us to a massively important point; Where is my data big? I work in an industry with highly efficient, well maintained, heavy duty, fairly bespoke equipment. It does break, but it doesn’t break too often. When it does break, it rarely does it in the same way. At this point I will tell you that most big data algorithms, trying to detect failure, run against raw data from machines of this type, are absolutely useless! Why? Well the data is small! There is no statistical significance to the failure data.
In the case where your failures are varied and few, you cannot look to your failures for insight.
But you may say: my client wanted an all singing, all dancing, fault and failure detection system that uses machine learning! I’m sorry to say it, but your client doesn’t understand their own data. You need to start backwards… Their big data, is good!
When you have lots of data, that describes machines operating correctly, this is where you have statistical significance. Where you can’t easily define failure, you have to begin with anomaly. Detect what’s good and highlight everything that isn’t!
Few systems exist that can do this effectively, as systems need to be flexible, sustainable, and scalable and the output often requires interpretation. Expertise is crucial in understanding which anomalies are actionable and which are not.
These systems have to be multi-modal in nature, and deal with data-instrumentation issues regularly.
So in summary, lots of data does not mean big data, and data science doesn’t have to be complicated. Be pragmatic, and use your own intelligence first. You will often find by asking some simple questions, you can determine your own insight, insight which your run of the mill data science based algorithm would never actually achieve!
No comments:
Post a Comment