Welcome to the trending topics update
IIA is in the fortunate position of working with some of the most influential Fortune 1000 analytics teams. On a daily basis we partner with these teams to accelerate the progress on their most pressing projects and initiatives. Collectively, we’ve worked with these analytics professionals on a couple thousand different topics over the years and thought it could be helpful to the broader analytics community to begin sharing some of those topics.
Last time, we reviewed the following two topics:
- Reconciling different workstyles and behaviors across the data science and data engineering teams
- How an organization that is used to one-off analytics project work can evolve into a product portfolio way of thinking.
But for this iteration, without further ado…
TOPIC #1
With ML becoming more prevalent as a way to provide insights against business problems, the gap in responsibilities and expectations between your data analysts and data scientists is getting further apart - so can you still develop talent internally and if so how?
We tend to see this come up with our clients on two fronts:
- As part of an HR effort to establish and manage hierarchical structures within larger analytics organizations for hiring, promotion and retention purposes
- When the analytics organizations themselves are trying to map applicable skills to problem sets in service of work throughput - ie who can and should be doing what?
These types of more advanced analytics organizations may be better off thinking of analysts and scientists as separate teams vs a single one - the former not necessarily being the farm for the later. Data Scientists focus on longer-term, impactful projects, whereas Data Analysts tend to be more concerned with one-off projects and ad hoc analysis. Data Scientists are strong in statistics, predictive modeling techniques, and obviously in this scenario, ML techniques, whereas Data Analysts are strong in business intelligence, reports, and dashboarding.
The Data Scientist works on a product for the end user — the end product for a Data Analyst is used internally.
Even the best Data Analysts might not be a good fit for an ML learning team because they are not usually accustomed to thinking about the user experience. Interview internal Data Analyst candidates as you would external candidates, and look for people with product insights, product sense, and some customer-facing experience. This could be participation in user-testing sessions or looking at input from customers, for example.
Moreover, a Data Scientist is more than a Data Analyst with additional requirements. Data analysis is only about 40% of the skill set needed to be a Data Scientist. A few of the key skills to look for in an internal Data Scientist candidate (e.g. a Data Analyst you are looking to bump up) in order to work successfully on a machine learning project may include:
- Knowledge of Python (as opposed to R, which many analysts prefer)
- Understanding of ML algorithms and ability to use libraries
- Creativity — many analysts come from more repetitive positions
- Data knowledge — understanding data has a high learning curve, so knowledge of data is a point in their favor
TOPIC #2:
JUSTIFYING THE FREQUENCY OF MODEL TRAINING
We recently had a client deploying an automated, predictive service prioritization model every week, training on the past four months data and refreshing the previous week’s data (subtracting the oldest week) – it raised this question: what amount of model training is the right amount?
The tests that check for stability can also tell you whether it’s worth the effort to refresh every week. In addition to comparing the model to the previous week’s model, compare it to the previous year’s model. If the model’s effectively the same, you might question why it would be necessary to train 52 times in the next year.
Additionally, using only four months of training data might result in the model lacking seasonal knowledge. The model will know nothing about winter when it reaches the start of winter. This could be a problem if you have seasonal effects.
Safeguards:
In general, it’s good practice to retrain models to avoid drift over time—as long as you have the appropriate safety guards. The following include some of the checks and balances you should think about when your refreshed model is deploying itself:
- If you’re retraining every week, you need to be robust with monitoring. But the monitoring is pointless if you can’t block deployment when something goes wrong.
- Use feature importance techniques to examine the most impactful inputs to the model. If the list changes significantly on a regular basis, the model is relying on fundamentally different pieces of information from week to week. Even if the model’s effectiveness remains the same, investigate, because important corner cases might be impacted.
- Look at data sparsity. Powerful models, such as gradient boosting machines (GBMs), trained or clustered on not enough data could result in unstable models that behave erratically.
- Perform a risk assessment of the model creation process as a whole. Enumerate the risks and how to mitigate them. Ask yourself questions such as: What if key data doesn’t load properly? What if the incorrect time-stamped data is used?
- As a best practice, the person who trained the model should also write the tests for the model, similar to software development.
These topics may not track with what you're working on, but hopefully are interesting to see what others are up to in this small community of analytics professionals that we all work in.
Till next month!
— Doug