5 tips for improving your data science workflow

All the sessions from Transform 2021 are available on-demand now. Watch now.


The biggest wastes in data science and machine learning don’t stem from inefficient code, random bugs, or incorrect analysis. They stem from flaws in planning and communication. Execution mistakes can cost a day or two to fix, but planning mistakes can take weeks to months to set right. Here are five ways you can avoid making those mistakes in the first place:

1. Set the right objective (function)

Mathematician and data analysis pioneer John Tukey said “an approximate answer to the right question is better than an exact answer to the wrong question.” Machine learning solutions work by optimizing towards an objective function — a mathematical formula that describes some value. One of the most basic examples is a profit function: Profit = Revenue – Costs.

While machine learning algorithms excel at finding the optimal solution, they can’t tell you if you’re maximizing the right thing at the right time. Periodically make sure that your objective function reflects your current priorities and values. For example, an early stage company may not be worried as much about profitability; instead they may want to maximize revenue in order to try to increase market-share. A company that is looking to IPO may want to demonstrate profitability, so may focus on minimizing costs, while maintaining the same level of market share. Only capturing the currently important metric (revenue) at specific points in time (quarterly) will hinder your ability to predict new cost functions (profitability) at different times.

Along those lines, data scientists can also fall into the trap of optimizing model metrics, and not business metrics. As an example, data scientists may consider using the area under a precision-recall curve or a receiver-operating-characteristic curve to evaluate overall model performance, but those curves don’t necessarily translate to business success. Instead, setting an objective like “Minimize false positives while maintaining a total false negative rate of X%” can be specific to your current business conditions, and can be used to weigh the specific costs of false positives and false negatives. Capturing pre-aggregated event-based data and periodic re-examination of the problem you’re trying to solve will allow you to keep moving in the right direction, instead of optimizing for the wrong problem.

2. Get on the same page

To your business stakeholders, there’s a huge difference between “We saw a 100 point increase in accuracy in the test set of 100,000 examples” and “If we had these improvements in place, we would have saved $20,000 dollars in the last business quarter.” “100,000 examples” and “100 point increase” are hard to visualize, whereas “$20,000” and “last business quarter” tend to be a lot easier for business stakeholders to grasp. Standardize your units of analysis so that your team and the business leaders spend less time translating, and more time ideating. 

The points-in-time that are critical can also differ by business stakeholder. A sales or customer success practitioner may need weekly, monthly or event-based measures (i.e. first subscription event, renewal event, support request events). While a revenue leader may need models per business segment, sales rep or product line on a quarterly or yearly basis. Collect data at an event level to support these various compute times as they arise. 

We’ve been on teams where train and test sets were at the whims of the particular data scientist. Our analysis wasn’t comparable to each other, and the model metrics we used were incomprehensible to the stakeholder. Once we standardized on business metrics, and times meaningful to the business (i.e. all deals from last quarter, subscription activity in the last month), it became easier to compare models internally and externally and easier to make present impactful business cases for the usage of our models.

3. Allow room for discovery

Data science is an inherently creative endeavor, oftentimes advancements in models come from unexpected places. The biggest breakthroughs come from exploring new avenues and new opportunities. One of the beautiful things about data science is that it takes ideas and methods from a broad array of scientific disciplines. Algorithms developed for genetics are used to analyze literature, methods to analyze literature can be adapted to make romantic matches on a dating app or provide recommendations for a vacation.

Advances in solutions often come from looking at the same problem from a different angle or frame of reference. For example, some of the first models didn’t take into account demographic information. For a long time now, data scientists have understood that including demographic data may help ads reach the right person or measure unintended bias. Then when the frame of psychology was introduced, data scientists began looking at the problem from a psychographic angle: Can demographics and demonstrated interest improve results? For example, adding in data about what someone shared on social media could provide a link to what they are likely to buy. Recently, event-based behavioral data, in near real time, has entered the space bringing both new information and time into the picture. Making very small gas station purchases then a very large TV purchase minutes later may signal a stolen credit card.

While you don’t want to spend all your time running down rabbit holes and chasing down wild geese, setting aside time to try new and creative solutions or explore different angles will pay off in the long run in new capabilities, better models, and faster time to results. Whether it’s setting aside time every week to chase down new leads or try new things, or allowing exploration tasks into your workflow, in the long run you’ll have happier scientists, and better long term results by allowing them to find new solutions or perspectives for the problems at hand.

4. Talk to your consumer

If you build a model without understanding your end-user and the problems they’re trying to solve, your model will be missing vital context. Business leaders tend to view things from 50,000 feet, whereas your models are often deployed at ground level with sales reps. Conditions on the ground never fully match what is viewed from up above, and so if you only take into context what you can see at that higher level, you’ll miss out on vital information. We’ve spent months building models for business leaders, only to discover that the system we built to make life easier, made things more difficult for the sales rep. We saved the company money, but we could’ve had a much bigger, faster impact if we built systems that were more closely aligned with our end users.

There are countless little contextual things that your users take for granted, and without speaking to your customers and working to understand them, you’ll miss out on this critical context. Talking to your users will ensure that your models will solve their needs. For example, a sales rep may be assigned to a territory and product line and expect the model they are provided to reflect this nuance. A revenue leader is looking across all reps to forecast the business. The features that make a model predictive at a global level will not be the same as those at a more granular level. In addition, a revenue leader cares more about accurate forecasting at the start of a quarter and month. A sales rep cares about when and what they can do to increase their success on a specific account. This context implies that you should build at least three different models with features computed at different points-in-time to increase accuracy and prevent leakage.

5. Optimal solutions tend to be suboptimal

Highly optimized solutions cost more to implement, more to maintain, and tend to be less flexible. Build simpler solutions whenever possible. Just because something is theoretically better, doesn’t mean that it’s practically better. We were working on a simple prediction logging database to be able to debug and replicate production predictions. At first, we wanted to get some fancy serverless AWS Athena set-up that wouldn’t require constantly running some database machine. We spent a day digging into Athena trying to get it set up before realizing that we had already spent more money in payroll costs than a persistent cloud-machine would cost to run for two years.

This ties in with “setting the right objective.” Optimized solutions only are optimized if your objective function is 100% correct, and isn’t likely to change. When it does change, then your highly optimized solution is likely to be optimized in the wrong direction. (Such as a model highly optimized to increase revenue and market share, but the business needs to shift towards profitability). A solution that is slightly less optimized, but more flexible, understandable, and adaptable will likely serve you better in the long run as priorities shift, and you better understand the costs associated with the problem space.

You’ll notice that many of these work together. In order to set the right objective function, you’ll want to talk to your consumer and get on the same page as your stakeholder. The ability to pivot your objective function to meet changing demands comes from not building a hyper-optimal solution to the local problem, but building something that is flexible. And of course, allowing room for discovery enables the exploration of new potential optima or problem spaces. Your business and model problems will change over time; set yourself up for success not just today, but into the future. These changes won’t save you 5 or 10 minutes here or there but will rather save you weeks of effort by minimizing the time spent building the wrong solutions.

Max Boyd is Senior Data Scientist at Tomo.

Charna Parkey is VP of Product at Kaskada.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Source

Leave a Comment