7 min read2 days ago

This is a 5-part mini-series (with a bonus Part 0 on desirability) on my experience building AI/ML and data products over the past decade. If you have not read Part 0 (desirability) or Part 1a (the first section on viability), it would be worth going back to read these first, as these things do have some order to them. Restating here that some may find what I say here to be agreeable, others heretical; however, I hope all of you do find some value in applying these principles in your own product development journey. Whatever your take, I’d love to hear your feedback and comments.

Now that you have a clear problem statement, you know what needs to be solved, you know who you need to solve it for, and you know that solving this problem is in fact a lucrative endeavour for you and your company, you need to assess whether this is actually an AI/ML problem. The three main things you need to assess to ensure that this is in fact an AI/ML problem to solve are:

AI/ML tractability

Data availability

Ethics

AI/ML tractability

Tractability in machine learning refers to whether a problem is solvable within a reasonable time frame using available computational resources. For our purposes, AI/ML tractability isn’t referring to something as abstract as “is Artificial General Intelligence (AGI) tractable” — here by tractability I am simply referring to whether this is a problem where applying AI/ML will outperform other methods. There are two dimensions to this — first is that applying AI/ML to a simple enough problem that could be solved using non-AI/ML methods such as traditional statistics could be like using a sledgehammer to track a nut. This overkill would waste time and resources. If there are well-established, well-documented, well-founded methods that have been widely used for decades to solve the problem well, it is best to go with a non-AI/ML solution. If statistical models explicitly define the relationships between variables, allowing for clear understanding of how factors influence outcomes, it is best to go with a non-AI/ML solution.

It makes sense to apply AI/ML solutions if there are specific patterns in these five domains.

Structure

Patterns in structure such as data being completely unstructured (text, images, videos, audio) or semi-structured (JSON or XML files, for example).

Relationships

Patterns in relationships such as nonlinear relationships between input variables (predicting house prices based on location, amenities, market trends, etc.) or data with many interdependent features that contribute to outcomes (high-dimensional relationships) where manual exploration is infeasible (genomic data analysis).

Dynamics

Input data is temporal or sequential (determining pre-existing conditions during medical claims) or evolving (product recommendations based on user behaviour).

Distribution

Data constitutes hidden groupings that aren’t obvious or directly measurable (customer segmentation based on purchasing behaviour) or patterns that emerge from multimodal input data (self-driving cars taking sensor and camera data to make driving decisions).

Variability

Input data is noisy or inconsistent but patterns need to be extracted nonetheless (anomaly detection), or probabilistic patterns with inherent uncertainty (weather forecasting).

The above rule of thumb, like all heuristics, has exceptions.

The second dimension to this is a bit more nuanced, and changes over time. Historically many problems we consider either solved or nearly solved today were considered intractable. When machine learning techniques worked or were correct, they were unreasonably slow. Or when they were quite fast, they did not always work or were not always correct. However, over time we have been able to get reasonably good solutions for complex (seemingly intractable) problems such as search, language reasoning, games, and so on. This kind of tractability is dynamic and depends on the available technology. Unless you are building cutting edge AI/ML products or are in AI research, this kind of tractability is less meaningful in building AI/ML products.

Data availability

AI/ML products are data hungry. If the data is insufficient in quantity, quality, diversity, accessibility, or testing capabilities — AI/ML may not be a viable solution.

Quantity of data

AI/ML solutions typically require large amounts of data for training and testing. The question to ask is whether there is enough data to train a reliable model. Depending on the task at hand and the quality of the labels, supervised learning can require hundreds, thousands, or millions of labelled samples. Unsupervised learning scales better with smaller datasets but would still benefit from larger amounts of data. If you are building a product for dynamic systems (recommendation engine), having a way to generate new data or update regularly is important.

Quality of data

Data must be accurate (correct, precise, consistent) for it to be useful — typos and mistakes will impact model performance drastically. Data must be complete — if there are significant gaps in data (missing data, for example), this can lead to limited or biased outputs. There are various ways to handle data missing completely at random, missing at random, or not missing at random which we will not go into here. Data must be relevant to the problem you are trying to solve — this one is obvious; using web traffic data to predict customer churn may not be effective without any transactional data. And finally, for supersied or semi-supervised learning, labelling the data correctly is important for robust outputs.

Diversity of data

AI/ML models generalise best when trained on diverse data that reflects real-world use cases. Data must have variety, i.e. it must cover all relevant scenarios, edge cases, and user types. And the data must be representative of the target population. For identifying rare cases, either the same size must be large or oversampling techniques must be used to ensure representativeness.

Data diversity is important because it plays a critical role in managing the balance between bias and variance, ultimately impacting model performance and generalization. Diverse data helps reduce bias by providing a more comprehensive representation of the problem space. When a model is trained in diverse data, it is less likely to make oversimplified assumptions, thus lowering bias. Equally, a more diverse dataset can help reduce variance by exposing the model to a wider range of examples, making it less sensitive to small fluctuations in the data. This is especially relevant for large neural networks, which typically have low bias but high variance. Training on large, diverse datasets greatly reduces variance, allowing these models to fit complicated functions while maintaining good generalisation. In ensemble learning, diversity helps reduce overall error by counterbalancing individual model weaknesses.

Diversity of data can be achieved by having multiple different sources of data, oversampling of underrepresented groups, using transfer learning and open datasets, and using synthetic data to supplement where needed. There are technical ways to address diversity such as ensemble methods (bagging, boosting) and regularization — but this is more in the realm of data science or the actual engineering of ML models. The important thing to keep in mind in product is to ensure that data diversity should be monitored and addressed appropriately.

While we are on this topic, you should definitely check out an article Datum’s CTO Ravi Bajracharya wrote about Data Centric Architecture (DCA) here. If you want your data to be your long term asset and company moat, adopting DCA principles is the way to go.

AI Ethics

There has been a lot written about AI Ethics — so I will keep this brief. We already discussed potential for bias and underrepresentation in the data section — AI models amplify the biases present in the training data, so ensuring that the data available is as fair and representative as possible is important. Of course when using data for any kind of AI training, sensitive data needs to be given proper privacy and data protection considerations. There are laws like HIPAA and GDPR for this very reason, and as a product manager, it is your job to ensure that your product adheres to them. AI/ML models can have vulnerabilities just like any system, so making sure they are protected against adversarial attacks or being prepared for this before your product is deployed is helpful. Data breaches have affected several large companies in the past few years — there are design and engineering choices you should discuss with your engineering team to ensure these are minimised. Model transparency and explainability are important factors to consider when your product is making life-critical decisions (particularly in health and safety or finance) — there are again design choices and technical methods your team can use to ensure this, if these are critical for your industry.

Just like any AI/ML implementation, your product could contribute to the gap between those who have access to technology and those who do not — thus increasing discrimination and marginalization; your product could contribute to job displacement; could potentially be misused or weaponized for unintended consequences; or could have long-term societal impact by increasing over-reliance on AI systems. This category of what-if scenarios and doomsday stories are likely possible, but apart from being aware I don’t have solid suggestions how to avoid them at this point.

The take home message on AI ethics is to have a checklist and go through this during your design and discovery phase, and have a mechanism to monitor potential issues at each stage of your development journey. And if any issues do arise, to acknowledge them, own them, and address them as soon as possible. This honesty and accountability, above all, is the responsibility of the product manager.

The next post (Part 2) will talk in a bit more detail about data since it sits at the core of any AI/ML application.

♻️ If you like what I write, please reshare with your network.

AI/ML tractability

Data availability

Ethics

AI/ML tractability

Structure

Relationships

Dynamics

Distribution

Variability

Data availability

Quantity of data

Quality of data

Diversity of data

AI Ethics

Written by Xeno Acharya

No responses yet