How to build AI products of the future?
Over the last six months, we have seen new innovations in artificial intelligence (AI), particularly in large language models, almost every week. AI is already impacting nearly every aspect of our society in ways we are just beginning to understand. The role of AI in our lives will likely continue to grow as it evolves. However, lack of AI-worthy (large, high quality) data is a major barrier to AI adoption in the vast majority of enterprises outside of consumer tech. Data availability, accessibility and quality are core issues that most enterprises face when it comes to using their data for AI products. One of the major ways to overcome this is to build systems that allow enterprise users to generate their own AI-worthy data at scale. This means enabling subject-matter experts to create, engineer, and curate core datasets that can power AI products. Today’s enterprise architecture is a bottleneck to enabling the creation of such AI-worthy datasets. In this article, I explain how — and what we can do to avoid this bottleneck.
Introduction to data centric architectures
The way we build technology today is use-case driven — we identify the business case (application), identify the data that may provide answers (products & services) to this business case, and build out a full stack of technology infrastructure (think databases, messaging lawyer, libraries, user interfaces, etc.) to solve this business case. There is a different, much more efficient way of building technology products — and it involves data centric architecture.
Data centric architecture puts data at the core of its design, functionality, and decision-making process. This means enterprise organizations can eliminate data silos, improve data quality, governance, security, scalability, flexibility, and ultimately make data-driven decisions regarding their products and services. Below are five key principles your company can implement to enable data centric architecture.
- Data as central asset: data is the core asset of your company; and therefore, you put significant resources into curating the best, most accurate, highly labeled, interoperable, scalable data across the enterprise. Applications are considered transient components that come and go while your core data stores are what withstand the test of time.
- Zero data silos: data accessibility is a fundamental principle within your company; you take pride in making your highly curated high quality data easily and readily accessible to everyone within your organization for analysis, reporting and decision-making. Data integration is seamless with internal and external data — ensuring scalability and flexibility throughout the data lifecycle.
- Data quality is paramount: you ensure the highest data quality by implementing processes and mechanisms for efficient data capture, cleaning, validation, and standardization. You build or implement tools to enable your subject matter experts to do this at scale. You build company-wide dashboards to monitor and alert changes in data quality continuously.
- In built data governance and security: as a company, you have built data governance and data security into your workflows and processes; you have policies and controls to ensure data privacy, compliance and protection against unauthorized access and breaches. Data lifecycle management is part of the governance plan — and it ensures data is managed effectively throughout its lifecycle, optimizing storage, performance, compliance (retention, archiving, and data disposal processes). Again, you have built company-wide dashboards to monitor and alert changes in governance and security.
- Data-driven decision making: organizational decisions are based on up-to-date information powered by your data; and you regularly experiment with and adopt advanced analytics, machine learning, and artificial intelligence to derive insights and drive business outcomes.
There are few fundamental differences between application centric AI product development and data centric AI product development. In an application centric world, integrating external data to internal enterprise data is extremely difficult and costly — data integration today consumes 35–50% of technology budget in most enterprise companies. This is because data exists in a wide variety of heterogeneous formats, structures, meaning, and terminology — and munging, normalizing and standardizing this can be resource heavy. Every new product comes with its own set of data requirements, and once built — change can be prohibitively expensive. Adding new features or upgrading legacy products to more advanced technology is necessary — but it can end up costing companies millions. Because data is tied up in various applications throughout the enterprise — making data driven decisions takes non-trivial effort and monitoring product and system performance needs even more extra effort.
In contrast, in data centric AI product development, internal and external data are either already integrated or readily integrated, as new data integration comes at a negligible cost. This globally integrated data sharing a common meaning can be expanded as new products need to be developed, building upon the core data stores that can be exported and shared in any needed format. This keeps the cost of building, upgrading and scaling applications reasonable. And while new applications may be born and old ones die, the data underneath powering them lives on.
Key differences between application-centric AI product development and data-centric AI product development.
Moving from application centric AI product development to data centric AI product development requires a cultural shift and mindset change. Traditionally, companies have considered data as a necessary component of building products and services at best, or by-products or support elements of their products and services at worst. Companies need to start thinking about data as a product or a service in and of itself — this means it is as important to invest in the data as it is on their core products and services.
This cultural shift in mindset comes with tons of advantages for the forward-looking companies. A data centric AI product development approach simplifies data management and therefore product development in several ways:
- Centralized data management: data centric architectures place data at the core of the IT systems and enable a centralized approach to data management. This simplifies data governance, integration, quality, and dissemination. It reduces complexity and increases transparency and efficiency. It also improves data accessibility in and across the enterprise.
- Unified data model: adopting a data centric architecture involves establishing a unified data model that services as a common framework for data representation and integration. This unified model simplifies data integration efforts by providing a standardized structure and meaning for data across different applications and systems. It enables seamless data sharing and eliminates the need for complex data mapping and transformation processes.
- Enhanced collaboration and efficiency: working within a common data framework means better collaboration among different teams and stakeholders. This means a much more agile product development process, shortening the build-measure-learn cycle and enabling building decoupled systems that are easier to maintain and can evolve faster without disrupting the underlying data infrastructure.
Future outlook on data centric AI
As AI models become more ubiquitous and open-sourced, what will differentiate companies is the quality of their data that will enable customization to their specific use case and context. Andrew Ng aptly put it: companies with consistent and accurately labeled output data, representative and high quality input data, and data that reflects post-deployment changes (“real world” data) will win in the future. Rather than constantly changing the model (finding better models or fine tuning their existing models) companies will want to focus on better engineering (tuning) their data so that existing models can provide increased performance. The popularization of concepts like DataOps, an emerging practice focusing on streamlining and automating data operations to ensure the timely delivery of high-quality data for analytics and product development, are signals that the industry is moving in this direction already. As more AI models become open sourced and commoditized, creating core custom data assets is what will drive AI products of the future.
***
If you enjoyed this post, please clap / love / like and share. If you’re excited about building AI products, reach out to me on LinkedIn here.