9 investors discuss hurdles, opportunities and the impact of cloud vendors in enterprise data lakes

The Evolution of Big Data and the Rise of Data Lakes
Approximately ten years ago, a discussion with a colleague centered on the topic of big data. We both initially believed that handling substantial datasets was primarily the domain of major corporations such as Facebook, Yahoo, and Google, and not a concern for most businesses.
However, this assessment proved inaccurate. It quickly became apparent that dealing with large volumes of data would become a universal necessity. Indeed, it was discovered that massive data quantities serve as the essential fuel for machine learning applications, a development neither of us anticipated.
Early Frameworks and the Emergence of Data Lakes
Early frameworks, including Hadoop and Spark, began to appear, and the concept of data warehouses was undergoing transformation. This progression was manageable when dealing with structured data, like credit card details.
However, traditional data warehouses were not equipped to handle the unstructured data required for building machine learning algorithms. Consequently, the concept of the data lake emerged as a solution for storing unprocessed data until it was needed.
Unlike the organized structure of data warehouses, data lakes offered a more flexible and raw storage approach.
Cloud Vendor Adoption and Investment
This evolving concept soon attracted the attention of major cloud providers, including Amazon, Microsoft, and Google. Furthermore, it sparked interest from investors, leading to the establishment of significant companies like Snowflake and Databricks, built upon the data lake principle.
Simultaneously, entrepreneurs began identifying and addressing related challenges. These included efficiently moving data into the data lake, cleaning and processing it, and then directing it to applications and algorithms capable of utilizing it.
During this period, data science transitioned from primarily academic research to becoming a more integrated function within businesses.
A Modern Ecosystem and Investor Perspectives
This evolution resulted in a completely new, modern ecosystem. With such developments, new ideas surfaced, companies were founded, and investment capital flowed into the sector.
We consulted with nine investors to gain insights into the appeal of the data lake concept, the role of cloud companies in this space, how investors identify promising new companies in a maturing market, and the opportunities and challenges within this profitable field.
Investors Consulted
The following investors contributed their perspectives:
- Caryn Marooney, general partner, Coatue Management
- Dharmesh Thakker, general partner, Battery Ventures
- Casey Aylward, principal, Costanoa Ventures
- Derek Zanutto, general partner, CapitalG
- Navin Chaddha, managing director, Mayfield
- Jon Lehr, co-founder and general partner, Work-Bench
- Peter Wagner, founding partner, Wing Ventures
- Nicole Priel, managing director, Ibex Ventures
- Ilya Sukhar, partner, Matrix Partners
Exploring Startup Opportunities in the Data Lakes Landscape Amidst Established Players
Caryn Marooney: The data market presents substantial opportunities, fueled by the drive to derive value from digital transformation initiatives. Both data lake and data warehouse architectures will remain crucial, as they address distinct requirements.
Organizations with pre-existing, extensive data infrastructure may find a complete migration to a data warehouse costly and protracted. For such entities, a data lake offers a viable solution, providing flexibility and the ability to conduct federated queries across diverse data sources.
Dharmesh Thakker: Companies like Databricks and Snowflake have become prominent in the data lake and warehouse sectors, respectively. However, evolving technical demands and business necessities necessitate continuous investment and innovation from both to sustain their competitive positions.
Irrespective of future developments, we are optimistic about the burgeoning ecosystem surrounding these key players, given the widespread data proliferation occurring across cloud and on-premise environments, and the variety of data storage solutions available. Significant opportunities exist for vendors to emerge as “unification layers” connecting data sources with various user groups – data scientists, engineers, analysts, and others – through integration middleware, real-time analytics, data governance, security, and monitoring solutions. These markets hold considerable potential.
Casey Aylward: Despite the presence of established cloud infrastructure providers, several significant opportunities remain within the data lake space:
- Convergence between business intelligence/analytics/SQL and machine learning/code (Scala or Python) is possible, though these areas cater to different users with varying skillsets and preferences. Architectural lock-in is a major concern for users regarding cloud providers, storage, and compute. Solutions promoting flexibility will be essential.
- Reprocessing data on each platform is currently inefficient and expensive as it moves. Technology enabling data movement without rewriting transformations, pipelines, and procedures presents a valuable opportunity.
- We are witnessing increased adoption of general data processing frameworks beyond MapReduce, particularly within the Python data science community. This represents a shift from Hadoop or Spark, which may not be optimal for unstructured, modern algorithms.
Derek Zanutto: The increasing adoption of data lake models is creating a substantial and rapidly expanding market alongside the more mature data warehouse model exemplified by Snowflake and major cloud vendors. Data lakes empower enterprises to gain insights from a wider range of data types – both structured and unstructured – for a broader spectrum of applications, from historical reporting to predictive analytics.
While data lakes offer numerous advantages, emerging challenges, such as data reliability and query performance, must be addressed to accelerate enterprise adoption. Companies developing solutions to these pain points will be well-positioned to capture a significant share of the profit pool generated by the data lake model.
Navin Chaddha: The widespread use of enterprise data lakes is fostering a new generation of data-first applications, requiring robust tooling for data privacy, integrity, governance, and access management. This trend also emphasizes the importance of workflows for data engineering and science teams, including data quality/observability and process automation.
Jon Lehr: DataOps represents a substantial opportunity for startups operating in the data lakes space. Data governance, lineage, and the creation and delivery of data features are critical for machine learning and remain largely unaddressed by companies like Snowflake or other established leaders. DataOps and MLOps are essential for the successful scaling of machine learning operations.
Peter Wagner: The traditional distinctions between data lakes, data warehouses, and databases have become blurred due to overlapping capabilities and marketing considerations.
Within the broader cloud data landscape, we see promising opportunities to expand the ecosystems around major platforms like Snowflake and AWS. Upsolver, a data preparation product enabling performant data lakes using commodity cloud storage, is a prime example. Pepperdata offers observability and optimization for cloud data.
The rise of machine learning as a strategic workload also presents compelling opportunities. Startups can create specialized tooling and infrastructure for ML workloads and their unique data types, such as Pinecone’s ML vector database and Truera’s model intelligence platform.
The “cloud-prem” model, where managed cloud services execute partially within a customer’s own infrastructure, is gaining traction due to economic and data privacy benefits. Hydrolix utilizes this model to disrupt the observability data market, while Upsolver employs it for cloud-native data preparation, minimizing data movement.
Finally, we are interested in specialized data platforms tailored to specific industries or enterprise functions, built on top of horizontal platforms like Snowflake, incorporating domain-specific integrations and knowledge. Segment and SetSail exemplify this trend.
Snowflake’s success demonstrates the value of a well-designed, managed service. Ease of adoption, scalability, and consumption-based pricing are key attributes of successful cloud data platforms.
Nicole Priel: Companies like Snowflake have effectively introduced elastic storage and compute, along with pay-as-you-go pricing, to the data lake space. However, challenges remain in ensuring employees understand the data within the warehouse and can easily access it. Companies like Panoply are innovating in the “last mile” of data analytics, bridging the gap between business users and the data lake with no-code automation.
Ilya Sukhar: Directly competing with major cloud providers, Snowflake, or Databricks at this stage would be challenging. These products are experiencing rapid growth and intense competition. While a next generation of products may emerge, adoption cycles may not be favorable for new entrants.
A fruitful area for innovation lies in storing and organizing unstructured data, such as images and videos, used in machine learning processes. Existing approaches have limitations.
I am focusing on companies building products one layer above the warehouse/lake, leveraging the “modern data stack” enabled by tools like Fivetran and dbt. The ability to gather and store all enterprise data is now simple and affordable.
This shift enables new applications and an inevitable disruption of existing categories like business intelligence. We anticipate the emergence of data-enabled applications previously unimaginable.
What are the biggest challenges for startups entering the data lake market currently, and how can they be addressed?
Caryn Marooney: A significant hurdle is the presence of established, successful companies already operating within this market. However, this also presents an opportunity for startups to integrate their offerings with these existing leaders. New ventures can focus on building solutions that complement current platforms, fostering a more extensive data ecosystem around the query engine.
Dharmesh Thakker: A key misstep for some startups is failing to establish partnerships with larger platforms like Snowflake and Databricks early in their market entry strategy. We believe major vendors – including AWS, Azure, and GCP – are actively seeking to expand the data infrastructure ecosystem through collaboration with third-party platforms, rather than attempting to control the entire technology stack. Companies should proactively seek opportunities to integrate with relevant platforms and collaborate on sales efforts.
Maintaining a self-sufficient business model is vital, but startups should be aware that larger platforms may prioritize co-selling with vendors they are more familiar with. Investing in these relationships from the outset can yield substantial benefits and create synergistic opportunities.
Casey Aylward: A primary challenge for startups lies in the proliferation of standards designed to lock users into solutions optimized for specific compute engines. While storage costs are decreasing, data processing has become the central competitive arena. As cloud data warehouses currently dominate SQL-based analytics, this functionality is expanding into the data lake space with companies like Databricks and Starburst. To thrive, startups should concentrate on serving specific data processing use cases, differentiating themselves from these broader platforms.
Derek Zanutto: Startups face the task of guiding enterprises through a fundamental shift in data management. They must persuade buyers, accustomed to 40 years of data warehouse procurement, to reconsider their approach to data storage and analytics. The data lake category is also intensely competitive, with startups often contending against established data platforms that control the underlying object storage infrastructure.
Success in the data lake market requires a focus on product depth rather than breadth. Companies specializing in “best of breed” technology for a specific use case are more likely to achieve scalability. From a go-to-market perspective, successful startups address concrete, real-world pain points with their solutions. Securing small, paid pilot projects demonstrating tangible business impact is crucial, and these successes can then be leveraged for further expansion.
Furthermore, open-sourcing the core technology can be a powerful strategy for achieving widespread adoption and building brand recognition. Open sourcing also allows startups to offer solutions compatible with multi and hybrid cloud environments, appealing to chief data officers seeking flexibility and avoiding vendor lock-in.
Navin Chaddha: Here are three common pitfalls for data startups today:
- The absence of an open standard necessitates integration with multiple data lake solutions.
- Startups must avoid transitioning into service-based businesses and instead prioritize strong product differentiation.
- Clearly articulating differentiation from established vendors and demonstrating return on investment to potential customers is essential.
Jon Lehr: Differentiation is paramount. Startups entering the data lake market must clearly define their competitive advantage: Are they faster? More cost-effective? Do they provide greater value? Or do they offer unique capabilities, such as graph data management? Databricks, for example, initially distinguished itself by offering a more affordable and efficient Spark implementation on AWS compared to EMR.
Peter Wagner: A major challenge for new entrants is market clutter. Data teams are bombarded with similar claims, making it difficult to gain attention. Elegant product definition and design that facilitates genuine product-led growth are key to breaking through the noise.
Achieving strategic alignment and collaboration with major cloud data providers is also critically important. Startups need to identify key partner platforms and tailor their product and go-to-market strategies accordingly.
Nicole Priel: Ongoing challenges within the data lake space center around data reliability and query performance.
How do the leading cloud providers influence the data lake market with their respective offerings?
Caryn Marooney: Major cloud vendors represent a significant force in any discussion concerning software infrastructure. Their substantial resources allow them to consistently present competitive products and deploy them extensively. However, history demonstrates that opportunities remain for inventive startups to thrive by providing novel software and superior customer support.
Dharmesh Thakker: Their impact is considerable, though customer responses differ. These large cloud providers aim to be the definitive source for all storage and computational requirements within the data environment. Therefore, understanding their proposed value and product roadmap is crucial for any organization beginning its exploration of this field. AWS, Azure, and GCP possess the capacity to substantially reduce costs for customers, but two primary drawbacks exist when collaborating with these large entities.
Considering the widespread distribution of data across both cloud and on-premises systems, clients frequently seek a platform-independent solution that functions across multiple cloud providers, rather than being restricted to a single one. Furthermore, aligning a customer’s toolkit with a single cloud vendor can ultimately result in vendor lock-in, granting the provider significant control over pricing and infrastructure needs.
For smaller organizations seeking a simplified approach, a complete stack from a cloud vendor may be beneficial. However, as clients evolve, they often prioritize solutions that operate across diverse platforms.
Casey Aylward: Cloud vendors generate revenue through their data lake storage solutions, such as Azure’s Data Lake Storage and Amazon’s S3. Owning the foundational layer of the stack positions them logically to integrate additional value through data management and processing tools, utilizing either SQL or code-based methods. Their ownership and monetization of storage enable them to offer competitive pricing on other features and facilitate rapid distribution.
Within the open-source community, emerging data formats like Apache Arrow are poised to significantly lower data transfer costs and enhance interoperability across platforms and programming languages. This technology serves as a key enabler in preventing platform-specific lock-in.
Derek Zanutto: The major cloud vendors are developing and marketing comprehensive ecosystems centered around their data lake offerings. Currently, they primarily provide the fundamental object storage infrastructure upon which data lakes are constructed, alongside data integration tools for data ingestion. They also offer a range of data services, including data science notebooks and federated SQL query engines, to make data within the lake accessible to users. This holistic platform approach provides enterprises with a “one-stop shop” for all data services, often at attractive price points due to the bundling of services with core infrastructure expenses.
Moreover, their services integrate seamlessly with other technologies within their cloud portfolios, presenting a compelling value proposition for many enterprises. Consequently, cloud vendors have profoundly impacted the data lake market. To effectively compete, data lake companies must prioritize genuine product innovation and differentiation, potentially through verticalized solutions tailored to specific industry needs.
Navin Chaddha: The leading cloud vendors have, in many respects, already commoditized the infrastructure market for data lakes. They continue to add value by integrating deeply with applications and incorporating business intelligence and visualization tools.
Jon Lehr: The lower costs and established credibility of large cloud vendors are driving a shift from on-premise solutions to cloud-based ones. Currently, a primary challenge for the data lake market is efficiently transferring data to the cloud. However, this transition creates opportunities for best-in-class solutions once data resides in the cloud.
Peter Wagner: The major cloud vendors present a dual nature – they are both enablers of the cloud data opportunity and formidable competitors. Snowflake has successfully navigated this dynamic, building a thriving business that simultaneously serves as a customer and competitor to the major cloud service providers.
The success of these major cloud data players also fosters ecosystem development. Their achievements have revealed unmet needs that startups are actively addressing. Examining the traditional enterprise data market reveals that these niches – and new ones – will be filled by cloud-native companies with modern architectures and more appealing business models. Companies that effectively address ecosystem needs for key partners can achieve substantial market growth.
Nicole Priel: When considering data lake storage, many companies, particularly those starting out or planning a migration, are evaluating Snowflake and Google BigQuery. It is vital that any startups within the data ecosystem ensure compatibility with these platforms.
Ilya Sukhar: They all recognize the strategic importance of the data warehouse (or data lake, depending on the preferred terminology). Consequently, they are effectively competing on price and bundling the warehouse as part of larger agreements with significant clients.
Exploring Startup Opportunities Beyond Data Lakes: Governance, Preparation, and Management
The landscape surrounding data lakes is expanding, with numerous adjacent services focused on data governance, preparation, and management. What potential exists for startups within these evolving markets?
The Growing Importance of Data Foundations
Caryn Marooney: The opportunities in data preparation, management, and governance are still in their nascent stages. As the data ecosystem develops, adoption of these solutions will increase, particularly for large organizations where scalability is paramount.
Dharmesh Thakker: A robust ecosystem has already formed around data lakes and warehouses, manifesting as “unification layers.” This presents a significant opportunity for startups. Companies like Matillion, Fivetran, Streamsets, and dbt have successfully built solutions for replicating data from cloud vendors – including Salesforce, Workday, and NetSuite – and then transforming it for analytical purposes.
Real-Time Analytics and Data Lifecycle Management
Companies such as Confluent and Rockset are concentrating on real-time, streaming analytics. Furthermore, once data reaches its destination, data cataloging, lineage tracking, governance, security, and monitoring become crucial for extracting actionable insights from data silos.
Casey Aylward: A unified layer above existing repositories could empower users across an organization, even those without technical expertise, to access data. Collaborative notebooks, due to their language-agnostic nature, appear to be a promising solution.
Data reliability is also critical for operational applications, especially in areas like artificial intelligence and machine learning. Monitoring data with the same rigor applied to core infrastructure and applications is a substantial opportunity.
Addressing Key Pain Points: Quality, Governance, and Time-to-Insight
Derek Zanutto: Startups addressing data governance and management challenges, regardless of their connection to data lakes, have a considerable advantage. Conversations with chief data officers reveal that their biggest concerns revolve around data quality, governance, and the speed at which they can gain insights.
One-third of data users surveyed lack confidence in the data their teams provide. This stems from inconsistent data usage, terminology, and the absence of a single source of truth, leading to inaccurate models and flawed business decisions.
Regulatory Compliance and Data Understanding
Increasingly strict regulations like GDPR and CCPA have made data protection and privacy a top priority. However, 80% of chief data officers surveyed do not feel fully compliant with GDPR. Many organizations lack the ability to answer fundamental questions about their data – its origin, location, usage, and potential impact in the event of a breach.
Navin Chaddha: Key areas for startup innovation include:
- Governance: privacy consent, access control, and compliance.
- Preparation: data quality, synchronization/integration, and stream processing.
- Management: data ops, exploratory analysis, and workflow automation.
The Rise of DataOps and the Last-Mile Problem
Jon Lehr: Organizations are proactively building internal data management practices, creating traction in these adjacent markets. Algorithmia, a Work-Bench portfolio company, exemplifies this trend by providing a platform for managing the entire machine learning lifecycle, ensuring rapid, secure, and cost-effective model deployment.
DataOps extends beyond traditional DevOps for data, automating and accelerating the entire data analytics cycle.
Peter Wagner: Exciting new companies are emerging in areas like data preparation (e.g., Upsolver), data stack observability (e.g., Pepperdata), optimized infrastructure for machine learning (e.g., Pinecone), and observability (e.g., Hydrolix). Technologies enabling data privacy and sharing also present significant opportunities.
Nicole Priel: A key challenge lies in bridging the gap between data end-users and the experts who understand the data lake’s contents. Simplifying data discovery and leveraging machine learning to automatically generate insights can help close this divide.
Ilya Sukhar: Fivetran’s impact on the ecosystem is noteworthy, serving as the primary method for moving data from SaaS products and internal databases into data warehouses. Building value-added services on top of Fivetran, such as a Plaid-like API platform for programmatic data access, is a promising avenue.
The recent Snowflake IPO has spurred significant investment in data-related startups tackling well-known problems like governance, quality, monitoring, cataloging, and integration. However, these categories are becoming increasingly competitive.
Ron Miller
Ron Miller's Professional Background
Ron Miller previously served as an enterprise reporter for TechCrunch, covering developments within the technology sector.
Prior to his role at TechCrunch, he held a long-standing position as a Contributing Editor with EContent Magazine.
Previous Editorial Roles
Throughout his career, Miller has regularly contributed to several prominent publications.
- He was a frequent contributor to CITEworld.
- Regular articles were also published on DaniWeb.
- TechTarget and Internet Evolution also featured his work.
- Miller was also involved with FierceContentManagement.
Disclosure of Prior Engagements
Transparency regarding past professional affiliations is important.
Miller formerly maintained a corporate blog for Intronis, where he published weekly articles addressing IT-related topics.
He has also contributed content to corporate blogs representing various organizations.
- These include Ness.
- Novell was another platform for his writing.
- Miller participated in the IBM Mid-market Blogger Program.