Data is the new oil in the digital economy. For businesses big and small, in all industries and regions, those who want to be ahead of the curve need data pipelines that are faster, cheaper and better. DataOps provides new opportunities in data pipeline development and management.
DataOps, short for “data operations”, is an agile and lean methodology for providing data analytics solutions. As you probably tell from its naming format, the concept is borrowed from DevOps, which transforms the software industry by enabling large teams of developers and IT operators to continuously deliver high-quality releases and provide value to customers. DataOps follows the same rationale: it improves data quality and accelerates data delivery through building a robust and automated data pipeline.
Data workflows today are far from good enough. Traditional data pipelines are blamed for being fragile and fragmented: jobs are easily broken by changes to application data; developers, data scientists, data engineers and data analysts have tacit knowledge in their domains and therefore face frictions in cross-team communications. In addition, maintenance is costly and time-consulting, and documentation is out-of-date and inaccurate.
To tidy up this mess, we need a holistic approach that brings all resources (people, processes, and technology) together to manage and use data effectively for the entire lifecycle. DataOps, a collaborative practice of improving dataflow automation, ensuring data quality and deploying data analytics, will be part of the solution.
According to DataKitchen, a major player in the DataOps enterprise software industry, there are four key components for DataOps: orchestration, testing, deployment automation, and data science model deployment. Testing is applied to both development phase and production phase to ensure data quality and pipeline functionality. What testing means to data analytics is like what product quality inspectors and facility inspectors mean to a manufacturing plant. Although it seems to add extra workload, it reduces the cases where data errors occur, changes to data schema break production jobs, or data drifts affect reliability of analysis results. As a checkpoint, it is testing that determines the speed of data analytics, as well as establishing trust in data and confidence in the analysis results. For modern data systems, automated testing is necessary to manage complexity, maintain continuity and build trust before doubts accumulate.
Introducing Superconductive and Great Expectations
Superconductive is the company behind Great Expectations, an open-source tool for preventing tech debt in data pipeline through testing, documentation, and profiling. Great Expectations offers a framework that helps data teams increase productivity and decrease operating risks through a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (rather than code) and at batch time (rather than compiling or deploy time).
Why We Invested in Superconductive
Great Expectations is a world’s leading open-source tool for automated testing in DataOps space. Its essence is a shared standard for quality assurance in data governance, called “expectations”. “Expectations” are flexible and declarative syntaxes that describes the expected format of data. They can be deployed directly into existing infrastructure, compiled directly to human-readable documentation, and extended into specific data domains.
In this sense, Great Expectation is a rule-maker and a game-changer at the frontier of data quality assurance. We value the visionariness and innovativeness of the Superconductive team and envision the possibility that expectations will become a common language in describing and testing data. Undoubtedly, setting clear expectations will help to keep everyone on the same page and facilitate interdisciplinary collaboration.
Great Expectation’s business model is open core, which allows it to disrupt the data world from bottom up. Under the hood there is not only a dedicated team, but also a large community that actively contribute their efforts and insights. The power of the masses can be massive, as we can see from examples of public listed open-core software companies, such as Mango DB and Elastic Search.
So far Great Expectations has amazing traction. A broad variety of companies, including big names like Lyft, Snowflake, McKinsey and Morningstar, have leveraged Great Expectations to defeat pipeline debts in their data systems. As of late May 2021, Great Expectations obtained 4.5k stars and over 550 folks on GitHub, and over 3200 slack members.
Great Expectations’ strategic standing in the DataOps revolution gives it distinct advantages. Data testing is an ideal entry point because it is an obvious pain point. According to a whitepaper from Eckerson group, data analytics teams devote up to 20% of their code to testing. No existing solution conducts pipeline tests effectively while doing these two important things right: 1) establishing common ground for describing the rules to conform 2) offering open access for the public to explore.
Besides data testing, the initial landing spot in Great Expectations’ GTM strategy, there are plenty of promising opportunities to expand to the entire organization. For instance, Great Expectations is also used as a source for generating documentation that reflects data and remains up to date. In addition, adjacent markets like data modeling, metadata management and data lake & data warehouse could be potential new territory in the future.
To sum it up, data-driven insights are extremely valuable assets in the digital economy; DataOps, an emerging discipline for collaborative analytics, is the future of data governance. We certainly expect Great Expectation to have a great journey ahead and look forward to witnessing its contribution to a bottom-up revolution toward a data world with better quality, integrity, and connectivity.