Pipelines bring data from important business sources. In many cases, reports and analysis that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.
As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.
This means that there should be multi-year funding to monitor and maintain the existing pipelines and provide headroom to add new ones, as well as support the analysis and retirement of old ones.
Pipelines need product managers to understand the pipelines’ current statuses and operability, and to prioritise the work.
(See this Martin Fowler articles products-over-projects for a wider description of working in product-mode over project-mode.)
Find ways for making common use of the data
The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions.
When creating pipelines, we try to architect them in a way that allows reuse whilst also remaining lean in our implementation choices.
In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes, and can often be made available to skilled users by providing them access to the landing zone.
Appropriate identity and access technologies, such as role-based access, can support re-use while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.
A pipeline should operate as a well-defined unit of work
Pipelines have a cadence which is driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work - whether every few seconds, hourly, daily, monthly or event-driven.
Pipelines should be built around use cases
In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to re-use parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case which is not expected to change substantially than focus on upgrading a fast-streaming pipe to meet the new requirements.
We have been helping the Office of National Statistics (ONS) to plan and manage the 2021 census of England and Wales. We have a variety of end users, which are defined as people who construct dashboards showing important management information, and people who need to manage large device estates. They need information from many ONS suppliers who provide that data in a variety of formats over different channels.
We built up our data pipeline architecture by starting with an initial high-priority use case from which we created the first thin slice from a single (if complex), data source to provide data for a dashboard. We iterated around this initial architecture, reusing key pieces of infrastructure (such as user groups and permissions), and data engineering patterns (e.g., retaining a common approach to ETL), and adding extra complexity - such as new ingestion methods, decryption and secrets management, as required by new use cases.
Taking a use case-based approach has enabled us to prioritise areas of highest value first, meaning we focus on what matters most for the business users, and dashboard makers, who rely on our pipelines.
We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production.
Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.
Consider how you name and partition your data
Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end users.
In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data.
Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.