Pipelines bring data from important business sources. In many cases, reports and analysis that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself.
As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.
This means that there should be multi-year funding to monitor and maintain the existing pipelines and provide headroom to add new ones, as well as support the analysis and retirement of old ones.
Pipelines need product managers to understand the pipelines’ current statuses and operability, and to prioritise the work.
The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions.
When creating pipelines, we try to architect them in a way that allows reuse whilst also remaining lean in our implementation choices.
In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes, and can often be made available to skilled users by providing them access to the landing zone.
Appropriate identity and access technologies, such as role-based access, can support re-use while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.
Pipelines have a cadence which is driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work - whether every few seconds, hourly, daily, monthly or event-driven.
In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to re-use parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case which is not expected to change substantially than focus on upgrading a fast-streaming pipe to meet the new requirements.
We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production.
Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.
Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end users.
In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data.
Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.