When designing the pipeline, consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be - is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.
If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.
And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production - model release usually takes place at a slower cadence (e.g., weekly or monthly).
Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development - this is not the appropriate use for a data warehouse or similar.
Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.
Keeping this data allows you to reprocess it without re- ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.
Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.
By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add/change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.
Pipelines develop and mature - at the start of development, it may not be possible to fully evaluate if the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.
We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures.
We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.
When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread - first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value - probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value.
BEAR IN MIND THAT IN THE FIRST ITERATION, YOU WILL NEED TO:
- Create a cloud environment which meets the organisation’s information security needs.
- Set up the continuous delivery environment.
- Create an appropriate test framework.
- Model the data and create the first schemas in a structured data store.
- Coach end users on how to access the data.
- Implement simple monitoring of the pipeline.
Later iterations will bring in more data sources and provide access to wider groups of users.
THEY WILL ALSO BRING IN MORE COMPLEX FUNCTIONALITY SUCH AS:
- Including sources of reference or master data.
- Advanced monitoring and alerting.
Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/ configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.
Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.
Data sources can suddenly stop functioning for many reasons - unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:
- Measuringcountsor other statistics of data going in and coming out at various points in the pipeline.
- Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
- Viewing log graphs - use the shapes to tell you when data volumes have dropped unexpectedly.
For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data - with many sources reporting similar or linked data - and a key challenge is to conform the data as a step before merging and aggregating it.
All these challenges require a shared understanding of data entities and fields - and need some kind of data model to resolve to. If you ignore this data model at the start of the pipeline, you will have to address these needs later on.
However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.
Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.
Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the data flow from the code for the individual tasks. There are many tools that support this separation-usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.
DEPENDING ON THE ENVIRONMENT AND THE NATURE AND PURPOSE OF THE PIPELINE, SOME TOOLS WE HAVE FOUND USEFUL ARE:
- Argo Workflows
- AWS Glue
As with any continuous delivery development, a data pipeline needs to be continuously tested.
HOWEVER, DATA PIPELINES DO FACE ADDITIONAL CHALLENGES:
- There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software - the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
- Even individual stages of a data pipeline can take a long time to process - anything with big data may well take hours to run. So feedback time and iteration cycles can be substantially longer.
- In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII- clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments - which can feel uncomfortable for many continuous delivery practitioners.
- In a big data environment, it will not be possible to test everything - volumes of data can be so large that you cannot expect to test against all of it.
WE HAVE USED A VARIETY OF TESTING PRACTICES TO OVERCOME THESE CHALLENGES:
- The extensive use of integration tests - providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
- Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
- Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
- Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see our blog testing infrastructure as code 3 lessons learnt for a discussion of some existing tools).