When designing the pipeline, consider what level of latency you need. What is your speed of decision? How quickly do you need the data? Building and running a low latency, real-time data pipeline will be significantly more expensive, so make sure that you know you need one before embarking on that path. You should also ask how fast your pipeline can be - is it even possible for you to have a real-time data pipeline? If all your data sources are produced by daily batch jobs, then the best latency you can reach will be daily updates, and the extra cost of real-time implementations will not provide any business benefits.
If you do need to be within real-time or near real-time, then this needs to be a key factor at each step of the pipeline. The speed of the pipe is conditioned by the speed of the slowest stage.
And be careful not to confuse the need for a real-time decision engine with the need for a real-time historical data store, such as a data warehouse for the data scientists. Decision models are created from stores of historical data and need to be validated before deployment into production - model release usually takes place at a slower cadence (e.g., weekly or monthly).
Of course, the deployed model will need to work on a live data stream, but we consider this part of the application development - this is not the appropriate use for a data warehouse or similar.
Ingestions should start by storing raw data in the pipeline without making any changes. In most environments, data storage is cheap, and it is common to have all the ingested data persisted and unchanged. Typically, this is done via cloud file storage (S3, GCP Cloud Storage, Azure Storage), or HDFS for on-premise data.
Keeping this data allows you to reprocess it without re- ingestion if any business rule changes, and it also retains the possibility of new pipelines based on this data if, for example, a new dashboard is needed.
Pipelines are usually composed of several transformations of the data, activities such as format validation, conformance against master data, enrichment, imputation of missing values, etc. Data pipelines are no different from other software and should thus follow modern software development practices of breaking down software units into small reproducible tasks. Each task should target a single output and be deterministic and idempotent. If we run a transformation on the same data multiple times, the results should always be the same.
By creating easily tested tasks, we increase the quality and confidence in the pipeline, as well as enhance the pipeline maintainability. If we need to add/change something on the transformation, we have the guarantee that if we rerun it, the only changes will be the ones we made.
Pipelines develop and mature - at the start of development, it may not be possible to fully evaluate if the pipeline is working correctly or not. Is this metric unusual because this is what always happens on Mondays, or is it a fault in the pipeline? We may well find at a later date that some of the ingested data was incorrect. Imagine you find out that during a month, a source was reporting incorrect results, but for the rest of the time, the data was correct.
We should engineer our pipelines so that we can correct them as our understanding of the dataflows matures.
We should be able to backfill the stored data when we have identified a problem in the source or at some point in the pipeline, and ideally, it should be possible to backfill just for the corresponding period of time, leaving the data for other periods untouched.
When starting at a greenfield site, we typically build up data pipelines iteratively around a steel thread - first a thin data pipe which is a thin slice through the architecture. This progressively validates the quality and security of the data. The first thread creates an initial point of value - probably a single data source, with some limited processing, stored where it can be accessed by at least one data user. The purpose of this first thread is to provide an initial path to data and uncover unexpected blockers, so it is selected for simplicity rather than having the highest end-user value.
BEAR IN MIND THAT IN THE FIRST ITERATION, YOU WILL NEED TO:
- Create a cloud environment which meets the organisation’s information security needs.
- Set up the continuous delivery environment.
- Create an appropriate test framework.
- Model the data and create the first schemas in a structured data store.
- Coach end users on how to access the data.
- Implement simple monitoring of the pipeline.
Later iterations will bring in more data sources and provide access to wider groups of users.
THEY WILL ALSO BRING IN MORE COMPLEX FUNCTIONALITY SUCH AS:
- Including sources of reference or master data.
- Advanced monitoring and alerting.
With the development of the Multi Digital Tax Platform, Her Majesty’s Revenue and Customs (HMRC) required an auditing and transaction monitoring capability to provide the right level of assurances across the digital estate. This ‘need’ evolved into what is now referred to as the Customer Insight Platform (CIP). CIP’s capability offers the opinions of users’ digital interactions over time that underpin improvements, security and fraud prevention. More recently, this data and insight are being used in upfront risking.
They asked Equal Experts to assist in the design, development and delivery of CIP. Using an agile, continuous delivery approach, we worked with HMRC to build upon the CIP platform, starting with an initial steel thread, and building up to a full production system. More recently, we’ve introduced a data modelling environment that provides a clear path to production for the models/insights which have been pivotal to the COVID response.
Pipelines are a mixture of infrastructure (e.g., hosting services, databases, etc.), processing code, and scripting/ configuration. They can be implemented using proprietary and/or open-source technologies. However, all of the cloud providers have excellent cloud native services for defining, operating and monitoring data pipelines. They are usually superior in terms of their ability to scale with increasing volumes, simpler to configure and operate, and support a more agile approach to data architecture.
Whichever solution is adopted, since pipelines are a mixture of components, it is critical to adopt an infrastructure-as-code approach. Only by having the pipeline defined and built using tools, such as terraform, and source controlled in a repository, will pipeline owners have control over the pipeline and the confidence to rebuild and refine it as needed.
Data sources can suddenly stop functioning for many reasons - unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:
- Measuringcountsor other statistics of data going in and coming out at various points in the pipeline.
- Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
- Viewing log graphs - use the shapes to tell you when data volumes have dropped unexpectedly.
One of our clients had an architecture which ingested data from a wide range of sources in order to understand how their business was operating and how their customers were using their products. It was a fast-growing company, and the data estate had grown rapidly with them. The rapid growth had meant that their pipelines were not sufficiently monitored, leading to undetected losses of data, and reduced confidence in the reports which depended on the data.
To help our client to regain confidence in their data, we first developed and implemented tactical sampling and measurement of data flows through the architecture, with simple text reports. These highlighted where problems were occurring in the data architecture, clearly showing where remediation was required. It rapidly identified some key problems which were subsequently fixed, as well as issues which were in the data sources themselves.
Using existing Grafana and Prometheus monitoring tooling, we created enduring data visualisations and alerts to enable the identification and resolution of any further data losses.
For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data - with many sources reporting similar or linked data - and a key challenge is to conform the data as a step before merging and aggregating it.
All these challenges require a shared understanding of data entities and fields - and need some kind of data model to resolve to. If you ignore this data model at the start of the pipeline, you will have to address these needs later on.
However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.
Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.
We worked with a vehicle manufacturer to create tools to help their dealers and internal users improve their understanding of used vehicle prices. As the data came from a variety of sources, it needed to be conformed against master data lists for key business attributes such as geographical region and product types. It would have been almost impossible to arrive at agreed-upon master data lists for these attributes which work for every part of the organisation. So, we took a pragmatic approach - creating geographies and product hierarchies that were meaningful to their dealer organisations and, at the same time, were reflected in their data. This enabled us to implement data pipelines within a short timescale, as well as a user experience that is understandable and useful across their global business.
Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the data flow from the code for the individual tasks. There are many tools that support this separation-usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.
DEPENDING ON THE ENVIRONMENT AND THE NATURE AND PURPOSE OF THE PIPELINE, SOME TOOLS WE HAVE FOUND USEFUL ARE:
- Argo Workflows
- AWS Glue
We helped a large broadcaster collect metadata on their digital assets from a wide variety of external providers. In the initial phase, pipeline ingestions and the associated scheduling were set up inside a single Clojure project, with management being done via configuration files present within that project. Whilst this worked well, it limited the flexibility to change a scheduler or ingestion. What we wanted was something that could:
- Schedule data ingestions.
- Manually trigger data ingestions.
- Perform data backfills for specific ingestions.
- Easily show an overview of the status of all running ingestions.
As our existing pipelines were implemented in Clojure with transformations implemented in SQL, we quickly settled on Argo Workflows for workflow management, coupled with dbt for transformation management. Using this approach allowed us to centralise the analytics logic and led to a reduction in data ingestion and processing from two hours to ten minutes.
As with any continuous delivery development, a data pipeline needs to be continuously tested.
HOWEVER, DATA PIPELINES DO FACE ADDITIONAL CHALLENGES:
- There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software - the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
- Even individual stages of a data pipeline can take a long time to process - anything with big data may well take hours to run. So feedback time and iteration cycles can be substantially longer.
- In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII- clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments - which can feel uncomfortable for many continuous delivery practitioners.
- In a big data environment, it will not be possible to test everything - volumes of data can be so large that you cannot expect to test against all of it.
WE HAVE USED A VARIETY OF TESTING PRACTICES TO OVERCOME THESE CHALLENGES:
- The extensive use of integration tests - providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
- Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
- Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
- Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see our blog testing infrastructure as code 3 lessons learnt for a discussion of some existing tools).