Pitfalls

Avoid tightly coupling your analytics pipelines with other business processes

Analytics data pipelines provide data to produce insights about your customers, business operations, technology performance, and more. For example, the role of a data warehouse is to create an historical record of data that can be mined for insights.
It is tempting to see these rich data sources as the best source of data for all data processing and plumb key business activities in these repositories. However, this can easily end up preventing the extraction of insights it was implemented for. Data warehouses can become so integrated into business operations - effectively acting as the Operational Data Store (ODS) - that they can no longer function as a data warehouse. Key business activities end up dependent on the fast processing of data drawn from the data warehouse, which prevents other users from running queries on the data they need for their analyses.
Modern architectures utilise a micro-service architecture, and we advocate this digital platform approach to delivering IT functionality (see our Digital Platforms playbook).

Micro-services should own their own data - there is unlikely to be a one-size-fits-all solution to volumes, latencies, or use of master or reference data of the many critical business data flows implemented as micro-services.

Great care should be taken as to which part of the analytics data pipelines micro-services draw their data from. The nearer the data they use is to the end users, the more constrained your data analytics pipeline will become over time, and the more restricted analytics users will become in what they can do.
If a micro-service is using a whole pipeline as part of its critical functionality, it is probably time to reproduce the pipeline as a micro-service in its own right, as the needs of the analytics users and the micro-service will diverge over time.

Include data users early on

We are sometimes asked if we can implement data pipelines without bothering data users. They are often very busy interfacing at senior levels, and as their work provides key inputs to critical business activities and decisions, it can be tempting to reduce the burden on them and think that you already understand their needs.
In our experience this is nearly always a mistake. Like any software development, understanding user needs as early as you can, and validating that understanding through the development, is much more likely to lead to a valued product. Data users almost always welcome a chance to talk about what data they want, what form they want it in, and how they want to access it. When it becomes available, they may well need some coaching on how to access it.

Keep unstructured raw inputs separate from processed data

In pipelines where the raw data is unstructured (e.g. documents or images), and the initial stages of the pipeline extract data from it, such as entities (names, dates, phone numbers, etc.), or categorisations, it can be tempting to keep the raw data together with the extracted information. This is usually a mistake. Unstructured data is always of a much higher volume, and keeping it together with extracted data will almost certainly lead to difficulties in processing or searching the useful, structured data later on. Keep the unstructured data in separate storage (e.g., different buckets), and store links to it instead.
Last modified 7mo ago