Data Lake: Building a Bridge between Technology and Business
Is Data a Technology or a Business function? Reviewing trade-offs, such as “who owns the data?”, “who can edit data in production?”, or even “where should the CDO function reside”, tend to look past a fundamental principal of the data industry. Effectively managing data is a two-way street. It is not solely Business, nor Technology. Data is an enterprise asset that exists where the handshake between business and technology occurs. Data Lakes and open source tools are erasing archaic frameworks used for decades in establishing policy, process, infrastructure, application design, controls, etc. Those who influence the “hearts and minds” of the organization to reset some of these effective, but outdated practices, will greatly accelerate the timeline to value from the data lake. Here are several decision points that may require a re-examination to increase momentum.
Segmenting your data lake to support different analytic needs will be helpful in driving success of the consumer
A production environment, where the business operates to meet the needs of its customers, requires stability and controls to support the business function, for which it is designed. This level of stability will help support teams succeed in achieving required service level agreements.
A data lake supporting analytics is an organic, iterative, environment supporting the testing of hypothesis. Once the hypothesis is proven, the “EUREKA” moment, that insight needs to be shared. Analysts can additionally bring their own data and mix it with the production data. The needs change on a daily basis. The process is not definable to the level required when designing a stable production application.
Establishing self-service analytics with sandboxes in the production environment can help bridge between traditional production data environment and the organic data blending required to support analytics. Allow the analysts to load data into a dedicated workspace that can be joined to the production data in the data lake. Self-service capabilities allow for the capture, blending and sharing of information without running a software development project. Some mature organizations have implemented business driven schedules for implementing repeatable processes.
Controls can be put in place on the sandboxes, 90 day expiration, charge-back methods, etc. to help establish outer boundaries for compute, storage, and network traffic.
A marketing initiative purchases data, for analyzing price sensitivity or market basket offerings within a specific segment. Once they find a profitable segment, the EUREKA moment, the marketing process begins across production channels. Analytics on production customer data, blended with purchased market data, sending leads to production channels.
Controlling access to data tends to occur within technology, sometimes as part of a project, following information security guidelines. This has a tendency to design access control from an application, or from a source perspective. This model falls down in several key areas.
- Technology will design access within a project and may not have a deep enough understanding of business utilization to appropriately assess the inherent risk. This drives towards tighter controls than are necessary when balanced with the level of risk.
- Justification processes are performed at an individual basis which leads to inconsistent entitlements.
- Multitude of sources. Data Lakes need information from many sources. If each source has a different provisioning process, it could complicate provisioning in the lake.
One concept for addressing this is to ensure the appropriate first line of defense is defined. For most companies, first line of defense is the business or department directly, but not IT. This reduces, but doesn’t eliminate, the role that technology plays and can enable a more balanced approach to risk and control. This also elevates the accountability of the business in defining the right level of risk.
A second opportunity is to change the design from where data is sourced, to where it is consumed. This is commonly known as a ROLE based provisioning process. Aligning access control with a specific department more closely ties the data needs to relevant policies and justifications. Role based design also eliminates the complexities for provisioning data from multiple sources.
By having a role called FINANCE, all GL, Transaction, and Account related information can be provisioned. Most roles in finance would have a consistent business need to know. Applicable relevant policies, i.e.: Sarbanes-Oxley would be applied to anyone with this role. On the other hand a second role called MARKETING, could require masked customer data, with relevant Privacy or GLBA policies implemented with this role.
One Size Does Not Fit All
Each usage of data has a unique context. The performance, quality, ease of use, cost, speed to decision, all can vary. Regulatory reporting may require a GL reconciled data mart with full lineage from when data is acquired to when it is put on a report. Hurricane Katrina is coming, do we have any customers, or inventory, or facilities that will be impacted, has a different level of urgency. The data quality can be best at hand.
Segmenting your data lake to support different analytic needs will be helpful in driving success of the consumer. Many analysts are used to data wrangling with data from disparate sources. Some information consumers need a more structured organization of the data assets.
Plan for a variety of tiers, perhaps starting with three, RAW, FULLY ORGANIZED, and in between. Be prepared to allow for using data across sandbox, raw, fully organized etc.