Insights

Data lakes vs data warehouses: considerations for storing data

I hosted a data management roundtable where one participant started off by saying she was told that she should invest in a data lake solution for her Firm because it was the latest technology for data management.  I asked what her goals were for this data management project.  The goals were to get daily, monthly, and quarterly position, risk, and PNL reporting in the hands of their PM’s and investors as quickly as possible using data from their OMS, administrator, and several data providers including credit, loan, private market sources.  She was advised by an IT vendor that a data lake would solve all of her data analysis and reporting challenges as well as being able to grow over time to an infinite scale.  That’s a bold promise.  Her business need is one that applies to all investment managers regardless of size.  Is a data lake a good choice?

Data lakes were born from the need to store large amounts of semi-structured and unstructured data that were expected to continually grow over time (“big data”).   You can put structured data into a data lake, even though that’s not its initial intent.  Data lakes are frequently noted as a technology solution for IoT and social media big data systems that have upwards of several hundred million producers that generate data records all day, every day.  In Finance, most datasets are not classified as big data.   On the buy side, big data isn’t position records; it isn’t trade or order data; it isn’t security master, reference data, nor regulatory filings.  Not 1 million, 10 million, or even 100 million records would be classified as big data.  Keep this scale in mind when somebody suggests a data lake solution.

Data is put into and stored in the data lake in raw form without any pre-processing.  The raw form makes it easier to analyse with machine learning (ML) tools.  The lack of pre-processing makes data ingestion very fast.  Once in, then what?  How do you know what data is in there?  How do you judge the quality?  How do you fix data if you find that it’s wrong?  Once it’s fixed, how do you know if your reports need to be rerun and which ones?

A data lake alone cannot help you with those issues.  In fact, because it’s so easy to keep adding data in any format, data lake detractors coined the term data swamp to describe data lakes at Firms who lack robust tagging, indexing, and overall data governance.

The data used to support an investment management firm will be structured and semi structured.  Examples are positions, trades, and security master data from counterparties, prime brokers, and administrators as well as market data from vendors like Bloomberg, S&P, Factset, IHS market, and others.  A field may be added or removed from time to time, but generally, they stay the same.  In an effort to cater to Firms of all sizes, these datasets are made available in CSV form with the expectation that hedge fund operations and investment professionals only need Excel to open the file to get the information.

Could this data all be stored in a data lake?  Yes, but it’s always better to use technology designed to solve the problem at hand.  Unless you have needs to store big data and perform analysis with ML tools, a modern data warehouse is a better fit.  A modern data warehouse should allow its users to ingest data having various formats.  It should harness cloud technology to scale, provide redundancy, and ensure data remains within specific regions (required for GDPR).  It should include tagging and indexing to build a data catalog.  With the data catalog, quality controls checks can be implemented and corrective actions defined for your data operations team.  Because the data sets are not so large, they can be versioned and purged after a retention period has elapsed.  Any number of BI and reporting tools can be used to access, analyze, format, and present the data. These are the characteristics and features that will solve the needs of most investment management firms.

With so many data lake projects failing to deliver the expected business value, companies are looking to claw back some of the data lake investment by implementing the tagging, indexing, and data catalogs that a modern data warehouse provides.  This would become an integrated environment that combines the low cost storage and possibility for machine learning of data lakes with the easy-access analytical tools of a data warehouse. This possibility is drawing significant interest from high profile investors: the latest round of investment for Databricks, a firm pioneering the use of an integrated ‘lakehouse’ environment, increased the company valuation to $38 billion.  It will be interesting to see if the lakehouse gains interest over time or just a niche solution for companies that believe a combination is preferrable over two dedicated solutions, each serving different needs and use cases.

References:

1 – https://databricks.com/discover/data-lakes/introduction
2 – https://uk-gdpr.org/chapter-3-article-17/
3 – https://techcrunch.com/2021/08/31/databricks-raises-1-6b-at-38b-valuation-as-it-blasts-past-600m-arr/

Share the Post:

Learn More

Get In Touch