Data Lakes

  • These are the alternative to data warehouses
    • Typically a data warehouse will have a common structured schema (ie every piece of data will have the same format) – will pull data from different sources and convert to that schema
    • A data lake contains the original source data in its original formats
    • Tools like Spark can then be used to pull in data in a variety of different formats and convert to dataframes and then join into one consistent schema
  • James Dixon coined the term
    • Data lake = large body of water in natural state
    • Data marts = akin to small units of bottled water
    • Users can dive into the lake and take samples
  • But nobody really knew how to design them
    • Kickback: People started talking about data lakes => data swamps
    • We see a lot of brokenness in the business of both getting data out of operational systems, and acting on the insights from that data
    • Watch out for people trying to build data lakes with no particular purpose in mind!
  • Data lakes shouldn’t be used as ways for systems to talk to each other
    • We want these operational systems to share freely with one another
    • One of the reasons data warehouses came about was to take query pressure off basic operational systems – so data was replicated and hived offline
