Columnar Data: Apache Arrow and Parquet with Julien Le Dem and Jacques Nadeau
Software Engineering Daily,
Originally posted on Software Engineering Daily
Column-oriented data storage allows us to access all of the entries in a database column quickly and efficiently. Columnar storage formats are mostly relevant today for performing large analytics jobs.
For example, if you are a bank, and you want to get the sum of all of the financial transactions that took place on your system in the last week, you don’t want to iterate through every row in a database of transactions. It is more efficient to just look at the column for the amount of money, and ignore things like timestamp and user id.
Julien Le Dem co-created Parquet, a file format for storing columnar data on disk. Jacques Nadeau is a VP of Apache Arrow, a format for in-memory columnar representation. They are both part of Dremio, and they join the show to talk about how columnar data is stored, processed, and shared between systems like Spark, Hadoop, and Python.