bookmark_border

Columnar Data: Apache Arrow and Parquet with Julien Le Dem and Jacques Nadeau

Software Engineering Daily,

Originally posted on Software Engineering Daily

Column-oriented data storage allows us to access all of the entries in a database column quickly and efficiently. Columnar storage formats are mostly relevant today for performing large analytics jobs.

For example, if you are a bank, and you want to get the sum of all of the financial transactions that took place on your system in the last week, you don’t want to iterate through every row in a database of transactions. It is more efficient to just look at the column for the amount of money, and ignore things like timestamp and user id.

Julien Le Dem co-created Parquet, a file format for storing columnar data on disk. Jacques Nadeau is a VP of Apache Arrow, a format for in-memory columnar representation. They are both part of Dremio, and they join the show to talk about how columnar data is stored, processed, and shared between systems like Spark, Hadoop, and Python.

Sponsors



To understand how your application is performing, you need visibility into your database. VividCortex provides database monitoring for MySQL, Postgres, Redis, MongoDB, and Amazon Aurora. Database uptime, efficiency, and performance can all be measured using VividCortex. You can learn more about how VividCortex works at vividcortex.com/sedaily.


Incapsula is a cloud service that protects applications from attackers and improves performance.  Botnets and denial-of-service attacks are recognized by Incapsula and blocked. This protects your API servers and microservices from responding to unwanted requests. To try Incapsula, go to incapsula.com/sedaily and get a month free for Software Engineering Daily listeners.


Saagie is an end-to-end data platform that lets you focus on deriving business value from data. Saagie helps you take control of your wide variety of data sources, and gets them in one place. Check it out at Saagie.com


About the Podcast