Parquet#

Reading and writing Parquet files is supported with format='parquet' if the pyarrow and pandas packages are installed. For writing, the file extensions .parquet or .parq will automatically imply the 'parquet' format. For reading, Parquet files are automatically identified regardless of the extension if the first four bytes of the file are b'PAR1'. In many cases you do not need to explicitly specify format='parquet', but it may be a good idea anyway if there is any ambiguity about the file format.

Multiple-file Parquet datasets are not supported for reading and writing.

Examples#

To read a table from a Parquet file named observations.parquet, you can do:

>>> t = Table.read('observations.parquet')

To write a table to a new file, simply do:

>>> t.write('new_file.parquet')

As with other formats, the overwrite=True argument is supported for overwriting existing files.

One big advantage of the Parquet files is that each column is stored independently, and thus reading a subset of columns is fast and efficient. To find out which columns are stored in a table without reading the data, use the schema_only=True as shown below. This returns a zero-length table with the appropriate columns:

>>> schema = Table.read('observations.parquet', schema_only=True)

To read only a subset of the columns, use the include_names and/or exclude_names keywords:

>>> t_sub = Table.read('observations.parquet', include_names=['mjd', 'airmass'])