Apache Parquet Reader/Writer

Apache Parquet is a columnar, file-based storage format, originating in the Apache Hadoop ecosystem. It can be queried efficiently, is highly compressed, supports null values, and is non-spatial. It is supported by many Apache big data frameworks, such as Drill, Hive, and Spark.

Parquet is additionally supported by several large-scale query providers, such as Amazon AWS Athena, Google Cloud BigQuery, and Microsoft Azure Data Lake Analytics.

About Apache Parquet Format Support

Note that this format differs from the Apache Parquet FME package format:

The package format is an easy way to add the Parquet format to an existing FME installation. This format, the Apache Parquet Writer, is released with FME, and has full bulk-mode functionality.

Parquet Product and System Requirements

Format

Product

Operating System

Reader/Writer

FME Desktop License

FME Server

FME Cloud

Windows

Linux

Mac

Reader

Available in FME Professional Edition and higher

Yes

Yes

32-bit: No

64-bit: Yes

Yes

Yes

Writer

Available in FME Professional Edition and higher

Yes

Yes

32-bit: No

64-bit: Yes

Yes

Yes

Parquet File Extensions

A Parquet dataset consists of multiple *.parquet files in a folder, potentially nested into partitions by attribute. For example, a Parquet dataset of customer information, partitioned by account type, might look like this:

  • customers
    • starter
      • Ks4Fju.parquet
      • N7GQGb.parquet
      • DsuO3K.parquet
      • ...
    • plus
      • e0UOXZ.parquet
      • 5htd7H.parquet
      • ...
    • enterprise
      • GqZIV1.parquet
      • GZgJAk.parquet
      • GShRhm.parquet
      • Tz06cp.parquet
      • ...

To write to this dataset, you would select the "customers" folder as the format dataset.

Reader Overview

While Parquet is a columnar format, the Parquet reader produces a feature for each row in the dataset. However, Bulk Mode bridges the gap between columnar storage and row-based access in FME.

The reader operates on a single *.parquet file, but multiple files making up a partitioned dataset can be selected. The writer will write a single file per feature type, but data can be partitioned using feature type fanout.

A dataset has only one reader feature type.

Latitude/Longitude and x, y, z coordinates

FME automatically recognizes some common attribute names as potential x,y,z coordinates and sets their types.

This data may not necessarily have a spatial component, but columns can be identified as x, y, or z coordinates to create point geometries. If a schema scan is performed and field labels contain variations of x/y, east/north, or easting/northing, FME will create the point geometry.

If FME detects latitude and longitude column names (for example, Latitude/Longitude or Lat/Long), the source coordinate system will be set to LL-WGS84.

Writer Overview

The Parquet writer writes all the attributes of a feature to a Parquet dataset. The writer operates on a folder and will write a single .parquet file to the folder for each feature type. The file name will be the feature type name. Existing files of the same name will be overwritten. There is a file version option for backwards compatibility with older external readers and a compression option to reduce file size.

A dataset has only one writer feature type, meaning a file only contains features from a single feature type.

FME Community

Search Parquet