Apache Parquet Reader/Writer
Apache Parquet is a columnar, file-based storage format, originating in the Apache Hadoop ecosystem. It can be queried efficiently, is highly compressed, supports null values, and is non-spatial. It is supported by many Apache big data frameworks, such as Drill, Hive, and Spark.
Parquet is additionally supported by several large-scale query providers, such as Amazon AWS Athena, Google Cloud BigQuery, and Microsoft Azure Data Lake Analytics.
About Apache Parquet Format Support
Note that this format replaces the now-deprecated Apache Parquet FME package format.
Parquet Product and System Requirements
Format |
Product |
Operating System |
||||
---|---|---|---|---|---|---|
Reader/Writer |
FME Desktop License |
FME Server |
FME Cloud |
Windows |
Linux |
Mac |
Reader |
Available in FME Professional Edition and higher |
Yes |
Yes |
64-bit: Yes |
Yes |
Yes |
Writer |
Available in FME Professional Edition and higher |
Yes |
Yes |
64-bit: Yes |
Yes |
Yes |
- More about FME Licenses and Subscriptions.
- More about FME Desktop Editions and Licenses.
Parquet File Extensions
A Parquet dataset consists of multiple *.parquet files in a folder, potentially nested into partitions by attribute. For example, a Parquet dataset of customer information, partitioned by account type, might look like this:
- customers
- starter
- Ks4Fju.parquet
- N7GQGb.parquet
- DsuO3K.parquet
- ...
- plus
- e0UOXZ.parquet
- 5htd7H.parquet
- ...
- enterprise
- GqZIV1.parquet
- GZgJAk.parquet
- GShRhm.parquet
- Tz06cp.parquet
- ...
- starter
To write to this dataset, you would select the "customers" folder as the format dataset.
Reader Overview
While Parquet is a columnar format, the Parquet reader produces a feature for each row in the dataset. However, Bulk Mode bridges the gap between columnar storage and row-based access in FME.
The reader operates on a single *.parquet file, but multiple files making up a partitioned dataset can be selected. The writer will write a single file per feature type, but data can be partitioned using feature type fanout.
A dataset has only one reader feature type.
Latitude/Longitude and x, y, z coordinates
FME automatically recognizes some common attribute names as potential x,y,z coordinates and sets their types.
This data may not necessarily have a spatial component, but columns can be identified as x, y, or z coordinates to create point geometries. If a schema scan is performed and field labels contain variations of x/y, east/north, or easting/northing, FME will create the point geometry.
If FME detects latitude and longitude column names (for example, Latitude/Longitude or Lat/Long), the source coordinate system will be set to LL-WGS84.
Writer Overview
The Parquet writer writes all the attributes of a feature to a Parquet dataset. The writer operates on a folder and will write a single .parquet file to the folder for each feature type. The file name will be the feature type name. Existing files of the same name will be overwritten. There is a file version option for backwards compatibility with older external readers and a compression option to reduce file size.
A dataset has only one writer feature type, meaning a file only contains features from a single feature type.