HTML Table Reader

Licensing options for this format begin with FME Professional Edition.

The HTML Table Reader provides FME with the ability to read table and list data from HTML documents.

Overview

HTML (Hypertext Markup Language) is used on the internet to format documents for display in web browsers. While the primary purpose is not to store data for machine-readability, table and list elements often contain useful data. While HTML is XML-based, it is not compatible with strict XML parsing. As a further complication, due to the lenient parsing methods used in web browsers, an HTML document does not have to follow the HTML specification fully to display reasonably well.

The HTML Table Reader lists all of the table and list (ul and ol) elements in the HTML document and allows you to select which tables or lists to read. Note that the feature type names for the tables and lists are determined based on the Table Name From reader parameter.

Attribute names are determined from table header if reading an HTML table that contains a header. For lists, or tables without a header row, attribute names will be generated. HTML tables without a header row will have attributes Col1 through ColN, while columns containing row headers, but not having a column heading will be named RowHeading1 through RowHeadingN, where N in both cases is the number of columns. Attribute types in both tables and lists are determined by scanning the data rows.

Features are produced for every row of a table when reading and HTML table. A single feature is produced for each HTML list where the list contents are output be stored in a single attribute called html_list_content.

HTML File Extensions

By convention, HTML files have the extension .htm or .html. However, web URLs will often have no file extension, or reflect the source script used to generate the HTML output, such as .php or .asp.

Note that URLs that generate HTML pages are valid datasets provided the request to the URL returns valid HTML. The HTML Table Reader allows any file extension when reading from disk.

Reader Overview

The HTML Table Reader parses features from the document.

Schema Scanning

Since the values in an HTML table do not have associated schema, FME scans the table to determine reasonable data types for each attribute. In the case of lists, or tables without a header row, generic attribute names will be generated.

Workbench Reader Dataset

The value for the Reader Dataset is the path or URL to an HTML document.