HTMLExtractor

Extracts structured data from web page or other HTML sources that are formatted for human readability (screen scraping), using CSS selectors to extract portions of HTML content into feature attributes.

Jump to Configuration

Typical Uses

  • Extracting content from a web page

How does it work?

The HTMLExtractor lets you define multiple queries to run against incoming HTML content, which can be provided either as an attribute or as a file. The queries are composed of an output attribute name, a CSS Selector which defines what type of tags to extract, and the choice of extracting whole tags, values, text, or HTML attributes.

You may either extract the first matching tag only, or keep multiple results as a list attribute.

The HTMLExtractor is better suited to HTML content than the XML transformers or regular expression searches, due to more lenient parsing and filters that can withstand minor changes to page content.

Examples

Usage Notes

Configuration

Input Ports

Output Ports

Parameters

Editing Transformer Parameters

Using a set of menu options, transformer parameters can be assigned by referencing other elements in the workspace. More advanced functions, such as an advanced editor and an arithmetic editor, are also available in some transformers. To access a menu of these options, click beside the applicable parameter. For more information, see Transformer Parameter Menu Options.

Defining Values

There are several ways to define a value for use in a Transformer. The simplest is to simply type in a value or string, which can include functions of various types such as attribute references, math and string functions, and workspace parameters. There are a number of tools and shortcuts that can assist in constructing values, generally available from the drop-down context menu adjacent to the value field.

Dialog Options - Tables

Transformers with table-style parameters have additional tools for populating and manipulating values.

Reference

Processing Behavior

Feature-Based

Feature Holding

No

Dependencies None
Aliases  
History Released: FME 2017.0

FME Community

The FME Community is the place for demos, how-tos, articles, FAQs, and more. Get answers to your questions, learn from other users, and suggest, vote, and comment on new features.

Search for all results about the HTMLExtractor on the FME Community.

 

Examples may contain information licensed under the Open Government Licence – Vancouver and/or the Open Government Licence – Canada.