HTMLExtractor

Example: Extracting links from a web page

In this portion of a workspace, all of the links on a web page will be extracted and output as a list attribute.

An HTTPCaller retrieves the contents of a web page, using the GET method. The contents of the page are stored as HTML in the _response_body attribute.

In the HTMLExtractor, the same attribute is set as the HTML source, and a query is constructed to find all links (CSS Selector = a[href]), extract the only the link itself (Tag Part/HTML Attribute = href), and store that in a new attribute called links.

The Return Format is set to List Attribute, and so all matches will be included.

The output will look similar to this:

links{0} = ‘https://www.example.com/page1.html’

links{1} = ‘https://www.example.com/page2.html’

links{2} = ‘https://www.example.com/page3.html’

Example: Extracting text content for a given “div”

In this portion of a workspace, an HTTPCaller uses the GET method to retrieve the contents of a web page and store them in the attribute _response_body.

In the HTMLExtractor, a query is constructed to find the div tag with the id “article” (CSS Selector = div#article). The contents of that tag will be extracted (Tag Part/HTML Attribute = Value), and output to the new attribute articleText.

With the Return Format set to First Match, the contents of the first matching div tag encountered will be output as an ordinary (non-list) attribute.

Input

This transformer accepts any feature.

Output

Features with attributes containing the results of Extract Queries..

If an error occurs, the feature will be output via the <Rejected> port, with information about the error contained in the fme_rejection_code and fme_rejection_message attributes.

Rejected Feature Handling: can be set to either terminate the translation or continue running when it encounters a rejected feature. This setting is available both as a default FME option and as a workspace parameter.

HTML Source

HTML Input

The type of source. Choices include:

File
Content

HTML Content

If HTML Input is set to Content, HTML content can either be specified directly in the HTML Content field, or set to the value of an attribute.

HTML File

If HTML Input is set to File, the path to an input HTML file can be specified.

Extract Queries

Target Attribute

The name of the attribute that will hold the results of the query.

CSS Selector

A CSS selector which specifies a tag or set of tags in the HTML document or content.

A list of selectors can be found at:

CSS Selector Reference

Tag Part/HTML Attribute

This parameter can be set to

Value: the value inside the HTML tag will be extracted
Whole: the entire tag will be extracted

Alternatively, an HTML attribute name (such as “href” or “alt”) can be entered. This will result in the attribute being extracted from the tag.

Output

Return Format

If this is set to First Match, the target attributes will contain only the first element found that matches the query.

If set to List Attributes, the target attributes will be lists, and will contain all results matching the query.

Additional Tools

Row Reordering

Enabled once you have clicked on a row in the Extract Queries. Choices include:

Add a row
Remove a row (Action set to Remove)
Move current row up one
Move current row down one
Move current row to top
Move current row to bottom

How to Set Parameter Values

Using the Text Editor

The Text Editor provides a convenient way to construct text strings (including regular expressions) from various data sources, such as attributes, parameters, and constants, where the result is used directly inside a parameter.

Text Editor

Using the Arithmetic Editor

The Arithmetic Editor provides a convenient way to construct math expressions from various data sources, such as attributes, parameters, and feature functions, where the result is used directly inside a parameter.

Arithmetic Editor

Conditional Values

Set values depending on one or more test conditions that either pass or fail.

Parameter Condition Definition Dialog

Content

Expressions and strings can include a number of functions, characters, parameters, and more - whether entered directly in a parameter or constructed using one of the editors.

Content Types

String Functions	These functions manipulate and format strings.
Special Characters	A set of control characters is available in the Text Editor.
Math Functions	Math functions are available in both editors.
Math Operators	These operators are available in the Arithmetic Editor.
FME Feature Functions	These return primarily feature-specific values.
FME Parameters	FME and workspace-specific parameters may be used.
Working with User Parameters	Create your own editable parameters.

Related Transformers

HTTPCaller

Processing Behavior	Feature-Based
Feature Holding	No
Dependencies	None
FME Licensing Level	FME Professional Edition and above
Aliases
History	Released: FME 2017.0
Categories	Integrations Strings Web Workflows

HTMLExtractor

Typical Uses

How does it work?

Usage Notes

Configuration

Input Ports

Output Ports

Parameters

Dialog Options

Editing Transformer Parameters

Defining Values

Reference

FME Knowledge Center