HtmlParseFilter (apache-nutch 1.23-SNAPSHOT API)
Package
org.apache.nutch.parse
Interface HtmlParseFilter
All Superinterfaces:
Configurable
Pluggable
All Known Implementing Classes:
CCParseFilter
DebugParseFilter
HeadingsParseFilter
HTMLLanguageParser
JSParseFilter
MetaTagsParser
NaiveBayesParseFilter
RegexParseFilter
RelTagParser
public interface
HtmlParseFilter
extends
Pluggable
Configurable
Extension point for DOM-based HTML parsers. Permits one to add additional
metadata to HTML parses. All plugins found which implement this extension
point are run sequentially on the parse.
Field Summary
Fields
Modifier and Type
Field
Description
static final
String
X_POINT_ID
The name of the extension point.
Method Summary
Modifier and Type
Method
Description
ParseResult
filter
Content
content,
ParseResult
parseResult,
HTMLMetaTags
metaTags,
DocumentFragment
doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM
tree of a page.
Methods inherited from interface org.apache.hadoop.conf.
Configurable
getConf
setConf
Field Details
X_POINT_ID
static final
String
X_POINT_ID
The name of the extension point.
Method Details
filter
ParseResult
filter
Content
content,
ParseResult
parseResult,
HTMLMetaTags
metaTags,
DocumentFragment
doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM
tree of a page.
Parameters:
content
- the
Content
for a given response
parseResult
- the result of running on or more
Parser
's on the content.
metaTags
- a populated
HTMLMetaTags
object
doc
- a
DocumentFragment
(DOM) which can be processed in
the filtering process.
Returns:
a filtered
ParseResult
See Also:
Parser.getParse(Content)