ScoringFilter (apache-nutch 1.23-SNAPSHO

ScoringFilter (apache-nutch 1.23-SNAPSHOT API)
Package
org.apache.nutch.scoring
Interface ScoringFilter
All Superinterfaces:
Configurable
Pluggable
All Known Implementing Classes:
AbstractScoringFilter
DepthScoringFilter
LinkAnalysisScoringFilter
MetadataScoringFilter
OPICScoringFilter
OrphanScoringFilter
ScoringFilters
SimilarityScoringFilter
URLMetaScoringFilter
public interface
ScoringFilter
extends
Configurable
Pluggable
A contract defining behavior of scoring plugins.

A scoring filter will manipulate scoring variables in CrawlDatum and in
resulting search indexes. Filters can be chained in a specific order, to
provide multi-stage scoring adjustments.
Author:
Andrzej Bialecki
Field Summary
Fields
Modifier and Type
Field
Description
static final
String
X_POINT_ID
The name of the extension point.
Method Summary
Modifier and Type
Method
Description
CrawlDatum
distributeScoreToOutlinks
Text
fromUrl,
ParseData
parseData,
Collection
Map.Entry
Text
CrawlDatum
>> targets,
CrawlDatum
adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
float
generatorSortValue
Text
url,
CrawlDatum
datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting
top N scoring pages during fetchlist generation.
float
indexerScore
Text
url,
NutchDocument
doc,
CrawlDatum
dbDatum,
CrawlDatum
fetchDatum,
Parse
parse,
Inlinks
inlinks,
float initScore)
This method calculates a indexed document score/boost.
void
initialScore
Text
url,
CrawlDatum
datum)
Set an initial score for newly discovered pages.
void
injectedScore
Text
url,
CrawlDatum
datum)
Set an initial score for newly injected pages.
default void
orphanedScore
Text
url,
CrawlDatum
datum)
This method may change the score or status of CrawlDatum during CrawlDb
update, when the URL is neither fetched nor has any inlinks.
void
passScoreAfterParsing
Text
url,
Content
content,
Parse
parse)
Currently a part of score distribution is performed using only data coming
from the parsing process.
void
passScoreBeforeParsing
Text
url,
CrawlDatum
datum,
Content
content)
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content
metadata.
void
updateDbScore
Text
url,
CrawlDatum
old,
CrawlDatum
datum,
List
CrawlDatum
> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update,
based on the initial value of the original CrawlDatum, and also score
values contributed by inlinked pages.
Methods inherited from interface org.apache.hadoop.conf.
Configurable
getConf
setConf
Field Details
X_POINT_ID
static final
String
X_POINT_ID
The name of the extension point.
Method Details
injectedScore
void
injectedScore
Text
url,
CrawlDatum
datum)
throws
ScoringFilterException
Set an initial score for newly injected pages. Note: newly injected pages
may have no inlinks, so filter implementations may wish to set this score
to a non-zero value, to give newly injected pages some initial credit.
Parameters:
url
- url of the page
datum
- new datum. Filters will modify it in-place.
Throws:
ScoringFilterException
- if there is a fatal error
setting an initial score for newly injected pages
initialScore
void
initialScore
Text
url,
CrawlDatum
datum)
throws
ScoringFilterException
Set an initial score for newly discovered pages. Note: newly discovered
pages have at least one inlink with its score contribution, so filter
implementations may choose to set initial score to zero (unknown value),
and then the inlink score contribution will set the "real" value of the new
page.
Parameters:
url
- url of the page
datum
- new datum. Filters will modify it in-place.
Throws:
ScoringFilterException
- if there is a fatal error
setting an initial score for newly discovered pages
generatorSortValue
float
generatorSortValue
Text
url,
CrawlDatum
datum,
float initSort)
throws
ScoringFilterException
This method prepares a sort value for the purpose of sorting and selecting
top N scoring pages during fetchlist generation.
Parameters:
url
- url of the page
datum
- page's datum, should not be modified
initSort
- initial sort value, or a value from previous filters in chain
Returns:
a sort value for use in sorting and selecting the
top N scoring pages during fetchlist generation
Throws:
ScoringFilterException
- if there is a fatal error
preparing the sort value
passScoreBeforeParsing
void
passScoreBeforeParsing
Text
url,
CrawlDatum
datum,
Content
content)
throws
ScoringFilterException
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content
metadata. This is needed in order
to pass this value(s) to the mechanism that distributes it to outlinked
pages.
Parameters:
url
- url of the page
datum
- source datum. NOTE: modifications to this value are not persisted.
content
- instance of content. Implementations may modify this in-place,
primarily by setting some metadata properties.
Throws:
ScoringFilterException
- if there is a fatal error
injecting score information from the current datum into
Content
metadata
passScoreAfterParsing
void
passScoreAfterParsing
Text
url,
Content
content,
Parse
parse)
throws
ScoringFilterException
Currently a part of score distribution is performed using only data coming
from the parsing process. We need this method in order to ensure the
presence of score data in these steps.
Parameters:
url
- page url
content
- original content. NOTE: modifications to this value are not
persisted.
parse
- target instance to copy the score information to. Implementations
may modify this in-place, primarily by setting some metadata
properties.
Throws:
ScoringFilterException
- if there is a fatal error
processing score data in subsequent steps after parsing
distributeScoreToOutlinks
CrawlDatum
distributeScoreToOutlinks
Text
fromUrl,
ParseData
parseData,
Collection
Map.Entry
Text
CrawlDatum
>> targets,
CrawlDatum
adjust,
int allCount)
throws
ScoringFilterException
Distribute score value from the current page to all its outlinked pages.
Parameters:
fromUrl
- url of the source page
parseData
- ParseData instance, which stores relevant score value(s) in its
metadata. NOTE: filters may modify this in-place, all changes will
be persisted.
targets
- pairs. NOTE: filters can modify this
in-place, all changes will be persisted.
adjust
- a CrawlDatum instance, initially null, which implementations may
use to pass adjustment values to the original CrawlDatum. When
creating this instance, set its status to
CrawlDatum.STATUS_LINKED
allCount
- number of all collected outlinks from the source page
Returns:
if needed, implementations may return an instance of CrawlDatum,
with status
CrawlDatum.STATUS_LINKED
, which contains
adjustments to be applied to the original CrawlDatum score(s) and
metadata. This can be null if not needed.
Throws:
ScoringFilterException
- there is a fatal error distributing
score data from the current page to all of its outlinks
updateDbScore
void
updateDbScore
Text
url,
CrawlDatum
old,
CrawlDatum
datum,
List
CrawlDatum
> inlinked)
throws
ScoringFilterException
This method calculates a new score of CrawlDatum during CrawlDb update,
based on the initial value of the original CrawlDatum, and also score
values contributed by inlinked pages.
Parameters:
url
- url of the page
old
- original datum, with original score. May be null if this is a
newly discovered page. If not null, filters should use score
values from this parameter as the starting values - the
datum
parameter may contain values that are no longer
valid, if other updates occurred between generation and this
update.
datum
- the new datum, with the original score saved at the time when
fetchlist was generated. Filters should update this in-place, and
it will be saved in the crawldb.
inlinked
- (partial) list of CrawlDatum-s (with their scores) from links
pointing to this page, found in the current update batch.
Throws:
ScoringFilterException
- there is a fatal error calculating
a new score of
CrawlDatum
during CrawlDb update
orphanedScore
default
void
orphanedScore
Text
url,
CrawlDatum
datum)
throws
ScoringFilterException
This method may change the score or status of CrawlDatum during CrawlDb
update, when the URL is neither fetched nor has any inlinks.
Parameters:
url
- URL of the page
datum
- CrawlDatum for page
Throws:
ScoringFilterException
- if there is a fatal error whilst
changing the score or status of
CrawlDatum
during
CrawlDb
update, when the URL is
neither fetched nor has any inlinks
indexerScore
float
indexerScore
Text
url,
NutchDocument
doc,
CrawlDatum
dbDatum,
CrawlDatum
fetchDatum,
Parse
parse,
Inlinks
inlinks,
float initScore)
throws
ScoringFilterException
This method calculates a indexed document score/boost.
Parameters:
url
- url of the page
doc
- indexed document. NOTE: this already contains all information
collected by indexing filters. Implementations may modify this
instance, in order to store/remove some information.
dbDatum
- current page from CrawlDb. NOTE:
changes made to this instance are not persisted
may be null if indexing is done without CrawlDb or if the
segment is generated not from the CrawlDb (via
FreeGenerator).
fetchDatum
- datum from FetcherOutput (containing among others the fetching
status)
parse
- parsing result. NOTE: changes made to this instance are not
persisted.
inlinks
- current inlinks from LinkDb. NOTE: changes made to this instance
are not persisted.
initScore
- initial boost value for the indexed document.
Returns:
boost value for the indexed document. This value is passed as an
argument to the next scoring filter in chain. NOTE: implementations
may also express other scoring strategies by modifying the indexed
document directly.
Throws:
ScoringFilterException
- if there is a fatal error whilst calculating
the indexed document score/boost