tarexp.evaluation module#
Consistent implementation of effectiveness metrics, including tricky issues like tiebreaking is critical to TAR experiments.
This is true both for evaluation, and because stopping rules may incorporate effectiveness estimates based on small samples.
We provide all metrics from the open source package ir-measures through the tarexp.workflow.Workflow.getMetrics() method.
Metrics are computed on both the full collection and unreviewed documents to support both finite population and generalization
perspectives.
In addition to standard IR metrics, TARexp implements OptimisticCost (tarexp.evaluation.OptimisticCost) to
support the idealized end-to-end cost analysis for TAR proposed in Yang et al. [1]. Such analysis requires specifying a
target recall and a cost structure associated with the TAR process. TARexp also provides helper functions for plotting
cost dynamics graphs (tarexp.helper.plotting).
See also
- class tarexp.evaluation.MeasureKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None)[source]#
Bases:
objectHashable key for evaluation metric.
- measure: str | ir_measures.measures.Measure = None#
Name of the measurement.
- section: str = 'all'#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall: float = None#
The recall target. Can be
Nonedepending on whether the measure requires one.
- class tarexp.evaluation.CostCountKey(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, label: bool = None)[source]#
Bases:
MeasureKeyHashable key for the recording count of the documents.
- measure#
Name of the measurement.
- section#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall#
The recall target. Can be
Nonedepending on whether the measure requires one.
- label: bool = None#
The ground truth label that the measure is counting.
- class tarexp.evaluation.OptimisticCost(measure: str | ir_measures.measures.Measure = None, section: str = 'all', target_recall: float = None, cost_structure: tuple[float, float, float, float] = None)[source]#
Bases:
MeasureKeyOptimistic Cost
The cost measure that records the total cost of reviewing documents in both first and (an optimal) second phase review workflow. Please refer to Yang et al. [1] for further details.
- measure#
Name of the measurement.
- section#
Part of the collection that the evaluation is measured on. Can be but not limited to “all”, “known” (reviewed documents), etc.
- target_recall#
The recall target. Can be
Nonedepending on whether the measure requires one.
- cost_structure: tuple[float, float, float, float] = None#
Four-tuple cost structure. The elements of the tuple are the unit cost of reviewing positive and negative documents in the first phase and positive and negative ones in the second phase respectively.
- static calc_all(measures, df)[source]#
Static method for calculating multiple
OptimisticCostmeasures given a cost Pandas DataFrame.The dataframe should contain “query_id”, “iteration”, “relevance” (as ground truth), “score”, “control” (boolean values of whether the document is in the control set), “known” (whether the document is reviewed) as columns and all documents in the collection as rows. This dataframe is similar to the one used in
ir_measures.calc_all().This method is also used internally by
evaluate().- Parameters:
measures – A list of
OptimisticCostinstance for calculation.df – The cost Pandas DataFrame
- Returns:
A Python dictionary with key as the measures and values as the measurement values. The count of each section given the all recall targets provided in the
measuresargument would also be returned as auxiliary information.- Return type:
dict
- tarexp.evaluation.evaluate(labels, ledger: Ledger, score, measures) Dict[MeasureKey, int | float][source]#
Evaluate TAR run based on a given
tarexp.ledger.Ledger.This function calculates the evaluation metrics based on the provided Ledger. The measures are evaluated on the last round recorded in the ledger. If inteded to caculate metrics on past rounds, please provide a ledger that only contains information up to the round the user is intended to evaluate by using
tarexp.ledger.Ledger.freeze_at().It serves as a catch-all function for all evaluation metrics
TARexpsupports, including all measurements inir-measuresandOptimisticCost. Future addtional of the supported evaluation metrics should also be added to this function for completeness.- Parameters:
labels – The ground-truth labels of the documents. This is different from the labels recorded in the Ledger which is the review results (not necessarily the groun-truth if not using a
tarexp.components.labeler.PerfectLabeler)ledger – The ledger the recorded the progress.
score – A list of the document scores.
measures – A list of
MeasureKeyinstance, the name of the measurments supported inir-measures, orir-measuresmeasurement object (such asir_measures.P@10).
- Returns:
A Python dictionary of keys as instances of
MeasureKeyand values as the corresponding measurement values.- Return type:
dict[
MeasureKey, int|float]