tarexp.dataset module#
A dataset contains the essential information of the collection for retrieval. It is designed to be a static variable throughout the TAR run.
Encoded documents in vectors are required but not limited any form. Depending on the intended experiments, the vectors can be generated by scikit-learn TFIDF vectorizer or even Huggingface Transformers tokenizers. We leave this flexibility to the users for further extensions.
Groud truth labels are essential when running experiments without human intervention. It is also used in most evaluation that requires ground truth labels. If the workflow is designed to run with actual human reviewing, the labels is no longer required.
- class tarexp.dataset.Dataset(name=None)[source]#
- Bases: - object- Meta class of - TARexpdataset.- The class defines the basic features for a dataset. All downstream datasets that inherits this class should implement the following properties or method: - Essentials - identifier
- The unique identifier of the dataset. It is used in verifying the dataset provided is identical when resuming a workflow. The identifier should summarize both vectors and the labels with a hash that does not depends on the memory location of the variable (e.g. the built-in - hash()function) but the actual content. Utility function- tarexp.util.stable_hash()provides such capability.
 
- ingest()
- It ingests a list of raw text into vectors and stored in the attribute - _vectors.
 
- getAllData()
- (Optional) It returns all vectors of the documents in the dataset. Ideally, it should returns a copy of the vector but could also be a reference if the colletion is too large to copy in memory. This meta class implemented a simple version but user implementing new dataset class should consider re-implement it to support on-demand processing of the vectors (such as collators in pyTorch). 
 
- getTrainingData()
- (Optional) It takes a - tarexp.ledger.Ledgeras an argument and returns the vectors of reviewed documents and labels from the ledger. This meta class also already implemented a simple version but should consider re-implementing for the same reason as- getAllData().
 
- duplicate()
- Returns a copy of the dataset along with any information that should be copied. This method should perform deep copy on all containing objects to prevent memory referencing the prevent fast multi-processing. 
 
 
- Labels (Optional) - labels
- The labels of the dataset. We recommand implementing this information as a property instead of an attribute of the class to prevent modifying the labels by accident during the workflow. If the labels are intended to be unavailable, please consider raise an - NotImplementedexception instead of- NotImplementedErrorto reflect the intention.
 
- pos_doc_idsand- neg_doc_ids
- The - setof positive and negative docuemnt ids.
 
- setLabels()
- The method that returns a new dataset that contains the label of all documents in the dataset. It should also check all documents are set with a label. Spawning a new instance makes sure that the original dataset instance is not polluted. 
 
 
 - property identifier#
 - property name#
 - property n_docs#
- Number of documents in the dataset. 
 - property labels#
 - property hasLabels#
 - property pos_doc_ids: set#
 - property neg_doc_ids: set#
 
- class tarexp.dataset.SparseVectorDataset(vectorizer=None)[source]#
- Bases: - Dataset- Dataset with Scipy Sparse Matrix. - Parameters:
- vectorizer – A function or a class instance that has a - fit_transformmethod (such as the vectorizers from scikit-learn).
 - property n_docs#
- Number of documents in the dataset. 
 - property labels#
- Returns a copy of the labels of all docuemnts. 
 - property identifier#
 - property pos_doc_ids: set#
- Returns the ids of the positive documents. 
 - property neg_doc_ids: set#
- Returns the ids of the negative documents. 
 - ingest(text, force=False)[source]#
- Ingest the text using the - vectorizerand store the vectors in this instance.- Parameters:
- text – A list of text that will be ingested and stored. If the labels are set, the length of the list should be identical to the length of labels. 
- force – Whether skipping the test on the length of the text and the labels. 
 
 
 - setLabels(labels, inplace=False)[source]#
- Returns a new datset with new labels. - Parameters:
- labels – A list or Numpy array of binary labels. The length should match the number of documents in the dataset. 
- inplace – Whether applying this set of labels to the current dataset. If - True, the method will replace the labels and returns- None. Default- False.
 
 
 - duplicate(deep=False)[source]#
- Duplicate the dataset. - Parameters:
- deep – Whether to perform deep copy on the vectors. Default - False.
 
 - classmethod from_sparse(matrix)[source]#
- Create a - SparseVectorDatasetinstance from a sparse matrix.
 
- class tarexp.dataset.TaskFeeder(dataset: Dataset, labels: Any)[source]#
- Bases: - object- Python Iterator that yields review tasks with different set of labels given the same base dataset (a dataset without labels.) - This class support both iterator in for loop or - next()function and index look up- []if the list of labels provided has already been materialized (not an iterator).- Parameters:
- dataset – A - Datasetinstance that does not contain any label. This instance will spawn downstream tasks with different labels
- labels – - If a Python dictionary is provided, the key is considered as the names of the tasks and the values are the corresponding labels. The length of all set of labels should be the same as the number of documents provided inbase dataset. - If a Pandas DataFrame is provided, the columns are considered to be the tasks where the column name are fed as the task name. The number of rows in the DataFrame should be the same as the number of documents in the base dataset. - If an iterator is provided, the length of the labels is not checked against the number of documents in the base dataset and a warning will be raised. The iterator should yield a tuple of the name of the task and the corresponding labels. The order of should also be stable especially when running experiments across multiple machines. If the iterator support length via - __len__, the class will respect it.