BaseDataset
Bases: JSONLMixin
, BaseTextDataset
Source code in src/llm_datasets/datasets/jsonl_dataset.py
get_document_from_item(item)
This simply returns the document with a text field from item (but dataset classes can override this to implement filtering etc.)
Source code in src/llm_datasets/datasets/jsonl_dataset.py
get_text_from_item(item)
This simply returns the text field from item (but dataset classes can override this to implement filtering etc.)
get_texts()
get_texts_with_multi_proc()
get_texts_with_single_proc()
Iterate over all input files and read JSON from each line.