Integration with other frameworks

LLM-Datasets can be used in combination with our own processing pipelines or integration in other frameworks, for example with Huggingface's DataTrove.

DataTrove integration

HuggingFace's DataTrove is a library to process, filter and deduplicate text data at a very large scale. All datasets implemented within LLM-Dataset can be processed with DataTrove. To do so, you can use the LLMDatasetsDatatroveReader class as input for any DataTrove pipeline. The LLMDatasetsDatatroveReader class takes a list of dataset ID(s) and/or config files as arguments, as shown in the example below:

from datatrove.pipeline.filters import SamplerFilter
from datatrove.pipeline.writers import JsonlWriter

from llm_datasets.datatrove_reader import LLMDatasetsDatatroveReader
from llm_datasets.utils.config import Config, get_config_from_paths

llmds_config: Config = get_config_from_paths(["path/to/my/config.yaml"])

pipeline = [
    LLMDatasetsDatatroveReader("legal_mc4_en", llmds_config),
    SamplerFilter(rate=0.5),
    JsonlWriter(
        output_folder="/my/output/path"
    )
]