Getting Started
Installation
Install the llm-datasets package with pip:
In order to keep the package minimal by default, llm-datasets comes with optional dependencies useful for some use cases.
For example, if you want to have the text extraction for all available datasets, run:
Quick start
Download and text extraction
To download and extract the plain-text of one or more datasets, run the following command:
By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:
Available datasets
A list or table with all available datasets can be print with the follow command:
Pipeline commands
usage: llm-datasets <command> [<args>]
positional arguments:
  {chunkify,collect_metrics,compose,convert_parquet_to_jsonl,extract_text,hf_upload,print_stats,shuffle,train_tokenizer}
                        llm-datasets command helpers
    chunkify            Split the individual datasets into equally-sized file chunks (based on bytes or rows)
    collect_metrics     Collect metrics (token count etc.) from extracted texts
    compose             Compose the final train/validation set based on the individual datasets
    convert_parquet_to_jsonl
                        Convert Parquet files to JSONL
    extract_text        Extract text from raw datasets
    hf_upload           Upload files or directories to Huggingface Hub.
    print_stats         Print dataset statistics as CSV, Markdown, ...
    shuffle             Shuffle the individual datasets on the file-chunk level (no global shuffle!)
    train_tokenizer     Train a tokenizer (only: sentencepiece supproted)
options:
  -h, --help            show this help message and exit