Compose dataset

The pipeline step that produces the final training or validation set is the compose step. Before you run this command, you should specify in the config files what datasets should be selected and how they should be sampled.

llm-datasets compose –-split=train –-configs=my_dataset.yaml \
    --text_data_dir=/data/my_text_data \
    --composed_data_dir=/data/my_composed_data/train/

Depending on the your system (especially IO-speed) and dataset size this step can take a substantial amount of time (> 24 hours for a 1T token dataset).