The LLM Dataset Generator is an open-source tool designed to facilitate the creation of text datasets using various language models. This repository provides a framework for generating and collecting text data, supporting research and development in natural language processing (NLP) and related fields.
This project is released under the CC0 1.0 Universal license. It is freely available for use by anyone for research, personal development, or commercial purposes without restriction. For more details, please refer to the LICENSE file.
Check the following for actual output examples:
For Jupyter Notebook execution results, see:
You can check the code from phi-3-text-generator/
.
To start the Ollama container and install the Phi-3:14B model, follow these steps:
- Docker Desktop must be installed.
-
Open a terminal and pull the Ollama Docker image from Docker Hub:
docker pull ollama/ollama:latest
-
Run the Ollama container with the following command:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
-
Verify that the container is running:
docker ps
Example output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES aa492e7068d7 ollama/ollama:latest "/bin/ollama serve" 9 seconds ago Up 8 seconds 0.0.0.0:11434->11434/tcp ollama
-
Check if Ollama is running correctly:
curl localhost:11434
You should see a message indicating that Ollama is running.
-
Pull the Phi-3:14B model:
docker exec -it ollama ollama pull phi3:medium
Before running phi-3-text-generator.py
, you can modify several constants in the file.
- OLLAMA_BASE_URL:
- If running inside the Docker container:
"http://host.docker.internal:11434"
- If running locally:
"http://localhost:11434"
- If running inside the Docker container:
- TARGET_SENTENCE_COUNT: Specify the number of sentences to generate.
- OUTPUT_FILE_COUNT: Specify the number of output files to generate.
- OUTPUT_DIRECTORY: Specify the directory where the generated files will be saved.
-
Open the
phi-3-text-generator.py
file and set the constants as needed. -
Install poetry
pip install poetry
-
Install the project dependencies. Go to the root directory of the project, and run:
poetry install
-
Activate the virtual environment:
poetry shell
-
Run the script with the following command:
python phi-3-text-generator.py
By default, the generated files will be saved in an out
directory created in the same directory as the script.
-
Navigate to the directory where the script was executed.
-
Check for the
out
directory:ls out