LLM Dataset Generator

Overview

The LLM Dataset Generator is an open-source tool designed to facilitate the creation of text datasets using various language models. This repository provides a framework for generating and collecting text data, supporting research and development in natural language processing (NLP) and related fields.

License

This project is released under the CC0 1.0 Universal license. It is freely available for use by anyone for research, personal development, or commercial purposes without restriction. For more details, please refer to the LICENSE file.

Examples

Check the following for actual output examples:

For Jupyter Notebook execution results, see:

phi-3-text-generator.ipynb

Setup Guide for Ollama and Phi-3 Text Generator

You can check the code from phi-3-text-generator/.

1. Creating and Starting the Ollama Container

To start the Ollama container and install the Phi-3:14B model, follow these steps:

Prerequisites

Docker Desktop must be installed.

Steps

Open a terminal and pull the Ollama Docker image from Docker Hub:
```
docker pull ollama/ollama:latest
```

Run the Ollama container with the following command:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify that the container is running:

docker ps

Example output:

CONTAINER ID   IMAGE                 COMMAND                  CREATED          STATUS          PORTS                     NAMES
aa492e7068d7   ollama/ollama:latest  "/bin/ollama serve"      9 seconds ago    Up 8 seconds    0.0.0.0:11434->11434/tcp  ollama

Check if Ollama is running correctly:
```
curl localhost:11434
```
You should see a message indicating that Ollama is running.

Pull the Phi-3:14B model:

docker exec -it ollama ollama pull phi3:medium

2. How to Use the Project

Before running phi-3-text-generator.py, you can modify several constants in the file.

Constants to Set

OLLAMA_BASE_URL:
- If running inside the Docker container: "http://host.docker.internal:11434"
- If running locally: "http://localhost:11434"
TARGET_SENTENCE_COUNT: Specify the number of sentences to generate.
OUTPUT_FILE_COUNT: Specify the number of output files to generate.
OUTPUT_DIRECTORY: Specify the directory where the generated files will be saved.

Execution Steps

Open the phi-3-text-generator.py file and set the constants as needed.
Install poetry
```
pip install poetry
```
Install the project dependencies. Go to the root directory of the project, and run:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```
Run the script with the following command:
```
python phi-3-text-generator.py
```

3. How to Check the Generated Files

By default, the generated files will be saved in an out directory created in the same directory as the script.

Steps to Verify

Navigate to the directory where the script was executed.
Check for the out directory:
```
ls out
```

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.devcontainer		.devcontainer
phi-3-text-generator		phi-3-text-generator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm-dataset-generator.code-workspace		llm-dataset-generator.code-workspace
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Dataset Generator

Overview

License

Examples

Setup Guide for Ollama and Phi-3 Text Generator

1. Creating and Starting the Ollama Container

Prerequisites

Steps

2. How to Use the Project

Constants to Set

Execution Steps

3. How to Check the Generated Files

Steps to Verify

About

Releases 3

Contributors 2

Languages

License

ZEKE320/llm-dataset-generator

Folders and files

Latest commit

History

Repository files navigation

LLM Dataset Generator

Overview

License

Examples

Setup Guide for Ollama and Phi-3 Text Generator

1. Creating and Starting the Ollama Container

Prerequisites

Steps

2. How to Use the Project

Constants to Set

Execution Steps

3. How to Check the Generated Files

Steps to Verify

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Contributors 2

Languages