If you got the exception ModuleNotFoundError: No module named 'oasst_data'
you
first need to install the oasst_data
package:
Run pip install -e .
in the oasst-data/
directory of the Open-Assistant
repository to install the oasst_data
python package in editable mode.
Reading jsonl files is in general very simple in Python. To further simplify the
process for OA data the oasst_data
module comes with Pydantic class
definitions for validation and helper functions to load and traverse message
trees.
Code example:
# parsing OA data files with oasst_data helpers
from oasst_data import read_message_trees, visit_messages_depth_first, ExportMessageNode
messages: list[ExportMessageNode] = []
input_file_path = "data_file.jsonl.gz"
for tree in read_message_trees(input_file_path):
if tree.prompt.lang not in ["en","es"]: # filtering by language tag (optional)
continue
# example use of depth first tree visitor help function
visit_messages_depth_first(tree.prompt, visitor=messages.append, predicate=None)
A more comprehensive example of loading all conversation threads ending in assistant replies can be found in the file oasst_dataset.py which is used to load Open-Assistant export data for supervised fine-tuning (training) of our language models.
You can also load jsonl data completely without dependencies to oasst_data
solely with standard python libraries. In this case the json objects are loaded
as nested dicts which need to be 'parsed' manually by you:
# loading jsonl files without using oasst_data
import gzip
import json
from pathlib import Path
input_file_path = Path(input_file_path)
if input_file_path.suffix == ".gz":
file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8")
else:
file_in = input_file_path.open("r", encoding="UTF-8")
with file_in:
# read one object per line
for line in file_in:
dict_tree = json.loads(line)
# manual parsing of data now goes here ...
Open-Assistant export data is written as standard
JSON Lines data. The generated files are UTF-8 encoded
text files with single JSON objects in each line. The files come either
uncompressed with the ending .jsonl
or compressed with the ending .jsonl.gz
.
Three different types of objects can appear in these files:
- Individual Messages
- Conversation Threads
- Message Trees
For readability the following JSON examples are shown formatted with indentation on multiple lines although they are be stored without indentation in the actual data file.
Message objects can be identified by the presence of a "message_id"
property.
In files written by Open-Assistant this property will appear as the first
property on the line directly after the opening curly brace.
Each message needs at least an id (UUID), message text, a role (either "prompter" or "assistant") and a language tag (BCP 47) like "en" for English.
Minimal example of a message:
{
"message_id": "13714ad5-3161-4ead-9593-7248b0a3f218",
"text": "List the pieces of a reinforcement learning system (..)",
"role": "prompter",
"lang": "en"
}
Example of a message with more properties:
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
},
The backend export tool
(export.py)
will generate jsonl files with individual messages when a set of messages is
exported that is not a full tree. This is for example the case when filtering
messages based on properties like user, deleted, spam or synthetic. Spam
messages are those which have a review_result
that is false
.
Conversation threads are a linear lists of messages. THese objects can be
identified by the presence of the "thread_id"
property which contains the UUID
of the last message of the thread (which can be used to reconstruct the thread
by returning the list of ancestor messages up to the prompt root message). The
message_id of the first message is normally also the id of the message-tree that
contains the thread.
{
"thread_id": "534c7711-afb5-4410-9006-489dc885280e",
"thread": [
{
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en"
},
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en"
},
{
"message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce",
"text": "Can you explain why we created a definition (..)",
"role": "prompter",
"lang": "en"
},
{
"message_id": "534c7711-afb5-4410-9006-489dc885280e",
"text": "The historical origin of the imaginary (..)",
"role": "assistant",
"lang": "en"
}
]
}
Message trees have of a prompt message at the root and can then branch out into
multiple different reply branches which each can again have further replies.
Message trees can be identified by the "message_tree_id"
property. The
message_tree_id
always matches the id of the prompt-message.
Example of a tree with minimal messages:
For clarity only the mandatory elements of the message are shown here. The full export format contains all the message attributes as shown above in the full message example.
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce",
"text": "Can you explain why we created a definition (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "534c7711-afb5-4410-9006-489dc885280e",
"text": "The historical origin of the imaginary (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
{
"message_id": "bb791a11-2de2-4e39-9b99-55da5cc730a0",
"text": "The square root of -1, denoted i, was (..)",
"role": "assistant",
"lang": "en",
"replies": []
}
]
}
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
{
"message_id": "d63d5610-338b-46b1-b537-9211cdb0ddc6",
"text": "Irrational numbers can be confusing (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
{
"message_id": "0ef7430e-314a-4da1-92bd-49a6967dc22f",
"text": "Irrational numbers are real numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
}
]
}
]
}
]
}
}
This format is used when whole trees are exported with
export.py
(for example all trees in ready_to_export
state).